This how-to will demonstrate the use of MethylAction to detect differentially methylated regions (DMRs) among three groups using data from MBD-isolated Genome Sequencing (MiGS). While MethylAction is designed for genome-wide analysis, this example data is only for a subset of the genome and a subset of samples so the example can be worked through quickly. Please refer to the function documentation for advanced options.
RAM and CPU requirements will depend on the depth of the sequencing and the number of samples and groups. We recommend very high performance machines. As a reference, we used a linux server with 20 cores of 2.80GHz CPUs and 64GB of RAM for a genome-wide analysis of 22 samples across 3 groups. A high performance computing cluster (HPC) was used to obtain 1,000s of permutations.
For the purposes of the example, less CPU and RAM are required. This how-to has been tested on a MacBook Pro with a 2.4 GHz CPU and 8 GB of RAM.
First, start R (by typing the “R” command at the linux command line) and install pre-requisite packages from Bioconductor:
source("http://bioconductor.org/biocLite.R") biocLite(c("GenomicRanges","IRanges","devtools","DESeq","GenomicAlignments", "Repitools","Rsubread","ggbio"))
Then, install both Goldmine and MethylAction from GitHub. Be sure to accept installation of any additional pre-requisite packages from CRAN.
library(devtools) install_github("jeffbhasin/goldmine") install_github("jeffbhasin/methylaction")
Please obtain the “methylaction_demo.tar.gz” file and extract it. This archive can be donwloaded from the methylaction_demo repository releases page. This contains all the example input data needed to complete this how-to.
From the linux command line, use the command “tar -zxvf methylaction_demo.tar.gz” for extraction.
Note that your session of R must have the working directory set to the location of the “methylaction_demo” folder extracted from the tar.gz file so paths to files in this how-to will be valid. Use the “setwd()” function to set this directory.
For example (change to your location):
First, start R, and load the “methylaction” R package into the session.
Unique sample identifiers, paths to BAM files, group assignments, and any sample covariates are defined in a CSV file. Please see “input/samples.csv” for an example of the format. Note that comparison groups must be ordered as you want the groups to appear in the output results. This is important, because in the output, patterns between the groups will be coded using sequences of binary digits, where each digit represents a group, and this is the same as the order the groups are encountered in the sample CSV file. Optionally, a column called “color” can be provided that defines colors for each groups. These will be used in certain plotting and reporting functions. If this column is omitted, colors are automatically assigned using RColorBrewer.
The sample CSV file can be read into R using the readSampleInfo() command:
samp <- readSampleInfo("input/samples.csv") print(samp)
The command will output confirmation of group sample sizes and group order.
All read alignments are read and processed from the BAM files into an RData workspace that is saved to disk, and this prevents the user from having to re-process data again for each run of MethylAction. The getReads() function obtains the coordinates of the read alignments from the BAM file. The initial stage of the program works with fragment counts in 50bp windows, which are also computed once and saved by the function getCounts().
There are two other variables specific to the experiment that are needed in the next step. These are the window size (we recommend 50bp) and the fragment size used for generating the sequencing library. This fragment size is not the read length – it is the average size of a fragment in the sequencing library that was prepared. Often this number is available from BioAnayzer results, and should be known to whomever did the sequencing library preparation for your study. In the case of paired end data, the fragsize variable can be set to “paired” and valid mate pairs will be used.
First, define the variables that will be needed for the preprocessing functions:
chrs <- "chr22" fragsize <- 120 winsize <- 50 ncore <- 1
While the demo BAM files contain only chr22 to reduce size, MethylAction can be run genome-wide on your own dataset. To do this, specify “chrs” as a vector of chromosome names. We recommend excluding the sex chromosomes in mixed-sex cohorts.
# For human autosomes (chr1 to chr22) chrs <- paste0("chr",1:22) # Including the sex chromosomes chrs <- paste0("chr",c(1:22,"X","Y"))
Then, read in alignments and generate binned count tables (these steps may be memory, disk, and CPU intensive):
reads <- getReads(samp=samp, chrs=chrs, fragsize=fragsize, ncore=ncore) counts <- getCounts(samp=samp, reads=reads, chrs=chrs, winsize=winsize, ncore=ncore)
We recommend saving all of the above into a single preprocessing RData, which can be loaded prior to running the DMR detection step described next.
save(samp, reads, counts, winsize, fragsize, chrs, file="output/prepro.rd",compress=T)
For future runs of DMR detection, the saved RData can be loaded rather than spending time re-preprocessing the data.
With all the preprocessing completed, differentially methylated regions (DMRs) can be detected using a call to the methylaction() function. There are many options to this function that will affect the DMR detection. Here we have used recommended defaults. See the function documentation for more details. This function performs multiple steps, which are all CPU, RAM, and disk intensive when run on larger data sets.
First, set the number of cores to use based on your hardware (we recommend reducing this number if there are memory issues):
ncore <- 1
Then, run methylaction() to call DMRs:
ma <- methylaction(samp=samp, counts=counts, reads=reads, ncore=ncore)
Finally, save the results object to disk:
It is also useful to save a DMR list as a CSV file for viewing in a spreadsheet application.
write.csv(makeDT(ma$dmr), row.names=FALSE, file="dmrs.csv")
The function methylaction() is designed to output a great deal of information about the internals of the DMR calling in order to facilitate comparisons between different settings and to prevent needing to re-run the command on large datasets to view intermediate states. The output object is a list. For a list of all detected DMRs, look at ma$dmr. Each DMR is assigned a pattern code, where each digit represents a group. The pattern indicates the differential methylation status between the groups. All groups with the digit is 1 are hypermethylated with respect to all groups with the digit 0. See the ma$args object for the arguments provided to the call of methylaction() that generated the output. If you want to access data from any internal steps of the function, see the objects nested under ma$data.
The maSummary() function provides details about the intermediate stages of the method.
## stat count percent ## 1 Window Size 50 ## 2 Total Windows 1026091 ## 3 All Zero Windows (filtered) 440064 42.89 ## 4 All Below FDR Windows (filtered) 409356 39.89 ## 5 Signal Windows (move on to stage one) 176671 17.22 ## 6 Windows Tested in Stage One 176671 ## 7 Sig Pattern in Stage One 158274 89.59 ## 8 Non-Sig Pattern in Stage One 0 0 ## 9 Ambig Pattern in Stage One 18397 10.41 ## 10 Regions Formed By Joining Adjacent Patterns 1351 ## 11 Regions Tested in Stage Two 1351 ## 12 Regions That Pass ANODEV 1330 98.45 ## 13 ANODEV Sig with Sig Pattern 1263 94.96 ## 14 ANODEV Sig with Non-sig Pattern 0 0 ## 15 ANODEV Sig with Ambig Pattern 67 5.04 ## 16 Total DMRs 1263
The built-in function table() can be used to obtain counts of DMRs by pattern and “frequent” classification.
DMRs can be visualized genome-wide via a heatmap or karyogram.
In this case, the Karyogram only spans chr22 due to the subset of data for this example.
It is also useful to view DMRs and the windowed sequencing read counts in a genome browser, such as the UCSC genome browser. The maBed() and maTracks() functions can create BED files suitable for uploading as custom tracks to http://genome.ucsc.edu/.
maBed(ma,file="dmrs.bed") maTracks(ma=ma, reads=reads, path="beds")
Because of the two stage testing approach, type I error rates may be inflated with this method. To determine if this level is acceptable, we have implemented permutation and bootstrapping approaches. This establishes a false discovery rate (FDR) for each pattern of DMR among the groups. If these FDRs are too high, they can be recalculated at lower p-value thresholds until they reach acceptable levels. Then, DMRs filtered at this p-value can be used as the definitive list for the study.
Bootstraps can be enabled by adding the “nperms” and “perm.boot” options to methylaction(). The resulting output list will then have an “fdr” object that reports false discovery rates (FDRs).
ma <- methylaction(samp=samp, counts=counts, reads=reads, perm.boot=T, nperms=3, ncore=ncore)
Finally, save the results object to disk:
If “perm.boot” is set to be FALSE, then regular permutations (sampling without replacement) are performed rather than bootstrapping (sampling with replacement).
See the maPerm(), maPermMerge(), and maPermFdr() for manual methods to run permutations. These are useful for spreading permutations across multiple machines or in a high performance computing (HPC) environment.