Computational Methods for Exploring the Microbiome


The following tutorial focuses on a subset of data derived from: The microbiota at multiple body sites during pregnancy in a rural Tanzanian population and the effects of Moringa supplemented probiotic yogurt doi:10.1128/AEM.00780-15. The study was comprised of a longitudinal analysis of 56 pregnant women. We will revisit the finding that there is a major shift in the vaginal microbiome at birth using a completely independent set of tools from those in the paper using a workflow entirely carried out in R. We will focus on a subset of 20 women for the purposes of this analysis comparing enrollment in the second trimester, to birth. We are using the vaginal microbiome data set here as it has relatively lower diversity compared to the gut and in previous analysis showed very clear shifts in composition.

The tutorial will require some basic knowledge of the R language. At this point in time, knowing bash, python, matlab, R, and/or some combination thereof is required in the microbiome field. All languages use more or less the same basic logic (for biological computing anyway) but differ in their syntax. It is more important to know what you want to do, than how to do it. Do not hesitate to ask for help if you are having trouble. Also check here for helpful cheat sheets. The commands you will need are all supplied in gray boxes with the output below.

If you are not familiar with R, please watch these videos:
-Introduction to R Programming with DataCamp
-Getting started with R and RStudio
If you are interested in learning more R, I recommend this book.


Installing necessary requirements


This entire process could take ~10-15 minutes, please ensure this has been done before coming to class!


If you do not already have R, download the correct version for your operating system here. Then download the newest version of R studio for your operating system here. Please ensure your version of R is at least 3.3.2 (ideally 3.4.0). You can check this with the following command:

print(R.Version()$version.string)
## [1] "R version 3.4.0 (2017-04-21)"

Open R studio and click File > New File > R script. Save this file in a folder of your choice as YourName.MicrobiomeTutorial.R.

Next, click Session > Set Working Directory > Choose Directory. Select the folder in which you placed your .R file.



For Ubuntu users only:

###For Ubuntu users only, you may need to do the following:
sudo apt-get install libcurl4-gnutls-dev #in terminal
sudo apt-get install curl #in terminal
sudo apt-get install gsl-bin libgs10-dev #in terminal
install.packages("RCurl") #in R



Next run the following code which will download all required packages for this tutorial if you do not already have them. It will also download the raw data from European Nucleotide Archive. If asked to compile, reply yes. On OSX you may be asked to install developer tools to which you should accept. If asked to update other packages, type n for none. You may have to do this multiple times. For advanced users, this script may update versions of packages you already have. Feel free to manually install or update as needed based on the source code of the install script.

source("https://raw.githubusercontent.com/jbisanz/BMS270_BMI219/master/InstallTutorial.R")

In OSX and Ubuntu ensure that RawData (downloaded by the above script), Tutorial_metadata.txt, rdp_species_assignment_14.fa.gz, rdp_train_set_14.fa.gz, error_profile.RDS, and tree.RDS are in the same directory as YourName.MicrobiomeTutorial.R. If you are running windows, the sequencing data was not automatically downloaded. Please manually download from this link, and decompress in the same directory as YourName.MicrobiomeTutorial.R


Load packages

To ensure that you have successfully installed all packages, run the following commands one by one paying attention to the output. Warning messages about object masking are to be expected in this case but OK.

require(vegan) #Diversity functions for microbiome data
require(dada2) #Our denoising and primary sequence processing tool
require(phyloseq) #plotting and analysis tools for microbiome data
require(MicrobeR) #plotting and functions for microbiome data
require(DESeq2) #differential expression/feature abundance tool
require(genefilter) #useful to help filter tables

require(plyr) #data formatting functions
require(reshape2) #data formatting functions
require(ggplot2) #general plotting function

require(DECIPHER) #for nucleotide alignments
require(phangorn) #phylogenetic tree tools
require(ggtree) #tree plotting tools

If any of these fail to load, try re-sourcing the InstallTutorial.R script above. It will only download the data once. If you still are unable to install any of the packages, please email me.



Finally, it is a good idea to be aware of your environment and what versions of packages you have installed. You can do this with the following command. Your environment will not look exactly like mine, but this is a good thing to keep track of for when it comes to try to repeat analyses or publish your data.

print(sessionInfo())
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.4
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggtree_1.8.1               treeio_1.0.2              
##  [3] phangorn_2.2.0             ape_4.1                   
##  [5] DECIPHER_2.4.0             RSQLite_1.1-2             
##  [7] Biostrings_2.44.0          XVector_0.16.0            
##  [9] ggplot2_2.2.1.9000         reshape2_1.4.2            
## [11] plyr_1.8.4                 genefilter_1.58.1         
## [13] DESeq2_1.16.1              SummarizedExperiment_1.6.1
## [15] DelayedArray_0.2.2         matrixStats_0.52.2        
## [17] Biobase_2.36.2             GenomicRanges_1.28.1      
## [19] GenomeInfoDb_1.12.0        IRanges_2.10.0            
## [21] S4Vectors_0.14.0           BiocGenerics_0.22.0       
## [23] MicrobeR_0.1.1             phyloseq_1.20.0           
## [25] dada2_1.4.0                Rcpp_0.12.10              
## [27] vegan_2.4-3                lattice_0.20-35           
## [29] permute_0.9-4             
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-131             bitops_1.0-6            
##  [3] RColorBrewer_1.1-2       httr_1.2.1              
##  [5] rprojroot_1.2            tools_3.4.0             
##  [7] backports_1.0.5          R6_2.2.1                
##  [9] DT_0.2                   rpart_4.1-11            
## [11] KernSmooth_2.23-15       Hmisc_4.0-3             
## [13] DBI_0.6-1                lazyeval_0.2.0          
## [15] mgcv_1.8-17              colorspace_1.3-2        
## [17] nnet_7.3-12              ade4_1.7-6              
## [19] gridExtra_2.2.1          compiler_3.4.0          
## [21] htmlTable_1.9            plotly_4.6.0            
## [23] checkmate_1.8.2          caTools_1.17.1          
## [25] scales_0.4.1             quadprog_1.5-5          
## [27] stringr_1.2.0            digest_0.6.12           
## [29] Rsamtools_1.28.0         foreign_0.8-68          
## [31] rmarkdown_1.5            base64enc_0.1-3         
## [33] htmltools_0.3.6          htmlwidgets_0.8         
## [35] textmineR_2.0.5          hwriter_1.3.2           
## [37] jsonlite_1.4             BiocParallel_1.10.1     
## [39] gtools_3.5.0             acepack_1.4.1           
## [41] dplyr_0.5.0              RCurl_1.95-4.8          
## [43] magrittr_1.5             GenomeInfoDbData_0.99.0 
## [45] Formula_1.2-1            biomformat_1.4.0        
## [47] Matrix_1.2-10            munsell_0.4.3           
## [49] RcppProgress_0.3         stringi_1.1.5           
## [51] yaml_2.1.14              MASS_7.3-47             
## [53] zlibbioc_1.22.0          rhdf5_2.20.0            
## [55] gplots_3.0.1             grid_3.4.0              
## [57] gdata_2.17.0             splines_3.4.0           
## [59] annotate_1.54.0          multtest_2.32.0         
## [61] locfit_1.5-9.1           knitr_1.15.1            
## [63] igraph_1.0.1             seqinr_3.3-6            
## [65] geneplotter_1.54.0       codetools_0.2-15        
## [67] fastmatch_1.1-0          XML_3.98-1.7            
## [69] evaluate_0.10            ShortRead_1.34.0        
## [71] latticeExtra_0.6-28      data.table_1.10.4       
## [73] RcppParallel_4.3.20      foreach_1.4.3           
## [75] gtable_0.2.0             purrr_0.2.2             
## [77] tidyr_0.6.2              assertthat_0.2.0        
## [79] xtable_1.8-2             survival_2.41-3         
## [81] viridisLite_0.2.0        tibble_1.3.0            
## [83] rvcheck_0.0.8            iterators_1.0.8         
## [85] memoise_1.1.0            AnnotationDbi_1.38.0    
## [87] GenomicAlignments_1.12.0 cluster_2.0.6

You are now ready for the tutorial! Wait for class before moving on!


2. Metadata Import


Background:

The key to a good study starts before the sequencing even begins. Metadata refers to the a list of your sample names and what they actually are. In its simplest form, it could simply be disease X versus healthy control. In an ideal world, it is full of as many relevant covariates as possible. Before starting the DNA extractions, I highly recommend putting together the metadata to ensure you have sufficient information for meaningful analysis. Here we have provided a text file called “Tutorial_metadata.txt” which is in tab separated variable format (TSV). These files can be opened and edited in excel or Google sheets, and along with comma separated variable format (CSV), are among most common ways to store metadata. To get you started, try applying the following code to import and view the associated metadata:

metadata<-read.table("Tutorial_metadata.txt",
                     sep ='\t', # our table is a TSV file with tabs between every column
                     header = TRUE, #the first row of our metadata contains the variable names
                     stringsAsFactors = FALSE #This has to do with how R treats text, for now it will be easiest to not treat text as a factor
                     )
Nice.Table(metadata)

Finally, we will do some minor formatting to our data. We will give the rows of our metadata table the sample_name variable. While not strictly necessary, some tools expect this, including Phyloseq and MicrobeR. Also, we will explicitly order some variables in our metadata putting Visit1 before Birth and ordering the Nugent score from healthy -> not healthy.

rownames(metadata)<-metadata$sample_name
metadata$Timepoint<-factor(metadata$Timepoint, levels=c("SecondTrimester","Birth")) #This puts the classes in order for future analyses and plots
metadata$anonymized_name<-as.factor(metadata$anonymized_name) #make the participant name a categorical variable
metadata$nugent_class<-factor(metadata$nugent_class, levels=c("healthy","intermediate","bv")) #make nugent class a categorical variable ordered from healthy -> bacterial vaginosis

Variable Explanations:

  • sample_name : The unique identifier of the sample. This corresponds to the name that will be generated in our feature table.
  • age : The age of the participant in years.
  • anonymized_name : A unique identifier of the participant in the study.
  • bmi : The body mass index (Kg/m2).
  • Timepoint : Time point in the study, IE either baseline (second trimester) or at birth.
  • weight_infant_1 : The delivery weight of the infant in Kg.
  • vaginal_ph : The pH of the microbial environment.
  • nugent_score : A clinical test based on manual interpretation of Gram-stain morphologies during microscopy.
  • nugent_class : A categorical variable derived from the Nugent score where low scores are healthy, and high scores are Bacterial Vaginosis.
  • FTP : The FTP address which the files can be pulled from.

3. Sequencing Import and Processing


Background:

We can use R directly to download our files, although this can be done in many ways including manually. The code has been included here, however it was part of your install script so you should not need to download it again. In this case, the data is being pulled from the European Nucleotide Archive. If you wanted to reanalyze a published data set, say from QIITA, you could easily adopt this exact same code.

#Do not run again unless you did not download the raw data
dir.create("RawData") # A directory to store all of our data
for (i in 1:nrow(metadata)){
  if(!file.exists(paste0("RawData/", metadata[i,"sample_name"],".fastq.gz"))){
    download.file(metadata[i,"FTP"], paste0("RawData/", metadata[i,"sample_name"],".fastq.gz"), method="wget", quiet=F)
  } else {
    print(paste0("You have already downloaded: ", metadata[i,"sample_name"]))
  }
}


There are many options for how to handle 16S rRNA gene sequencing. Conventionally clustering of sequences at a set threshold, usually 97%, is common. These are commonly called OTUs(Operational Taxonomic Units). It is now in vogue to use some form of denoising and avoid clustering all together. This has the advantage of providing the highest possible resolution which can even resolve strains of the same species in some cases. There is some debate about what these should be called, but for our purposes we will refer to them as SVs(Sequence Variants). Today we will use DADA2 denoising which can be accomplished completely in R. Similar approaches include Deblur (Knight Group) and UNOISE (Robert Edgar).

To carry out these steps, I recommend inspecting the following dada2 tutorial: http://benjjneb.github.io/dada2/tutorial.html. Many of the steps we will apply have been copied verbatim from this tutorial. Note: We have single ended 1x150 sequencing so overlapping is not necessary and certain filtering steps only require the passing of one variable rather than two.



To help understand the denoising approach, you should look at the format of our raw data. Take a look at the first lines of a FASTQ file:

readLines("RawData/2024.birth.MNIH02UP.BFF10F.V.fastq.gz", n=4) #read top 4 lines of file and print to scren
## [1] "@2024.birth.MNIH02UP.BFF10F.V_0 orig_bc=CCACACCTTCTG new_bc=CCACACCTTCTG bc_diffs=0"                                                                    
## [2] "TTCCAGCTCTGCGAGCTTGCTCCCATATTGTTGCAGTTAAAACGCCCGTAGTCTGAATTGGCCAGCAATGGTCGTACGTATCTTTACGTTCACTGTGAACAAATCAGGACGCTTAGAGTATGGCTACATGAATGACTCAGCGCAGTATGAA"
## [3] "+"                                                                                                                                                      
## [4] ">AAA3@CFFFFFGEEEEGFGGGHGGHHHHHHHHHHGGHHHHHHGGGGGGGGGHHHHHHHGHHHHHGHHHHHHHCGHHGGHHHHHHHHGHHHHHHHHHGHHHHHHHHFFGHGGGGGFHHHHHGFGGGHHHHBFHHHHHHHHHGGCGGFHFHH"

FASTQ files typically are a repeating pattern of 4 lines:
1. The first starts with an @ sign followed by a unique identifier, in this case it denotes the sample of origin and some artifacts of the demultiplexing process.
2. The nucleotide sequence
3. A + sign denoting the end of the nucleotide sequence
4. The quality string. These characters actually denote a numeric code which give the probability that a sequencing error has occurred. In typical encoding, H would denote a score of 39 which has a 1:7700 probability of an error. A Q score of 10 denotes a 1:10 probability of an incorrect base call. It is imperative that the majority of your sequencing run has high quality scores, typically >30, but there is no fixed cutoff.

Further Reading:

FASTQ files and quality scores
How Illumina Sequencers work
QIIME2
QIIME - The conventional approach for 16S rRNA gene Sequencing
Mothur

Resources:

DADA2
DeBlur
UNOISE
dada2 Tutorial Read here further for theoretical aspects.


Start by examining the quality of a few random samples with x as cycle number and y as Quality score. What do you think of the quality?

FASTQs<-paste0("RawData/", list.files("RawData")) # generate a list of file names
plotQualityProfile(FASTQs[c(1,10,20)]) #plot 3 random samples

Remember that sequencing quality almost always decreases as the number of cycles increases.



It would appear we have decent quality sequencing, although some samples show a sharp decline in quality in the late cycles and have lower quality in the early cycles. Try using the filterAndTrim command of dada2 as per the tutorial to filter your data. Try using these values if you think they are appropriate:

  • truncQ=2, #truncate reads after a quality score of 2 or less
  • truncLen=130, #truncate after 130 bases
  • trimLeft=10, #remove 10 bases off the 5’ end of the sequence
  • maxN=0, #Don’t allow any Ns in sequence
  • maxEE=2, #A maximum number of expected errors
  • rm.phix=TRUE, #Remove lingering PhiX (control DNA used in sequencing) as it is likely there is some.
  • multithread=TRUE #Run parallel, if you can this will speed up filtering significantly. On some windows systems, this may not be possible.



If you filter too aggressively you might not even end up the output of filterAndTrim as nothing will pass the filter. Make sure you have all your filtered files (check inside your FilteredData/ folder).

dir.create("FilteredData") # Create a directory for filtered Read Data
filt.FASTQs<-paste0("FilteredData/", metadata$sample_name, "_filtered.fastq.gz") # Create a new names for filtered sequence data
filtered<-filterAndTrim(fwd=FASTQs, #The list of fastqs that we currently have
                        filt=filt.FASTQs, #The list of of new names for the filtered versions
                        compress=TRUE, #store the results as .gz to save on storage space
                        truncQ=2, #truncate reads after a quality score of 2 or less
                        truncLen=130, #truncate after 130 bases
                        trimLeft=10, #remove 10 bases off the 5' end of the sequence
                        maxN=0, #Don't allow any Ns in sequence
                        maxEE=2, #A maximum number of expected errors
                        rm.phix=TRUE, #Remove lingering PhiX... it is highly likely there is some
                        multithread=TRUE #Run parallel, if you can this will speed up filtering significantly ***On Windows you may need to set = FALSE if you recieve an error relating to mc.cores
                        )

It would be a good idea to look at how many reads were lost during filtering.

print(paste("We removed a total of", sum(filtered[,1])-sum(filtered[,2]), "from", sum(filtered[,1]), "reads"))
## [1] "We removed a total of 12181 from 834252 reads"

This seems to be an acceptable loss. Now we need to do the most computationally intensive part of the process which is learning the error rates. I highly recommend enabling multithreading if possible.

if(!file.exists("error_profile.RDS")){
  error_profile<-learnErrors(filt.FASTQs, multithread=TRUE)
  saveRDS(error_profile, "error_profile.RDS")
} else {
    print("Error profiles loaded from file")
    error_profile<-readRDS("error_profile.RDS")
}
## [1] "Error profiles loaded from file"


This took ~ 3 minutes on both a 2017 2.8 GHz Intel Core i5 iMac and 2011 2.0 GHz Intel Core i7 Macbook Pro, each with 16Gb of RAM. The time to do this could very significantly depending on your computer. For this reason we have provided the error profile with the downloads for the tutorial. The code above checks if you already have the profile and loads it from disk when possible rather than recalculate.



Now dereplicate your sequences which will help speed things up later during denoising.

derep<-derepFastq(filt.FASTQs, verbose=FALSE) #dereplicate sequences
names(derep) <- metadata$sample_name #name the dereplicated sequences by sample of origin

Now you can generate your denoised reads. Apply the dada( ) function to denoise.

denoised<-dada(derep, err=error_profile, multithread = TRUE) #denoise sequences, again on windows you may need to set multithread=FALSE
## Sample 1 - 6441 reads in 1220 unique sequences.
## Sample 2 - 22622 reads in 9691 unique sequences.
## Sample 3 - 14029 reads in 6108 unique sequences.
## Sample 4 - 11990 reads in 1982 unique sequences.
## Sample 5 - 20111 reads in 3303 unique sequences.
## Sample 6 - 10209 reads in 5776 unique sequences.
## Sample 7 - 16056 reads in 2500 unique sequences.
## Sample 8 - 9506 reads in 2399 unique sequences.
## Sample 9 - 76023 reads in 15313 unique sequences.
## Sample 10 - 14505 reads in 2621 unique sequences.
## Sample 11 - 14756 reads in 5396 unique sequences.
## Sample 12 - 32724 reads in 12288 unique sequences.
## Sample 13 - 36929 reads in 4287 unique sequences.
## Sample 14 - 12677 reads in 2867 unique sequences.
## Sample 15 - 7040 reads in 777 unique sequences.
## Sample 16 - 32002 reads in 9957 unique sequences.
## Sample 17 - 23173 reads in 3252 unique sequences.
## Sample 18 - 9004 reads in 4266 unique sequences.
## Sample 19 - 47104 reads in 37546 unique sequences.
## Sample 20 - 25698 reads in 8917 unique sequences.
## Sample 21 - 14539 reads in 2223 unique sequences.
## Sample 22 - 17800 reads in 4720 unique sequences.
## Sample 23 - 17628 reads in 2746 unique sequences.
## Sample 24 - 15712 reads in 2355 unique sequences.
## Sample 25 - 26490 reads in 4603 unique sequences.
## Sample 26 - 18022 reads in 7373 unique sequences.
## Sample 27 - 17282 reads in 5436 unique sequences.
## Sample 28 - 17952 reads in 8670 unique sequences.
## Sample 29 - 24017 reads in 4545 unique sequences.
## Sample 30 - 14883 reads in 2829 unique sequences.
## Sample 31 - 34471 reads in 4255 unique sequences.
## Sample 32 - 13541 reads in 2455 unique sequences.
## Sample 33 - 15574 reads in 2810 unique sequences.
## Sample 34 - 29157 reads in 8353 unique sequences.
## Sample 35 - 11413 reads in 1024 unique sequences.
## Sample 36 - 16472 reads in 3774 unique sequences.
## Sample 37 - 16027 reads in 2971 unique sequences.
## Sample 38 - 26512 reads in 9589 unique sequences.
## Sample 39 - 19470 reads in 4059 unique sequences.
## Sample 40 - 12510 reads in 1302 unique sequences.

Finally make your table of sequence variants (SVs) and check the dimensions to ensure it seems plausible:

sv.table<-makeSequenceTable(denoised) #make your feature table
print(paste("There are", ncol(sv.table),"SVs in", nrow(sv.table), "Samples"))
## [1] "There are 1793 SVs in 40 Samples"


At this point, it would be wise to address sequence chimeras. For a discussion of chimeras, see here. These erroneous sequences have no biological meaning and removing them helps reduce the number of features you will compare. There are many chimera removal algorithms, but today use the built in removeBimeraDenovo( ) function and use method=“pooled”. Why do you think we would want to use the pooled method? After removing chimeras, check how many reads you removed.

sv.table.nochim <- removeBimeraDenovo(sv.table, method="pooled", multithread=TRUE, verbose=FALSE) #remove chimeras, again you may need to specify multithread=FALSE on some windows systems
print(paste("After chimera removal, there are", ncol(sv.table.nochim),"SVs in", nrow(sv.table.nochim), "Samples"))
## [1] "After chimera removal, there are 1503 SVs in 40 Samples"
print(paste("Chimeric sequences represented", round(100*(1-sum(sv.table.nochim)/sum(sv.table)),2), "% of sequences"))
## [1] "Chimeric sequences represented 4.29 % of sequences"

Note: When doing PCR for 16S rRNA gene sequencing, don’t use 35 cycles and use high fidelity enzymes.



Now one last filtering step: remove SVs that are not present in at least 2 samples with at least 1 count! There are many ways you could apply this, but try filterfun(kOverA(2,1)) and the filter_taxa command from phyloseq or Confidence.Filter.OTUs from MicrobeR. Note: generally speaking, you should generate alpha diversity metrics before you apply any filtering, but for the purposes of reducing computational requirements, we are going to trim spurious features out now rather than later.

sv.table.filtered<-as(filter_taxa(otu_table(sv.table.nochim, taxa_are_rows = FALSE), filterfun(kOverA(2,1)), prune = TRUE), "matrix")
print(paste("After required features be present in at least 2 samples there are ", ncol(sv.table.filtered),"SVs in", nrow(sv.table.filtered), "Samples"))
## [1] "After required features be present in at least 2 samples there are  202 SVs in 40 Samples"



Now we have a table of sequences and samples. But the sequences aren’t particularly helpful to us. What we really want to know is what organisms those sequences belong to. There are multiple choices of database for this tool. For the purposes of today, we will use the Ribosomal Database Project. Typically the Green Genes or SILVA databases are more commonly used but we are using the RDP database to save on computational time. I personally use SILVA. Use the assignTaxonomy() function of dada2 with the rdp_train_set_14.fa.gz that you downloaded as part of your installation.

taxonomy <- as.data.frame(assignTaxonomy(sv.table.filtered, "rdp_train_set_14.fa.gz", multithread=TRUE, verbose=TRUE)) #assign taxonomy
## Finished processing reference fasta.

To make visualizing the data easier, we can assign a unique ID to each of our sequence variants. I will create a new category in the taxonomy table called Seq that contains the SV sequence and also a category for the unique ID.

taxonomy$Seq<-rownames(taxonomy)
taxonomy$SV_ID<-paste0("SV_", seq(from=1, to=nrow(taxonomy), by=1))
colnames(sv.table.filtered)<-taxonomy[colnames(sv.table.filtered),]$SV_ID #Our table of SVs currently has sequences for names, switch it to the unique ID you created above
rownames(taxonomy)<-taxonomy$SV_ID


One last step before we do analysis. Generate a phylogenetic tree. We will use this workflow to carry it out completely in R. If you were on a unix computer with QIIME installed you could also use MicrobeR’s Make.Tree command. The code below would generate a new tree, however this can take some time to run so instead we will use a precomputed tree which was downloaded as part of the tutorial data. Again, if it is missing, this code will regenerate it.

if(file.exists("tree.RDS")) {
    tree<-readRDS("tree.RDS")
} else{
seqs <- taxonomy$Seq
names(seqs) <- taxonomy$SV_ID # This propagates to the tip labels of the tree
alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA, verbose = FALSE)
phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm) # Note, tip order != sequence order
fit = pml(treeNJ, data=phang.align)
fitGTR <- update(fit, k=4, inv=0.2)
fitGTR <- optim.pml(fitGTR, model="GTR", 
                    optInv=TRUE, 
                    optGamma=TRUE,
                    rearrangement = "stochastic",
                    control = pml.control(trace = 0))
tree<-fitGTR$tree
tree<-midpoint(tree) #midpoint root the tree
saveRDS(tree, "tree.RDS")
}

Bonus: It is always a good idea to plot a tree to visually inspect it and ensure nothing looks out of place. We will color it by Phylum and add labels for the Genus of the SV. Does it look correct to you?

plottree <- ggtree(tree, layout = "circular", branch.length = "none")
plottree <- (plottree %<+% taxonomy[tree$tip.label, c("SV_ID", "Phylum", "Genus")] 
            #+ geom_tippoint(aes(color = Phylum),  size = 2) # to have points by tip
            + geom_tiplab2(aes(color = Phylum, label=Genus),  size = 2) 
            + theme(legend.position = "right") 
            + ggtitle("Cladogram of SV tree")
            )
print(plottree)

You have now completely generated the data we require to start meaningful analysis!!!


4. Exploratory Analysis


Background:


Before doing any statistical analysis, it is always a good idea to get a feel for your data. Bar plots and heat maps are a good approach for these. Phyloseq contains a number of tools for these analysis as does MicrobeR. In this next section, experiment with different plots to get a 1000-foot view of your data and an idea of what questions you might want to ask in more detailed analyses.

Resources:

Phyloseq Tutorials/Graphics


One thing to always be aware of when dealing with microbiome data (or many ‘ome’ data sets) is if your table is set up with sample names as rows and taxa as columns, or vice versa. Many tools explicitly expect one orientation versus the other. For going forward in this analysis, I would recommend transposing your table t( ) to have your taxa as rows and samples as columns.

sv.table.analysis<-t(sv.table.filtered) #transpose table

If you want to take advantage of phyloseq, it would be ideal to have a phyloseq object which has your feature table, taxonomic assignments, phylogenetic tree, and metadata all in one. Try creating one.

phyobj<-phyloseq(otu_table(sv.table.analysis, taxa_are_rows = TRUE), tax_table(as.matrix(taxonomy)), phy_tree(tree), sample_data(metadata)) #make phyloseq object

Get a summary of this object:

phyobj #look at summary
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 202 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 202 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 202 tips and 201 internal nodes ]

For plotting purposes it is often easier to look at higher level taxonomic assignments. Lets start by creating these. Hint: See this forum link: https://github.com/joey711/phyloseq/issues/616 , or this ham-fisted way of doing it which should be available as a function in your environment called Summarize.Taxa():https://github.com/jbisanz/MicrobeR/blob/master/R/Summarize.Taxa.R

taxa.summaries<-Summarize.Taxa(sv.table.analysis, taxonomy) #Summarize 

Now try making a bar plot of Genus level Abundances using either Phyloseq or MicrobeR’s Microbiome.Barplot( ).

Microbiome.Barplot(taxa.summaries[[6]], metadata, CATEGORY = "Timepoint") + ggtitle("Genus Abundances by Time Point")

What do you notice about the composition of the microbiome at genus level comparing visit one and the birth time point? Are you starting to get an idea for follow up analyses?



Depending on the nature of the data, trends can sometimes be spotted in heat maps. Let’s create one.

Microbiome.Heatmap(sv.table.analysis, metadata, CATEGORY = "Timepoint", TRANSFORM="percent", NTOPFEATURES = 30)
## [1] "No plotting method specified, using ggplot2"
## [1] "Minimum abundance: 0"
## [1] "Maximum abundance: 98.3070175438597"

Did this confirm what we saw above? What genus does SV_1 belong to? Try to figure this out hint: look at your taxonomy table.


5. Alpha Diversity


Background:

Alpha diversity is a measure of diversity within samples. Perhaps the simplest metric you could think of is the number of species present in an ecosystem which is commonly applied. Others try to estimate both the species you can see, and the ones you can’t, such as Chao1. Others are a little bit more complex such as Shannon Diversity which use both the number of species present and their abundances: \[Div_{shannon} = - \sum_{i=1}^R p_i ln p_i\] Where p is the proportional abundance of feature i.

Further Reading:

Comparison of alpha diversty metrics
Studying Microbial Diversity


Lets start by using two commonly applied measures of alpha diversity: Observed Species and Shannon Diversity. You could either use vegan’s diversity( ) or phyloseq’s plot_richness( ).

print(plot_richness(phyobj, x="Timepoint", color="anonymized_name", measures=c("Observed","Shannon")) + theme_bw()) 

Note: You may get an error here about a lack of singleton reads. This is a result of our earlier filtering and does not represent a problem for your data set.

While this figure helps show that alpha diversity is altered, it is difficult to see how it changes on a per individual basis. Since Phyloseq creates this figure using a package called ggplot2, it is possible to add in connecting lines directly using metadata already part of the plot.

print(plot_richness(phyobj, x="Timepoint", color="anonymized_name", measures=c("Observed","Shannon")) 
      + geom_line(aes(group=anonymized_name)) # add lines connected by individual
      + theme_bw() #I perfer a white rather than gray background
      + theme(legend.position="none") #we do not need the legend, but feel free to include
)



It certainly appears that we have a trend. It would be ideal to test this statistically. Generally speaking, in a 2-way comparison, a t-test or Mann-Whitney U test are appropriate. The data does not appear to be highly skewed so it should be safe to carry out a t-test. Since the data represents repeated sampling of the same individuals, you should ensure this is also a paired analysis. Both t-tests (t.test) and Mann-Whitney U (wilcox.test) are built into R. Test the effect on Observed SVs:

totest<-metadata
totest$Observed<-specnumber(sv.table.analysis[,metadata$sample_name], MARGIN=2) #add a the number of observed species to the table. The [,metadata$sample_name] portion is to ensure that the order of the SV table matches the metadata, otherwise we could accidentally shuffle our values and our samples names.
t.test(Observed~Timepoint, data=totest, paired=T)
## 
##  Paired t-test
## 
## data:  Observed by Timepoint
## t = -3.3999, df = 19, p-value = 0.003005
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -26.980742  -6.419258
## sample estimates:
## mean of the differences 
##                   -16.7
#OR SAFER WAY OF DOING IT that doesn't assume pairs are ordered and that there could be unpaired samples
#int<-intersect(subset(totest, Timepoint=="Birth")$anonymized_name, subset(totest, Timepoint=="Visit1")$anonymized_name) #find intersect of participants at the two time points
#totest<-totest[totest$anonymized_name %in% int,]  #only keep complete cases
#totest<-totest[order(totest$anonymized_name),] #order on name to put pairs together if they were not already
#t.test(totest[totest$Timepoint=="Birth",]$Observed, totest[totest$Timepoint=="Visit1",]$Observed, paired=T)

*Depending on your alpha diversity metric of choice, you may or may not wish to subsample (normalize for sequencing effort) your data before the calculation. We will skip doing it today but you will subsample for later analysis.

Conclusions:

What has this told us about the composition and diversity of the microbiome comparing early pregnancy and birth?


6. Beta Diversity


Background:

Beta diversity refers to between sample similarity and is a commonly used way to compare communities as a whole. There are many metrics commonly applied. The most commonly encountered will be UniFrac and/or Bray Curtis. At their simplest level, they are looking at the overlap between two communities and generating a value of dissimilarity. Typically a value of 0 denotes identical community composition while 1 suggests non-overlapping communities. The key lies in how they are calculated. One large difference is that UniFrac applies a phylogenetic tree to normalize these values for the evolutionary distance between community members while Bray Curtis does not. An example of this is that to Bray Curtis, Lactobacillus crispatus and Lactobacillus iners are just as different as Lactobacillus crispatus and E. coli. Because L. crispatus and L. iners belong to the same genus, this is a much smaller difference to a UniFrac Metric. One last difference, unweighted UniFrac only uses the presence/absence of features, while weighted Unifrac also uses their abundances. As such unweighted UniFrac (and similar non-abundance weighted metrics) are prone to the effects of very low abundance taxa which may be variable based on sampling depth. For this reason it is usually advisable to subsample or rarefy data. This is a way to normalize for sequencing depth in silico which is appropriate here, but may not always be advisable.



Many people confuse the idea of beta diversity metrics and ordination/principal components/coordinates analysis (NDMS/PCA/PCoA). Beta diversity creates sets of pair-wise difference metrics while ordination approaches allow it to be put into visual space wherein the samples with more similar community structures are plotted next to each other.

Further Reading:

Studying Microbial Diversity

Resources:

https://joey711.github.io/phyloseq/plot_ordination-examples.html
http://cc.oulu.fi/~jarioksa/softhelp/vegan/html/adonis.html


To start, generate principal coordinate analyses of Bray Curtis, Weighted UniFrac, Unweighted UniFrac and Jensen Shannon Divergence. This can be accomplished using Phyloseq’s plot_ordination or MicrobeR’s PCoA.

print(PCoA("braycurtis", metadata, sv.table.analysis, COLOR="Timepoint", TREE=tree, ADONIS = F))
## [1] "Subsampling OTU table to 6090 , currently has  202  taxa."
## [1] "...sampled to 6090 reads with 201 taxa"
## [1] "Doing Bray Curtis..."

print(PCoA("weightedunifrac", metadata, sv.table.analysis, COLOR="Timepoint", TREE=tree, ADONIS = F))
## [1] "Subsampling OTU table to 6090 , currently has  202  taxa."
## [1] "...sampled to 6090 reads with 201 taxa"
## [1] "Doing Weighted UniFrac..."

print(PCoA("unweightedunifrac", metadata, sv.table.analysis, COLOR="Timepoint", TREE=tree, ADONIS = F))
## [1] "Subsampling OTU table to 6090 , currently has  202  taxa."
## [1] "...sampled to 6090 reads with 201 taxa"
## [1] "Doing Unweighted UniFrac..."

print(PCoA("jsd", metadata, sv.table.analysis, COLOR="Timepoint", TREE=tree, ADONIS = F))
## [1] "Subsampling OTU table to 6090 , currently has  202  taxa."
## [1] "...sampled to 6090 reads with 201 taxa"



What do you note about how each of the approaches pictured the data? Which one looks best to you? An additional technical note, ideally you want as much variation explained on the graph as possible which is usually plotted in the axes labels. Pick the metric that you think best represents the data and move forward with that. I am going to select weighted UniFrac. Some times meaningful information is contained on an additional axis of the PCoA. While not necessary, try rendering this in 3D using MicrobeR’s PCoA3D! If you do not see this rendering, enable webgl in your browser. In chrome, type chrome://flags in your address bar, then enable WebGL2.0. This may not work on some windows systems.

PCoA3D("weightedunifrac", metadata, sv.table.analysis, COLOR="Timepoint", TREE=tree) #This may not work on some windows system
## [1] "Subsampling OTU table to 6090 , currently has  202  taxa."
## [1] "...sampled to 6090 reads with 201 taxa"
## [1] "Doing Weighted UniFrac..."

Did this help at all?



You have established visually that there appears to be differences in community structure between Visit and Birth1. Remember that this is longitudinal sampling. To help look at sample pairing, connect points belonging to the same person using a similar approach to the one you used in the alpha diversity analysis.

print(PCoA("weightedunifrac", metadata, sv.table.analysis, COLOR="Timepoint", TREE=tree, ADONIS = F) 
      + geom_line(aes(group=anonymized_name), color="grey") #Add a grey line connecting an individual's time points
      + geom_text(aes(label=anonymized_name, color=Timepoint), hjust=0, vjust=1, size=3) #add in the individual's name
      + ggtitle("Weighted UniFrac: Points joined by Participant")
      )
## [1] "Subsampling OTU table to 6090 , currently has  202  taxa."
## [1] "...sampled to 6090 reads with 201 taxa"
## [1] "Doing Weighted UniFrac..."

To test this statistically, two commonly applied tests are ANOSIM and ADONIS. Today we will apply ADONIS as it can handle both categorical and continuous variables while ANOSIM is limited to categorical variables. The general steps in this testing are to (1) generate your distance matrix (if you have not already), (2) carry out ADONIS test with vegan’s adonis( ) function. Look at both the effect of time, but also of the participant of origin.


1- Generate the distance matrix

sv.subsampled<-Subsample.Table(sv.table.analysis) #normalize for sequencing depth
## [1] "Subsampling OTU table to 6090 , currently has  202  taxa."
## [1] "...sampled to 6090 reads with 201 taxa"
unweighted<-UniFrac(phyloseq(sv.subsampled, tree), weighted=F) #Generate a distance matrix, consider taking a look at this


2- Now do test ADONIS test.

adonis(unweighted ~ anonymized_name + Timepoint, data=metadata, permutations = 9999)
## 
## Call:
## adonis(formula = unweighted ~ anonymized_name + Timepoint, data = metadata,      permutations = 9999) 
## 
## Permutation: free
## Number of permutations: 9999
## 
## Terms added sequentially (first to last)
## 
##                 Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)   
## anonymized_name 19    2.9159 0.15347  1.4010 0.52819 0.0697 . 
## Timepoint        1    0.5234 0.52339  4.7781 0.09481 0.0024 **
## Residuals       19    2.0813 0.10954         0.37700          
## Total           39    5.5206                 1.00000          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Were the results significant? An important number to also consider is the R2 value which indicates the % variation explained. What explains more of the variation, time or person? What would you have expected?



But again, remember this is a longitudinal analysis, so we can account for repeated time by indicating a strata. Repeat the test using a strata.

adonis(unweighted~Timepoint, strata=metadata$anonymized_name, data=metadata) #include strata
## 
## Call:
## adonis(formula = unweighted ~ Timepoint, data = metadata, strata = metadata$anonymized_name) 
## 
## Blocks:  strata 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)   
## Timepoint  1    0.5234 0.52339    3.98 0.09481  0.009 **
## Residuals 38    4.9972 0.13150         0.90519          
## Total     39    5.5206                 1.00000          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Did this change your result?

Conclusions:

What has this told us about the gross composition of the microbiome comparing early pregnancy and birth?


7. Testing for differential features with DESeq2


Background

Ultimately, what people often care most about is individual features/strains. After all, scientists tend to take a reductionist approach and want to identify novel pathogens or new probiotics in microbiome data. While fundamentally, 16S rRNA gene sequencing is a tool designed to look at communities as a whole, many approaches exist for looking at individual features. There are also a wide range of approaches and ethos for these. Generally speaking, it is a good idea to use multivariate statistics such as the beta diversity metrics above before looking for differences in individual taxa.

While there is not necessarily a single correct way to find significant taxa, there is certainly a wrong way: Do not take raw sequence counts and do a series of uncorrected t-tests. Remember that data is a technical sampling (if subsampled) of a technical sampling (which DNA binds to sequencing flow cell), of a technical sampling (which DNA was amplified by PCR primers), of a technical sampling (which portion of a sample was extracted for DNA), of a biological sampling (one arbitrary sample collection from an individual). As such it is important to realize that if you were to analyze replicate samples at every step of the process you would end up with slightly different numbers (but hopefully the same trends). Even the same sequencing library sequenced multiple times will yield slightly different results. Some tools are more false positive prone then others, and some create better visualization than others. Some links to tools are provided below but today we will use DESeq2. DESeq2 was originally intended for RNA-seq data, but this is conceptually no different than an amplicon library. Both are just a table of counts with similar technical considerations.

Resources:

DESeq2
ALDEX2
LEFse
edgeR
PhyloSeq_to_DESeq2


We will use an approach based on these instructions.

#You can manually feed data to DESeq2 but today I am going to use the  phyloseq_to_deseq2() function of Phyloseq to do this.
dds = phyloseq_to_deseq2(phyobj, ~ Timepoint)
dds = DESeq(dds, test="Wald", fitType="local", parallel = TRUE)

Did this fail? It probably should have, and this is a fairly common problem. The problem is that there is a high probability that there will be microbes that are not present in every sample which causes problems in the normalization.

There are a number of ways we can work around this issue. Try the way addressed here by supplying a new geometric mean function:

dds = phyloseq_to_deseq2(phyobj, ~ Timepoint)
## converting counts to integer mode
gm_mean = function(x, na.rm=TRUE){exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))} #declare new function
geoMeans = apply(counts(dds), 1, gm_mean) #calculate the geometric means
dds = estimateSizeFactors(dds, geoMeans = geoMeans) #manually calculate the size factor using these new geometric means instead of the default DESeq2 approach
dds = DESeq(dds, test="Wald", fitType="local")
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 139 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing

Now that you have calculated your results, you can get the data back in tabular form and filter them for only those that have a FDR value<0.1 and a fold change of >2-fold. Remember that on a log2 scale, the actual fold change is 2FoldChange so if we want a 2-fold change either up or down, 1/-1 is our actual cutoff.

results.deseq<-as.data.frame(results(dds))
filt.results.deseq<-subset(results.deseq, padj<0.1 & (log2FoldChange>1 | log2FoldChange<(-1))) # FDR<0.1 AND either foldchange>1 OR foldchange<-1
print(paste("There are", nrow(filt.results.deseq), "significantly different SVs"))
## [1] "There are 22 significantly different SVs"
Nice.Table(filt.results.deseq) #display features in a table

Now this is good, but remember our study design is over time, so you must account for this. See here for ideas on how to accomplish this.

dds = phyloseq_to_deseq2(phyobj, ~ anonymized_name + Timepoint)
## converting counts to integer mode
geoMeans = apply(counts(dds), 1, gm_mean)
dds = estimateSizeFactors(dds, geoMeans = geoMeans)
dds = DESeq(dds, test="Wald", fitType="local", parallel = TRUE) #may need to set parallel = FALSE on some windows systems
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates: 2 workers
## mean-dispersion relationship
## final dispersion estimates, fitting model and testing: 2 workers

Again apply your filtering criteria again and look how this affected our number of significant results. Try changing your significance and fold change cut offs and see how it changes your number of significant features.

results.deseq<-as.data.frame(results(dds))
filt.results.deseq<-subset(results.deseq, padj<0.1 & (log2FoldChange>1 | log2FoldChange<(-1)))
print(paste("There are", nrow(filt.results.deseq), "significantly different SVs"))
## [1] "There are 18 significantly different SVs"
Nice.Table(filt.results.deseq)

Here we have our results, but the actual sequence isn’t very helpful. We could look them up in our taxonomy table to find their higher order taxonomic assignments, but instead lets directly try to assign them to a species. Dada2 has a tool for this!

#we will only assign species to the sequences in the significant results
species<-assignSpecies(taxonomy[rownames(filt.results.deseq),]$Seq, "rdp_species_assignment_14.fa.gz", allowMultiple = TRUE, verbose=TRUE)
## 14 out of 18 were assigned to the species level.
species.results<-cbind(filt.results.deseq, species) #paste the new genus and species names onto the right of the deseq results
Nice.Table(species.results)

This is a lot more helpful. *Note that you have the option to allow multiple best hits to species assignments which creates a /-separated list of species. We can now see exactly which organisms have been changed.

It would be nice to help visualize these changes. One common way is a volcano plot. Try to create one with fold change on x, and -log(pvalue) on the y.

print(ggplot(results.deseq, aes(x=log2FoldChange, y=-log10(pvalue)))
      + geom_point(alpha=0.2) #alpha is opacity, this helps see overlapping points
      + theme_bw()
      + geom_point(data=species.results, aes(x=log2FoldChange, y=-log10(pvalue)), color="red") #add red points for only the significant results
      + geom_text(data=species.results, aes(x=log2FoldChange, y=-log10(pvalue), label=Genus), color="red", hjust=1, vjust=1, alpha=0.5) #add genus names as well for the significant findings
      )

It is also generally a good idea to plot individually. Lets try this. There are many options for how to plot abundance. In this case I will opt for centered log2 ratio as it is easily it is already on a log scale, but try it with multiple methods of plotting including % abundance.

normalized.table<-Make.CLR(sv.table.analysis) #normalize with CLR
toplot<-merge(metadata[,c("anonymized_name","Timepoint", "sample_name")], t(normalized.table), by="row.names", all=F) #merge our metadata with the raw data for plotting

toplot<-melt(toplot, id.vars=c("anonymized_name","Timepoint", "sample_name"), #convert to data long-form which is easier to plot. Check what it looks like before and after
             measure.vars=rownames(filt.results.deseq), #only include values for the significant results
             variable.name="SV", #the name of the column with the significantly different SVs
             value.name="Abundance" #the name of the column containing the CLR abundance
            )

print(ggplot(toplot, aes(x=Timepoint, y=Abundance))
      + geom_boxplot() #plot as boxplots
      + geom_line(aes(group=anonymized_name)) #add lines connecting participants
      + theme_bw() # This is my pick for plotting theme
      + ylab("CLR Abundance")
      + facet_wrap(~SV, scales="free") #repeat plot for every significant SV
      )



Why do you think we have a highly abundant organisms with out a known genus or species? How do you think you could fix this problem? Give it a stab and see if you can figure out what SV_2 is. The entire internet is as your fingers and you can find the sequence in your taxonomy table.

Conclusions:



What do you think this says about organisms that shift during pregnancy? Does this fit with what you saw in your bar plot, alpha diversity and beta diversity analysis? Have you replicated the findings of the original paper? Did you do a more comprehensive analysis than the original paper? How would you analyze the full time series of data instead of just the first and birth visits?


8. Novel analysis


If you have finished early, it is now your turn to come up with a novel question to ask using this data. Perhaps look at Nugent classification, or age, or pH. Before you leave the tutorial, show us your novel analysis and let us know what you think it means. Try to generate new figures and/or analysis. Here are some ideas:

## [1] "Subsampling OTU table to 6090 , currently has  202  taxa."
## [1] "...sampled to 6090 reads with 201 taxa"
## [1] "Doing Weighted UniFrac..."


9. Wrap up


You have now completed a fairly comprehensive analysis of a microbiome data set. You started with raw data like you would get direct from a sequencer or read repository. These steps are fairly generalizable to multiple study types. If you think 16S rRNA gene analysis is something you will do often, you should experiment with other approaches and see if you converge on the same conclusions: QIIME, QIIME2, UPARSE, Mothur.

If you would like feedback on your code, please email to jordan.bisanz@ucsf.edu.