QIIME2 Workshop

UCSF Benioff Center for Microbiome Medicine Bioinformatics Workshop Series

16 April 2020

Preparing for tutorial

Installation Option 1: Running Locally (if using OSX or Linux) *Preferred

We are going to be installing a local version of QIIME2 version 2020.2. The installation instructions are available here. The installation process could take >30min depending on your internet connection. Please install well ahead of the workshop.

  • Step 1: Download and install miniconda and read the instructions for installation.
  • Step 2: Using the terminal, download the instructions needed to install qiime as below by selecting the line for your operating system*:
wget -O qiime2.yml https://data.qiime2.org/distro/core/qiime2-2020.2-py36-osx-conda.yml   # use this line for Mac/OSX
wget -O qiime2.yml https://data.qiime2.org/distro/core/qiime2-2020.2-py36-linux-conda.yml # use this line for linux

*Note: the wget tool will download linked files to your computer. The -O qiime2.yml in the command above specifies name of the file when it is downloaded. In the event you do not already have wget installed, you can get it using conda with the command conda install wget.

Out of interest, you should look at what is contained within this .yml file: it is a list of the dependencies required to run qiime2. You can do this with the bash command less qiime2.yml.

Now we can create our Conda environment specifically for qiime2 as below:

conda env create --name qiime2-2020.2 --file qiime2.yml

In this command conda env create is telling conda we want to create a new environment. --name qiime2-2020.2 is specifying that we want our environment to be named qiime2-2020.2 and --file qiime2.yml is providing the instructions for the installation*.

*Note: we specify the input to many shell tools using what are called flags which often take the form of either a single dash followed by a letter (ex -n) or two dashes followed by a word (ex --name). You can often use either but using the full description for the flag makes the code easier to read and understand later.

Next we want to activate our new environment as below:

conda activate qiime2-2020.2

*Note: at any time you can deactivate your environment with the command conda deactivate

Now we can create a new folder which will contain all the files for the tutorial and the results we calculate:

mkdir tutorial16S # make a new folder 
cd tutorial16S # change directory into the new folder

And finally we will download our raw data:

wget https://github.com/jbisanz/MicrobiomeWorkshop2020/raw/master/qiime2/tutorialdata.tar.gz
tar -xvf tutorialdata.tar.gz # decompress the data which should form a new folder called data

Installation Option 2: Running on Interactive Node of Wynton Cluster (Mandatory for Windows Users)

Note: to use this option, you need to have signed up for an account in advance.

Start by logging into a wynton login node as below:

ssh yourusername@log1.wynton.ucsf.edu #use ssh to login to wynton

Since we can’t run code directly on a login node, we subsequently connect to one of the development nodes (dev1, dev2, or dev3) as below:

ssh dev3 #use ssh again to login to a development node

You should currently be in your own home directory which you can check using the pwd command. In my case it is /wynton/home/turnbaugh/jbisanz. Remember this location for later as we will need it to download our results back from the cluster.

Rather than have everyone install their own Conda environment, you can instead access the conda install and qiime2 environment which I maintain as below:

source /turnbaugh/qb3share/shared_resources/Conda/etc/profile.d/conda.sh #loading conda into your terminal session
conda activate qiime2-2020.2 #activating a conda environment I have already made

*Note: at any time you can deactivate your environment with the command conda deactivate

Now we can create a new folder which will contain all the files for the tutorial and the results we calculate:

mkdir tutorial16S # make a new folder 
cd tutorial16S # change directory into the new folder

And finally we can copy the tutorial data to our directory:

cp /wynton/home/turnbaugh/jbisanz/tutorialdata.tar.gz . #copy the tutorial data from my folder to yours
tar -xzvf tutorialdata.tar.gz # decompress the raw data into a new folder called data/

!!Stop here as we will go through the remainder together!!


Background

Today we will be examining a portion of a longitudinal dataset derived from germ-free mice which were administered human fecal samples before and after a prolonged weight loss diet. Mice which received fecal samples from after diet proceeded to lose a significant amount of weight compared to the pre-diet group. This leads to the obvious question, what aspect of the human microbiome can transmit a weight loss phenotype to mice? To answer this question, we used V4 16S rRNA amplicon sequencing prepared according to a slightly modified version of the earth microbiome project protocols. The data we will use today has been down sampled to 20% of the original sequencing data and only includes 18 samples (6 mice x 3 time points). This dataset is small enough that no step of processing the data will take more than a minute or two to run on a standard personal computer.

Two quick notes before we begin relevant to those using the wynton cluster:

  • In general, you should never be running your actual analysis on the interactive development nodes (dev1, dev2, dev3); however, this dataset is so small that it should have minimal impact. What you would ideally do would be to chain all of your commands together into a single script and submit this using qsub. An example of such a practice using an older version of qiime could be found here.

  • To access your results from the cluster, you will need to transfer them to your own computer. This can be accomplished using a command line tool called scp (secure copy). Because we will be putting all of our results into a single folder called artifacts, we can download them to our local computer using the following command to get the results. Note, this commands needs to be run from a separate terminal window which is not connected to the cluster and you would of course substitute in your username in place of jbisanz.

scp -r jbisanz@log1.wynton.ucsf.edu:/wynton/home/turnbaugh/jbisanz/tutorial16S/artifacts TutorialResults

In the above commands scp -r means to recursively copy the entire folder, then we specify the source of the transfer with our username and the server: jbisanz@log1.wynton.ucsf.edu, and then the folder on that server we want: /wynton/home/turnbaugh/jbisanz/tutorial16S/, and finally where we want it to go on our personal computer: TutorialResults. To view the artifact, you can click and drag it onto view.qiime2.org. You can also access the precomputed artifacts here.


Importing Sequencing Data

The data we have today was sequenced using 251x151 reads on an Illumina MiSeq, the target region is ~252 base pairs long which means that there is considerable overlap between these reads allowing for many layers of error correction as in the schematic below:

Forward ------------------------->
                   <--------------- Reverse

Typically, sequence reads come as compressed fastq files (.fastq.gz) with two separate files denoting the forward and reverse orientation (usually marked _R1.fastq.gz or _R2.fastq.gz).

Let us explore the files we will be using today that you downloaded before the workshop. These are found in data/reads and we can list them as below using the list command (ls -lh) wherein -l specifies in a tabular list and -h specifies the file sizes in a human readable format (ex Mb):

ls -lh data/reads/

You should see a list of 18 files as below:

-rwxr-xr-x. 1 jbisanz lsd 1.7M Apr 14 10:54 MOUSE.G11.Day1_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.5M Apr 14 10:54 MOUSE.G11.Day1_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.5M Apr 14 10:54 MOUSE.G11.Day3_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.4M Apr 14 10:54 MOUSE.G11.Day3_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.5M Apr 14 10:54 MOUSE.G11.Day7_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.4M Apr 14 10:54 MOUSE.G11.Day7_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.4M Apr 14 10:54 MOUSE.G12.Day1_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.2M Apr 14 10:54 MOUSE.G12.Day1_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.2M Apr 14 10:54 MOUSE.G12.Day3_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.2M Apr 14 10:54 MOUSE.G12.Day3_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.3M Apr 14 10:54 MOUSE.G12.Day7_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.2M Apr 14 10:54 MOUSE.G12.Day7_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.4M Apr 14 10:54 MOUSE.G21.Day1_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.2M Apr 14 10:54 MOUSE.G21.Day1_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.7M Apr 14 10:54 MOUSE.G21.Day3_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.6M Apr 14 10:54 MOUSE.G21.Day3_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 2.0M Apr 14 10:54 MOUSE.G21.Day7_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.7M Apr 14 10:54 MOUSE.G21.Day7_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.6M Apr 14 10:54 MOUSE.H11.Day1_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.4M Apr 14 10:54 MOUSE.H11.Day1_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.7M Apr 14 10:54 MOUSE.H11.Day3_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.6M Apr 14 10:54 MOUSE.H11.Day3_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.6M Apr 14 10:54 MOUSE.H11.Day7_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.5M Apr 14 10:54 MOUSE.H11.Day7_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.6M Apr 14 10:54 MOUSE.H12.Day1_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.4M Apr 14 10:54 MOUSE.H12.Day1_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.5M Apr 14 10:54 MOUSE.H12.Day3_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.3M Apr 14 10:54 MOUSE.H12.Day3_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.5M Apr 14 10:54 MOUSE.H12.Day7_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.4M Apr 14 10:54 MOUSE.H12.Day7_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.6M Apr 14 10:54 MOUSE.H13.Day1_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.4M Apr 14 10:54 MOUSE.H13.Day1_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.6M Apr 14 10:54 MOUSE.H13.Day3_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.4M Apr 14 10:54 MOUSE.H13.Day3_2.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.4M Apr 14 10:54 MOUSE.H13.Day7_1.fastq.gz
-rwxr-xr-x. 1 jbisanz lsd 1.3M Apr 14 10:54 MOUSE.H13.Day7_2.fastq.gz

If we wanted to examine what these files look like, we can chain two commands together: zcat which will decompress the file on the fly, and head to show is the top 10 lines. We can connect these together using the pipe operator | which sends the results of zcat to head:

zcat data/reads/MOUSE.G11.Day1_1.fastq.gz | head

*Note: on some OS X systems zcat may not function as intended which is a frustrating design feature, you can use gzcat instead

We should get an output that looks like this:

@EXP2.MOUSE.G11.Day1.PreWTL_40 M01869:152:000000000-AMWM8:1:1101:19987:1450 1:N:0:0 orig_bc=CTAATCCGGAAC new_bc=CTAATCCGGAAC bc_diffs=0
TACGTAGGTGGCAAGCGTTGTCCGGATTTACTGGGCGTAAAGGGAGCGTAGGCGGATGTTTACGTCCGTTGTCCAATCCTCCCGCTCCCCTTTCCTCCTCCCTTCCCTCCTCTCTTTCTTGAGTCCCGTTGTGGTCTCTAGACTTCCTTTTTTCCCGTTGAACTCCTTAGTTCTTCCGCAGACTTCCAGTTCCGCCGCCCCCTTTCTGCACTTTCCCTTCCCCTTATCCTCTCCCCCCTGTGTCTCACCCC
+
@6A@C9,-C-,C<9<D@CC+BEFCC,BCE9CE@99@@F+,,;,,,,@+@77-C+++7+:<9,-,:,,+,9,8,,,,<,996,,+8@7+,899,,,5,4:,:+9944,,,8B,,,5,9:?,,,,<,:,+,,++++,,,,,,,,,<5CA?,,:,8,,:+,8++,,8,8,8,,,,,:3,,*****,3,<8,,7,,6***5*4*14:<2>;,,+55+2,,5>+*5+02*+++33;+*++*/*1**,**+1++*1*
@EXP2.MOUSE.G11.Day1.PreWTL_640 M01869:152:000000000-AMWM8:1:1101:9058:1630 1:N:0:0 orig_bc=CTAATCCGGAAC new_bc=CTAATCCGGAAC bc_diffs=0
TACGTAGGTGGCGAGCGTTGTCCGGATTTACTGGGCGTAAAGGGAGCGTAGGTGGACTTTTAAGTGAGTTGTGAAATACCCGGGCTCAACTTTGGTGCTGCATTTCATACTGGAAGTCTTGAGTGCAGTAGAGGAGAATGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAGATTAGGAAGAACACCAGTGGCGAAGGCGATTCTCTGGACTGTACCTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGFEGGGF,-CC,,,,<,C,,,,<,:9C++6@CC5,C5C,,,:,5C,:,<@EA9,,<E,,,,,:CF,,,5C,9,,,,,,5,,+,,::,=,BAFEE,,:AC9,7+7F7=,3C,>3@>F+,,@>,,7,,3,3@,@>78F:,@<7*>*:<*;>;@CE@7,=C7B,,6>76C*/=****0;@E*+*/*1293*))*0**+2*

FASTQ files contain a repeating pattern of 4 lines:

  • 1: The header line which specifies the name of the sample and information about it.
  • 2: The DNA sequence
  • 3: a + sign which may optionally repeat the header line
  • 4: a quality string which denotes the certainty of the base call for each base of the sequence. See here for help interpreting its meaning.

Due to a design choice, QIIME2 prefers to work on a specific file structure called and artifact (.qza) and to get our data into QIIME2 we need to use the tool qiime tools import. We specify the sample names and locations of all the files using what is called a manifest which has been provided in data/read_manifest.csv. Below I have looked at the top 10 lines of the manifest.

How would you look at the top 10 lines of the read manifest?

sample-id,absolute-filepath,direction
MOUSE.G11.Day1,$PWD/data/reads/MOUSE.G11.Day1_1.fastq.gz,forward
MOUSE.G11.Day1,$PWD/data/reads/MOUSE.G11.Day1_2.fastq.gz,reverse
MOUSE.G11.Day3,$PWD/data/reads/MOUSE.G11.Day3_1.fastq.gz,forward
MOUSE.G11.Day3,$PWD/data/reads/MOUSE.G11.Day3_2.fastq.gz,reverse
MOUSE.G11.Day7,$PWD/data/reads/MOUSE.G11.Day7_1.fastq.gz,forward
MOUSE.G11.Day7,$PWD/data/reads/MOUSE.G11.Day7_2.fastq.gz,reverse
MOUSE.G12.Day1,$PWD/data/reads/MOUSE.G12.Day1_1.fastq.gz,forward
MOUSE.G12.Day1,$PWD/data/reads/MOUSE.G12.Day1_2.fastq.gz,reverse
MOUSE.G12.Day3,$PWD/data/reads/MOUSE.G12.Day3_1.fastq.gz,forward

We will start by creating a folder to store all of our artifacts in using the make directory command (mkdir):

mkdir artifacts

Since this is the first qiime command we will run. Make sure you have activated your qiime2 environment (conda activate qiime2-2020.2) and confirm your current working directory using pwd. The basic structure of a QIIME2 command is qiime which is followed by the name of what is called a plugin, as described here. Finally we get a series of commands and flags which in this case describes that we want to import data of the type ParedEndSequencesWithQuality aka paired FASTQ files. Finally, to make the code easier to read, we can separate it over multiple lines using \.

qiime tools import \
 --type 'SampleData[PairedEndSequencesWithQuality]' \
 --input-path data/read_manifest.csv \
 --output-path artifacts/Reads.qza \
 --input-format PairedEndFastqManifestPhred33

Now, if you were to look within the artifacts folder, what would you find? How would we see the contents of artifacts?

Finally as a last check, we want to look at the quality of our sequencing data which can be accomplished using the qiime demux summarize command.

qiime demux summarize \
  --i-data artifacts/Reads.qza \
  --o-visualization artifacts/ReadQC.qzv

Note, many outputs we create today are called visualizations (.qzv). These can either be viewed directly using the qiime tools view command, or by clicking and dragging the file onto view.qiime2.org. We can now see a number of useful outputs such as the number of reads on a per-sample basis and sequencing quality as a function of read position as below:

*Note: depending on the method of library construction, it may be necessary to remove primer/adapter sequences which can be easily accomplished using the qiime cutadapt tool.


Dada2

Now we can get into the real workhorse of processing the data which is Dada2. It is going to take care of doing a number of distinct tasks:

  • Filtering and running QC on our samples
  • Learning sequencing error profiles for the run
  • Correcting sequencing errors
  • Overlapping reads
  • Removing Chimeras !Important!
  • Generating a final feature table

It takes a number of arguments below which are explained using by comments (anything following a #). Remember you can always look at the help for a function by running something like qiime dada2 denoise-paired --help.

The flags we will be providing are explained below:

  • --i-demultiplexed-seqs the input reads we prepared above
  • --p-trunc-len-f how many bases to trim forward read to
  • --p-trunc-len-r how many bases to trim reverse read to
  • --p-trim-left-f how many bases to trim of front of forward read
  • --p-trim-left-r how many bases to trim of front of reverse read
  • --p-n-threads number of cores/processors to use
  • --o-table artifacts/Feature_Table.qza a table of features and their counts in samples
  • --o-representative-sequences a list of the DNA sequences for each feature
  • --o-denoising-stats a table containing information about the denoising process
  • --verbose give us lots of updates while running
qiime dada2 denoise-paired \
    --i-demultiplexed-seqs artifacts/Reads.qza \
    --p-trunc-len-f 220 \
    --p-trunc-len-r 130 \
    --p-trim-left-f 5 \
    --p-trim-left-r 5 \
    --p-n-threads 4 \
    --o-table artifacts/Feature_Table.qza \
    --o-representative-sequences artifacts/Feature_Sequences.qza \
    --o-denoising-stats artifacts/Feature_Statistics.qza \
    --verbose

Now that we have done the hard part of our data processing, there is no point in holding onto duplicate copies of our sequencing data. We can remove the Reads.qza artifact below using the rm command, but as always, be very careful when using rm as there is no backup.

rm artifacts/Reads.qza

From the work we have done thus far, there are two var important files:

  • Feature_Table.qza this contains the abundances of each feature (organism) on a per-sample basis
  • Feature_Sequences.qza this contains the DNA sequence of each feature

A major downside to using QIIME2 is that it is very difficult to directly look at your data without first exporting it to another format. In our session on using R to do microbiome analysis we will use a package I developed (qiime2R) which facilitates reading them into R in more standard formats; however, today we will use the qiime tools export function to do that with the sequences:

qiime tools export --input-path artifacts/Feature_Sequences.qza --output-path exports

Now if we look at the top of the sequences using head exports/dna-sequences.fasta, we see the following in a standard fasta format:

>cd3e52559a9e53dce7721688de8fd9b5
AGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGCATGGCAAGCCAGATGTGAAAGCCCGGGGCTCAACCCCGGGACTGCATTTGGAACTGTCAGGCTAGAGTGTCGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGACGATGACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAA
>438fb020cf251f2c2f4e89552e4a78ff
AGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGTAGGCTGTTTCGTAAGTCGTGTGTGAAAGGCGCGGGCTCAACCCGCGGACGGCACATGATACTGCGAGACTAGAGTAATGGAGGGGGAACCGGAATTCTCGGTGTAGCAGTGAAATGCGTAGATATCGAGAGGAACACTCGTGGCGAAGGCGGGTTCCTGGACATTAACTGACGCTGAGGCACGAAGGCCAGGGGAGCGA

*Note: rather than use a name such as OTU_1 or E. coli, the sequences are refered to with a unique alpha numeric identifier like cd3e52559a9e53dce7721688de8fd9b5. This is actually a md5 checksum of the sequence and as such is a unique fingerprint to each feature’s sequence while using significantly less characters.


Building a tree

For many analyses, we may want to consider the relatedness of the organisms across samples which requires the generation of a phylogenetic tree. This generally consists of three steps which are completed by qiime phylogeny align-to-tree-mafft-fasttree:

  • Align the sequences (MAFFT)
  • Mask uninformative or ambiguous sequences
  • Build a tree (FastTree)
qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences artifacts/Feature_Sequences.qza \
  --o-alignment artifacts/Feature_Alignment.qza \
  --o-masked-alignment artifacts/Feature_MaskedAlignment.qza \
  --o-tree artifacts/Feature_TreeUnrooted.qza \
  --o-rooted-tree artifacts/Feature_TreeRooted.qza

Unfortunately QIIME2 does not have a tool for visualizing trees. We will cover how to do this in the “Advanced visualization and statistical analysis in R” workshop but for now I include a graphical representation of it looks like below:


Taxonomic Analysis

As you may have guessed, neither cd3e52559a9e53dce7721688de8fd9b5 or AGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAA… are particularly helpful in figuring out which microorganisms are present in a sample. To do that we need to assign these sequences to a taxonomy. A number of tools and databases have been developed for this purpose, and we will examine my preferred tool/database in the “Advanced visualization and statistical analysis in R” workshop, but for now we will use the qiime feature-classifier classify-sklearn function and the Green Genes 13.8 classifier which as been trained on V4 sequencing data data/gg-13-8-99-515-806-nb-classifier.qza as below:

qiime feature-classifier classify-sklearn \
  --i-reads artifacts/Feature_Sequences.qza \
  --i-classifier data/gg-13-8-99-515-806-nb-classifier.qza \
  --o-classification artifacts/Feature_Taxonomy.qza

Again, having a table with >200 taxonomic assignments is not particularly helpful unless we know which samples contain which organisms and in what abundances. A classic way to examine this is through generating a taxonomic bar plot. QIIME2 has an excellent tool for generating these plots called qiime taxa barplot. At this point we should look at a very important file for conducting meaningful analysis which is the metadata. These files describe each sample and the relevant variables of which we want to analyze. You have been provided a metadata file in data/metadata.tsv which we can view using the less command as below:

less data/metadata.tsv
SampleID        MouseID Time_days       Treatment
#q2:types       categorical     numeric categorical
MOUSE.G11.Day1  G11     1       Before_Diet
MOUSE.G11.Day3  G11     3       Before_Diet
MOUSE.G11.Day7  G11     7       Before_Diet
MOUSE.G12.Day1  G12     1       Before_Diet
MOUSE.G12.Day3  G12     3       Before_Diet
MOUSE.G12.Day7  G12     7       Before_Diet
MOUSE.G21.Day1  G21     1       Before_Diet
MOUSE.G21.Day3  G21     3       Before_Diet
MOUSE.G21.Day7  G21     7       Before_Diet
MOUSE.H11.Day1  H11     1       After_Diet
MOUSE.H11.Day3  H11     3       After_Diet
MOUSE.H11.Day7  H11     7       After_Diet
MOUSE.H12.Day1  H12     1       After_Diet
MOUSE.H12.Day3  H12     3       After_Diet
MOUSE.H12.Day7  H12     7       After_Diet
MOUSE.H13.Day1  H13     1       After_Diet
MOUSE.H13.Day3  H13     3       After_Diet
MOUSE.H13.Day7  H13     7       After_Diet

As you can see above, our data contains 4 columns:

  • SampleID - a unique sample identifier
  • MouseID - a unique identifier for each mouse in the study (3 samples per mouse)
  • Time_days - a measurement of time after colonization in days
  • Treatment - a column describing if the mouse received human samples from before or after diet

The second line starting with #q2:types is a special feature for qiime2 which allows you define whether or not the variable should be treated as a number or a category which is particularly important for variables like time. Now we are ready to generate our bar plots:

qiime taxa barplot \
  --i-table artifacts/Feature_Table.qza \
  --i-taxonomy artifacts/Feature_Taxonomy.qza \
  --m-metadata-file data/metadata.tsv \
  --o-visualization artifacts/TaxonomyBarplot.qzv

View the output by either using qiime tools view artifacts/TaxonomyBarplot.qzv or by downloading the file using scp and using view.qiime2.org. Use the sliders and drop boxes to try to recreate the following plot:

What conclusions can we draw from this plot? What are its strengths? What are its limitations? What organisms do you think are different between these two communities (if any)?


Alpha Diversity Analysis

Alpha diversity is a measure of diversity within samples. Perhaps the simplest metric you could think of is the number of species present in an ecosystem which is commonly applied. Others are a little bit more complex such as Shannon Diversity which use both the number of species present and their abundances: \[Div_{shannon} = - \sum_{i=1}^R p_i ln p_i\] Where p is the proportional abundance of feature i.

Further Reading:

As a first step before calculating diversity metrics we are going to rarefy our data to negate the effects of uneven sequencing depth. The idea of rarification, also called subsampling, is a somewhat divisive notion. While I would absolutely not recommend using subsampling before doing differential abundance metrics, when calculating diversity metrics they are necessary as apparent diversity within a sample is heavily influenced by sequencing depth, especially metrics such as richness which do not consider the abundances of features. Generally we would sample to the lowest sequencing depth observed in any given sample. We can figure out what that is by using the qiime feature-table summarize function and viewing the output as below:

qiime feature-table summarize \
  --i-table artifacts/Feature_Table.qza \
  --o-visualization artifacts/Feature_Table_Summary.qzv

After viewing the results we find:

Sample ID   Feature Count
MOUSE.G21.Day7  15549
MOUSE.G21.Day3  15345
MOUSE.H11.Day3  13992
MOUSE.G11.Day1  13814
MOUSE.H13.Day1  13809
MOUSE.H12.Day3  13487
MOUSE.H12.Day7  13240
MOUSE.H13.Day3  12988
MOUSE.G11.Day3  12978
MOUSE.H11.Day1  12938
MOUSE.G11.Day7  12879
MOUSE.H12.Day1  12579
MOUSE.H11.Day7  12372
MOUSE.G21.Day1  11939
MOUSE.H13.Day7  11818
MOUSE.G12.Day1  11268
MOUSE.G12.Day3  10914
MOUSE.G12.Day7  10414

What is the appropriate sampling depth?

Next we can create a new version of our table using qiime feature-table rarefy specifying the depth using --p-sampling-depth as below:

qiime feature-table rarefy \
  --i-table artifacts/Feature_Table.qza \
  --o-rarefied-table artifacts/Feature_Table_S10414.qza \
  --p-sampling-depth 10414

Now we can calculate two metrics richness (called observed_otus in QIIME2) and Shannon Diversity:

qiime diversity alpha \
  --i-table artifacts/Feature_Table_S10414.qza \
  --p-metric observed_otus \
  --o-alpha-diversity artifacts/observed_diversity.qza
qiime diversity alpha \
  --i-table artifacts/Feature_Table_S10414.qza \
  --p-metric shannon \
  --o-alpha-diversity artifacts/shannon_diversity.qza

We can now use qiime diversity alpha-group-significance to plot and statistically test if diversity is different by groups

qiime diversity alpha-group-significance \
  --i-alpha-diversity artifacts/shannon_diversity.qza \
  --m-metadata-file data/metadata.tsv \
  --o-visualization artifacts/shannon_diversity_plot.qzv

Now do the same for observed_otus, does this tell us a different story?

Problem: What statistical assumption have we violated and how could we overcome it? We will cover this more in depth in our next workshop.


Beta-diversity analysis

Ahh, the QIIME PCoA plot, a perennial favorite found in ~2/3 of microbiome manuscripts since the year 2010. There are actually a number of discrete steps involved in the generation of these plots which are generated by a single pipeline in QIIME2 (and QIIME 1). The first step is the calculation of similarity between samples. But how does one actually numerically describe the similarity between two samples? There are a variety of commonly used metrics but today we will look at a handful while in the next workshop we will discuss some more advanced metrics:

  • Bray Curtis Considers the abundances of every feature.
  • Unweighted UniFrac Considers the phylogenetic relatedness of all of the features which are present/absent between the two samples.
  • Weighted UniFrac Considers the phylogenetic relatedness, but also abundance of samples.

The second important component is ordination of which we will be using Principal Coordinates Analysis. These methods seek to find a way to represent a complex set of pair-wise measurements in 2-3 axes to visualize.

Below we are using the qiime diversity core-metrics-phylogenetic pipeline which will automate these tasks for us and put them all in artifacts/diversity. *Note: this pipeline will also recalculate some of the alpha diversity metrics which might provide slighly different estimates than when we ran them above, can you think why this would be?

qiime diversity core-metrics-phylogenetic \
  --i-table artifacts/Feature_Table.qza \
  --p-sampling-depth 10414 \
  --i-phylogeny artifacts/Feature_TreeRooted.qza \
  --m-metadata-file data/metadata.tsv \
  --output-dir artifacts/diversity

Now that we have our results, lets start with the weighted unifrac PCoA (qiime tools view artifacts/diversity/weighted_unifrac_emperor.qzv) and explore. Emperor is a graphical user interface that allows you to explore your data and has a number of handy options for making publication-quality figures. As you play with the tools and look at the other distance metrics, think about these questions:

  • How many groups do you see in this data?
  • How does the location of the samples relate to their transplant group and time?
  • How do different distance metrics compare in their visualizations?

Challenge: recreate this near publication-ready figure:

In our next workshop we will cover more advancing graphing approaches to convey important experimental data in these plots which necessitates using R.


Differential Abundance Analysis

Ultimately, what people often care most about is individual features/strains as these could be novel pathogens, new probiotics, or diagnostics. While fundamentally, 16S rRNA gene sequencing is a tool designed to look at communities as a whole, many approaches exist for looking at individual features. There are also a wide range of approaches and ethos for these. Generally speaking, it is a good idea to use multivariate statistics such as the beta diversity metrics above before looking for differences in individual taxa. While there is not necessarily a single correct way to find significant taxa, there is certainly a wrong way: Do not take raw sequence counts and do a series of uncorrected t-tests. Today we will use ANCOM as it is built into QIIME2; however, it is not my preferred method and we will cover alternate approaches in our advanced workshop. Rather than consider the abundances as proportions, ANCOM and other approaches apply corrections for compositionality which involve taking log ratios often in the form of a centered log ratio (clr) as below:

\[clr_x = log(\frac{counts_x}{g(counts_y)})\]

Where \(clr_x\) is the centered log ratio abundance of feature \(x\) and \(g(counts_y)\) represents the geometric mean of all features in the sample. This method of normalization has a number of advantages including allowing the use of many more standard statistical methods. Believe it or not, proportions violate the statistical assumptions of many commonly used statistical approaches. Karl Pearson wrote about this in 1896. For more information on CLR transformations and why compositionality is important, I would recommend read Microbiome Datasets Are Compositional: And This Is Not Optional.

Using log ratios creates an obvious problem: you can not take the log of 0. To get around this, we must add a pseudo-count to our data (which in this case defaults to 1). In very sparse data sets there are more complex methods of handling data sets that are preferable, but that is not our case today. We can use qiime composition add-pseudocount as below to add 1 to every count in our table

qiime composition add-pseudocount \
  --i-table artifacts/Feature_Table.qza \
  --o-composition-table artifacts/Feature_Table_Pseudocounted.qza

Finally we are ready to test our data examining the effect of treatment (--m-metadata-column Treatment):

qiime composition ancom \
  --i-table artifacts/Feature_Table_Pseudocounted.qza \
  --m-metadata-file data/metadata.tsv \
  --m-metadata-column Treatment \
  --o-visualization artifacts/ANCOM.qzv

Now explore the resulting visualization (artifacts/ANCOM.qzv) using qiime tools view or view.qiime2.org. Think about the following questions:

  • What statistical assumption have we violated in this analysis? How could we get around it?
  • How many significantly different taxa are there?
  • What are the taxonomic identities of these taxa?
  • Are these taxa going up or down?

Our Advanced visualization and statistical analysis in R workshop on April 30 will cover this topic in more detail and show how we can plot these abundances and carry out our statistical analysis in a more appropriate manner


Resources: