1 Background

In this session, we will continue where we left off from our QIIME2 session looking at a humanized mouse data set which presents an interesting weight loss phenotype on colonization as seen in the figure below. We will take advantage of QIIME2 to do the initial processing of our data, but extend the analysis and visualization using R.

Figure 1. Mouse weights after transplant of human microbiomes.

Figure 1. Mouse weights after transplant of human microbiomes.


2 A Quick Reminder of How We Got Here

2.1 Import Sequencing Data to QIIME2

qiime tools import \
 --type 'SampleData[PairedEndSequencesWithQuality]' \
 --input-path artifacts/manifest.csv \
 --output-path artifacts/Reads.qza \
 --input-format PairedEndFastqManifestPhred33

2.2 Read QC, Denoise, Overlap, Remove Chimeras, and Make Table

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs artifacts/Reads.qza \
  --p-trunc-len-f 220 \
  --p-trunc-len-r 130 \
  --p-trim-left-f 5 \
  --p-trim-left-r 5 \
  --p-n-threads 4 \
  --o-table artifacts/Feature_Table.qza \
  --o-representative-sequences artifacts/Feature_Sequences.qza \
  --o-denoising-stats artifacts/Feature_Statistics.qza \
  --verbose

2.3 Build a Phylogenetic Tree

qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences artifacts/Feature_Sequences.qza \
  --o-alignment artifacts/Feature_Alignment.qza \
  --o-masked-alignment artifacts/Feature_MaskedAlignment.qza \
  --o-tree artifacts/Feature_TreeUnrooted.qza \
  --o-rooted-tree artifacts/Feature_TreeRooted.qza

3 Preparing our R session

Start by creating a project for your session by going to File > New Project > New Directory > YourChoiceHere.

Next we will create folders for our outputs: figures, tables, and backups (RDS). Note: RDS files are a special R-specific format that can be easily written and reloaded using saveRDS() and readRDS(). See here for more information.

dir.create("figures")
dir.create("tables")
dir.create("RDS")
dir.create("artifacts")

I have also taken the liberty of adding in additional samples and sequencing depth and provide the updated artifacts at the link here. You can download them directly in R using:

download.file(url="https://github.com/jbisanz/MicrobiomeWorkshop2020/raw/master/rvis/artifacts/Feature_Table.qza", destfile="artifacts/Feature_Table.qza")
download.file(url="https://github.com/jbisanz/MicrobiomeWorkshop2020/raw/master/rvis/artifacts/Feature_Sequences.qza", destfile="artifacts/Feature_Sequences.qza")
download.file(url="https://github.com/jbisanz/MicrobiomeWorkshop2020/raw/master/rvis/artifacts/Feature_TreeRooted.qza", destfile="artifacts/Feature_TreeRooted.qza")
download.file(url="https://github.com/jbisanz/MicrobiomeWorkshop2020/raw/master/rvis/artifacts/metadata.tsv", destfile="artifacts/metadata.tsv")

3.1 Installing and Loading Required Packages

Generally speaking, there are 3 sources of relevant packages for R:

You may already have some of these packages installed, so the code below first tries to load the package (require()), and if it can’t (if(!...)) then installs the package (install.packages()).

# Packages from CRAN:
if(!require(tidyverse)){install.packages("tidyverse")}
if(!require(vegan)){install.packages("vegan")} # useful ecology package
if(!require(ape)){install.packages("ape")} # useful ecology package
if(!require(lmerTest)){install.packages("lmerTest")} # useful package for statistics and longitudinal designs
if(!require(broom)){install.packages("broom")} #useful package for handling stats
if(!require(BiocManager)){install.packages("BiocManager")} # required for installing Bioconductor Packages
if(!require(devtools)){install.packages("devtools")} # required for installing GitHub Packages

# Packages from Bioconductor:
if(!require(dada2)){BiocManager::install("dada2")} # The original R package used by QIIME2
if(!require(ggtree)){BiocManager::install("ggtree")} # Plotting and manipulating trees
if(!require(philr)){BiocManager::install("philr")} # Testing of phylogenetic signals

# Packages from GitHub
if(!require(qiime2R)){devtools::install_github("jbisanz/qiime2R")} # Package for reading artifacts

4 Importing Data to R

4.1 Metadata

We will start by exploring our metadata:

metadata<-read_tsv("artifacts/metadata.tsv")
str(metadata)
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':    30 obs. of  4 variables:
 $ SampleID : chr  "MOUSE.G11.Day1" "MOUSE.G12.Day1" "MOUSE.G21.Day1" "MOUSE.H12.Day1" ...
 $ MouseID  : chr  "G11" "G12" "G21" "H12" ...
 $ Time_days: num  1 1 1 1 1 1 3 3 3 3 ...
 $ Treatment: chr  "PreWTL" "PreWTL" "PreWTL" "PostWTL" ...
 - attr(*, "spec")=
  .. cols(
  ..   SampleID = col_character(),
  ..   MouseID = col_character(),
  ..   Time_days = col_double(),
  ..   Treatment = col_character()
  .. )
SampleID MouseID Time_days Treatment
MOUSE.G11.Day1 G11 1 PreWTL
MOUSE.G12.Day1 G12 1 PreWTL
MOUSE.G21.Day1 G21 1 PreWTL
MOUSE.H12.Day1 H12 1 PostWTL
MOUSE.H21.Day1 H21 1 PostWTL
MOUSE.H22.Day1 H22 1 PostWTL
MOUSE.G11.Day3 G11 3 PreWTL
MOUSE.G12.Day3 G12 3 PreWTL
MOUSE.G21.Day3 G21 3 PreWTL
MOUSE.H12.Day3 H12 3 PostWTL
MOUSE.H21.Day3 H21 3 PostWTL
MOUSE.H22.Day3 H22 3 PostWTL
MOUSE.G11.Day7 G11 7 PreWTL
MOUSE.G12.Day7 G12 7 PreWTL
MOUSE.G21.Day7 G21 7 PreWTL
MOUSE.H12.Day7 H12 7 PostWTL
MOUSE.H21.Day7 H21 7 PostWTL
MOUSE.H22.Day7 H22 7 PostWTL
MOUSE.G11.Day14 G11 14 PreWTL
MOUSE.G12.Day14 G12 14 PreWTL
MOUSE.G21.Day14 G21 14 PreWTL
MOUSE.H12.Day14 H12 14 PostWTL
MOUSE.H21.Day14 H21 14 PostWTL
MOUSE.H22.Day14 H22 14 PostWTL
MOUSE.G11.Day21 G11 21 PreWTL
MOUSE.G12.Day21 G12 21 PreWTL
MOUSE.G21.Day21 G21 21 PreWTL
MOUSE.H12.Day21 H12 21 PostWTL
MOUSE.H21.Day21 H21 21 PostWTL
MOUSE.H22.Day21 H22 21 PostWTL

Variables:

  • SampleID: a unique identifier for samples
  • MouseID: a unique identifier that marks mice which are repeatedly sampled (6 mice)
  • Time_days: amount of time that has passed since colonization (5 time points)
  • Treatment: if the microbiome was from humans before weight loss (PreWTL) or after (PostWTL)

4.2 Feature Table

SVtable<-read_qza("artifacts/Feature_Table.qza")$data

Now we can look at our data directly which is difficult using QIIME2. In this table, each column is a SampleID, and each row is a Ribosomal Sequence Variant (SV):

SVtable[1:5,1:5] # looking at the first 5 rows and first 5 columns
MOUSE.G11.Day1 MOUSE.G11.Day14 MOUSE.G11.Day21 MOUSE.G11.Day3 MOUSE.G11.Day7
438fb020cf251f2c2f4e89552e4a78ff 152 6607 7162 10526 11216
6bd982a66b0c74116b5c1cd9b824c113 62 8665 6912 7190 8438
bd87538a8a84679b27e33c28432046c2 996 3773 1243 8943 4344
cece12865e81c96a523b3bb6a4511561 726 0 0 232 0
cd3e52559a9e53dce7721688de8fd9b5 9649 0 0 2573 154

To see the total number of samples and SVs, we can look at the dimensions of the table:

dim(SVtable)
[1] 419  30

R always uses rows, then columns. So in this case we have 419 SVs and 30 samples. Did any samples go missing?

If we now want to know the total number of reads we have we can just count up the total of the table:

sum(SVtable)
[1] 1992076

Now we can see that we have 1,992,076 reads which is about ~1/10 of a MiSeq run. Where would the other ~10-20 million reads go?

If we want to see the number of reads per sample, we simply need to add up the columns:

colSums(SVtable) %>% sort()
 MOUSE.H21.Day1  MOUSE.G12.Day7  MOUSE.G12.Day3  MOUSE.G12.Day1 
          48550           52018           54891           56486 
MOUSE.G21.Day21  MOUSE.H22.Day1  MOUSE.G21.Day1  MOUSE.H12.Day1 
          57423           57941           59504           61475 
 MOUSE.H21.Day3  MOUSE.H22.Day3  MOUSE.G11.Day7  MOUSE.G11.Day3 
          61659           61969           62617           63943 
 MOUSE.H21.Day7  MOUSE.H12.Day7 MOUSE.G11.Day21  MOUSE.G11.Day1 
          64706           65082           66089           66616 
 MOUSE.H12.Day3 MOUSE.H21.Day21 MOUSE.H12.Day14 MOUSE.H12.Day21 
          66890           67664           67953           69759 
MOUSE.G11.Day14 MOUSE.H22.Day21 MOUSE.H21.Day14 MOUSE.G12.Day14 
          70253           70331           71406           71715 
 MOUSE.H22.Day7  MOUSE.G21.Day3  MOUSE.G21.Day7 MOUSE.H22.Day14 
          75140           75407           76936           77712 
MOUSE.G21.Day14 MOUSE.G12.Day21 
          78763           91178 

So as you can see, we have between 48,550 reads and 91,178 reads per sample which is on the high end of average for 16S rRNA amplicon sequencing.

If we want to see how frequently certain features are found we can do the transverse function and add up the rows:

rowSums(SVtable) %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1      33     263    4754    1535  190956 

So we can see the average number of times a SV is seen was 4,754 with a range from 1-190,956.

4.3 Feature Sequences

seqs<-read_qza("artifacts/Feature_Sequences.qza")$data
seqs
  A DNAStringSet instance of length 419
      width seq                                        names               
  [1]   243 AGGTCTCAAGCGTTGTTCGG...ACGAAGGCCAGGGGAGCGA 438fb020cf251f2c2...
  [2]   243 AGGATCCGAGCGTTATCCGG...TCGAAAGTGTGGGTATCAA 6bd982a66b0c74116...
  [3]   243 AGGGGGCAAGCGTTATCCGG...TCGAAAGCGTGGGGAGCAA bd87538a8a84679b2...
  [4]   243 AGGATCCGAGCGTTATCCGG...TCGAAAGTGTGGGTATCAA cece12865e81c96a5...
  [5]   243 AGGGGGCAAGCGTTATCCGG...TCGAAAGCGTGGGGAGCAA cd3e52559a9e53dce...
  ...   ... ...
[415]   243 ATGGTGCAAGCGTTATCCGG...TCGAAAGTGTGGGTATCAA bacf4da9ccd7c31fa...
[416]   225 AGGGGGCAAGCGTTATCCGG...TCGAAAGCGTGGGGAGCAA 266e9ce3254292205...
[417]   225 AGGATCCGAGCGTTATCCGG...TCGAAAGTGTGGGTATCAA bb629518734c6be78...
[418]   215 AGGTCTCAAGCGCAGGGGCT...CCCAGTAGTCCGGCTGACT f1e8905a5850b8c08...
[419]   242 AGGTGGCAAGCGTTGTCCGG...TCGAAAGTGTGGGTAGCAA d0fa2acf99d76b680...

Here we have a copy of the actual DNA sequences for each SV. To demonstrate how the unique alpha numeric identifiers are formed we can regenerate them using a hash function:

seqs$`438fb020cf251f2c2f4e89552e4a78ff` %>%
  as.character() %>%
  openssl::md5()
[1] "438fb020cf251f2c2f4e89552e4a78ff"

4.4 Phylogenetic Tree

tree<-read_qza("artifacts/Feature_TreeRooted.qza")$data

We can now get an idea for the structure of our tree using the ggtree function to plot:

ggtree(tree)

It is a great idea to always look at a tree to make sure there aren’t any weird placements. An example is below:


5 Assigning Taxonomy

As we had previously discussed there are multiple methods and databases with which taxonomy can be assigned; however, my favourite method involves exact matching to a reference database which can also tell you all possible species for which the sequence variant could belong to. My preferred database for doing this is SILVA. There are preformatted databases available for download here. We will download them via R below. Note: this will require ~127Mb of hard drive space, you can skip this section by downloading the assigned taxonomy here.

dir.create("taxdb")
download.file(url="https://zenodo.org/record/3731176/files/silva_nr_v138_train_set.fa.gz?download=1", destfile="taxdb/SILVA138_trained.fa.gz")
download.file(url="https://zenodo.org/record/3731176/files/silva_species_assignment_v138.fa.gz?download=1", destfile="taxdb/SILVA138_species.fa.gz")

We assign taxonomy in two phases: first we assign down to the genus level using the Dada2 version of the RDP classifier.

taxonomy<-assignTaxonomy(seqs, "taxdb/SILVA138_trained.fa.gz")
taxonomy<-addSpecies(taxonomy, "taxdb/SILVA138_species.fa.gz", allowMultiple = TRUE)

If you did not download the databases and run the code, you can download it and read it in as below:

download.file(url="https://github.com/jbisanz/MicrobiomeWorkshop2020/raw/master/rvis/RDS/taxonomy.RDS", destfile="RDS/taxonomy.RDS")
taxonomy<-readRDS("RDS/taxonomy.RDS")

Let’s start by taking a look at the assigned taxonomy:

View(taxonomy)

Note that Dada2 kept the DNA sequence and not the shorter hashed feature IDs. We could either recalculate their hashes, or just swap them out as below:

taxonomy<-
tibble(FeatureID=names(seqs), Seq=as.character(seqs)) %>%
  left_join(
    taxonomy %>% 
      as.data.frame() %>%
      rownames_to_column("Seq")
  ) %>%
  as_tibble()

And now we have both combined:

taxonomy
# A tibble: 419 x 9
   FeatureID   Seq       Kingdom  Phylum Class Order Family Genus  Species 
   <chr>       <chr>     <fct>    <fct>  <fct> <fct> <fct>  <fct>  <fct>   
 1 438fb020cf… AGGTCTCA… Bacteria Verru… Verr… Verr… Akker… Akker… mucinip…
 2 6bd982a66b… AGGATCCG… Bacteria Bacte… Bact… Bact… Bacte… Bacte… cellulo…
 3 bd87538a8a… AGGGGGCA… Bacteria Firmi… Clos… Lach… Lachn… Blaut… caecimu…
 4 cece12865e… AGGATCCG… Bacteria Bacte… Bact… Bact… Bacte… Bacte… vulgatus
 5 cd3e52559a… AGGGGGCA… Bacteria Firmi… Clos… Lach… Lachn… <NA>   <NA>    
 6 20cf13f228… AGGATCCG… Bacteria Bacte… Bact… Bact… Bacte… Bacte… uniform…
 7 ca1392eaa9… AGGATCCG… Bacteria Bacte… Bact… Bact… Bacte… Bacte… dorei/v…
 8 d8999f8529… AGGATCCG… Bacteria Bacte… Bact… Bact… Bacte… Bacte… dorei/f…
 9 128696fec6… AGGTGGCG… Bacteria Firmi… Clos… Clos… Clost… Clost… celatum…
10 db79739512… AGGATCCG… Bacteria Bacte… Bact… Bact… Tanne… Parab… distaso…
# … with 409 more rows

Note that there are NAs for some genera and species: why would this be?

Question: Any taxa pop out to you?

Finally, let’s look at assignment rates across levels. We will use an apply() function here which allows us to calculate a function across either a row (MARGIN=1) or column (MARGIN=2).

taxonomy %>%
  select(-FeatureID, -Seq) %>%
  apply(., MARGIN=2, function(x) sum(!is.na(x))/nrow(taxonomy)*100)
  Kingdom    Phylum     Class     Order    Family     Genus   Species 
100.00000  99.76134  99.76134  99.52267  95.46539  78.28162  27.92363 

Question: What taxonomic level has the poorest assignment?


6 Analyzing Alpha Diversity Over Time

As we discussed in our QIIME2 session, there are multiple methods for calculating diversity, here we will use Shannon’s diversity. This is also the first instance where a curious design feature of many R packages shows up. Some people like their samples as rows and their features as columns, and others (including me) like samples as columns, features as rows. When we encounter a package where we need to flip the input we can either use the transpose function (t()) or specify that it is the inverse using MARGIN=2. Also, remember that we should control for uneven read depth for many diversity metrics.

shannondiv<-
  SVtable %>%
  t() %>%
  rrarefy(sample=min(colSums(SVtable))) %>%
  diversity(index="shannon") %>%
  data.frame(Shannon=.) %>%
  rownames_to_column("SampleID") %>%
  left_join(metadata)

let’s inspect what we just created which is essentially a copy of our metadata with an additional column for the Shannon Diversity:

head(shannondiv)
         SampleID  Shannon MouseID Time_days Treatment
1  MOUSE.G11.Day1 3.004961     G11         1    PreWTL
2 MOUSE.G11.Day14 3.463410     G11        14    PreWTL
3 MOUSE.G11.Day21 3.426358     G11        21    PreWTL
4  MOUSE.G11.Day3 3.025028     G11         3    PreWTL
5  MOUSE.G11.Day7 3.027034     G11         7    PreWTL
6  MOUSE.G12.Day1 3.004284     G12         1    PreWTL

Because we have all of this in a tabular format, we can now easily plot it using ggplot! I highly recommend downloading a ggplot cheat sheet and keeping a copy near your desk.

Let’s sequentially go through building a plot starting with the ggplot() command which initializes the plot

ggplot(shannondiv, aes(x=Time_days, y=Shannon))

Ok, we have a plot, but there is not anything on it! We need to add some type of plot type to it. In the aesthetics portion of the command (aes()) we can specify which parameters we want mapped to columns in the data. Off the top of my head, some graphical parameters you might want to use include: color, fill, shape, size, alpha (transparency) and linetype.

If we wanted to make a scatter plot, we would use geom_point():

ggplot(shannondiv, aes(x=Time_days, y=Shannon)) +
  geom_point()

OK, now we are starting to get somewhere, but we can’t tell which points are from which group, we can assign color to group as below:

ggplot(shannondiv, aes(x=Time_days, y=Shannon, color=Treatment)) +
  geom_point()

Hmm, maybe what we really need to know is the means and error. We can take advantage of a built in function of ggplot called stat_summary() which calculates these for us. We can then use it to add error bars, points, and connecting lines:

ggplot(shannondiv, aes(x=Time_days, y=Shannon, color=Treatment)) +
  stat_summary(geom="errorbar") +
  stat_summary(geom="line") +
  stat_summary(geom="point")

Note: stat_summary defaults to mean + se. We can supply alternate functions using fun.data but won’t for now.

We are getting closer, but this figure is a little bland for my taste and when possible, it is often preferable to show individual points. To do this I will overlay the individual points and instead of using line and error bars, using a ribbon instead which corresponds to the mean +/- error.

ggplot(shannondiv, aes(x=Time_days, y=Shannon, color=Treatment)) +
  stat_summary(geom="ribbon") +
  geom_point()

Ugly? Maybe, but instead of mapping to color, I am going to switch to mapping to fill, and also use a different point shape which we can separately control the fill from the border (color). See here for R shape codes.

ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
  stat_summary(geom="ribbon") +
  geom_point(shape=21)

Because we have overlapping lines, we can go ahead and add some transparency using alpha.

ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
  stat_summary(geom="ribbon", alpha=0.6) +
  geom_point(shape=21)

Ok, I like the basic plot, but now let’s adjust some features. For one, I would prefer a red and blue color scheme. My favourite shades are indianred and cornflowerblue. See here for R color codes. We can override the default scale using scale_fill_manual():

ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
  stat_summary(geom="ribbon", alpha=0.6) +
  geom_point(shape=21) +
  scale_fill_manual(values=c("indianred","cornflowerblue"))

Next up, I would like to adjust the overall aesthetics of the plot and will do so by changing the theme/ One of my favourites is accomplished using theme_bw().

ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
  stat_summary(geom="ribbon", alpha=0.6) +
  geom_point(shape=21) +
  scale_fill_manual(values=c("indianred","cornflowerblue")) +
  theme_bw()

You can also define your own custom themes. I have included one in qiime2R called theme_q2r() which uses font sizes more appropriate for publication figures.

ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
  stat_summary(geom="ribbon", alpha=0.6) +
  geom_point(shape=21) +
  scale_fill_manual(values=c("indianred","cornflowerblue")) +
  theme_q2r()

Finally we can touch up our axes and title.

ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
  stat_summary(geom="ribbon", alpha=0.6) +
  geom_point(shape=21) +
  scale_fill_manual(values=c("indianred","cornflowerblue")) +
  theme_q2r() +
  ylab("Shannon Diversity") +
  xlab("Time (days)") +
  ggtitle("Shannon Diversity in Recipient Mice")

Now that we have our finished plot, we can save it using ggsave(). Note the units for height and width default to be in inches and this is very helpful when planning out a publication or a poster!

ggsave("figures/shannon_over_time.pdf", height=3, width=4)

Ok, looking at this plot, there are a number of questions we want to ask:

  • Are PreWTL samples more/less diverse than PostWTL?
  • Does diversity change over time?
  • Does diversity change over time differently depending on PreWTL/PostWTL?

Using the I-test (note:eye) I would propose only middle question has the possibility of being true. Given our study design we need to account for repeated sampling which can be accomplished through a linear mixed effect model including a random effect which denotes the MouseID. We can run this model using the lmer() function and the notation: Shannon~Treatment*Time + (1|MouseID). This means that our dependent variable is Shannon Diversity and we want to test the effect of Treatment, Time, and their interaction (*) while controlling for repeated sampling within MouseID. Linear mixed effects models have a number of beneficial features like being able to handle complex study designs and tolerating missing data, but one must carefully evaluate the results and they often require decently powered experiments.

fit<-lmer(Shannon~Treatment*Time_days + (1|MouseID), data=shannondiv)

Note: the error about a singular fit, how do you think you should handle this?

We can check a couple diagnostic plots including the residuals where we would hope to see fairly random distribution:

plot(fit)

And a quantile/quantile plot where we would hope that the points will fall on the center line suggesting the samples come from populations with similar distributions.

qqmath(fit)

Finally we can look at the results:

summary(fit)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Shannon ~ Treatment * Time_days + (1 | MouseID)
   Data: shannondiv

REML criterion at convergence: -4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.54965 -0.69518  0.01111  0.67136  2.16605 

Random effects:
 Groups   Name        Variance Std.Dev.
 MouseID  (Intercept) 0.00000  0.000   
 Residual             0.02434  0.156   
Number of obs: 30, groups:  MouseID, 6

Fixed effects:
                           Estimate Std. Error        df t value Pr(>|t|)
(Intercept)                3.041226   0.064346 26.000000  47.263   <2e-16
TreatmentPreWTL           -0.090627   0.091000 26.000000  -0.996   0.3285
Time_days                  0.012457   0.005454 26.000000   2.284   0.0308
TreatmentPreWTL:Time_days  0.008106   0.007713 26.000000   1.051   0.3029
                             
(Intercept)               ***
TreatmentPreWTL              
Time_days                 *  
TreatmentPreWTL:Time_days    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) TrPWTL Tm_dys
TrtmntPrWTL -0.707              
Time_days   -0.780  0.551       
TrtmPWTL:T_  0.551 -0.780 -0.707
convergence code: 0
boundary (singular) fit: see ?isSingular

From above, we can see that there was a significant effect of Time (P=0.0052) with an estimated effect of 0.03±0.008 diversity units/day while no other term is significant. Perhaps we want to do a more straight forward statistical analysis which would be to test if there is a significant difference between Pre/Post at each time point, we can accomplish that as below using Welch’s t-tests and then following up using a Bonferroni correction.

shannondiv %>%
 group_by(Time_days) %>%
  do(
    t.test(Shannon~Treatment, data=.) %>%
    tidy()
    ) %>%
  ungroup() %>%
  mutate(AdjPvalue=p.adjust(p.value, method="bonferroni")) %>%
  select(Time_days, Pvalue=p.value, Bonferonni=AdjPvalue)
# A tibble: 5 x 3
  Time_days Pvalue Bonferonni
      <dbl>  <dbl>      <dbl>
1         1 0.505       1    
2         3 0.136       0.679
3         7 0.0253      0.126
4        14 0.630       1    
5        21 0.963       1    

So as we can see, the difference at 7 days is not significant when we consider multiple testing correction.

6.1 Alpha Diversity Conclusions

You can’t win them all, and in this case it would appear that alpha diversity (measured by Shannon’s index) is not significantly different between treatment groups and is unlikely to explain the differences in weight loss.

7 Beta Diversity

We will start by calculating our distance matrix:

braycurtis<-
  SVtable %>%
  t() %>%
  rrarefy(sample=min(colSums(SVtable))) %>%
  apply(., 2, function(x) x/sum(x)) %>% # convert to proportions
  vegdist(method = "bray")

let’s now look at the distance matrix, or at least the top corner:

as.matrix(braycurtis)[1:10, 1:10]
MOUSE.G11.Day1 MOUSE.G11.Day14 MOUSE.G11.Day21 MOUSE.G11.Day3 MOUSE.G11.Day7 MOUSE.G12.Day1 MOUSE.G12.Day14 MOUSE.G12.Day21 MOUSE.G12.Day3 MOUSE.G12.Day7
MOUSE.G11.Day1 0.0000000 0.9443029 0.9317079 0.9128938 0.9604865 0.4914967 0.9525329 0.9442062 0.9037734 0.9432727
MOUSE.G11.Day14 0.9443029 0.0000000 0.5221430 0.8250190 0.5698763 0.9601955 0.6525649 0.5337583 0.7516608 0.5775732
MOUSE.G11.Day21 0.9317079 0.5221430 0.0000000 0.8422434 0.6688037 0.9510593 0.7235263 0.4320101 0.7722957 0.6713900
MOUSE.G11.Day3 0.9128938 0.8250190 0.8422434 0.0000000 0.8443195 0.9391927 0.8774215 0.8343303 0.7130303 0.8325179
MOUSE.G11.Day7 0.9604865 0.5698763 0.6688037 0.8443195 0.0000000 0.9761717 0.7268493 0.6761330 0.7654650 0.4555076
MOUSE.G12.Day1 0.4914967 0.9601955 0.9510593 0.9391927 0.9761717 0.0000000 0.9669770 0.9595806 0.9212065 0.9606964
MOUSE.G12.Day14 0.9525329 0.6525649 0.7235263 0.8774215 0.7268493 0.9669770 0.0000000 0.7070862 0.8078133 0.7697195
MOUSE.G12.Day21 0.9442062 0.5337583 0.4320101 0.8343303 0.6761330 0.9595806 0.7070862 0.0000000 0.7804602 0.6528492
MOUSE.G12.Day3 0.9037734 0.7516608 0.7722957 0.7130303 0.7654650 0.9212065 0.8078133 0.7804602 0.0000000 0.7300499
MOUSE.G12.Day7 0.9432727 0.5775732 0.6713900 0.8325179 0.4555076 0.9606964 0.7697195 0.6528492 0.7300499 0.0000000

How can we visualize this? Today we will use principal coordinates analysis (PCoA):

pcbraycurtis<-pcoa(braycurtis)

When conducting a PCoA (or PCA) it is always a good idea to look at how much variation has been explained in your plot, this can be done with what is called a scree plot:

pcbraycurtis$values %>%
  as.data.frame() %>%
  rownames_to_column("PC") %>%
  mutate(PC=as.numeric(PC)) %>%
  select(PC, Relative_eig) %>%
  mutate(PercentVariation=Relative_eig*100) %>%
  ggplot(aes(x=PC, y=PercentVariation)) +
  geom_point()

Based on this plot we can see that >30% of variation is explained on the first two axes so a 2d plot makes a good amount of sense here. We will now extract the numeric coordinates for the plot from the PCoA object and merge them with our metadata from plotting.

plotbraycurtis<-
  pcbraycurtis$vectors %>%
  as.data.frame() %>%
  rownames_to_column("SampleID") %>%
  select(SampleID, Axis.1, Axis.2) %>% 
  left_join(metadata)

head(plotbraycurtis)
         SampleID      Axis.1     Axis.2 MouseID Time_days Treatment
1  MOUSE.G11.Day1  0.32158424 0.43350845     G11         1    PreWTL
2 MOUSE.G11.Day14 -0.37601087 0.07914481     G11        14    PreWTL
3 MOUSE.G11.Day21 -0.33207366 0.07054293     G11        21    PreWTL
4  MOUSE.G11.Day3 -0.08192775 0.14094713     G11         3    PreWTL
5  MOUSE.G11.Day7 -0.34157371 0.07445824     G11         7    PreWTL
6  MOUSE.G12.Day1  0.33845678 0.41348738     G12         1    PreWTL

Finally we can create our plot like we did above:

plotbraycurtis %>%
  ggplot(aes(x=Axis.1, y=Axis.2, color=Treatment)) +
  geom_point()

OK, not a bad start, first let’s fix the aesthetics and include the variation explained on the axes.

plotbraycurtis %>%
  ggplot(aes(x=Axis.1, y=Axis.2, color=Treatment)) +
  geom_point() +
  xlab(paste0("PC1:", round(pcbraycurtis$values$Relative_eig[1]*100, 2),"%")) +
  ylab(paste0("PC2:", round(pcbraycurtis$values$Relative_eig[2]*100, 2),"%")) +
  theme_q2r() +
  scale_color_manual(values=c("indianred","cornflowerblue"))

Ok, this looks great but what is special about that group of samples up in the top right corner? A key piece of missing data is that these are the same mice over time, and it would be ideal to denote time on this graph. We could do this by replacing the points with the number as below using geom_text() and specifying label=Time_days in the aesthetics:

plotbraycurtis %>%
  ggplot(aes(x=Axis.1, y=Axis.2, color=Treatment, label=Time_days)) +
  geom_text() +
  xlab(paste0("PC1:", round(pcbraycurtis$values$Relative_eig[1]*100, 2),"%")) +
  ylab(paste0("PC2:", round(pcbraycurtis$values$Relative_eig[2]*100, 2),"%")) +
  theme_q2r() +
  scale_color_manual(values=c("indianred","cornflowerblue"))

This is one option, or we could instead color by time, and use a different shape for the groups as below:

plotbraycurtis %>%
  ggplot(aes(x=Axis.1, y=Axis.2, color=Time_days, shape=Treatment)) +
  geom_point() +
  xlab(paste0("PC1:", round(pcbraycurtis$values$Relative_eig[1]*100, 2),"%")) +
  ylab(paste0("PC2:", round(pcbraycurtis$values$Relative_eig[2]*100, 2),"%")) +
  theme_q2r()

Both of these are great options, but when the dataset is small enough, I like to connect points over time from the same individual/animal using geom_path(). To do this, we need to specify the grouping as being group=MouseID and sort the sample by time as this is the direction the arrow will be drawn in:

plotbraycurtis %>%
  arrange(Time_days) %>%
  ggplot(aes(x=Axis.1, y=Axis.2, color=Treatment, group=MouseID)) +
  geom_path(arrow=arrow(type="closed", length=unit(0.3,"cm")), alpha=0.6) +
  xlab(paste0("PC1:", round(pcbraycurtis$values$Relative_eig[1]*100, 2),"%")) +
  ylab(paste0("PC2:", round(pcbraycurtis$values$Relative_eig[2]*100, 2),"%")) +
  theme_q2r()  +
  scale_color_manual(values=c("indianred","cornflowerblue"))

I think I like this the most, and so I am going to save it again using ggsave().

ggsave("figures/pcoa.pdf", height=3, width=4)

Ok, so based on this plot, what would we conclude? I would say that the two communities started different, changed over time, but always stayed different. If we want to statistically test this, we can turn to a family of statistical tests for ANOVA-like analyses of distance matrices, in this case ADONIS. Similar to above, we will use the same test of Time*Treatment but include MouseID as a stratum for permutational testing to control for repeated sampling. Just to be safe I am also going to reorder my metadata table to make sure it matches the order of samples in the distance matrix.

metadata<-metadata[match(labels(braycurtis),metadata$SampleID),]

adonis(braycurtis~Time_days*Treatment, strata=metadata$MouseID, data=metadata)

Call:
adonis(formula = braycurtis ~ Time_days * Treatment, data = metadata,      strata = metadata$MouseID) 

Blocks:  strata 
Permutation: free
Number of permutations: 999

Terms added sequentially (first to last)

                    Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
Time_days            1    1.0555 1.05547  3.6449 0.09797  0.001 ***
Treatment            1    1.5601 1.56007  5.3875 0.14480  0.001 ***
Time_days:Treatment  1    0.6293 0.62927  2.1731 0.05841  0.012 *  
Residuals           26    7.5289 0.28957         0.69882           
Total               29   10.7737                 1.00000           
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on this analysis we can conclude that there are significant effects of Time, Treatment, and their interaction (P<0.01 for all). It is then helpful to look at their proportion of variation explained (R2) wherein we can see that Treatment had the highest value of 14%, followed by 9.8% for Time, and 5.8% for their interaction (the communities change with respect to each other over time).

7.1 Beta Diversity Conclusions

Pre and post weight loss microbiotas are two compositionally distinct communities and with this multivariate evidence in hand, we should now investigate which organisms are different between them.

8 Differential Sequence Variant Abundance

As we previously discussed, proportions violate the assumptions of many statistical approaches and there are a number of transformations to the data that can help overcome this issue. Today we will be using centered log ratios which involve a log transformation and scaling of abundance relative to the average abundance in sample. You can conceptually think of this like the ∆∆Ct normalization used in qPCR but wherein there is no house keeping gene and instead we estimate a control from the distribution of the data. In this mode of scaling the data, the columns sum to 0 (or near it due to rounding issues) and an abundance>0 is above average in abundance, and <0 is below average. Read more about compositional data here. To calculate the CLR abundance we will first add 0.5 to every value in the table (SVtable+0.5) which will allow us to then take the log2 of all of the abundances (log2()), we then operate along columns (MARGIN=2) and take the ratio of x against the geometric mean of x (mean(x)). Note: we are subtracting rather than dividing as we are in the log space and the log(x/y)=log(x)-log(y). Before doing any of this, we will start by removing very rare features: in this case, anything which is seen less than 10 times in the entire dataset (SVtable[rowSums(SVtable)>=10,]) or is found in less than 3 samples filtSVtable[rowSums(apply(filtSVtable, 2, function(x) if_else(x>0, 1, 0 )))>3,].

filtSVtable<-SVtable[rowSums(SVtable)>=10,]
filtSVtable<-filtSVtable[rowSums(apply(filtSVtable, 2, function(x) if_else(x>0, 1, 0 )))>3,]

How would we figure out how many reads we have lost?

How would we figure out how many sequence variants we have removed?

CLRabundance<-apply(log2(filtSVtable+0.5), MARGIN=2, function(x) x-mean(x))

We will again fit linear mixed effects models to our transformed data to extract which sequence variants are significantly different between groups.

tests<-
  CLRabundance %>%
  as.data.frame() %>%
  rownames_to_column("FeatureID") %>%
  pivot_longer(-FeatureID, names_to = "SampleID", values_to = "CLR") %>%
  left_join(metadata) %>%
  group_by(FeatureID) %>%
  do(
    lmer(CLR~Time_days*Treatment + (1|MouseID), data=.) %>%
    anova() %>%
    tidy()
  )

Now we can add multiple testing correction in with Benjamini Hochberg FDR.

tests<-
  tests %>%
  group_by(term) %>%
  mutate(FDR=p.adjust(p.value, method="BH"))

Now we can look at the list of features which are significantly different between the two treatment groups or are different over time between the two treatment groups while adding in taxonomy to help guide us.

SigFeatures<-
  tests %>%
  ungroup() %>%
  filter(term!="Time_days") %>%
  filter(FDR<0.05) %>%
  select(FeatureID) %>%
  distinct()

dim(SigFeatures)
[1] 63  1

We can now see that there were 63 hits, but how to make sense of them? It might help to visualize their abundances over time using something like a heat map. To do this I will calculate fold-change on a per time point basis (the fold change between the mean CLR abundance within each group). I am also going to assign a new name to features which is its FeatureID, Genus, and Species.

foldchanges<-
  CLRabundance %>%
  as.data.frame() %>%
  rownames_to_column("FeatureID") %>%
  pivot_longer(-FeatureID, names_to = "SampleID", values_to = "CLR") %>%
  left_join(metadata) %>%
  group_by(FeatureID, Time_days, Treatment) %>%
  summarize(MeanCLR=mean(CLR)) %>%
  ungroup() %>%
  pivot_wider(values_from = MeanCLR, names_from = Treatment) %>%
  mutate(log2FC=PostWTL-PreWTL) %>%
  left_join(taxonomy) %>%
  mutate(Name=paste(FeatureID, Genus, Species))


head(foldchanges)
# A tibble: 6 x 14
  FeatureID Time_days PostWTL PreWTL log2FC Seq   Kingdom Phylum Class
  <chr>         <dbl>   <dbl>  <dbl>  <dbl> <chr> <fct>   <fct>  <fct>
1 0032a6aa…         1   0.396   4.54 -4.15  AGGG… Bacter… Actin… Cori…
2 0032a6aa…         3  -3.32    3.99 -7.31  AGGG… Bacter… Actin… Cori…
3 0032a6aa…         7  -2.77   -3.20  0.426 AGGG… Bacter… Actin… Cori…
4 0032a6aa…        14  -3.98   -3.84 -0.131 AGGG… Bacter… Actin… Cori…
5 0032a6aa…        21  -4.04   -3.86 -0.174 AGGG… Bacter… Actin… Cori…
6 00c3c59e…         1   1.79   -2.36  4.15  AGGA… Bacter… Bacte… Bact…
# … with 5 more variables: Order <fct>, Family <fct>, Genus <fct>,
#   Species <fct>, Name <chr>

Now that we have the fold changes by time, we can make our heat map (geom_tile()).

foldchanges %>%
  filter(FeatureID %in% SigFeatures$FeatureID) %>%
  ggplot(aes(x=Time_days, y=Name, fill=log2FC)) +
  geom_tile()

This is OK, but not super helpful because of the color scheme, and the X-axis. Must adjust the X axis to be a factor/character instead of a number:

foldchanges %>%
  filter(FeatureID %in% SigFeatures$FeatureID) %>%
  mutate(Time_days=factor(Time_days, levels=c(1,3,7,14,21))) %>%
  ggplot(aes(x=Time_days, y=Name, fill=log2FC)) +
  geom_tile()

Now add in a manual fill scheme, in this case if we want a scale centered at 0 we can use scale_fill_gradient2().

foldchanges %>%
  filter(FeatureID %in% SigFeatures$FeatureID) %>%
  mutate(Time_days=factor(Time_days, levels=c(1,3,7,14,21))) %>%
  ggplot(aes(x=Time_days, y=Name, fill=log2FC)) +
  geom_tile() +
  scale_fill_gradient2(low="cornflowerblue", high="indianred")

It would probably help the clarity if we were to try to meaningfully order the rows. This could be done via clustering, but for simplicity, let’s do it by sorting by the average log2FC across all time points. To control the order on the y-axis we need to convert the names to a factor and specify their order.

featureorder<-
  foldchanges %>%
  group_by(Name) %>%
  summarize(log2FC=mean(log2FC)) %>%
  arrange(log2FC) %>%
  pull(Name)

foldchanges %>%
  filter(FeatureID %in% SigFeatures$FeatureID) %>%
  mutate(Time_days=factor(Time_days, levels=c(1,3,7,14,21))) %>%
  mutate(Name=factor(Name, levels=featureorder)) %>%
  ggplot(aes(x=Time_days, y=Name, fill=log2FC)) +
  geom_tile() +
  scale_fill_gradient2(low="cornflowerblue", high="indianred")

And now finally, we can adjust the aesthetics to get us to a polished product:

foldchanges %>%
  filter(FeatureID %in% SigFeatures$FeatureID) %>%
  mutate(Time_days=factor(Time_days, levels=c(1,3,7,14,21))) %>%
  mutate(Name=factor(Name, levels=featureorder)) %>%
  ggplot(aes(x=Time_days, y=Name, fill=log2FC)) +
  geom_tile() +
  scale_fill_gradient2(low="cornflowerblue", high="indianred") +
  ylab("Sequence Variant") +
  xlab("Time (days)") +
  theme_q2r() +
  coord_cartesian(expand=FALSE) #shrink excess white space

Finally, let’s save a copy.

ggsave("figures/heatmap.pdf", width=6, height=5)

A very wise person once said “you can upload it, download it, and reload it, but eventually you actually have to read it and think about it.” So looking through this list of sequence variants, do any of then pop out to you?

To help understand the results, it can also help to plot the individual feature’s abundances over time. We can easily accomplish that as below:

individualabundances<-
  CLRabundance %>%
  as.data.frame() %>%
  rownames_to_column("FeatureID") %>%
  filter(FeatureID %in% SigFeatures$FeatureID) %>%
  pivot_longer(-FeatureID, names_to = "SampleID", values_to = "CLR") %>%
  left_join(metadata) %>%
  left_join(taxonomy) %>%
  mutate(name=paste(FeatureID, Genus, Species)) %>%
  select(name, CLR, Time_days, Treatment)
individualabundances  
# A tibble: 1,890 x 4
   name                                             CLR Time_days Treatment
   <chr>                                          <dbl>     <dbl> <chr>    
 1 bd87538a8a84679b27e33c28432046c2 Blautia caec…  8.47         1 PreWTL   
 2 bd87538a8a84679b27e33c28432046c2 Blautia caec…  8.97        14 PreWTL   
 3 bd87538a8a84679b27e33c28432046c2 Blautia caec…  7.39        21 PreWTL   
 4 bd87538a8a84679b27e33c28432046c2 Blautia caec… 10.8          3 PreWTL   
 5 bd87538a8a84679b27e33c28432046c2 Blautia caec…  9.96         7 PreWTL   
 6 bd87538a8a84679b27e33c28432046c2 Blautia caec…  5.77         1 PreWTL   
 7 bd87538a8a84679b27e33c28432046c2 Blautia caec…  7.45        14 PreWTL   
 8 bd87538a8a84679b27e33c28432046c2 Blautia caec…  7.21        21 PreWTL   
 9 bd87538a8a84679b27e33c28432046c2 Blautia caec… 10.8          3 PreWTL   
10 bd87538a8a84679b27e33c28432046c2 Blautia caec…  9.59         7 PreWTL   
# … with 1,880 more rows

To create separate plots on a per feature basis, we can take advantage of either the facet_wrap() or facet_grid() functions as below. Note: by specifying scales="free" we allow each plot to have its own scale, and by specifying ncol=6 we are specifying we want 6 columns (and by extension 11 rows).

individualabundances %>%
  ggplot(aes(x=Time_days, y=CLR, color=Treatment)) +
  stat_summary(geom="errorbar", width=0) +
  stat_summary(geom="line") +
  stat_summary(geom="point") +
  theme_q2r() +
  facet_wrap(~name, scales="free", ncol=6)

ggsave("figures/perFeature.pdf", height=20, width=30)

This is the type of plot I would never include in a manuscript, but are helpful to interpret the heat map. You could also select one panel from this plot to use in a figure for an organism of interest.

Questions:

  • What do you make of the first and last organisms (P. distasonis)?
  • What do you think about Escherichia/Shigella?
  • Any other organisms seem familiar?

Since we have a number of significantly different features, it may help to visualize them on a phylogenetic tree to spot any big picture differences. We can do that using the ggtree() function which allows us to use ggplot2 aesthetic mappings in the context of the tree. To merge data with the tree we need it in one large data.frame or tibble wherein the first column maps the tips of the tree:

tree

Phylogenetic tree with 419 tips and 418 internal nodes.

Tip labels:
    438fb020cf251f2c2f4e89552e4a78ff, cde9017648b65ce5f31e70347eb2cda3, 20cf13f228f8d16b499e9493dd2ee208, cece12865e81c96a523b3bb6a4511561, ca1392eaa969b2980a69cc34a327e112, 5669573ccb2eb47522bed90a9e3439e4, ...
Node labels:
    root, 0.987, 0.952, 0.995, 0.938, 0.740, ...

Rooted; includes branch lengths.
head(tree$tip.label)
[1] "438fb020cf251f2c2f4e89552e4a78ff" "cde9017648b65ce5f31e70347eb2cda3"
[3] "20cf13f228f8d16b499e9493dd2ee208" "cece12865e81c96a523b3bb6a4511561"
[5] "ca1392eaa969b2980a69cc34a327e112" "5669573ccb2eb47522bed90a9e3439e4"

As you can see above, the tip.labels are the feature IDs. I would like to be map the average fold change that we used above and if the hit was significant or not so I will create an object below:

treedata<-
foldchanges %>%
  group_by(FeatureID) %>%
  summarize(log2FC=mean(log2FC)) %>%
  mutate(Significant=if_else(FeatureID %in% SigFeatures$FeatureID, "*",""))

To merge this data to the tree we use the unusual %<+% operator followed by our standard ggplot2 syntax of +. Note: you can specify aes mappings within any given geom function as below.

ggtree(tree) %<+% 
  treedata +
  geom_tippoint(aes(color=log2FC)) +
  geom_tiplab(aes(label=Significant))

OK, a good start, but let’s adjust the color scale and change our shape such that it always has an outline (a purely aesthetic feature that I prefer).

ggtree(tree) %<+% 
  treedata +
  geom_tippoint(aes(fill=log2FC), shape=21) +
  geom_tiplab(aes(label=Significant)) +
  scale_fill_gradient2(low="cornflowerblue", high="darkred")

Note: ggtree removes the legend by default, lets add it back in on the right side.

ggtree(tree) %<+% 
  treedata +
  geom_tippoint(aes(fill=log2FC), shape=21) +
  geom_tiplab(aes(label=Significant)) +
  scale_fill_gradient2(low="cornflowerblue", high="darkred") +
  theme(legend.position="right")

Note all the grey because they are NA? These are features which did not make our filtering thresholds before the CLR transformation, we can “prune” them from the tree below:

drop.tip(tree, tree$tip.label[!tree$tip.label %in% treedata$FeatureID]) %>%
  ggtree() %<+% 
  treedata +
  geom_tippoint(aes(fill=log2FC), shape=21) +
  geom_tiplab(aes(label=Significant)) +
  scale_fill_gradient2(low="cornflowerblue", high="darkred") +
  theme(legend.position="right")

We could also make this tree circular using the layout option:

drop.tip(tree, tree$tip.label[!tree$tip.label %in% treedata$FeatureID]) %>%
  ggtree(layout="circular") %<+% treedata +
  geom_tippoint(aes(fill=log2FC), shape=21) +
  geom_tiplab(aes(label=Significant)) +
  scale_fill_gradient2(low="cornflowerblue", high="darkred") +
  theme(legend.position="right")

This looks pretty good, although maybe it was overly reductive to simplify to a single log2 fold change. Instead lets merge our heat map with our tree using pheatmap()! I will splice together bits of code from our heat map and our tree to do this.

heatmapplot<-
foldchanges %>%
  filter(FeatureID %in% treedata$FeatureID) %>%
  mutate(Time_days=factor(Time_days, levels=c(1,3,7,14,21))) %>%
  select(FeatureID, Time_days, log2FC) %>%
  pivot_wider(names_from = "Time_days", values_from="log2FC") %>%
  as.data.frame() %>%
  column_to_rownames("FeatureID")

treeplot<-
drop.tip(tree, tree$tip.label[!tree$tip.label %in% treedata$FeatureID]) %>%
  ggtree(layout="circular") %<+% 
  treedata +
  geom_tiplab2(aes(label=Significant), align=TRUE, offset=0.65 )

gheatmap(treeplot, heatmapplot) +
  scale_fill_gradient2(low="cornflowerblue", high="darkred")

8.1 Differential Abundance Conclusions

  • Which of these figures clearly communicated a biological message and was easy to interpret?
  • If the goal of creating a visualization is to make a biological pattern clear while obscuring the data in the least way possible, how do you feel about the last tree we created? See panel A of this figure for an example of a dataset with a clearer phylogenetic signal.
  • If we have 60+ organisms that could be responsible how would you prioritize following up on them?
  • How would you follow up this analysis with additional analysis or experimentation to understand which of the identified organisms was responsible for inducing weight loss?

Hopefully this session has been helpful and if you have any feed back or questions, please email me at jordan.bisanz@ucsf.edu.