In this session, we will continue where we left off from our QIIME2 session looking at a humanized mouse data set which presents an interesting weight loss phenotype on colonization as seen in the figure below. We will take advantage of QIIME2 to do the initial processing of our data, but extend the analysis and visualization using R.
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path artifacts/manifest.csv \
--output-path artifacts/Reads.qza \
--input-format PairedEndFastqManifestPhred33
qiime dada2 denoise-paired \
--i-demultiplexed-seqs artifacts/Reads.qza \
--p-trunc-len-f 220 \
--p-trunc-len-r 130 \
--p-trim-left-f 5 \
--p-trim-left-r 5 \
--p-n-threads 4 \
--o-table artifacts/Feature_Table.qza \
--o-representative-sequences artifacts/Feature_Sequences.qza \
--o-denoising-stats artifacts/Feature_Statistics.qza \
--verbose
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences artifacts/Feature_Sequences.qza \
--o-alignment artifacts/Feature_Alignment.qza \
--o-masked-alignment artifacts/Feature_MaskedAlignment.qza \
--o-tree artifacts/Feature_TreeUnrooted.qza \
--o-rooted-tree artifacts/Feature_TreeRooted.qza
Start by creating a project for your session by going to File > New Project > New Directory > YourChoiceHere.
Next we will create folders for our outputs: figures, tables, and backups (RDS). Note: RDS files are a special R-specific format that can be easily written and reloaded using saveRDS()
and readRDS()
. See here for more information.
dir.create("figures")
dir.create("tables")
dir.create("RDS")
dir.create("artifacts")
I have also taken the liberty of adding in additional samples and sequencing depth and provide the updated artifacts at the link here. You can download them directly in R using:
download.file(url="https://github.com/jbisanz/MicrobiomeWorkshop2020/raw/master/rvis/artifacts/Feature_Table.qza", destfile="artifacts/Feature_Table.qza")
download.file(url="https://github.com/jbisanz/MicrobiomeWorkshop2020/raw/master/rvis/artifacts/Feature_Sequences.qza", destfile="artifacts/Feature_Sequences.qza")
download.file(url="https://github.com/jbisanz/MicrobiomeWorkshop2020/raw/master/rvis/artifacts/Feature_TreeRooted.qza", destfile="artifacts/Feature_TreeRooted.qza")
download.file(url="https://github.com/jbisanz/MicrobiomeWorkshop2020/raw/master/rvis/artifacts/metadata.tsv", destfile="artifacts/metadata.tsv")
Generally speaking, there are 3 sources of relevant packages for R:
install.packages()
)BiocManager::install()
)devtools::install_github()
).You may already have some of these packages installed, so the code below first tries to load the package (require()
), and if it can’t (if(!...)
) then installs the package (install.packages()
).
# Packages from CRAN:
if(!require(tidyverse)){install.packages("tidyverse")}
if(!require(vegan)){install.packages("vegan")} # useful ecology package
if(!require(ape)){install.packages("ape")} # useful ecology package
if(!require(lmerTest)){install.packages("lmerTest")} # useful package for statistics and longitudinal designs
if(!require(broom)){install.packages("broom")} #useful package for handling stats
if(!require(BiocManager)){install.packages("BiocManager")} # required for installing Bioconductor Packages
if(!require(devtools)){install.packages("devtools")} # required for installing GitHub Packages
# Packages from Bioconductor:
if(!require(dada2)){BiocManager::install("dada2")} # The original R package used by QIIME2
if(!require(ggtree)){BiocManager::install("ggtree")} # Plotting and manipulating trees
if(!require(philr)){BiocManager::install("philr")} # Testing of phylogenetic signals
# Packages from GitHub
if(!require(qiime2R)){devtools::install_github("jbisanz/qiime2R")} # Package for reading artifacts
We will start by exploring our metadata:
metadata<-read_tsv("artifacts/metadata.tsv")
str(metadata)
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 30 obs. of 4 variables:
$ SampleID : chr "MOUSE.G11.Day1" "MOUSE.G12.Day1" "MOUSE.G21.Day1" "MOUSE.H12.Day1" ...
$ MouseID : chr "G11" "G12" "G21" "H12" ...
$ Time_days: num 1 1 1 1 1 1 3 3 3 3 ...
$ Treatment: chr "PreWTL" "PreWTL" "PreWTL" "PostWTL" ...
- attr(*, "spec")=
.. cols(
.. SampleID = col_character(),
.. MouseID = col_character(),
.. Time_days = col_double(),
.. Treatment = col_character()
.. )
SampleID | MouseID | Time_days | Treatment |
---|---|---|---|
MOUSE.G11.Day1 | G11 | 1 | PreWTL |
MOUSE.G12.Day1 | G12 | 1 | PreWTL |
MOUSE.G21.Day1 | G21 | 1 | PreWTL |
MOUSE.H12.Day1 | H12 | 1 | PostWTL |
MOUSE.H21.Day1 | H21 | 1 | PostWTL |
MOUSE.H22.Day1 | H22 | 1 | PostWTL |
MOUSE.G11.Day3 | G11 | 3 | PreWTL |
MOUSE.G12.Day3 | G12 | 3 | PreWTL |
MOUSE.G21.Day3 | G21 | 3 | PreWTL |
MOUSE.H12.Day3 | H12 | 3 | PostWTL |
MOUSE.H21.Day3 | H21 | 3 | PostWTL |
MOUSE.H22.Day3 | H22 | 3 | PostWTL |
MOUSE.G11.Day7 | G11 | 7 | PreWTL |
MOUSE.G12.Day7 | G12 | 7 | PreWTL |
MOUSE.G21.Day7 | G21 | 7 | PreWTL |
MOUSE.H12.Day7 | H12 | 7 | PostWTL |
MOUSE.H21.Day7 | H21 | 7 | PostWTL |
MOUSE.H22.Day7 | H22 | 7 | PostWTL |
MOUSE.G11.Day14 | G11 | 14 | PreWTL |
MOUSE.G12.Day14 | G12 | 14 | PreWTL |
MOUSE.G21.Day14 | G21 | 14 | PreWTL |
MOUSE.H12.Day14 | H12 | 14 | PostWTL |
MOUSE.H21.Day14 | H21 | 14 | PostWTL |
MOUSE.H22.Day14 | H22 | 14 | PostWTL |
MOUSE.G11.Day21 | G11 | 21 | PreWTL |
MOUSE.G12.Day21 | G12 | 21 | PreWTL |
MOUSE.G21.Day21 | G21 | 21 | PreWTL |
MOUSE.H12.Day21 | H12 | 21 | PostWTL |
MOUSE.H21.Day21 | H21 | 21 | PostWTL |
MOUSE.H22.Day21 | H22 | 21 | PostWTL |
Variables:
SVtable<-read_qza("artifacts/Feature_Table.qza")$data
Now we can look at our data directly which is difficult using QIIME2. In this table, each column is a SampleID, and each row is a Ribosomal Sequence Variant (SV):
SVtable[1:5,1:5] # looking at the first 5 rows and first 5 columns
MOUSE.G11.Day1 | MOUSE.G11.Day14 | MOUSE.G11.Day21 | MOUSE.G11.Day3 | MOUSE.G11.Day7 | |
---|---|---|---|---|---|
438fb020cf251f2c2f4e89552e4a78ff | 152 | 6607 | 7162 | 10526 | 11216 |
6bd982a66b0c74116b5c1cd9b824c113 | 62 | 8665 | 6912 | 7190 | 8438 |
bd87538a8a84679b27e33c28432046c2 | 996 | 3773 | 1243 | 8943 | 4344 |
cece12865e81c96a523b3bb6a4511561 | 726 | 0 | 0 | 232 | 0 |
cd3e52559a9e53dce7721688de8fd9b5 | 9649 | 0 | 0 | 2573 | 154 |
To see the total number of samples and SVs, we can look at the dimensions of the table:
dim(SVtable)
[1] 419 30
R always uses rows, then columns. So in this case we have 419 SVs and 30 samples. Did any samples go missing?
If we now want to know the total number of reads we have we can just count up the total of the table:
sum(SVtable)
[1] 1992076
Now we can see that we have 1,992,076 reads which is about ~1/10 of a MiSeq run. Where would the other ~10-20 million reads go?
If we want to see the number of reads per sample, we simply need to add up the columns:
colSums(SVtable) %>% sort()
MOUSE.H21.Day1 MOUSE.G12.Day7 MOUSE.G12.Day3 MOUSE.G12.Day1
48550 52018 54891 56486
MOUSE.G21.Day21 MOUSE.H22.Day1 MOUSE.G21.Day1 MOUSE.H12.Day1
57423 57941 59504 61475
MOUSE.H21.Day3 MOUSE.H22.Day3 MOUSE.G11.Day7 MOUSE.G11.Day3
61659 61969 62617 63943
MOUSE.H21.Day7 MOUSE.H12.Day7 MOUSE.G11.Day21 MOUSE.G11.Day1
64706 65082 66089 66616
MOUSE.H12.Day3 MOUSE.H21.Day21 MOUSE.H12.Day14 MOUSE.H12.Day21
66890 67664 67953 69759
MOUSE.G11.Day14 MOUSE.H22.Day21 MOUSE.H21.Day14 MOUSE.G12.Day14
70253 70331 71406 71715
MOUSE.H22.Day7 MOUSE.G21.Day3 MOUSE.G21.Day7 MOUSE.H22.Day14
75140 75407 76936 77712
MOUSE.G21.Day14 MOUSE.G12.Day21
78763 91178
So as you can see, we have between 48,550 reads and 91,178 reads per sample which is on the high end of average for 16S rRNA amplicon sequencing.
If we want to see how frequently certain features are found we can do the transverse function and add up the rows:
rowSums(SVtable) %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 33 263 4754 1535 190956
So we can see the average number of times a SV is seen was 4,754 with a range from 1-190,956.
seqs<-read_qza("artifacts/Feature_Sequences.qza")$data
seqs
A DNAStringSet instance of length 419
width seq names
[1] 243 AGGTCTCAAGCGTTGTTCGG...ACGAAGGCCAGGGGAGCGA 438fb020cf251f2c2...
[2] 243 AGGATCCGAGCGTTATCCGG...TCGAAAGTGTGGGTATCAA 6bd982a66b0c74116...
[3] 243 AGGGGGCAAGCGTTATCCGG...TCGAAAGCGTGGGGAGCAA bd87538a8a84679b2...
[4] 243 AGGATCCGAGCGTTATCCGG...TCGAAAGTGTGGGTATCAA cece12865e81c96a5...
[5] 243 AGGGGGCAAGCGTTATCCGG...TCGAAAGCGTGGGGAGCAA cd3e52559a9e53dce...
... ... ...
[415] 243 ATGGTGCAAGCGTTATCCGG...TCGAAAGTGTGGGTATCAA bacf4da9ccd7c31fa...
[416] 225 AGGGGGCAAGCGTTATCCGG...TCGAAAGCGTGGGGAGCAA 266e9ce3254292205...
[417] 225 AGGATCCGAGCGTTATCCGG...TCGAAAGTGTGGGTATCAA bb629518734c6be78...
[418] 215 AGGTCTCAAGCGCAGGGGCT...CCCAGTAGTCCGGCTGACT f1e8905a5850b8c08...
[419] 242 AGGTGGCAAGCGTTGTCCGG...TCGAAAGTGTGGGTAGCAA d0fa2acf99d76b680...
Here we have a copy of the actual DNA sequences for each SV. To demonstrate how the unique alpha numeric identifiers are formed we can regenerate them using a hash function:
seqs$`438fb020cf251f2c2f4e89552e4a78ff` %>%
as.character() %>%
openssl::md5()
[1] "438fb020cf251f2c2f4e89552e4a78ff"
tree<-read_qza("artifacts/Feature_TreeRooted.qza")$data
We can now get an idea for the structure of our tree using the ggtree function to plot:
ggtree(tree)
It is a great idea to always look at a tree to make sure there aren’t any weird placements. An example is below:
As we had previously discussed there are multiple methods and databases with which taxonomy can be assigned; however, my favourite method involves exact matching to a reference database which can also tell you all possible species for which the sequence variant could belong to. My preferred database for doing this is SILVA. There are preformatted databases available for download here. We will download them via R below. Note: this will require ~127Mb of hard drive space, you can skip this section by downloading the assigned taxonomy here.
dir.create("taxdb")
download.file(url="https://zenodo.org/record/3731176/files/silva_nr_v138_train_set.fa.gz?download=1", destfile="taxdb/SILVA138_trained.fa.gz")
download.file(url="https://zenodo.org/record/3731176/files/silva_species_assignment_v138.fa.gz?download=1", destfile="taxdb/SILVA138_species.fa.gz")
We assign taxonomy in two phases: first we assign down to the genus level using the Dada2 version of the RDP classifier.
taxonomy<-assignTaxonomy(seqs, "taxdb/SILVA138_trained.fa.gz")
taxonomy<-addSpecies(taxonomy, "taxdb/SILVA138_species.fa.gz", allowMultiple = TRUE)
If you did not download the databases and run the code, you can download it and read it in as below:
download.file(url="https://github.com/jbisanz/MicrobiomeWorkshop2020/raw/master/rvis/RDS/taxonomy.RDS", destfile="RDS/taxonomy.RDS")
taxonomy<-readRDS("RDS/taxonomy.RDS")
Let’s start by taking a look at the assigned taxonomy:
View(taxonomy)
Note that Dada2 kept the DNA sequence and not the shorter hashed feature IDs. We could either recalculate their hashes, or just swap them out as below:
taxonomy<-
tibble(FeatureID=names(seqs), Seq=as.character(seqs)) %>%
left_join(
taxonomy %>%
as.data.frame() %>%
rownames_to_column("Seq")
) %>%
as_tibble()
And now we have both combined:
taxonomy
# A tibble: 419 x 9
FeatureID Seq Kingdom Phylum Class Order Family Genus Species
<chr> <chr> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 438fb020cf… AGGTCTCA… Bacteria Verru… Verr… Verr… Akker… Akker… mucinip…
2 6bd982a66b… AGGATCCG… Bacteria Bacte… Bact… Bact… Bacte… Bacte… cellulo…
3 bd87538a8a… AGGGGGCA… Bacteria Firmi… Clos… Lach… Lachn… Blaut… caecimu…
4 cece12865e… AGGATCCG… Bacteria Bacte… Bact… Bact… Bacte… Bacte… vulgatus
5 cd3e52559a… AGGGGGCA… Bacteria Firmi… Clos… Lach… Lachn… <NA> <NA>
6 20cf13f228… AGGATCCG… Bacteria Bacte… Bact… Bact… Bacte… Bacte… uniform…
7 ca1392eaa9… AGGATCCG… Bacteria Bacte… Bact… Bact… Bacte… Bacte… dorei/v…
8 d8999f8529… AGGATCCG… Bacteria Bacte… Bact… Bact… Bacte… Bacte… dorei/f…
9 128696fec6… AGGTGGCG… Bacteria Firmi… Clos… Clos… Clost… Clost… celatum…
10 db79739512… AGGATCCG… Bacteria Bacte… Bact… Bact… Tanne… Parab… distaso…
# … with 409 more rows
Note that there are NAs for some genera and species: why would this be?
Question: Any taxa pop out to you?
Finally, let’s look at assignment rates across levels. We will use an apply()
function here which allows us to calculate a function across either a row (MARGIN=1
) or column (MARGIN=2
).
taxonomy %>%
select(-FeatureID, -Seq) %>%
apply(., MARGIN=2, function(x) sum(!is.na(x))/nrow(taxonomy)*100)
Kingdom Phylum Class Order Family Genus Species
100.00000 99.76134 99.76134 99.52267 95.46539 78.28162 27.92363
Question: What taxonomic level has the poorest assignment?
As we discussed in our QIIME2 session, there are multiple methods for calculating diversity, here we will use Shannon’s diversity. This is also the first instance where a curious design feature of many R packages shows up. Some people like their samples as rows and their features as columns, and others (including me) like samples as columns, features as rows. When we encounter a package where we need to flip the input we can either use the transpose function (t()
) or specify that it is the inverse using MARGIN=2
. Also, remember that we should control for uneven read depth for many diversity metrics.
shannondiv<-
SVtable %>%
t() %>%
rrarefy(sample=min(colSums(SVtable))) %>%
diversity(index="shannon") %>%
data.frame(Shannon=.) %>%
rownames_to_column("SampleID") %>%
left_join(metadata)
let’s inspect what we just created which is essentially a copy of our metadata with an additional column for the Shannon Diversity:
head(shannondiv)
SampleID Shannon MouseID Time_days Treatment
1 MOUSE.G11.Day1 3.004961 G11 1 PreWTL
2 MOUSE.G11.Day14 3.463410 G11 14 PreWTL
3 MOUSE.G11.Day21 3.426358 G11 21 PreWTL
4 MOUSE.G11.Day3 3.025028 G11 3 PreWTL
5 MOUSE.G11.Day7 3.027034 G11 7 PreWTL
6 MOUSE.G12.Day1 3.004284 G12 1 PreWTL
Because we have all of this in a tabular format, we can now easily plot it using ggplot! I highly recommend downloading a ggplot cheat sheet and keeping a copy near your desk.
Let’s sequentially go through building a plot starting with the ggplot()
command which initializes the plot
ggplot(shannondiv, aes(x=Time_days, y=Shannon))
Ok, we have a plot, but there is not anything on it! We need to add some type of plot type to it. In the aesthetics portion of the command (aes()
) we can specify which parameters we want mapped to columns in the data. Off the top of my head, some graphical parameters you might want to use include: color, fill, shape, size, alpha (transparency) and linetype.
If we wanted to make a scatter plot, we would use geom_point()
:
ggplot(shannondiv, aes(x=Time_days, y=Shannon)) +
geom_point()
OK, now we are starting to get somewhere, but we can’t tell which points are from which group, we can assign color to group as below:
ggplot(shannondiv, aes(x=Time_days, y=Shannon, color=Treatment)) +
geom_point()
Hmm, maybe what we really need to know is the means and error. We can take advantage of a built in function of ggplot called stat_summary()
which calculates these for us. We can then use it to add error bars, points, and connecting lines:
ggplot(shannondiv, aes(x=Time_days, y=Shannon, color=Treatment)) +
stat_summary(geom="errorbar") +
stat_summary(geom="line") +
stat_summary(geom="point")
Note: stat_summary defaults to mean + se. We can supply alternate functions using fun.data
but won’t for now.
We are getting closer, but this figure is a little bland for my taste and when possible, it is often preferable to show individual points. To do this I will overlay the individual points and instead of using line and error bars, using a ribbon instead which corresponds to the mean +/- error.
ggplot(shannondiv, aes(x=Time_days, y=Shannon, color=Treatment)) +
stat_summary(geom="ribbon") +
geom_point()
Ugly? Maybe, but instead of mapping to color, I am going to switch to mapping to fill, and also use a different point shape which we can separately control the fill from the border (color). See here for R shape codes.
ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
stat_summary(geom="ribbon") +
geom_point(shape=21)
Because we have overlapping lines, we can go ahead and add some transparency using alpha
.
ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
stat_summary(geom="ribbon", alpha=0.6) +
geom_point(shape=21)
Ok, I like the basic plot, but now let’s adjust some features. For one, I would prefer a red and blue color scheme. My favourite shades are indianred and cornflowerblue. See here for R color codes. We can override the default scale using scale_fill_manual()
:
ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
stat_summary(geom="ribbon", alpha=0.6) +
geom_point(shape=21) +
scale_fill_manual(values=c("indianred","cornflowerblue"))
Next up, I would like to adjust the overall aesthetics of the plot and will do so by changing the theme/ One of my favourites is accomplished using theme_bw()
.
ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
stat_summary(geom="ribbon", alpha=0.6) +
geom_point(shape=21) +
scale_fill_manual(values=c("indianred","cornflowerblue")) +
theme_bw()
You can also define your own custom themes. I have included one in qiime2R called theme_q2r()
which uses font sizes more appropriate for publication figures.
ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
stat_summary(geom="ribbon", alpha=0.6) +
geom_point(shape=21) +
scale_fill_manual(values=c("indianred","cornflowerblue")) +
theme_q2r()
Finally we can touch up our axes and title.
ggplot(shannondiv, aes(x=Time_days, y=Shannon, fill=Treatment)) +
stat_summary(geom="ribbon", alpha=0.6) +
geom_point(shape=21) +
scale_fill_manual(values=c("indianred","cornflowerblue")) +
theme_q2r() +
ylab("Shannon Diversity") +
xlab("Time (days)") +
ggtitle("Shannon Diversity in Recipient Mice")
Now that we have our finished plot, we can save it using ggsave()
. Note the units for height and width default to be in inches and this is very helpful when planning out a publication or a poster!
ggsave("figures/shannon_over_time.pdf", height=3, width=4)
Ok, looking at this plot, there are a number of questions we want to ask:
Using the I-test (note:eye) I would propose only middle question has the possibility of being true. Given our study design we need to account for repeated sampling which can be accomplished through a linear mixed effect model including a random effect which denotes the MouseID. We can run this model using the lmer()
function and the notation: Shannon~Treatment*Time + (1|MouseID)
. This means that our dependent variable is Shannon Diversity and we want to test the effect of Treatment
, Time
, and their interaction (*
) while controlling for repeated sampling within MouseID
. Linear mixed effects models have a number of beneficial features like being able to handle complex study designs and tolerating missing data, but one must carefully evaluate the results and they often require decently powered experiments.
fit<-lmer(Shannon~Treatment*Time_days + (1|MouseID), data=shannondiv)
Note: the error about a singular fit, how do you think you should handle this?
We can check a couple diagnostic plots including the residuals where we would hope to see fairly random distribution:
plot(fit)
And a quantile/quantile plot where we would hope that the points will fall on the center line suggesting the samples come from populations with similar distributions.
qqmath(fit)
Finally we can look at the results:
summary(fit)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Shannon ~ Treatment * Time_days + (1 | MouseID)
Data: shannondiv
REML criterion at convergence: -4
Scaled residuals:
Min 1Q Median 3Q Max
-1.54965 -0.69518 0.01111 0.67136 2.16605
Random effects:
Groups Name Variance Std.Dev.
MouseID (Intercept) 0.00000 0.000
Residual 0.02434 0.156
Number of obs: 30, groups: MouseID, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 3.041226 0.064346 26.000000 47.263 <2e-16
TreatmentPreWTL -0.090627 0.091000 26.000000 -0.996 0.3285
Time_days 0.012457 0.005454 26.000000 2.284 0.0308
TreatmentPreWTL:Time_days 0.008106 0.007713 26.000000 1.051 0.3029
(Intercept) ***
TreatmentPreWTL
Time_days *
TreatmentPreWTL:Time_days
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) TrPWTL Tm_dys
TrtmntPrWTL -0.707
Time_days -0.780 0.551
TrtmPWTL:T_ 0.551 -0.780 -0.707
convergence code: 0
boundary (singular) fit: see ?isSingular
From above, we can see that there was a significant effect of Time (P=0.0052) with an estimated effect of 0.03±0.008 diversity units/day while no other term is significant. Perhaps we want to do a more straight forward statistical analysis which would be to test if there is a significant difference between Pre/Post at each time point, we can accomplish that as below using Welch’s t-tests and then following up using a Bonferroni correction.
shannondiv %>%
group_by(Time_days) %>%
do(
t.test(Shannon~Treatment, data=.) %>%
tidy()
) %>%
ungroup() %>%
mutate(AdjPvalue=p.adjust(p.value, method="bonferroni")) %>%
select(Time_days, Pvalue=p.value, Bonferonni=AdjPvalue)
# A tibble: 5 x 3
Time_days Pvalue Bonferonni
<dbl> <dbl> <dbl>
1 1 0.505 1
2 3 0.136 0.679
3 7 0.0253 0.126
4 14 0.630 1
5 21 0.963 1
So as we can see, the difference at 7 days is not significant when we consider multiple testing correction.
You can’t win them all, and in this case it would appear that alpha diversity (measured by Shannon’s index) is not significantly different between treatment groups and is unlikely to explain the differences in weight loss.
We will start by calculating our distance matrix:
braycurtis<-
SVtable %>%
t() %>%
rrarefy(sample=min(colSums(SVtable))) %>%
apply(., 2, function(x) x/sum(x)) %>% # convert to proportions
vegdist(method = "bray")
let’s now look at the distance matrix, or at least the top corner:
as.matrix(braycurtis)[1:10, 1:10]
MOUSE.G11.Day1 | MOUSE.G11.Day14 | MOUSE.G11.Day21 | MOUSE.G11.Day3 | MOUSE.G11.Day7 | MOUSE.G12.Day1 | MOUSE.G12.Day14 | MOUSE.G12.Day21 | MOUSE.G12.Day3 | MOUSE.G12.Day7 | |
---|---|---|---|---|---|---|---|---|---|---|
MOUSE.G11.Day1 | 0.0000000 | 0.9443029 | 0.9317079 | 0.9128938 | 0.9604865 | 0.4914967 | 0.9525329 | 0.9442062 | 0.9037734 | 0.9432727 |
MOUSE.G11.Day14 | 0.9443029 | 0.0000000 | 0.5221430 | 0.8250190 | 0.5698763 | 0.9601955 | 0.6525649 | 0.5337583 | 0.7516608 | 0.5775732 |
MOUSE.G11.Day21 | 0.9317079 | 0.5221430 | 0.0000000 | 0.8422434 | 0.6688037 | 0.9510593 | 0.7235263 | 0.4320101 | 0.7722957 | 0.6713900 |
MOUSE.G11.Day3 | 0.9128938 | 0.8250190 | 0.8422434 | 0.0000000 | 0.8443195 | 0.9391927 | 0.8774215 | 0.8343303 | 0.7130303 | 0.8325179 |
MOUSE.G11.Day7 | 0.9604865 | 0.5698763 | 0.6688037 | 0.8443195 | 0.0000000 | 0.9761717 | 0.7268493 | 0.6761330 | 0.7654650 | 0.4555076 |
MOUSE.G12.Day1 | 0.4914967 | 0.9601955 | 0.9510593 | 0.9391927 | 0.9761717 | 0.0000000 | 0.9669770 | 0.9595806 | 0.9212065 | 0.9606964 |
MOUSE.G12.Day14 | 0.9525329 | 0.6525649 | 0.7235263 | 0.8774215 | 0.7268493 | 0.9669770 | 0.0000000 | 0.7070862 | 0.8078133 | 0.7697195 |
MOUSE.G12.Day21 | 0.9442062 | 0.5337583 | 0.4320101 | 0.8343303 | 0.6761330 | 0.9595806 | 0.7070862 | 0.0000000 | 0.7804602 | 0.6528492 |
MOUSE.G12.Day3 | 0.9037734 | 0.7516608 | 0.7722957 | 0.7130303 | 0.7654650 | 0.9212065 | 0.8078133 | 0.7804602 | 0.0000000 | 0.7300499 |
MOUSE.G12.Day7 | 0.9432727 | 0.5775732 | 0.6713900 | 0.8325179 | 0.4555076 | 0.9606964 | 0.7697195 | 0.6528492 | 0.7300499 | 0.0000000 |
How can we visualize this? Today we will use principal coordinates analysis (PCoA):
pcbraycurtis<-pcoa(braycurtis)
When conducting a PCoA (or PCA) it is always a good idea to look at how much variation has been explained in your plot, this can be done with what is called a scree plot:
pcbraycurtis$values %>%
as.data.frame() %>%
rownames_to_column("PC") %>%
mutate(PC=as.numeric(PC)) %>%
select(PC, Relative_eig) %>%
mutate(PercentVariation=Relative_eig*100) %>%
ggplot(aes(x=PC, y=PercentVariation)) +
geom_point()
Based on this plot we can see that >30% of variation is explained on the first two axes so a 2d plot makes a good amount of sense here. We will now extract the numeric coordinates for the plot from the PCoA object and merge them with our metadata from plotting.
plotbraycurtis<-
pcbraycurtis$vectors %>%
as.data.frame() %>%
rownames_to_column("SampleID") %>%
select(SampleID, Axis.1, Axis.2) %>%
left_join(metadata)
head(plotbraycurtis)
SampleID Axis.1 Axis.2 MouseID Time_days Treatment
1 MOUSE.G11.Day1 0.32158424 0.43350845 G11 1 PreWTL
2 MOUSE.G11.Day14 -0.37601087 0.07914481 G11 14 PreWTL
3 MOUSE.G11.Day21 -0.33207366 0.07054293 G11 21 PreWTL
4 MOUSE.G11.Day3 -0.08192775 0.14094713 G11 3 PreWTL
5 MOUSE.G11.Day7 -0.34157371 0.07445824 G11 7 PreWTL
6 MOUSE.G12.Day1 0.33845678 0.41348738 G12 1 PreWTL
Finally we can create our plot like we did above:
plotbraycurtis %>%
ggplot(aes(x=Axis.1, y=Axis.2, color=Treatment)) +
geom_point()
OK, not a bad start, first let’s fix the aesthetics and include the variation explained on the axes.
plotbraycurtis %>%
ggplot(aes(x=Axis.1, y=Axis.2, color=Treatment)) +
geom_point() +
xlab(paste0("PC1:", round(pcbraycurtis$values$Relative_eig[1]*100, 2),"%")) +
ylab(paste0("PC2:", round(pcbraycurtis$values$Relative_eig[2]*100, 2),"%")) +
theme_q2r() +
scale_color_manual(values=c("indianred","cornflowerblue"))
Ok, this looks great but what is special about that group of samples up in the top right corner? A key piece of missing data is that these are the same mice over time, and it would be ideal to denote time on this graph. We could do this by replacing the points with the number as below using geom_text()
and specifying label=Time_days
in the aesthetics:
plotbraycurtis %>%
ggplot(aes(x=Axis.1, y=Axis.2, color=Treatment, label=Time_days)) +
geom_text() +
xlab(paste0("PC1:", round(pcbraycurtis$values$Relative_eig[1]*100, 2),"%")) +
ylab(paste0("PC2:", round(pcbraycurtis$values$Relative_eig[2]*100, 2),"%")) +
theme_q2r() +
scale_color_manual(values=c("indianred","cornflowerblue"))
This is one option, or we could instead color by time, and use a different shape for the groups as below:
plotbraycurtis %>%
ggplot(aes(x=Axis.1, y=Axis.2, color=Time_days, shape=Treatment)) +
geom_point() +
xlab(paste0("PC1:", round(pcbraycurtis$values$Relative_eig[1]*100, 2),"%")) +
ylab(paste0("PC2:", round(pcbraycurtis$values$Relative_eig[2]*100, 2),"%")) +
theme_q2r()
Both of these are great options, but when the dataset is small enough, I like to connect points over time from the same individual/animal using geom_path()
. To do this, we need to specify the grouping as being group=MouseID
and sort the sample by time as this is the direction the arrow will be drawn in:
plotbraycurtis %>%
arrange(Time_days) %>%
ggplot(aes(x=Axis.1, y=Axis.2, color=Treatment, group=MouseID)) +
geom_path(arrow=arrow(type="closed", length=unit(0.3,"cm")), alpha=0.6) +
xlab(paste0("PC1:", round(pcbraycurtis$values$Relative_eig[1]*100, 2),"%")) +
ylab(paste0("PC2:", round(pcbraycurtis$values$Relative_eig[2]*100, 2),"%")) +
theme_q2r() +
scale_color_manual(values=c("indianred","cornflowerblue"))
I think I like this the most, and so I am going to save it again using ggsave()
.
ggsave("figures/pcoa.pdf", height=3, width=4)
Ok, so based on this plot, what would we conclude? I would say that the two communities started different, changed over time, but always stayed different. If we want to statistically test this, we can turn to a family of statistical tests for ANOVA-like analyses of distance matrices, in this case ADONIS. Similar to above, we will use the same test of Time*Treatment
but include MouseID
as a stratum for permutational testing to control for repeated sampling. Just to be safe I am also going to reorder my metadata table to make sure it matches the order of samples in the distance matrix.
metadata<-metadata[match(labels(braycurtis),metadata$SampleID),]
adonis(braycurtis~Time_days*Treatment, strata=metadata$MouseID, data=metadata)
Call:
adonis(formula = braycurtis ~ Time_days * Treatment, data = metadata, strata = metadata$MouseID)
Blocks: strata
Permutation: free
Number of permutations: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
Time_days 1 1.0555 1.05547 3.6449 0.09797 0.001 ***
Treatment 1 1.5601 1.56007 5.3875 0.14480 0.001 ***
Time_days:Treatment 1 0.6293 0.62927 2.1731 0.05841 0.012 *
Residuals 26 7.5289 0.28957 0.69882
Total 29 10.7737 1.00000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on this analysis we can conclude that there are significant effects of Time, Treatment, and their interaction (P<0.01 for all). It is then helpful to look at their proportion of variation explained (R2) wherein we can see that Treatment had the highest value of 14%, followed by 9.8% for Time, and 5.8% for their interaction (the communities change with respect to each other over time).
Pre and post weight loss microbiotas are two compositionally distinct communities and with this multivariate evidence in hand, we should now investigate which organisms are different between them.
As we previously discussed, proportions violate the assumptions of many statistical approaches and there are a number of transformations to the data that can help overcome this issue. Today we will be using centered log ratios which involve a log transformation and scaling of abundance relative to the average abundance in sample. You can conceptually think of this like the ∆∆Ct normalization used in qPCR but wherein there is no house keeping gene and instead we estimate a control from the distribution of the data. In this mode of scaling the data, the columns sum to 0 (or near it due to rounding issues) and an abundance>0 is above average in abundance, and <0 is below average. Read more about compositional data here. To calculate the CLR abundance we will first add 0.5 to every value in the table (SVtable+0.5
) which will allow us to then take the log2 of all of the abundances (log2()
), we then operate along columns (MARGIN=2
) and take the ratio of x
against the geometric mean of x (mean(x)
). Note: we are subtracting rather than dividing as we are in the log space and the log(x/y)=log(x)-log(y). Before doing any of this, we will start by removing very rare features: in this case, anything which is seen less than 10 times in the entire dataset (SVtable[rowSums(SVtable)>=10,]
) or is found in less than 3 samples filtSVtable[rowSums(apply(filtSVtable, 2, function(x) if_else(x>0, 1, 0 )))>3,]
.
filtSVtable<-SVtable[rowSums(SVtable)>=10,]
filtSVtable<-filtSVtable[rowSums(apply(filtSVtable, 2, function(x) if_else(x>0, 1, 0 )))>3,]
How would we figure out how many reads we have lost?
How would we figure out how many sequence variants we have removed?
CLRabundance<-apply(log2(filtSVtable+0.5), MARGIN=2, function(x) x-mean(x))
We will again fit linear mixed effects models to our transformed data to extract which sequence variants are significantly different between groups.
tests<-
CLRabundance %>%
as.data.frame() %>%
rownames_to_column("FeatureID") %>%
pivot_longer(-FeatureID, names_to = "SampleID", values_to = "CLR") %>%
left_join(metadata) %>%
group_by(FeatureID) %>%
do(
lmer(CLR~Time_days*Treatment + (1|MouseID), data=.) %>%
anova() %>%
tidy()
)
Now we can add multiple testing correction in with Benjamini Hochberg FDR.
tests<-
tests %>%
group_by(term) %>%
mutate(FDR=p.adjust(p.value, method="BH"))
Now we can look at the list of features which are significantly different between the two treatment groups or are different over time between the two treatment groups while adding in taxonomy to help guide us.
SigFeatures<-
tests %>%
ungroup() %>%
filter(term!="Time_days") %>%
filter(FDR<0.05) %>%
select(FeatureID) %>%
distinct()
dim(SigFeatures)
[1] 63 1
We can now see that there were 63 hits, but how to make sense of them? It might help to visualize their abundances over time using something like a heat map. To do this I will calculate fold-change on a per time point basis (the fold change between the mean CLR abundance within each group). I am also going to assign a new name to features which is its FeatureID, Genus, and Species.
foldchanges<-
CLRabundance %>%
as.data.frame() %>%
rownames_to_column("FeatureID") %>%
pivot_longer(-FeatureID, names_to = "SampleID", values_to = "CLR") %>%
left_join(metadata) %>%
group_by(FeatureID, Time_days, Treatment) %>%
summarize(MeanCLR=mean(CLR)) %>%
ungroup() %>%
pivot_wider(values_from = MeanCLR, names_from = Treatment) %>%
mutate(log2FC=PostWTL-PreWTL) %>%
left_join(taxonomy) %>%
mutate(Name=paste(FeatureID, Genus, Species))
head(foldchanges)
# A tibble: 6 x 14
FeatureID Time_days PostWTL PreWTL log2FC Seq Kingdom Phylum Class
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <fct> <fct> <fct>
1 0032a6aa… 1 0.396 4.54 -4.15 AGGG… Bacter… Actin… Cori…
2 0032a6aa… 3 -3.32 3.99 -7.31 AGGG… Bacter… Actin… Cori…
3 0032a6aa… 7 -2.77 -3.20 0.426 AGGG… Bacter… Actin… Cori…
4 0032a6aa… 14 -3.98 -3.84 -0.131 AGGG… Bacter… Actin… Cori…
5 0032a6aa… 21 -4.04 -3.86 -0.174 AGGG… Bacter… Actin… Cori…
6 00c3c59e… 1 1.79 -2.36 4.15 AGGA… Bacter… Bacte… Bact…
# … with 5 more variables: Order <fct>, Family <fct>, Genus <fct>,
# Species <fct>, Name <chr>
Now that we have the fold changes by time, we can make our heat map (geom_tile()
).
foldchanges %>%
filter(FeatureID %in% SigFeatures$FeatureID) %>%
ggplot(aes(x=Time_days, y=Name, fill=log2FC)) +
geom_tile()
This is OK, but not super helpful because of the color scheme, and the X-axis. Must adjust the X axis to be a factor/character instead of a number:
foldchanges %>%
filter(FeatureID %in% SigFeatures$FeatureID) %>%
mutate(Time_days=factor(Time_days, levels=c(1,3,7,14,21))) %>%
ggplot(aes(x=Time_days, y=Name, fill=log2FC)) +
geom_tile()
Now add in a manual fill scheme, in this case if we want a scale centered at 0 we can use scale_fill_gradient2()
.
foldchanges %>%
filter(FeatureID %in% SigFeatures$FeatureID) %>%
mutate(Time_days=factor(Time_days, levels=c(1,3,7,14,21))) %>%
ggplot(aes(x=Time_days, y=Name, fill=log2FC)) +
geom_tile() +
scale_fill_gradient2(low="cornflowerblue", high="indianred")
It would probably help the clarity if we were to try to meaningfully order the rows. This could be done via clustering, but for simplicity, let’s do it by sorting by the average log2FC across all time points. To control the order on the y-axis we need to convert the names to a factor and specify their order.
featureorder<-
foldchanges %>%
group_by(Name) %>%
summarize(log2FC=mean(log2FC)) %>%
arrange(log2FC) %>%
pull(Name)
foldchanges %>%
filter(FeatureID %in% SigFeatures$FeatureID) %>%
mutate(Time_days=factor(Time_days, levels=c(1,3,7,14,21))) %>%
mutate(Name=factor(Name, levels=featureorder)) %>%
ggplot(aes(x=Time_days, y=Name, fill=log2FC)) +
geom_tile() +
scale_fill_gradient2(low="cornflowerblue", high="indianred")
And now finally, we can adjust the aesthetics to get us to a polished product:
foldchanges %>%
filter(FeatureID %in% SigFeatures$FeatureID) %>%
mutate(Time_days=factor(Time_days, levels=c(1,3,7,14,21))) %>%
mutate(Name=factor(Name, levels=featureorder)) %>%
ggplot(aes(x=Time_days, y=Name, fill=log2FC)) +
geom_tile() +
scale_fill_gradient2(low="cornflowerblue", high="indianred") +
ylab("Sequence Variant") +
xlab("Time (days)") +
theme_q2r() +
coord_cartesian(expand=FALSE) #shrink excess white space
Finally, let’s save a copy.
ggsave("figures/heatmap.pdf", width=6, height=5)
A very wise person once said “you can upload it, download it, and reload it, but eventually you actually have to read it and think about it.” So looking through this list of sequence variants, do any of then pop out to you?
To help understand the results, it can also help to plot the individual feature’s abundances over time. We can easily accomplish that as below:
individualabundances<-
CLRabundance %>%
as.data.frame() %>%
rownames_to_column("FeatureID") %>%
filter(FeatureID %in% SigFeatures$FeatureID) %>%
pivot_longer(-FeatureID, names_to = "SampleID", values_to = "CLR") %>%
left_join(metadata) %>%
left_join(taxonomy) %>%
mutate(name=paste(FeatureID, Genus, Species)) %>%
select(name, CLR, Time_days, Treatment)
individualabundances
# A tibble: 1,890 x 4
name CLR Time_days Treatment
<chr> <dbl> <dbl> <chr>
1 bd87538a8a84679b27e33c28432046c2 Blautia caec… 8.47 1 PreWTL
2 bd87538a8a84679b27e33c28432046c2 Blautia caec… 8.97 14 PreWTL
3 bd87538a8a84679b27e33c28432046c2 Blautia caec… 7.39 21 PreWTL
4 bd87538a8a84679b27e33c28432046c2 Blautia caec… 10.8 3 PreWTL
5 bd87538a8a84679b27e33c28432046c2 Blautia caec… 9.96 7 PreWTL
6 bd87538a8a84679b27e33c28432046c2 Blautia caec… 5.77 1 PreWTL
7 bd87538a8a84679b27e33c28432046c2 Blautia caec… 7.45 14 PreWTL
8 bd87538a8a84679b27e33c28432046c2 Blautia caec… 7.21 21 PreWTL
9 bd87538a8a84679b27e33c28432046c2 Blautia caec… 10.8 3 PreWTL
10 bd87538a8a84679b27e33c28432046c2 Blautia caec… 9.59 7 PreWTL
# … with 1,880 more rows
To create separate plots on a per feature basis, we can take advantage of either the facet_wrap()
or facet_grid()
functions as below. Note: by specifying scales="free"
we allow each plot to have its own scale, and by specifying ncol=6
we are specifying we want 6 columns (and by extension 11 rows).
individualabundances %>%
ggplot(aes(x=Time_days, y=CLR, color=Treatment)) +
stat_summary(geom="errorbar", width=0) +
stat_summary(geom="line") +
stat_summary(geom="point") +
theme_q2r() +
facet_wrap(~name, scales="free", ncol=6)
ggsave("figures/perFeature.pdf", height=20, width=30)
This is the type of plot I would never include in a manuscript, but are helpful to interpret the heat map. You could also select one panel from this plot to use in a figure for an organism of interest.
Questions:
Since we have a number of significantly different features, it may help to visualize them on a phylogenetic tree to spot any big picture differences. We can do that using the ggtree()
function which allows us to use ggplot2 aesthetic mappings in the context of the tree. To merge data with the tree we need it in one large data.frame or tibble wherein the first column maps the tips of the tree:
tree
Phylogenetic tree with 419 tips and 418 internal nodes.
Tip labels:
438fb020cf251f2c2f4e89552e4a78ff, cde9017648b65ce5f31e70347eb2cda3, 20cf13f228f8d16b499e9493dd2ee208, cece12865e81c96a523b3bb6a4511561, ca1392eaa969b2980a69cc34a327e112, 5669573ccb2eb47522bed90a9e3439e4, ...
Node labels:
root, 0.987, 0.952, 0.995, 0.938, 0.740, ...
Rooted; includes branch lengths.
head(tree$tip.label)
[1] "438fb020cf251f2c2f4e89552e4a78ff" "cde9017648b65ce5f31e70347eb2cda3"
[3] "20cf13f228f8d16b499e9493dd2ee208" "cece12865e81c96a523b3bb6a4511561"
[5] "ca1392eaa969b2980a69cc34a327e112" "5669573ccb2eb47522bed90a9e3439e4"
As you can see above, the tip.labels are the feature IDs. I would like to be map the average fold change that we used above and if the hit was significant or not so I will create an object below:
treedata<-
foldchanges %>%
group_by(FeatureID) %>%
summarize(log2FC=mean(log2FC)) %>%
mutate(Significant=if_else(FeatureID %in% SigFeatures$FeatureID, "*",""))
To merge this data to the tree we use the unusual %<+%
operator followed by our standard ggplot2 syntax of +
. Note: you can specify aes mappings within any given geom function as below.
ggtree(tree) %<+%
treedata +
geom_tippoint(aes(color=log2FC)) +
geom_tiplab(aes(label=Significant))
OK, a good start, but let’s adjust the color scale and change our shape such that it always has an outline (a purely aesthetic feature that I prefer).
ggtree(tree) %<+%
treedata +
geom_tippoint(aes(fill=log2FC), shape=21) +
geom_tiplab(aes(label=Significant)) +
scale_fill_gradient2(low="cornflowerblue", high="darkred")
Note: ggtree removes the legend by default, lets add it back in on the right side.
ggtree(tree) %<+%
treedata +
geom_tippoint(aes(fill=log2FC), shape=21) +
geom_tiplab(aes(label=Significant)) +
scale_fill_gradient2(low="cornflowerblue", high="darkred") +
theme(legend.position="right")
Note all the grey because they are NA? These are features which did not make our filtering thresholds before the CLR transformation, we can “prune” them from the tree below:
drop.tip(tree, tree$tip.label[!tree$tip.label %in% treedata$FeatureID]) %>%
ggtree() %<+%
treedata +
geom_tippoint(aes(fill=log2FC), shape=21) +
geom_tiplab(aes(label=Significant)) +
scale_fill_gradient2(low="cornflowerblue", high="darkred") +
theme(legend.position="right")
We could also make this tree circular using the layout
option:
drop.tip(tree, tree$tip.label[!tree$tip.label %in% treedata$FeatureID]) %>%
ggtree(layout="circular") %<+% treedata +
geom_tippoint(aes(fill=log2FC), shape=21) +
geom_tiplab(aes(label=Significant)) +
scale_fill_gradient2(low="cornflowerblue", high="darkred") +
theme(legend.position="right")
This looks pretty good, although maybe it was overly reductive to simplify to a single log2 fold change. Instead lets merge our heat map with our tree using pheatmap()
! I will splice together bits of code from our heat map and our tree to do this.
heatmapplot<-
foldchanges %>%
filter(FeatureID %in% treedata$FeatureID) %>%
mutate(Time_days=factor(Time_days, levels=c(1,3,7,14,21))) %>%
select(FeatureID, Time_days, log2FC) %>%
pivot_wider(names_from = "Time_days", values_from="log2FC") %>%
as.data.frame() %>%
column_to_rownames("FeatureID")
treeplot<-
drop.tip(tree, tree$tip.label[!tree$tip.label %in% treedata$FeatureID]) %>%
ggtree(layout="circular") %<+%
treedata +
geom_tiplab2(aes(label=Significant), align=TRUE, offset=0.65 )
gheatmap(treeplot, heatmapplot) +
scale_fill_gradient2(low="cornflowerblue", high="darkred")
Hopefully this session has been helpful and if you have any feed back or questions, please email me at jordan.bisanz@ucsf.edu.