ubiome类似数据dada2处理探索4
前面我们探索了处理不能拼接的V4 PE150数据,首先双向reads根据质量情况分别切成120bp,然后使用dada2 R包进行了直接+10N拼接,生成ASV表,再分别使用dada2包和qiime2进行了物种注释,基本上完成了一个最简单的分析过程,这里,使用比较流行的phyloseq包进行下多样性分析。
说实话,之前从没有使用R进行过16S数据分析,一般认为R速度慢些,而且不熟悉。这里用了一次感觉还不错,虽然由于只有一个样本参数报错,改了下(基本上是直接删除参数的),总算有个图出来了。下面是详细过程:
1.安装加载包
BiocManager::install("phyloseq", version = "3.10")
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
library(ggplot2); packageVersion("ggplot2")
theme_set(theme_bw())
2.准备sample-meta表信息
由于只有一个样本,做了最简单的处理,目的是为了不报错,代码运行,可能有不合理之处,欢迎指正。
samples.out <- "NULL"#rownames(seqtab.nochim)
subject <- "join"#sapply(strsplit(samples.out, "D"), `[`, 1)
gender <- "man"#substr(subject,1,1)
subject <- substr(subject,2,999)
day <- "2019" #as.integer(sapply(strsplit(samples.out, "D"), `[`, 2))
samdf <- data.frame(Subject=subject)#, Gender=gender, Day=day)
samdf$When <- "Early"
samdf$When[samdf$Day>100] <- "Late"
rownames(samdf) <- samples.out
#sample_names() 这个是软件报错提示运行的,上面运行后不需要这行
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE),
sample_data(samdf),
tax_table(taxa))
ps <- prune_samples(sample_names(ps) != "Mock", ps) # Remove mock sample
3.otutable处理,多样性和柱形图绘制
dna <- Biostrings::DNAStringSet(taxa_names(ps))
names(dna) <- taxa_names(ps)
ps <- merge_phyloseq(ps, dna)
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
ps
#这是输出信息
phyloseq-class experiment-level object
otu_table() OTU Table: [ 348 taxa and 1 samples ]
sample_data() Sample Data: [ 1 samples by 1 sample variables ]
tax_table() Taxonomy Table: [ 348 taxa by 7 taxonomic ranks ]
refseq() DNAStringSet: [ 348 reference sequences ]
#丰富度绘图
plot_richness(ps, measures=c("Shannon", "Simpson"))#, color="When")
# Transform data to proportions as appropriate for Bray-Curtis distances
#以下两行beta多样性,只有一个样本,就没有运行
#ps.prop <- transform_sample_counts(ps, function(otu) otu/sum(otu))
#ord.nmds.bray <- ordinate(ps.prop, method="NMDS", distance="bray")
#选前20属绘图
top20 <- names(sort(taxa_sums(ps), decreasing=TRUE))[1:20]
ps.top20 <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU))
ps.top20 <- prune_taxa(top20, ps.top20)
plot_bar(ps.top20, fill="Family") #+ facet_wrap(~When, scales="free_x")
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LiK9Kz6y-1580028204222)(https://github.com/zd200572/mp_kejijizhe/raw/master/blogs/Rplot-shannon.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7A459PuC-1580028204231)(https://github.com/zd200572/mp_kejijizhe/raw/master/blogs/Rplot-bar.png)]
4.另一种序列注释方法
官方文档里介绍了另一种物种注释方法,这里也试试。需要下载训练好的参考数据库集,RDP_v16-mod_March2018.RData。下载地址: http://DECIPHER.codes/Downloads.html
#包安装和加载
BiocManager::install("DECIPHER", version = "3.10")
library(DECIPHER); packageVersion("DECIPHER")
dna <- DNAStringSet(getSequences(seqtab.nochim)) # Create a DNAStringSet from the ASVs
load("~/Biodata/Refrence/RDP_v16-mod_March2018.RData") # CHANGE TO THE PATH OF YOUR TRAINING SET
ids <- IdTaxa(dna, trainingSet, strand="top", processors=NULL, verbose=FALSE) # use all processors
ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species") # ranks of interest
# Convert the output object of class "Taxa" to a matrix analogous to the output from assignTaxonomy
taxid <- t(sapply(ids, function(x) {
m <- match(ranks, x$rank)
taxa <- x$taxon[m]
taxa[startsWith(taxa, "unclassified_")] <- NA
taxa
}))
colnames(taxid) <- ranks; rownames(taxid) <- getSequences(seqtab.nochim)
比较下三种注释方法的差别
好了,到这里,基本上两三个物种注释方法已经完成了,下面比较下三种方法的差别。
前面已经进行的探索有:
上一篇: mapreduce运行遇到的问题-1
推荐阅读