TCGA数据挖掘(四):表达差异分析(4)
- 2019 年 10 月 4 日
- 笔记
在之前我们的文章:TCGA数据挖掘(三):表达差异分析中,我们利用的是TCGAbiolinks包中的TCGAanalyze_DEA函数进行差异表达分析,我们也提到可以选择基于limma或edgeR包进行分析,TCGA数据挖掘(三):表达差异分析这一讲中我们利用的是edgeR包,之后我们在文章:TCGA数据挖掘(四):表达差异分析(2)和TCGA数据挖掘(四):表达差异分析(3)中分别也介绍了其他方法的差异分析,包括edgeR和DESeq包,今天这一讲,我们就利用TCGAbiolinks包中的TCGAanalyze_DEA函数基于limma包进行差异分析。
数据下载
基因表达数据的下载
数据下载代码和之前的一样,这里再提供一次。避免出错不知道原因。
setwd("E:\BioInfoCloud\TCGABiolinks\case_study1") # 加载相应的包,可能会需要其他包,提示错误就安装缺少的包。 # 因为每个人已经安装的包不一样。 library(TCGAbiolinks) library(plyr) library(limma) library(biomaRt) library(SummarizedExperiment) # 请求数据。 query <- GDCquery(project = "TCGA-BRCA", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts") # 从query中获取结果表,它可以选择带有cols参数的列,并使用rows参数返回若干行。 # 1222个barcode samplesDown <- getResults(query,cols=c("cases")) # samplesDown中筛选出TP(主要实体瘤)样本的barcode # 1102个barcode dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown, typesample = "TP") # 113个barcode dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown, typesample = "NT") # 选择100个正常组织和100个肿瘤组织样本作为研究对象 dataSmTP_short <- dataSmTP[1:100] dataSmNT_short <- dataSmNT[1:100] # 根据前面的筛选,再次请求数据 queryDown <- GDCquery(project = "TCGA-BRCA", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts", barcode = c(dataSmTP_short, dataSmNT_short)) #读取下载的数据并将其准备到R对象中,在工作目录生成brca_case1.rda文件, #同时还产生Human_genes__GRCh38_p12_.rda文件 GDCdownload(query = queryDown) dataPrep1 <- GDCprepare(query = queryDown, save = TRUE, save.filename = "brca_case1.rda")
数据处理
# 去除dataPrep1中的异常值,dataPrep数据中含有肿瘤组织和正常组织的数据 dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep1, cor.cut = 0.6, datatype = "HTSeq - Counts") #使用EDASeq进行标准化 dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep, geneInfo = geneInfoHT, method = "gcContent") #根据分位数一处count较低的基因。 dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm, method = "quantile", qnt.cut = 0.25) # 将计数数据转换为log2计数/百万(LogCPM),估计均值-方差关系并使用此关系计算适当的观测级别权重。 # 然后,数据即可用于线性建模。 v.dataFilt<-voom(dataFilt) #in order to have 2 batches with multiple samples and avoid batches with one sample #the user need to call first the get_IDs function on the top of this script if this has not been done already c1<-which(get_IDs(dataPrep)$tss=="E9") c2<-which(get_IDs(dataPrep)$tss=="E2") #taking log transformed data for exploration of batch effects TCGAbatch_Correction(tabDF = v.dataFilt$E[,c(c1,c2)], batch.factor="TSS", adjustment=NULL) #if you get error in plot.new():figure margins too large - please use: #par(mar=c(1,1,1,1)) #and then rerun the command above

#### After exploration of the batches we can continue to work on the original gene matrix for DEA ### ###############Tumor purity filtering########### ###vector containing all TCGA barcodes that hhave 60% tumor purity or more Purity.BRCA<-TCGAtumor_purity(colnames(dataPrep1), 0, 0, 0, 0, 0.6)$pure_barcodes ################DEA with Molecular subtypes############ ####Molecular subtypes: ####Filtering data so all samples have a pam50 subtype for BRCA: #diff contains TCGA samples that have an available molecular subtype ###Also Applying Tumor purity filtering by making a set difference #documentation for TCGA_MolecularSubtype is available ##########Important######## ####in this example no change happens because all the TCGA barcodes ####have available subtype info ###setdiff() is here used to remove samples with no subtype info from the TCGA BRCA samples that satisfy ###tumor purity criteria diff<-setdiff(Purity.BRCA, TCGA_MolecularSubtype(colnames(dataPrep[,dataSmTP_short]))$filtered) dataFilt.brca.cancer<-dataPrep[,diff] dataFilt.brca.normal<-dataPrep[,dataSmNT_short] dataFilt.brca<-cbind(dataFilt.brca.normal, dataFilt.brca.cancer) mol_subtypes<-c(rep("Normal", 100), TCGA_MolecularSubtype(colnames(dataFilt.brca.cancer))$subtypes$subtype) # All barcodes have available molecular subtype info mol_subtypes<-make.names(mol_subtypes) # to convert ENSEMBL id to gene name rownames(dataFilt.brca)<-rowData(dataPrep1)$external_gene_name # we redo here the normalization and filtering steps dataNorm.brca <- TCGAanalyze_Normalization(tabDF = dataFilt.brca, geneInfo = geneInfo, method = "gcContent") # 将标准化后的数据再过滤,得到最终的数据 dataFilt.brca.final <- TCGAanalyze_Filtering(tabDF = dataNorm.brca, method = "quantile", qnt.cut = 0.25)
表达差异分析
这里利用的是TCGAbiolinks包中的TCGAanalyze_DEA函数,是基于limma包的差异分析。
DEG.brca.TSS <- TCGAanalyze_DEA(MAT=dataFilt.brca.final, pipeline="limma", batch.factors = c("TSS"), Cond1type = "Normal", Cond2type = "Tumor", fdr.cut = 0.01 , logFC.cut = 1, voom = TRUE, Condtypes = mol_subtypes, contrast.formula = "Mycontrast = (BRCA.LumA+BRCA.LumB)/2 -Normal") #in this DEA we use Normal as a reference, thus genes with LogFC > 1 are up regulated in the subtypes #with respect to the normal samples and viceversa for the LogFC < -1. write.csv(DEG.brca.TSS, "DEGsBRCA_limma_091018.csv", quote = FALSE) #3241 genes identified #to check how many with logFC > 5 (for plotting purposes) DEG.brca.filt.TSS<-DEG.brca.TSS$Mycontrast[which(abs(DEG.brca.TSS$Mycontrast$logFC) >= 6), ] #47 genes identified
可视化 利用火山图进行可视化。 TCGAVisualize_volcano(DEG.brca.TSS$Mycontrast$logFC, DEG.brca.TSS$Mycontrast$adj.P.Val, filename = "LuminalABvsNormal_FC6.TSS.pdf", xlab = "logFC", names = rownames(DEG.brca.TSS$Mycontrast), show.names = "highlighted", x.cut = 1, y.cut = 0.01, highlight = rownames(DEG.brca.TSS$Mycontrast)[which(abs(DEG.brca.TSS$Mycontrast$logFC) >= 6)], highlight.color = "orange")

在之前利用edgeR包获得的结果图如下,我们简单比较一下2中方法的差异。

很明显看到2中方法得到的结果是有差异的,这有时候我们会纠结那种方法好,这个就看你自己的研究啦,什么结果符合自己的用什么,当然这有点投机取巧的感觉,最好是2中方法得到后取交集。 不过下面我们对这2种方法得到的结果进行一个相关性评估。 limma和edgeR结果的相关性比较library(ggplot2) DEGs_limma <- read.csv("DEGsBRCA_limma_091018.csv", row.names = 1) DEGs_edgeR <- read.csv("DEGsBRCA_edgeR_091018.csv", row.names = 1) genes_toTest <- rownames(DEGs_limma)[1:1000] genes_common <- intersect(genes_toTest, rownames(DEGs_edgeR)) dataset <- subset(DEGs_limma, rownames(DEGs_limma) %in% genes_common, select = "Mycontrast.logFC") dataset <- cbind(DEGs_edgeR$Mycontrast.logFC[match(rownames(dataset),rownames(DEGs_edgeR))], dataset) colnames(dataset) <- c("logFC_edgeR", "logFC_limma") pdf("scatterplot_logFC_limma_edgeR_top1000.pdf", width=5, height=3) ggplot(dataset, aes(x=logFC_limma,y=logFC_edgeR))+geom_point()+ xlab("logFC_limma")+ylab("logFC_edgeR")+ theme(legend.title=element_blank(),axis.title=element_text(size=10),axis.text=element_text(size=10), legend.text=element_text(size=20)) dev.off() cor(dataset$logFC_edgeR, dataset$logFC_limma) #0.9936 这2种方法得到的结果虽然有差异,但很小得到的大多数差异基因是一样的。
