TCGA数据挖掘（四）：表达差异分析（4）

2019 年 10 月 4 日
笔记

在之前我们的文章：TCGA数据挖掘（三）：表达差异分析中，我们利用的是TCGAbiolinks包中的TCGAanalyze_DEA函数进行差异表达分析，我们也提到可以选择基于limma或edgeR包进行分析，TCGA数据挖掘（三）：表达差异分析这一讲中我们利用的是edgeR包，之后我们在文章：TCGA数据挖掘（四）：表达差异分析（2）和TCGA数据挖掘（四）：表达差异分析（3）中分别也介绍了其他方法的差异分析，包括edgeR和DESeq包，今天这一讲，我们就利用TCGAbiolinks包中的TCGAanalyze_DEA函数基于limma包进行差异分析。

数据下载

基因表达数据的下载

数据下载代码和之前的一样，这里再提供一次。避免出错不知道原因。

setwd("E:\BioInfoCloud\TCGABiolinks\case_study1")  # 加载相应的包，可能会需要其他包，提示错误就安装缺少的包。  # 因为每个人已经安装的包不一样。  library(TCGAbiolinks)  library(plyr)  library(limma)  library(biomaRt)  library(SummarizedExperiment)  # 请求数据。  query <- GDCquery(project = "TCGA-BRCA",                    data.category = "Transcriptome Profiling",                    data.type = "Gene Expression Quantification",                    workflow.type = "HTSeq - Counts")  # 从query中获取结果表，它可以选择带有cols参数的列，并使用rows参数返回若干行。  # 1222个barcode  samplesDown <- getResults(query,cols=c("cases"))  # samplesDown中筛选出TP(主要实体瘤)样本的barcode  # 1102个barcode  dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,                                    typesample = "TP")  # 113个barcode  dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,                                    typesample = "NT")    # 选择100个正常组织和100个肿瘤组织样本作为研究对象  dataSmTP_short <- dataSmTP[1:100]  dataSmNT_short <- dataSmNT[1:100]  # 根据前面的筛选，再次请求数据  queryDown <- GDCquery(project = "TCGA-BRCA",                        data.category = "Transcriptome Profiling",                        data.type = "Gene Expression Quantification",                        workflow.type = "HTSeq - Counts",                        barcode = c(dataSmTP_short, dataSmNT_short))  #读取下载的数据并将其准备到R对象中，在工作目录生成brca_case1.rda文件，  #同时还产生Human_genes__GRCh38_p12_.rda文件  GDCdownload(query = queryDown)  dataPrep1 <- GDCprepare(query = queryDown, save = TRUE, save.filename = "brca_case1.rda")

数据处理

# 去除dataPrep1中的异常值，dataPrep数据中含有肿瘤组织和正常组织的数据  dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep1,                                        cor.cut = 0.6,                                        datatype = "HTSeq - Counts")  #使用EDASeq进行标准化  dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,                                        geneInfo = geneInfoHT,                                        method = "gcContent")  #根据分位数一处count较低的基因。  dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,                                    method = "quantile",                                      qnt.cut =  0.25)  # 将计数数据转换为log2计数/百万(LogCPM)，估计均值-方差关系并使用此关系计算适当的观测级别权重。  # 然后，数据即可用于线性建模。  v.dataFilt<-voom(dataFilt)  #in order to have 2 batches with multiple samples and avoid batches with one sample  #the user need to call first the get_IDs function on the top of this script if this has not been done already  c1<-which(get_IDs(dataPrep)$tss=="E9")  c2<-which(get_IDs(dataPrep)$tss=="E2")  #taking log transformed data for exploration of batch effects  TCGAbatch_Correction(tabDF = v.dataFilt$E[,c(c1,c2)], batch.factor="TSS", adjustment=NULL)  #if you get error in plot.new():figure margins too large - please use:  #par(mar=c(1,1,1,1))  #and then rerun the command above

#### After exploration of the batches we can continue to work on the original gene matrix for DEA ###  ###############Tumor purity filtering###########  ###vector containing all TCGA barcodes that hhave 60% tumor purity or more  Purity.BRCA<-TCGAtumor_purity(colnames(dataPrep1), 0, 0, 0, 0, 0.6)$pure_barcodes  ################DEA with Molecular subtypes############  ####Molecular subtypes:  ####Filtering data so all samples have a pam50 subtype for BRCA:    #diff contains TCGA samples that have an available molecular subtype  ###Also Applying Tumor purity filtering by making a set difference  #documentation for TCGA_MolecularSubtype is available  ##########Important########  ####in this example no change happens because all the TCGA barcodes  ####have available subtype info  ###setdiff() is here used to remove samples with no subtype info from the TCGA BRCA samples that satisfy  ###tumor purity criteria  diff<-setdiff(Purity.BRCA, TCGA_MolecularSubtype(colnames(dataPrep[,dataSmTP_short]))$filtered)  dataFilt.brca.cancer<-dataPrep[,diff]  dataFilt.brca.normal<-dataPrep[,dataSmNT_short]  dataFilt.brca<-cbind(dataFilt.brca.normal, dataFilt.brca.cancer)  mol_subtypes<-c(rep("Normal", 100), TCGA_MolecularSubtype(colnames(dataFilt.brca.cancer))$subtypes$subtype)  # All barcodes have available molecular subtype info  mol_subtypes<-make.names(mol_subtypes)  # to convert ENSEMBL id to gene name  rownames(dataFilt.brca)<-rowData(dataPrep1)$external_gene_name  # we redo here the normalization and filtering steps  dataNorm.brca <- TCGAanalyze_Normalization(tabDF = dataFilt.brca,                                             geneInfo = geneInfo,                                             method = "gcContent")  # 将标准化后的数据再过滤，得到最终的数据  dataFilt.brca.final <- TCGAanalyze_Filtering(tabDF = dataNorm.brca,                                         method = "quantile",                                         qnt.cut =  0.25)

表达差异分析

这里利用的是TCGAbiolinks包中的TCGAanalyze_DEA函数，是基于limma包的差异分析。

DEG.brca.TSS <- TCGAanalyze_DEA(MAT=dataFilt.brca.final,                              pipeline="limma",                              batch.factors = c("TSS"),                              Cond1type = "Normal",                              Cond2type = "Tumor",                              fdr.cut = 0.01 ,                              logFC.cut = 1,                              voom = TRUE,                              Condtypes = mol_subtypes,                              contrast.formula = "Mycontrast = (BRCA.LumA+BRCA.LumB)/2 -Normal")  #in this DEA we use Normal as a reference, thus genes with LogFC > 1 are up regulated in the subtypes  #with respect to the normal samples and viceversa for the LogFC < -1.    write.csv(DEG.brca.TSS, "DEGsBRCA_limma_091018.csv", quote = FALSE)  #3241 genes identified    #to check how many with logFC > 5 (for plotting purposes)  DEG.brca.filt.TSS<-DEG.brca.TSS$Mycontrast[which(abs(DEG.brca.TSS$Mycontrast$logFC) >= 6), ]  #47 genes identified

可视化利用火山图进行可视化。 TCGAVisualize_volcano(DEG.brca.TSS$Mycontrast$logFC, DEG.brca.TSS$Mycontrast$adj.P.Val, filename = "LuminalABvsNormal_FC6.TSS.pdf", xlab = "logFC", names = rownames(DEG.brca.TSS$Mycontrast), show.names = "highlighted", x.cut = 1, y.cut = 0.01, highlight = rownames(DEG.brca.TSS$Mycontrast)[which(abs(DEG.brca.TSS$Mycontrast$logFC) >= 6)], highlight.color = "orange")

在之前利用edgeR包获得的结果图如下，我们简单比较一下2中方法的差异。

很明显看到2中方法得到的结果是有差异的，这有时候我们会纠结那种方法好，这个就看你自己的研究啦，什么结果符合自己的用什么，当然这有点投机取巧的感觉，最好是2中方法得到后取交集。不过下面我们对这2种方法得到的结果进行一个相关性评估。 limma和edgeR结果的相关性比较library(ggplot2) DEGs_limma <- read.csv("DEGsBRCA_limma_091018.csv", row.names = 1) DEGs_edgeR <- read.csv("DEGsBRCA_edgeR_091018.csv", row.names = 1) genes_toTest <- rownames(DEGs_limma)[1:1000] genes_common <- intersect(genes_toTest, rownames(DEGs_edgeR)) dataset <- subset(DEGs_limma, rownames(DEGs_limma) %in% genes_common, select = "Mycontrast.logFC") dataset <- cbind(DEGs_edgeR$Mycontrast.logFC[match(rownames(dataset),rownames(DEGs_edgeR))], dataset) colnames(dataset) <- c("logFC_edgeR", "logFC_limma") pdf("scatterplot_logFC_limma_edgeR_top1000.pdf", width=5, height=3) ggplot(dataset, aes(x=logFC_limma,y=logFC_edgeR))+geom_point()+ xlab("logFC_limma")+ylab("logFC_edgeR")+ theme(legend.title=element_blank(),axis.title=element_text(size=10),axis.text=element_text(size=10), legend.text=element_text(size=20)) dev.off() cor(dataset$logFC_edgeR, dataset$logFC_limma) #0.9936 这2种方法得到的结果虽然有差异，但很小得到的大多数差异基因是一样的。