一文看懂WGCNA 分析(2019更新版)

2019 年 10 月 11 日
筆記

發現我這個4年前的WGCNA分析教程可以排在自己最受歡迎的前10個教程裡面了，而且直接以我這個授課程式碼出的SCI文章就有38篇了，當然不排除很多學員使用我的程式碼卻不告知我，也不會致謝我。

不過，我這點戰績根本就算不上什麼，其實這個WGCNA包已經是十多年前發表的了，仍然是廣受好評及引用量一直在增加，破萬也是指日可待。

大家首先可以看到3個教程：

2016-WGCNA-HCC-hub-gene.pdf 中文文章範例）
WGCNA_GBMTutorialHorvath.pdf
WGCNA_YeastTutorialHorvath.pdf

其中第一個是我4年前的WGCNA分析教程最主要的參考文獻，後面兩個是英文教程，我相信你大概率是不會去看的，不過，我還是放在這裡了。（還是需要強調，這兩個英文教程完整的展現了WGCNA的全部用法）

然後你只需要簡單瀏覽本文檔，就可以在rstudio裡面打開後綴是proj的文件，打開R程式碼，一步步跟著學習啦！

基本概念

WGCNA其譯為加權基因共表達網路分析。該分析方法旨在尋找協同表達的基因模組(module)，並探索基因網路與關注的表型之間的關聯關係，以及網路中的核心基因。

適用於複雜的數據模式，推薦5組(或者15個樣品)以上的數據。一般可應用的研究方向有：不同器官或組織類型發育調控、同一組織不同發育調控、非生物脅迫不同時間點應答、病原菌侵染後不同時間點應答。

基本原理

從方法上來講，WGCNA分為表達量聚類分析和表型關聯兩部分，主要包括基因之間相關係數計算、基因模組的確定、共表達網路、模組與性狀關聯四個步驟。

第一步計算任意兩個基因之間的相關係數（Person Coefficient）。為了衡量兩個基因是否具有相似表達模式，一般需要設置閾值來篩選，高於閾值的則認為是相似的。但是這樣如果將閾值設為0.8，那麼很難說明0.8和0.79兩個是有顯著差別的。因此，WGCNA分析時採用相關係數加權值，即對基因相關係數取N次冪，使得網路中的基因之間的連接服從無尺度網路分布(scale-freenetworks)，這種演算法更具生物學意義。

第二步通過基因之間的相關係數構建分層聚類樹，聚類樹的不同分支代表不同的基因模組，不同顏色代表不同的模組。基於基因的加權相關係數，將基因按照表達模式進行分類，將模式相似的基因歸為一個模組。這樣就可以將幾萬個基因通過基因表達模式被分成了幾十個模組，是一個提取歸納資訊的過程。

WGCNA術語

權重(weghted)

基因之間不僅僅是相關與否，還記錄著它們的相關性數值，數值就是基因之間的聯繫的權重(相關性)。

Module

模組(module)：表達模式相似的基因分為一類，這樣的一類基因成為模組；

Eigengene

Eigengene（eigen +‎ gene）：基因和樣本構成的矩陣，https://en.wiktionary.org/wiki/eigengene

Adjacency Matrix

鄰近矩陣：是圖的一種存儲形式，用一個一維數組存放圖中所有頂點數據；用一個二維數組存放頂點間關係（邊或弧）的數據，這個二維數組稱為鄰接矩陣；在WGCNA分析裡面指的是基因與基因之間的相關性係數矩陣。 如果用了閾值來判斷基因相關與否，那麼這個鄰近矩陣就是0/1矩陣，只記錄基因相關與否。但是WGCNA沒有用閾值來卡基因的相關性，而是記錄了所有基因之間的相關性。

Topological Overlap Matrix (TOM)

WGNA認為基因之間的簡單的相關性不足以計算共表達，所以它利用上面的鄰近矩陣，又計算了一個新的鄰近矩陣。一般來說，TOM就是WGCNA分析的最終結果，後續的只是對TOM的下游注釋。

下游分析

得到模組之後的分析有：

1.模組的功能富集

2.模組與性狀之間的相關性

3.模組與樣本間的相關係數

挖掘模組的關鍵資訊：

1.找到模組的核心基因

2.利用關係預測基因功能

程式碼示例

其中第一步數據準備反而是最複雜的，取決於大家的R語言水平，這個數據GSE48213-wgcna-input.RData我已經保存下來咯，如果大家不會做，又想體驗一下這個WGCNA流程，就可以直接load我保存好的數據文件即可。

step1: 輸入數據的準備

這裡主要是表達矩陣，如果是晶片數據，那麼常規的歸一化矩陣即可，如果是轉錄組數據，最好是RPKM/TPM值或者其它歸一化好的表達量。然後就是臨床資訊或者其它表型，總之就是樣本的屬性。

為了保證後續腳本的統一性，表達矩陣統一用datExpr標識，臨床資訊統一用datTraits標識。(PS: 如果你R語言很差，變數名不要輕易修改)

library(WGCNA)  RNAseq_voom <- fpkm  ## 因為WGCNA針對的是基因進行聚類，而一般我們的聚類是針對樣本用hclust即可，所以這個時候需要轉置。  WGCNA_matrix = t(RNAseq_voom[order(apply(RNAseq_voom,1,mad), decreasing = T)[1:5000],])  datExpr0 <- WGCNA_matrix  ## top 5000 mad genes  datExpr <- datExpr0    ## 下面主要是為了防止臨床表型與樣本名字對不上  sampleNames = rownames(datExpr);  traitRows = match(sampleNames, datTraits$gsm)  rownames(datTraits) = datTraits[traitRows, 1]

上面程式碼裡面的rpkm就是我們的轉錄組數據的表達矩陣，以rpkm為單位。而datTraits就是所有樣本對應的表型資訊。需要自己製作，這個是學習WGCNA的基礎，本次實例程式碼都是基於這兩個數據。至於如何做出上面程式碼的兩個例子，取決於大家自己的項目，我這裡給出自己的程式碼，僅供參考哈！

setwd('WGCNA/')  #     56 breast cancer cell lines were profiled to identify patterns of gene expression associated with subtype and response to therapeutic compounds.  if(F){    ## https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE48213    #wget -c ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE48nnn/GSE48213/suppl/GSE48213_RAW.tar    #tar -xf GSE48213_RAW.tar    #gzip -d *.gz    ## 首先在GSE48213_RAW目錄裡面生成tmp.txt文件，使用shell腳本    # awk '{print FILENAME"t"$0}' GSM*.txt |grep -v EnsEMBL_Gene_ID >tmp.txt    #  其實也可以直接使用R來讀取GSE48213_RAW.tar裡面的gz文件，這裡就不演示了    # 可以參考：https://mp.weixin.qq.com/s/OLc9QmfN0YcT548VAYgOPA 裡面的教程    ## 然後把tmp.txt導入R語言裡面用reshape2處理即可    # 這個 tmp.txt 文件應該是100M左右大小哦。    a=read.table('GSE48213_RAW/tmp.txt',sep = 't',stringsAsFactors = F)    library(reshape2)    fpkm <- dcast(a,formula = V2~V1)    rownames(fpkm)=fpkm[,1]    fpkm=fpkm[,-1]    colnames(fpkm)=sapply(colnames(fpkm),function(x) strsplit(x,"_")[[1]][1])      library(GEOquery)    a=getGEO('GSE48213')    metadata=pData(a[[1]])[,c(2,10,12)]    datTraits = data.frame(gsm=metadata[,1],               cellline=trimws(sapply(as.character(metadata$characteristics_ch1),function(x) strsplit(x,":")[[1]][2])),               subtype=trimws(sapply(as.character(metadata$characteristics_ch1.2),function(x) strsplit(x,":")[[1]][2]))               )  save(fpkm,datTraits,file = 'GSE48213-wgcna-input.RData')  }

很明顯，這個數據GSE48213-wgcna-input.RData我已經保存下來咯，如果大家不會做，又想體驗一下這個WGCNA流程，那麼可以找我email求取這個數據哦。我的郵箱是[email protected]

我給大家演示的示例數據大概是下面這個樣子：

> head(datTraits)  ## 56 個細胞系的分類資訊，表型                    gsm cellline       subtype  GSM1172844 GSM1172844    184A1 Non-malignant  GSM1172845 GSM1172845    184B5 Non-malignant  GSM1172846 GSM1172846    21MT1         Basal  GSM1172847 GSM1172847    21MT2         Basal  GSM1172848 GSM1172848     21NT         Basal  GSM1172849 GSM1172849     21PT         Basal  > fpkm[1:4,1:4]  ## 56個細胞系的36953個基因的表達矩陣                  GSM1172844 GSM1172845 GSM1172846  GSM1172847  ENSG00000000003   95.21255   95.69868   19.99467  65.6863763  ENSG00000000005    0.00000    0.00000    0.00000   0.1492021  ENSG00000000419  453.20831  243.64804  142.05818 200.4131493  ENSG00000000457   18.10439   26.56661   16.12776  12.0873135  >

這個數據集裡面的56種細胞系被分成了5組，如果要分開兩兩做差異分析，有10種組合，也就是說需要做10次差異分析，每個差異分析結果都需要去注釋，會比較麻煩，這個時候WGCNA就派上用場啦。當然，如果你一定要去做差異分析，我也給你程式碼：https://github.com/jmzeng1314/my-R/blob/master/10-RNA-seq-3-groups/hisat2_mm10_htseq.R

實際上多個分組，差異分析策略是非常個性化的，比如：https://mp.weixin.qq.com/s/hc6JkKxyelc7b1M1MRiHRQ

step2:確定最佳beta值

選擇合適「軟閥值（soft thresholding power）」beta，同樣的，也是使用教程標準程式碼即可：

powers = c(c(1:10), seq(from = 12, to=20, by=2))  # Call the network topology analysis function  sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)  #設置網路構建參數選擇範圍，計算無尺度分布拓撲矩陣      # Plot the results:    ##sizeGrWindow(9, 5)    par(mfrow = c(1,2));    cex1 = 0.9;    # Scale-free topology fit index as a function of the soft-thresholding power    plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],         xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",         main = paste("Scale independence"));    text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],         labels=powers,cex=cex1,col="red");    # this line corresponds to using an R^2 cut-off of h    abline(h=0.90,col="red")    # Mean connectivity as a function of the soft-thresholding power    plot(sft$fitIndices[,1], sft$fitIndices[,5],         xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",         main = paste("Mean connectivity"))    text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

關鍵就是理解pickSoftThreshold函數及其返回的對象，最佳的beta值就是sft$powerEstimate

最佳beta值

參數beta取值默認是1到30，上述圖形的橫軸均代表權重參數β，左圖縱軸代表對應的網路中log(k)與log(p(k))相關係數的平方。相關係數的平方越高，說明該網路越逼近無網路尺度的分布。右圖的縱軸代表對應的基因模組中所有基因鄰接函數的均值。最佳的beta值就是sft$powerEstimate，已經被保存到變數了，不需要知道具體是什麼，後面的程式碼都用這個即可，在本例子裡面是6。

即使你不理解它，也可以使用程式碼拿到合適「軟閥值（soft thresholding power）」beta進行後續分析。

step3：一步法構建共表達矩陣

有了表達矩陣和估計好的最佳beta值，就可以直接構建共表達矩陣了。

net = blockwiseModules(                   datExpr,                   power = sft$powerEstimate,                   maxBlockSize = 6000,                   TOMType = "unsigned", minModuleSize = 30,                   reassignThreshold = 0, mergeCutHeight = 0.25,                   numericLabels = TRUE, pamRespectsDendro = FALSE,                   saveTOMs = F,                   verbose = 3   )   table(net$colors)

所有的核心就在這一步，把輸入的表達矩陣的幾千個基因組歸類成了幾十個模組。大體思路：計算基因間的鄰接性，根據鄰接性計算基因間的相似性，然後推出基因間的相異性係數，並據此得到基因間的系統聚類樹。然後按照混合動態剪切樹的標準，設置每個基因模組最少的基因數目為30。

根據動態剪切法確定基因模組後，再次分析，依次計算每個模組的特徵向量值，然後對模組進行聚類分析，將距離較近的模組合併為新的模組。

step4: 模組可視化

這裡用不同的顏色來代表那些所有的模組，其中灰色默認是無法歸類於任何模組的那些基因，如果灰色模組裡面的基因太多，那麼前期對表達矩陣挑選基因的步驟可能就不太合適。

# Convert labels to colors for plotting  mergedColors = labels2colors(net$colors)  table(mergedColors)  # Plot the dendrogram and the module colors underneath  plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],                      "Module colors",                      dendroLabels = FALSE, hang = 0.03,                      addGuide = TRUE, guideHang = 0.05)  ## assign all of the gene to their corresponding module  ## hclust for the genes.

基因的模組可視化

這裡的重點就是plotDendroAndColors函數，它接受一個聚類的對象，以及該對象裡面包含的所有個體所對應的顏色。比如對表達矩陣進行hclust之後，加上表達矩陣裡面所有樣本的分組資訊對應的顏色，也是可以用plotDendroAndColors函數可視化的，比如下面的樣品圖：

#明確樣本數和基因數  nGenes = ncol(datExpr)  nSamples = nrow(datExpr)  #首先針對樣本做個系統聚類樹  datExpr_tree<-hclust(dist(datExpr), method = "average")  par(mar = c(0,5,2,0))  plot(datExpr_tree, main = "Sample clustering", sub="", xlab="", cex.lab = 2,       cex.axis = 1, cex.main = 1,cex.lab=1)  ## 如果這個時候樣本是有性狀，或者臨床表型的，可以加進去看看是否聚類合理  #針對前面構造的樣品矩陣添加對應顏色  sample_colors <- numbers2colors(as.numeric(factor(datTraits$Tumor.Type)),                                  colors = c("white","blue","red","green"),signed = FALSE)  ## 這個給樣品添加對應顏色的程式碼需要自行修改以適應自己的數據分析項目。  #  sample_colors <- numbers2colors( datTraits ,signed = FALSE)  ## 如果樣品有多種分類情況，而且 datTraits 裡面都是分類資訊，那麼可以直接用上面程式碼，當然，這樣給的顏色不明顯，意義不大。  #構造10個樣品的系統聚類樹及性狀熱圖  par(mar = c(1,4,3,1),cex=0.8)  plotDendroAndColors(datExpr_tree, sample_colors,                      groupLabels = colnames(sample),                      cex.dendroLabels = 0.8,                      marAll = c(1, 4, 3, 1),                      cex.rowText = 0.01,                      main = "Sample dendrogram and trait heatmap")

上面給樣本進行聚類的程式碼可以不運行，其實跟WGCNA本身關係不大。

樣本的聚類可視化

可以看到這些乳腺癌的細胞系的表達譜聚類情況並不是完全與其分類匹配，所以僅僅是根據樣本的分組資訊做差異分析並不完全準確。

step5:模組和性狀的關係

## step 5 (最重要的) 模組和性狀的關係  ## 這一步主要是針對於連續變數，如果是分類變數，需要轉換成連續變數方可使用  table(datTraits$subtype)  if(T){    nGenes = ncol(datExpr)    nSamples = nrow(datExpr)    design=model.matrix(~0+ datTraits$subtype)    colnames(design)=levels(datTraits$subtype)    moduleColors <- labels2colors(net$colors)    # Recalculate MEs with color labels    MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes    MEs = orderMEs(MEs0); ##不同顏色的模組的ME值矩 (樣本vs模組)    moduleTraitCor = cor(MEs, design , use = "p");    moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)      sizeGrWindow(10,6)    # Will display correlations and their p-values    textMatrix = paste(signif(moduleTraitCor, 2), "n(",                       signif(moduleTraitPvalue, 1), ")", sep = "");    dim(textMatrix) = dim(moduleTraitCor)    png("step5-Module-trait-relationships.png",width = 800,height = 1200,res = 120)    par(mar = c(6, 8.5, 3, 3));    # Display the correlation values within a heatmap plot    labeledHeatmap(Matrix = moduleTraitCor,                   xLabels = colnames(design),                   yLabels = names(MEs),                   ySymbols = names(MEs),                   colorLabels = FALSE,                   colors = greenWhiteRed(50),                   textMatrix = textMatrix,                   setStdMargins = FALSE,                   cex.text = 0.5,                   zlim = c(-1,1),                   main = paste("Module-trait relationships"))    dev.off()      # 除了上面的熱圖展現形狀與基因模組的相關性外    # 還可以是條形圖,但是只能是指定某個形狀    # 或者自己循環一下批量出圖。    Luminal = as.data.frame(design[,3]);    names(Luminal) = "Luminal"    y=Luminal    GS1=as.numeric(cor(y,datExpr, use="p"))    GeneSignificance=abs(GS1)    # Next module significance is defined as average gene significance.    ModuleSignificance=tapply(GeneSignificance,                              moduleColors, mean, na.rm=T)    sizeGrWindow(8,7)    par(mfrow = c(1,1))    # 如果模組太多，下面的展示就不友好    # 不過，我們可以自定義出圖。    plotModuleSignificance(GeneSignificance,moduleColors)    }

通過模組與各種表型的相關係數，可以很清楚的挑選自己感興趣的模組進行下游分析了。這個圖就是把moduleTraitCor這個矩陣給用熱圖可視化一下。

模組和性狀的關係

因為一些歷史遺留問題，這個圖片缺乏X軸的標記。

從上圖已經可以看到跟乳腺癌分類相關的基因模組了，包括"Basal" "Claudin-low" "Luminal" "Non-malignant" "unknown" 這5類所對應的不同模組的基因列表。可以看到每一種乳腺癌都有跟它強烈相關的模組，可以作為它的表達signature，模組裡面的基因可以拿去做下游分析。我們看到Luminal表型跟棕色的模組相關性高達0.86，而且極其顯著的相關，所以值得我們挖掘，這個模組裡面的基因是什麼，為什麼如此的相關呢？

step6:感興趣性狀的模組的具體基因分析

性狀跟模組雖然求出了相關性，可以挑選最相關的那些模組來分析，但是模組本身仍然包含非常多的基因，還需進一步的尋找最重要的基因。所有的模組都可以跟基因算出相關係數，所有的連續型性狀也可以跟基因的表達值算出相關係數。主要參考資料：PDF document, R script 如果跟性狀顯著相關基因也跟某個模組顯著相關，那麼這些基因可能就非常重要。

首先計算模組與基因的相關性矩陣

# names (colors) of the modules  modNames = substring(names(MEs), 3)  geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));  ## 算出每個模組跟基因的皮爾森相關係數矩陣  ## MEs是每個模組在每個樣本裡面的值  ## datExpr是每個基因在每個樣本的表達量  MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));  names(geneModuleMembership) = paste("MM", modNames, sep="");  names(MMPvalue) = paste("p.MM", modNames, sep="");

再計算性狀與基因的相關性矩陣

  ## 只有連續型性狀才能只有計算    ## 這裡把是否屬於 Luminal 表型這個變數用0,1進行數值化。    Luminal = as.data.frame(design[,3]);    names(Luminal) = "Luminal"    geneTraitSignificance = as.data.frame(cor(datExpr, Luminal, use = "p"));    GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));    names(geneTraitSignificance) = paste("GS.", names(Luminal), sep="");    names(GSPvalue) = paste("p.GS.", names(Luminal), sep="");

最後把兩個相關性矩陣聯合起來,指定感興趣模組進行分析

 module = "brown"    column = match(module, modNames);    moduleGenes = moduleColors==module;    sizeGrWindow(7, 7);    par(mfrow = c(1,1));    verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),                       abs(geneTraitSignificance[moduleGenes, 1]),                       xlab = paste("Module Membership in", module, "module"),                       ylab = "Gene significance for Luminal",                       main = paste("Module membership vs. gene significancen"),                       cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

模組和性狀裡面的指定基因的相關性比較

可以看到這些基因不僅僅是跟其對應的模組高度相關，而且是跟其對應的性狀高度相關，進一步說明了基因值得深度探究。

step7:網路的可視化

主要參考資料：PDF document, R script

首先針對所有基因畫熱圖

# 主要是可視化 TOM矩陣，WGCNA的標準配圖  # 然後可視化不同 模組 的相關性 熱圖  # 不同模組的層次聚類圖  # 還有模組診斷，主要是 intramodular connectivity  if(T){    nGenes = ncol(datExpr)    nSamples = nrow(datExpr)    geneTree = net$dendrograms[[1]];    dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 6);    plotTOM = dissTOM^7;    diag(plotTOM) = NA;    #TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")    nSelect = 400    # For reproducibility, we set the random seed    set.seed(10);    select = sample(nGenes, size = nSelect);    selectTOM = dissTOM[select, select];    # There』s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.    selectTree = hclust(as.dist(selectTOM), method = "average")    selectColors = moduleColors[select];    # Open a graphical window    sizeGrWindow(9,9)    # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing    # the color palette; setting the diagonal to NA also improves the clarity of the plot    plotDiss = selectTOM^7;    diag(plotDiss) = NA;      png("step7-Network-heatmap.png",width = 800,height = 600)    TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")    dev.off()      # Recalculate module eigengenes    MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes    ## 只有連續型性狀才能只有計算    ## 這裡把是否屬 Luminal 表型這個變數0,1進行數值化    Luminal = as.data.frame(design[,3]);    names(Luminal) = "Luminal"    # Add the weight to existing module eigengenes    MET = orderMEs(cbind(MEs, Luminal))    # Plot the relationships among the eigengenes and the trait    sizeGrWindow(5,7.5);      par(cex = 0.9)    png("step7-Eigengene-dendrogram.png",width = 800,height = 600)    plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8, xLabelsAngle                          = 90)    dev.off()      # Plot the dendrogram    sizeGrWindow(6,6);    par(cex = 1.0)    ## 模組的進化樹    png("step7-Eigengene-dendrogram-hclust.png",width = 800,height = 600)    plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),                          plotHeatmaps = FALSE)    dev.off()    # Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)    par(cex = 1.0)    ## 性狀與模組熱      png("step7-Eigengene-adjacency-heatmap.png",width = 800,height = 600)    plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2),                          plotDendrograms = FALSE, xLabelsAngle = 90)    dev.off()    }

這個非常消耗計算資源和時間，所以建議選取其中部分基因作圖即可，我就沒有畫，而且根據下面的程式碼選取部分基因來作圖！

然後隨機選取部分基因作圖

nSelect = 400  # For reproducibility, we set the random seed  set.seed(10);  select = sample(nGenes, size = nSelect);  selectTOM = dissTOM[select, select];  # There』s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.  selectTree = hclust(as.dist(selectTOM), method = "average")  selectColors = moduleColors[select];  # Open a graphical window  sizeGrWindow(9,9)  # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing  # the color palette; setting the diagonal to NA also improves the clarity of the plot  plotDiss = selectTOM^7;  diag(plotDiss) = NA;  TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")

模組熱圖

這個圖湊數的意義居多，如果是把全部基因畫上去，可以很清楚的看到各個區塊顏色差異。

最後畫模組和性狀的關係

 # Recalculate module eigengenes    MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes    ## 只有連續型性狀才能只有計算    ## 這裡把是否屬於 Luminal 表型這個變數用0,1進行數值化。    Luminal = as.data.frame(design[,3]);    names(Luminal) = "Luminal"    # Add the weight to existing module eigengenes    MET = orderMEs(cbind(MEs, Luminal))    # Plot the relationships among the eigengenes and the trait    sizeGrWindow(5,7.5);    par(cex = 0.9)    plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8, xLabelsAngle                          = 90)    # Plot the dendrogram    sizeGrWindow(6,6);    par(cex = 1.0)    ## 模組的聚類圖    plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),                          plotHeatmaps = FALSE)    # Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)    par(cex = 1.0)    ## 性狀與模組熱圖    plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2),                          plotDendrograms = FALSE, xLabelsAngle = 90)

性狀與模組熱圖

step8:提取指定模組的基因名

## step 8  # 主要是關心具體某個模組內部的基因  if(T){    # Select module    module = "brown";    # Select module probes    probes = colnames(datExpr) ## 我們例子裡面的probe就是基因    inModule = (moduleColors==module);    modProbes = probes[inModule];    head(modProbes)      # 如果使用WGCNA包自帶的熱圖就很醜。    which.module="brown";    dat=datExpr[,moduleColors==which.module ]    plotMat(t(scale(dat)),nrgcols=30,rlabels=T,            clabels=T,rcols=which.module,            title=which.module )    datExpr[1:4,1:4]    dat=t(datExpr[,moduleColors==which.module ] )    library(pheatmap)    pheatmap(dat ,show_colnames =F,show_rownames = F) #對那些提取出來的1000個基因所在的每一行取出，組合起來為一個新的表達矩陣    n=t(scale(t(log(dat+1)))) # 'scale'可以對log-ratio數值進行歸一化    n[n>2]=2    n[n< -2]= -2    n[1:4,1:4]    pheatmap(n,show_colnames =F,show_rownames = F)    group_list=datTraits$subtype    ac=data.frame(g=group_list)    rownames(ac)=colnames(n)    pheatmap(n,show_colnames =F,show_rownames = F,             annotation_col=ac )    # 可以很清晰的看到，所有的形狀相關的模組基因    # 其實未必就不是差異表達基因。  }

有了基因資訊，下游分析就很簡單了。包括GO/KEGG等功能資料庫的注釋

Step9: 模組的導出

主要模組裡面的基因直接的相互作用關係資訊可以導出到cytoscape,VisANT等網路可視化軟體。

# Recalculate topological overlap  TOM = TOMsimilarityFromExpr(datExpr, power = 6);  # Select module  module = "brown";  # Select module probes  probes = colnames(datExpr) ## 我們例子裡面的probe就是基因名  inModule = (moduleColors==module);  modProbes = probes[inModule];  ## 也是提取指定模組的基因名  # Select the corresponding Topological Overlap  modTOM = TOM[inModule, inModule];  dimnames(modTOM) = list(modProbes, modProbes)  ## 模組對應的基因關係矩陣

首先是導出到VisANT

vis = exportNetworkToVisANT(modTOM,  file = paste("VisANTInput-", module, ".txt", sep=""),  weighted = TRUE,  threshold = 0)

然後是導出到cytoscape

  cyt = exportNetworkToCytoscape(         modTOM,        edgeFile = paste("CytoscapeInput-edges-", paste(module, collapse="-"), ".txt", sep=""),        nodeFile = paste("CytoscapeInput-nodes-", paste(module, collapse="-"), ".txt", sep=""),        weighted = TRUE,        threshold = 0.02,        nodeNames = modProbes,        nodeAttr = moduleColors[inModule]                                  );

如果模組包含的基因太多，網路太複雜，還可以進行篩選，比如：

nTop = 30;  IMConn = softConnectivity(datExpr[, modProbes]);  top = (rank(-IMConn) <= nTop)  filter <- modTOM[top, top]

後面就是cytoscape自身的教程了，這裡不再贅述，我部落格有比較詳盡的介紹。