單細胞分析實錄(6): 去除批次效應/整合數據

2021 年 1 月 6 日
筆記
生物信息

上一篇已經講解了Seurat標準流程，推文的最後，注意到了不同樣本之間的表達數據是存在批次效應的，就像下圖這樣，有些是可以聚到一起的亞群，卻出現了不同樣本分開/偏移的情況，比如第3群，這種就是批次效應：

接下來我會介紹Seurat v3的標準整合流程、Seurat結合Harmony 的整合流程，仍然使用上一個數據集

1. Seurat v3的標準整合流程

對於不同樣本先分別運行Seurat的標準流程到找Variable基因這一步

library(Seurat)
library(tidyverse)
### testA ----
testA.seu=CreateSeuratObject(counts = testA)
testA.seu <- NormalizeData(testA.seu, normalization.method = "LogNormalize", scale.factor = 10000)
testA.seu <- FindVariableFeatures(testA.seu, selection.method = "vst", nfeatures = 2000)
### testB ----
testB.seu=CreateSeuratObject(counts = testB)
testB.seu <- NormalizeData(testB.seu, normalization.method = "LogNormalize", scale.factor = 10000)
testB.seu <- FindVariableFeatures(testB.seu, selection.method = "vst", nfeatures = 2000)

然後就是主要的整合步驟，object.list參數是由多個Seurat對象構成的列表，如下：

### Integration ----
testAB.anchors <- FindIntegrationAnchors(object.list = list(testA.seu,testB.seu), dims = 1:20)
testAB.integrated <- IntegrateData(anchorset = testAB.anchors, dims = 1:20)

需要注意的是：上面的整合步驟相對於harmony整合方法，對於較大的數據集（幾萬個細胞），非常消耗內存和時間；當存在某一個Seurat對象細胞數很少（印象中200以下這樣子），會報錯，這時建議用第二種整合方法

這一步之後就多了一個整合後的assay（原先有一個RNA的assay），整合前後的數據分別存儲在這兩個assay中

> testAB.integrated
An object of class Seurat 
35538 features across 6746 samples within 2 assays 
Active assay: integrated (2000 features)
 1 other assay present: RNA
 
 > dim(testAB.integrated[["RNA"]]@counts)
[1] 33538  6746
> dim(testAB.integrated[["RNA"]]@data)
[1] 33538  6746

> dim(testAB.integrated[["integrated"]]@counts) #因為是從RNA這個assay的data矩陣開始整合的，所以這個矩陣為空
[1] 0 0
> dim(testAB.integrated[["integrated"]]@data)
[1] 2000 6746

後續仍然是標準流程，基於上面得到的整合data矩陣

DefaultAssay(testAB.integrated) <- "integrated"
# Run the standard workflow for visualization and clustering
testAB.integrated <- ScaleData(testAB.integrated, features = rownames(testAB.integrated))
testAB.integrated <- RunPCA(testAB.integrated, npcs = 50, verbose = FALSE)
testAB.integrated <- FindNeighbors(testAB.integrated, dims = 1:30)
testAB.integrated <- FindClusters(testAB.integrated, resolution = 0.5)
testAB.integrated <- RunUMAP(testAB.integrated, dims = 1:30)
testAB.integrated <- RunTSNE(testAB.integrated, dims = 1:30)

看一下去除批次效應之後的結果

library(cowplot)
testAB.integrated$patient=str_replace(testAB.integrated$orig.ident,"_.*$","")
p1 <- DimPlot(testAB.integrated, reduction = "tsne", group.by = "patient", pt.size=0.5)+theme(
  axis.line = element_blank(),
  axis.ticks = element_blank(),axis.text = element_blank()
)
p2 <- DimPlot(testAB.integrated, reduction = "tsne", group.by = "ident",   pt.size=0.5, label = TRUE,repel = TRUE)+theme(
  axis.line = element_blank(),
  axis.ticks = element_blank(),axis.text = element_blank()
)
fig_tsne <- plot_grid(p1, p2, labels = c('patient','ident'),align = "v",ncol = 2)
ggsave(filename = "tsne2.pdf", plot = fig_tsne, device = 'pdf', width = 27, height = 12, units = 'cm')

可以看到，不同樣本的細胞基本都分散均勻了，結果還不錯

2. Seurat結合Harmony的整合流程

這種整合方法很簡單，而且占內存少，速度快

library(harmony)
testdf=cbind(testA,testB)
test.seu <- CreateSeuratObject(counts = testdf) %>%
  Seurat::NormalizeData() %>%
  FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>% 
  ScaleData()
test.seu <- RunPCA(test.seu, npcs = 50, verbose = FALSE)
[email protected]$patient=str_replace(test.seu$orig.ident,"_.*$","")

先運行Seurat標準流程到PCA這一步，然後就是Harmony整合，可以簡單把這一步理解為一種新的降維

test.seu=test.seu %>% RunHarmony("patient", plot_convergence = TRUE)

> test.seu
An object of class Seurat 
33538 features across 6746 samples within 1 assay 
Active assay: RNA (33538 features)
 2 dimensional reductions calculated: pca, harmony

接着就是常規聚類降維，都是基於Harmony的Embeddings矩陣

test.seu <- test.seu %>% 
  RunUMAP(reduction = "harmony", dims = 1:30) %>% 
  FindNeighbors(reduction = "harmony", dims = 1:30) %>% 
  FindClusters(resolution = 0.5) %>% 
  identity()

test.seu <- test.seu %>% 
  RunTSNE(reduction = "harmony", dims = 1:30)

看看效果

p3 <- DimPlot(test.seu, reduction = "tsne", group.by = "patient", pt.size=0.5)+theme(
  axis.line = element_blank(),
  axis.ticks = element_blank(),axis.text = element_blank()
)
p4 <- DimPlot(test.seu, reduction = "tsne", group.by = "ident",   pt.size=0.5, label = TRUE,repel = TRUE)+theme(
  axis.line = element_blank(),
  axis.ticks = element_blank(),axis.text = element_blank()
)
fig_tsne <- plot_grid(p3, p4, labels = c('patient','ident'),align = "v",ncol = 2)
ggsave(filename = "tsne3.pdf", plot = fig_tsne, device = 'pdf', width = 27, height = 12, units = 'cm')

看起來也挺好的~

對於異質性很大的不同數據集，比如不同病人的腫瘤細胞，考慮到生物學差異已經遠遠大於批次效應，這時不應該再進行批次效應的去除，不然內在的生物學差異也會抹掉。儘管也在文獻裏面看到過對腫瘤細胞去批次的，但我始終覺得這樣做是錯的。不知道大家是怎麼看的？

最後還有一個有用的小技巧，在降維之後，我們可以用DimPlot()函數把結果畫在二維平面上，那這些坐標信息存儲在哪兒呢？存儲在Embeddings(test.seu,”pca”)矩陣中，harmony/tsne/umap同理。有時候我們可以把坐標和meta@data的信息提取出來，用ggplot2畫圖，方便修改。

> Embeddings(test.seu,"tsne")[1:4,1:2]
                       tSNE_1    tSNE_2
A_AAACCCAAGGGTCACA  -6.774032 -25.08772
A_AAACCCAAGTATAACG  -5.294804  39.77550
A_AAACCCAGTCTCTCAC  30.600943  31.28015
A_AAACCCAGTGAGTCAG -32.432794 -14.62455

因水平有限，有錯誤的地方，歡迎批評指正！

Tags: 生物信息