跟着大神学单细胞数据分析
- 2020 年 3 月 30 日
- 笔记
前言
这是 Tang Ming 大神分享的单细胞分析的seurat流程。今天我们来理一下大致的分析思路,当然里面好多细节的部分还需要自己下功夫慢慢研究。
原文链接如下: https://crazyhottommy.github.io/scRNA-seq-workshop-Fall-2019/scRNAseq_workshop_1.html
下载数据
我们将下载来自10x Genomics的公共 5k pbmc (外周血单核细胞)数据集。然后用R分析
wget http://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_v3/5k_pbmc_v3_filtered_feature_bc_matrix.tar.gz tar xvzf 5k_pbmc_v3_filtered_feature_bc_matrix.tar.gz
安装所需的R包
install.packages("tidyverse") install.packages("rmarkdown") install.packages('Seurat')
如果你已经安装过这写R包,你可以忽略这一步。如果还没有安装或者安装R包有问题,可以参考下面的教程:
rstudio软件无需联网但是 BiocManger无法安装R包 批量安装R包小技巧大放送
读入数据
# 读取PBMC数据集 pbmc.data <- Read10X(data.dir = "filtered_feature_bc_matrix/") # 使用原始数据(未归一化处理)初始化Seurat对象 pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc5k", min.cells = 3, min.features = 200) pbmc
An object of class Seurat 18791 features across 4962 samples within 1 assay Active assay: RNA (18791 features)
如果你想了解更多Seurat对象的详细信息,你可以参考这个网站:https://github.com/satijalab/seurat/wiki
注:读入数据这一步使用的Seurat包应该是 Seurat V3版本。因为我用Seurat V2创建的对象和文中所给的结果不一致
## 使用Srurat V2 创建对象 pbmc <- CreateSeuratObject(raw.data = pbmc.data, project = "pbmc5k", min.cells = 3, min.features = 200) pbmc An object of class seurat in project pbmc5k 18791 genes across 5025 samples.
质量控制
## check at metadata head([email protected]) # The [[ operator can add columns to object metadata. This is a great place to stash QC stats pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") [email protected] %>% head() ##将质量控制指标可视化为小提琴图 VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) #我们根据上面的可视化设置了截止值。这个截止值是相当主观的。 pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 25)

Normalization
通常情况下,我们采用全局缩放的归一化方法"LogNormalize"
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
不过,现在Seurat也有一个新的标准化的方法,称为SCTransform . 详细了解可以查看:https://satijalab.org/seurat/v3.0/sctransform_vignette.html
特征选择
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) # Identify the 10 most highly variable genes top10 <- head(VariableFeatures(pbmc), 10) # plot variable features with and without labels plot1 <- VariableFeaturePlot(pbmc) plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) CombinePlots(plots = list(plot1, plot2), ncol =1)

Scaling the data
ScaleData函数:
- Shifts the expression of each gene, so that the mean expression across cells is 0
- Scales the expression of each gene, so that the variance across cells is 1
我们一般将平均值为0,方差值为1的数据认为是标准数据
all.genes <- rownames(pbmc) pbmc <- ScaleData(pbmc, features = all.genes)
如果数据量很大,这一步可能需要较长时间
在scale前后检查数据
## 检查前后数据的区别 #### raw counts, same as pbmc@assays$RNA@counts[1:6, 1:6] pbmc[["RNA"]]@counts[1:6, 1:6] ### library size normalized and log transformed data pbmc[["RNA"]]@data[1:6, 1:6] ### scaled data pbmc[["RNA"]]@scale.data[1:6, 1:6]

scale是Seurat工作流程中必不可少的一步。但结果仅限于用作PCA分析的输入。
ScaleData中默认设置是仅对先前标识的变量特征执行降维(默认为2000).因此,在上一个函数调用中应省略features参数。
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
PCA
主成分分析(PCA)是一种线性降维技术
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc), verbose = FALSE) p1<- DimPlot(pbmc, reduction = "pca") p1

如果想了解更多PCA相关的,可以在YouTube观看StatQuest的: https://www.youtube.com/watch?v=HMOI_lkzW08
或者看下面的教程: 聚类分析和主成分分析
或者原作者的博客:
https://divingintogeneticsandgenomics.rbind.io/post/pca-in-action/ https://divingintogeneticsandgenomics.rbind.io/post/permute-test-for-pca-components/
当然你也可以用ggplot2画出各种好看的PCA图,网上搜索的话,画图代码有很多。这里不再论述。
确定PCs数
为了克服scRNA序列数据单一特征中的广泛技术噪音,Seurat根据其PCA分数对细胞进行聚类,每个PC基本上表示一个“元特征”,该特征结合了相关特征集上的信息。因此,最主要的主成分代表了数据集的强大压缩。但是,我们应该选择包括多少个PC?10个?20?还是100?
可以用如下方法来大致判定:
pbmc <- JackStraw(pbmc, num.replicate = 100, dims = 50) pbmc <- ScoreJackStraw(pbmc, dims = 1:50) JackStrawPlot(pbmc, dims = 1:30)

ElbowPlot(pbmc, ndims = 50)

variance explained by each PC
mat <- pbmc[["RNA"]]@scale.data pca <- pbmc[["pca"]] # Get the total variance: total_variance <- sum(matrixStats::rowVars(mat)) eigValues = (pca@stdev)^2 ## EigenValues varExplained = eigValues / total_variance varExplained %>% enframe(name = "PC", value = "varExplained" ) %>% ggplot(aes(x = PC, y = varExplained)) + geom_bar(stat = "identity") + theme_classic() + ggtitle("scree plot")

### this is what Seurat is plotting: standard deviation pca@stdev %>% enframe(name = "PC", value = "Standard Deviation" ) %>% ggplot(aes(x = PC, y = `Standard Deviation`)) + geom_point() + theme_classic()

细胞分群
pbmc <- FindNeighbors(pbmc, dims = 1:20) pbmc <- FindClusters(pbmc, resolution = 0.5) # Look at cluster IDs of the first 5 cells head(Idents(pbmc), 5)

运行非线性降维(UMAP/tSNE)
pbmc <- RunUMAP(pbmc, dims = 1:20) pbmc<- RunTSNE(pbmc, dims = 1:20) ## after we run UMAP and TSNE, there are more entries in the reduction slot str(pbmc@reductions) DimPlot(pbmc, reduction = "umap", label = TRUE)

## now let's visualize in the TSNE space DimPlot(pbmc, reduction = "tsne")

tSNE相关视频: https://www.youtube.com/watch?v=NEaUSP4YerM
## now let's label the clusters in the PCA space DimPlot(pbmc, reduction = "pca")

查找差异表达特征(集群生物标记)
# find all markers of cluster 1 cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25) head(cluster1.markers, n = 5) # find all markers distinguishing cluster 5 from clusters 0 and 3 cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25) head(cluster5.markers, n = 5) # find markers for every cluster compared to all remaining cells, report only the positive ones pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
这一步很费时间,如果你觉得慢,Seurat V3.0.2 为FindALLMarkers在内的一些步骤提供了并行支持。 更多了解:https://satijalab.org/seurat/v3.0/future_vignette.html
# we only have 2 CPUs reserved for each one. plan("multiprocess", workers = 2) pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
可视化marker基因
VlnPlot
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

## understanding the matrix of data slots pbmc[["RNA"]]@data[c("MS4A1", "CD79A"), 1:30] pbmc[["RNA"]]@scale.data[c("MS4A1", "CD79A"), 1:30] pbmc[["RNA"]]@counts[c("MS4A1", "CD79A"), 1:30] # you can plot raw counts as well VlnPlot(pbmc, features = c("MS4A1", "CD79A"), slot = "counts", log = TRUE)

VlnPlot(pbmc, features = c("MS4A1", "CD79A"), slot = "scale.data")

FeaturePlot plot the expression intensity overlaid on the Tsne/UMAP plot.
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))

p<- FeaturePlot(pbmc, features = "CD14") ## before reordering p

p_after<- p ### after reordering p_after$data <- p_after$data[order(p_after$data$CD14),] CombinePlots(plots = list(p, p_after))

DoHeatmap
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC) DoHeatmap(pbmc, features = top10$gene) + NoLegend()
