RNA-seq數據不僅僅是表達量

  • 2019 年 11 月 20 日
  • 筆記

RNA-seq數據毫無疑問是目前NGS領域被使用最頻繁的了,但是大部分科研人員對它的理解,還停留在表達量層面,尤其是基於基因的表達量,無非就是分組,然後走差異分析這樣的統計學檢驗,繪製火山圖和差異基因熱圖,上下調的通路。全部的學習資料我都影片錄製免費共享在B站了:

先不說大家對RNA-seq數據的標準分析是否一定是對的,這樣的簡陋的分析其實是對數據的暴殄天物!

首先可以分析差異轉錄本,可變剪切

看到一篇2019年5月發表在Molecular Neurodegeneration雜誌的文章:TREM2 brain transcript-specific studies in AD and TREM2 mutation carriers 把普通的RNA-seq數據根據自己的生物學背景挖掘了一下。背景知識需要去搜索了解Triggering Receptor Expressed in Myeloid cells 2 (TREM2)這個基因,以及它的3個轉錄本。

都是European-Americans,測序數據是:

  • AD cases with TREM2 variants (n = 33)
  • AD cases (n = 195)
  • healthy controls (n = 118)

來源於3個不同的機構:

  • Washington University in St. Louis Knight-ADRC Brain Bank (51 participants)
  • MSBB-BM36, (132 participants)
  • MCBB (162 participants)

每個樣本平均測序數據量是 134.9 million ,是2 × 101bp的測序策略。

其中2個機構的數據是已有的,數據下載方式:

  • Mayo Clinic Brain Bank RNA-seq data was downloaded from the AMP-AD portal (synapse ID = 5,550,404; accessed January 2017).
  • Mount Sinai Brain Bank RNA-seq data was downloaded from the AMP-AD portal (synapse ID = https://www.synapse.org/#!Synapse:syn3157743; accessed January 2017).

轉錄組數據分析流程,主要是軟體選擇,參考基因組版本:

  • FastQC [49] was used to assess sequencing quality.
  • The RNA-seq was aligned to the human GRCh37 primary assembly using STAR (ver 2.5.2b).
  • Read alignments were further evaluated by using PICARD CollectRnaSeqMetrics (ver 2.8.2) to examine read distribution across the genome.
  • We employed Kallisto (v0.42.5) and tximport to determine the read count for each transcript and quantified transcript abundance as transcripts per kilobase per million reads mapped (TPM), using gene annotation of Homo sapiens reference genome (GENCODE GRCh37) for each participant from Knight-ADRC, MCB and MSBB-BM36 independently, with the following parameters: -t 10 -b 100.
  • Then we summed the read counts and TPM of all alternative splicing transcripts of a gene to obtain gene expression levels. Due to the positive skewness of TPM values, we calculate their logarithm10 (log10TPM) for further analysis.

關於轉錄本的差異分析,我們分享過salmon+DRIMseq流程,在前些天的推文裡面,見:每月一生信流程之rnaseqDTU(差異轉錄本)

在文章導論大量介紹了TREM2)這個基因,以及它的3個轉錄本。同時看了3個隊列的這個基因的3個轉錄本的表達量情況。

We were able to detect and quantify the levels of three TREM2 transcripts ENST00000373113, ENST00000373122 and ENST00000338469 using RNA-seq data from AD and control brains from three different, independent studies.

image-20191118111906902

不過這樣的分析仍然是片面的,因為作者僅僅是關心自己生物學背景的基因,下面的全局比較的總結表格其實是不可或缺的。

image-20191118222328335

然後可以分析融合基因

看到[article](https://ashpublications.org/bloodadvances/article-lookup/doi/10.1182/bloodadvances.2019000404) Transcriptome analysis offers a comprehensive illustration of the genetic background of pediatric acute myeloid leukemia. Blood Adv 文章就是日本研究團隊的 [RNA-seq] in 139 of the 369 patients with de novo pediatric AML ,這樣文章落腳點就是基因融合事件,54 in-frame gene fusions and 1 RUNX1 out-of-frame fusion in 53 of 139 patients.

在大的病人隊列裡面,提供實驗驗證了 258 gene fusions in 369 patients (70%) 。

image-20191118110449107

因為有RNA-seq數據的只有139個病人,所以 突變全景圖如下:

image-20191118110604155

甚至找到的基因融合事件,可以當做是病人的一種表型資訊進行分析:

image-20191118111159029