TCGA數(shù)據(jù)庫的normal樣本不夠該怎么辦

TCGA數(shù)據(jù)庫的normal樣本不夠該怎么辦,相信很多沒有經(jīng)驗(yàn)的人對(duì)此束手無策,為此本文總結(jié)了問題出現(xiàn)的原因和解決方法,通過這篇文章希望你能解決這個(gè)問題。

10年積累的成都做網(wǎng)站、成都網(wǎng)站制作經(jīng)驗(yàn),可以快速應(yīng)對(duì)客戶對(duì)網(wǎng)站的新想法和需求。提供各種問題對(duì)應(yīng)的解決方案。讓選擇我們的客戶得到更好、更有力的網(wǎng)絡(luò)服務(wù)。我雖然不認(rèn)識(shí)你,你也不認(rèn)識(shí)我。但先做網(wǎng)站后付款的網(wǎng)站建設(shè)流程,更有那坡免費(fèi)網(wǎng)站建設(shè)讓你可以放心的選擇與我們合作。

自己想挖掘的癌癥,雖然是在TCGA數(shù)據(jù)庫有數(shù)據(jù),但是normal(癌旁樣品或者血液)太少了,做差異分析什么的, 會(huì)面臨樣本數(shù)量不平衡問題,是否可以納入GTEx數(shù)據(jù)庫的正常組織轉(zhuǎn)錄組測(cè)序數(shù)據(jù)。

  • GTEx,The Genotype-Tissue Expression (GTEx) project,首次被提出來是2013年,上百位科學(xué)家聯(lián)名在Nature Genetics雜志發(fā)表的文章首次介紹了“基因型-組織表達(dá)工程”,并成立了“基因型-組織表達(dá)研究聯(lián)盟”。The GTEx has catalogued gene expression in >9,000 samples across 53 tissues from 544 healthy individuals.
  • TCGA,The cancer genome altas,https://cancergenome.nih.gov/ ,是由National Cancer Institute ( NCI, 美國國家癌癥研究所) 和  National Human Genome Research Institute (NHGRI, 國家人類基因組研究所) 合作建立的癌癥研究項(xiàng)目,通過收集整理癌癥相關(guān)的各種組學(xué)數(shù)據(jù)。The Cancer Genome Atlas (TCGA) has quantified gene expression levels in >12000 samples from >33 cancer types.

其實(shí)是沒辦法簡單的回答是否可以整合TCGA和GTEx數(shù)據(jù)庫,或者說該如何結(jié)合,這背后的統(tǒng)計(jì)學(xué)略微有點(diǎn)復(fù)雜,不僅僅是批次效應(yīng)。發(fā)表在Sci Data. 2018; 的文章:Unifying cancer and normal RNA sequencing data from different sources 就比較詳細(xì)的說明了TCGA和GTEx數(shù)據(jù)庫的轉(zhuǎn)錄組數(shù)據(jù)的天然差異:

  • sequencing platform and chemistry, personnel, details in the analysis pipeline, etc
  • 基因表達(dá)量范圍:4-10 (log2 of normalized_count) for TCGA, and 0-4 (log2 of RPKM) for GTEx

全部代碼共享在:GitHub (https://github.com/mskcc/RNAseqDB).

TCGA數(shù)據(jù)庫的normal樣本不夠該怎么辦

統(tǒng)一TCGA和GTEx定量流程

最近一篇發(fā)表在SR,17 February 2020 的文章:Variability in estimated gene expression among commonly used RNA-seq pipelines 比較了常見轉(zhuǎn)錄組測(cè)序數(shù)據(jù)分析流程對(duì)定量拿到的表達(dá)矩陣的影響:

  • We compared gene expression values from common samples (4,800 tumor samples from TCGA and 1,890 normal-tissue samples from GTEx) processed by the pipelines to     understand how gene expression quantification is impacted by differences in data processing.

TCGA和GTEX是兩個(gè)超級(jí)大的擁有RNA-seq數(shù)據(jù)的計(jì)劃,其中TCGA涵蓋33種癌癥,超1萬個(gè)樣品,而GTEX也有500多個(gè)病人的50多種組織的近1萬個(gè)樣品數(shù)據(jù)。它們各自的發(fā)起單位對(duì)RNA-seq數(shù)據(jù)處理不一樣,而且后續(xù)也有一些新的流程處理試圖統(tǒng)一兩個(gè)數(shù)據(jù)庫的RNA-seq數(shù)據(jù)分析結(jié)果,比較出名的5個(gè)流程分別是

  • TOPMed pipeline (https://github.com/broadinstitute/gtex-pipeline)
  • recount2 pipeline (https://jhubiostatistics.shinyapps.io/recount/)

作者把這5個(gè)流程應(yīng)用到TCGA和GTEX,得到10個(gè)不同組合的數(shù)據(jù)

  • GDC (GDC-Xena/Toil, GDC-Piccolo, GDC-Recount2, GDC-MSKCC and GDC-MSKCC Batch).
  • GTEx (GTEx-Xena/Toil, GTEx-Recount2, GTEx-MSKCC, GTEx-MSKCC Batch)

做了非常完善的比較,并且公布全部代碼在:https://github.com/sonali-bioc/UncertaintyRNA

TCGA數(shù)據(jù)庫的normal樣本不夠該怎么辦

比較常見的5個(gè)轉(zhuǎn)錄組定量流程
 

整合TCGA和GTEx數(shù)據(jù)庫的文獻(xiàn)

非常多!

很多簡陋的數(shù)據(jù)挖掘,比如發(fā)表在PeerJ的 BIOINFORMATICS AND GENOMICS雜志的文章:Identification of four hub genes associated with adrenocortical carcinoma progression by WGCNA 也會(huì)涉及到TCGA數(shù)據(jù)庫和GTEx的整合。

首先下載TCGA和GTEx數(shù)據(jù)庫的TPM表達(dá)矩陣:

Gene transcripts per million (TPM) data were downloaded from the UCSC Xena database, which included ACC (The Cancer Genome Atlas, n = 77) and normal samples (Genotype Tissue Expression, n = 128).

然后差異分析流程是:

  • Of the 60,498 genes in each sample, we removed genes with a mean TPM ≤ 2.5 (>1 is a common cutoff for determining if an isoform is expressed or not  in the cancer and normal samples and thus retained 13,987 genes.

  • For those genes in the samples that showed significant changes, we used analysis of variance (ANOVA) in R  to determine the variance in genes between the two groups. ANOVA is a collection of statistical models useful for DEG analysis.

  • We obtained 2,953 significant DEGs (Table S2) in ACC with a p < 0.001 and |log2 (fold-change)| > 1 cutoff.

差異分析結(jié)果是:1,181 up-regulated and 1,772 down-regulated genes.

可以看到,作者默認(rèn)TPM這個(gè)轉(zhuǎn)錄組測(cè)序表達(dá)數(shù)據(jù)歸一化形式本身是具有跨平臺(tái)跨數(shù)據(jù)庫的特性,所以無需考慮批次效應(yīng),直接使用最簡單粗暴的ANOVA檢驗(yàn)即可!

 

如果是甲基化數(shù)據(jù)

我們都知道,TCGA數(shù)據(jù)庫是目前最綜合最全面的癌癥病人相關(guān)組學(xué)數(shù)據(jù)庫,包括:

  • DNA Sequencing
  • miRNA Sequencing
  • Protein Expression array
  • mRNA Sequencing
  • Total RNA Sequencing
  • Array-based Expression
  • DNA Methylation
  • Copy Number array

知名的腫瘤研究機(jī)構(gòu)都有著自己的TCGA數(shù)據(jù)庫探索工具,比如:

  • Broad Institute     FireBrowseportal, The Broad Institute
  • cBioPortalfor Cancer Genomics, Memorial Sloan-Kettering Cancer Center

對(duì)轉(zhuǎn)錄表達(dá)這個(gè)層面的信息來說,最優(yōu)選擇當(dāng)然是整合TCGA和GTEx數(shù)據(jù)庫,但是對(duì)于甲基化數(shù)據(jù),我們有沒有類似于GTEx數(shù)據(jù)庫的超級(jí)大隊(duì)列呢?

目前我還沒有接觸到,我前面分享過:這樣的診斷模型才優(yōu)秀,作者就是下載TCGA的結(jié)直腸癌甲基化位點(diǎn)信號(hào)矩陣文件:

  • Tissue DNA methylation data were obtained from the TCGA (TCGA, TCGA-COAD, and TCGA-READ).

以及正常人的血液的甲基化信號(hào)值作為對(duì)照:

  • Whole-blood DNA methylation profiles from healthy donors were generated in an aging study (GSE40279)

上面的兩個(gè)隊(duì)列是為了確定直腸癌特異性甲基化位點(diǎn),做的是差異分析,確定了 top 1000 methylation markers

可以合理的推測(cè)應(yīng)該是沒有人類各個(gè)正常組織的甲基化數(shù)據(jù)供使用,所以他們才會(huì)退而求其次使用正常人的血液的甲基化信號(hào)值作為對(duì)照吧!

看完上述內(nèi)容,你們掌握TCGA數(shù)據(jù)庫的normal樣本不夠該怎么辦的方法了嗎?如果還想學(xué)到更多技能或想了解更多相關(guān)內(nèi)容,歡迎關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道,感謝各位的閱讀!

標(biāo)題名稱:TCGA數(shù)據(jù)庫的normal樣本不夠該怎么辦
分享鏈接:http://muchs.cn/article26/ghepcg.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián),為您提供商城網(wǎng)站、App開發(fā)、ChatGPT定制網(wǎng)站、營銷型網(wǎng)站建設(shè)、手機(jī)網(wǎng)站建設(shè)

廣告

聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主,如果涉及侵權(quán)請(qǐng)盡快告知,我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如需處理請(qǐng)聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載,或轉(zhuǎn)載時(shí)需注明來源: 創(chuàng)新互聯(lián)

商城網(wǎng)站建設(shè)