简单的snptest要不要学

2020 年 3 月 18 日
笔记

For the GWAS datasets, we calculated per-allele ORs and SEs using logistic regression analysis with the SNPTEST software (version 2.5.4) based on a probabilistic dosage model. 使用基于概率剂量模型的SNPTEST软件(版本2.5.4)使用Logistic回归分析计算每个等位基因的ORs和Se.

根据这句话，尝试做点学习，记录点笔记。

Logit笔记

先补充学习了Logit的相关知识，手记笔记：

Logistic回归适用于二值响应型变量（0和1），模型假设响应亦是Y服从二项分布。通过一系列连续和/类别型预测变量来预测二值型结果时使用。《R语言实战》（一本初看没有多少头绪，慢慢才有些感觉的书。最近还读了《R语言轻松入门与提高/达人迷》觉得这本书浅显易懂，虽然有点老，推荐阅读。

软件下载和准备

下载地址在这个网站 https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html

#我是黑苹果，下载的mac版，虽然两天死机一次，仍在坚持用，因为实在受不了bug10的咖喱风   wget -c http://www.well.ox.ac.uk/~gav/resources/snptest_v2.5.4-beta3_MacOSX_x86_64.tgz   #ubuntu，动态的比较推荐，适应更多平台   wget -c http://www.well.ox.ac.uk/~gav/resources/snptest_v2.5.4-beta3_linux_x86_64_dynamic.tgz   #Cent   wget -c http://www.well.ox.ac.uk/~gav/resources/snptest_v2.5.4-beta3_CentOS6.6_x86_64_dynamic.tgz   ##解压安装   tar zxvf snptest_v2.5.4-beta3_MacOSX_x86_64.tgz   #测试  snptest_v2.5.4-beta3_MacOSX_x86_64/snptest_v2.5.4-beta3 -help  Welcome to SNPTEST v2.5.3 (revision 8a0c171055ff6ac4c69079501f52ac93c563ed20)  © University of Oxford 2008-2015  https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html  Read LICENCE file for conditions of use.    Usage: snptest_v2.5.3 <options>   #好多信息，此处省略n百字符

使用

1.先确定哪个选项完成这个功能

官网的文档中，有几个功能作了介绍，第一个Computing summary statistics，应该是汇总统计，主要功能介绍为：计算每个基因型的计数，频率，SNP缺失率和OR值。显然不是这个功能，顺便用搜索logistic 确认了下，没有。第二个是Frequentist Association Tests，频率关联测试，没找到相关内容。第三个，Bayesian Tests (Bayes Factors)，贝叶斯测试，找到了相关内容。后面两个选项也是有相关内容的，但是在这里又看到了这句话的引文，来自2007年的一篇nature,A new multipoint method for genome-wide association studies by imputation of genotypes，发现这篇文章中涉及的两个方法是 frequentist和bayesian，所以断定应该是后者了。

简单看下这篇引文，会不会有什么收获，基本上是关于detect causal variants that have not been directly genotyped的，翻译一下就是检测尚未直接进行基因分型的因果变异。所推断的基因型有时(适当地)是不确定的，所以我们发现有必要开发关联性检验(包括频度检验和贝叶斯检验)，以考虑到我们推定的基因型的不确定性。贝叶斯因子在某种程度上类似于频率P值，它们的使用开始出现在文献中，作为经典关联检验的一种更强大和更容易解释的选择。

还是来自引文的内容：使用贝叶斯因子比频率测试统计量或P值有几个优点。正确解释P值需要了解所用测试的威力。非正式地，小的P值可能在NULL下偶然出现或来自真正的关联。在没有对可能的效应大小的研究的威力了解的情况下，评估其中的哪一种可能是困难的(例如，对于动力不足的研究，我们预计最重要的P值会偶然出现)。贝叶斯因子的计算，就像幂计算一样，需要关于效应大小的假设，但贝叶斯因子本身具有自然的解释，作为根据数据改变我们先前的关联概率的因子。贝叶斯因子可以在给定的SNP下通过不同的关联模型自然地组合。例如，我们可以用加性模型、显性模型、隐性模型和一般模型求贝叶斯因子的平均值，以避免必须指定在一个位点使用的单个模型。可以使用类似的思想来组合区域内跨SNP的贝叶斯因子。根据最近关于贝叶斯方法获得的能力的证据，我们重点研究了基于贝叶斯因子的测试统计，并在使用的两组测试统计中对方法进行了比较，以便将结果集中在每种方法预测因果变量的能力上，而不是集中在不同测试统计数据的能力差异上。我们分别使用非共轭和共轭先验(补充方法)对2型糖尿病数据集和模拟数据集进行分析，以反映我们对适合这些数据集的遗传效应大小的信念。

然后，这里有带了个参考文献，顺藤摸瓜，找到了，一个大牛的教程，但是是nature!!! 厉害吧。简单扫了一眼，好像看不懂，这充分告诉我们，学好数学是多么地重要，回归今天的主题吧！

2.用示例数据学习下Bayesian Tests (Bayes Factors)

-bayesian选项，此选项控制您希望在每个SNP上测试与模型没有关联的SNP。这五个不同的模型被编码为1 =Additive，2 =显性，3 =隐性，4 =常规和5 =杂合子。使用此选项时，输出文件将为每个测试包含一列，其中包含该测试的log10贝叶斯因子以及模型参数（β值）及其标准误差的后验均值。SNPTEST将allele_A编码为0，将allele_B编码为1，这定义了beta的含义以及se的含义。例如，当使用加性模型时，β估计对数几率的增加，这可以归因于allele_B的每个副本。贝叶斯因子将始终以每个SNP计算。

-method选项还用于控制贝叶斯模型拟合的方式，但并非所有选项都有效。

如果表型是二进制，那么有效的选项只有 threshold, expected, score 和 ml. score选项使用单个牛顿-拉夫森迭代来估计后验模式，而ml选项使用多次迭代。
如果是数量表型，那么有效的选项只有threshold和expected。

贝叶斯模型的先验

下表说明了所使用的Logistic回归的线性预测值、模型参数使用的先验的形式、SNPTEST中使用的默认先验以及可用于更改先验的命令行选项。

Model	Linear Predictor	Priors	Default	Coding	Command line option
Additive	log(pi/(1-pi)) = µ + ßGi	µ~N(a0, a12) ß~N(b0, b12)	a0=0, a1=1 b0=0, b1=0.2	Gi is the additive coding of the SNP i.e. AA -> 0, AB ->1, BB -> 2.	-prior_add a0 a1 b0 b1
Dominant	log(pi/(1-pi)) = µ + ßDi	µ~N(a0, a12) ß~N(b0, b12)	a0=0, a1=1 b0=0, b1=0.5	Di is the dominant coding of the SNP i.e. AA -> 0, AB -> 1, BB -> 1.	-prior_dom a0 a1 b0 b1
Recessive	log(pi/(1-pi)) = µ + ßRi	µ~N(a0, a12) ß~N(b0, b12)	a0=0, a1=1 b0=0, b1=0.5	Ri is the recessive coding of the SNP i.e. AA -> 0, AB -> 0, BB -> 1.	-prior_rec a0 a1 b0 b1
General	log(pi/(1-pi)) = µ + ßGi + qHi	µ~N(a0, a12) ß~N(b0, b12) q~N(c0, c12)	a0=0, a1=1 b0=0, b1=0.2 c0=0, c1=0.5	Gi is the additive coding of the SNP i.e. AA -> 0, AB ->1, BB -> 2. Hi is the heterozygote coding of the SNP i.e. AA -> 0, AB ->1, BB -> 0.	-prior_gen a0 a1 b0 b1 c0 c1
Heterozygote	log(pi/(1-pi)) = µ + ßHi	µ~N(a0, a12) ß~N(b0, b12)	a0=0, a1=1 b0=0, b1=0.5	Hi is the heterozygote coding of the SNP i.e. AA -> 0, AB ->1, BB -> 0.	-prior_het a0 a1 b0 b1

T分布先验

在SNPTEST v2中，有一个新的选项来指定在遗传效应上使用t分布先验。t分布的较粗尾部允许更灵活地指定关于遗传效应大小的信念。此选项由以下两个选项控制。

-t_prior	详细说明了t-分布先验在遗传效应上的应用。该选项有效地修改了上表中描述的先验，即t-分布的均值和方差由上表中给出的选项指定，但是正态分布被t-分布代替。注：t分布仅用于遗传效应，即上述模型中的参数？和q。例如，-Bayesian add-t_previous将指定线性预测器log(pi/(1-pi))=µ+？Gi，并且先验将是µ~N(a0，a12)和？t(b0，b12，df=3)。
-t_df	t分布的自由度参数。默认值为3。当此参数设置得非常大时，先验收敛到正态分布先验。

-t_prior

详细说明了t-分布先验在遗传效应上的应用。该选项有效地修改了上表中描述的先验，即t-分布的均值和方差由上表中给出的选项指定，但是正态分布被t-分布代替。注：t分布仅用于遗传效应，即上述模型中的参数？和q。例如，-Bayesian add-t_previous将指定线性预测器log(pi/(1-pi))=µ+？Gi，并且先验将是µ~N(a0，a12)和？t(b0，b12，df=3)。

**-t_df **

t分布的自由度参数。默认值为3。当此参数设置得非常大时，先验收敛到正态分布先验。

例子-Bayesian病例-对照test

使用Bayesian addictive 模型对二值型表型bin1计算，使用默认参数

snptest_v2.5.4-beta3_MacOSX_x86_64/snptest_v2.5.4-beta3    -data snptest_v2.5.4-beta3_MacOSX_x86_64//example/cohort1.gen snptest_v2.5.4-beta3_MacOSX_x86_64//example/cohort1.sample snptest_v2.5.4-beta3_MacOSX_x86_64//example/cohort2.gen snptest_v2.5.4-beta3_MacOSX_x86_64//example/cohort2.sample   -o snptest_v2.5.4-beta3_MacOSX_x86_64//example/ex.out   -bayesian 1   -method score   -pheno bin1  #输出了一些信息  #这里显示2.5.3版本了  Welcome to SNPTEST v2.5.3 (revision 8a0c171055ff6ac4c69079501f52ac93c563ed20)  © University of Oxford 2008-2015  https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html  Read LICENCE file for conditions of use.    ==============    Data Files :   -gen files : snptest_v2.5.4-beta3_MacOSX_x86_64//example/cohort1.gen snptest_v2.5.4-beta3_MacOSX_x86_64//example/cohort2.gen   -sample files : snptest_v2.5.4-beta3_MacOSX_x86_64//example/cohort1.sample snptest_v2.5.4-beta3_MacOSX_x86_64//example/cohort2.sample    Tests :   -bayesian : 1   -method score    Inspecting data (this may take some time)...  Sample and exclusions summary :#样本量500-500   - Number of individuals in : (cohort 1)   (cohort 2)                                500          500      Reading sample files :  Summary of covariates and phenotypes   # discrete variables : 4 离散变量    cov1 : type = D (Discrete covariate)    cov2 : type = D (Discrete covariate)    sex : type = D (Discrete covariate)    bin3 : type = D (Discrete covariate)   # continuous variables : 2 连续变量    cov3 : type = C (Continuous covariate)    cov4 : type = C (Continuous covariate)   # phenotypes : 4 表型    pheno1 : type = P (Continuous phenotype)    pheno2 : type = P (Continuous phenotype)    bin1 : type = B (Binary phenotype)    bin2 : type = B (Binary phenotype)  Covariate summary :    (no covariates in use.)  Phenotype summary :    bin1    : missing  levels              2        0(499) 1(499)    You have specified the following model: 模型    bin1 ~ (baseline) + (genotype)    Phenotype being used : bin1    Data Summaries :   -number of SNPs = (unknown)    Data with missing genotype data threshold and exclusion list applied :   snptest_v2.5.4-beta3_MacOSX_x86_64//example/cohort1.gen : 500   snptest_v2.5.4-beta3_MacOSX_x86_64//example/cohort2.gen : 500    --------------------------------------------------------------------------    SinglePhenotypeTest #检验  --------------------------------------------------------------------------    Analyzing Data :   scanning... read chunk [1 of (unknown)]... done.   scanning... read chunk [2 of (unknown)]... done.   scanning... no more data.    finito

来看看结果文件，都有什么内容：

less snptest_v2.5.4-beta3_MacOSX_x86_64/example/ex.out  #有点不好分清列，还是用excel，数据-分列-字符-空格

看看表头，好多信息，分了n行才列出来，第一行分别是染色体号，位置，等位基因，最大分型率

chromosome	position	alleleA	alleleB	index	average_maximum_posterior_call	info	cohort_1_AA
INSERTION_1	RSID_1	NA	0	A	AGTGCTA	1	0.992234

第二行是两个队列的相关信息。

cohort_1_AB	cohort_1_BB	cohort_1_NULL	cohort_2_AA	cohort_2_AB	cohort_2_BB	cohort_2_NULL
0.970846	0	0	0	499	176.815	250.974

第三行是总体基因型汇总信息

all_AA	all_AB	all_BB	all_NULL	all_total	cases_AA	cases_AB	cases_BB
71.2104	0	176.815	250.974	71.2104	499	998	0

第四行是样本对照统计信息

cases_NULL	cases_total	controls_AA	controls_AB	controls_BB	controls_NULL	controls_total
0	499	499	176.815	250.974	71.2104	0

第五行是接着上面的，还有数据缺失情况和OR值信息。

all_maf	cases_maf	controls_maf	missing_data_proportion	het_OR	het_OR_lower
499	0.394184	0	0.394184	0.25	NA

第六行还是上一行的OR值信息

het_OR_upper	hom_OR	hom_OR_lower	hom_OR_upper	all_OR	all_OR_lower
NA	NA	NA	NA	NA	NA

第七行是本次分析得到的目标参数，log10(bf)贝叶斯因子？，beta值，标准差。

all_OR_upper	bayesian_add_log10_bf	bayesian_add_beta_1	bayesian_add_se_1	comment
NA	-0.0174105	2.02E-17	0.192135	NA

大概到这里，这句话要获得的信息OR值和se就得到了，前路漫漫，依然有好多不懂的，持续学习！