全基因组haplotype基因型分析软件：GHap

程序员文章站 2024-03-02 20:02:46

...

首先，emm，”G“是”Genome-Wide"的缩写。

R语言 GHap 包用于从"定相后（phased）的SNP数据"构建全基因组haplotype，及对其频率、基因型等进行分析。Haploview和Plink等软件可以分析haplotype block，但似乎无法输出各个样本的haplotype基因型，也就无法进行很多下游的分析，如基于haplotype-based GWAS，GHap包填补了这个空白。

对Haplotype的定义方法有很多，常用的方法有以下两种：

使用LD进行定义，参考 ”Gabriel S B, Schaffner S F, Nguyen H, et al. The structure of haplotype blocks in the human genome[J]. Science, 2002, 296(5576): 2225-2229“
使用滑窗(sliding-windows)进行定义，有固定的也有可变长度的，有重叠的也有非重叠的。

还有其它，如用diversity进行定义的方法。

具体哪种方法更好，没有定论，但总的认为都比单SNP能提供更多的信息，即作关联分析（GWAS）或基因组选择（GS）的power更高。

参考：

Lorenz A J, Hamblin M T, Jannink J L. Performance of single nucleotide polymorphisms versus haplotypes for genome-wide association analysis in barley[J]. PloS one, 2010, 5(11): e14079

安装



library(BiocManager)


BiocManager::install('GHap')

使用

输入文件格式

需要三个文件，后缀分别为.samples,.markers,.phase。

因为从V2.0开始，软件要先转换为二进制，所以最好文件前缀一致，方便后续操作，如：test.samples, test.markers, test.phase。

三个文件都是空格分隔，没有表头:

test.samples：两列， Population和ID。Population可以是家系信息或其它分组信息。

test.markers：五列，Chromosome, Marker ID, Position(bp), Ref Allele(A0) and Alt Allele (A1)。需要按物理位置排序。



A01 A01_1022 1022 C T

A01 A01_1331 1331 G A

A01 A01_1493 1493 A G

A01 A01_1535 1535 A G

A01 A01_1957 1957 T G

test.phase：m行（标记），2n列（样本）。与plink的ped文件相似，但基因型用0和1编码。必须无缺失。



0 1 0 1 0 1 0 1 0 1

1 0 1 0 1 0 0 0 1 0

1 0 1 0 1 0 0 0 1 0

1 0 1 0 1 0 0 0 1 0

0 1 0 0 0 0 0 1 0 0

0 1 0 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0

这三个文件可以先用beagle,shapeit，eagle等软件进行phased，然后从输出文件重建。

压缩文件，构GHap phase 对象

library(GHap)

#压缩文件,输出一个test.phaseb文件到当前文件夹


 ghap.compress(input.file='test',out.file='test')


#读入文件，仅用文件前缀


phase <- ghap.loadphase("test")

过滤标记

这一步一般通过plink、vcftools软件完成了。

#过滤掉maf <0.05的标记
maf <- ghap.freq(phase, type="maf", ncores = 1)
markers <- names(maf)[which(maf > 0.05)]
phase <- ghap.subsetphase(phase, unique(phase$id), markers)

haplotyping

软件支持用sliding-window方法进行单倍型分型，但也支持自定义的分型，如用plink或haplowview软件的LD block进行分型。

sliding-window


# 用marker个数进行sliding

blocks.mkr10 <- ghap.blockgen(phase, windowsize = 10, slide = 10, unit = "marker")

# 根据物理距离进行sliding
#locks.kb <- ghap.blockgen(phase, windowsize = 100, slide = 100, unit = "kbp")
# 自定义haplotype，如用plink计算LD定义haplotype

nohup plink --bfile ../snp.utx.imputated --blocks no-pheno-req --blocks-max-kb 2000 --blocks-strong-lowci 0.7 --blocks-strong-highci 0.98 --allow-extra-chr -out snp.utx.imputated &



#重构为GHap需要的格式

cat chr|while read id;do grep $id snp.utx.imputated.blocks.det|nl|awk '{print "CHR"$1"_B"$1"\t"$2"\t"$3"\t"$4"\t"$4-$3+1"\t"$6}' - >>blocks_LD.txt;done

sed -i "1i BLOCK\tCHR\tBP1\tBP2\tSIZE\tNSNP" blocks_LD.txt 

#读入blocks信息

blocks.ld <- read.table('blocks_LD.txt',header=1,stringsAsFactors=F)



# 构建基于haplotype的基因型矩阵

ghap.haplotyping(phase, blocks.mkr10, outfile = "test", binary = T, ncores = 100,batchsize=300000)

outfile：输出文件前缀
binary：以二进制编码输出。也可以不输出二进制，但要注意的是不能读入，因为只能读入二进制。
ncores：线程
batchsize：批次大小

Haplotype 统计分析


#读入haplotype 基因型，仅使用文件前缀


haplo <- ghap.loadhaplo("test")


#haplotype 等位基因统计分析
hapstats <- ghap.hapstats(haplo, ncores = 150)

write.table(hapstats,'test.hapstats',quote=F,row.names=F)

#haplotype block统计分析

blockstats <- ghap.blockstats(hapstats, ncores = 150)

write.table(blockstats,'test_ld.blockstats',quote=F,row.names=F)

其它应用

GHap 的功能还有很多，例如计算PCA，单倍型多态性（Fst），GWAS分析等等，可参考说明。

输出haplotype

当数据量很大时，用R处理就很耗时。因此GHap可以将haplotype输出为其它软件可接受的格式，如Plink格式。

ghap.hap2plink(haplo, outfile = "test")

全基因组haplotype基因型分析软件：GHap

安装

使用

输入文件格式

压缩文件，构GHap phase 对象

过滤标记

haplotyping

sliding-window

Haplotype 统计分析

其它应用

输出haplotype

全基因组haplotype基因型分析软件：GHap

全基因组关联分析（GWAS）软件：Tassel5

全基因组关联分析（GWAS）软件：mrMLM