0079-【生信软件】-人类基因组hg19、hg38构建bwa索引
程序员文章站
2024-03-02 19:40:34
...
在临床数据分析时,使用的人类参考基因组为UCSC上面的hg19,hg39版本,且常常将除1-22,X,Y,M,以外的染色体去除,避免数据的干扰。
hg19索引构建
- 进入UCSC的官网,hg19的ftp网址 http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/
- 参考基因组版本说明,与NCBI的对应版本
This directory contains the Feb. 2009 assembly of the human genome (hg19,
GRCh37 Genome Reference Consortium Human Reference 37 (GCA_000001405.1)),
as well as repeat annotations and GenBank sequences.
The Feb. 2009 human reference sequence (GRCh37) was produced by the
Genome Reference Consortium:
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
- 需要下载的文件
chromFa.tar.gz - The assembly sequence in one file per chromosome.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.
md5sum.txt - checksums of files in this directory
hg19.chrom.sizes - Two-column tab-separated text file containing assembly
sequence names and sizes.
- 检测文件完整性
# 打开码文件
cat md5sum.txt
## 将chromFa.tar.gz的码写入一个新文件
echo "ec3c974949f87e6c88795c17985141d3 chromFa.tar.gz" > check_md5sum.txt
## 检测
md5sum -c check_md5sum.txt
##显示 chromFa.tar.gz: OK
- 查看染色体,看染色体命名规律
cat hg19.chrom.sizes |head -n 25
## 显示
chr1 249250621
chr2 243199373
chr3 198022430
chr4 191154276
chr5 180915260
chr6 171115067
chr7 159138663
chrX 155270560
chr8 146364022
chr9 141213431
chr10 135534747
chr11 135006516
chr12 133851895
chr13 115169878
chr14 107349540
chr15 102531392
chr16 90354753
chr17 81195210
chr18 78077248
chr20 63025520
chrY 59373566
chr19 59128983
chr22 51304566
chr21 48129895
chr6_ssto_hap7 4928567
- 解压参考基因组,显示所有染色体单个文件
tar -zxf chromFa.tar.gz
- 将不要的染色体,移动到暂存的文件夹
mv chrUn_gl0002* chr*_random.fa chr*ctg* chr*apd* chr*cox* chr*hap* temp
#mkdir chrfa
#mv chromFa.tar.gz chrfa/
当面目录只剩下
chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa
- 将需要的染色体进行合并,并移动到单个fa文件暂存文件夹
for i in chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa
do
echo $i;
#cat $i >> hg19.fa
mv $i chrfa
done
- bwa进行索引构建
bwa index -a bwtsw -p hg19.fa hg19.fa
索引构建完后显示
[BWTIncConstructFromPacked] 670 iterations done. 6154206942 characters processed.
[BWTIncConstructFromPacked] 680 iterations done. 6177547390 characters processed.
[bwt_gen] Finished constructing BWT in 687 iterations.
[bwa_index] 2423.30 seconds elapse.
[bwa_index] Update BWT... 19.05 sec
[bwa_index] Pack forward-only FASTA... 16.77 sec
[bwa_index] Construct SA from BWT and Occ... 1192.17 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -a bwtsw -p hg19.fa hg19.fa
[main] Real time: 3814.250 sec; CPU: 3676.297 sec
文件生成查看
[email protected]:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg19$ ls -lht
total 8.0G
-rwxrwxrwx 1 toucan toucan 3.0G Sep 27 09:53 hg19.fa
-rwxrwxrwx 1 toucan toucan 1.5G Sep 27 11:17 hg19.fa.sa
-rwxrwxrwx 1 toucan toucan 6.6K Sep 27 10:57 hg19.fa.amb
-rwxrwxrwx 1 toucan toucan 944 Sep 27 10:57 hg19.fa.ann
-rwxrwxrwx 1 toucan toucan 739M Sep 27 10:57 hg19.fa.pac
-rwxrwxrwx 1 toucan toucan 2.9G Sep 27 10:57 hg19.fa.bwt
-rwxrwxrwx 1 toucan toucan 3.5K Sep 27 13:16 work_index.log
-rwxrwxrwx 1 toucan toucan 685 Sep 27 10:06 work_index.sh
drwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:58 chrfa
drwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:17 temp
drwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:00 download
- 索引构建完成
hg38构建索引
- 进入UCSC官网 http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
- 查看下载文件
hg38.fa.gz - "Soft-masked" assembly sequence in one file.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.
hg38.fa.gz - "Soft-masked" assembly sequence in one file.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.
md5sum.txt - checksums of files in this directory
hg38.chrom.sizes - Two-column tab-separated text file containing assembly
sequence names and sizes.
- 检查文件完整性
[email protected]:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ echo "1c9dcaddfa41027f17cd8f7a82c7293b hg38.fa.gz" >> check_md5sum.txt
[email protected]:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ echo "a5aa5da14ccf3d259c4308f7b2c18cb0 hg38.chromFa.tar.gz" >> check_md5sum.txt
[email protected]:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ cat check_md5sum.txt
1c9dcaddfa41027f17cd8f7a82c7293b hg38.fa.gz
a5aa5da14ccf3d259c4308f7b2c18cb0 hg38.chromFa.tar.gz
[email protected]:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ md5sum -c check_md5sum.txt
hg38.fa.gz: OK
hg38.chromFa.tar.gz: OK
- 查看染色体命名规则
[email protected]:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ cat hg38.chrom.sizes |head -n 25
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
chr5 181538259
chr6 170805979
chr7 159345973
chrX 156040895
chr8 145138636
chr9 138394717
chr11 135086622
chr10 133797422
chr12 133275309
chr13 114364328
chr14 107043718
chr15 101991189
chr16 90338345
chr17 83257441
chr18 80373285
chr20 64444167
chr19 58617616
chrY 57227415
chr22 50818468
chr21 46709983
- 解压染色体单个文件
tar -zxf hg38.chromFa.tar.gz
cd chroms/
- 合并需要的染色体文件
mkdir chrfa
cat work.sh
for i in chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa
do
echo $i;
cat $i >> ../hg38.fa
mv $i ../chrfa
done
[email protected]:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ rm -rf chroms/
[email protected]:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ mv hg38.chromFa.tar.gz chrfa/
查看结果
[email protected]:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ cat hg38.fa |grep ">"|uniq -c
1 >chr1
1 >chr2
1 >chr3
1 >chr4
1 >chr5
1 >chr6
1 >chr7
1 >chr8
1 >chr9
1 >chr10
1 >chr11
1 >chr12
1 >chr13
1 >chr14
1 >chr15
1 >chr16
1 >chr17
1 >chr18
1 >chr19
1 >chr20
1 >chr21
1 >chr22
1 >chrX
1 >chrY
1 >chrM
- bwa构建索引
bwa index -a bwtsw -p hg38.fa hg38.fa
输出显示:
[BWTIncConstructFromPacked] 630 iterations done. 6018708242 characters processed.
[BWTIncConstructFromPacked] 640 iterations done. 6055482898 characters processed.
[BWTIncConstructFromPacked] 650 iterations done. 6088164258 characters processed.
[BWTIncConstructFromPacked] 660 iterations done. 6117207522 characters processed.
[BWTIncConstructFromPacked] 670 iterations done. 6143017202 characters processed.
[BWTIncConstructFromPacked] 680 iterations done. 6165952882 characters processed.
[bwt_gen] Finished constructing BWT in 686 iterations.
[bwa_index] 2564.67 seconds elapse.
[bwa_index] Update BWT... 19.27 sec
[bwa_index] Pack forward-only FASTA... 16.36 sec
[bwa_index] Construct SA from BWT and Occ... 1230.34 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -a bwtsw -p hg38.fa hg38.fa
[main] Real time: 3969.074 sec; CPU: 3855.406 sec
输出文件
-rwxrwxrwx 1 toucan toucan 3150052305 Oct 5 15:44 hg38.fa*
-rwxrwxrwx 1 toucan toucan 16843 Oct 5 16:34 hg38.fa.amb*
-rwxrwxrwx 1 toucan toucan 954 Oct 5 16:34 hg38.fa.ann*
-rwxrwxrwx 1 toucan toucan 3088286508 Oct 5 16:33 hg38.fa.bwt*
-rwxrwxrwx 1 toucan toucan 772071602 Oct 5 16:34 hg38.fa.pac*
-rwxrwxrwx 1 toucan toucan 1544143256 Oct 5 16:54 hg38.fa.sa*
其他索引构建
fai结尾
samtools faidx hg19.fa
## 显示生成
hg19.fa.fai
samtools faidx hg38.fa
dict结尾
gatk CreateSequenceDictionary -R hg19.fa -O hg19.dict