hifiasm的使用|利用HIFI测序数据进行基因组组装
目前用于Pacbio HIFI测序数据的组装软件主流上有:FALCON、Hifiasm、Hicanu三款。
Hifiasm的使用
介绍
Hifiasm是用于PacBio Hifi读取的快速单倍型解析的从头汇编程序。它可以在几个小时内组装一个人类基因组,并与加利福尼亚红木基因组(迄今为止测序最复杂的基因组之一)一起工作。Hifiasm可以生产质量最好的组装商的初级/替代组装。它还引入了新的图合并算法,并在给定三重数据的情况下实现了最佳的单倍型解析程序集。
软件安装
#使用conda安装
conda install -c bioconda hifiasm
#安装hifiasm(需要g++和zlib)
git clone https://github.com/chhylp123/hifiasm
cd hifiasm && make
格式转换
由于是bam格式需要转换为fasta格式
# bam --> fasta
samtools view *.bam | awk '{print ">"$1"\n"$10}' > fasta
#补充一下其他格式的转换
## sam ---> fasta
cat *.sam | awk '{print ">"$1"\n"$10}' > *.fasta
## fasta ---> sam
bowtie2 -1 *_1.fa -2 *_2.fa -p 16 -x prefix -S *.sam
## sam --> bam
# [email protected]:线程 -b:输出格式为BAM -S:自动检测输入格式 -o:输出文件
samtools view [email protected] 16 -b -S final.sam -o final.bam
## bam --> sam
samtools view *.bam -O SAM > *.sam
软件参数
$ hifiasm
Usage: hifiasm [options] <in_1.fq> <in_2.fq> <...>
Options:
Input/Output:
-o STR prefix of output files [hifiasm.asm]
-i ignore saved read correction and overlaps
-t INT number of threads [1]
-z INT length of adapters that should be removed [0]
--version show version number
Overlap/Error correction:
-k INT k-mer length (must be <64) [51]
-w INT minimizer window size [51]
-f INT number of bits for bloom filter; 0 to disable [37]
-D FLOAT drop k-mers occurring >FLOAT*coverage times [5.0]
-N INT consider up to max(-D*coverage,-N) overlaps for each oriented read [100]
-r INT round of correction [3]
Assembly:
-a INT round of assembly cleaning [4]
-m INT pop bubbles of <INT in size in contig graphs [10000000]
-p INT pop bubbles of <INT in size in unitig graphs [100000]
-n INT remove tip unitigs composed of <=INT reads [3]
-x FLOAT max overlap drop ratio [0.8]
-y FLOAT min overlap drop ratio [0.2]
-u disable post join contigs step which may improve N50
--lowQ INT
output contig regions with >=INT% inconsistency in BED format; 0 to disable [70]
Trio-partition:
-1 FILE hap1/paternal k-mer dump generated by "yak count" []
-2 FILE hap2/maternal k-mer dump generated by "yak count" []
-c INT lower bound of the binned k-mer's frequency [2]
-d INT upper bound of the binned k-mer's frequency [5]
-3 FILE list of hap1/paternal read names []
-4 FILE list of hap2/maternal read names []
Purge-dups:
-l INT purge level. 0: no purging; 1: light; 2: aggressive [0 for trio; 2 for unzip]
-s FLOAT similarity threshold for duplicate haplotigs [0.75]
-O INT min number of overlapped reads for duplicate haplotigs [1]
--purge-cov INT
coverage upper bound of Purge-dups [auto]
--high-het enable this mode for high heterozygosity sample [experimental, not stable]
Example: ./hifiasm -o NA12878.asm -t 32 NA12878.fq.gz
See `man ./hifiasm.1' for detailed description of these command-line options.
用法
典型的hifiasm命令行如下所示:
hifiasm -o <outputPrefix> -t <nThreads> <HiFi-reads.fasta>
#eg:
hifiasm -o NA12878.asm -t 32 NA12878.fq.gz
其中NA12878.fq.gz
提供输入reads,-t
设置使用中的CPU数,-o
输出文件的前缀名
当亲本短读可用时,hifiasm可以生成一对具有三位一体的单倍型解析程序集。要进行这种组装,需要先用yak计算k-mers,然后进行组装:
yak count -b37 -t <nThreads> -o <pat.yak> <paternal-short-reads.fastq>
yak count -b37 -t <nThreads> -o <mat.yak> <maternal-short-reads.fastq>
#eg:
yak count -k31 -b37 -t16 -o pat.yak paternal.fq.gz
yak count -k31 -b37 -t16 -o mat.yak maternal.fq.gz
然后我们用以下命令产生the paternal assembly and the maternal assembly
:
hifiasm -o <outputPrefix> -t <nThreads> -1 <pat.yak> -2 <mat.yak> <HiFi-reads.fasta>
#eg:
hifiasm -o NA12878.asm -t 20 -1 pat.yak -2 mat.yak NA12878.fq.gz
结果
对于非三重组装,hifiasm会生成以下文件:
prefix.r_utg.gfa(Haplotype-resolved raw unitig graph in GFA format):保留了组装生成的所有单体型信息,包括体细胞突变和重复的测序错误。
prefix.p_utg.gfa(Haplotype-resolved processed unitig graph without small bubbles):无小气泡的单倍型解析;去掉由于体细胞突变和数据背景噪音引起的small bubbles(这个并不是真正的单体型信息),对于高度杂合基因组物种优先选择这个结果。
prefix.p_ctg.gfa(Primary assembly contig graph):对于低杂合度物种来说,优先选择该文件;对于高杂合度物种,该结果代表其中一个单倍型。
prefix.a_ctg.gfa(Alternate assembly contig graph):组装出来的另一套单体型基因组结果。
对于三重组装,hifiasm会生成以下文件:
prefix.r_utg.gfa(Haplotype-resolved raw unitig graph in GFA format):保存了所有的单倍型信息。
prefix.hap1.p_ctg.gfa(Phased paternal/haplotype1 contig graph):保留了阶段性父系/单倍型1组装。
prefix.hap2.p_ctg.gfa(Phased maternal/haplotype2 contig graph):保留了阶段性母系/单倍型2组装。