欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

小鼠全基因组数据分析

程序员文章站 2024-03-02 20:32:52
...

小鼠全基因组数据分析

文章是:2017 Unexpected mutations after CRISPR–Cas9 editing in vivo

We performed WGS on a CRISPR–Cas9-edited mouse to identify all off-target mutations and found an unexpectedly high number of SNVs compared with the widely accepted assumption that CRISPR causes mostly indels at regions homologous to the sgRNA.

发表于 Nature Methods volume14, pages547–548 (2017) 作者已经撤稿,但是数据仍然在!

  • This article was retracted on 27 April 2018

DNA was isolated from two CRISPR-repaired mice (F03 and F05) and one uncorrected control

数据上传了:Bioproject (accession PRJNA382177). Sequencing data available at SRA:

有点类似于肿瘤外显子的数据分析流程:

As additional controls, each of the variants was compared with the FVB/NJ genome in the mouse dbSNP database (v138), and each of the SNVs was also compared with all 36 strains in the Mouse Genome Project (v3).

首先下载数据

根据作者给出的ID号

下载:

cat srr.list |while read id;do (nohup ~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/prefetch $id -X 100G  & );done

下载得到的sra文件如下:

 67G Jun 29 18:58 SRR5450996.sra
 69G Jun 29 18:51 SRR5450997.sra
 39G Jun 29 18:15 SRR5450998.sra

sra转换为fastq后如下:

45G Jun 30 13:29 control_1.fastq.gz
47G Jun 30 13:29 control_2.fastq.gz
46G Jun 30 13:26 F03_1.fastq.gz
48G Jun 30 13:26 F03_2.fastq.gz
27G Jun 30 09:24 F05_1.fastq.gz
28G Jun 30 09:24 F05_2.fastq.gz

简单的使用trim_galore进行过滤后如下;

43G Jul  1 10:22 control_1_val_1.fq.gz
44G Jul  1 10:22 control_2_val_2.fq.gz
43G Jul  1 10:29 F03_1_val_1.fq.gz
45G Jul  1 10:29 F03_2_val_2.fq.gz
26G Jul  1 03:18 F05_1_val_1.fq.gz
27G Jul  1 03:18 F05_2_val_2.fq.gz

看起来没有啥太多数据被过滤,说明作者此次测序数据质量还不错!

小鼠WGS数据分析准备工作

一般来说,可以选择最新版小鼠参考基因组(mm10)了,如果你实在有其它需求,也可以自行选择其它版本。

遗憾的是GATK官网不提供小鼠相关资源:ftp://ftp.broadinstitute.org/bundle

所以只能去illumina官网下载咯: https://support.illumina.com/sequencing/sequencing_software/igenome.html

mkdir -p ~/biosoft/GATK/resources/bundle/mm10 
cd ~/biosoft/GATK/resources/bundle/mm10 
wget ftp://igenome:[email protected]/Mus_musculus/UCSC/mm10/Mus_musculus_UCSC_mm10.tar.gz 
nohup tar zxvf Mus_musculus_UCSC_mm10.tar.gz &

下载全部成功后,应该如下:

这个 Mus_musculus_UCSC_mm10.tar.gz &压缩包里面的数据文件实在是太多,我没办法列出来秀给大家了。

如果有要做 RealignerTargetCreator 需要下载dbsnp文件,参考: https://www.biostars.org/p/182917/

这个教程里面提到的文件非常之多,可以ftp登录看看,用户名就用guest即可。

cd ~/biosoft/GATK/resources/bundle/mm10 
mkdir dbsnp
cd dbsnp 
ftp ftp-mouse.sanger.ac.uk
cd current_indels/strain_specific_vcfs/
ls -lh  
lcd ./
prompt off
mget **

可以下载得到个小鼠品系的vcf文件如下:

 225.0M May  1  2015 129P2_OlaHsd.mgp.v5.snps.dbSNP142.vcf.gz
 201.7M May  1  2015 129S1_SvImJ.mgp.v5.snps.dbSNP142.vcf.gz
 206.7M May  1  2015 129S5SvEvBrd.mgp.v5.snps.dbSNP142.vcf.gz
 190.8M May  1  2015 A_J.mgp.v5.snps.dbSNP142.vcf.gz
 193.9M May  1  2015 AKR_J.mgp.v5.snps.dbSNP142.vcf.gz
 177.7M May  1  2015 BALB_cJ.mgp.v5.snps.dbSNP142.vcf.gz
 165.8M May  1  2015 BTBR_T__Itpr3tf_J.mgp.v5.snps.dbSNP142.vcf.gz
 199.1M May  1  2015 BUB_BnJ.mgp.v5.snps.dbSNP142.vcf.gz
 167.2M May  1  2015 C3H_HeH.mgp.v5.snps.dbSNP142.vcf.gz
 194.8M May  1  2015 C3H_HeJ.mgp.v5.snps.dbSNP142.vcf.gz
  16.5M May  1  2015 C57BL_10J.mgp.v5.snps.dbSNP142.vcf.gz
   6.8M May  1  2015 C57BL_6NJ.mgp.v5.snps.dbSNP142.vcf.gz
 101.6M May  1  2015 C57BR_cdJ.mgp.v5.snps.dbSNP142.vcf.gz
 100.2M May  1  2015 C57L_J.mgp.v5.snps.dbSNP142.vcf.gz
 108.4M May  1  2015 C58_J.mgp.v5.snps.dbSNP142.vcf.gz
 749.4M May  1  2015 CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
 198.0M May  1  2015 CBA_J.mgp.v5.snps.dbSNP142.vcf.gz
 192.9M May  1  2015 DBA_1J.mgp.v5.snps.dbSNP142.vcf.gz
 197.9M May  1  2015 DBA_2J.mgp.v5.snps.dbSNP142.vcf.gz
 196.8M May  1  2015 FVB_NJ.mgp.v5.snps.dbSNP142.vcf.gz
 206.4M May  1  2015 I_LnJ.mgp.v5.snps.dbSNP142.vcf.gz
 216.4M May  1  2015 KK_HiJ.mgp.v5.snps.dbSNP142.vcf.gz
 234.6M May  1  2015 LEWES_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
 206.0M May  1  2015 LP_J.mgp.v5.snps.dbSNP142.vcf.gz
 698.2M May  1  2015 MOLF_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
 204.6M May  1  2015 NOD_ShiLtJ.mgp.v5.snps.dbSNP142.vcf.gz
 201.3M May  1  2015 NZB_B1NJ.mgp.v5.snps.dbSNP142.vcf.gz
 202.8M May  1  2015 NZO_HlLtJ.mgp.v5.snps.dbSNP142.vcf.gz
 212.4M May  1  2015 NZW_LacJ.mgp.v5.snps.dbSNP142.vcf.gz
 728.8M May  1  2015 PWK_PhJ.mgp.v5.snps.dbSNP142.vcf.gz
 203.2M May  1  2015 RF_J.mgp.v5.snps.dbSNP142.vcf.gz
 181.4M May  1  2015 SEA_GnJ.mgp.v5.snps.dbSNP142.vcf.gz
   1.5G May  1  2015 SPRET_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
 197.1M May  1  2015 ST_bJ.mgp.v5.snps.dbSNP142.vcf.gz
 271.7M May  1  2015 WSB_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
 267.7M May  1  2015 ZALENDE_EiJ.mgp.v5.snps.dbSNP142.vcf.gz

这些vcf文件的理解,需要对小鼠这个实验动物背景有一点了解,实际上这个时候我们需要的vcf文件应该是来自于dbSNP数据库的,需要需要的是dbsnp的rs ID号。 ftp://ftp.ncbi.nih.gov/snp/

cd ~/biosoft/GATK/resources/bundle/mm10/dbsnp
wget ftp://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz 
wget ftp://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz.tbi

下载的vcf文件也需要仔细理解哦:

##fileformat=VCFv4.0
##source=dbSNP
##dbSNP_BUILD_ID=150
##reference=GCF_000001635.24
##variationPropertyDocumentationUrl=ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf
##INFO=<ID=RSPOS,Number=1,Type=Integer,Description="Chr position reported in dbSNP">
##INFO=<ID=RV,Number=0,Type=Flag,Description="RS orientation is reversed">
##INFO=<ID=VP,Number=1,Type=String,Description="Variation Property.  Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf">
##INFO=<ID=GENEINFO,Number=.,Type=String,Description="Pairs each of gene symbol:gene id.  The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">

GATK流程

gatk只是上游流程,

touch config.sra config.raw config.clean config.bam readme.txt 
mkdir -p  {sra,raw_fastq,clean_fastq,alignment,somatic,germline,cnv,logs,qc,ccf,qualimap}
mkdir -p somatic/{mutect2,varscan,muse}
mkdir -p cnv/{cnvkit,gatk4,sequenza,gistic2}
mkdir -p ccf/{ABSOLUTE,THetA2,PureCN,PyClone,BubbleTree} 
mkdir -p qc/{align,clean,counts,raw,recal,trim,somatic,germline,cnv,qualimap}

比对以及GATK中间流程的数据太耗费空间,就不保留了,而且非常好内存。

首先看比对成sam后的bam文件,大小是:

 138G Jul  6 23:55 control.bam
 139G Jul  7 00:12 F03.bam
 83G Jul  6 04:47 F05.bam

接下来 25G的内存已经是不够用了,报错如下:

To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Runtime.totalMemory()=19129171968
[Sat Jul 07 05:45:29 CST 2018] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 85.82 minutes.
slurmstepd: *** JOB 164819 ON compute-2-5 CANCELLED AT 2018-07-07T05:44:49 ***
slurmstepd: Exceeded job memory limit
slurmstepd: Job 164819 exceeded memory limit (26735988 > 26214400), being killed

调整到35G的内存,终于可以成功运行,得到的bam文件如下:

123G Jul  9 20:00 control_marked_fixed.bam
125G Jul  9 21:20 F03_marked_fixed.bam
80G Jul  9 16:44 F05_marked_fixed.bam

接下来的重头戏是 GATK的BaseRecalibratorHaplotypeCaller,可以参考:https://www.jianshu.com/p/49d035b121b8

for sample in {control,F03,F05};
do 
echo $sample
ls -lh ${sample}_marked_fixed.bam
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"  BaseRecalibrator \
    -R $ref  \
    -I ${sample}_marked_fixed.bam  \
    --known-sites $snp \
    --known-sites $indel \
    -O ${sample}_recal.table \
    1>${sample}_log.recal 
 
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"   ApplyBQSR \
    -R $GENOME  \
    -I ${sample}_marked_fixed.bam  \
    -bqsr ${sample}_recal.table \
    -O ${sample}_bqsr.bam \
    1>${sample}_log.ApplyBQSR 
    
## 使用GATK的HaplotypeCaller命令
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller \
     -R $ref  \
     -I ${sample}_bqsr.bam \
      --dbsnp $snp \
      -O ${sample}_raw.vcf \
      1>${sample}_log.HC 

done  

其实这样的shell脚本是很烂的, 因为这个小鼠全基因组数据太大,如果这样一个步骤接着一个步骤,一个样本接着一个样本的运行,大半个月就过去了。

而且上面的数据都是非常大的,可以选择小批量数据进行测试脚本:

mkdir test 
samtools view -h control_marked_fixed.bam |head -100000 |samtools sort -o test/control_marked_fixed.bam -
cd test 
module load java/1.8.0_91
GENOME=/home/jianmingzeng/biosoft/GATK/resources/bundle/mm10/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
ref=$GENOME
INDEX=/home/jianmingzeng/biosoft/GATK/resources/bundle/mm10/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa
GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.2.1/gatk
dbsnpPath=/home/jianmingzeng/biosoft/GATK/resources/bundle/mm10/dbsnp
snp1=${dbsnpPath}/129S5SvEvBrd.mgp.v5.snps.dbSNP142.vcf.gz
snp2=${dbsnpPath}/C57BL_6NJ.mgp.v5.snps.dbSNP142.vcf.gz
snp3=${dbsnpPath}/I_LnJ.mgp.v5.snps.dbSNP142.vcf.gz
indel1=${dbsnpPath}/129S5SvEvBrd.mgp.v5.indels.dbSNP142.normed.vcf.gz
indel2=${dbsnpPath}/C57BL_6NJ.mgp.v5.indels.dbSNP142.normed.vcf.gz
indel3=${dbsnpPath}/I_LnJ.mgp.v5.indels.dbSNP142.normed.vcf.gz
sample=control
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"  BaseRecalibrator \
    -R $ref  \
    -I ${sample}_marked_fixed.bam  \
    --known-sites $snp1 --known-sites $snp2 --known-sites $snp3 \
    --known-sites $indel1 --known-sites $indel2 --known-sites $indel3 \
    -O ${sample}_recal.table \
    1>${sample}_log.recal 

尤其值得注意的是染色体标记错误:

A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
  reference contigs = [chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
  features contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9, MT, X, Y]

也就是说我们给的vcf文件里面的染色体是没有chr这个前缀,可是我们给的参考基因组里面却有这个前缀。

GATK的gvcf模式

这里有3只小鼠需要进行基因型比较,所以gvcf模式比较合适,使用 GATK的 Joint Calling , 过滤参考:https://mp.weixin.qq.com/s/W8Vfv1WmW6M7U0tIcPtlng

因为学校服务器最近总是出问题,所以这个流程只能是半成品,待续。

module load java/1.8.0_91
GENOME=/home/jianmingzeng/biosoft/GATK/resources/bundle/mm10/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
INDEX=/home/jianmingzeng/biosoft/GATK/resources/bundle/mm10/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa
GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.2.1/gatk

# for bam in  /home/jianmingzeng/data/project/human_BRCA/alignment/*_recal.bam
cat config |while read bam;
do
  if((i%$1==$2))
        then
sample=$(basename "$bam" _marked_fixed.bam)
echo $sample
$GATK  --java-options "-Xmx15G -Djava.io.tmpdir=./"   HaplotypeCaller  \
-ERC GVCF   -R $GENOME -I $bam -O  ${sample}_raw.gvcf

        fi
        i=$((i+1))
done

得到的gvcf文件如下:

12G Aug 24 09:07 control_raw.gvcf
12G Aug 24 12:35 F03_raw.gvcf
37G Aug 23 03:55 F05_raw.gvcf
40G Aug 22 13:58 S1081_raw.gvcf
33G Aug 22 04:14 S1082_raw.gvcf
34G Aug 22 10:29 S1083_raw.gvcf
32G Aug 22 03:10 S1084_raw.gvcf
32G Aug 23 19:05 S1085_raw.gvcf
29G Aug 24 00:18 S1086_raw.gvcf
36G Aug 26 10:00 S1089_raw.gvcf

然后使用GenomicsDBImport加上GenotypeGVCFs命令合并gvcf信息,得到最后的vcf文件记录着所有小鼠跟参考基因组不一样的位点信息。

用Delly检测SV

代码很简单,如下:

export OMP_NUM_THREADS=5
excl=/home/jianmingzeng/biosoft/delly/delly/excludeTemplates/human.hg38.excl.tsv
genome=/home/jianmingzeng/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
cat bam_path.txt |while read bam
do
echo $bam
sample=$(basename $bam "_marked_fixed.bam" )
echo $sample
    if((i%$2==$3))
    then
        ~/biosoft/delly/delly_v0.7.6_linux_x86_64bit call -t DEL -g $genome -o ${sample}.DEL.bcf -x $excl $bam
        ~/biosoft/delly/delly_v0.7.6_linux_x86_64bit call -t DUP -g $genome -o ${sample}.DUP.bcf -x $excl $bam
        ~/biosoft/delly/delly_v0.7.6_linux_x86_64bit call -t INV -g $genome -o ${sample}.INV.bcf -x $excl $bam
        ~/biosoft/delly/delly_v0.7.6_linux_x86_64bit call -t TRA -g $genome -o ${sample}.TRA.bcf -x $excl $bam
        ~/biosoft/delly/delly_v0.7.6_linux_x86_64bit call -t INS -g $genome -o ${sample}.INS.bcf -x $excl $bam
    fi
    i=$((i+1))
done
# ls *bcf|while read id ;do bcftools view $id >${id%%.*}.vcf ;done

但是运行时间很长。

scalpel软件找indel