欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

GATK4 简易用法 --20201213

程序员文章站 2022-07-10 17:36:46
...

Contents

  1. Java command basics
  2. Using the gatk wrapper script (recommended)
  3. Adding GATK arguments
  4. Adding Java arguments
  5. Adding Spark arguments
  6. Examples of real commands

1. Java command basics

GATK follows the basic Java command-line syntax:

java -jar program.jar [program arguments]

The core of the command is java -jar program.jar, which starts up the program in a Java Virtual Machine (JVM).

2. Using the gatk wrapper script (recommended)

We provide a launch script that encapsulates the java -jar program.jar part of the command in a single invocation, gatk. There are several reasons for this that we don’t go into in this article (including that there are now two jars included in the package you download), but the upshot is that it makes it possible to add GATK to your PATH variable, and it allows us to build in some autocomplete functionality for convenience.

So the basic command is now:

gatk [program arguments]

3. Adding GATK arguments

The only universally required argument is the name of the GATK tool you want to run. It is a positional argument, so you specify it directly after the gatk bit, like this:

gatk ToolName [tool arguments]

After the tool name, you can specify any arguments in any order, with the appropriate argument name as follows:

gatk ToolName --argument-name value

3.1. Argument naming conventions
The overwhelming majority of argument names follow a "kebab" convention, where the name is prefixed by two dashes (–) and where applicable, words are separated by single dashes (-). A minority of very commonly-used arguments accept a short name prefixed by a single dash (-). The short name is often a single capital letter.

3.2. Ordering
The ordering of GATK arguments is not important, but we recommend passing required arguments first for consistency. It is also a good idea to consistently order arguments by some kind of logic in order to make it easy to compare different commands over the course of a project. It’s up to you to choose what that logic should be.

3.3. Flags
Flags are arguments that have boolean values, i.e. TRUE or FALSE. They are typically used to enable or disable specific features; for example, --QUIET will suppress some log output. To activate a flag that is set to FALSE by default, all you need to do is add the flag name to the command (no need to specify an actual value). To deactivate a flag that is set to TRUE by default, you need to specify the value as FALSE; for example --create-output-variant-index FALSE will disable automatic variant indexing.

4. Adding Java arguments

Normally you would insert any java-specific arguments (such as -Xmx to specify memory allocation) between the java and -jar bits of the basic Java command like this:

java -Xmx4G -jar program.jar [program arguments]

When you’re using the gatk wrapper syntax (which we strongly recommend), you have to do it a bit differently, like this:

gatk --java-options "-Xmx4G" [program arguments]

To specify multiple Java arguments, just add them to the quoted string like this:

gatk --java-options "-Xmx4G -XX:+PrintGCDetails" [program arguments]

The order of Java arguments inside the quoted string is not important.

5. Adding Spark arguments

When you run Spark-capable tools, you may need to specify Spark-specific parameters. These must be appended to the end of your GATK command, after a – separator, like this:

gatk [GATK arguments] -- [Spark arguments]

6. Examples of real commands

This is a very simple command that runs HaplotypeCaller in default mode on a single input BAM file containing sequence data and outputs a VCF file containing variant calls.

gatk HaplotypeCaller -R reference.fasta -I sample1.bam -O variants.vcf

Now let’s switch to running HaplotypeCaller in GVCF mode so that we can add multiple samples to our analysis in a scalable way:

gatk HaplotypeCaller -R reference.fasta -I sample1.bam -O variants.g.vcf -ERC GVCF

We can write this same command on multiple lines to make it more readable by using backslashes at the ends of lines:

gatk HaplotypeCaller \
    -R reference.fasta \
    -I sample1.bam \
    -O variants.g.vcf \
    -ERC GVCF

We can add the common Java memory argument -Xmx like this:

gatk --java-options "-Xmx4G" HaplotypeCaller \
    -R reference.fasta \
    -I sample1.bam \
    -O variants.g.vcf \
    -ERC GVCF

If the data is from exome sequencing, we should additionally provide the exome targets using the -L argument:

gatk --java-options "-Xmx4G" HaplotypeCaller \
    -R reference.fasta \
    -I sample1.bam \
    -O variants.g.vcf \
    -ERC GVCF \
    -L exome_intervals.list

Now let’s say we want to add a read filter that deals with some problems in our data:

gatk --java-options "-Xmx4G" HaplotypeCaller \
    -R reference.fasta \
    -I sample1.bam \
    -O variants.g.vcf \
    -ERC GVCF \
    -L exome_intervals.list \
    --read-filter OverclippedReadFilter

If we want to reduce the amount of chatter in the logs, we can turn on the --QUIET setting like this:

gatk --java-options "-Xmx4G" HaplotypeCaller \
    -R reference.fasta \
    -I sample1.bam \
    -O variants.g.vcf \
    -ERC GVCF \
    -L exome_intervals.list \
    --read-filter OverclippedReadFilter \
    --QUIET

And finally, if we want to turn off automatic variant index creation:

gatk --java-options "-Xmx4G" HaplotypeCaller \
    -R reference.fasta \
    -I sample1.bam \
    -O variants.g.vcf \
    -ERC GVCF \
    -L exome_intervals.list \
    --read-filter OverclippedReadFilter \
    --QUIET \
    --create-output-variant-index FALSE

For more examples of commands and for specific tool command recommendations, see the tool index.