# TOIl-VG
## University of California, Santa Cruz Genomics Institute
### Please contact us on [github with any issues](https://github.com/BD2KGenomics/toil-vg/issues/new)
[vg](https://github.com/vgteam/vg) is a toolkit for DNA sequence analysis using variation graphs. Toil-vg is a [toil](https://github.com/BD2KGenomics/toil)-based framework for running common vg pipelines at scale, either locally or on a distributed computing environment:
`toil-vg construct`: Create vg graph from FASTA and VCF, constructing contigs in parallel.
`toil-vg run`: Given input vg graph(s), create indexes, map reads, then produce VCF variant calls.
`toil-vg index`: Produce a GCSA and/or XG index from input graph(s).
`toil-vg map`: Produce a graph alignment (GAM) for each chromosome from input reads and index
`toil-vg call`: Produce VCF from input XG index and GAM(s).
## Installation
### Local TOIL-VG Pip Installation
Installation requires Python and Toil. We recommend installing within virtualenv as follows
virtualenv toilvenv
source toilvenv/bin/activate
pip install toil[aws,mesos]==3.13.0
pip install --pre toil-vg
## WIKI
See the [Wiki](https://github.com/vgteam/toil-vg/wiki) in addition to below for examples.
### Docker
toil-vg can run vg, along with some other tools, via [Docker](http://www.docker.com). Docker can be installed locally (not required when running via cgcloud), as follows.
* [**Linux Docker Installation**](https://docs.docker.com/engine/installation/linux/): If running `docker version` doesn't work, try adding user to docker group with `sudo usermod -aG docker $USER`, then log out and back in.
* [**Mac Docker Installation**](https://docs.docker.com/docker-for-mac/): If running `docker version` doesn't work, try adding docker environment variables: `docker-machine start; docker-machine env; eval "$(docker-machine env default)"`
* **Running Without Docker**: If Docker is not installed or is disabled with `--container None`, toil-vg requires the following command line tools to be installed on the system: `vg, pigz, bcftools, tabix`. `jq, samtools and rtg vcfeval` are also necessary for certain tests.
## Configuration
A configuration file can be used as an alternative to most command line options. A default configuration file can be generated using
toil-vg generate-config > config.yaml
Pass this file to `toil-vg` commands using the `--config` option.
For non-trivial inputs, care must be taken to specify the resource requirements for the different pipeline phases (via the command line or by editing the config file), as they all default to single-core and 4G of ram.
To generate a default configuration for running at genome scale on a cluster with 32-core worker nodes, use
toil-vg generate-config --whole_genome > config_wg.yaml
## Testing
make test
A faster test to see if toil-vg runs on the current machine (Replace myname with a unique prefix):
./scripts/bakeoff.sh -f myname f1.tsv
Or on a Toil cluster
./scripts/bakeoff.sh -fm myname f1.tsv
In both cases, verify that f1.tsv contains a number (should be approx. 0.9). Note that this script will create some directories (or S3 buckets) of the form `myname-bakeoff-out-store-brca1` and `myname-bakeoff-job-store-brca1`. These will have to be manually removed.
## A Note on IO conventions
The jobStore and outStore arguments to toil-vg are directories that will be created if they do not already exist. When starting a new job, toil will complain if the jobStore exists, so use `toil clean <jobStore>` first. When running on Mesos, these stores should be S3 buckets. They are specified using the following format aws:region:bucket (see examples below).
All other input files can either either be local (best to specify absolute path) or URLs specified in the normal manner, ex : http://address/input_file or s3://bucket/input_file. The config file must always be local. When using an S3 jobstore, it is preferable to pass input files from S3 as well, as they load much faster and less cluster time will be wasted importing data.
## Running on Amazon EC2 with Toil
### Install Toil
Please read Toil's [installation documentation](http://toil.readthedocs.io/en/latest/install/basic.html)
Install Toil locally. This can be done with virtualenv as follows:
virtualenv ~/toilvenv
. ~/toilvenv/bin/activate
pip install toil[aws,mesos]
### Create a leader node
wget https://raw.githubusercontent.com/BD2KGenomics/toil-vg/master/scripts/create-ec2-leader.sh
./create-ec2-leader.sh <leader-name> <keypair-name>
Log into the leader with
toil ssh-cluster <leader-name> --zone usa-west-2a
In order to log onto a worker node instead of the leader, find its public IP from the EC2 Management Console or command line, and log in using the core username: `ssh core@public-ip`
Destroy the leader when finished with it. After logging out with `exit`:
toil destroy-cluster myleader
### Small AWS Test
Run a small test from the leader node as follows.
wget https://raw.githubusercontent.com/BD2KGenomics/toil-vg/master/scripts/bakeoff.sh
chmod u+x ./bakeoff.sh
./bakeoff.sh -fm <NAME>
### Processing a Whole Genome
From the leader node, begin by making a toil-vg configuration file suitable for processing whole-genomes, then customizing it as necessary.
toil-vg generate-config --whole_genome > wg.yaml
Toil-vg can be used to construct vg graphs as, for example, [described here](https://github.com/vgteam/vg/wiki/working-with-a-whole-genome-variation-graph). Files will be written to the S3 bucket, OUT_STORE and the S3 bucket, JOB_STORE, will be used by Toil (both buckets created automatically if necessary; do not prefix OUT_STORE or JOB_STORE with s3://)
REF=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
VCF=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
MASTER_IP=`ifconfig eth0 |grep "inet addr" |awk '{print $2}' |awk -F: '{print $2}'`
toil-vg construct aws:us-west-2:JOB_STORE aws:us-west-2:OUT_STORE --fasta $REF --vcf $VCF --config wg.yaml --out_name hs37d5 --batchSystem=mesos --mesosMaster=${MASTER_IP}:5050 --nodeTypes r3.8xlarge:0.85 --maxNodes 8 --provisioner aws --realTimeLogging --logInfo --defaultPreemptable --logFile construct.log --retryCount 3 --regions $(for i in $(seq 1 22; echo X; echo Y); do echo $i; done)
Indexes can be created above using the `--xg_index` and `--gcsa_index` options (and switching to i2.8xlarge nodes), or by running `toil-vg index` below. :
MASTER_IP=`ifconfig eth0 |grep "inet addr" |awk '{print $2}' |awk -F: '{print $2}'`
toil-vg index aws:us-west-2:JOB_STORE aws:us-west-2:OUT_STORE --batchSystem=mesos --mesosMaster=${MASTER_IP}:5050 --graphs $(for i in $(seq 22; echo X; echo Y); do echo s3://OUT_STORE/hs37d5-${i}; done) --chroms $(for i in $(seq 22; echo X; echo Y); do echo $i; done) --realTimeLogging --logInfo --config wg.yaml --index_name my_index --defaultPreemptable --nodeTypes i2.8xlarge:1.00 --maxNodes 5 --provisioner aws 2> index.log
Note that the spot request node type (i2.8xlarge) and amount ($1.00) can be adjusted in the above command. Keep in mind that indexing is very memory and disk intensive.
If successful, this will produce for files in s3://OUT_STORE/
my_index.xg
my_index.gcsa
my_index.gcsa.lcp
my_index_id_ranges.tsv
We can now align reads and produce a VCF in a single call to `toil-vg run`. (see `toil-vg map` and `toil-vg call` to do separately). The invocation is similar to the above, except we use r3.8xlarge instances as we do not need as much disk and memory.
toil-vg run aws:us-west-2:JOB_STORE READ_LOCATION/reads.fastq.gz SAMPLE_NAME aws:us-west-2:OUT_STORE --batchSystem=mesos --mesosMaster=${MASTER_IP}:5050 --gcsa_index s3://OUT_STORE/my_index.gcsa --xg_index s3://