Pittsburgh Supercomputing Center 

Advancing the state-of-the-art in high-performance computing,
communications and data analytics.

Sequence Data Conversion for ALLPATHS-LG

Before you can use ALLPATHS-LG, your sequencing data will need to be converted and have metadata added to it. There are several conversion tools in the package to assist in converting the data. The conversion tools in ALLPATHS-LG handle data in the following formats:

  1. BAM format
  2. FASTQ format (uncompressed or .gz compressed)
  3. FASTA format with accompaning QUALA formated files

Depending on the genome size and the specific format that the files are in, it can take a substantial amount of time to convert and prepare the data for ALLPATHS-LG.

The sequence data conversion process is discussed in detail the ALLPATHS manual. The section below describes how to reduce the time spent converting your data on the PSC Blacklight system by using LUSTRE file striping, $SCRATCH_RAMDISK, and parallel processing.

Preparing Data on Blacklight

First, stripe the read data on $SCRATCH

Place your read data in your $SCRATCH directory in a "striped" manner. (By default, this "file striping" will not happen.) To stripe your files, you must direct the system how to layout your file data through the lfs setstripe command. The easiest way to do this is to:

  1. Create a new $SCRATCH subdirectory, e.g.
    mkdir $SCRATCH/my_genome_reads
  2. Use the lfs setstripe command to set the layout of all the data that will be placed in the new $SCRATCH subdirectory, e.g.
    lfs setstripe $SCRATCH/my_genome_reads -s 1m -c 16
  3. Place the read data into the striped directory (either through cp, scp, or sftp). All the files and new subdirectories created in the striped directory will inherit the striping layout of the parent directory.

To check if your data has been striped after you have placed it into the striped directory, you can use the lfs getstripecommand, e.g.

lfs getstripe $SCRATCH/my_genome_reads/read1.fast1

The returned parameter lmm_stripe_countshould be the same as the stripe count (-c parameter) used to stripe the directory. Using a stripe size of 1MB (-s 1m) and a stripe count of 16 (-c 16) is a good starting point for most read data.

 

Next, use $SCRATCH_RAMDISK if possible for the data conversion

Using Blacklight's RAMDISK feature can substantially speed up the writing of the reformatted data. However, because RAMDISK space is limited, the data formatting strategy to use, ultimately depends on the size of the read data.

After your read data has been striped, determine how much space your read data consumes on the $SCRATCH directory. One way to do this is through the ducommand. For example

du -s --block-size=1G $SCRATCH/my_genome_reads

will tell you how many GB of data is in the $SCRATCH/my_genome_reads directory.

 

If the numbers reported by the du command are sufficiently small (~ < 480Gb of compressed fastq files), a one-step strategy using $SCRATCH_RAMDISK and the "PrepareAllPathsInputs.pl" script can be used. This strategy consists of three basic steps:

  1. Read the raw sequence read data that has been striped (per above) from $SCRATCH
  2. Use $SCRATCH_RAMDISK for writing all files generated by the scripts called during the reformatting procedure
  3. Copy the data from $SCRATCH_RAMDISK to a new striped directory on $SCRATCH

To do the file conversion, run the "PrepareAllPathsInputs.pl" procedure in parallel by setting the parameter "HOSTS" (i.e. number of forked hosts on the local computer - in this case the local computer is Blacklight) to a number greater than one, but no greater than 15. In general, using more than 15 hosts for the data conversion process does not improve and in some cases has degrades the overall performance of the data conversion procedure.

The following table can be used as a set of general guidelines for parameters in your PBS job file and for the "PrepareAllPathsInputs.pl" script based on the total size of the data that you are converting:

Data Size    PBS     PBS    dplace -c
(.fastq.gz)  ncpus   walltime    value
-----------  -----   --------    ------------
    <  30GB   16       60 min    0-15:1
 30 -  60GB   32      120 min   0-31:2
 60 - 120GB   64      240 min    0-63:4
120 - 240GB   128     480 min   0-127:8
240 - 480Gb   256     960 min    0-255:16

The actual parameters required for your data conversion may vary from the guidelines listed above. For example, the required walltime for the data reformatting procedure will vary from run to run depending on the total amount of I/O being performed on Blacklight at the time that the reformatting is taking place. Thus, you may wish to increase the walltimes listed above by a margin of safety by choosing the next highest limit particularaly if the dataset is near the maximum end of the range listed in the table.

An example PBS script file implementing this one-step strategy running the "PrepareAllPathsInputs.pl" script in parallel is below:

#!/bin/bash
#PBS -l ncpus=16
#PBS -l walltime=60:00
#PBS -j oe
#PBS -q batch
#PBS -N APFMT1
#
ja
set -x
source /usr/share/modules/init/bash
module load allpaths-lg
ulimit -s 3000000
#
# $SCRATCH/my_APformat_reads: is the striped directory where we
# are going to copy the allpaths reformatted reads
#
# $SCRATCH_RAMDISK/my_APformat_reads: is the ramdisk directory
# where we are going to be doing the reformatting.
#
mkdir $SCRATCH/my_APformat_reads
lfs setstripe $SCRATCH/my_APformat_reads -s 1m -c 16
mkdir $SCRATCH_RAMDISK/my_APformat_reads
cd $SCRATCH_RAMDISK/my_APformat_reads
mkdir data
mkdir data/read_cache
#
# Stage in_groups.csv and in_libs.csv to ramdisk
#
cp $SCRATCH/in_groups.csv .
cp $SCRATCH/in_libs.csv .
#
dplace -c 0-15:1 PrepareAllPathsInputs.pl\
 DATA_DIR=$PWD/data\
 PLOIDY=2\
 IN_GROUPS_CSV=in_groups.csv\
 IN_LIBS_CSV=in_libs.csv\
 GENOME_SIZE=2500000000\
 OVERWRITE=True\
 HOSTS=15\
 | tee prepare.out
#
# See how much data was created, copy data from ramdisk, and print ja
# report
#
du -s --block-size=1G $SCRATCH_RAMDISK
cp -r $SCRATCH_RAMDISK/my_APformat_reads/* $SCRATCH/my_APformat_reads
ja -clst

If the numbers reported by the du command are sufficiently large, using the one-step "PrepareAllPathsInputs.pl" script is not recommended. Instead, the reformatting procedure recommended is the three-script procedure which is described in more detail in the "ALLPATHS Cache for power users" section of the manual. The general three-step procedure is as follows:

  1. Run the "CacheLibs.pl" script; generally this script will take a trivial amount of time.
  2. Run the "CacheGroups.pl" script; generally this script will take a lot of time but can be done in parallel.
  3. Run the "CacheToAllpathsInputs.pl" script; generally this script will take a moderate amount of time to run.

This three script procedure is equivalent to the one-script "PrepareAllPathsInputs.pl" procedure outlined above:

#!/bin/bash
#PBS -l ncpus=16
#PBS -l walltime=30:00
#PBS -j oe
#PBS -q batch
#PBS -N APFMT3
#
ja
set -x
source /usr/share/modules/init/bash
module load allpaths-lg
ulimit -s 3000000
#
# $SCRATCH/my_APformat_reads: is the striped directory where we
# are going to copy the allpaths reformatted reads
#
# $SCRATCH_RAMDISK/my_APformat_reads: is the ramdisk directory
# where we are going to be doing the reformatting.
#
mkdir $SCRATCH/my_APformat_reads
lfs setstripe $SCRATCH/my_APformat_reads -s 1m -c 16
mkdir $SCRATCH_RAMDISK/my_APformat_reads
cd $SCRATCH_RAMDISK/my_APformat_reads
mkdir data
mkdir data/read_cache
#
# Stage in_groups.csv and in_libs.csv to ramdisk
#
cp $SCRATCH/in_groups.csv .
cp $SCRATCH/in_libs.csv .
#
#
date
CacheLibs.pl\
 CACHE_DIR=$PWD/data/read_cache\
 ACTION=Add\
 IN_LIBS_CSV=in_libs.csv\
 OVERWRITE=1\
 DRY_RUN=0\
 VERBOSE=1\
 | tee prepare1.out
#
date
CacheGroups.pl\
 CACHE_DIR=$PWD/data/read_cache\
 ACTION=Add\
 PICARD_TOOLS_DIR=\
 IN_GROUPS_CSV=in_groups.csv\
 INCLUDE_NON_PF_READS=1\
 PHRED_64=0\
 FORCE_PHRED=0\
 OVERWRITE=1\
 SAVE_INTERMEDIATES=0\
 TMP_DIR=\
 HOSTS=15\
 JAVA_MEM_GB=8\
 DRY_RUN=0\
 VERBOSE=1\
 | tee prepare2.out
date
CacheToAllPathsInputs.pl\
 CACHE_DIR=$PWD/data/read_cache\
 IN_GROUPS_CSV=in_groups.csv\
 PLOIDY=2\
 DATA_DIR=$PWD/data\
 GENOME_SIZE=2500000000\
 FRAG_COVERAGE=\
 FRAG_FRAC=\
 JUMP_COVERAGE=\
 JUMP_FRAC=\
 LONG_JUMP_COVERAGE=\
 LONG_JUMP_FRAC=\
 LONG_JUMP_MIN_SIZE=20000\
 LONG_READ_MIN_LEN=500\
 DRY_RUN=0\
 VERBOSE=1\
 | tee prepare3.out
date
cp -r $SCRATCH_RAMDISK/lgteststripegharial/* $SCRATCH/lgteststripegharial
#
# See how much data was created, copy data from ramdisk, and print ja # report
#
du -s --block-size=1G $SCRATCH_RAMDISK
cp -r $SCRATCH_RAMDISK/my_APformat_reads/* $SCRATCH/my_APformat_reads
ja -clst

The above three script procedure is equivalent to the one-script "PrepareAllPathsInputs.pl" procedure outlined above. The advantage is that each of the steps can be performed separately and data can be staged into and out of $SCRATCH_RAMDISK before and after each of the three scripts, thus minimizing the amount of $SCRATCH_RAMDISK needed for data conversion.

For performance and ram disk usage savings on large datasets, the "CacheGroups.pl" and the "CacheToAllpathsInputs.pl" script should be set to read striped data from from $SCRATCH and write to $SCRATCH_RAMDISK. Also, placing the running of these two scripts in their own independent PBS script is recommended.

It is also possible to reformat data without using RAMDISK. In general reformatting that does not use $SCRATCH_RAMDISK performs poorly. Thus this method should be reserved for unusual data that can not be formatted any other way and in general should be limited to the running of the "CacheToAllPathsInputs.pl" script both reading and writing to stripped $SCRATCH directories.

Stay Connected

Stay Connected with PSC!

facebook 32 twitter 32 google-Plus-icon