Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.
To see what other modules are needed, what commands are available and how to get additional help type
module help seqtk
To see what versions of Seqtk are available type
module avail seqtk
To use Seqtk, include a command like this in your batch script to load the Seqtk module:
module load seqtk
Be sure you also load any other modules needed, as listed by the
module help seqtk command.
Terms & Conditions for Use of MATLAB Software on PSC Systems
Carnegie Mellon University, acting through the Pittsburgh Supercomputing Center (PSC) is prepared to grant to the person or entity named below ("you"), permission to use the MATLAB Programs solely for the purpose(s) ("Purposes") identified below, subject to the limitations set forth below. To be granted access to MATLAB, complete this form and click "I AGREE".
This is a legal agreement between you, the user of MATLAB at the Pittsburgh Supercomputing Center ("PSC"), and Carnegie Mellon University. MATLAB Programs are made available for use at the PSC under license, which license carries conditions with which all PSC MATLAB users must comply. By clicking on "I AGREE" you are indicating your understanding and acceptance of the following terms.
- The permission granted to use MATLAB Programs at the PSC is personal and is not to be extended by the undersigned to others.
- Usage of MATLAB Programs at the PSC under this agreement is solely for the purpose of academic course work and teaching, noncommercial academic research, and personal use which is not for any commercial or other organizational use.
- Downloading any MATLAB Programs onto your computer(s) is expressly prohibited.
Trinity Usage on Blacklight
The information in this page was taken directly from the Trinity-use-on-Blacklight wiki page formerly hosted on Wikispaces. It was created by Brian Couger of Oklahoma State University, and we thank him for his time, his expertise, and his permission to use it.
Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-Seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:
- Inchworm assembles the RNA-Seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
- Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.
- Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that correspond to paralogous genes.
The Blacklight resource is hosted by the Pittsburgh Supercomputing Center (www.psc.edu).
Blacklight is an SGI Altix UV 1000 supercomputer designed for memory-limited scientific applications in fields as different as biology, chemistry, cosmology, machine learning and ecnomics. Funded by the National Science Foundation (NSF), Blacklight carries out this mission with partitions with as much as 16 Terabytes of coherent shared memory.
Blacklight's unique architecture allows computational jobs that require a large amount of memory overheard, such as de novo transcriptome/genomic assemblies to be completed. The very large amount of addressable RAM allows for very high read density assemblies, many of which would be outside the computational scope of many other HPC systems.
A complete description of Blacklight can be found at:
Obtaining an account
Blacklight is part of the XSEDE program (https://www.xsede.org/), the successor to the TeraGrid. XSEDE is an National Science Foundation funded collection of HPC resources, services and expertise that allows users to use national HPC infrastructure resources remotely. Instructions for obtaining a user account can be found here: https://www.xsede.org/web/guest/allocations. Requirements are that you or a member of your group are a current researcher in the United States of America or have a research partner who is currently working in the United States.
Logging on to Blacklight
There are three options for logging on to Blacklight once you have established a XSEDE user account.
1. GSI-SSHTerm (All Systems): This allows you to access and use all XSEDE resources as well as transfer files to the desired resource
2. Putty/WinSCP (Windows) SSH (Linux): Allows usage/file transfer remotely through two separate programs
Putty: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html allows remote log ons
WinSCP: http://winscp.net/eng/download.php allows file transfer
3. XSEDE website (Web Browser): Allows usage/file transfer remotely through a web browser
The host name to use is: blacklight.psc.teragrid.org
Upon log-in you will be prompted to enter a user name and password. Use the XSEDE username and password given to you when you received your XSEDE account.
Running Jobs on Blacklight
A highly detailed explanation of executing computational jobs on Blacklight and advanced usage can be found here: http://www.psc.edu/index.php/computing-resources/blacklight
This section gives a brief overview of the basics for usage of Blacklight and a step by step how-to guide on using Trinity on Blacklight.
Blacklight OS Structure
Blacklight uses a custom Linux based kernel structure for the OS and a PBS-Torque like system for scheduling and managing jobs. Users who have experience with either should be in familiar territory.
Helpful for Linux related questions: http://www.linuxquestions.org/questions/
Blacklight Queue Structure
There are 2 basic queues on Blacklight, the debug queue and the batch queue.
The debug queue has a limit of 30 minutes of wall time and 16 cores maximum, good for ensuring your command line execution arguments are correct. The debug queue is NOT to be used for production runs.
The batch queue is broken into subqueues based on the amount of cores and wall time requested. You submit jobs to the batch queue and they are automatically slotted into the appropriate subqueue based on the resources requested.
- Jobs that ask for 256 or fewer cores can ask for a maximum wall-time of 96 hours.
- Jobs that ask for more than 256 cores, to a maximum of 1440 cores, can ask for a maximum wall-time of 48 hours.
Jobs requesting more than 1440 cores are sent to a separate queue where they receive special handling.
What if I need more time?
The amount of memory that is allocated to your job is determined by the number of cores requested. The 16 cores on each blade share 128 Gbytes of RAM. This table shows the amount of RAM you have access to based on the number of cores that you request. Because there are 16 cores on a blade, and blades can not be shared among jobs, you must request cores in multiples of 16.
On Blacklight, Service Unit charges (SUs) are based on the number of cores a job uses. One core-hour is one SU. Because jobs do not share blades, and there are 16 cores on a blade, a one hour job that uses one blade will be charged 16 SUs.
Jobs are executed on Blacklight using a Portable Batch System(PBS/Torque) system. Users submit jobs to a scheduler which determines when the job is executed based on a number of factors including: the resources required for the job, the number of jobs a user has currently in the queue, the job's specified wall-time, and how many jobs are currently running. For quickest turnaround, jobs should only request the amount of resources needed.
You must create a job script and submit it to run a job. A number of things are required. The following template script is an example of running Trinity. Each #COMMENT line provides an explanation of the next line of the script.
#!/bin/csh #COMMENT ncpus must be a multiple of 16, the formula for total RAM used by number of cpus is ncpus/16*128 = X GB #PBS -l ncpus=32
#COMMENT The duration of time requested for the job, in this case 40 hours and 30 minutes #PBS -l walltime=95:00:00
#COMMENT Combines stdout and stderr in one file #PBS -j oe
#COMMENT Specifies the queue. Change this to 'debug' to access the debug queue (limit of ncpus=16 and walltime=00:30:00) #PBS -q batch
#COMMENT Needed to load the module command source /usr/share/modules/init/csh
#COMMENT Set stacksize to unlimited limit stacksize unlimited
#COMMENT Move to your $SCRATCH directory, this directory should be where your read files are located cd $SCRATCH
#COMMENT Load most recent version of Trinity
#COMMENT Run 'module avail trinity' on Blacklight command line to find name of latest Trinity module
#COMMENT (unless need to continue a run started with a different version -- don't switch versions in the middle of an assembly!) module load trinity
#COMMENT Load latest versions of supporting modules required by Trinity module load bowtie module load samtools
#COMMENT Run the Trinity command Trinity --seqType fq --JM 100G --left reads.left.fq --right reads.right.fq --SS_lib_type RF --CPU 16 > trinity_output.log
MAKE SURE TO REDIRECT TRINITY OUTPUT TO A LOG FILE AS SHOWN ABOVE (> trinity_output.log) OR YOUR JOB WILL LIKELY GET KILLED!!!
If the output goes through the batch system, the job will be killed if the output exceeds 20 MB (which it usually does with Trinity).
Once you have copied the above template script and made the appropriate changes, you can create a job submission file and submit the job to the queue. You can use any text editor you are familiar with to do this.
As an example, here we use the vi editor to create the job submission script. If you don't know how to use vi, see here:
You can create a new file (or open an existing one) with vi by typing this on the Blacklight command line:
Copy and paste the above script into the vi file and save it on Blacklight (copy entire script, go to open vi file, press i for insert, right click, hold shift and press z key twice).
Now you are ready to submit the job by typing
To ensure that the job was submitted properly use the command
qstat -f <pbsJobnumber> or qstat -f -u <your-username>
The output will look something like this:
Note that successful submission does not guarantee successful completion. An exit status will be given at the end of the job to designate how the job completed.
A detailed explanation of exit status values can be found here: http://www.clusterresources.com/torquedocs21/2.7jobexitstatus.shtml
During job run-time, the
qstat command can be used to check on the status of a job, how much RAM is being used and how close the job is to reaching wall-time:
qstat -f <pbsjobnumber> or
qstat -f -u <your-username>
If at anytime you want to cancel a running job, use the
All files that are needed for execution should be loaded to your $SCRATCH directory. Upon logging in type:
cd $SCRATCH pwd
The directory given will be the path of your $SCRATCH directory. This directory can store and use large files, unlike your home directory.
user@tg-login1:~> cd $SCRATCH user@tg-login1:~> pwd /brashear/user user@tg-login1:~>
If using GSI-SSHTerm to transfer files, upon logging in go to Tools > SFTP Session.
In the Address box, type in the full path to your $SCRATCH directory where you want to store the files. (Your screen will have your username rather than 'mbcougar'.)
The batch script above requests 32 CPU (or cores) with 256 GB of RAM for 95 hours. This should be enough to run most small to medium Trinity jobs. If your job is small, you may consider using 16 CPU (or cores) which allocates 128GB of RAM for your job, but be warned that there are only a limited number of 16 core jobs allowed to run on the system, so turnaround may actually be slower than for 32 core jobs. You can check if your 16 core job is held up by other 16 core jobs by running
qstat -s <pbsjobnumber>:
user@tg-login1:~> qstat -s 208539 tg-login1.blacklight.psc.teragrid.org: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - ----- 208539.tg-login1 user batch_r myjob -- -- 16 -- 00:10 Q -- host bl0.psc.teragrid.org has 7 16 core jobs running...limit is 7
Only 16 core jobs are limited in this fashion. Other jobs will run based on available cores and the number of jobs ahead of yours in the queue.
If your job is large, consider altering the parameters as necessary to accommodate the data. If you believe that you need more wall-time remember that Butterfly can be run separately from Inchworm and Chrysalis (recommended for large data-sets on Blacklight).
Using Interactive Access
Interactive access on Blacklight is possible; however, it should only be used for short debugging jobs. This command will request an interactive session with 16 cores (allocating 128 GB of RAM) and 30 minutes of wall-time:
qsub -I -l ncpus=16 -l walltime=00:30:00 -q debug
This job uses the debug queue, which has a limit of 16 cores for 30 minutes. Larger jobs must be run with a batch script as in the Job Submission section above.
If your job is killed
If you encounter the following error (or one with slightly different numerical values) that causes the job to stop, you did not ask for enough memory. Request more memory (by requesting more cores) and resubmit the job. See the Memory allocation section of this document for details.
PBS: Job killed: cpuset memory_pressure 10562 reached/exceeded limit 1 (numa memused is 134200964 kb)
PSC has installed the module software on Blacklight. You can load Trinity and all its dependencies with the module command and execute it anywhere as if it were contained in your path. To see what versions of Trinity are currently installed, type:
module available trinity
user@tg-login1:~> module avail trinity -------------------------- /usr/local/opt/modulefiles -------------------------- trinity/r2013-08-14 trinity/r2014-04-13p1 trinity/r2013-11-10 trinity/r2014-07-17 user@tg-login1:~>
Choose the version you want, then load it using its specific version number.
module load trinity/version-number
Note: When using interactive access you must load these modules after you have started your Interactive PBS access.
For a look at all programs that can be loaded with module type:
After you load the trinity module, all the Trinity commands are available for you to use.
user@tg-login1:~> Trinity Trinity: Command not found.
user@tg-login1:~> module load trinity user@tg-login1:~> Trinity ############################################################################### # # ______ ____ ____ ____ ____ ______ __ __ # | || \ | || \ | || || | | # | || D ) | | | _ | | | | || | | # |_| |_|| / | | | | | | | |_| |_|| ~ | # | | | \ | | | | | | | | | |___, | # | | | . \ | | | | | | | | | | | # |__| |__|\_||____||__|__||____| |__| |____/ # ############################################################################### # # Required: # # --seqType :type of reads: ( fa, or fq ) # # --JM :(Jellyfish Memory) number of GB of system memory to use for # k-mer counting by jellyfish (eg. 10G) *include the 'G' char # # If paired reads: # --left :left reads, one or more (separated by space) # --right :right reads, one or more (separated by space) # # Or, if unpaired reads: # --single :single reads, one or more (note, if single file contains pairs, can use flag: --run_as_paired ) # #
More information on the module command is found here: http://www.psc.edu/index.php/module
Before running Trinity, set stacksize to unlimited
If you are using bash, type:
ulimit -s unlimited
If you are using csh, type:
limit stacksize unlimited
Move to your $SCRATCH space
Your scratch directory is where all assembly files should be uploaded and where all large outputs should be kept on Blacklight (your $HOME space has a 5 GB quota). To move to your scratch space type:
If you need the location of this directory to transfer files with either WinSCP or GSI-SSHTerm type
pwd. This will bring up the directory which should be:
/brashear/<your Blacklight User Name>
Just remember to backup any data on $SCRATCH either to $HOME (if it is on the order of megabytes) or to the archival system (if it is GBs or larger).
Execute Trinity Specific Commands
The following are examples of Trinity commands that can be used. A full list of options is available on Trinity's main site: http://trinityrnaseq.sourceforge.net/. We highly recommend that you read this list to see the correct usage for these commands.
Note: you must substitute your specific values for <Variable> in these examples. Do not include the '< >' symbols in your command. Be sure you are in your $SCRATCH directory and that your input files are located there also.
Strand Specific Sequencing (Preferred Library Method typical of the dUTP/UDG sequencing method) :
Trinity.pl --seqtype fq --kmer_method meryl --left <YourReads1.fq> --right <YourReads2.fq> --output <DirNameForOutput> --SS_lib_type RF --min_contig_length <contigLengthMinCutoff> --CPU 16 --bflyCPU 16 --bflyGCThreads 16 > trinity_output.log
Note: other methods of Strand Specific library generation may require FR orientation, please review the Trinity website for a full explanation.
Non Strand Specific Library
Trinity.pl --seqtype fq --kmer_method meryl --left <YourReads1.fq> --right <YourReads2.fq> --output <DirNameForOutput> --min_contig_length <contigLengthMinCutoff> --CPU 16 --bflyCPU 16 --bflyGCThreads 16 > trinity_output.log
Other Options for consideration
Some additional Trinity options are given here. For a complete list of advanced options and guide for Trinity use see: http://trinityrnaseq.sourceforge.net/advanced_trinity_guide.html
- --paired_fragment_length <int>
- This is the insert size for paired end reads, default is 300
- Requires bowtie module to be loaded, only recommended if you are assembling a transcriptome from a gene dense genome such as a fungal genome. If you have paired end reads, Trinity uses Bowtie to determine that consistent pairing is used, this is not recommended for large genomes. Ensure that your read names are properly labeled by ending with "/1" "/2
- --kmer_method (required) <meryl> <jellyfish> or <inchworm>
- These are the different methods that can be used for kmer creation with inchworm. More documentation can be found on the Trinity website or the meryl website listed above. For large to very large assemblies these parameters can be adjusted for improved performance at a trade off for the amount of RAM used.
- --cpu <int>
- Number of CPUS, this should be equal to the number of CPUs (cores) requested for the job
- --bflyCPU <int>
- Number of CPUS to use for Butterfly,should be equal to that of the amount of CPUs (cores) requested for the job
- --bflyHEapSpaceInit <string>
- This value is the amount of RAM initially each thread will use in the butterfly job, the product of this value and the thread count can not exceed the amount of RAM allocated for the job. An example of a acceptable value is 3G for 3GB of initial java heap space
- --bflyHeapSpaceMax <string>
- This is the amount of heap space butterfly will attempt to use if the initial amount is insufficient, if a job does not complete and exits with an error.
- Only Run Inchworm, can be useful when dealing with very large jobs that require a large amount of wall time
- The maximum amount of reads Chrysalis will anchor for any given graph
- Maximum amount of reads to read into memory at once for Chrysalis
More information on Trinity
- Download Trinity (includes Inchworm, Chrysalis and Butterfly) at http://sourceforge.net/projects/trinityrnaseq/
- Trinity Website: http://trinityrnaseq.sourceforge.net/
- Trinity FAQ: http://trinityrnaseq.sourceforge.net/trinity_faq.html
- Trinity Forum: http://sourceforge.net/mailarchive/forum.php?forum_name=trinityrnaseq-users
Sickle is a windowed adaptive trimming tool for FASTQ file using quality.
Most modern sequencing technologies produce reads that have deteriorating quality towards the 3′-end and some towards the 5′-end as well. Incorrectly called bases in both regions negatively impact assembles, mapping, and downstream bioinformatics analyses.
Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3′-end of reads and also determines when the quality is sufficiently high enough to trim the 5′-end of reads. It will also discard reads based upon the length threshold. It takes the quality values and slides a window across them whose length is 0.1 times the length of the read. If this length is less than 1, then the window is set to be equal to the length of the read. Otherwise, the window slides along the quality values until the average quality in the window rises above the threshold, at which point the algorithm determines where within the window the rise occurs and cuts the read and quality there for the 5′-end cut. Then when the average quality in the window drops below the threshold, the algorithm determines where in the window the drop occurs and cuts both the read and quality strings there for the 3′-end cut. However, if the length of the remaining sequence is less than the minimum length threshold, then the read is discarded entirely.
To see what other modules are needed, what commands are available and how to get additional help type
module help sickle
To see what versions of Sickle are available type
module avail sickle
To use Sickle, include a command like this in your batch script to load the Sickle module:
module load sickle
Be sure you also load any other modules needed, as listed by the
module help sickle command.