GlimmerHMM

GlimmerHMM is a Eukaryotic Gene-Finding System based on a Generalized Hidden Markov Model (GHMM).

Documentation

General usage 

To see what versions of GlimmerHMM are available type

module avail glimmerhmm

To see what other modules are needed, what commands are available and how to get additional help type

module help glimmerhmm

To use GlimmerHMM, include a command like this in your batch script or interactive session to load the glimmerhmm module:

module load glimmerhmm

Be sure you also load any other modules needed, as listed by the module help glimmerhmm command.

 

Command line usage

glimmerhmm <genome1-file> <training-dir-for-genome1> [options]

Options:

-p file_name If protein domain searches are available, read them from file file_name
-d dir_name Training directory is specified by dir_name (introduced for compatibility with earlier versions)
-o file_name Print output in file_name; if n>1 for top best predictions, output is in file_name.1, file_name.2, ... , file_name.n f
-n n Print top n best predictions
-g Print output in gff format
-v Don't use svm splice site predictions
-f Don't make partial gene predictions
-h Display the options of the program

Training datasets

To use glimmer, you must either train the program or use a precompiled training set.

Using pre-compiled training datasets

A number of precompiled training sets are included in the GlimmerHMM release. To see what is available, type:

module load glimmerhmm 
ls $GLIMMERHMM_TRAIN/trained_dir

If your genome is listed above (or is a close relative of a genome listed above), you may use the pre-compiled training sets, with the-d option followed by the directory containing the pre-compiled training set. The precompiled training sets can be found in the directory $GLIMMERHMM_HOME/trained_dir

For example to use the precompiled set for the human genome on a set of sequences contained in the file fasta.file, you would use the following on the command line:

% glimmerhmm fasta.file -d $GLIMMERHMM_HOME/trained_dir/human

Compiling your own training dataset

Use the trainGlimmerHMM module.

To train, use the commandtrainGlimmerHMM  with the parameters as specified below.

trainGlimmerHMM <mfasta_file> <exon_file> [optional_parameters]

<mfasta_file> is a multifasta file containing the sequences for training with the usual format:

>seq1
AGTCGTCGCTAGCTAGCTAGCATCGAGTCTTTTCGATCGAGGACTAGACTT
CTAGCTAGCTAGCATAGCATACGAGCATATCGGTCATGAGACTGATTGGGC
>seq2
TTTAGCTAGCTAGCATAGCATACGAGCATATCGGTAGACTGATTGGGTTTA
TGCGTTA

<exon_file> is a file with the exon coordinates relative to the sequences contained in the <mfasta_file>; different genes are separated by a blank line; I am assuming a format like below:

seq1 5 15
seq1 20 34
seq1 50 48
seq1 45 36
seq2 17 20

In this example seq1 has two genes: one on the direct strand and another one on the complementary strand

Optional_parameters

-i i1,i2,..., in isochores to be considered (e.g. if two isochores are desired between 0-40% GC content and 40-100% then the option should be: -i 0,40,100; default is -i 0,100 )
-f val val = average value of upstream UTR region if known
-l val val = average value of downstream UTR region if known
-n val val = average value of intergenic region if known

 

After running trainGlimmerHMM, a directory will be created in the directory where you ran the training procedure from. This directory will be called TrainGlimmM[data][time]  where  [data] and [time] specify the data and the time when the directory was created. This directory contains the training parameters needed by GlimmerHMM to run. A log file named after the name of the diretory will be also created specifying some of the default parameters set for GlimmerHMM. Once your training is complete, run GlimmerHMM with your training set.

User Information

Passwords
Connect to PSC systems:
Policies
Technical questions:

Send mail to remarks@psc.edu or call the PSC hotline: 412-268-6350.