Pittsburgh Supercomputing Center 

Advancing the state-of-the-art in high-performance computing, communications and informatics.

BEDTools

 

BEDTools is a collection of utilities for comparing, summarizing, and intersecting genomic features in the UCSC Genome Browser BED format.

BEDTools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.

Installed on blacklight, biou

Other resources that may be helpful include:

Quinlan AR and Hall IM, 2010.
BEDTools: a flexible suite of utilities for comparing genomic features.
Bioinformatics. 26, 6, pp. 841–842.
Website: http://code.google.com/p/bedtools/ Manual: http://bedtools.readthedocs.org/en/latest/

Running BEDTools

1) Make the BEDTools commands availiable for use
a) blacklight:
The BedTools programs will be made availiable for use through the module command. To load the SNAP module enter:

module load BEDTools

b) biou:
The BEDTools programs are availiable through the Galaxy instance on biou.

To make the BEDTools programs availiable through the command line, csh users should enter the following command:

% source /packages/bin/SETUP_BIO_SOFTWARE

To make the BEDTools programs availiable through the command line, bash users should enter the following command:

% source /packages/bin/SETUP_BIO_SOFTWARE

2) Programs in the BEDTools package:

intersectBed --- returns overlaps between two BED files.
peIntersectBed --- returns overlaps between a paired-end BED file and a regular BED file.
closestBed --- returns the closest feature to each entry in a BED file.
subtractBed --- removes the portion of an interval that is overlapped by another feature.
windowBed --- returns overlaps between two BED files based on a user-defined window
mergeBed --- merges overlapping features into a single feature.
complementBed --- returns all intervals _not_ spanned by the features in a BED file.
fastaFromBed --- creates FASTA sequences based on intervals in a BED file.
coverageBed --- creates a BED graph file based on the the overlaps of two BED files.
genomeCoverageBed --- creates either a histogram or a "per base" report of genome coverage.
sortBed --- sorts a BED file by chrom, then start position.
linksBed --- creates an HTML file of links to the UCSC or a custom browser.

3) Usage Examples:

a) intersectBed. Reports overlaps between features in two BED files.

NOTE: When intersecting SNPs, make sure the coordinate conform to the UCSC format. That is, the start position for each SNP should be SNP position - 1 and the end position should be SNP position. E.g. chr7 10000001 10000002 rs123464

1. Report the base-pair overlap between the features in two BED files.

$ intersectBed -a reads.bed -b genes.bed

2. Report whether each entry in A overlaps ANY entry in B.

$ intersectBed -a reads.bed -b genes.bed -u

3. Report those entries in A that overlap NO entries in B. Like "grep -v"

$ intersectBed -a reads.bed -b genes.bed -v

4. Report the count of entries in B that A overlaps.

$ intersectBed -a reads.bed -b genes.bed -c

5. Report the entire, original A entry for each overlap with B.

$ intersectBed -a reads.bed -b genes.bed -wa

6. Report the entire, original B entry for each overlap.

$ intersectBed -a reads.bed -b genes.bed -wb

7. Read BED A from stdin. Useful for stringing together commands.
NOTE: -a must be "stdin".

For example, find genes that overlap LINEs but not SINEs.

$ intersectBed -a genes.bed -b LINES.bed | intersectBed -a stdin -b SINEs.bed -v

b) peIntersectBed. Reports overlaps between features in a "paired-end" BEDPE file and a regular BED feature file.

OVERLAPS WITH PAIRED-END "ENDS"

1. Which paired-end reads overlap with a repeat on EITHER end?

$ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type either

2. Which structural variations overlap with a repeat on BOTH ends?

$ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type both

3. Which paired-end reads DO NOT overlap with a repeat on either end?

$ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type NEITHER

4. Which structural variations overlap with a repeat on ONE and ONLY ONE end?

$ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type xor

OVERLAPS WITH THE "SPAN" OF THE PAIRED-END READ OR EVENT

1. Which pair-end reads overlap span a gap based on the "inner-span"?

    E.g. Pair =====....................=====
    Gap 0000000000000000
 
    $ peIntersectBed -a reads.bedpe -b gaps.bed -type inspan

2. Which pair-end reads overlap span a gene based on the "inner-span"?

    E.g. Pair =====....................=====
    Gene 00000000000000000000000
    $ peIntersectBed -a reads.bedpe -b genes.bed -type outspan

c) closestBed. Very similar to intersectBed, but finds the closest (not necessarily overlapping) feature in B to each feature in A. If multiple features in B overlap a feature in A, the one with the highest overlap with respect to A is reported.

1. Find the closest ALU to each gene.

$ closestBed -a genes.bed -b ALUs.bed

Allows input from stdin in the same manner as intersectBed (see #7).

d) subtractBed. For each feature in A, it returns the fraction of that feature that is not overlapped by B. If a feature in A is entirely "spanned" by any feature in B, it will not be reported.

1. Remove introns from genes.

$ subtractBed -a genes.bed -b introns.bed

Allows input from stdin in the same manner as intersectBed (see #7).

e) windowBed. Very similar to intersectBed, but allows one to look for overlaps within a window flanking each A entry

1. Report all genes that are within 10000 bp UPSTREAM and DOWNSTREAM of CNVs.

$ windowBed -a CNVs.bed -b genes.bed -w 10000

Allows input from stdin in the same manner as intersectBed (see #7).

f) mergeBed. Merges overlapping or "book-ended" entries into a single entry.

1. Merge overlapping repetitive elements into a single entry

$ mergeBed -i repeatMasker.bed

2. Merge overlapping repetitive elements into a single entry, returning the number of entries merged

$ mergeBed -i repeatMasker.bed -n

3. Merge _nearby_ repetitive elements into a single entry, so long as they are within 1000 bp of one another

$ mergeBed -i repeatMasker.bed -d 1000

g) fastaFromBed. Extracts sequences in FASTA format based for each of the intervals defined in a BED file. The headers in the input FASTA file must exactly match the chrom column (column 1) in the BED file. If a strand is provided, the sequence will be reported accordingly (i.e. "-" returns a reverse complement). Otherwise, everything is reported on the "+" strand.

1. The example below will report those intervals in the human genome that do not contain an annotated repeat.

$ fastaFromBed -db hg18.fa -bed segDups.hg18.bed -fo segDups.hg18.fasta

h) sortBed. Sort a BED file in ascending order, first by chrom, then by start position.

$ sortBed scrambled.bed
Accepts input from stdin:
$ cat scrambled.bed | sortBed -i stdin > sorted.bed

i) linksBed. Creates HTML links from a BED file.

$ linksBed myRegions.bed > myRegions.html

j) coverageBed. Reports the number of entries in A that overlap with each feature in B. For example, one could use BED B as say 10Kb "windows" across a genome. If BED A contained the coordinates of sequence alignments, one could use coverageBed to count the number of reads that map to each window. This would be a preliminary step copy-number analysis. The output is in BEDGRAPH format and will allow you to plot the result on the UCSC Genome Browser.

1. For example, column four is the number of reads in "A" that overlapped each "window" in "B" :

$ coverageBed -a illuminaReads.bed -b windows.bed | head
chr1 0 10000 0
chr1 10001 20000 33
chr1 20001 30000 42
chr1 30001 40000 71

k) genomeCoverageBed. Computes a histogram of sequence coverage (aligned sequences in BED format) for a given genome. Optionally, (-d) it will also report the depth of coverage at each base on each chromosome in the genome file (-g) supplied. The (-max) option will "lump" all positions with a depth >= max into a single bin. I thank Royden Clark for his help in developing the speedy algorithm to compute coverage. By default, the output format is chrom, depth, number of bases at that depth, size of chrom, fraction of bases at depth.

1. For example, here's a look at "Joe's genome" coverage on chromosome 1, binning everything with a depth >= 10 together:

$ genomeCoverageBed -g hg18.genome -i joe.bed -max 10 | grep -P "chr1\t"
chr1 0 88757187 247249719 0.358978
chr1 1 26410800 247249719 0.106818
chr1 2 21892230 247249719 0.088543
chr1 3 18653191 247249719 0.0754427
chr1 4 16000409 247249719 0.0647136
chr1 5 13564576 247249719 0.0548618
chr1 6 11582171 247249719 0.046844
chr1 7 9722031 247249719 0.0393207
chr1 8 8168742 247249719 0.0330384
chr1 9 6765343 247249719 0.0273624
chr1 10 25733039 247249719 0.104077

l) complementBed. Returns those intervals in a given genome that do not overlap the entries in a BED file.

For example, report all intervals in the human genome that are not covered by repetitive elements.

$ complementBed -i repeatMasker.bed -g hg18.genome

That is:
RepeatMasker: ============ ==== =========== ========
Output: **** ******** ************ ************* *******

NOTE: Genome files must contain a single line for each chromosome in the following tab-delimited format. If you wish to create a genome file for a different species, you can easily do this by using the UCSC Genome Browser's MySQL interface or use the Table Browser (table chromInfo). Here's an example for how to collect the genome info for hg18.

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg18.chromInfo" > hg18.genome

NOTE: Be sure to remove the header line (chrom size) in the resulting file prior to running genomeCoverageBed

For example, here is hg18.genome.

chr1 247249719
chr1_random 1663265
chr10 135374737
chr10_random 113275
chr11 134452384
chr11_random 215294
chr12 132349534
chr13 114142980
chr13_random 186858
chr14 106368585
chr15 100338915
chr15_random 784346
chr16 88827254
chr16_random 105485
chr17 78774742
chr17_random 2617613
chr18 76117153
chr18_random 4262
chr19 63811651
chr19_random 301858
chr2 242951149
chr2_random 185571
chr20 62435964
chr21 46944323
chr21_random 1679693
chr22 49691432
chr22_random 257318
chr22_h2_hap1 63661
chr3 199501827
chr3_random 749256
chr4 191273063
chr4_random 842648
chr5 180857866
chr5_random 143687
chr5_h2_hap1 1794870
chr6 170899992
chr6_random 1875562
chr6_cox_hap1 4731698
chr6_qbl_hap2 4565931
chr7 158821424
chr7_random 549659
chr8 146274826
chr8_random 943810
chr9 140273252
chr9_random 1146434
chrM 16571
chrX 154913754
chrX_random 1719168
chrY 57772954

Common Problems:

1. Why am I not finding overlap with my SNPs?

When intersecting SNPs, make sure the coordinate conform to the UCSC format. That is, the start position for each SNP should be SNP position - 1 and the end position should be SNP position. For example:

chr7 10000001 10000002 rs123464

Stay Connected

Stay Connected with PSC!

facebook 32 twitter 32 google-Plus-icon