A community dataset space allows Bridges' users from different grants to share data in a common space. Bridges hosts both public and private datasets, providing rapid access for individuals, collaborations and communities with appropriate protections.
Community datasets are appropriate when data will be shared amongst Bridges' groups. Any data that should only be accessed by one group should be stored in that group's pylon5 space.
If you have a dataset for use by multiple groups on Bridges, request that it be stored in the community dataset space by completing the Community Dataset Request form. If your data collection has security or compliance requirements, you should indicate so on the form, or you can contact email@example.com.
Some data collections are available to anyone with a Bridges' account. They include:
ImageNet is an image dataset organized according to WordNet hierarchy. See the ImageNet website for complete information.
Available on Bridges at /pylon5/datasets/community/imagenet
Natural Languge Tool Kit Data
NLTK comes with many corpora, toy grammars, trained models, etc. A complete list of the available data is posted at: http://nltk.org/nltk_data/
Available on Bridges at /pylon5/datasets/community/nltk
Dataset of handwritten digits used to train image processing systems.
Available on Bridges at /pylon5/datasets/community/mnist
Several genomics datasets are publicly available.
- The BLAST databases can be accessed through the environment variable $BLASTDB after loading the BLAST module.
- CAMI (Critical Assessment of Metagenome Interpretation) is a community-led initiative designed to help tackle challenges in metagenome assembly and analysis by aiming for an independent, comprehensive and bias-free evaluation of methods. Data from the first CAMI challenge is available at /pylon5/datasets/community/genomics/cami.
- Repbase is the most commonly used database of repetitive DNA elements. You must register with RepBase at http://www.girinst.org and send proof of registration to firstname.lastname@example.org in order to use the Repbase database.
- The University of California at Santa Cruz reference genomes are available at /pylon5/datasets/community/
genomics/UCSC. The collection includes human, mouse and drosophila genomes.
- Other genomics datasets
- Other available datasets are typically used with a particular genomics package. These include:
Other useful datasets
A list of datasets that may be useful follows. These datasets are not currently installed on Bridges, but can be copied to your pylon5 space, or if you think they would be useful to many Bridges' users, you can request that they be installed in a public space.
Keras Datasets for Import
These datasets are available from https://keras.io/datasets/
- CIFAR10 small image classification
- CIFAR100 small image classification
- IMDB Movie reviews sentiment classification
- Reuters newswire topics classification
- MNIST database of handwritten digits
- Fashion-MNIST database of clothing
- Boston housing price regression dataset (from CMU)
- The PASCAL Visual Object Classes Homepage
- Open Images Dataset V5
- The Street View House Numbers (SVHN)
Natural Language Processing
- Twenty Newsgroups
- Yelp Reviews
- The Wikipedia Corpus
- The Blog Authorship Corpus
- Natural Language Toolkit — NLTK 3.4.5
Audio and Audio-Visual
- Free Spoken Digit Dataset
- Free Music Archive (FMA)
- Million Song Dataset
Scikit-Learn Datasets for Import
- Wisconsin Breast cancer - binary classification
- Iris - PCA, LDA, multi-class classification
- Wine - multi-class classification
- Boston house prices - regression
- Diabetes - regression
- Handwritten digits - image classification
Multi-class classification and clustering
Univariate Time Series
Multivariate Time Series