Data Exacell Pilot Applications

The Data Exacell (DXC) was a research pilot project to create, deploy, and test software and hardware building blocks that enable data analytics in scientific research. The DXC coupled analytic resources with innovative storage configurations to allow communities that traditionally have not used HPC to create valuable new applications.

Pilot Applications

Pilot applications in data-intensive research areas were used to motivate, test, demonstrate, and improve the DXC building blocks. These projects span research areas from radio astronomy to the digital humanities, including genomics, data-intensive frameworks, biology, and epidemiology.

. The pilot applications were selected according to their ability to advance research through some combination of:

High data volume, variety, and/or velocity
Novel approaches to data management or organization
Novel approaches to data integration or fusion
Integration of data analytic components into workflows
Complementary to existing DXC pilot applications

For each pilot application, a PSC specialist worked closely with the research group to formulate and implement an effective solution.

Pilot Applications Click on the project title for more information	Affiliation
Community: Genomics and Bioinformatics
Pittsburgh Genome Resource Repository (PGRR) With the Pittsburgh Genome Resource Repository (PGRR) researchers in the Department of Biomedical Informatics (DBMI) at the University of Pittsburgh aim to create gold-standard pipelines for analyzing cancer genome data, enable large-scale profiling and integrative analysis of genomic and phenotypic data, and make these analytical resources accessible to both computationally-advanced researchers and those with little computational experience. The DXC makes The Cancer Genome Atlas (TCGA) data accessible to both Pitt computational resources and DXC analytical engines at PSC through a SLASH2 wide-area filesystem. A PostgreSQL database within DXC stores all TCGA DXC analytical engines at PSC through a SLASH2 wide-area filesystem. A PostgreSQL database within DXC stores all TCGA es to these data. Researchers with little computational experience can analyze the TCGA data through commercial applications such as CLCBio, which is deployed at Pitt. Researchers with greater computational experience can analyze the data using command-line tools on Pitt’s Frank and HTC clusters or DXC computational systems at PSC. In addition, a wide-area SLASH2 filesystem will soon make all of these data available to Pitt and CMU researchers on Bridges. Pitt and CMU scientists are actively using data and DXC systems for their research even as we continue to work to improve existing capabilities and provide additional capabilities.	University of Pittsburgh
Galaxy Galaxy (http://galaxyproject.org) provides a powerful and flexible platform for creating and running reproducible scientific workflows, mostly focused on bioinformatics applications, but also adopted for use by other domains. Even within the bioinformatics community, many specialized Galaxy instances have been configured at various sites around the world for both private and public use. Some of these excel in supporting specific research subdomains, for example, metagenomics or proteomics. The ability to federate Galaxy sites would facilitate the work of scientists wishing to take advantage of the strengths of various sites or collaborate with scientists working in related fields. In addition, many of these sites have significant investment in software and workflows for particular types of analyses but lack the computing power to run their analyses on very large data sets. Even the popular main Galaxy site (“Galaxy Main”), with over 50,000 users, has modest computational resources available, limiting the types of analyses that can be pursued. The DXC can help address these issues by connecting Galaxy instances to powerful data analytics platforms like Greenfield at PSC through the SLASH2 wide-area filesystem. Multiple Galaxy instances can then potentially connect to each other through DXC to potentially connect to each other through DXC to distributed sites. As a specific example, the Galaxy Main site, hosted at Texas Advanced Computing Center (TACC), does not currently support de novo RNA-Seq assembly work (e.g., using the popular Trinity software) because greater computational and memory capabilities are required. RNA-Seq is extremely popular with genomics researchers, because it enables communities or individual labs to study non-model organisms without a reference genome at very modest expense. With the Galaxy team, we are exploring using the Data Exacell to enable large memory de novo assembly projects, especially RNA-Seq assembly, from Galaxy Main. Bringing this capability to Galaxy Main will greatly benefit the research being done there and will also be a first step toward the allowing many distributed Galaxy instances to federate and do large-scale data analysis through the DXC.	Pennsylvania State University
Genepattern hosted server GenePattern is a platform supporting hundreds of bioinformatic analysis methods with the ability to chain analytical steps together to develop reproducible workflows. GenePattern’s web-based interface brings the sophistication of powerful bioinformatics methods to all genomics researchers, regardless of their level of programming experience. The GenePattern community consists of over 20,000 researchers and is especially strong in cancer research. We are working with the Broad Institute to establish a GenePattern server at PSC. The establishment of GenePattern at PSC will help meet immediate capacity and capability needs of the GenePattern community, with large-memory and compute-intensive jobs being directed to PSC. Establishing an initial implementation of GenePattern on DXC has enabled strong, early GenePattern support on Bridges.	Broad Institute
Wide-area Filesystem Technologies for Distributed 3rd-Generation Sequence AnalysisWorkflows The Minnesota Supercomputing Institute (MSI) and PSC will work together to test and implement PSC’s wide-area SLASH2 filesystem to support distributed workflows between MSI and PSC. A key application of this work will be to support the overwhelming and increasing demand for de novo assembly projects using third-generation long-read sequencing technologies by allowing these workflows to run on either MSI or PSC resources according to their computational requirements. The MSI team is also examining memory usage by their research groups to determine which groups might benefit from sending jobs that require a lot of shared memory to DXC large memory nodes.	Minnesota Supercomputing Institute
Community: Radio Astronomy
Radio Astronomy The Robert C. Byrd Green Bank Telescope (GBT) at the National Radio Astronomy Observatory in Green Bank, West Virginia, is the world’s largest fully-steerable single-aperture antenna, with a dish diameter of 100 meters and wavelength sensitivity from 3m down to 2.6mm. Thanks to new focal plane receivers and back-end equipment, the volume of data produced by the GBT is rising rapidly, requiring increased analytic capability to keep up with observations. Data velocity and volume can be high, with projects ranging from ~15-350 GB/hour. The GBT Mapping Pipeline is a new software tool intended to ease the production of sky maps from this massive data stream. Mapping of large patches of sky is one of the main uses of the GBT and is complementary to the highly focused studies from facilities such as the Expanded Very Large Array (EVLA). Challenges include a complex software environment (Python-based, with legacy elements; AIPS, Obit, and ParselTongue), high data velocity, reliability and throughput.	National Radio Astronomy Observatory
Community: Machine Learning and Big Data
Twitter Data collection for Epidemiology This project, led by Brian Primack in the University of Pittsburgh’s School of Medicine, seeks to understand and monitor trends in alcohol, tobacco, and other drug use by tracking public discourse on the Twitter social media platform. Using well-defined search and filtering strategies, they will monitor relevant content occurring in the real-time stream of Twitter data. By leveraging these data, the researchers will be able to conduct a variety of analyses including longitudinal and geographical tracking of trends and sentiment analysis of content (using Natural Language Processing).	University of Pittsburgh
Development of Distributed Machine Learning Architecture and Applications This project will develop algorithms, infrastructure, and techniques to obtain new insights from biomedical Big Data. It involves approximately 11 PIs in Carnegie Mellon’s Center for Machine Learning and Health, who will collaborate with researchers at the University of Pittsburgh and the University of Pittsburgh Medical Center. In particular, Eric Xing’s group at CMU has developed Petuum, a highly efficient framework for iterative-convergent machine learning (http://www.cs.cmu.edu/~seunghak/petuum-13-weidai.pdf) that they are testing and refining on the DXC.	Carnegie Mellon University
Community: History
ColFusion/World History Data Center (WHDC) ColFusion (Collaborative data Fusion), developed primarily by Evgeny Karataev in the University of Pittsburgh’s School of Information Sciences (SIS), is a sophisticated software infrastructure for the systematic accumulation and utilization of global heterogeneous datasets based on the collective intelligence of research communities. To the user, it appears as a gateway in a web browser. It is being applied to enable crowdsourcing of data integration and fusion for the World History Data Center and the Collaborative for Historical Information and Analysis (CHIA), led by Patrick Manning in the University of Pittsburgh’s Department of History. Col*Fusion aims to support large-scale interdisciplinary research, where a comprehensive picture of the subject requires large amounts of historical data from disparate data sources from a variety of disciplines. As an example, consider the task of exploring long-term and short-term social changes, which requires consolidation of a comprehensive set of data on social-scientific, health, and environmental dynamics. While there are numerous historical data sets available from various groups worldwide, the existing data sources are principally oriented toward regional comparative efforts rather than global applications. They vary widely both in content and format, and cannot be easily integrated and maintained by small groups of developers. Devising efficient and scalable methods for integration of the existing and emerging historical data sources is a considerable research challenge.	University of Pittsburgh
Community: Digital Humanities
Clustering Medieval Slavic Miscellany Manuscripts and Aligning the Rus’ Primary Chronicle This project maintains a web interface to digital humanities analysis resources on the DXC. The VMs, certificates, and related infrastructure are now being transitioned to Bridges.	University of Pittsburgh
Digital Humanities Projects This project maintains a web interface to digital humanities analysis resources on the DXC. The VMs, certificates, and related infrastructure are now being transitioned to Bridges.	University of Pittsburgh – Greensburg
Rhetorical Text Analysis This project maintains a web interface to digital humanities analysis resources on the DXC. The VMs, certificates, and related infrastructure are now being transitioned to Bridges.	Carnegie Mellon University
Computational Notebooks for Teaching Digital Scholarship Matthew Burton, a postdoctoral researcher working in Digital Scholarship, is exploring the use of Jupyter Notebooks and Docker containers in teaching computational literacy in libraries and the digital humanities. This project runs workshops on various computational tasks such as Python programming, data preparation, web scraping, and accessing APIs.	University of Pittsburgh
Other Pilot Projects
PyLing (Pitt Python Linguistics) PyLing (Pitt Python Linguistics Group) is an undergrad student group at the University of Pittsburgh interested in Python, linguistics, and natural language processing. Led by Na-Rae Han, a few of the project’s early goals include leveraging Python and NLTK (the Natural Language Toolkit) to build chat bots and develop an online corpus resource intended for beginning students in linguistics. This pilot application is primarily pedagogical and will benefit the DXC through introducing analytic software specific to linguistics.	University of Pittsburgh

This material is based upon work supported by the National Science Foundation under Grant No. 1261721. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Data Exacell Pilot Applications

Pilot Applications

Pilot Applications

Affiliation