Thinking BIG DATA
$7.6-Million NSF Grant to fund the Data Exacell, PSC's Next-Generation System for Storing, Analyzing Big Data
The term “Big Data” has become a buzzword.
Like any buzzword, its definition is fairly malleable, carrying different meanings in research, technology, medicine, business and government.
One common thread, though, is that Big Data represents volumes of data that are so large that they are outgrowing the available infrastructure for handling them. In many cases, research can’t be done because the tools don’t yet exist for managing and analyzing the data in a reasonable amount of time. Ultimately, we need to develop both tools and an overall strategy to make Big Data fulfill its promise in fields as disparate as biomedicine, the humanities, public health, astronomy and more.
PSC is taking the next step in developing both tools and direction for harnessing Big Data. A new National Science Foundation (NSF) grant will fund a PSC project to develop a prototype Data Exacell (DXC), a next-generation system for storing, handling and analyzing vast amounts of data.The $7.6-million, four-year grant will allow PSC to design, build, test and refine DXC in collaboration with selected scientific research projects that face unique challenges in working with and analyzing Big Data.
50 to 100 thousand gigabytes of data generated by a single astronomy project on the new generation of telescopes
|38 million video-hours being uploaded annually to YouTube|
2.2-mile height of the stack of DVDs to store DNA sequence data being generated annually
|5 million million total gigabytes of data generated worldwide annually|
We are very pleased with this opportunity to continue working cooperatively to advance the state of the art based on our historical strengths in information technologies,” says Subra Suresh, the president of Carnegie Mellon University.
“The Data Exacell holds promise to provide advances in a wide range of important scientific research,” says Mark Nordenberg, chancellor of the University of Pittsburgh.
Big Data is a broad field that encompasses both traditional high-performance computing and also other fields of technology and of research. But these fields increasingly share a focus more on data collection and analysis—handling and understanding unprecedented amounts of data— than on computation.They also require access methods and performance beyond the capability of traditional large data stores.The DXC project will directly address these required enhancements.
“The focus of this project is Big Data storage, retrieval and analysis,” says Michael Levine, PSC scientific director. “The Data Exacell prototype builds on our successful, innovative activities with a variety of data storage and analysis systems.”
The core of DXC will be SLASH2, PSC’s production software for managing and moving data volumes that otherwise would be unmanageable.
“What’s needed is a distributed, integrated system that allows researchers to collaboratively analyze cross-domain data without the performance roadblocks that are typically associated with Big Data,” says Nick Nystrom, director of strategic applications at PSC. “One result of this effort will be a robust, multifunctional system for Big Data analytics that will be ready for expansion into a large, production system.”
DXC will concentrate primarily on enhancing support for data-intensive research. PSC external collaborators from a variety of fields will work closely with the center’s scientists to ensure the system’s applicability to existing problems and its ability to serve as a model for future systems. The collaborating fields are expected to include genomics, radio astronomy, analysis of multimedia data and other fields. (See below.)
“The Data Exacell will have a heavy focus on how the system will be used,” says J. Ray Scott, PSC director of systems and operations. “We’ll start with a targeted set of users who will get results but who are experienced enough to help us work through the challenges of making it production quality.”
Initial DXC Par tners
• National Radio Astronomy Obser vator y, Green Bank, WV
• Event Detection in Multimedia Project, Carnegie Mellon University
• Galaxy Genome Project, Pennsylvania State University
• Depar tment of Biomedical Informatics, University of Pittsburgh
• World Histor y Data Center, University of Pittsburgh