Data Supercell at PSC

Data Supercell at Pittsburgh Supercomputing Center

A patent is pending for PSC’s innovative disk-based data-storage system.

PITTSBURGH, August 21, 2012 — The Pittsburgh Supercomputing Center (PSC) has developed and deployed a cost-effective, disk-based file repository and data-management system called the Data Supercell. This innovative technology, developed by a PSC team of scientists, provides major advantages over traditional tape-based archiving for large-scale datasets.

The PSC team exploited increasing cost-effectiveness of commodity disk technologies, and adapted sophisticated PSC-developed file system software (called SLASH2) to create a new class of integrated storage services. A patent application is under review.

The Data Supercell is intended especially to serve users of large scientific datasets, including users of XSEDE (the Extreme Science and Engineering Discovery Environment), the National Science Foundation cyberinfrastructure program, the world’s largest collection of integrated digital resources and services.

“The Data Supercell is a unique technology, building on the increasing cost-effectiveness of disk storage and the capabilities of PSC’s SLASH2 file system,” said Michael Levine and Ralph Roskies, PSC co-scientific directors. “It will go far to enable more efficient, flexible analyses of very large-scale datasets.”

Initial capacity of the Data Supercell is four petabytes (four quadrillion bytes), and it is designed to allow added capacity as needed. In comparison with cumbersome tape-based archiving, sometimes referred to as “write once, read never,” the Data Supercell’s disk-based technology facilitates much faster data transfer (latency 10,000 times better than tape and bandwidth 24 times faster than PSC’s previous tape archiving system). It also incorporates high reliability and security features for optimized data replication, safety and movement.

Deployment of the Data Supercell aims to meet expanded data-storage needs posed by rapid evolution toward ever larger quantities of data stored and transferred in many kinds of applications — an evolution frequently termed “big data” — including astrophysics, genomics and vast amounts of Internet data that can be “mined” for commercial purposes.

Various departments at the University of Pittsburgh, Carnegie Mellon University and Drexel University are using the Data Supercell. Researchers with large genomic datasets, produced through Galaxy, a web-based platform for bioinformatics research at Penn State, are currently using 470 terabytes of Data Supercell storage.

The Data Supercell was developed by this team of PSC scientists: Paul Nowoczynski, Jared Yanovich, Zhihui Zhang, Jason Sommerfield, J. Ray Scott, and Michael Levine.