Johns Hopkins University’s Alex Szalay will lead a two-year national effort to begin building a network allowing scientists to more efficiently store and analyze huge caches of data and share them with other researchers.
The National Science Foundation today announced a $1.8 million grant to a nationwide team, led by Szalay as principal investigator, to start developing the Open Storage Network.
“The goal is to create a robust, industrial-strength national storage substrate that can impact 80 percent of the NSF research community,” said Szalay, a Bloomberg Distinguished Professor at Johns Hopkins and director of its Institute for Data Intensive Engineering and Science.
The eventual buildout of the network may cost between $20 million and $30 million in hardware and software, a relatively modest investment that “could completely change the academic big data landscape,” Szalay said.
A conservative projection of universities that would eventually join would make OSN – at about 200 petabytes – one of the largest distributed data storage networks dedicated to science in the world, with economies of scale that would make management of huge datasets cheaper for all involved, Szalay said.
Szalay is an astrophysicist whose work on galaxies led him to a deep interest in how all of science manages with the “data deluge.” That’s the term he and others use to describe the avalanche of data that advanced scientific methods make available to researchers studying complex questions as diverse as the origin of the universe, climate change, and the genetic origins of disease.
Szalay now holds appointments as a professor both in physics and astronomy in the university’s Krieger School of Arts and Sciences and in computer science in the Whiting School of Engineering.
Other members of Szalay’s OSN team are from the National Data Service and each of the four NSF-funded Big Data Regional Innovation Hubs: the west hub, in California; the Midwest hub, in Illinois; the south hub, based in North Carolina and Georgia; and the northeast hub, located in Massachusetts and Pennsylvania.
NSF said in its announcement of the grant that the new OSN will be at first a pilot project of researchers at the participating institutions. During this phase, the developers will ensure it is easy to use, performs well, is reached efficiently from various parts of the internet, is secure and reliable, has good privacy protections, and preserves data well.
Additional software and service layers will be added to OSN as it is developed, NSF said. Several pieces of the system will rely on Globus, widely used by scientists for large-scale data movements worldwide, Szalay said.
“We are excited to support OSN to help meet the needs of researchers in today’s era of data-driven discovery and innovation,” said Erwin Gianchandani, acting assistant director of the NSF’s Computer and Information Science and Engineering Directorate. “The OSN team and their supporting collaborators will build a community to multiply the impact of previous and current NSF investments and anchor comprehensive data infrastructure that will be vital to the future of our nation’s scientific and engineering enterprise.”
The NSF grant builds on a $1 million seed grant awarded in 2017 by Schmidt Futures, an initiative by former Google chairman and Alphabet executive chairman Eric Schmidt to advance society through technology.
The Schmidt Futures grant supports building the first prototypes of low-cost, large-capacity data transfer systems for the new network, designed to match the speed of a 100-gigabit network connection with only a small number of nodes. This data transfer system will help to ensure that OSN can eventually be deployed at many universities across the United States, the NSF said.
“We are excited to support Professor Szalay’s promising work designing and testing these impressive storage devices, and want many such open-design petabyte units to be assembled and deployed in and for universities,” said Stuart Feldman, chief scientist at Schmidt Futures. “We applaud NSF’s investment in the Open Storage Network as a key step toward enabling research requiring truly massive amounts of data.”