Whether one is a high energy physicist or geneticist, research across many different disciplines rely increasingly on more distributed cyberinfrastructure — where data is stored and produced in several locations, analyzed in others, and shared or published in yet another location. This has supported new discoveries at an exciting rate, but also creates a new problem: how to locate, organize, and manage data that is now scattered.

To solve this problem, researchers from the Department of Computer Science at the University of Chicago created a non-profit cloud platform, called Globus, that allows researchers to manage and access their data in the cloud. Globus currently has over 2,600 connected institutions and 604,000 users across over 80 countries and has been used to transfer several exabytes of data.

However, while Globus has made it easier for researchers to manage their data across these different locations, the cloud service still requires that people have prior knowledge of where the data resides. To address this, Research Associate Professor Kyle Chard, Arthur Holly Compton Distinguished Service Professor of Computer Science Ian Foster, and researcher Ben Blaiszik were awarded an NSF Grant that plans to address this issue.

The team will be building Globus search, a new capability that will be integrated into the existing Globus platform. Globus Search will allow researchers to search for data across multiple locations, and also develop high level capabilities to index these distributed locations to access metadata and content. This eliminates the need for prior knowledge from the researcher. Globus Search will be able to index not only the file system metadata — which is the file names and paths— but also be able to index the contents of the file itself.

“Increasingly, we see people not just finding data and using ordinary search but relying on machine learning models that extract features from data automatically,” Foster states. “We want to be able to allow people to construct and apply machine learning classifiers to data that may be distributed over many file systems, and find data that is similar to the data they have at hand, but perhaps that they didn’t realize was similar.”

Although indexing and searching is not new, the challenge the team faces is having to deal with extremely specialized, scientific data. In addition to the sheer scale of this project (Globus Search must keep up with the rate of change of billions of files stored in supercomputers), the system must also be created with a common mechanism that will work across different storage systems, from typical file systems to archival tape storage, specialized parallel file systems, and the many cloud object stores available.

Chard and Foster also have to develop an authorization model to go with Globus Search’s distributed indexing algorithm. Globus has an outstanding reputation for data security, and much of the world’s most important scientific data is managed through the platform. For instance, biomedical and protected data are both managed with Globus. While developing the authentication and authorization methods in Globus took decades of research and development to get to this point, Globus Search will move towards that same level of assurance and ability to handle protected data. For now, Chard and Foster plan to implement a rich authorization model that is layered on top of Globus Auth and existing technologies from Amazon Web Services and Elastic Search.

“It’s very important to manage who’s allowed to access metadata, because the metadata will often have important scientific information in it,” Chard states. “That could be an individual scientist, or it could be a large collaboration. We have metadata collected from these hundreds of petabytes of data, and Globus will implement careful control over who’s allowed to access each piece of metadata, and that can be changed over time as collaborations develop.”

Millions of scientists around the world are generating vast amounts of data every day, but much of this valuable information is never reused by other researchers. Chard and Foster hope that the creation of Globus search will make it easier to find and access data that is created by all sorts of different fields. Currently, it’s already becoming easier over time to find and access data, but no one knows the full extent of the data that is available, and therefore don’t access it or do research with it. By making it easy to index and catalog data, Chard and Foster hope that it will become much simpler to find and use that information, which in turn can accelerate the pace of scientific progress.

Foster remarks, “One question that NSF is asking, and we don’t really know how to answer, is how do you evaluate whether you’ve succeeded in increasing the rate of scientific discovery? I don’t know how to do that yet, except by seeing who might cite our work when they announce a result. However, I think it’s clear that people can discover information more easily with a tool like Globus Search. They will be able to answer questions that they couldn’t have answered before.”

To learn more about Globus and its impact in the research community, please visit the Globus website here. To learn more about the people that advance the Globus platform, please visit the Globus labs page here.

Related News

More UChicago CS stories from this research area.
UChicago CS News

UChicago Researchers Demonstrate the Quantifiable Uniqueness of Former President Donald Trump’s Language Use

Sep 30, 2024
UChicago CS News

Five UChicago CS students named to Siebel Scholars class of 2025

Sep 20, 2024
UChicago CS News

NSF and Simons Foundation launch $20 million National AI Research Institute in Astronomy

Sep 18, 2024
In the News

Data Ecology: A Socio-Technical Approach to Controlling Dataflows

Sep 18, 2024
UChicago CS News

Ph.D. Student Shawn Shan Named MIT Technology Review’s 35 Innovators Under 35 and Innovator of the Year

Sep 16, 2024
UChicago CS News

Ben Zhao Named to TIME Magazine’s TIME100 AI List

Sep 05, 2024
UChicago CS News

Ian Foster and Rick Stevens Named to HPCwire’s 35 Legends List

Aug 28, 2024
UChicago CS News

University of Chicago to Develop Software for Effort to Create a National Quantum Virtual Laboratory

Aug 28, 2024
UChicago CS News

New Classical Algorithm Enhances Understanding of Quantum Computing’s Future

Aug 27, 2024
UChicago CS News

Decoding Content Moderation: Analyzing Policy Variations Across Top Online Platforms

Aug 26, 2024
UChicago CS News

Get to Know Our Newest Faculty Members

Aug 21, 2024
In the News

Big Brains Podcast: Fighting back against AI piracy, with Ben Zhao and Heather Zheng

Aug 15, 2024
arrow-down-largearrow-left-largearrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-smallbutton-arrowclosedocumentfacebookfacet-arrow-down-whitefacet-arrow-downPage 1CheckedCheckedicon-apple-t5backgroundLayer 1icon-google-t5icon-office365-t5icon-outlook-t5backgroundLayer 1icon-outlookcom-t5backgroundLayer 1icon-yahoo-t5backgroundLayer 1internal-yellowinternalintranetlinkedinlinkoutpauseplaypresentationsearch-bluesearchshareslider-arrow-nextslider-arrow-prevtwittervideoyoutube