CATALYST was formed by The Ohio State University College of Medicine to advance T3 research at Ohio State. As part of that mission, CATALYST has developed this DataCore for researchers across the College of Medicine to streamline and expand access to data that accelerates discovery and leads to funded research.


Data Curation


CATALYST seeks to curate selected data to serve as strategic resource for the college. The term "curation" is used to express the process that the DataCore takes to providing this resource to the community. The need for data curation is significant in health care-related research.


There are many of data sets available in health care, and securing and using them can pose a significant challenge. For example:

  • While acquiring a single year of data might be straightforward, merging data across multiple years – even when secured from the same source – generally requires some level of data management, and cleaning, unifying, and troubleshooting a data set can be demanding;
  • Different organizations require unique bureaucratic processes to legally access the data for research; and
  • Licensing data can be just as costly for one person as it is for an institutional purchase and, as such, it is in the institution's best interest to standardize access.

In version 1 of the DataCore, CATALYST seeks to streamline this process by facilitating access data as a strategic resource. Upcoming data and technologies will be integrated into this resource as they are developed. For more information about how DataCore works, check out the CoreFAQs.

If you would like to receive emails regarding updates, you can send a request to



About the DataCore

The DataCore is a shared resource available to researchers in The Ohio State University College of Medicine (COM) that brings large-scale clinical datasets into an analytic platform that is easy to access and simple to use to facilitate outcomes research.


The CDC will be a tool that streamlines research on secondary datasets. It is a shared resource available to researchers in COM that will reduce the costs associated with data licensing and the time associated with data acquisition and processing. The CDC contains large-scale clinical datasets such as Marketscan, the Healthcare Cost and Utilization Project (HCUP), and Centers for Medicare & Medicaid Services (CMS) Claims data. It’s an analytic platform that will empower researchers to ask their question in one datasource and get answers from many. The CDC will be easy to access by using automation to empower researchers to use an application to select the data they want and then automatically pull it into the statistical analysis software of their preference. The CDC is a source of truth for all the data held within it and includes clear instructions on how to use the data with a dataset that is already cleaned, merged, and harmonized.


The CDC will facilitate outcomes research, empowering researchers to ask their question once and investigate it in many data sources, over many time periods, automatically. 


Taken together, the CDC helps streamline, simplify, and automate scholarship and discovery. 


  • All work with the data is covered by a master exempt IRB, meaning researchers will not need to fill out a unique IRB for every project they work on.
  • All data are harmonized, which allows researchers to focus on discovery rather than converting the data to a consistent format.
  • Different datasets are mapped to each other, which empowers researchers to see alternative datasets to perform their analysis in. 
  • All of the rules, restrictions, and costs for datasets are clearly spelled out in a single location, saving researchers from having to hunt down or inquire specifically about every dataset. This also reduces the cost for the researcher by using the data license purchased by the DataCore rather than necessitating the individual purchase of the data.



The DataCore will streamline informatics around secondary data by acting as a one-stop shop for the entire process of obtaining data.

A repository within the DataCore will detail which years of which datasets are available. This repository also will contain all related documentation and data dictionaries for these datasets. As these datasets are loaded into the data commons and used, documentation will be added to the repository to detail the methods used for harmonizing the data as well as all updates and corrections that have been made to the dataset.

Based on an individual’s appointment, they may have varying levels of access. The DataCore will act as a source of truth to detail which individuals have permission to access which datasets for a given project. Additionally, the DataCore will provide a step-by-step walkthrough of what steps are required to access both currently available and additional data. The DataCore will provide a list of restrictions and rules as detailed by:


  • The data use agreement with each dataset,
  • Their cost of access, and 
  • A blanket IRB.


Once the access to data is established, users will be granted access to query the data commons. Users will be given a file that will allow them to access the data with its related metadata in their preferred statistical software.




The data ore will simplify informatics of secondary data by empowering users to ask one question but query many times.


The data commons will contain the following deidentified data:

  • Healthcare Cost and Utilization Project (HCUP) from the Agency for Healthcare Research and Quality (AHRQ): admission-level data about hospital admissions, readmissions, and emergency department use
  • Centers for Medicare & Medicaid Services (CMS) claims data: claims-level data about all claims filed through medicare
  • American Hospital Association (AHA) Annual Survey (AHAAS) with Information Technology supplement (AHAIT): annualized survey about hospital demographics
  • Health Information National Trends Survey (HINTS) from the National Cancer Institute (NCI): person-level survey that asks people general thoughts on cancer and cancer-related topics
  • Patient-Centered Outcomes Research Institute (PCORI) PCORNET: patient EHR data along with PCORI studies in a unified data model
  • Epic System’s Cosmos: patient EHR data
  • IBM’s Truven: a set of clinical datasets of which one is a patient EHR dataset

The data commons is the one data source to rule them all.

The data commons is a single SQL database structured to allow for intra- and inter-dataset analysis and streamlined to empower ease of access. The data contained within the data commons has been structured such that it is merged with other years of the same data source and that the data has been corrected and validated. The data commons also contains metadata to explain what questions are being asked by each data source and what the responses to those questions mean.




The DataCore will empower users to automatically determine apples-to-apples comparisons with alternate dataset variables and automatically query the data commons once new data is available.

When a user makes a request of the data commons, a query to provide the data will be procedurally generated to fulfill the request. Using this information, the DataCore will automatically determine equivalent questions that could be asked in other data sources contained within the data commons and be able to provide those matches to the requestor. This will empower the requestor to make valid comparisons between their findings with one dataset in others.

The DataCore will be able to automatically update the requestor’s data with updated information once it’s available. The DataCore also will be able to upload the code of the requestor and translate it so the procedure can be performed automatically in the future. The DataCore will use the researcher’s code and run it in the backend, automatically querying all equivalent datasets across all time and re-query them as new data becomes available.


Example use case


You’re a clinician with an idea.


You want to investigate a question in the most recent national inpatient sample dataset; however, you don’t have the data and you don’t know the process you need to go through to get it. That’s where the DataCore comes in. The DataCore:

  • Eases the process by being a single location for every step of data acquisition. 
  • Tells you which data sets are available and provides a data dictionary to help you determine what data you need to answer your question.
  • Tells you what you need to do to access the NIS.
  • Guides you through the process of gaining access to the data set along with the cost, rules, and requirements surrounding its use.
  • Allows you to use the statistical software of your preference to investigate the data.
  • Automatically pushes new data to your working dataset as it becomes available.


Once you have completed your analysis, you will be able to upload your code into the DataCore; it will be able to automatically run the code for you against all related datasets. The DataCore will then be able to reach out as new data becomes available.

The DataCore is a different approach for doing informatics with secondary data. It empowers researchers to ask whether their findings remain valid across other years of data or in other sources of data. By changing these data sources, we are changing the possible outcomes. This approach to inducing intersubjective reliability is methodologically novel. Furthermore, the DataCore is designed with ease of use in mind. It helps researchers navigate the labyrinth of process and requirements for using these datasets.


It is our intent to have a blanket IRB.


As updates are made from version 1 of DataCore, an update log will become available.


If you would like to receive emails regarding updates, you can send a request to