Abstract
In search of more efficient data sharing:
We recently asked a colleague to share a dataset they published along with their paper at one of the ACM conferences. The paper had the "Artifacts available" badge in the ACM Digital Library, highlighting the research in the paper as reproducible.
Yet, the instructions to get the dataset required several steps rather than just a link: log in, find the paper, click on a tab, scroll, get to the dataset. It was much better than receiving the dataset by email.
Yet in many other research disciplines—biology, geophysics, biodiversity, social sciences, cultural heritage—sharing data and other research artifacts is streamlined and is the cultural norm. Computer science (CS) is pretty good at sharing software. How did CS researchers get behind many other sciences in how we think about sharing data?
Let's start by distinguishing three different aspects of data sharing: open data, data required for reproducibility of published research, and data as a first-class citizen in scientific discourse. All three aspects are related, but they are not the same: a dataset can be open but not citable or easily discoverable, for example. Or a dataset may be findable and interoperable, but not open.
Of the three aspects of data sharing we mentioned, open data, or data that is available for free under appropriate licenses, is probably most familiar to many CS researchers: most of us are steeped in open source software and understand and appreciate the value of sharing our software in an open way. Open data is just as important and is the bedrock of data-driven research and innovation as practiced by, for example, modern bioscience.
We recently asked a colleague to share a dataset they published along with their paper at one of the ACM conferences. The paper had the "Artifacts available" badge in the ACM Digital Library, highlighting the research in the paper as reproducible.
Yet, the instructions to get the dataset required several steps rather than just a link: log in, find the paper, click on a tab, scroll, get to the dataset. It was much better than receiving the dataset by email.
Yet in many other research disciplines—biology, geophysics, biodiversity, social sciences, cultural heritage—sharing data and other research artifacts is streamlined and is the cultural norm. Computer science (CS) is pretty good at sharing software. How did CS researchers get behind many other sciences in how we think about sharing data?
Let's start by distinguishing three different aspects of data sharing: open data, data required for reproducibility of published research, and data as a first-class citizen in scientific discourse. All three aspects are related, but they are not the same: a dataset can be open but not citable or easily discoverable, for example. Or a dataset may be findable and interoperable, but not open.
Of the three aspects of data sharing we mentioned, open data, or data that is available for free under appropriate licenses, is probably most familiar to many CS researchers: most of us are steeped in open source software and understand and appreciate the value of sharing our software in an open way. Open data is just as important and is the bedrock of data-driven research and innovation as practiced by, for example, modern bioscience.
Original language | English |
---|---|
Pages (from-to) | 36-38 |
Journal | Communications of the ACM |
Volume | 66 |
Issue number | 1 |
DOIs | |
Publication status | Published - 1 Jan 2023 |