Open data and open code advance scientific progress but must be rewarded in researchers’ careers

19.5.2023 Jaakko Kuha & Laura Niemi

Maria J. Ribeiro is a researcher in human neuroscience at the Coimbra Institute for Biomedical Imaging and Translational Research (CIBIT), ICNAS and Faculty of Medicine at the University of Coimbra, Portugal. Maria completed her BSc in physics in 1996 by the University of Porto, Portugal, and obtained her DPhil in neuroscience in 2003 by the University of Sussex, UK.

Her research contributes towards unveiling the neural mechanisms behind perception, decision making and actions in humans using non-invasive neuroimaging techniques and physiological recordings. Since 2015, Maria has been working in the field of ageing. She is interested in understanding the neural changes that occur in ageing, and how they are causal to cognitive decline and contribute to the onset of neurodegenerative diseases associated with the ageing process.

Website: http://neuroscience.pt/
ORCID-ID: 0000-0001-6422-3279
GoogleScholar
Twitter: @Ribeiro_neurosc

You do research in the field of human neuroscience. How do you view the value of open data in your field?

I think there are two important aspects to opening data. One is to verify the results that are published and try to tackle fraud in science. Once the data is published, at least we can verify that the data exists and that the findings were not fabricated. Then we can also reanalyze the data and make sure that there are no errors in the analysis. In fact, it is not just that you can publish the dataset, but you can also make all the analysis code publicly available. That is something that I have done for my latest publication. That means that if someone else wants to repeat my analysis they will have the analysis code to do it. Also, if they want to use some part of the analysis for their own study, that can be done too. I’ve done that myself. I’ve gone through other people’s papers and adapted their code for my own research. That is a big help, as then we don’t need to keep reinventing the wheel. But that also ensures that errors in the analysis may be caught by other people, and then, the study may be reformulated. It’s something that I think is still not done enough.

The other reason why opening datasets can be very important is to maximize the usefulness of the data itself. In neuroimaging, the data are very rich. There’s so much that we can study with one single dataset. And usually, in the original study, the researchers will only look at one aspect of the data. For example, if we have data that covers the whole of the brain, or data that covers EEG data from the cortex, we can look at the visual cortex, we can look at the frontal cortex, we can look at frequency data, we can look at connectivity, we can look at so much. There are also constantly evolving new methods that permit you to go back to old datasets and reanalyze them with these new methods. So, it’s important to have the datasets available so we can go back and forth and have different people analysing them from different angles.

As you already mentioned, it does take some effort to open data. Could you explain the process of opening the data from the perspective of your research?

I did, obviously, some research on the internet to try to find out what would be the most appropriate place to deposit my data. Then I followed the suggestions of the EEGLAB toolbox. They have a plugin that helps you to format the data into the BIDS format. I read whatever was available about how you should format the data. EEGLAB also has a YouTube channel where they have videos explaining how to organize your EEG data for sharing it. They also suggest publishing it in OpenNeuro. So, I just followed their advice, I used their MATLAB code, and adapted the code to do it. It took me one week of work to finish the process, to organize it and to make it work. The data has a little bit of a strange format, because it has two groups of participants, each group has three different tasks, and the tasks have different numbers of runs. I had to adapt some of the code that was available to make it work and organize the data as it should be organized. The reason why I could justify taking the time and effort to do this is because I wanted to publish my study in a journal that values open data, and I think this is an important incentive for researchers. If we want the data to be useful and be reused by other researchers, we want to make sure that all the relevant information is with the dataset. It’s obviously something that takes time, you must sit down, think about it, and have all the information there.

Was it your personal interest to share your data?

It was my personal interest. I think within my local research community, I was the first one to open data although we have been talking about it for a while. There are others who are now also interested in doing it. Obviously, it is easier with EEG data, because it’s easy to make it non-identifiable, to anonymize it. With MRI data, we have the problem that the person can be identified, because you can identify the face of the participant. So, with MRI data it is a bit more challenging to actually make it completely anonymous. You will need to deface the data, and there are some challenges there.

What were the reasons that made you think about it in the first place? How did you come up with the idea?

It was also because I don’t want this data to be lost. I know there’s so much potential to it, and I don’t have time to do everything. So, if someone else can get interested and pick it up I’m really happy to make it available for the community. For me, it is quite rewarding to know that people are interested in my work and the dataset that I produced. That is one reason. The other incentive to actually spend the time and do it was that I knew that with an open dataset, the study would be viewed as stronger for publication. So, if the journal values open data, that is obviously another incentive to do it.

Have you had any concerns over opening your data?

My dataset is published in OpenNeuro. We’ve published three papers already with that dataset, and we are now working on a fourth paper. But the data is now open, and it has been downloaded over 200 times. However, I don’t know what the other researchers are doing with it. I’m not worried that they are going to publish the same that I’m working on, but I’m curious. I know, in some cases, the datasets are made open, but the researchers require that you contact them and tell them how you are going to use the data. But in OpenNeuro, you don’t have that option. The idea is that it is open, and anyone can download it and do whatever they want. I was just asking them this week, that if someone publishes something with the dataset, I would like an email to be sent to me. I know that I can always look for it online but sometimes you might miss it. It would be good to have an easy link from the dataset to all the publications that resulted from someone using that dataset, and that is not yet implemented in the repository.

How do you view the role of data management and data management planning in relation to opening research data?

I think it’s important that the laboratories and the institutions have a clear policy on open data. It is a fairly new idea, and a lot of people will find it strange. You know, one may wonder, what if someone else publishes before me, or what are the other researchers going to do with the data? So, it is important that these kinds of issues are discussed with the researchers and that there is some sort of local policy of what should be done in terms of how to organize your data, how to anonymize the data and make sure that it’s not identifiable? I think that the researchers need to be supported here, because it’s not trivial to do it well the first time.

Do you find that open data and its benefits are discussed enough?

Not as much as I would like to. I think that one of the big advantages I already mentioned at the beginning would be to have the published studies double-checked. So, you publish a study, and someone else goes and checks the data, checks your analysis, and then finds any mistakes. This is something that I don’t think is done enough. I still don’t think we are there and see this being done in a systematic way to capture the mistakes and studies that are not replicable.

Your data has received quite a lot of downloads already. Did you take some specific actions to increase its visibility?

I haven’t done anything, actually. Now thinking about it, maybe I could have disseminated it a bit more. So, maybe it is just the dataset that catches people’s attention because it has a relatively large number of participants, EEG, ECG and pupil data. It also has a passive condition that people can use for resting state studies, and that is something that is also quite popular. I hope we captured people’s attention for the right reason, and they’re doing nice research with it.

What do you think are the biggest barriers for those not opening their data?

Researchers are focused on whatever output is valued for their career progression. So, if it’s publications, that’s what they’re focused on. And if publishing a dataset is not seen as important, then it will be something in the back of their minds. Then there may be concerns about privacy and the identifiability of the data that can be problematic. Another concern may be people finding errors in your studies. Not everyone would be happy about that they publish something, and someone else comes and says that it’s wrong. Finally, if you are still going to keep using the data, there may be concerns about someone else publishing the same results.

Could you share any tips with researchers who are considering taking the effort of opening their data for the first time?

It is really important to have the data well documented. Try to put yourself in someone else’s shoes and think what it is that they need to know in order to use the data. I also recommend using a standard format that has been validated by the international community. That makes it easy for others to use it, and also helps to structure all the information, also for ourselves. Also, don’t think that you are going to do it in your free time. You will need to assign time to do it. That is also why I think it is important that the researchers feel that this time that they are investing is actually valued for their careers. It’s very easy for us to say that it’s good for science. But then, if we don’t have a job or if we cannot progress in our careers, that’s also going to have a big impact. So, it is important that the decision-makers send a clear message if it is important to share data or if it is not. I don’t think that this message is very clear yet.

Acknowledgements: This interview was conducted as part of the RI4C2 project which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101035802

Publication information: 3/2023, Open Up! blog, ISSN 2814-8967