The data tsunami—the Environmental Sciences perspective

Research data championAnnemarie Eckes, a PhD student from Cambridge University, is one of the Jisc Research Data Champions.  She’s also a Research Data Champion at Cambridge.  She recently participated in the RDA EU-ENVRI summer school on data management and data science in Espoo, Finland.  These are her reflections on the experience.

 


Like most disciplines Environmental Sciences are experiencing a data tsunami with automated measurements generating vasts amounts of data. But we are currently incapable of efficiently harvesting all the “Scientific Energy” from this tsunami to drive scientific findings. The reason is that we currently cannot channel the data into a format that is easily re-usable by everyone. We are simply being overrun by this tsunami and see data passing by that some have looked at, but others have no chance in double-checking or re-using. Ultimately, the data does end up somewhere. Like water into lakes, ponds, little streams or small cavities in the soil, data trickles into some repositories, spreadsheets, local databases stored in the cloud, on individual’s computers, as supplementary materials or on hard drives… …lost, forgotten, a lot of it dead data.

Research Data Alliance summer school

Jozef Mišutka introducing the repository ClarinDspace on the first day

By now most scientists should be aware of this issue. But we are facing the huge challenge of managing a lot of data together as a research community, to allow everyone to find it, access it and make it human and machine readable so that it can ultimately be reused—FAIR (more about FAIR data principles in Wilkinsons et al., 2016; DOI: 10.1038/sdata.2016.18 ). There are efforts on different fronts to address these challenges: to describe data, transform metadata, change habits. Bottom-up or top-down? At the RDA EU-ENVRI summer school we were exposed to bottom-up and top-down examples and tried some tools hands-on. I suppose the aim of this workshop was to trickle down some of the current developments as well as the FAIR principles to scientists interested in data management and data analysis. I will try to briefly summarise what we could take away from this workshop.

Why share Data?

The first hurdle to making data available is often to convince people to share their data in the first place. How to incentivise people; to make people realise the usefulness of making data open and document them in a standardised and structured manner is still an issue I find. But scientists live on citations and acknowledgements. So if one ensures proper acknowledgement through the use of creative licenses and proper citing of the data, thus acknowledging the data creators, this might be catered for. Besides this carrot approach, the ‘stick’ is coming from the funding agencies and the publishers, who nowadays often require scientists to deposit their data in public repositories. But are repositories ready for this? Have we agreed on all the necessary domain-specific standards, minimal requirements for metadata and come up with sensible data formats for interoperability solutions between repositories and analysis tools? And how can we ensure we are capturing the right information to keep data meaningful in the future? The best principle might be: shared data is better than unshared data. Shared data that is well described is desired over badly described data.

Repositories

How and where to share data? During the workshop, we were introduced to repositories. The example case for this workshop was Clarin Dspace ( http://www.dspace.org/, https://github.com/ufal/clarin-dspace)

Trustworthy repositories are repositories that fulfil certain data management related standards. For example, they commit to making their data openly available, being sustainable and revealing their own data management policy. They can even be externally certified for it (e.g. DSA certification). To make your own repository more visible, one can suggest it on http://www.re3data.org/, http://www.duraspace.org/, https://www.openaire.eu/content-acquisition-policy

Metadata

Metadata provides context to data, aids data discovery, and can be used as workflow description in data processing. Generally scientists do record a lot of metadata about their data. The issue is though that it might not be structured and therefore not machine-readable.

When trying to create structured metadata, there are many different standards one could encode metadata in. One that has been hanging around for a while is the Dublin core (http://dublincore.org/), which is a very basic set of 15 terms. Another one, ISO19115 (https://www.iso.org/standard/32579.html) is quite relevant for spatial data, so helpful to describe environmental data. If a repository wants to be truly findable and interoperable, with so many different standards around, it nowadays needs to try to expose their metadata through multiple schemas, possibly by transferring one format into the other (see metadata transformations https://www.ncddc.noaa.gov/metadata-standards/metadata-xml/). If these are in place, metadata from the repository can be exposed to harvesters (read more: https://www.openarchives.org/pmh/tools/). An example of a harvester is b2find. Such a harvester is useful, as it finds scientific data “irrespective of their origin, discipline or community” (https://www.eudat.eu/services/b2find) , provided the repositories expose their metadata in a suitable format that allows harvesting protocols to extract the (meta)data.

OAI-PMH is a well-defined metadata harvesting protocol (for exchanging metadata). It transfers digital objects (metadata) from source(repository) to destination(harvester). In a similar manner, the OAI-ORE is a protocol for actual data harvesting.

But OIA-PMH is in web standards a relatively old way harvesting metadata, which came up prior to REST. ResourceSync is another framework, which works well on synchronising metadata in the current web architecture.

Permanent identifiers

Permanent Identifiers can be assigned to point at metadata and data. Their assignment to data and metadata is an important contribution to the FAIR principles.

There are differences between PIDs and DOIs. PIDs are not persistent themselves, but “persistable”. They are identifiers that point to some digital object. This can even have its location moved, and as long as the PID administrator changes the reference URL, the PID should still point to the data. DOIs are a type of persistent identifiers. According to Data Cite getting a DOI requires the data to be “stable”, so never to be changed and moved once a DOI is assigned. Also it needs to be stored sustainably so that access is never lost.

We had a session on how to mint PIDs. We derived them ourselves, and then added metadata to the PIDs themselves. If you need to assign PIDs for your repository, have a look at handle.net https://www.handle.net/, http://www.pidconsortium.eu/).

It is important to prevent a decrease in value of scientific papers if the data they cite have broken links, thus inhibiting reproducibility. That this is already the case has been shown by Klein et al 2014, who found that ONE IN FIVE (!) articles contain references which have one or more broken links.( https://doi.org/10.1371/journal.pone.0115253 I find this an eye-opening number and hope that PIDs and DOIs will be able to help prevent link rot in science.

How all this works together?

The most interesting bit was how all the above then fit together:

We learned in a hands-on session how it is possible to get from a single PID handle to directly downloading a particular dataset and its metadata. From there one could locally directly manipulate the datasets in python R or any other language of preference. For example, one could change the date column, which recorded dates in a text format, into a standard datetime format, which is then also readable by machines. Metadata can be added and it can be programmatically resubmitted as a new entry into a collection of a repository.

Take home message

Of course, I could not go into great detail here, and also left out some of the content such as Data Analaysis, or discuss problems of interoperability: data model differences of the repositories and ontologies, how to represent digital objects and data types..

But you can learn more, as well as have a look at upcoming events provided by RDA yourselves at https://www.rd-alliance.org/events.html. I think the take home message for me from this course is that on many levels there is still so much work to be done to get the most out of this data tsunami. But I think such courses like this one, and people who taught and attended the course are the ones who hopefully will help not only to manage, tidy, describe, rescue, analyse and share (meta)data, but also carry this awareness out to our fellow scientists, that this might be a good idea.

Tweet about this on TwitterShare on Google+Share on LinkedInShare on FacebookEmail this to someone

Leave a Reply

Your email address will not be published. Required fields are marked *