At the beginning of this month we hosted our first Research Data Alliance meeting in the UK. The Research Data Alliance (RDA) is a unique community-drive organisation of over 4,500 volunteers, over 44 organisations, from 115 countries, but with a shared interest to discuss and develop ‘data bridges’ to enable open data sharing across technologies, disciplines and countries. Plenary meetings are held twice a year across the globe and are well attended. However, this workshop was based in the UK with the aim of bringing the RDA to the wider UK related community.
Slides are available from the Jisc repository.
The RDA is made up of many interest and working groups that, by its own admission, can seem quite overwhelming at first. At our workshop the morning presentations gave an overview of the RDA and its work, but the group sessions allowed for discussion and feedback from the audience on the outputs of the following areas: Trust and Certification, Metadata Standards, Publishing Data, and Data Citation.
Rachel Bruce (Jisc) explained that the purpose of the day was to raise awareness of the RDA, show how it works, and for the audience to learn about some of its outputs and how they could be applied in their organisations. A sustainable research infrastructure requires a global community to sustain it and one way to do this is through the RDA. Jisc is working with the UK Research Councils to support this RDA work in the UK.
Mark Parsons the Secretary General of the RDA gave an overview of the RDA and the direction it is heading. Lessons learned from other industries show that to be sustainable requires flexibility, adaptation and response to rapid change. The RDA is trying to develop a data infrastructure based on this type of structure. The “build it and they come” model doesn’t work. Instead the RDA focuses on connections – building the social and technical bridges. Their vision resonates with people as they have grown to over 4500 members in 3.5 years. This includes individuals and organisations to help build an infrastructure. Open problem solving is the key to ensuring that interest and working groups are successful and they are allowed to work on whatever they feel is important. They are community driven and non-profit, with no favoured technology solution. This type of workshop is important as you have to learn from local, specific issues and bring these to the global groups.
Juan Bicarregui (STFC and also part of RDA Europe) described how open data policy has developed over the last 10 years. Looking at policies from OECD, the EC and G8 there has been a focus on openness, interoperability and sustainability. The G8+5 group looked at the FAIR principles and recommended an international forum on data interoperability should be set up. This was one impetus for the formation of the RDA. One difficulty overcome has been getting EU, US and AU funders working together, but this success has shown how these are global issues. The recent H2020, EU Data Infrastructure and EU Open Science Cloud programmes have shown there is a lack of interoperability required for data sharing with deep rooted walls between disciplines. There are lots of things we need to do to reduce these barriers for example, develop specifications for interoperability and cloud based services for open science, enlarge scientific user base. The RDA is dealing with many of these issues and has hit the nail on the head for research funders. However, there is a lot to deliver. The RDA structure provides the coordinating framework with 27 working groups and 46 interest groups, some at the coal face, others working at higher levels.
The workshop then split out into breakout groups to discuss the outputs and their adoption from four separate areas:
- Trust and Certification
- Data Citation
- Metadata Standards
- Publishing Data
Here I will give a brief overview of the sessions, but more information is provided in the workshop report.
Group 1 – Trust and Certification
Lesley stressed the importance of having trustworthy digital repositories with the overall aim that data is reusable. There are a number of different certification schemes but there is a need for a core foundation level for repositories. This working group was set up with members of the Data Seal of Approval and the ICSU World Data System to come up with a set of core requirements and common procedures. This inspires trust, which is at the heart of sharing and archiving of data. The output produced is the Catalogue of Common Requirements, which contains 16 core requirements. This is one step to a more coherent way of more stringent and compatible standards for repository certification
Ingrid illustrated what Lesley had talked about with her experience of certification experience at DANS. She stressed that you MUST have management level commitment to get the job done and set up a core certification team to plan, discuss, monitor and partly execute the work. If you need all 16 guidelines it will take 2 weeks for self-assessment but this could rise if you need to do the work. Start at the core base level and the effort will rise as you go up the levels. You do not certify for life, but have to pass a test regularly. It took 250 hours to do the renewal and most of this was on technical development, then writing the self-assessment.
In 2015 DANS went for the extended NESTOR seal of certification. This required around 1500 hours of work, with nearly half of the organisation involved in some way. Although this is a large commitment it builds stakeholder confidence in the repository, raises awareness about digital preservation, improves communication within the repository and processes, and ensures transparency. This is important to DANS as it builds trust with stakeholders but is also a big stick to further develop and professionalise their core services. As the audience response demonstrated this is a big commitment but some of the benefits to working practice means it has been worth it for DANs.
Group 2 – Data Citation
Unfortunately, Ari Asmi was unable to make the session due to bad weather in Helsinki delaying his flight. However, Carlo Zwölf, who was presenting as an adopter, kindly presented Ari’s slides.
The Data Citation group has produced guidelines to support dynamic data citation. Citation is needed for scientific and career development purposes. The group agreed to look at: dynamic data, arbitrary subsets, stability across tech, machine actionable, scalable to very large and dynamic datasets. A query of a dynamic dataset is given a Persistent Identifier (PID), which is time stamped and stored in a database. This means any query can be re-executed by others.
There are mandatory and optional fields. The guidelines are being tested but they are still looking for use cases. The Virtual Atomic and Molecular Data Centre (VAMDC) is changing the way that data citation is done to these new paradigms. The VAMDC federates 30 heterogeneous databases, is virtual and contains no data but is a wrapper for these databases. Therefore, all are formatted, and can be queried, in the same way.
Citation enhances trust and gives credit, but the papers’ version does not work for data. Change can be rapid and not reported systematically. There is no continuum of change reported through papers and it needs to be provided by a database. The volume of digital data is wide and constantly growing and the problem is more anthropological than technical. The solution is to track the versioning of the data and employ mechanisms to speed up the process of citation. The possibility was raised of creating an exemplar in the UK on how this process can be implemented and whether Jisc could support an institution in implementing these guidelines.
Group 3 – Metadata Standards
The RDA Metadata Standards Directory Working Group’s stores its outputs in a standards directory on GitHub at http://rd-alliance.github.io/metadata-directory/. Details of the second work package of the Metadata Standards Catalogue Working Group are in the following Google Doc – http://goo.gl/7bdSiz
Alex Ball talked about the past, present and future of the metadata standards catalogue. The vast majority of scientists use in-house lab schemes and a massive amount of data has no metadata at all. This is a major problem as data may be useless without the metadata description. It lowers impact, future research questions can’t be answered and the data isn’t discoverable.
The group looked at existing work, including the DCC Disciplinary metadata catalogue, and undertook a survey of researchers to find what they were using. As a result, several new standards, profiles and tools have been added to the catalogue. The RDA catalogue is kept in sync with the DCC Disciplinary catalogue and both can be fed into. The ambition is to make data easy to find, encourage adoption and adaptation, but also be searchable, painless to contribute to and machine accessible. They are looking at endorsement of the scheme from different groups and will provide mappings to and from schemes that already exist. The next steps are on finalising the model, converting data, selecting tools, new interfaces and prototypes.
Sarah Jones described DMPOnline and how this will be importing data from the metadata standards directory. The hope is that the directory will allow relevant standards to be identified. The benefits of integration included an increased researcher awareness of what’s available. It’s also the intention to produce DMPs that can be used for machine analysis. The key benefit is demonstrating the use of catalogues in DMPs.
Dom Fripp has been engaging with the RDA Community in particular to look at shared issues and common elements with regards to metadata. Everyone seems to face very similar challenges and so there are lessons that can be learnt and shared. With new schemas, such as schema.org, coming up there will be more opportunity to bring research data more usefully into search results. Dom has been looking at the ELIXIR Bridging Force Interest Group and how this impacts on Jisc’s research data Discovery Service and Shared Service work; certainly lessons from their work can be applied to the solutions we need. Jisc is working to update SWORD deposit protocol, in particular so it can handle research data use cases, at the Denver RDA plenary Dom discussed SWORD4DATA with the RDA community, since such protocols are international this is an invaluable forum for feedback. The work includes developing SWORD case studies, implementations and challenges & updating the protocol.
Group 4 – Publishing Data
Jonathan Tedds described some of the issues around publishing data dealt with by the group. In major facilities such as Hubble Space Telescope, more papers are based on reuse of original data than papers on original research. This shows that if you publish in a standardised way it promotes reuse. There is a growing set of unmanaged datasets to non-public data from individual scientists, labs, etc. Increasingly they are required to make these datasets available. The scale of the problem can be shown in a zoology example – in 516 papers published between 1991 and 2011 only 37% of the data from the 2011 papers is still findable and/or retrievable, from 7% of the 1991 papers.
The RDA WDS publishing Data IG focussed on publishing workflows, services and bibliometrics.
Fiona Murphy described some of the outputs of the WDS/RDA Data Publishing Workflows working group. Its output Key Components of Data Publishing looks at using current best practices to develop a reference model for data publishing. This pulled together the key components for a data publishing workflow including core components required for a published product and additional elements for increased context, quality and visibility/accessibility.
Ian Bruno described the benefits of linking data and literature, which include increased visibility and discoverability. It also places research data and the articles in the right context to enable proper interpretation and re-use and support credit attribution. The Scholix framework (Scholarly Link Exchange) represents a set of aspirational principles and practical guidelines to support a global information ecosystem around links between scholarly literature and research data. It’s not an architecture but organisations are already developing services based on this framework. Future work will look at agreement on metadata and schema for capturing data-article links, developing hubs, and outreach to those with data/article links including publishers/repositories.
Finally, the Data Bibliometrics working group undertook a survey to look at the methods used to measure impact. Although everyone understands the value of sharing data, there is no quantitative way to say why data should be shared. The working group will conceptualise data metrics and corresponding services that are suitable for overcoming existing barriers to sharing.
How to engage with the RDA
Panel: Mark Parsons, Rachel Bruce (Chair) and Juan Bicarregui
At the end of the day delegates discussed how we can more effectively engage with the RDA? Also, with an international scope how do we translate RDA into a local setting? It was agreed that having a hierarchy such as RDA Global, RDA Europe and RDA UK can be confusing and difficult for people to know at what level they should engage. In Mark Parson’s experience it is useful to have a national data forum where national issues are discussed, which can be brought to the RDA.
There is a need to reach out to practitioners who are never going to attend RDA plenaries and a national group, such as the RDA UK, should allow them to voice their concerns and ask questions. Very few researchers go to RDA plenaries so practitioners need to have a way of engaging with the research community. It should be a channel for linking people who are interested in the working groups, and the work that is going on, and enables everyone to feed into the groups and learn from them.
There are people who won’t engage with the RDA but will want to adopt the recommendations, so the outputs need to be promoted and reach a maximum audience. Future workshops could look at implementations of the recommendations.
One benefit of attending RDA events is the serendipitous relationships formed within and across groups. The RDA is a window to different communities, a place to find solutions. Although the RDA can do advocacy to UK institutions there is no obvious way of feeding this back into the RDA. Organisations join the RDA and as members can represent their country’s institutions. The audience asked if Jisc could provide this voice? The UK has an informal open research data forum that Jisc does engage in. There is a need to think how this could become formalised. Jisc has classic affiliations to HEIs and some research institutes, but the audience is bigger than this for the RDA, it is possible that the data forum could be a conduit and Jisc working with others could help to feed back to the RDA.
The research data community is developing and there are pockets of practice, but in a lot of HEIs it’s still a problem. There is a lot still to think about on how to implement practice from theory. Perhaps UK use cases could feed into the RDA and then come back with toolkits to help with these problems. It’s difficult to know what outputs are available so some RDA outputs, which might have a particular use in UK, could be consolidated into packages for a UK audience.
The workshop ended with thanks to all presenters and attendees. Feedback from the audience was positive and that there was a lot to think about. Further information about the RDA and its outputs is available from their website – https://www.rd-alliance.org/. You can join the RDA UK group at https://www.rd-alliance.org/groups/rda-united-kingdom.