A Festival for Persistent Identifiers - Research infrastructure and data

Introduction

“Open identifiers deserve their own festival” say the organisers of PIDapalooza and if you have any interest in Persistent Identifiers (PIDs) then this is the event for you. With a unique style roughly modelled on a music festival (nail painting and tattoos were optional), the focus is on discussions and questions in short, parallel sessions rather than sitting through a load of presentations. This was the second PIDapalooza (the first kicked off back in November 2016 in Reykjavik), and was held in Girona on 23/24 January 2018. With a sell-out crowd and plans for the next festival being discussed, it looks set to become an annual event and, who knows, by drawing a disparate community together, might even help to solve some of the issues around PIDs.

The festival was opened with the lighting of the persistent “flame”, although this wasn’t as persistent as hoped when the machine broke down. There was an active twitter stream over the two days, although changing the hashtag from #PIDapalooza and #PIDapalooza18 showed that even PID-experts can have problems with persistence.

Although my main reason in attending PIDapalooza was due to my ongoing work around organisation identifiers, there were many sessions relevant to other Jisc work. I’ve selected a number of sessions for this post but there were plenty of others of interest. You can view all the slides at https://pidapalooza.figshare.com.

Data publishing: PID adoption stories

Data publishing is increasingly more important for research so there needs to be consistency about PIDs around data – DOIs assigned, data citations, ORCID authentication/login, etc. To value data publications we need to promote PIDS, for example cite data (DOIs) in publications, CVs, talks, etc. Often adoption is difficult because the language/intention is confusing. Solutions are often confusing for the researchers. Figshare is looking at citation style, versioning, DOIs, other identifiers and existing identifiers, API endpoints, and getting credit. Provenance is important.

We need trust in the connections between the things we’re trying to find. How do we incentivise researchers to use PIDs to link their data publications to all research outputs? What can we learn from the PID community that could help drive adoption in the data community? In chemistry, for example, linking a large PDF of data to an article ticks the box, but isn’t useful or trustworthy without any metadata or useful information.

RCUK progress on OrgIDs

Ashley Moore from RCUK gave an update on the work they have been doing around organisation identifiers (OrgIDs). Jisc has been involved in some of the meetings between the British Library and RCUK and we’ve shared progress with them on our work with OrgIDs. Understanding all the organisations RCUK work with is a big challenge for them as it includes institutions, project partners, spin-off organisations and those related to grant funding. Their grant application system (Je-S) is 15 years old and has problems with duplications, mergers, de-mergers and keeping up to date with the changing world of organisations. They use UKPRN but have worked with the BL on a pilot project to apply ISNIs to fundable organisations. With planning for a new grants system they want to make it use PIDs if available.

RCUK continues to look at how to improve their organisation information, how to manage multiple locations for a single organisation and what PIDs they should use or adopt in the future. The decisions they make are relevant to UK research as they could potentially mandate the use of PIDs, although there are issues around a publicly funded organisation mandating their use. Some of the plans have been put on hold due to the changes going from RCUK to UK Research and Innovation. In the discussion session reference was made to a register of organisation lists (http://org-id.guide/, https://github.com/org-id/register), the open Persistent Institution Identifier Registry being set up by Datacite/Crossref/ORCID, work in Germany where a database of national organisations was set up 12 years ago, and work in Portugal where they have adopted an international registry based on the ISNI+ and Ringgold approach suggested by the Jisc/CASRAI OrgID working group.

Using PIDs to enable biologists to do their research

Jo McEntyre from EMBL-EBI talked about her role as a biologist and the importance of reusing and remixing data so that she can do her research, or her adventures of PIDs in the life sciences. There are many data resources at EMBL-EBI and, just like biology itself, the data is all connected. The types of identifiers used include accession numbers, DOIs (journals and generic data resources, occasional use for high-level datasets as secondary PIDs) and ORCID IDs (5M articles, 650K published authors in Europe PMC). They like ORCID IDs and they are well used. Examples of databases used include ProteomeXchange, BioStudies and Europe PMC. PIDs are a technical necessity. Metadata is critical for building services and discoverability, services are critical for uptake. Mention was made of the FREYA project, which started in December 2017 and is a follow on from THOR. With a variety of international partners, they aim to produce a PID conceptual graph. This will improve discovery, navigation, retrieval and access. It’s being done in three ways – PID forum (discuss), PID graph (implement), and PID commons (consolidate).

All about BASE

According to Christian Pietsch, in his overview of BASE, the main difference between Google Scholar and BASE (Bielefeld Academic Search Engine) is that the BASE team answer emails. Having managed the Research Data Discovery Service (RDDS) it was interesting to hear about their harvesting, aggregation, enrichment and exposure of OAI metadata, mainly from institutional repositories, subject repositories and OA journals. It’s been running since 2001 and in 2017 it integrated with ORCID search and link. It contains 122M documents from 6,103 content sources and 127 source countries. They believe 70% are OA, although 42.5% are confirmed OA. There are three ways to use BASE – web interface, search API, and an OAI-PMH API. Christian listed the pros of OAI-PMH as simple, time-tested, widely spread and stable, but the cons are it’s slow for large repositories, technical implementation, misconfigurations, HTTP-based but not quite RESTful, and not enough transparency regarding availability. They are looking at alternatives such as web harvesting (not their speciality), ResourceSync (interesting but not offered by many repositories), and Linked Open Data. Dublin Core is often misconfigured and DOIs are stored in different Dublin Core elements making it non-trivial identifying DOIs. ORCID IDs are not consistent in Dublin Core. As we found with the RDDS, it’s inconsistencies that can make harvesting and aggregation into a discovery service a challenging process.

Jisc runs its own service, CORE (COnnecting REpositories), aggregating open access content from UK and worldwide repositories and open access journals. It provides a range of services including discovery, analytics, and text mining access. This is helped by doing a lot of work in cleaning up/normalising the data for the search portal and the API.

OrgID Working Group update

The ORCID/DataCite/CrossRef Organization Identifier Working Group ran through 2017 to “refine the structure, principles, and technology specifications for an open, independent, non-profit organization identifier registry to facilitate the disambiguation of researcher affiliations.” As a member of this working group and through my involvement in an earlier Jisc-CASRAI working group looking at OrgIDs, I am particularly interested in the outcome from this group. The group had produced recommendations on registry governance and product principles before issuing a Request for Information in October 2017 to solicit comment and expressions of interest from the broader research community in developing the Registry. You can read the RFI responses in this document. This session summarised work so far and gave some feedback on the stakeholder meeting held before PIDapalooza. This meeting had the goal to establish an initial governing board, review the options for a registry host, determine the decision-making process and the next steps. These will be:

Steering group (Laure Haak, Trisha Cruse) set up Interim Executive Committee (IEC) tasked with developing a proposal for setting up the registry (define guiding principles, partner roles and responsibilities, decision for the Governing Council, and Host criteria).
Set up Interim community Governing Council (IGC) consisting of working group members, interested parties, and stakeholder meeting attendees.
IEC to float the proposal with the IGC for comments by end of February, then take steps to finalise the host organisation.

The plan is to add a neutral organisation into the decision making process to ensure fairness. From the start the working group has been transparent and open. I’ll provide an update on Jisc’s involvement with this group in a later post.

Using PIDs to measure use of scientific equipment

Jisc is helping universities, colleges and the research community to share equipment with each other, and with industry through the equiqment.data project. This session was particularly relevant to this project as it attempts to capture the use of research facilities with PIDs. Oak Ridge National Laboratory’s (ORNL) and Advanced Photon Source have unique equipment facilities but it has been difficult to make an accurate assessment on return of investment. The ORNL started collecting ORCID IDs of researchers using their neutron sources in early 2016. This acknowledged use and allowed connection of user facility to researcher’s publication through ORCID IDs.

In January 2017 the US Department of Energy, national labs, SSURF (representing scientists that use any of the facilities across the US), CHORUS, publishers and ORCID established a User Facilities and Publications Working Group to define user stories and information flows that leverage open identifier infrastructure. Its report was published in November 2017 outlining findings and recommendations, with a series of proof of concept pilot projects launching in 2018. The working group established workflows around Step 1 – proposal submission and Steps 2-3 – proposal accepted and facility use. They are now looking for people to pilot these steps at their facilities. ORNL has implemented several of the workflows outlined in step 1 and 2. Wiley was represented in this session and publisher workflows were presented. Although publishers receive information from authors about facilities in the manuscript there are no PIDs or standards applied. This makes it difficult to analyse. They’ve been working on a pilot project to integrate award and facility IDs into the manuscript publication process.

The working group is interested in anyone who wants to join the group or consider participating in a pilot project (just email community@orcid.org). Jisc will be joining and reviewing how this work could help with equipment.data. This also links to the work of the RDA working group on the Persistent Identification of Instruments.

Making molecular data FAIR

Henry Rzepa from Imperial College came to this event to get some questions answered and most of these had been, either through presentations, or talking to people. It’s the unknowns, the questions he should have asked, which concern him and he is always collecting more examples of use cases. His work is in molecular chemistry and there are 136M molecules with at least 7.6 billion property values known, probably a lot more. He demonstrated the difference between FAIR and unFAIR data through the visualisation of a molecule in 3D, where one is a difficult to understand printed image and the other has a DOI, which downloads a molecule that can be rotated – a whole toolkit deployed with the molecule. The molecule name is a PID that goes back 150 years. Most of the identification of molecular data is closed so is not FAIR. At Imperial they’ve used the International Chemical Identifier created in 2000 for web use. They’ve stored this in the AlternateIdentifier field in DataCite schema v3. They’ve created entries in repositories – SPECTRa from 2005, figshare from 2011 (only in custom institutional version) and zenodo from 2013. In DataCite v4.1 the SubjectScheme property was exploited as a better way forward for NCI. Metadata recommendations for DataCite Registration are being launched in February 2018. Wanting more flexibility they created a new repository (built in a remarkably short time) allowing for ownership of DOIs, deep collection hierarchy and workflows. This allows data discovery and reuse and the data mining of registered metadata.

Control of Dutch National Research Information

Jisc and SURF have collaborated and worked closely not just through the Knowledge Exchange, but through shared interests in many areas. This was another session related to work within Jisc, in particular around OA and research information management.

Within the Netherlands 100% of universities used METIS, a home-bred CRIS tailored to Dutch standards. The market evolved and surpassed this to use solutions such as Pure, Converis and (later) Symplectic. These are aggregated at a National level in NARCIS, the gateway to scholarly information in the Netherlands. The issues they have are that there is no easy way to adapt each CRIS to store the preferred information (no flexibility), no easy way to connect databases for (national) analysis (no interoperability), and discrepancies in total national research output (and % of OA) when different sources were compared (no reliability). PIDs can help to solve these issues by reducing dependency on external sources and collecting linked IDs on a National level. This will allow them to always have access and make use of this information in research intelligence.

10 out of 14 Research Universities participated in the pilot ORCID-NL consortium. Most were also implementing new CRIS systems. A group of 25 universities of applied science are still in discussion. The new consortium has 8 research universities, but the goal is all 14. CRIS integration has been sub-optimal, vendors haven’t been so responsive and integration issues remain. They are looking at the research information reality they want to achieve and how to get there. This involves producing a centralised ID resolver, which is being developed in 3 phases:

Phase 1, the co-author disambiguation: API interface (pilot underway)
Phase 2, non-CRIS institutions: Web interface
Phase 3, non-CRIS institutions: federated authentication

Future possibilities of this project include being able to collect/link object and ORG IDs, integrate with national repository (NARCIS), monitor Open Access and monitor output from funded research. One important question is would this ID Resolver be a useful infrastructure for open science? Jisc and SURF are currently working with ORCID on producing a PID vision piece due to be published online soon.

Other highlights

Metadata 2020 is a collaboration that advocates richer, connected, and reusable open metadata, for all research outputs, which will advance scholarly pursuits for the benefit of society. The challenge is getting important information referenced and they want to encourage richer metadata to fuel discoverability. There’s a core team and others involved in the working group, consisting of 19 advisors and 70 individuals contributing. The group has come up with challenges, issues and solutions from the perspective of librarians, funders, publishers, service providers/platforms and tools, data publisher and repositories, and researchers. Proposed projects for 2018 are:

Communicate reasons for and importance of metadata for researchers
Metadata recommendation and element mappings
Metadata vocabulary
Business cases and examples
Standardisation / Principles
Metadata Evaluation and Guidance

The Initiative for Open Citations I4OC is a collaboration between scholarly publishers, researchers, and other interested parties to promote the unrestricted availability of scholarly citation data. There is a need to look at identifiers as relationships or view as data entities with more information. A citation can be a link and entity. Four features to turn a citation into a first class data entity are that it must be:

Definable in a machine-readable manner
Storable, searchable and retrievable in a well-structured open database
Identifiable using global Persistent Identifier Scheme
Web-based Resolution service

The Open Citations corpus database has been developed and is an open scholarly infrastructure organisation with a primary purpose to host and build the Open Citations Corpus (OCC), an RDF database of scholarly citation data that now contains almost 13 million citation links.

In biomedicine, in-line data citation can be just a reference, e.g. “Protein Data Bank 2gc4”. Although this informal expression makes sense to the reader it needs to be formalised to make it machine readable, e.g. “pdb:2gc4”. This use of a PID provides access and metadata and is a new practice for this community. Currently there are some “ugly” identifiers for data where the URL often reflects the underlying implementation. A solution is to have meta-resolvers sitting at a stable URL, which redirects to the right resolver on collection name. The advantage is it’s a centralised and maintained common registry. Examples of these meta-resolvers are n2t.net and identifiers.org. In June 2016 n2t.net and identifiers.org harmonised and agreed to use the same set of unique prefixes and same resolution format.

Summary

An interesting comment at the event was that many of the issues raised here were around 20 years ago. The response to this was, why have they not been fixed in all that time? That they haven’t and with PIDs being essential in the linking of so many outputs, objects, people and organisations the need for an event such as PIDapalooza will continue, even if only to bring this community together to highlight the issues and work being done internationally.