Research at Risk Research data metrics

Data Citation – what is the state of the art in 2016?

The Jisc Research Data Usage Metrics project recently held a project meeting to discuss possible future work in the emerging area of research data metrics. We saw a presentation from Cameron Neylon of Curtin University, which amongst other topics (including a fascinating section on defining a citation, which is covered on the project blog ) introduced us to the many current initiatives around the world that are considering the citation of research datasets. As Jisc work continues in this area we hope to build links with a number of these initiatives to ensure that we can make the most valuable possible contribution. We asked Cameron to contribute the post below in order to introduce current work on data citation to our readers and to help set out the scene . (David Kernohan)

Interest in Data Citation is growing rapidly, motivated by widely-held interests in improving the quality of information on the relationships between data and articles, and gaining or assigning credit for data contributors. Researchers and others are most interested in formal citation as a means of showing usage and impact of data publication compared to other potential measures of usage. Many of the systems and functionality for useable data citation are in place but practice differs amongst publishers and repositories, and uptake by researchers is currently limited.

Work on harmonising practice and systems is proceeding rapidly under the auspices of the NIH funded Data Citation Implementation Pilot, a program of FORCE11. This group is currently the main driver of technical implementation and best practice development. Progress is being made, but the current limited scale of data citation means that it will be some time before a critical mass of high quality information is available.

Community Initiatives

As a new practice for a new class of published objects data citation enjoys a benefit from not being tightly tied to existing sustainability models of publishers and other stakeholders (including secondary publishers and citation aggregators). It is therefore a space that – at least to date – sustains a strong collaboration amongst many stakeholders. A set of initiatives have developed at the community level. Other relevant initiatives include the NISO Altmetrics Initiative which touches on some of the same issues, the Making Data Count project (a collaboration of PLOS, California Digital Library, and DataONE), and work at Europe Pubmed Central on identifying references to biomedical data resources in the full text of scholarly literature.

Joint Declaration of Data Citation Principles (JDDCP)

A set of principles for Data Citation arose from a Birds of  Feather session at the Beyond the PDF meeting in Amsterdam in 2013. The initial set of principles that came out of this session were circulated and a broader group convened through the FORCE11 organisation to harmonise a broader set of recommendations from the many activities occurring in this space. The Data Citation principles have been endorsed by 107 organisations and 236 individuals making them the best consensus statement on data citation.

The Principles are framed as recommendations rather than prescriptions and are being implemented across a wide range of contexts by many stakeholders. They cover the importance of data, credit and attribution, identifiers, access to data, persistence, specificity and verifiability, and, interoperability and flexibility.

Data Citation Implementation Pilot

The Joint Declaration of Data Citation Principles (JDDCP) raises a series of implementation issues that were discussed by Starr et al (2015). FORCE11 (Future of Research Communications and eScholarship, a collaborative project established in 2011 to help facilitate change toward improved knowledge creation and sharing) has developed a follow-on program from the JDDCP in the form of an NIH-funded pilot implementation project. This group is working with publishers and repositories to develop implementation pathways for the principles. At a meeting in March several publishers gave updates on current implementation of data citation and working groups were formed to tackle implementation issues including: developing a FAQ, issues with identifiers, publisher early adopters, repository early adopters, implementation in JATS, and landing pages.

The work on the Journal Article Tag Suite (JATS, a format for exchanging information for scholarly articles) has already made recommendations on adding elements to the JATS standard that were incorporated into the v1.1 release. Specifically two new elements <data-title> and <version> were added alongside a new set of attributes providing for the ability to identify authorities assigning identifiers (i.e. data repositories), data curators as contributors, as well as new types of object identifiers (ARKs, accession numbers, handles) and finally, and crucially, “data” as a value for the attribute @publication-type.

DataCite, Crossref and Identifier Providers

While not community projects in the sense of the above DataCite and Crossref are crucial community-based organisations in this space, alongside other identifier providers. Crossref has a long standing both providing DOIs for published member outputs and also capturing from new publications the citations to other works, both those with Crossref DOIs and those without. DataCite has emerged more recently as an important provider of DOIs for data repositories and other publishers and outputs not traditionally covered by Crossref. Both organisations have strong communities around them and have worked closely together.

Current state of implementation

Kratz and Strasser (2015) note that there is substantial functionality available for data citation and that citation is the measure of most interest to those wishing to track the use of, attention to, and performance of published data. However, while some publishers are making progress, implementation is not entirely consistent and is lagging behind.

Implementation by Repositories

Strasser et al (2015) in a survey of a wide range of repositories found that few (23%) track citations to datasets. More repositories track downloads or page views than citations; understandably given these are proxies that can be tracked directly by the repository, however relatively few make that information available. While references to Datacite DOIs in formal literature with Crossref DOIs could be being tracked through Crossref CitedBy information this capability has not been widely utilised. This is in part due to inconsistent implementation from publishers in data citation from formally published literature with Crossref DOIs, but also due to weak data citation practice by researchers. Work by Elizabeth Hull at Dryad showed that for articles 2011-14 referencing data in Dryad only 6% included the DOI in the works cited, wth 75% including the DOI in the running text.

Due to the lack of practice efforts to more comprehensively track data citation in practice through repositories have largely involved text mining and search strategies. Kafkas et al (2015)  report on an effort to mine documents in the Open Access corpus of Europe PubMed Central for references to datasets. These focussed on biological databases with accession identifiers and a tradition of referencing data sources in running text. They note that many references are found not in the article themselves but in the supplementary information, generally not indexed by secondary publishers and not captured through Crossref services.

Implementation by Journal Publishers

Several publishers (SpringerNature, Elsevier, Frontiers, PLOS) are directly engaged in the Data Citation Implementation Pilot and these have moved forward on starting to implement the new JATS standard. Deborah Lapeyre has provided a valuable set of examples on implementation for a wide range of data objects.

Amongst publishers Biomed Central (and the GigaScience journal in particular) and Nature Publishing Group (the journal Scientific Data) have lead implementation of practice. Both also have a strong history of dealing well with references to data in biomedical databases through capturing structured references to accession numbers. These however have not traditionally been surfaced as citations per se.

While substantial progress is being made, it is biased towards large players. It is worth remembering that many publishers do not use JATS XML within their pipelines, particularly outside STM disciplines. Progress has therefore been biased towards publishers and use cases around the biomedical sciences disciplines.

Progress on implementation

With the DCIP in progress this is a fast moving field. Both major repositories and publishers are moving rapidly to harmonise and improve data citation systems and practice. The major players here are those organisations with a strong technical capacity, the larger repositories and publishers. There is a risk of smaller players being left behind unless the development of implementation best practice is supported by infrastructures and support systems that provide for implementation.

The capture of data citations remains weak across the scholarly communications ecosystem. Much of the machinery is in place but there are disagreements on the extent to which data citation should simply adopt existing document citation practice. In the longer term the issue may become moot. General citation practice on the ground is shifting and diversifying rapidly with steadily more non-traditional objects being cited – we estimate that up to 15% of objects cited from academic papers may be plain URLs (In 2012 Geoff Bilder suggested that this was 1 in 15 in a presentation at the CrossRef members meeting – at 25’30” in this video ).

The challenge for building a coherent data citation infrastructure is to harmonise and develop best practice fast enough to stay ahead of this curve. The Data Citation Implementation Pilot is a good example of moving fast enough to do exactly that.


Cameron NeylonAbout the author: Cameron Neylon is an advocate for open access and Professor of Research Communications at the Centre for Culture and Technology at Curtin University. You can find out more about his work and get in touch with Cameron via his personal page Science in the Open.


Featured image: “Metric Mania” by Josep Ma Rosell [Batega]. Shared on Flickr under a CC-BY licence.