New JISCMRD Projects: Citing, Linking and Integrating Research Data

As well as seeking to demonstrate ways in which research data management in UK Universities can be improved, the JISCMRD programme seeks to demonstrate the benefits of making research data as openly available as possible.

The JISC Call for Projects 14/09 Strand A sought to fund projects to demonstrate the innovative potential for research and scholarly communications of improving methods for citing, linking and integrating research data.

The Call took as its premise that to form a reliable, transparent and reproducible basis for scientific conclusions, research data should be readily accessible and citable. Information concerning provenance of the data, the way in which raw data has been analysed, adjusted to take account of errors or influences, etc should be available, so that findings can be reproduced, tested and verified. The data that underpins research can be made more usable and valuable by linking it, not just to publications, creators, but also to related concepts and data.

Proposals were asked to demonstrate how the project would benefit research that produces and consumes data, and to indicate where possible the broader applicability of the approach and technology used.

Six projects, covering a range of subject areas, were funded representing an investment of nearly 660,000. All of the projects run for a year from August 2010. The project summaries that follow are formed from the project proposals. The contact details provided are all from the ac.uk domain.

King’s College London (with the University of Edinburgh and Humboldt University Berlin)

SPQR – Supporting Productive Queries for Research

The SPQR project will make existing datasets relating to the Roman Empire more accessible and reusable for analysis by providing mechanisms for linking concepts and terms.

The SPQR project addresses the integration of heterogeneous datasets in the humanities, specifically data relating to classical antiquity, using a linked data approach and based on the recently published Europeana Data Model. The researcher requirements addressed by SPQR are driven by the outcomes of the JISC ENGAGE funded project LaQuAT, which concluded that a relational database model for integrating these datasets is often too inflexible. It is in precisely this regard that Semantic Web and Linked Data approaches have great potential, as they allow researchers to formalise resources and the links between them more flexibly, and to create, explore and query these linked resources. Closely allied to Linked Data has been work on ontologies for providing agreed meanings for both links and the resources they connect. Thus ontologies can act as the semantic mediator between heterogeneous databases, enabling researchers to explore, understand and extend these datasets more productively and so improve the contributions that the data can make to their research. Using the Europeana Data Model as an ontology has the additional advantage of easy integration of the humanities resources into the Europeana portal.

The datasets relating to antiquity have features that make them a particular challenge and opportunity: they are “hand-crafted”, resulting from much individual effort; they are often incomplete/ambiguous, and may contain errors/contradictions; they contain much embedded, implicit semantics, which is difficult for researchers to use or comprehend. Such datasets occur across the humanities, and also in sciences when research is focused on individual scientists.

Among the project deliverables will be Case Studies and a browsable corpus of linked data relating to important data sets of inscriptions and documents from the early Roman Empire. The project will also provide reviews of Linked Data tools and a set of online tutorials providing practical guidance to the approach used.

Project Director: Mark Hedges mark.hedges at kcl

Project Manager: Tobias Blanke tobias.blanke at kcl

Newcastle University (with the University of Manchester)

Data linking with knowledge blogging

The Data linking with knowledge blogging project will explore ways of linking raw and analysed data to a new form of rapid publication for scientific research: peer-reviewed blogs.

Coining the term ‘Knowledge Blogs’ or ‘K-Blogs’, the project will extend existing blogging tools for use as a lightweight, semantically linked publication environment. Bi-directional links will be maintained both between blog publications and to other forms of data, enabling researchers to create a hub in the linked-data environment. K-Blogs are convenient and straight-forward for authors to use, integrating into researchers existing work practices and tools. The blogs provide readers with distributed feedback and commenting mechanisms, to comment on interpretative posts and data. A lightweight editorial / peer review process will be scoped and established.

The project will support three communities (microarray, public health and workflow), providing immediate benefit, in addition to the long-term benefit of the platform as a whole. Working closely with these communities, the development approach will be user-centric, while showcasing the platform as the basis for next generation research publishing.

As well as developing the semantically enhanced K-Blog platform, with bi-directional links to underlying data, the project will run four workshops, each focusing on a particular blog and research community.

Project Director and Manager: Philip Lord phillip.lord at newcastle

UKOLN, University of Bath (with the University of Manchester and the British Library)

SageCite: citing large-scale predictive network models of disease

The SageCite project will explore ways of citing and representing in publication complex analytical processes used in predictive disease modelling (including sources of data, complex statistical and computer aided analysis and modelling). By establishing a Citation Model, will enhance researchers’ ability to identify, cite and reuse network models.

Network models, or bionetworks, used for predictive disease modeling, are the outputs of an analysis (e.g. a code or workflow) of prior results which may be other networks or base, primary data generated through observations, instrumentation or predictions. For such complex network models of disease and associated data, SageCite will develop and test a Citation Framework, covering Data, Process and Publication, allowing various analytical steps to be cited, provenance tracked etc. Process represents the methods that are applied to data materials to generate data, combine data and produce new insights, as well as new, citable scientific research objects. Such data analysis methods include workflows, such as Taverna. Data analyses produce provenance (history and dependency graph) that links citable results to the citable processes and citable source data they arise from.

The SageCite project will explore approaches, options and requirements for citing large scale predictive network models of disease (and other, analogous compound research objects). A citation-enabled workflow demonstrator, built as an extension of myExperiment and the Sage Commons data infrastructure, and using Taverna worflows and DataCite services, will serve as a case study to investigate issues relating to citation services (e.g. attribution and granularity in complex objects). The project will explore work with two leading journal publications – Nature Genetics and PLoS Computational Biology – to demonstrate accreditation and attribution by embedding citations of network models. A benefits evaluation will build on analysis developed by the JISC-funded Keeping Research Data Safe 2 Report.

Project Director: Liz Lyon e.lyon at ukoln

Project Manager: Monica Duke m.duke at ukoln

University of East Anglia (with the STFC e-Science Centre)

ACRID – Advanced Climate Research Infrastructure for Data

A collaboration between the University of East Anglia Climatic Research Unit and the STFC e-Science Centre, the Advanced Climate Research Infrastructure for Data will improve ways of exposing climate data for re-use, making it easier to cite the data and to understand the provenance and validity of the data.

Past data management practices in many fields of natural science, including climate research, have focussed primarily on the final research output – the research publication – with less attention paid to the chain of intermediate data results and their associated metadata (including provenance). Data were often regarded merely as an adjunct to the publication, rather than a scientific resource in their own right. Data publication occurred with a lower priority, with small regard to the requirements for re-use. A case in point is climate/research data held by the Climatic Research Unit (CRU) at the University of East Anglia, where the lack of data provenance has been highlighted through a recent hacking of university emails. While the House of Commons Science and Technology Committee noted that CRU’s “(data sharing) actions were in line with common practice in the climate science community” they went on to suggest “…that climate scientists should take steps to make available all the data that support their work (including raw data) and full methodological workings (including the computer codes)”. The Commons also noted that even so, “it is not standard practice in climate science to publish the raw data and the computer code in academic papers”. This project aims to address this issue directly, by developing an approach to exposing climate research data for re-use, through the adoption of linked-data principles for the data themselves. The data to be exposed within the project will be three major CRU datasets, but the methodology to be used will be deployable elsewhere too. Best practice data citation techniques will enable a seamless link with research publications. Mechanisms will be developed for capturing key provenance metadata, and for adapting previously developed climate science data models to integrate with data re-use standards (OAI-ORE) and emerging Cabinet Office guidelines for public data.

See the Press Release about this project and the rest of the programme.

Principal Investigator: Tim Osborn t.osborn at uea

Project Manager and Technical Lead: Andrew Woolf andrew.woolf at stfc

University of Manchester (with the Freshwater Biological Association, King’s College London, Queen Mary University of London)

FISH.Link

The Fish.Link Project will support the sharing and integration of diverse data relating to freshwater biology (hydrology and systematic observational data created by government agencies and professional and amateur researchers).

Motivated by the large quantity of diverse data in the freshwater biology community, FISH.Link will provide a demonstrator of the benefits of publishing data by illustrating how data can be combined, repurposed and reused with attribution and provenance information to promote data sharing. In particular, the project intends to support the sharing and integration of research data through the application of lightweight vocabularies and vocabulary mapping. The project will demonstrate how a variety of tools can be used together in order to support the exposure, citing and linking of research data. The use of vocabulary mapping will facilitate integration of data sets, moving towards the Web of Data that forms the current Linked Open Data vision. The resulting workflow will involve the use of annotation tools to support the recording of ownership and provenance, ensuring appropriate citation and facilitating understanding of the data; query tools to enable discovery and access to data; and mapping tools to support the construction of mappings between vocabularies, and thereby the datasets themselves. FISH.Link will use a research case study that will address a pertinent scientific question concerning the impact of environmental conditions on freshwater biology that cannot be answered with any one dataset, but which can be addressed when existing datasets are combined using Linked Data approaches.

Project Lead: Sean Bechhofer sean.bechhofer at manchester

University of Southampton (STFC e-Science Centre)

WebTracks: web-scale link tracking for research data and publications

The WebTracks Project will explore ways of making research data more accessible and easy to use by enabling researchers to establish links which represent the evidential stages between data, analysis, scientific conclusions and publication.

Modern research requires the coordination of a large number of different digital objects and communication mechanisms. A researcher will typically generate raw data, through observation or experiment, generate analysis data using software, discuss the results using private email or public forums, make presentations, and publish a formal description of the results in journals. Each stage of this process typically involves support systems, including repositories, websites, data archives, each with independent management. However, to maximise the value of the scientific work, the connections between these stages needs to be exposed, so that the evidential basis of the conclusions presented in publications can be accessed, and the usage of digital objects can be traced. Research objects are typically spread across a number of different sources which need to be linked together. Established techniques such as OAI-PMH and the emerging Linked Web of Data provide tools to publish data for linking. However identifying and linking the appropriate objects, in an open and easily accessible manner, has not yet been addressed.

The objective of the WebTracks project is to provide a mechanism to identify and construct linked graphs of dependency, citation and provenance between research outputs from different data sources by implementing a peer-to-peer protocol which propagates semantic links between data resources without the need for centralised services. This protocol, named InteRCom, will enable the building of added value services for aggregation, provenance, citation tracking etc. that will demonstrate the value of this approach to the construction of a linked web of data for researchers, publishers and research evaluators.

The WebTracks project will provide case studies around chemical science data involving the ISIS and Diamond facilities and the National Crystallography Service. These scenarios will link the records of raw data, secondary analysis and publication data within chemical sciences. Further, the project will develop and deploy instances of the services in collaboration with both conventional scholarly publishers and the growing ecosystem of web-based publication for various forms (LabLogs, citation publications etc).

Principal Investigator: Simon Coles s.j.coles at soton

STFC Lead: Brian Matthews brian.matthews@stfc