The overarching aim of this programme area is to contribute to an increase in research data management skills in UK higher education and research organisations. This will be achieved by providing high quality training materials which will serve the needs of a variety of roles and stakeholders requiring research data management skills.
There is a recognised need to increase skills in managing research data among staff in HEIs, including researchers, librarians and research support staff. Important work was accomplished by projects in the first Managing Research Data Programme, which developed discipline-specific training materials, but this work was limited to certain research areas.
The present strand aims to build on previous achievements by addressing remaining gaps in availability of discipline-focussed training materials, targeting disciplines not covered by previous JISC projects or other work. In particular, there is a need for training for subject or research librarians whose role will increasingly include providing support for researchers in making best use of the research data infrastructure and services which may be available (inter-)nationally or at an institutional level.
The JISC Managing Research Data Programme 2011-13 has funded four projects to design, pilot and test training materials for research data management adapted for the needs of discipline-focussed post-graduate courses and for subject or discipline liaison librarians.
RDMRose, University of Sheffield
RDMRose will develop and adapt learning materials about RDM to meet the specific needs of liaison librarians in university libraries, both for practitioners’ CPD and for embedding into the postgraduate taught (PGT) curriculum. Its deliverables will include OER materials suitable for learning in multiple modes, including face to face and self-directed learning.
Project Website: http://www.shef.ac.uk/is/research/projects/rdmrose
RDMTPA: Research Data Management Training for the whole project lifecycle in Physics & Astronomy research, University of Hertfordshire
The RDMTPA project will build on existing JISCMRD work, both within and outwith the University of Hertfordshire, and will work with the Centre for Astrophysics Research (CAR) and the Centre for Atmospheric & Instrumentation Research (CAIR) to develop a short course in Research Data Management, directed at Post-Graduate and early career researchers in the physical sciences.
SoDaMaT: Sound Data Management Training
SoDaMaT will develop discipline-specific research data management training materials for postgraduate research students, researchers and academics working in the area of digital music and audio research.
Project Blog: http://rdm.c4dm.eecs.qmul.ac.uk/category/project/sodamat
TraD: Training for Data Management at UEL
TraD will embed good practice in data management (DM) at UEL by developing disciplinary training material for postgraduate curricula, adapting existing materials in the area of psychology and developing new materials for computer science. The project will provide training opportunities for research staff and a learning module for library support staff.
Project Blog: http://datamanagementuel.wordpress.com/
A fifth project – DaMSSI-ABC – will provide a support function, to assist projects in following best practice, ensure reusability, engage stakeholders and synthesise outcomes.
DaMSSI-ABC, DCC, University of Glasgow, University of Manchester, RIN and Vitae
DaMSSI will support and improve coherence in the development, dissemination and reuse of research data management training materials developed by the JISC RDMTrain projects. Specifically, DaMSSI will:
- work with relevant professional bodies / learned societies and funders, to endorse and promote good data management practice;
- classify course offerings, by ensuring that the anticipated outcomes of training interventions are clearly set out to allow participants to select the training that best meets their learning objectives;
- identify and agree benchmarks on learning outcomes on learning outcomes and means of assessment so that courses from a range of training providers can be effectively compared.
Project Website: http://www.dcc.ac.uk/training/damssi-abc
Project Blog: http://damssiabc.jiscinvolve.org/wp/
Manage locally, discover (inter-)nationally: research data management lessons from Australia at OR2012
What to keep and why; how to support research data management through the lifecyle; and how to make the data citable, discoverable and reusable: these are core questions in research data management. They are questions with both human and technical aspects. These are the issues which Exeter is addressing through advocacy and training, its draft RDM policy and plans for a sustainable service; and which Oxford is seeking to tackle through DataFinder and ‘just enough metadata’.
With a sizeable national investment and an impressive coordinated approach, Australia – in the form of the Australian National Data Service and a host of institutional projects – is providing useful examples of how these questions may be answered.
Natasha Simons, Griffiths University: Enabling data capture, management, aggregation, discovery and reuse
Natasha described the development of the Griffith University Research Hub, a metadata store solution designed as far as possible to automate the collation of new research data held in the university.
Metadata relating to research data created by Griffiths researchers and largely curated in the Griffiths research data repository is exposed by the Research Hub for harvesting to the ANDS Research Data Australia service ‘a set of web pages describing data collections produced by or relevant to Australian researchers. Research Data Australia is designed to promote visibility of research data collections in search discovery engines such as Google and Yahoo, to encourage their re-use.’
Metadata is exposed using RIF-CS (Repository Interchange Format – Collections and Services) a high level schema structured around four classes of objects: collections, parties, activities and services.
The Griffiths Research Hub metadata store is based on VIVO, a triple-store solution, and uses the ANDS-VITRO ontology for describing research activity. VIVO-VITRO is one of a number of metadata store solutions encouraged by ANDS and being implemented by ANDS funded projects. For more detail about the Griffiths implementation of VITRO see the DLib article Wolski et al 2011, Building an Institutional Discovery Layer for Virtual Research Collections.
As well as contributing to ANDS’s broader objectives in Research Data Australia, the benefits of the Griffiths Research Hub are to provide a platform of linked information about the university’s research activities – potentially a rich and valuable resource for the management of research information, grant funded projects and the development of collaborations and new initiatives.
Just as the Griffiths Research Hub exposes information about researchers, projects and research data, so Research Data Australia provides a platform to discover information about research data created by Australian researchers. It remains early days – analytics do not yet exist to show to what extent this platform is assisting discovery and reuse – but the potential is clear.
Anthony Beitz, Monash University, Institutional infrastructure for research data management
Anthony described an integrated and strategic approach to supporting researchers eResearch and data management needs. Fundamental to the Monash approach is the recognition that researchers, for good reason, tend to prefer more bespoke, community developed solutions to blunt and generic platforms that are often the wares of centralised IT services. Anthony was unequivocal: ‘If a research community already has an RDM solution, or an emerging one, then it is this which should be adopted and supported.’
One suspects that few would disagree with this in principle… But at a time when in the UK IT support is being withdrawn from research departments, the cry from IT directors will be: ‘Great, but how is this to be resourced.’ A good and pertinent question. But equally, real attention needs to be paid to researchers needs. There is little point in providing generic solutions if these do not respond sufficiently to researchers requirements and are scarcely fit for purpose.
I took Anthony’s point to be that it is of fundamental importance to be sensitive to the objectives and requirements of specific research areas.
For a RDM platform to be effective and have high utility, it must fit in with researchers’ tools, workflows, instrumentation, methodologies, environment, and most importantly culture. As most of these features vary from discipline to discipline, it is unrealistic to believe that a singular approach to RDM will consistently meet researchers’ needs. Indeed, research institutions should expect that a range of RDM platforms will be required in order to accommodate their researchers.
Monash uses a team of developers and agile software development methodology to support this. And the onus is upon engaging with specific research groups and communities. The Monash approach is to work along a decision tree: if possible adopt a third party product; if necessary adapt a product to disciplinary or local needs; and if these options fail to develop the product locally.
The focus on the requirements of reach communities applies both to the support of research activity (data capture, analysis etc; the active data phase) and the curation and archiving of data which in some sense is complete (the data publication phase). For the archiving and publication phase, the Monash approach is manage locally, and promote discovery (inter-)nationally by propagating metadata to national registries such as Research Data Australia, or such disciplinary hubs as may exist. Once again, this seems to push a great deal of responsibility for curation and archiving the way of the institution. The Monash response is to meet this challenge and ‘form a separate specialised support group for RDM infrastructure’.
A lot of institutions will find this approach daunting. But many of the arguments about utility and the need for products that are fit for purpose are fundamentally persuasive. It will be important to understand more about and to learn from the Monash model.
The issue of how to fund a research data management infrastructure on a sustainable basis while only partially relying on cost-recovery from grant funded research projects is a matter of concern for all JISCMRD projects and all institutions, including Open Exeter… In relation to this issue, and others, Open Exeter is paying particular attention to how the university can best support the RDM requirements of post-graduate students.
Hannah Lloyd-Jones, Open Exeter Project, University of Exeter: post-graduate research data, a new challenge for repositories?
Hannah gave a clear and comprehensive overview of the work of Open Exeter. The presentation is available from the first data management session on the OR2012 Conference website.
The project is divided into four areas of work:
Technical development: focussing providing a DSpace instance for research data, with underlying storage, and ensuring integration of document and data repositories.
Creation of training materials and guidance: to support researchers and research support staff in the use of the data infrastructure. Exeter’s guidance pages are currently in construction.
Advocacy and governance: to establish the institutional policies around the management, retention and publication of research data.
The fourth strand of the project is a distinctive feature of the Open Exeter project. ‘Follow the Data’ describes the detailed work the project is doing to understand researcher requirements. This has involved research based on the DCC’s data asset framework methodology (comprising an online survey and follow-up interviews). A report summarising findings has recently been published.
Open Exeter is also working closely with a cohort of post-graduate research students: this approach has the dual benefit of helping the project understand research practice and RDM requirements, while also assisting advocacy and dissemination of project objectives.
This focus also emanates from a widespread concern – prevalent at Exeter and other institutions – with what happens to PGR research data at completion. At the moment, Exeter requires the deposit of post-graduate theses in the institutional repository, but – surprisingly – not the data substantiating the theses’ findings. This is a matter of concern – potentially of frustration and consternation – in departments where the research data may form part of the ongoing research initiatives, part of the department’s research assets, its institutional memory.
The Open Exeter has prepared separate draft RDM policies for researchers and for PGR students. The draft policy for PGR research data notes: ‘The security of PhD students’ data is of particular importance when it is embedded in a larger research project and will need to be accessed after the completion of the students’ degree.’
To support the objectives of these draft policies, the Open Exeter project will offer an infrastructure to allow the following: deposit of data with thesis with a simple deposit mechanism; the repository will assign a persistent ID, linking the data to the thesis. The project is also focussing on awareness raising and embedding cultural change in research community through a PGR focussed support network.
The Open Exeter Summary of Findings from the Open Exeter Data Asset Framework Survey, provides some interesting insights. The overwhelming message is that the university cannot just provide an RDM service for those researchers with externally funded research. In all schools and at all career stages, there is a substantial amount of research being conducted which does not have an external funder and is funded by the university itself. Non-grant funded research at Exeter includes research involving commercially or personally sensitive data, and includes some post graduate research data also. For an institution that endorses the view that ‘good practice in research data management is a key part of research excellence’ it is scarcely conceivable that an RDM service and infrastructure could be limited to those researchers and projects with external sources of funding. The data produced by internally funded research is an institutional asset requiring careful management and, where appropriate, archiving, publication and dissemination. However, the challenging conclusion from this observation is that ‘there could only ever be partial cost recovery from grants (via direct or indirect costs) for future staffing and infrastructure for research data management.’ [p.4] Following from this, the report observes that ‘new responsibilities will need to be accepted into central and college teams’. Sustainability models for institutional RDM services ‘are likely to include recommendations for additional dedicated staffing to help manage and monitor institutional research data management policy and practice.’ [p.6]
The Exeter report provides some grounds for the view that costs of an RDM service may be offset by indirect means: avoiding the loss of research income [p.16]; reducing data loss [p.32]; cost and efficiency savings through better management and more effective data disposal [p.35]. Most importantly, the costs of the RDM service might be controlled – and good practice made more effective – by providing ‘clarification regarding when to archive and what to archive (criteria for retention or disposal)’. [p.35]
Arguments in favour of research data sharing stress the need for verification and reproducibility. It is fundamental to the scientific method and to good research practice for other researchers to be able to test the evidence underpinning the hypotheses and interpretations presented in a given scholarly publication.
In recognition of this a number of journals have recommended or mandated that research data be deposited in appropriate data repositories prior to publication. Parallel to this, there are a growing number of initiatives that explicitly link journal articles with the underlying data or that may be characterised as data journals, championing the publication of research data sets with commentary, analysis and visualisation.
Technical, procedural and cultural challenges exist around the use of identifiers, exchange of metadata, effective linking and data citation. There is also a need to establish sustainable partnerships between journals, data centres and research organisations which are necessary to underpin innovative forms of data publication.
Innovative data publications are likely to provide researchers with recognition and reward for making datasets available and thus encourage data to be viewed as a first class research output, for data publication to be considered an essential part of the scholarly process. Likewise, it seems likely that as well as making it easier for researchers to locate and access datasets, linking between publications and supporting data will provide a means for established data centres, or even institutional data repositories to enhance and draw attention to well-curated research outputs.
For partnerships around data publication to become established, there are important questions to be considered:
What policies are required on the behalf of journals’ editorial boards to achieve greater levels on data sharing, citation and linkages between publications and datasets?
What partnerships between journals, data centres and research organisations are necessary to establish sustainable solutions, and what business models are appropriate?
How may the costs of long term data archiving be met and appropriately distributed in models that stress the importance of publishing data and linking data sets to published outputs?
- What characterises a suitable repository and what criteria of quality and assurance are necessary of the data archive underpinning such collaborations?
- What, if any, peer review of data is appropriate before publication?
The JISC Managing Research Data Programme 2011-13 has, therefore, funded two projects to design and implement innovative technical models and organisational partnerships to encourage and enable publication of research data. These projects will also explore these questions listed above and thereby shed light on solutions which will enable the greater development of data publication.
PREPARDE: Peer REview for Publication & Accreditation of Research Data in the Earth sciences
PREPARDE will capture the processes and procedures required to publish a scientific dataset, ranging from ingestion into a data repository, through to formal publication in a data journal. It will also address key issues arising in the data publication paradigm, namely, how does one peer-review a dataset, what criteria are needed for a repository to be considered objectively trustworthy, and how can datasets and journal publications be effectively cross-linked for the benefit of the wider research community.
Project website: http://proj.badc.rl.ac.uk/preparde
PRIME: Publisher, Repository and Institutional Metadata Exchange
PRIME will enable the automated exchange of metadata between publishers, subject-based and institutional repositories. A partnership between UCL, the Archaeology Data Service and Ubiquity Press, a campus-based open access publisher located at UCL, PRIME will ensure that each stakeholder has a record of content relevant to them, even when the data itself is held elsewhere.
As previously noted, scholarly journals are increasingly recommending or requiring as a condition of publication that research data should be made available in an appropriate repository. A service to collate and summarise journal research data policies would serve the purpose of providing researchers, managers of research data services and other stakeholders with an easy source of reference to understand the requirements and recommendations made by journal editorial board with regard to data sharing. Such a service would provide a useful information and advocacy tool for a variety of stakeholders in this area (including exponents of open data, research data infrastructure providers, institutional managers with responsibilities for research data management etc). It is also likely to provide a helpful incentive for the increasing systematisation and codification of such policies and for their more regular review.
JISC and other stakeholders need to understand precisely what is required in such a service and what business models are available to maintain a sustainable service, including a consideration of sources of funding and cost recovery.
The third project funded by the JISC Managing Research Data Programme in the area of data publication is feasibility study for a service to collate and summarise journal data policies, which will consider requirements and present possible business models.
JoRD: Journal Research Data Policy Bank
The Journal Research Data Policy Bank (JoRD) project will conduct a feasibility study into the scope and shape of a sustainable service to collate and summarise journal policies on Research Data. The aim of this service will be to provide researchers, managers of research data and other stakeholders with an easy source of reference to understand and comply with Research Data policies.
Project website: http://crc.nottingham.ac.uk/projects/jord.php
Continuing a series of posts on research data issues at OR2012…
Another theme over the week, picked up in Peter Burnhill’s closing summary, was the important of ‘linking research inputs and outputs’. A number of JISC Managing Research Data projects are taking a holistic view, seeking to ensure the joined up exchange of information between research information management systems, institutional repositories and research data archives and catalogues. This was stressed in Cathy Pink’s presentation on the Research 360 project. And it was given forceful expression in Sally Rumsey’s account of Oxford’s Data Management Rollout project and the DataFinder system which will play an important role in ‘pulling it all together’.
Sally Rumsey, DaMaRO Project, University of Oxford
Sally’s presentation can be downloaded from the OR2012 website; a video of the second research data session is available on the OR2012 YouTube channel.
In her presentation on the DaMaRO project, Sally underlined the need for cross-university, multi-stakeholder approach to addressing the RDM challenge. The DaMaRO project has four components: fostering the adoption of RDM policies and strategies; design and provision of training, support and guidance; technical development and implementation of an RDM infrastructure; and preparation of a business plan for sustainability.
Development of RDM policies, training materials and a strategy and plan for sustainability form overarching activities. The RDM infrastructure itself is to have three components, corresponding to phases in the research data lifecycle:
- Data creation and management of active data: this will be done locally in departments and research groups, taking advantage of the DataStage and VIDaaS platforms as well as other bespoke tools.
- Archival storage and curation: provided centrally, using the DataBank repository, as well as a software store, but also drawing on community, national and international infrastructure (for storage and curation) where this exists.
- Data discovery and dissemination: provided principally by the Oxford DataFinder.
Sally described DataFinder, which is being developed by the DaMaRO Project, as the keystone of Oxford’s Research Data Infrastructure. Alternatively, DataFinder could be described as the connective tissue and nerves, linking all the other elements of the RDM infrastructure together. To the researcher DataFinder will provide a catalogue of research data produced by Oxford projects, whether these are internally or externally funded, whether the data is held in the Oxford DataBank or elsewhere. It is a platform for both discovery and dissemination of research outputs and assets. DataFinder provides a mechanism for assigning DOIs to ensure proper identification and encourage appropriate citation. This service will ensure that Oxford can be seen to comply with funder requirements, by interoperating with research management systems to show what data assets have been generated from which grant and whether they are available for further analysis or reuse.
To this end, a three tier metadata approach is envisaged, comprising:
- a minimal mandatory metadata set providing core information (this starts with the DataCite kernel but includes other fields such as location, access terms and conditions and any embargo information).
- a second mandatory layer with ‘contextual’, administrative information (ideally, much of this will be automatically harvested, or passed on by administrative systems).
- and finally optional metadata (the rich, specific and discipline related information required for reuse).
The Oxford project has currently identified 12 fields for the set of minimal mandatory metadata which will be regarded as the bare minimum to be provided in relation to any research dataset. The contextual metadata will include any information mandated by the funder, for example, the identify of the funder, the name of the project or programme and the grant number or identifier.
There seems to be a growing level of consensus among JISCMRD projects and elsewhere around the broad contours of such a three tier metadata approach. I intend to revisit this in a future post to understand how much alignment there is in the detail of the first two layers of this broadly accepted structure.
Sally stressed repeatedly the size of the task ahead. DaMaRO performs an important role by pulling together a number of previous initiatives. However, it is not realistic to think that this will be a complete RDM service. By project end, in March 2013, Oxford will have a policy and a plan for sustainability; a body of training materials for researchers and research support staff; and two core services run by the Bodleian, DataBank and DataFinder. Significant progress, to be sure, but foundations nonetheless. Sustainability and cost recovery, for example, are significant challenges. It will be necessary, of course, to recover costs against research grants – and Sally urged the need for transparency in this regard, ideally a hypothecated line in grant proposals for RDM infrastructure.
However, it must be recognised that a significant amount of research – producing important data outputs – is conducted at Oxford that is not externally funded. Like many other institutions, Oxford is currently needs to examine very carefully how an infrastructure which provides a service for non-funded projects can be sustained when only partial cost-recovery from funded projects is possible.
The issue of how to fund a research data management infrastructure on a sustainable basis while only relying partially on cost-recovery from grant funded research projects is a matter of concern for all JISCMRD projects and all institutions, including Open Exeter…
It has been remarked in a number of places that OR 2012 saw the arrival of research data in the repository world.
In his account of his Edinburgh experience, Peter Sefton observed that we are now talking about Research Data Repositories, not just Institutional Publication Repositories And Angus Whyte has described this as bringing data into the open repository fold. It has also been remarked that research data management can make ‘a significant contribution to an institution’s research performance’, but only if services are based upon robust understanding of researchers’ needs.
Using a wordle of #or2012 tweets in his closing summary, Peter Burnhill noted that ‘Data is the big arrival. There is a sense in which data is now mainstream.’ (See Peter’s summary on the OR2012 You Tube Channel). Peter also remarked on the presence of #jiscmrd, the tag for JISC’s Managing Research Data programme, in the word cloud! A number of the current JISCMRD projects gave presentations about their work over the week.
None of these observations should be a surprise. Research data has been climbing the agenda, pushed by a variety of imperatives, that are well known. So what was significant in the various presentations on research data issues at OR2012?
Cathy Pink, Research 360 Project, University of Bath
In the DCC workshop, Cathy Pink addressed the questions of why universities should be engaging with research data and in dealing with the ‘how’ considered important use cases at Bath. The drivers come in large part from funder policies, themselves reflecting public good principles about research integrity and making the most out of research investment – this is well known. At Bath, as elsewhere, these policies coincide with the universities interest in managing better its research activity and outputs more generally. For this reason, as RDM solutions are developed and implemented (the project is evaluating Sakai and DataStage during the project, ePrints and DataBank post project) they must be integrated with the CRIS and publications repository. These relationships are important because of the need to link research inputs to research outputs, publications to research data. As the name suggests, the Bath project aims to ensure that various aspects of research management and research data management are joined up across the lifecycle. This is a sizeable challenge and is well worth taking a look at Cathy’s presentation to consider the range of questions that are being considered.
The Research 360 Project has been directly involved the development of Bath’s EPSRC Roadmap. Like other institutions, to meet EPSRC’s expectations, Bath is developing a catalogue of research data holdings. Cathy stressed that Bath relies particularly heavily on commercial and industrial partnerships – this focus is written into the universities charter – and therefore the challenge of managing commercial confidentiality is pressing. It is worth stressing plainly: it is precisely because commercial and sensitive data are concerned that the University considers it important to have in place a robust data management infrastructure. Ideally, the data catalogue will list all data assets, even where these are embargoed – but it is possible that commercial partners may require the metadata also to be restricted. Cathy also raised the interesting question of whether DOIs should be assigned to embargoed data. This would depend at least on whether the minimal metadata required by DataCite was considered sensitive or not, but there may be other considerations…
Chris Awre, History DMP Project, University of Hull
Hull has been using the institutional Hydra repository to curate and publish datasets for some time. Inter alia, this includes curating a collection of datasets for the History of Marine Animals Populations (HMAP) initiative. For an international initiative, a collaboration which may rely on a series of project grants, in a research area where there may not be an established and appropriate national and international data archive, an institutional repository thus provides an important service. Moreover, the University of Hull has a particular expertise in various aspects of Maritime History. And this includes expertise in the preparation, collation and other processing of datasets forming part of the HMAP Project. The repository has also curated data from the University’s projects around the Domesday Book. This was an AHRC funded project and the data was deposited with the History Data Service. However, the project lead wanted the data to be available to the general public without the need for a login or registration.
With the demise of the AHDS, the History Data Service collection policy has narrowed in scope. In such cases, where international and national (Tier One and Two services) do not exist, institutional repositories clearly have a role to play as a sustainable, trusted repository (see my previous post for arguments around the role of institutional repositories in the curation hierarchy). One can see the attraction for leading departments to develop research and data expertise in parallel, partnering the institutional repository service to ensure that key data assets are published and preserved for the long term, as with these example datasets at Hull. The History DMP project built on this partnership and existing expertise to understand better the data management needs of researchers in the department of history, to prepare a departmental data management plan and an adaptation of the DCC’s DMP checklist for the needs of historians.
Cathy and Chris were joined in the DCC workshop by Sally Rumsey from the University of Oxford. Sally gave two presentations during the week at OR2012. In the DCC workshop she asked, what is ‘Just Enough Metadata’, what is sufficient metadata from the perspective of an institutional repository and data service. Metadata was also an important theme in her presentation in the general Research Data Management session in which she described the JISCMRD Data Management Rollout Project… of which more soon…
Institutional Data Repositories and the Curation Hierarchy: reflections on the DCC-ICPSR workshop at OR2012 and the Royal Society’s Science as an Open Enterprise report
At Open Repositories 2012 the Digital Curation Centre and ICPSR (the Inter-university Consortium for Political and Social Research) organised a workshop to consider issues around the roles and responsibilities of institutional data repositories.
Graham Pryor opened the workshop with an overview of the DCC’s programme of institutional engagements. The workshop featured three presentations relating to projects in the JISC Managing Research Data Programme and other institutional activities (from Cathy Pink on the progress made by the University of Bath’s Research 360 Project; from Sally Rumsey on the work of Oxford’s DaMaRO Project to establish a schema for minimal mandatory metadata; and from Chris Awre on the use of Hull’s institutional repository for curating research data and the History DMP project): on these, more in a forthcoming post. There were also presentations from Ann Green and Jared Lyle of ICPSR on the results of a survey looking at how institutions might work with national data centres (and in what areas institutional repository managers would seek to develop greater expertise through such relationships) and from Gregg Gordon on SSRN, the Social Science Research Network, which has a growing interest in joining up research data assets as well as articles, pre-prints and grey literature.
Angus Whyte has already written a very useful and comprehensive overview of the workshop. I want to focus here on the relationship between emerging institutional research data services and more established national and international data archives. This theme was central to the workshop and was introduced in Graham Pryor’s presentation using the ‘Data Pyramid’ as described in the Royal Society’s Science as an Open Enterprise Report.
The ‘Data Pyramid’ suggests that currently the management and preservation of research data may be considered as happening in four tiers, forming a hierarchy of increased value and permanence.
The Data Pyramid – a hierarchy of rising value and permanence, taken from the Royal Society’s Science as an Open Enterprise report, p.60.
Tier One comprises major international resources such as the Worldwide Protein Data Bank. In Tier Two we find the national data centres such as the UK Data Archive and the British Atmospheric Data Centre. Universities’ institutional data repositories and research data services, such as those being piloted in the JISC Managing Research Data programme are found in Tier Three. And in Tier Four come the data collections of individual researchers or research groups: as likely as not these are unsystematically structured and described, reside on temporary storage, shared only with collaborators and not subject to any plan for longer term preservation. Of course, the data pyramid is a useful model, but it does not accurately describe the current state of affairs, largely because Tier Three is underdeveloped (and Tier Four forms a far broader base to the pyramid than any diagram can allow…)
Graham used the data pyramid to ask some searching questions of workshop participants:
- What responsibility should academic institutions have for supporting the data curation needs of their researchers?
- What responsibilities should academic institutions have for curating the data they produce?
- Should academic institutions engage with these questions only where there is no tier 1 or tier 2 service available?
A challenging and provocative way of putting the final question is to ask, as Angus did so neatly in his post, whether universities and other research institutions really want to be ‘lenders of last resort’, providing ‘a home for orphaned data to fill gaps left by national and international disciplinary repositories’?
These are extremely important and pertinent questions and discussions in the workshop were constructive for exploring these issues. Short of proposing precise answers to these questions, what I would like to do here is reconsider the data pyramid and note arguments raised in the Royal Society report as a way of discussing the role, responsibility and purpose of institutional research data services in relation to national and international data archives and collections.
It should be recognised from the start that the curation hierarchy comprising Tiers One-Four should not be considered non-porous and entirely independent. The challenge with which we are faced, in my view, is ensuring that the greatest amount of research data rises up the pyramid to the greatest degree possible and appropriate for the data in question. In particular, this means encouraging potentially valuable and reusable research data to be unlocked from the disaggregated storage and scarcely managed collections that characterises ‘Tier Four’. This objective points to the need for collaboration and coordination between institutional, national and international data services in a number of areas:
- to capture, preserve and then bring together dispersed datasets, adding value through discovery, curation and analysis functions when a critical mass is achieved;
- to promote a research culture that encourages the curation and preservation of research data;
- to help develop and cultivate the skills and services which enable these steps to happen.
In the OR2012 workshop, Ann Green and Jared Lyle made similar points neatly. They recalled Chris Rusbridge’s argument that digital preservation ‘is like a relay race, with different parties taking responsibility for a limited period and then ‘passing the baton”, in order to show how partnerships between institutions and data centres may be helpful to universities seeking to offer services for the long term preservation of selected research data. Ann and Jared also cited the recommendation from a 2007 report that domain specific archives should partner with institutional repositories ‘to provide expertise, tools, guidelines, and best practices to the research communities they serve’. Data centres like the UKDA and BADC are important centres of expertise, with already impressive outreach activities. Nevertheless, anything that can be done to amplify such work and to build up specific partnerships for the propagation of expertise and skills should be encouraged.
Tier Three services in universities have an extremely important role to play in a joined-up research data ecosystem. At present the gulf is cavernous between the relatively small proportion of research data that is preserved in national and international data services (Tier One and Two) and the vast amounts of research data that are of significance and value for verification and reuse, but are effectively lost (in Tier Four).
The Royal Society report recognises this and is unequivocal in its view that institutional research data services (Tier Three) need to be developed and that this is an area of ‘particular concern … in the [curation] hierarchy’ [Science as an Open Enterprise, p.63]. The reason for this is the crucial role of research institutions in propagating the skills, culture and policies which are necessary to respond to the growing imperative to make the most of research data assets. It is by means of institutional policies and services that research data currently lost and inaccessible in the individual collections that form the base of the data pyramid can be made available and reusable.
Much important data, with considerable reuse potential, is also lost, particularly when researchers retire or move institution. This report suggests that institutions should create and implement strategies designed to avoid this loss. Ideally data that has been selected as having potential value, but for which there is no Tier 1 or Tier 2 database available, and which can no longer be maintained by scientists actively using the data, should be curated at institutional (Tier 3) level. [Science as an Open Enterprise, p.63]
Institutional data services can form an elevator by means of which important data collections may emerge from the temporary and inaccessible storage of Tier 4. The Science as an Open Enterprise report makes the point that coherent, more highly curated datasets, to answer very specific research questions, will emerge from the heterogenous collections of services like the Dryad data repository or institutional data repositories.
Data collections often migrate between the tiers particularly when data from a small community of users become of wider significance and move up the hierarchy or are absorbed by other databases. The catch-all life sciences database, Dryad, acts as a repository for data that needs to be accessible as part of journals’ data policies. This has led to large collections in research areas that have no previous repository for data. When a critical mass is reached, groups of researchers sometimes decide to spin out a more highly curated specialised database. [Science as an Open Enterprise, p.62]
The JISC Managing Research Data Programme is helping universities develop policies, strategies and curation services which will allow this role to be performed in the broader data ecology. As already noted on this blog, the Science as an Open Enterprise report recognises the importance of this activity and recommended that it ‘should be expand beyond the pilot 17 institutions within the next five years.’ [Science as an Open Enterprise, p.73] However, if the most is to be made of such investment lessons should be learnt from the approach taken by the Australian National Data Service. Recognising that a significant amount of research data management must happen in institutions – because it is in institutions that the systemic change must happen which will allow the capture of ‘the wide range of data produced by the majority of scientists not working in partnership with a data centre’ – ANDS have also developed an infrastructure, the Australian Research Data Commons, which allows institutional data collections to feed into national and disciplinary collections and discovery portals.
The Australian Data Commons, taken from the Royal Society’s Science as an Open Enterprise report, p.69.
Along similar lines, Gregg Gordon described the value added and connecting services which SSRN could offer as ‘glue for data repositories’. Such services can be built on the data assets curated by Tier 3 institutional data repositories.
To my mind, such arguments and examples make a strong case from a strategic perspective for investment in the development of Tier 3 data services in research institutions and that such data repositories can and should contribute in ways that go beyond being a repository of last resort. But they also recall the need to develop services which allow data to be most easily aggregated and for more highly curated collections to be constructed in response on the one hand to the opportunity created by the development of a critical mass of data in a given area, and on the other hand to the emergence of new research activities ready to exploit this asset.