heiDATA Preservation Policy
heiDATA is the research data repository of Heidelberg University. As a publication platform, heiDATA's task is to keep research data permanently and sustainably available over long periods of time. For this purpose, data authors are contractually guaranteed a retention period of 10 years for their data. However, the mission of the repository extends beyond this and an open-ended holding period for the data is foreseen.
To achieve this goal, heiDATA pursues a forward-looking long-term preservation strategy that combines systematic data curation with a reliable technical infrastructure.
Data Curation
Every data publication on heiDATA undergoes a data curation workflow, which aims at ensuring data integrity and data authenticity, enhancing documentation and metadata, and preparing data sets as well as possible for long-term preservation. In order to achieve these goals every publication process is moderated by a data curator.
Before ingesting files to the repository the curator examines file formats with regard to their long-term preservability. For heiDATA, open, non-proprietary and well-documented formats are preferred. I.e. it is checked whether the formats used are open and non-proprietary. If that is not the case it is investigated whether the files could be converted into an format without loss of relevant information or usability for the community. However, if conversion to such a format is not possible, other formats are also accepted. In these cases, an attempt is made to store the documentation of the file format, if openly available, in an internal knowledge base in order to make subsequent use of the data possible, at least in principle, even after the format has expired. Furthermore, if possible, format recognition and format validations are carried out. Format recognition is done with FIDO and DROID. Validation is done via JHOVE, veraPDF and ExifTool. Furthermore, data integrity is supported by the functionalities of the Dataverse software, which records file checksums at bit level (MD5) and, for tabular data, at variable-level (UNF), which allows curators, authors and third parties to check file integrity. For published datasets a transparent versioning system makes changes traceable.
Another task of the curator is to enhance documentation and metadata in coordination with authors. This step is integrated in the review process for every data publication, where authors cannot publish their datasets without submitting them for review by a data curator. Additional appraisal can take place e.g. by editors and reviewers of journals, where the related manuscripts are submitted. Via heiDATA’s Private URL feature these can get access to the data prior to publication.
heiDATA here follows the structure of the OAIS model for digital preservation. The moderated publication process guarantees that the data ingested into heiDATA, the so-called Submission Information Packages (SIPs), are already prepared as well as possible for long-term preservation. After authors have submitted data and metadata for review, these are reviewed by the staff and optimized in consultation with the data providers. This process creates the final Archival Information Package (AIP), which is in the case of heiDATA identical with the Dissemination Information Packages (DIPs), which can be downloaded by the users.
Technical Infrastructure
The current technical infrastructure in place for heiDATA is as follows: The repository uses dedicated and scaling virtual infrastructure of the university’s private cloud infrastructure heiCLOUD. As storage backend, two independent systems are used for high availability for the heiDATA service. Data is stored on the “Large Scale Data Facility” (LSDF) within the service SDS@hd, a dedicated online storage platform for research data that is managed by the computing centre of Heidelberg University for the researchers of the federal state of Baden-Württemberg. This system applies IBM Storage Scale as software and filesystem. The second storage system for online-access in place for heiDATA is the cloud storage of heiCLOUD. Through the technical concept, both systems allow for scaling capacity and performance and form solid, secure and highly available storage environments. By using both systems together, a particularly high availability of the overall service and accessibility of the data can be achieved.
The heiDATA system’s integrity is under surveillance by a Checkmk monitoring server. This system continuously executes a set of checks on the heiDATA server and stores check results in a local Round Robin database.
Backups of the PostgreSQL database are done every night. These backups are saved on a heiCLOUD volume. This volume is synchronized to the online storage volume each night. A backup client (software IBM Storage Protect,) is running regularly, taking care of data backups which are not located on the heiCLOUD volume and the LSDF, e.g. the operating system data and configuration files located in the /etc directory. This backup strategy on several target backup storages will allow us to restore the system flawlessly in case of a disaster recovery if necessary.
Although the concept of using two dedicated storage systems as backend plus nightly backups on another heiCLOUD volume already lead to a secure and highly available storage backend, in the future heiDATA will make use of the long-term preservation system for the university heiARCHIVE. heiARCHIVE is construed as a dark archive and follows the concepts of the OAIS reference model (Open Archival Information System). heiARCHIVE is based on an in-house development and offers features like format recognition/validation and extraction of metadata from files. The software heiARCHIVE makes use of open community standards, software tools and libraries: A storage abstraction is realized based on the open source data management software iRODS to manage data copies and geo-replication and the BagIt file packaging format is used for structuring and naming directories and files. The METS standard (https://www.loc.gov/standards/mets/) is used to define a container for descriptive, administrative, and structural metadata. The PREMIS standard (https://www.loc.gov/standards/premis/) defines the metadata for the preservation of the data objects and their long-term usability. DataCite is used to represent the descriptive metadata. heiARCHIVE has an API that is used to do a long-term preservation for the data that is published in heiDATA.
The repository heiDATA as well as the related long-term archiving service heiARCHIVE are operated on systems on the premises of the university. Appropriately designed infrastructure is available and used for these services to guarantee a reliable, stable and highly available core infrastructure. This includes e.g. the power connection (incl. uninterruptible power supply), cooling capacities, network connection (including two independent uplinks to the Internet). There is restrictive access management to the data center premises for reasons of data protection. For the technical infrastructure and the infrastructure-related processes, efforts are currently underway to have the data center premises certified according to DIN EN 50600. All servers, storage systems and further IT components are operated and managed by professional administrative staff with many years of experience in operating such systems. These systems are maintained permanently as the central university IT infrastructure. Cyclical renewals and expansions are planned for the long term and are funded by grants or university funds or through other financing models.