Funktionen

6.2. Long-Term Data archiving

Besides data storage, long-term data archiving is a necessary step in the research data life cycle ensuring that your data are well preserved for you and other researchers. While data storage primarily involves the storage of data during the ongoing work process in the project period, data archiving is concerned with how the data can be made available in as reusable a way as possible after the project has been completed. A distinction is often made between data storage in a repository and data archiving in the sense of LTA. However, in many places, including the DFG's “Guidelines for Safeguarding Good Research Practice” from 2019 (“Guideline 17: Archiving”), both terms are used equivalently. When we speak of preservation or data retention in the following, we mean the storage of data in a research data repository and long-term archiving is meant. The differences between the two variants are the subject of this chapter.
With today's rapidly evolving digital possibilities, the older data becomes, the more likely it is that this data can no longer be opened, read, or understood in the future. There are several reasons for this: The necessary hardware and/or software is missing, or scientific methods have changed so much that data is now collected in other ways with other parameters. Modern computers and notebooks, for example, now almost always do without a CD or DVD drive, which means that these storage media can no longer be widely used. Long-term archiving therefore aims to ensure the long-term use of data over an unspecified period of time beyond the limits of media wear and technical innovations. This includes both the provision of the technical infrastructure and organisational measures. In doing so, LTA pursues the preservation of the authenticity, integrity, accessibility, and comprehensibility of the data.
In order to enable long-term archiving of data, it is important that the data are provided with meta-information relevant to LTA, such as the collection method used, hardware of the system used to collect the data, software, coding, metadata standards including version, possibly a migration history, etc. In addition, the datasets should comply with FAIR principles as far as possible. This includes storing data preferably in non-proprietary, openly documented data formats and avoiding proprietary data formats. Open formats need to be migrated less often and are characterised by a longer lifespan and higher dissemination. Also, make sure that the files to be archived are unencrypted, patent-free and non-compressed. In principle, file formats can be converted lossless, lossy, or according to the meaning. Lossless conversion is usually preferable, as all information is retained. However, if smaller file sizes are preferred, information losses must often be accepted. For example, if you convert audio files such as WAV to MP3, information is lost through compression and the sound quality decreases. However, the conversion results in a smaller file size. The following table gives a first basic overview of which formats are suitable and which are rather unsuitable for a certain data type:
Recommended and non-recommended data formats by file name
Data Type
Recommended formats
Less suitable or unsuitable formats
Audio
.mp3
Computer-aided design (CAD)
-
Databases
.accdb / .mdb
Raster graphics & images
.tif (uncompressed) / .jp2 / .jpg2 /.png
.gif / .jpeg / .jpg / .psd
Statistical data
.sav (SPSS)
Tables
.xls / .xlsx / .xlx
Texts
.odf / .rtf / .txt / PDF/A
.docx / .doc / PDF
Vector graphics
.cdr
Video
.mp4 / .mkv / .mj2 / .avi (uncompressed)
.mov / .wmv
The list of formats in the column labelled "less suitable or unsuitable" does not imply that you cannot utilise these formats for long-term data storage. It is rather a matter of being sensitised to questions of long-term availability in a first start. Make it clear which format offers which advantages and disadvantages. If you want to delve further, you will find what you are looking for on the website of NESTOR – the German competence network for long-term archiving and long-term availability of digital resources. Under NESTOR - Topic you will find current short articles from the field, e.g. on tiff or pdf formats. If you put these and other overviews side by side, you will notice that the recommendations on file formats differ from each other. We do not yet have enough experience in this field. Another good way to find out if you are uncertain about formats is to ask a specialised data centre or a research data network, if one exists. If you want to store your data there, this approach is even more advisable. You may then find that your data will be taken even if the chosen data format is not the first choice from an LTA perspective. Operators of repositories or research data centres work close to science and always try to find a way of dealing with formats that are widely used in the respective fields, e.g. Excel files. As an example of this, you can take a look at the guidelines of the Association for Research Data Education.
To be able to decide for yourself which formats are suitable for your project, there are many criteria that you should consider when making your selection (according to Harvey/Weatherburn 2018: 131):
  • Extent of dissemination of the data format
  • Dependence on other technologies
  • Public accessibility of the file format specifications
  • Transparency of the file format
  • Metadata support
  • Reusability/Interoperability
  • Robustness/complexity/profitability
  • Stability
  • Rights that can complicate data storage
LTA currently uses two strategies for long-term data preservation: emulation and migration.
Emulation means that on a current, modern system, an often older system is emulated, which imitates the old system in as many aspects as possible. Programmes that do this are called emulators. A prominent example of this is DOSBox, which makes it possible to emulate an old MS DOS system including almost all functionalities on current computers and thus to use software for this system, which is most likely no longer possible with a more current system.
Migration or data migration means the transfer of data to another system or another data carrier. In the area of LTA, the aim is to ensure that the data can still be read and viewed on the system to be transferred. For this, it is necessary that the data are not inseparably linked to the data carrier on which they were originally collected. Remember that metadata must also be migrated!
When choosing a suitable storage location for long-term archiving, you should consider the following points:
  • Technical requirements – The service provider should have a data conversion, migration and/or emulation strategy. In addition, a readability check of the files and a virus check should be carried out at regular intervals. All steps should be documented. Plus copies of the data in several locations (on-site, near-side, off-site)
  • Seals for trustworthy long-term archives – Various seals have been developed to assess whether a long-term archive is trustworthy, e.g. the nestor seal, which was developed on the basis of DIN 31644 “Criteria for trustworthy digital long-term archives”, the ISO 16363 standard based on the Reference Model for an Open Archival Information System (OAIS), or the CoreTrustSeal.
  • Costs – The operation of servers as well as the implementation of technical standards are associated with costs, which is why some service providers charge for their services. The price depends above all on the amount of data.
  • Making the data accessible – Before choosing the storage location, you should decide whether the data should be accessible or only stored.
    Service provider longevity – Economic and political factors influence the longevity of service providers.


Bisher wurde noch kein Kommentar abgegeben.