eArchiving Documentation
Use eArchiving
eArchiving is arguably the most complex of the CEF Building Blocks and this "Use eArchiving" section provides an in-depth, comprehensive guide to the whole eArchiving end-to-end workflow, with use cases as examples.
The aim of eArchiving is to provide the core specifications, software, training and knowledge to help data creators, software developers and digital archives tackle the challenge of short, medium and long-term data management and reuse in a sustainable, authentic, cost-efficient, manageable and interoperable way. The core of eArchiving is formed by Information Package specifications which describe a common format for storing bulk data and metadata in a platform-independent, authentic and long-term understandable way. The specifications are ideal for migrating long-term valuable data between generations of information systems, transferring data to dedicated long-term repositories (i.e. digital archives), or preserving and reusing data over extended (and shorter) periods of time and generations of software systems. Next to the specifications eArchiving offers a set of sample software to demonstrate the format in different scenarios and business environments, and consultancy in regard to long-term digital preservation risks and their mitigation.
How to get started?
Using eArchiving starts with mapping your digital preservation problem to the eArchiving format specification and tool portfolio. That is to select the right format specifications and tools that best address your problem. Finding the right eArchiving components is not always easy. You have to understand the logic behind the eArchiving elements and have some knowledge of the eArchiving use cases, specifications and tools.
This section aims to help newcomers to digital archiving or to the eArchiving Building Block find the best solution. We guide you through the eArchiving concepts, approaches and elements:
- The OAIS Reference Model of a digital archive, information package and process concept.
- E-ARK uses cases and processes
- Understanding eArchiving specifications and tools
- Finding solutions to your digital archiving problems
Main Standards and references
The eArchiving specifications are based on common, international standards for transmitting, describing and preserving digital data. The main standard is the Reference model for an Open Archival Information System (OAIS) (OAIS Reference model) which has Information Packages as its basis. The main standard for transmitting Information Packages is the Metadata Encoding and Transmission Standard (METS), and the main standard for preserving Information Packages is Preservation Metadata Implementation Strategies (PREMIS).
OAIS Reference Model
The conceptual starting point of the information package specifications, use cases and process of the E-ARK project was the Open Archival Information System (OAIS) Reference Model (https://public.ccsds.org/pubs/650x0m2.pdf). The OAIS Reference Model is designed as a conceptual framework of a digital archive. The model defines three types of information packages and a set of electronic archival processes.
OAIS Functional Entities (source: public.ccsds.org)
An information package, according to the OAIS model, contains the archival content along with descriptive and technical metadata. The three information package types are:
- Submission Information Package (SIP), i.e. the input of the archive,
- Dissemination Information Package (DIP), the output of the archive and
- Archival Information Package (AIP), the internal format managed by the archive during long-term preservation.
The processes of an OAIS archive are:
- Ingest
- Archival Storage
- Preservation Planning
- Data Management
- Access
- Administration
The above list is often extended with a Pre-Ingest process. Pre-Ingest covers the data and metadata assessment and compilation into the Submission Information Package. The Pre-Ingest process is usually performed by the data producer institution (Producer).
E-ARK uses cases and processes
In the scope of the E-ARK project (a predecessor of the eArchiving Building Block running in 2014-17) the E-ARK team has
- identified the E-ARK use cases and detailed the related OAIS processes,
- developed a set of format specifications (including a detailed structure for all three types of OAIS information packages),
- and developed or modified a set of tools to process the information packages.
Use cases identified by the E-ARK project
- Pre-Ingest and Ingest use cases
- Export and ingest relational database(s) based on SIARD
- Export and ingest electronic records based on MoReq2010
- Package and ingest simple files from a file system
- Package and ingest geodata related to other digital content in the package
- Access use cases
- Access relational database(s) based on SIARD
- Access relational database(s) via SOLR (not SQL)
- Access single electronic records/files (ingested from an ERMS or from a file system)
- Access data via OLAP (data cube) technology
- Access geodata re related to other digital content in the package
Understanding eArchiving specifications and tools
The following tables show the digital archiving components resulting from the E-ARK project layered according to the OAIS processes. The columns of the (source and intermediate) formats are left white while the columns containing the tools – performing the transition from one format to the other – are drawn in amber.
Pre-Ingest and Ingest
Data Source | Export tool | Content type format | SIP creation tool | Submission Information Package | Ingest tool | Archival Information Package | Archival Repository |
Database | DBVTK | SIARD 2.0 | RODA-In ESS ETP SIP Creator (E-ARK Web) | E-ARK SIP | RODA ESS ETA SIP2AIP Converter (E-ARK Web) | E-ARK AIP | RODA Repository ESS Preservation Platform HDFS Storage SOLR Index (E-ARK Web) |
ERMS | ERMS export module | ERMS content type | |||||
Files | |||||||
Geodata | QGIS* | Geodata content type |
*QGIS is not an E-ARK product. Some freely available and (almost) industry standard tools were integrated into and tested together with the E-ARK toolset during some pilot scenarios in the E-ARK project.
Access
Archival Repository | Archival Information Package | Search and Order tools | DIP creation tool | Dissemination Information Package | Viewer | Output Format |
RODA Repository ESS Preservation Platform HDFS Storage SOLR Index (E-ARK Web) | E-ARK AIP | Search & Display Order Management Tool Lily Ingest* E-ARK Web Search | RODA ESS EPP AIP2DIP Converter (E-ARK Web) | E-ARK DIP | DBVTK | Relational Database |
SOLR | SOLR Database | |||||
CMIS Portal Viewer | ERMS record | |||||
IP Viewer | Simple files | |||||
OLAP* Viewer | OLAP Data | |||||
QGIS*, Peripleo* | Geodata |
*QGIS, Peripleo, Lily Ingest, Oracle OLAP are not E-ARK products. Some freely available and (almost) industry standard tools were integrated into and tested together with the E-ARK toolset during some pilot scenarios in the E-ARK project.
As the above tables show, the format specifications indicate the connection points between the processing steps as the process progresses. If the format specifications can be standardized, they automatically bring compatibility between the consecutive process steps. That is exactly the reason why detailed format specifications were desperately needed. The OAIS model doesn’t specify the internal structure of the information packages. One of the main goals of the E-ARK project was to provide the archival community with detailed format specifications.
The E-ARK project has defined the following format specifications:
For OAIS information packages
- Common Specification for Information Packages
- SIP Specification
- AIP Specification
- DIP Specification
For content types (to store data of specific types within the information package)
- SIARD 2.0 format for databases
- ERMS format for electronic records from records management systems
- Geodata format to store geographic information along with other data or content types
Every tool developed or modified in the scope of the E-ARK project is compatible with all the above format specifications.
The E-ARK Web solution was developed as a reference implementation. Although it is not a mature tool set (it is currently under further development), all components were well tested and tried in cooperation with the specifications and other tools in some of the more than twenty real-world E-ARK pilot scenarios.
You can find some basic description as well as links to more detailed information of every component at the Library page of the General Model (http://kc.dlmforum.eu/gm3).
The General Model provides information about all E-ARK components from different aspects. The cross-reference view shows the connected elements of a selected component. The components are divided into four groups: format specifications, use cases and processes, tools and pilot scenarios.
The above products portfolio of the E-ARK project is considered as an initial release of the eArchiving Building Block services. (Please note that the General Model is being redesigned according to the service oriented approach of the eArchiving Building Block.)
Finding solutions to your digital archiving problems
Finding an eArchiving solution corresponding to your requirements means mapping your problem to the eArchiving format specifications and Sample Software Portfolio tools. That is to find the right specifications and tools best matching your demands.
Finding the right eArchiving components is not always easy. eArchiving follows a modular approach. You can find more than one, sometimes overlapping, solutions to one particular digital archiving task, usually with tools from different vendors. In order to help you find your way in the Sample Software Portfolio we would recommend consulting the General Model. With its versatile views the General Model helps you finding information you’d need to select the appropriate components.
Probably the most informative section of the General Model is the Map view. It shows all E-ARK elements organized according to the OAIS processes. The Map view has four subviews:
- format specifications,
- use cases and processes,
- tools,
- and pilots.
The format specification subview highlights the format specifications (along with the source and output content formats) in white.
To the left you can find the source formats corresponding to the pre-ingest use cases. Then each input/output element (content types, SIP, AIP, DIP) corresponds to an E-ARK format specification. At the rightmost part you can find the output file formats after a successful access process.
We would recommend using one of the eArchiving format specifications if you can. If you decide to use your local formats it is not guaranteed that the eArchiving tools can process them.
The Map view also presents the tools processing the input formats into the output formats.
The tools subview shows the name of the tools in orange written on the arrow pointing from the input to the output. For example RODA-In creates an E-ARK SIP from any of the content types to the left.
As you can see there can be more than one tool for the same purpose. (E.g. creating a SIP can be performed by 4 different tools.) We would recommend experimenting a little with the tool candidates to find out which one suits your requirements, infrastructure and archival environment the best. You can find more information about how other institutions have used the selected tools and specifications in the E-ARK pilot documentation (explained below). Although theoretically all modules of the Sample Software Portfolio are compatible with each other, using tools from the same vendor is usually a safer solution. They are constantly tested with each other in many digital archiving environments and scenarios.
The process subview shows the high-level process diagrams of the selected OAIS process.
As with the information packages, the OAIS model only names the required processes but doesn’t define the internal structure or give any detail. The E-ARK project defined the processes at an appropriate detail level in order to design the tools.
The pilots subview summarizes the pilot scenarios executed by the project.
In order to test the tools and specifications of the E-ARK use cases, the project has carefully planned and executed a set of more than twenty real-world pilot scenarios at archival institutions in seven European countries. The view presents the scenarios of each pilot site showing the tools and specifications they have tested along the process map from pre-ingest to access.
As the E-ARK project was focusing on testing the cooperation and interactions of the different components, these pilot scenarios can be very useful when planning your own digital archival scenarios. If you can find pilot scenarios resembling your own, the pilot documentation will help you implementing your own solution. A pilot scenario can be helpful if it implements the same use case (like archiving databases, or geodata along with your content), or uses the same tools you are planning to try out.
You can find detailed information about the pilots in the D2.3 Detailed Pilot Specification and D2.4 Pilot Documentation (and here) documents.