aspace_preservica_db
A reporting database for collating and doing stuff with ArchivesSpace and Preservica metadata
Tables
The tables in the database and how they are generated
archival_object
This table is populated by running the archival_object_table.sql query against the ArchivesSpace database.
| Column | Description | Type | | ———————— | —————————— | ———— | | id | The archival object id | int | | repo_id | The repository id | int | | root_record_id | The parent resource id | int | | parent_id | The parent archival object | int | | ref_id | The ref id | varchar(255) | | component_id | The component unique id | varchar(255) | | title | The archival object title | varchar(8704)| | publish | 1 published, 0 unpublished | int | | level | Hierarchical level (i.e. file) | varchar(255) | | preservica_collection_id | Found in the note field | varchar(255) | | extent | Concatenated extent value | varchar(255) | | physical_containers | Concatenated container data | mediumtext | | create_time | ArchivesSpace creation time | timestamp | | m_time | ArchivesSpace last modified | timestamp |
digital_object
This table is populated by running the digital_object_table.sql query against the ArchivesSpace database.
| Column | Description | Type | | ———————— | —————————— | ———— | | id | The digital object id | int | | digital_object_id | The Preservica DelUnit id | varchar(255) | | archival_object_id | The linked record id | int | | title | The digital object title | varchar(255) | | publish | 1 published, 0 unpublished | int | | has_content_link | 1 has link. 0 no link | int | | create_time | ArchivesSpace creation time | timestamp | | m_time | ArchivesSpace creation time | timestamp |
digital_object_component
This table is populated by running the digital_object_component_table.sql query against the ArchivesSpacee database.
| Column | Description | Type | | ———————— | —————————— | ———— | | id | The digital object component id| int | | repo_id | The repository id | int | | root_record_id | The root digital object id | int | | parent_id | The parent digital object id | int | | component_id | The component unique id | varchar(255) | | title | The component title | varchar(8704)| | publish | 1 published, 0 unpublished | int | | create_time | ArchivesSpace creation time | timestamp | | m_time | ArchivesSpace creation time | timestamp |
restriction
This table is populated by running the restrictions_table.sql query against the ArchivesSpace database. This query can only be run on a local/test version of YUL ArchivesSpace, as it requires SQL common table expressions (CTEs) that only exist in MySQL 8+. The production version of YUL ArchivesSpace should be upgraded to MySQL 8 in early 2021.
| Column | Description | Type | | ———————— | —————————— | ———— | | id | The archival object id | int | | resource_note_text | The resource-level note | longtext | | lvl | The note level | int | | path | The restriction text hierarchy | longtext | | type_path | The restriction type hierarchy | longtext | | end_path | The restriction date hierarchy | longtext |
hierarchy
This table is populated by running the hierarchies.sql query against the ArchivesSpace database. The hierarchies table is a view that only exists in a local/test version of YUL ArchivesSpace, as it requires SQL common table expressions (CTEs) that only exist in MySQL 8+. The production version of YUL ArchivesSpace should be upgraded to MySQL 8 in early 2021.
| Column | Description | Type | | ———————— | —————————— | ———— | | id | The archival object id | int | | root_record_id | The resource id | int | | repo_id | The repository id | int | | full_path | The full hierarchy | longtext | | lvl | The hierarchical level | int |
collection
This table is populated by extracting data from Preservica collection XML files. The collection IDs are stored in notes within ArchivesSpace archival object records, and are extracted by running the get_collection_ids.sql query against the ArchivesSpace database.
| Column | Description | Type | | ———————— | —————————— | ———— | | id | Preservica collection id | varchar(255) | | parent_collection_id | Preservica parent collection | varchar(255) | | collection_code | Preservica collection code | varchar(255) | | security_tag | Preservica security tag | varchar(255) | | title | Preservica collection title | varchar(8704)| | create_time | Local database create time | timestamp | | m_time | Local database last modified | timestamp |
deliverable_unit
This table is populated by extracting data from Preservica deliverable unit XML files. The deliverable unit IDs are stored in digital object records in ArchivesSpace, and are extracted by running the get_deliverable_unit_ids.sql query against the ArchivesSpace database.
| Column | Description | Type | | ————————- | —————————— | ———— | | id | The digital object id | int | | collection_id | The Preservica collection id | varchar(255) | | parent_deliverable_unit | The parent deliverable unit id | varchar(255) | | root_parent_del_unit | The root deliverable unit id | varchar(255) | | deliverable_unit_level | The deliverable unit level | int | | digital_surrogate | yes or no | varchar(255) | | coverage_from | Start date | timestamp | | coverage_to | End date | timestamp | | security_tag | Security tag | varchar(255) | | create_time | Local database create time | timestamp | | m_time | Local database last modified | timestamp |
manifestation
This table is populated by extracting data from Preservica deliverable unit XML files.
| Column | Description | Type | | ———————— | —————————— | ———— | | id | Preservica manifestation id | varchar(255) | | deliverable_unit_id | Deliverable unit id | varchar(255) | | typeref | The manifestation type | varchar(255) | | summary | Preservica summary | varchar(8704)| | create_time | Local database create time | timestamp | | m_time | Local database last modified | timestamp |
digital_file
This table is populated by extracting data from Preservica digital file XML files. The list of digital files is derived from Preservica deliverable unit XML files.
| Column | Description | Type | | ———————— | —————————— | ———— | | id | The digital file id | int | | manifestation_id | The manifestation id | int | | file_set_id | The file set id | int | | filesize | The file size in bytes | int | | filemoddate | The file modification date | timestamp | | filename | The name of the file | varchar(255) | | working_path | The relative path to the file | varchar(255) | | format_puid | The format unique id | varchar(255) | | format_name | The format name | mediumtext | | mimetype | The format mimetype | varchar(255) | | create_time | Local database creation time | timestamp | | m_time | Local database last modified | timestamp |
Extracting Data from Source Databases
How to extract data from ArchivesSpace and Preservica
ArchivesSpace
Data is extracted from ArchivesSpace via several SQL queries
Preservica
Data is extracted from Preservica by calling the Preservica API, as access to the Preservica database is restricted.
Initializing the aspace_preservica_db Database
How to create and populate the database using extracted data from ArchivesSpace and Preservica
Run process_xml_for_db.py
The first step in initializing the aspace_preservica_db database is running the process_xml_for_db.py script on extracted Preservica XML files.
Run db_manager.py
Once the CSV spreadsheets are prepared for ingest, it is time to run the db_manager.py script to create and populate the database.
Run SQL updates to fix field formatting
After the data is ingested, a little bit of clean-up is necessary.
Derive parent deliverable unit data
Updating the aspace_preservica_db Database
Checking for new data
Adding new data to the database
Design Notes and To-Dos
Reconciling missing data
Some data was missed during the initial load:
Collection data
Child deliverable unit data
Digital object component data
Restriction data
Consider re-designing manifestation table, since there are duplicate manifestation IDs; or at least sort out those issues.