ArchivesSpace Beyond the Basics Skillshare: Data Auditing for the ArchivesSpace PUI

In August 2017, Yale University Libraries’ special collections repositories initiated a library-wide project in preparation for the rollout of the ArchivesSpace Public User Interface (PUI) in early 2018. The project is comprised of six subgroups focusing on different aspects of the implementation:

  • Accessibility

  • User experience and design

  • Technical integrations

  • Training and documentation

  • Branding and promotion

  • Data cleanup and enhancements (that’s us!)

Yale’s special collections repositories create an enormous amount of metadata, and the ArchivesSpace PUI represents a radical change from our current finding aid database (YFAD). Among other things, this means that we have a lot of data clean-up work to do before we can “go live” with the PUI. The Data Cleanup and Enhancements Workgroup has been charged with identifying the nature and extent of our data problems, and coming up with in-house or outsourced solutions.

In our initial meetings, we identified a laundry list of things that could be fixed. But, given the short-term nature of this project, we’ve had to do some hard thinking about which of our data quality issues will have the greatest impact on our users and on the security of our restricted metadata (i.e. student records, donor-imposed restrictions) once the PUI is implemented, and to stay focused on just those issues.

Our first task, after narrowing the scope of our work, was to identify the nature and extent of our problems in each of the eight areas on which we decided to focus:

  • Publication status

  • Restrictions

  • Dates

  • Note Labels

  • Shared records/controlled value lists + Extents

  • Preferred Citations

  • URLs

  • Containers

The demos that follow outline the steps we took to audit our data in each of these areas. We share them in the hopes that they can be of use to other institutions seeking to undertake similar work. We will continue to update this repository as we evaluate our results and implement our solutions.

Requirements

  • ArchivesSpace 2.x+

  • MySQL client

  • Python 3

    • Highly Recommended: Anaconda (Python distribution and package manager)

    • pandas module (included with Anaconda)

    • dateutil module (included with Anaconda)

  • OpenRefine

  • LibreOffice or Excel

Demo 1: Publication Status

Objectives

  • Query the ArchivesSpace database to find the publication status of accessions, resources, archival objects, and notes.

  • Analyze results

  • Prepare for ArchivesSpace update

Query Database

get_note_pub_status.sql

  • Returns note text (JSON format), publication status, parent URI, and persistent ID

get_component_pub_status.sql

  • Returns archival object publication status, archival object display string, archival object level, parent resource publication status, parent EAD ID, and URI

get_resource_pub_status.sql

  • Returns resource title, EAD ID, publication status, and URI

  • Can create similar queries for digital objects and accessions if desired

Analyze Results

pandas-toolbox.py

  • Group by resource, publication status. Get counts of published and unpublished records

Prepare Data for ArchivesSpace Update

  • Review query outputs and make changes to publication status column - 1s and 0s

update_pub_status.py

  • Updates publication status using updated query outputs as input

Demo 2: Access Restrictions

Objectives

  • Query the ArchivesSpace database to retrieve all conditions governing access and conditions governing use notes for resources, archival objects, and digital objects.

  • Analyze results

  • Clean data, prepare for ArchivesSpace update

  • Troubleshoot errors

Query Database

get_all_access_restricts.sql

  • Returns URI, JSON-formatted note text, restriction note type, local restriction type, restriction begin date, restriction end date, title(s), persistent ID

get_all_use_restricts.sql

  • Returns URI, JSON-formatted note text, restriction note type, local restriction type, restriction begin date, restriction end date, title(s), persistent ID

Extract Text

extract_multipart_text.py

Extracts restriction note text from JSON output from database and appends to a new copy of access restriction query ouput

Analyze and Clean Data in OpenRefine

Sorting, Faceting and Filtering

  • Find records with potentially machine-actionable restrictions in free text fields, which do not have any text in machine-actionable fields

  • Identify records which do not have restrictions

  • Keyword searches - ‘open’, ‘closed’, etc.

Regular expressions to find dates and restrictions in strings:

  • [^a-zA-z0-9]\d\d\d\d[^a-zA-Z0-9] - [1|2]\d\d\d

    • Broadest formulation, will return all 4 digit numbers directly in between any 2 non-letter, non-number characters. May also return accession numbers which contain years

  • \b[r|R](estrict)\w*\b

    • Broad search - will return ‘restricted’, ‘Restricted’, ‘restriction’, ‘Restriction’; but may also return ‘not restricted’, etc.

    • Narrower searches/further searches on filtered data - ‘restricted fragile’, ‘donor’

    • Won’t necessarily capture everything, but can be iterative

Reg Ex/OpenRefine Resources:

Prepare Data for ArchivesSpace Update

Adding columns in OpenRefine

  • Can take machine actionable dates from free text fields and move them to a new column, which can then be used as input for making updates to notes via the ArchivesSpace API

Troubleshooting

The rights restriction tables in the ArchivesSpace database are less-than-intuitive, and documentation is lacking. These scripts explicate the differences between result sets and identify duplicates, and may help identify the cause of inconsistent results.

find_diffs.py

find_dupes.py

Demo 3: Dates

Objectives

  • Query the ArchivesSpace database to return all dates for accessions, resources, archival objects, and digital objects.

  • Analyze results

  • Clean data, prepare for ArchivesSpace update

Query Database

accession_dates.sql

  • Returns accession title, URI, date expression, begin date, end date, date type, date label, date certainty, date calendar, date era

resource_dates.sql

  • Returns resource title, identifier, URI, EAD ID, date expression, begin date, end date, date type, date label, date certainty, date calendar, date era

component_dates.sql

  • Returns archival object title, archival object URI, parent resource title, parent URI, parent EAD ID, component level, date expression, begin date, end date, date type, date label, date certainty, date calendar, date era. Also can be limited by repository ID and by amount of records.

Analyze Results

archival_object_date_expression

  • Returns archival object id, archival object title, parent resource record title, EAD ID, resource identifier, repository ID, date expression, date begin, and date end. Limits returns to records where date expression has data and empty machine-readable begin and end dates. Can also be limited by repository ID, and with limits for amount of records to return.

pandas-toolbox.py

Analyzes date type usage, era, certainty, etc.

OpenRefine

  • See Demo 2: Restrictions for guidance on using regular expressions to parse free text date fields

Timetwister + subprocess module

  • Thanks to Alex Duryee!

timetwister_parse.py

Loops through a spreadsheet containing unstructured data expressions and parses them into machine-readable dates using timetwister

Timewalk plugin

update_dates.py

Can request all URIs for dates you want to parse, then POST without making any changes. Dates associated with those URIs should parse.

dateutil.parser module

date_parse.py

Loops through a spreadsheet containing unstructured data expressions and parses them into machine-readable dates using dateutil Python module

Much less effective than timetwister, but does not require installation of Ruby or timetwister

Prepare for ArchivesSpace Update

Manipulating timetwister output, appending to original query output

Demo 4: Labels

Objectives

  • Query the ArchivesSpace database to retrieve all user-supplied labels attached to resource-, archival object-, or digital-object level notes.

  • Analyze results

Query Database

get_all_notes.sql

  • Returns note text (JSON format), URI, persistent ID

Note: this query gets all notes. You can run this query one time and run numerous demos from this presentation on the same dataset. Can also split by repository for quicker analysis. Might be a good idea to make copies of the output.

Extract Text

extract_labels.py

Extract notes type and label into a list, along with URI and persistent ID

Analyze Data

pandas-toolbox.py

Get counts of label usage relative to note type

Demo 5: Managing Shared Records (Controlled Value Lists, Agents, Subjects, Container Profiles, etc) + Extents

Objectives

  • Query the ArchivesSpace database to retrieve all records linked to controlled values

  • Analyze results

  • Prepare for updates

Query Database

extent_type_rec_links.sql

  • Returns repository name, resource identifier, resource title, coponent title, extent type ID, extent type, URI

agents_linked_recs.sql

  • Returns repository name, resource identifier, resource title, component title, accession title, record URI, agent, agent URI

subjects_linked_recs.sql

  • Returns repository name, resource identifier, resource title, component title, subject, URI

container_profiles_linked_recs.sql

  • Returns repository name, resource identifier, resource title, component title, container profile, URI

get_extents.sql

  • Returns URI, title(s), container summary, portion, number, extent type, physical details, dimensions

get_extents_plus_top_containers.sql

  • Returns extent data + container data for comparison

get_enum_positions.sql

  • Returns URI, enumeration value, position

Analyze Results

pandas-toolbox.py

Group results by physical description, etc.

Preparing for ArchivesSpace Update

  • Contacting repositories, making policy decisions

update_enum_val_positions.py

  • Updates position of enumeration values for a particular enumeration using get_enumeration_value_positions.sql output as input

Demo 6: Citations

Objectives

  • Query the ArchivesSpace database to retrieve all preferred citation notes

  • Analyze results

  • Clean data, prepare for ArchivesSpace update

Query Database

get_all_notes.sql

  • Returns note text (JSON format), URI, persistent ID

Note: this query gets all notes. You can run this query one time and run numerous demos from this presentation on the same dataset. Can also split by repository for quicker analysis. Might be a good idea to make copies of the output.

Extract Text

extract_text.py

Extracts text from JSON output.

ID_missing_call_numbers.py

Identifies notes which do not contain non-word patterns. Can also do this in OpenRefine

Clean Data

insert_call_numbers.py

Inserts call numbers into preferred citation notes

Demo 7: URLs

Objectives

  • Query ArchivesSpace database to retrieve URLs attached to digital objects and notes

  • Test validity of links

Query Database

Thanks to Youn Noh for her work on this portion of the demo

Analyze Data

Thanks to Youn Noh for her work on this portion of the demo

notes.py

check_url.py

Checks whether a link is active or broken

digital_object_id.py

file_uri.py

A Note About Containers

get_top_containers.sql

Lessons Learned

  • Prioritize tasks by potential impact on users, data security

  • Clean-up is an iterative process

  • Use SQL to retrieve data

  • Use OpenRefine and Python to manipulate, analyze, and update data

    • Output as input - don’t forget your URIs and persistent IDs!

  • Some manual review necessary, but most tasks can be automated

Next Steps

  • Evaluation by repositories

  • Finish writing and testing update scripts

  • Add update scripts to data_cleanup_workgroup repository

  • Agents and Subjects Task Force Work