ArchivesSpace Beyond the Basics Skillshare: Data Auditing for the ArchivesSpace PUI
In August 2017, Yale University Libraries’ special collections repositories initiated a library-wide project in preparation for the rollout of the ArchivesSpace Public User Interface (PUI) in early 2018. The project is comprised of six subgroups focusing on different aspects of the implementation:
Accessibility
User experience and design
Technical integrations
Training and documentation
Branding and promotion
Data cleanup and enhancements (that’s us!)
Yale’s special collections repositories create an enormous amount of metadata, and the ArchivesSpace PUI represents a radical change from our current finding aid database (YFAD). Among other things, this means that we have a lot of data clean-up work to do before we can “go live” with the PUI. The Data Cleanup and Enhancements Workgroup has been charged with identifying the nature and extent of our data problems, and coming up with in-house or outsourced solutions.
In our initial meetings, we identified a laundry list of things that could be fixed. But, given the short-term nature of this project, we’ve had to do some hard thinking about which of our data quality issues will have the greatest impact on our users and on the security of our restricted metadata (i.e. student records, donor-imposed restrictions) once the PUI is implemented, and to stay focused on just those issues.
Our first task, after narrowing the scope of our work, was to identify the nature and extent of our problems in each of the eight areas on which we decided to focus:
Publication status
Restrictions
Dates
Note Labels
Shared records/controlled value lists + Extents
Preferred Citations
URLs
Containers
The demos that follow outline the steps we took to audit our data in each of these areas. We share them in the hopes that they can be of use to other institutions seeking to undertake similar work. We will continue to update this repository as we evaluate our results and implement our solutions.
Requirements
ArchivesSpace 2.x+
MySQL client
Python 3
Highly Recommended: Anaconda (Python distribution and package manager)
pandasmodule (included with Anaconda)dateutilmodule (included with Anaconda)
LibreOffice or Excel
Demo 1: Publication Status
Objectives
Query the ArchivesSpace database to find the publication status of accessions, resources, archival objects, and notes.
Analyze results
Prepare for ArchivesSpace update
Query Database
get_note_pub_status.sql
Returns note text (JSON format), publication status, parent URI, and persistent ID
get_component_pub_status.sql
Returns archival object publication status, archival object display string, archival object level, parent resource publication status, parent EAD ID, and URI
get_resource_pub_status.sql
Returns resource title, EAD ID, publication status, and URI
Can create similar queries for digital objects and accessions if desired
Analyze Results
pandas-toolbox.py
Group by resource, publication status. Get counts of published and unpublished records
Prepare Data for ArchivesSpace Update
Review query outputs and make changes to publication status column - 1s and 0s
update_pub_status.py
Updates publication status using updated query outputs as input
Demo 2: Access Restrictions
Objectives
Query the ArchivesSpace database to retrieve all conditions governing access and conditions governing use notes for resources, archival objects, and digital objects.
Analyze results
Clean data, prepare for ArchivesSpace update
Troubleshoot errors
Query Database
get_all_access_restricts.sql
Returns URI, JSON-formatted note text, restriction note type, local restriction type, restriction begin date, restriction end date, title(s), persistent ID
get_all_use_restricts.sql
Returns URI, JSON-formatted note text, restriction note type, local restriction type, restriction begin date, restriction end date, title(s), persistent ID
Extract Text
extract_multipart_text.py
Extracts restriction note text from JSON output from database and appends to a new copy of access restriction query ouput
Analyze and Clean Data in OpenRefine
Sorting, Faceting and Filtering
Find records with potentially machine-actionable restrictions in free text fields, which do not have any text in machine-actionable fields
Identify records which do not have restrictions
Keyword searches - ‘open’, ‘closed’, etc.
Regular expressions to find dates and restrictions in strings:
[^a-zA-z0-9]\d\d\d\d[^a-zA-Z0-9]- [1|2]\d\d\dBroadest formulation, will return all 4 digit numbers directly in between any 2 non-letter, non-number characters. May also return accession numbers which contain years
For more specific formulations, see: Regular Expression Library
\b[r|R](estrict)\w*\bBroad search - will return ‘restricted’, ‘Restricted’, ‘restriction’, ‘Restriction’; but may also return ‘not restricted’, etc.
Narrower searches/further searches on filtered data - ‘restricted fragile’, ‘donor’
Won’t necessarily capture everything, but can be iterative
Reg Ex/OpenRefine Resources:
OpenRefine Cheat Sheet (Arcadia Falcone)
Prepare Data for ArchivesSpace Update
Adding columns in OpenRefine
Can take machine actionable dates from free text fields and move them to a new column, which can then be used as input for making updates to notes via the ArchivesSpace API
Troubleshooting
The rights restriction tables in the ArchivesSpace database are less-than-intuitive, and documentation is lacking. These scripts explicate the differences between result sets and identify duplicates, and may help identify the cause of inconsistent results.
find_diffs.py
find_dupes.py
Demo 3: Dates
Objectives
Query the ArchivesSpace database to return all dates for accessions, resources, archival objects, and digital objects.
Analyze results
Clean data, prepare for ArchivesSpace update
Query Database
accession_dates.sql
Returns accession title, URI, date expression, begin date, end date, date type, date label, date certainty, date calendar, date era
resource_dates.sql
Returns resource title, identifier, URI, EAD ID, date expression, begin date, end date, date type, date label, date certainty, date calendar, date era
component_dates.sql
Returns archival object title, archival object URI, parent resource title, parent URI, parent EAD ID, component level, date expression, begin date, end date, date type, date label, date certainty, date calendar, date era. Also can be limited by repository ID and by amount of records.
Analyze Results
archival_object_date_expression
Returns archival object id, archival object title, parent resource record title, EAD ID, resource identifier, repository ID, date expression, date begin, and date end. Limits returns to records where date expression has data and empty machine-readable begin and end dates. Can also be limited by repository ID, and with limits for amount of records to return.
pandas-toolbox.py
Analyzes date type usage, era, certainty, etc.
OpenRefine
See Demo 2: Restrictions for guidance on using regular expressions to parse free text date fields
Timetwister + subprocess module
Thanks to Alex Duryee!
timetwister_parse.py
Loops through a spreadsheet containing unstructured data expressions and parses them into machine-readable dates using timetwister
Timewalk plugin
update_dates.py
Can request all URIs for dates you want to parse, then POST without making any changes. Dates associated with those URIs should parse.
dateutil.parser module
date_parse.py
Loops through a spreadsheet containing unstructured data expressions and parses them into machine-readable dates using dateutil Python module
Much less effective than timetwister, but does not require installation of Ruby or timetwister
Prepare for ArchivesSpace Update
Manipulating timetwister output, appending to original query output
Demo 4: Labels
Objectives
Query the ArchivesSpace database to retrieve all user-supplied labels attached to resource-, archival object-, or digital-object level notes.
Analyze results
Query Database
get_all_notes.sql
Returns note text (JSON format), URI, persistent ID
Note: this query gets all notes. You can run this query one time and run numerous demos from this presentation on the same dataset. Can also split by repository for quicker analysis. Might be a good idea to make copies of the output.
Extract Text
extract_labels.py
Extract notes type and label into a list, along with URI and persistent ID
Analyze Data
pandas-toolbox.py
Get counts of label usage relative to note type
Demo 5: Managing Shared Records (Controlled Value Lists, Agents, Subjects, Container Profiles, etc) + Extents
Objectives
Query the ArchivesSpace database to retrieve all records linked to controlled values
Analyze results
Prepare for updates
Query Database
extent_type_rec_links.sql
Returns repository name, resource identifier, resource title, coponent title, extent type ID, extent type, URI
agents_linked_recs.sql
Returns repository name, resource identifier, resource title, component title, accession title, record URI, agent, agent URI
subjects_linked_recs.sql
Returns repository name, resource identifier, resource title, component title, subject, URI
container_profiles_linked_recs.sql
Returns repository name, resource identifier, resource title, component title, container profile, URI
get_extents.sql
Returns URI, title(s), container summary, portion, number, extent type, physical details, dimensions
get_extents_plus_top_containers.sql
Returns extent data + container data for comparison
get_enum_positions.sql
Returns URI, enumeration value, position
Analyze Results
pandas-toolbox.py
Group results by physical description, etc.
Preparing for ArchivesSpace Update
Contacting repositories, making policy decisions
update_enum_val_positions.py
Updates position of enumeration values for a particular enumeration using
get_enumeration_value_positions.sqloutput as input
Demo 6: Citations
Objectives
Query the ArchivesSpace database to retrieve all preferred citation notes
Analyze results
Clean data, prepare for ArchivesSpace update
Query Database
get_all_notes.sql
Returns note text (JSON format), URI, persistent ID
Note: this query gets all notes. You can run this query one time and run numerous demos from this presentation on the same dataset. Can also split by repository for quicker analysis. Might be a good idea to make copies of the output.
Extract Text
extract_text.py
Extracts text from JSON output.
ID_missing_call_numbers.py
Identifies notes which do not contain non-word patterns. Can also do this in OpenRefine
Clean Data
insert_call_numbers.py
Inserts call numbers into preferred citation notes
Demo 7: URLs
Objectives
Query ArchivesSpace database to retrieve URLs attached to digital objects and notes
Test validity of links
Query Database
Thanks to Youn Noh for her work on this portion of the demo
Analyze Data
Thanks to Youn Noh for her work on this portion of the demo
notes.py
check_url.py
Checks whether a link is active or broken
digital_object_id.py
file_uri.py
A Note About Containers
get_top_containers.sql
Lessons Learned
Prioritize tasks by potential impact on users, data security
Clean-up is an iterative process
Use SQL to retrieve data
Use OpenRefine and Python to manipulate, analyze, and update data
Output as input - don’t forget your URIs and persistent IDs!
Some manual review necessary, but most tasks can be automated
Next Steps
Evaluation by repositories
Finish writing and testing update scripts
Add update scripts to data_cleanup_workgroup repository
Agents and Subjects Task Force Work