March 25, 2015
Spanish Artists from the Fourth to the Twentieth Century: A Critical Dictionary (also known as the Spanish Artists Dictionary) is one of the Frick Art Reference Library’s core reference sources. Originally a print publication, the dictionary was formatted as a Filemaker database in the early 1990s and published online as part of the Frick’s research portal. More than 5,000 records document Spanish artists, including information regarding alternative names, dates of artistic activity, media, bibliographic references, and Frick Photoarchive holdings.
The Spanish Artists Dictionary was made available to us as a CSV spreadsheet for a project that formed part of the Pratt SILS course, “Programming for Cultural Heritage.” Taught by Matt Miller of NYPL Labs, the course introduced students to Python, a general purpose programming language. We used Python to manipulate and extract data from CSV, JSON, and XML files.
In the first portion of our project, we focused on manipulating the Spanish Artists Dictionary to create dynamic visualizations. One area that interested us was artist dates. Whenever possible, the artist dates provided in the Spanish Artists Dictionary are birth and death dates. Since such information is not always available, however, some start and end dates consist of a confirmed date of activity. For example, if the researchers who compiled the Spanish Artists Dictionary found an exhibition catalog mentioning the artist — and there was no additional information — that exhibition date was used as the start or end date for the artist. Dates that are not birth or death dates are qualified with an abbreviation ("exh." for exhibition dates).
Our first step was to write a Python script that parsed through the CSV data and looked through the “activity date qualifier” column. The script searched for rows in which the activity date qualifier was “exh” and wrote a new CSV containing the artist’s name and the data from the activity year(s) column. It also got rid of a square bracket character using a regular expression as a way to normalize the data.
Once we had cleaned and distilled the data, we wanted to provide a way for people to see and interact with it. We chose Tableau Public, a free data visualization software. The distilled data, now in a separate CSV spreadsheet, was manipulated manually to give each year of activity its own row so that Tableau would properly understand the data as date ranges. Then, Diana created three visualizations.
Displays all the artists whose activity dates were qualified with the term “exh." (exhibited). It shows the number of years an artist exhibited, organized in descending order. It is also interactive and can be searched by artist name and exhibition date.
This visualization plots the exhibition years on a line graph. There is a noticeable spike in the number of artists exhibiting in 1944 and 1951. The bulk of the artists in the dataset exhibited around 1900.
This visualization counts the number of years an artist exhibited. It clusters the artists by size and color. The darker the bubble, the more years an artist exhibited.
One of the great advantages of digitizing and visualizing data such as the Spanish Artists Dictionary is that doing so provides an exciting opportunity to arrange the information in a new way.
In addition to the visualizations, we also wrote a script to address a problem with linking from the Spanish Artists Dictionary records to the library catalog. When a user clicked the link to the Frick Art Reference Library’s online catalog, the search box that appeared in the catalog was populated from a field in the Spanish Artists Dictionary labeled “LC Name.” However, many entries had incorrect or outdated names in the “LC Name” field. Others did not include a name in the “LC Name” field, so the name was populated from the main name field, and this produced faulty results.
To address this problem, we wrote a script to compare the main name entry in the Spanish Artists Dictionary database with a list of Spanish names pulled from the online catalog’s subject heading list. For testing purposes we decided to work with the 374 female artist names in the Spanish Artists Dictionary. We used the token_set_ratio method from FuzzyWuzzy, a string matching library for Python. The method breaks two strings (in this case the artist names from the Spanish Artists Dictionary and from the subject heading list) into words and compares their degree of similarity. A match was considered positive for results that achieved a ratio above 90. The idea was for these matches to then be used to populate the search box when the library catalog links were clicked. 49 matches were found, including two false matches that probably resulted because of the presence of substrings. For example, “Romero López, José María, 1815-1880” matched falsely with “López, María” and “Rodríguez de Losada, José María, 1826- 1896” matched falsely with “Rodríguez, María” because the first name contains both elements of the second name.
We hope this post has been effective in introducing some of the ways Python can be used to make data accessible in a new way. We would like to thank Anastasia Levadas, Digital Photoarchive Coordinator, and the rest of the Photoarchive staff for providing access to the Spanish Artists Dictionary — working with it was a great way to gain a hands-on understanding of how programming can be used in a cultural heritage context.
Alex and Diana are earning their MSLIS (Library and Information Science) from the Pratt Institute.