In my internship, I had two main objectives. The first was to create another machine-actionable dataset to share on the Library’s GitHub account, using the pilot study as a starting point. The second goal was to creatively reuse this data to show an example of what can be done when cultural heritage institutions share their collections as open data. Concretely, this took the form of creating an online, interactive map of Leuven through the centuries. In this blog post, I will explain the steps I took to achieve these goals.
Data Cleaning and Enriching
Thanks to Mariana’s work on the pilot study, I was able to follow many of her steps for the data cleaning, transforming and refining. For a more detailed explanation, give Mariana’s blog post a read. I also chose to do the data cleaning in OpenRefine since it is a free, open source tool that easily lets users clean and transform data.
Once I had done basic data cleaning and transforming, I began to enrich the data with the location information. To use the dataset to map the images, I needed coordinates for each image. Before adding coordinates, however, I added another category, which I called Place Name. Although the metadata already included information about the place pictured in the image, it wasn’t always complete or consistent, especially when more than one location was featured in an image. Although I had to add the information for the Place Name column manually, it was a valuable use of time because the result was a consistent name for each place, which allowed me to take advantage of OpenRefine’s text facet feature to sort the records and add the coordinates in mass.
Figure 1: Screenshot from OpenRefine: Abdij Keizersberg, which has 5 records, is selected from the text facet
I decided to use OpenStreetMap (OSM) to get the coordinates and as the base layer for the map because it is open source and widely used for various web applications.
Figure 2: View of the OpenStreetMap interface. By right clicking and choosing “Show address” at the location of Abdij Keizersberg, we get the coordinates in the search results window in decimal degree format (here highlighted in yellow).
The latitude is copied from OSM and pasted into OpenRefine by clicking on the edit button on the latitude cell. Then the “Fill down” feature is applied to add the latitude to the rest of the records with the same Place Name. The same steps were then repeated with the longitude column. The text facet feature made adding the coordinate information a relatively quick process and it ensured that all records with the same Place Name have the exact same coordinates.
Figure 3: Three views of the OpenRefine interface. The first image shows the latitude being pasted into the corresponding cell of the first record. In the second image, the fill down function is applied, resulting in the third image which shows that the latitude has been added to the remaining records.
Creative Reuse: Georeferencing Images
With the data cleaning and enrichment complete, I moved on to the next step of my project. As part of my interactive map of Leuven, I wanted to feature the historical maps in the collection so I used them as overlays over a base layer map. In order to do so, I first needed to georeference the historical maps. Georeferencing is the process of adding coordinate information to raster files (the historical maps) so that they can be aligned to a map with a coordinate system. It works by assigning ground control points (GCPs) to the raster image for which the coordinates are known.
In order to georeference all 18 of the historical maps in the collection, I turned to QGIS, an open source GIS software. QGIS has a georeferencer feature which allows you to select GCPs and assign them to coordinates on a base map. The first step is to add the OpenStreetMap tile as the base map and zoom in to the correct location (in this case, Leuven), and subsequently to upload the raster image, i.e. the historical map.
Figure 4: Views of the QGIS interface, adding the OSM tile (left image) and the raster file (the historical map in jpg format; right image).
Next, within the QGIS georeferencer, a point on the historical map is selected to add a GCP. When a point is chosen, a popup box lets you either manually enter the coordinates or choose them from the map canvas, in this case, the OpenStreetMap which was added in the first step. The same spot is selected on the OSM map, the coordinates get filled in and are assigned to that GCP. The same steps are then repeated to add more GCPs.
Figure 5: Views of the QGIS georeferencer showing a popup box after selecting a point on the historical map through which to add a GCP (top image), the OSM on which to select the same point (bottom left image) and the assignation of the coordinates to the GCP on the raster image (bottom right image).
A minimum of 4 GCPs should be added that are evenly distributed around the image. The more GCPs, the more accurately the georeferenced image will align with the base map. Once a sufficient number of GCPs are added, the georeferencer is run and the historical map is aligned over the base map. Depending on how accurately the GCPs were chosen and how accurate the historical map is, there will be some warping and distortion of the historical map.
Figure 6: The QGIS interface shows the historical map of Leuven georeferenced on top of the base map.
Creative Reuse: Making the interactive online map
With the map pages set up with the historical map overlay, the next step was to add all of the places to the map. The d3 library let me parse the CSV file and add all of the locations (by using the coordinates that were added in the data enrichment stage) as points to the map. Using the Leaflet popup feature, I added the name of the place and a Bootstrap carousel of the thumbnail images to the popup.
Figure 8: Map with locations as points added. Popup includes the place name and a carousel of thumbnail images of that place.
For some final touches on the map, the Leaflet Marker Cluster plug-in was used to cluster the markers that were close to each other so the map wasn’t overwhelmed with markers on the map. The Leaflet Extra Markers plug-in was used to change the color and add icons and numbers to the markers to allow for easy identification of the category of the place.
Figure 9: Map with the customized, clustered markers showing the layers that are available.
This post presents the publication of an open access spectral dataset which, beside other areas of application, provides insights and opportunities for better fine-tuned digitisation results for documentary heritage.
The Imaging Lab of KU Leuven Libraries has a strong focus on digitisation and imaging of documentary heritage. A crucial step in the digitisation process is the image creation; the transition of the physical object into a bitmap or array of pixels. During this process the colors of the object to be photographed, have to be translated into a particular colorimetric value at pixel base. The better that translation is fine-tuned, the more accurate the final result will be and the more the viewer can be assured the digital presentation matches the original object as it would appears in similar lighting conditions (=reliable data). To finetune the translation a reference color target is used in combination with specific software to build a bespoke ICC profile (=translator). This reference color target consists of an array of colored patches from which the colorimetric values are known.
… the color patches on the applied color targets do not always represent all the tints, tones, shades and hues typical for documentary heritage. So what is the problem … ?
… the digitisation of historical documentary heritage is not the same as wedding photography, product photography or even digitising the colorful art work of Kandinsky. Most of these materials, especially the historical ones, have a limited gamut (range of colors), and hold in particular less pronounced colors.
… and thus, the colors selected for most reference color targets are quite the opposite: they have a wide gamut (well spread across the theoretical color space) and they include the more pronounced colors.
Consequently, when these reference color targets are used to calibrate colors, they do not necessarily calibrate the colors photographers encounter when digitising documentary heritage. And that is unfortunate as the whole effort of profiling the colors for specific lighting conditions to obtain faithful digital representation documenting the original might be in jeopardy! Or less dramatically, we are aware this is a challenge and this job can be done better!
HOW TO ProfilE Colors
To further explore and understand the issue, let’s first take a step backwards: how is color profiled? Cameras (ranging from professional to smartphone) have embedded software or algorithms which interpret the incoming reflected light and for each pixel ‘translates’ it into a colorimetric values. These algorithms are predetermined, to assure the images look good. In fact, the standard algorithms will accentuate and shift some colors focusing more on pleasing color as opposed to accurate color. Color scientists know that, the sale managers of camera production companies know that, and such processed images are appreciated by the costumers buying those cameras. Thus, a perfect world!
When the digitisation process aims at creating a digital surrogate resembling as close as possible the original (including color), this strategy falls short. A solution is to capture the object in a raw format and process the data through specific software for processing of raw data (Capture One, Lightroom, Phocus, RawTherapee, …). In such software the color profile that comes with the camera can be disabled and replaced by a custom, tailormade profile.
Why custom made? Well, the appearance of the surface and thus the color, through the reflection of the light, changes when the lighting condition changes, such as a change in position of the light, color temperature, … In a digitisation studio all those parameters can be controlled and need to be kept the same throughout a digitisation process. In such a controlled environment, it is possible to obtain close to perfect digital representations if the correct procedures are followed and adequate hard- and software is used.
To create an in situ color profile for the specific and standardized (lighting) conditions a reference color target is used. The most popular and widely used of these targets is the ColorChecker Digital SG. The basic idea of such manual color calibration is straightforward:
a standardized illumination set-up for imaging documentary heritage is established (=photo-studio)
an image is taken from a reference color target for which the colorimetric values are known (=calibration target)
based on the obtained result from 2. with the illumination conditions from 1. software can estimate how the colors in the data of the image should be interpreted in order to obtain a correct representation, without taking into account the color translation by the profile camera. As such, an in situ color profile is calculated. (=color profiling)
the color profile calculated in 3. is applied on all images taken with the same illumination set-up as 1. (=color calibrated digitisation).
Central in this process is the reference color target from which an image is taken. This target consists of a number of different neutral and colored patches. For each patch the colorimetric values are known or measured and stored in a text file. Using a target representing all the colors in the visual spectrum is impossible. To overcome this, manufacturers of reference color targets try to include a selection of colors more or less evenly spread across the visual spectrum. At the same time they will try to include a sufficient number of tones that frequently appear in photographs, such as skin tones, sky, green vegetation, …
When the photographer has set up his equipment (camera and lighting) (1.) the target is photographed and the reflection or response in relation to the specific conditions is registered by the camera (2.). At this moment, the response (measured energy) is uninterpreted! Color profiling software measures the response of the patches in the image and creates a table (LUT) with these values and the reference values of the target (the text file). An ICC color profile translates the measured value into the reference value. (3.). The ICC color profile is stored and the photographer applies it to all the images made within the same conditions (light position, …) (4.). Have these conditions changed (btw, that includes the position of the camera or the specific lens), a new color profile needs to be made.
Thus, with the help of a reference color targets software (in a camera or computer) can calculate how the registered (observed) energy captured by the sensor in the camera should be translated to generate the colors of a surface as they are in real.
Profiling the colors that matter
One has to define the discrepancy between the colors on the reference color target and the colors of the documentary heritage we digitize on a day to day basis. This should provide the intel to understand whether the existence of this discrepancy – the gap – is a pure theoretical problem or a real life issue. At the Imaging Lab of KU Leuven Libraries we decided to make that effort. In collaboration with our colleagues of Special Collections we defined which original historical materials are representative for library heritage and archive collections and started measuring their spectral responses with a standard reflective spectrometer (Eye-One (i1) Pro Photo) (see below: ‘The Open Access Spectral Data’). As such, an insight has been obtained by providing the spectral responses and corresponding colors with their attested tints, tones, shades and hues typical for documentary heritage.
When de spectral data is inspected it can immediately be observed the attested colors are in the yellow, brown and slightly red regions of the color space (below A & B). And secondly, when this cluster of measured colors is compared with the spread of patches on one of the most popular color calibration targets (ColorChecker Digital SG) very interesting insights are revealed (below C & D).
Visualizing and comparing the spectral data shows that the measured historical materials fall in a zone represented by very little color patches on the ColorChecker Digital SG (see also the video below for a more in depth discussion by Don Williams). That means that this zone, which represents in particular colors that are common on historical documentary heritage, will remain poorly profiled. The spectral dataset accentuates this very well.
A gap between the colors that are commonly profiled, and which should be profiled, is identified.
The profiled colors on ‘standard’ calibration targets are no perfect match with the colors which should be profiled as a gap can be observed. Consequently, even when the current digitisation standards for documentary heritage, such as metamorfoze and FADGI, are followed; it is unclear how accurate a number of specific, frequently occurring heritage colors which matter most are registered. As such, subtle variations and changes in their materiality (for ex. due to time, light exposure, conservation interactions, …), which in theory should be observable based on the colorimetric values, can remain undetected or will be poorly represented.
Further study and actions can be made. The KU Leuven Libraries spectral data shows there is still room for improvement. This needs to be explored further. The spectral dataset used for the above made conclusions counts 433 measure points, selected on typical historical documents in a Belgian Special Collections library. In the broader international context, this exercise should be repeated to establish extended spectral insights in historical documentary heritage across cultural and material traditions.
Based on the above, it seems wise to populate color calibration targets with other/extra patches more closely related to the type of imaged materials. This is not new, for the DT Next Generation Target (v2) a similar exercise has already been done, leading to a calibration target with extra ‘heritage colors’. The KU Leuven data has also already been matched for the selection of color patches on the new FADGI ISO 19264 (for a comparison see the video above). These new targets will need to be tested, not only in their ability to calibrate standardised colors in general, but more importantly in checking out their added value in color profiling documentary heritage materials.
To facilitate and support future activities and research with spectral data of historical (documentary) heritage, this data should be made available for the broad community of heritage scientists.
THE OPEN ACCESS SPECTRAL DATA
The entire KU Leuven Libraries spectral data has been published online as open data. Together with the needed documentation this give the opportunity to use this data for any future work for which spectral characterisation of documentary heritage materials is wanted. The dataset is published on zenodo.org as Hameeuw Hendrik, Vandermeulen Bruno, Van Cutsem Frédéric, Smets An & Snijders Tjamke (2021): KU Leuven Libraries Open access Spectral data of historical paper, parchment/vellum, leather, inks and pigments (Version 1.0) [Data set]. http://doi.org/10.5281/zenodo.3965419. Feel free to work with the dataset. When you do, do not hesitate to reach out to us if you have any questions. If you have feedback and/or are interested to collaborate on the topic: email@example.com.
KU Leuven Libraries is creating data-level access to Belgian historical censuses. This blog post gives some context and a brief overview. To know more about the digitisation process, you can read our colleague André Davids’ article “Die Texterkennung als Herausforderung bei der Digitalisierung von Tabellen”, in O-Bib. Das Offene Bibliotheksjournal / Herausgeber VDB, 7(2), 1-13. https://doi.org/10.5282/o-bib/5584 (in German).
Censuses have been carried out for some 5000 years for the purposes of tax collection and the military. Soon after the foundation of Belgium, however, the Belgian sociologist and statistician led the way for the first censuses which, from the start, were also intended for research. These 1846 censuses covered not only population but also agriculture and industry. They were highly acclaimed across Europe and followed by many subsequent censuses.
These Belgian censuses are indeed a true treasure trove for socio-economic research. Their analysis, however, is very time-consuming due to their extent and format. Complex questions such as ‘What impacts salary levels more: the rising industry or rather the location?’ are very difficult to answer when working with the originals.
Statistique de la Belgique: Industrie, recensement de 1880 (1887). The open pages reveal the complexity of the census’ content.
Converting the many volumes into digital format would open a whole range of new possibilities, so KU Leuven Libraries Economics and Business’ project ‘Belgian Historical Censuses’ (website in Dutch) currently digitizes the physical volumes in order to create data-level access to the tables in a spreadsheet format. Based on research needs, primary focus is on the industrial censuses that have been organised between 1846 and 1947.
The basis, of course, is transforming the physical copies into digital format. Based on their physical state, the KU Leuven collection items are either scanned (modern materials in good condition) or photographically digitized (heritage and fragile materials) using presets resp. processing techniques that improve the success rate of OCR.
Text recognition starts with layout analysis. ABBYY FineReader assigns categories to the various parts of the image files: image, text, table. Manual checks and adjustments ensure a correct interpretation of the table structure.
Layout analysis in ABBY FineReader
Once the layout analysis is successfully completed, ABBYY FineReader executes the actual text recognition. The result is, again, manually checked.
Executing the actual text recognition
A searchable PDF file as well as an Excel spreadsheet can now be exported. To ensure the layout of the spreadsheet correctly reflects the tables, they are intensively edited.
Exporting the digitised object as Excel still requires intensive editing of the spreadsheet.
OCR’ing numerical data is particularly challenging. Contrary to text (where a single incorrectly OCR’ed letter still allows to interpret the work correctly), any error at the level of a single figure (e.g. 1 being read as 7) has major consequences when working with the data.
The only possible solution is manually controlling and adjusting the numerical data – a very intensive step in the quality assurance process, especially for the older censuses due to the font used in these documents. QA operators recalculate the totals of separate rows/columns and compare these with the totals included in the censuses. When the corresponding totals of original and digital objects differ, the individual numbers in the row/column are corrected based on the original volume.
Checking and adjusting the numerical data through comparison of totals in the digitised object (Excel) and the original object.
After having created a searchable PDF and a computationally ready Excel, structural metadata is created. This metadata and both files are ingested into the library’s preservation environment Teneo. The metadata allows to link the digital objects to the correct descriptive metadata in the library catalogue Alma and to present the object to the public as open data. The data is now ready to be used!
Teneo viewer showing the machine-readable PDF (in the centre) and partial metadata (on the right). The computationally-ready spreadsheet (indicated on the left) may be downloaded too.
As a result of the Corona crisis, museums and other heritage institutions today have little or no physical access, both in Belgium and abroad. It puts the consultation of objects and the study of our past under strong pressure. In part, we can fall back on digitised objects, notes and old publications, but these only represent part of the information, which means that important details can be overlooked. Fortunately, the sector, in collaboration with engineers, has devised solutions to remedy this.
In the heritage sector, the digitisation of objects has long been the focus of attention and experimentation. For the public, this usually results in an online photo that can be zoomed in or on which the contrast can be adjusted. These are purely colour images, standard digital photographs conceal no extra information. However, different types of image scanners register a lot more characteristics of a surface than just the colour. Being able to visualize this information in a handy online tool therefore offers new possibilities for anyone working with heritage objects. Think, for example, of the KBR drawings by Pieter Bruegel the Elder that were recently examined by KU Leuven. The researchers were able to study the paper down to the fibre using their Portable Light Dome (PLD) scanner. They also got a much better view of the extensive range of techniques used by the old master.
Software is the key
Over the past 15 years, KU Leuven researchers, together with various partners from the heritage sector, have developed digital techniques that can visualise objects to an unprecedented level of detail: the PLD scanner. “With this method, they illuminate an object from a large number of angles and take photos of it, the so-called ‘single-camera, multi-light recording’, says Hendrik Hameeuw, co-coordinator of the project at KU Leuven. “The way in which this recording is subsequently processed determines which characteristics of the surface, such as relief or texture, the software can show and thus how the user experiences the object”.
New universal file format
“To be entirely complete, we actually have to look at the file types of these interactive datasets,” says Hameeuw. Most heritage institutions calculate and store these types of images of their heritage with a specific image format, usually RTI/HSH. The software developed in Leuven works with PLD files (ZUN, CUN) that have extra functionalities compared to those RTI/HSH files. Pixel+ now makes this way of calculation available to the whole world, not only by offering it online, but also by introducing a new kind of container file for it: glTF. “Compare it with an ordinary photo on your computer. It will probably be a JPEG or GIF file. But if you want to work with it in Photoshop, the program will turn the same image into a PSD file”. These glTFs are compatible with both the Leuven PLD and the RTI/HSH files. “With this we offer a new universal standard for this kind of images and we also open them immediately via the online pixel+ viewer, a kind of free photoshop for ‘single-camera, multi-light recording’ images”. This allows both RTI/HSH and PLD files to be studied and compared within the same program for the first time.µ
A new world
Pixel+ extracts a lot of extra information from the available data. The objects, such as old coins, miniatures or paintings, suddenly acquire extra dimensions after hundreds of years, which can be used for research on these objects to gain new insights. Especially in the field of 3D (geometry) and the correct understanding of the reflections of light on an object, the Leuven software is taking major steps forward.
“The technology is interesting for many objects, from clay tablets over coins to paintings or medieval manuscripts,” explains Hameeuw. “The software allows, among other things, the objects to be viewed virtually with different incidence of light, the relief to be mapped at pixel level or a 3D visualisation to be generated”. Frédéric Lemmers of the KBR Digitisation Department joins in: “By even combining it with multi-spectral imaging, researchers recently discovered that the heads of some figures in KBR’s 13th-century Rijmbijbel were painted over at a later date.” At the Art & History Museum, the technology was used to make heavily weathered texts on almost 4,000-year-old Egyptian figurines readable again.
Institutions from all over the world, from the Metropolitan Museum of Art in New York (USA) to the Regionaal Archeologisch Museum a/d Schelde in Avelgem (Belgium), will be able to upload, consult and study their own datasets or files in pixel+. The software converts the information according to various new standards and allows users to access the virtual heritage objects interactively. “This development really is a milestone for the heritage sector”, emphasises Chris Vastenhoud, promoter of the project from the Art & History Museum. “A whole new world will open up for heritage institutions worldwide. They will be able to document and share a lot of additional information in order to communicate about the objects in their collections”.
At the beginning of April 2020, the pixel+ project staff already presented their results during the online (as a result of Corona) SPIE conference. As a result, the paper below was published:
Vincent Vanweddingen, Hendrik Hameeuw, Bruno Vandermeulen, Chris Vastenhoud, Lieve Watteeuw, Frédéric Lemmers, Athena Van der Perre, Paul Konijn, Luc Van Gool, Marc Proesmans 2020: Pixel+: integrating and standardizing of various interactive pixel-based imagery, in: Peter Schelkens, Tomasz Kozacki (eds.) Optics, Photonics and Digital Technologies for Imaging Applications VI, Proc. of SPIE Vol. 11353, 113530G. (DOI: 10.1117/12.2555685)
One of the current hot topics in the GLAM (Galleries, Libraries, Archives, Museums)sector is that of presenting collections as data for use and reuse by as diverse a public as possible. Having adopted an open data policy and working towards FAIR data, a previous blog post described our implementation of images as open data. Digitisation does not only create images, however, so we have started the exciting road to disclosing other data too. During 2019, we were delighted that Mariana Ziku, at the time an Advanced MSc in Digital Humanities student at KU Leuven, took up an internship at our Digitisation Department and set out to investigate how to start navigating this road. Here is her story!
Hi there! I’m Mariana, an art historian and curator. During Spring 2019, I had the opportunity to join the Digitisation Department at KU Leuven Libraries, as the department set the scene for extending its open digital strategy. The goal: investigating new ways for sharing and using the libraries’ digitised collections as data. A Digital Humanities (DH) traineeship and research position was opened under the title “Open Access Publishing in Digital Cultural Heritage” to examine the data aspect of heritage collections in the context of open access.
This blog post gives a brief insight into the research, aspirations and practical outcomes of this internship, which resulted in the pilot study ‘Open access documentary heritage: the development of git and machine actionable data for digitised heritage collections’.
Figure 1: The pilot stack for creating machine-actionable documentary heritage data for KU Leuven Libraries’ Portrait Collection. CC-BY 4.0. Vector icons and graphics used for the KU Leuven Pilot Stack infographic: at flaticon.com by Freepik and at Freepik by Katemangostar.
Open-access development in GLAMs
First on the agenda: delving into KU Leuven Libraries’ digital ecosystem. This included looking into the library’s digitisation workflows, imaging techniques and R&D projects (like this one) as well as discovering the elaborated back-end architecture and metadata binding it all together. At that time, two anticipated open-access projects were going public: the ‘Digital Heritage Online’ browsing environment for the library’s digitised heritage collections and the enhanced image viewer, promoting a more functional user interface with a clear communication of the licensing terms and with full object and file-level download functions for the public domain collections.
With this view of KU Leuven Libraries’ open digital strategy in mind, we explored the current open-access publishing landscape among other GLAM institutions. We got an indicative overview through the active running project of McCarthy and Wallace “Survey of GLAM open access policy and practice”. The survey is an informal and open Google spreadsheet, allowing GLAM institutions to list themselves and index their openness spectrum. By running a Tableau data analysis and visualisation on selected survey facets, we outlined the open access digital ecosystem of approximately 600 GLAM institutions. This revealed, among others, that museums are the most active cultural institution types in the field of open access, followed by libraries. Countries with the most open institutions are Germany, U.S.A. and Sweden.
Figure 2: A data visualisation board of instances in the Survey of GLAM open access policy and practice, an ongoing project by McCarthy and Wallace indexing the openness spectrum of GLAM institutions for enabling data re-use of their collections. (Data captured April 2019. The list has grown since then.)
The survey provides a valuable insight into instances of open data that each GLAM institution makes available online. The data is currently organised in 17 columns, including links to the Wikidata and GitHub accounts of the listed GLAMs. Although the majority of indexed institutions had a Wikidata account, the number of GLAMs with an active GitHub account was low, (approximately 50 out of 600 institutions). Nevertheless, looking for good practices in open-access publishing, we started to explore existing GLAM activity on GitHub, examining, among others, data repositories of prominent institutions that have long been active on GitHub (e.g. MoMA, MET, NYPL), wiki-documented projects (e.g. AMNH) and various curated lists. GitHub seemed a time- and cost-effective way to provide access to GLAM open data, which convinced us to further explore the potential of collaborative Git platforms as GitHub for digital cultural heritage.
The question for the internship pilot was set: how can Git be used within the framework of GLAM digital practices and could it become part of a strategy for creating open data-level access to digitised GLAM collections?
Towards the computational transformation of documentary heritage as open data: a pilot study for KU Leuven Libraries
Documentary heritage metadata created and curated within libraries is considered to be well-structured and consistent. In many ways this is indeed the case due to the use of internationally recognised metadata standards and the creation of description formats for specific material types. Even so, the structure of this metadata as-is poses many challenges for computational use because libraries primarily create metadata for human-readable cataloguing purposes rather than for digital research. In addition, library metadata is being created over a long period of time by many people who make different interpretational choices and, occasionally, errors. Even the metadata formats may be changed over time thus creating inconsistencies. As a result, the data may become “dirty” on top of its structural challenges for computational use.
As digital scholarship becomes increasingly more prevalent, the development of new systematic and reliable approaches to transform GLAM data into machine-actionable data is critical. In this context, the concept of Git can be helpful. Git is an open source system, which is able to maintain a history of fully functional changes of uploaded data or coding. It also enables simultaneous collaborative work, for example when keeping a trustworthy record of coding issues and of contributions from a wider community of users. However, using Git to access CSV and other spreadsheet files on a data-level is not quite here yet for the wider user community: Git platforms like GitHub do not (yet) support collaborative, hands-on work on datasets contained in CSV files with the option to track changes in real-time with Git. Nevertheless, GitHub can be useful for publishing open GLAM data, as it can foster engagement, help build a knowledgeable community and generate feedback.
The pilot set forth to provide a test case by publishing the digitised Portrait Collection as open, computational-fluent data in a so-called ‘optimal’ dataset. Approximately 10.000 graphic prints from KU Leuven Libraries’ Special Collections had been digitised several years ago and provided with descriptive and structural metadata. Now, the dataset would be prepared to function as an open-access documentary heritage resource, specifically designed for digital humanities research and made available via a new KU Leuven Libraries’ GitHub repository.
Open GLAM data: good practices, frameworks and guidelines
As we were looking into data-specific open access for documentary heritage, the pilot study first investigated practices, standards and technical deployments that reframe cultural heritage (CH) as open data. This included the analysis of standardised policy frameworks for creating GLAM data repositories on the one hand and research infrastructures supporting CH data-driven scholarship on the other.
We investigated good practices for creating trustworthy GLAM data repositories and digital resources that move towards open-access, improved discoverability and active reusability. Interestingly, these were based on (amongst others):
Acknowledging the Open Knowledge Foundation (OKF) as an early advocate of open-access in the GLAM sector through the OpenGLAM principles, which provided working concepts of machine-readable data and data discoverability.
In addition, the publications of the “Collections as Data” project (Thomas Padilla et al, May 2019) were particularly helpful for gaining a better understanding of the challenges that GLAM institutions encounter in providing resources for digital scholarship and in developing the necessary digital expertise.
A minimal-computing approach to developing git and machine-actionable data
Turning to the dataset preparation for publication, it was essential to develop both a process that required only minimal programming knowledge and a step-by-step guide for GLAM data processing that would be as accessible as possible within a humanities context. That way, over time more machine-actionable datasets from the library’s heritage collections could be developed as part of the general working processes of the library.
The Metadata Services Department took on the metadata export from the metadata repository. Using MarcEdit and FileMaker, both tools that are already part of the library’s general workflows, the initial MARCXML was transformed into a CSV format.
Next, we needed to gain insight into the content of the dataset through simple data analysis. We used Jupyter Notebooks hosted in Colaboratory, to ensure a real-time shared use of the findings and to allow everyone to look at and interact with the data analysis outcome. Analysing the CSV file with short Python scripts revealed basic information on the data quality inside the columns. Furthermore, columns were identified as containing categorical, qualitative or numerical data, as columns with unique values or as almost empty columns to be discarded. Using Colaboratory requires knowledge of the Python programming language, however. The search for a minimal computing approach to GLAM data processing required looking for tools that are easier to use.
Figure 3: Click on the link above to go to Google Colab and see the notebook with the first set of data explorations of the initial CSV file (before data processing for its computational transformation). The short Python scripts can be copied and easily adjusted for the exploration of other datasets.
Following the advice of Packed (Belgian center of expertise in digital heritage, now part of Meemoo), we selected OpenRefine, a free, open-source stand-alone desktop application for data processing. OpenRefine can be used without advanced programming skills, although it has elaborated capabilities as well. Basic data processes can be performed through “push-button” actions and simple text filters written in the General Refine Expression Language (GREL).
The three principal stages of data processing could be executed in OpenRefine: data cleaning, data transforming and data refining. The result was an optimal CSV file containing machine-actionable data. This data could now be used for data analysis and visualisation with minimal friction, in the context of data-driven scholarship as well as for other creative computational reuses.
Let’s take a better look at what data processing actually does. Having a clearer view of the dataset and a minimal-computing tool, we could now start processing the data. Transforming the CSV file into machine-actionable data required over 500 individual data processing tasks on OpenRefine. Below follows a single example for each data processing stage: cleaning, transforming and refining.
Data cleaning is primarily based on faceting and clustering. Faceting is a useful feature of OpenRefine that allows the browsing of columns and rows. Within these facets the data can be filtered based on subsets (‘categories’) which in their turn can be changed in bulk.
Figure 4: Screenshot of the OpenRefine tool showing a list of the 32 categories for the facet ‘Description of graphic material’. One of these categories is used in 10.323 out of the total number of 10.512 records.
Inspecting the column “Description of graphic material”, as shown in the image above, reveals a total of 32 different data categories in which all of the 10.512 records (i.e. the complete portrait collection) are being sorted. The number of categories can be reduced because some categories are formed due to inconsistencies in data documentation, like the dot behind ‘$oBlack-and-white’ in one of the categories highlighted in green in the example below.
Figure 5:Screenshot of OpenRefine showing two inconsistencies in data documentation. Highlighted in green is an unintended category duplication due to a faulty value (a dot added to the value ‘Black-and-white’). Highlighted in yellow is a seemingly faulty category that, upon inspection of the image file described in this record, turned out to be correct.
However, category integration is a critical process that requires detailed inspection and consideration of data nuances that may else be lost. In the image above, for example, a category with a double entry of “Black and White” is not a mistake. It merely implies that the image contains two portraits, both of which are black and white.
In the image below is another example of a column with 248 different categories that need to be numbered down in order to become computational amenable. The metadata coding for this content type (Typographical terms) is not based on MARC 21 standards, but refers to a local coding system used in conjunction with MARC21. Here too, the number of different categories can be reduced with category integration, in combination with splitting, which is explained below.
Figure 6: On the left, the initial column of the unprocessed CSV file containing approx. 250 categories of typographical terms, which made the column not fit for computational use. On the right, the outcome of data cleaning in the new CSV file: 20 major categories of typographical terms could be identified encompassing all the Portrait Collection’s artworks (Typographical terms 1) and a second column was created (Typographical terms 2) in order to include more specific information which complements the first column.
Another critical issue with data not fit for computational use is inconsistencies occurring from different spellings and name variants of the same entity. The clustering feature of OpenRefine groups choices that look similar, with the option to choose between different levels of sensitivity. The image below shows how data entries referring to cities were clustered in order to surface patterns that are representations of the same thing, but are documented in different ways.
Figure 7: Performing clustering in OpenRefine inside the column “Place of manufacture”: an automatic identification and grouping of various spelling forms of the cities where the graphic prints were created.
Going a step further with data processing, an extended transformation of many columns is recommended to further refine the repository and form more amenable data for computational use. Typically, improvements include splitting columns, renaming entities and replacing strings.
Splitting columns can be useful when a single column holds two or more variables, as in Figure 6 above. The data becomes computational responsive when separated and split into two or more columns.
Renaming entities too can enhance their computational capabilities. Often this involves replacing strings, for example when various URL formats exist inside a single column, that direct to diverse webpages. In the following example, the column “link to digital object 1” contains different formatted URLs that display an image preview of each portrait. The URLs are mainly in the form of “delivery links” that redirect to an image display within a viewer environment: http://depot.lias.be/delivery/DeliveryManagerServlet?dps_pid=IE4849156. However, delivery links are not fit for image harvesting as they link to a viewer and not to a file. In order to enable image harvesting, we transformed the link to a direct image display of the file in thumbnail form: http://resolver.libis.be/IE4849156/thumbnail. This way, a quick image preview will be available while the embedded thumbnail preview will be useful for more elaborated data visualisations.
Figure 8: First task from a set of successive data transformations, performed for creating a unified URL format for the image-display link. The resulting link directs to a thumbnail image preview, which can be integrated more easily into data visualisations. Data transformation was performed through a GREL expression.
In many cases, as in the above data transformation instance, successive data transformations might be needed in order to make a single data entry consistent and functional.
Data refining starts with fixing encoding issues in strings by replacing diacritic characters. For example, the diacritic character “é” is not properly represented and instead an arbitrary character chain appears (“eÌ”). In order to refine this correctly across all columns, we performed a transformation through the “All” arrow-down menu with a GREL expression. OpenRefine subsequently shows the number of transformations automatically applied to each column.
Figure 9: Refining data by reinterpreting diacritic characters throughout all columns with a GREL expression. The top right image shows the high number of character modifications within column ‘Main title’, with 427 cells containing the diacritic character at least once.
Useful applications and features have been integrated that further support the use of GitHub as an open GLAM repository. These include the extension of the open research repository ‘Zenodo’ as well as the “All Contributors” bot extension.
The integration of Zenodo makes the data repository citable through the minting of a Digital Object Identifier (DOI). Zenodo’s support of DOI versioning allows to cite a specific dataset and its potential different versions. Zenodo is mounted on GitHub, allowing us to choose a suitable license and a number of other useful aspects for the dataset (e.g. export formats, citation style, recorded versions).
“All Contributors” is a framework developed within GitHub to attribute all possible contributors that will be engaged with the dataset by utilising an “emoji key”. Contributors include not only people who work on code but also everyone who may ask and answer questions, design, provide examples, translate, etc. The bot is configured accordingly and can also function for opening up inquiries for data modelling to the broader community, enabling attribution within the library’s GitHub page to literally everyone with a GitHub account engaging with the dataset.
Figure 11: The “All contributors” extension integrated in the GitHub GLAM account can acknowledge everyone engaging with the related data (not just coders). A set of emojis (shown below the avatar pictures) represents specific contribution types (e.g. reviewers, translators, making questions, sharing ideas).
The pilot served as a test case to develop a process for the preparation of more machine-actionable datasets through minimal-computing processes at KU Leuven Libraries. It also set up the GitHub profile of KU Leuven Libraries, initiating Git by applying the Zenodo DOI and All contributions extensions, creating the readme files and uploading the machine-actionable dataset on its Git repository, while commiting and proposing file changes by using Git. The data preparation process has been documented on a step by step basis, in order to create a blueprint for data processing of documentary heritage intended for computational use and to offer critical insights, identify omissions, missing information and process steps that could possibly be improved in the future.
The pilot was KU Leuven Libraries’ first step towards creating a Git repository of open and ready to analyse datasets of documentary heritage from the library’s collections, openly available to (amongst others) researchers and students in (digital) humanities looking to use computational-fluent open data that display minimal discrepancies.
The pilot study for developing git and open documentary heritage data for computational use was conducted in the Digitisation Department of KU Leuven Libraries by Mariana Ziku as part of her thesis in the Advanced MSc in Digital Humanities, under the supervision of Nele Gabriels, the guidance of Bruno Vandermeulen, the training with Diederik Lanoye and the advice and good company of Hendrik Hameeuw and Mark Verbrugge. Thanks also to Bart Magnus and Sam Donvil from Packed (now Meemoo) for sharing their expertise on digital cultural heritage.
Here’s a piece of exciting news: KU Leuven Libraries has adopted an open data policy for its digitised collections! This means that nearly 42.000 digital representations of originals from the library’s public domain holdings may be freely shared and (re-)used by all. And of course this number continues to grow.
How it works?
The IE viewer displays copyright status information and links to the general terms and conditions.
The information pane of the IIIF-based Mirador viewer displays the same copyright status information and link to the general terms and conditions as the IE viewer
While we strive to keep our collections as open as possible, some items are available under a license, e.g. when the public domain originals are not part of KU Leuven Libraries’ holdings or when permission was given for copyrighted materials to be digitized and made available to the public under specific conditions. Visitors can consult the licensing conditions via the viewer.
When images are made available under a licence, the viewers (here: Mirador) link to specific conditions for use for the digital object in view.
Having checked the status of a digital object and the conditions for use, online visitors can use the direct download possibilities included in the Teneo viewer. These offer single-file or full-object download.
Images may be downloaded as single files or as full object in the IE viewer.
The downloaded images are jpg or jp2 and allow perfect readability of even the smallest written text. Visitors are now ready to become active users of the digitised collection!
How we got there
Easy as it may seem, implementing an open data policy required significant effort on various levels to ensure that online visitors would be clearly informed about the judicial status of both physical originals and the digital representation and what this means for its use. KU Leuven Libraries currently presents its digitised collections either in the Rosetta IE viewer (with an ‘Intellectual Entity’ generally equalling an individual collection item) or – for bespoke collections – in a IIIF-based Mirador viewer. Systems and processes had to be adjusted to show this information on an single-item level in these view environments.
First, data can only be freely used and shared if it can be both accessed and acquired. To this end, easy-to-use download functions were first created within the IE viewer. This viewer now offers both single-file or full-object download (see the images above). Mirador too will include download options before long.
Second, covering the bases, our team collected the rights information from legacy project descriptions and agreements, both internal to KU Leuven Libraries and (for digital objects based on original held by other institutions) with external partners. Unclear phrasing was clarified and the images resulting from digitisation projects were assigned one of three possible legal statuses: public domain/open data, available under a license, or copyrighted.
Third, for each of these three statuses, terms & conditions for use of the digital objects were designed in close collaboration with KU Leuven’s Legal Department. Furthermore, an overview page was created detailing, for each of the digitisation projects, the licensing conditions for those digitized items that are available under a license.
Fourth, a copyright status was assigned to each of the individual digitised objects, more specifically to both the physical (public domain or in copyright) and digital (content as open data, under a license or in copyright) objects. For the originals, the descriptive metadata model was modified to include the copyright status; for the digital objects, status information would become part of the metadata in the information package presented for ingest into the preservation environment.
Terms and conditions for the use of images as open data.
And finally: fifth, we turned to metadata visualisation in the viewer environments. The metadata shown in both the IE and the Mirador viewer is not that of the public search environment Limo nor from the metadata repository Alma, but rather from the digital asset preservation system, Rosetta; hence the choice to include the copyright status of the digital representations in the ingest information packages. By nature, the metadata (just like any data) in the preservation system is unchangeable. Including the extra information for those digital objects already in Rosetta was not something to be done lightly, but implementing an open data policy justified the decision.
Rather than adding copyright statuses to the existing metadata, we decided to create a standard mapping between Alma and Rosetta and replace the existing descriptive metadata in Rosetta. That way, two other issues could be addressed: the inconsistent and the static nature of the descriptive information shown in the digital representation viewers. The inconsistency was a legacy of the early digitisation projects at KU Leuven Libraries (with some projects generating extensive descriptions and others hardly any) while the static nature of the metadata is inherent to its extraction from the preservation environment.
This chart shows the four main elements of our architecture and the flow of data between them. While the viewer (both IE and Mirador) is accessed via a link in Limo, the actual images and metadata shown is retrieved from Rosetta.
The new standard mapping between Alma and Rosetta provides a direct answer to the first issue. The viewers display a uniform metadata set consisting of title, material type, genre, location of the original item and copyright status information. A link to the full descriptive record in Limo gives users access to the most up-to-date information about the digital item on view. And of course both viewers’ metadata panes display all the object-level information required to implement an open data policy. Together with the download functions, this enables KU Leuven Libraries to offer its digitised collections as open data.
The road to ‘open’
Meanwhile, we invite everyone to visit the collections at Digital Heritage Online (read all about DHO in a previous blog post) and to actively use, reuse and share our digitised collections!
An interview in the KU Leuven university journal Campuskrant with prof. Lieve Watteeuw the KU Leuven, contributions to the FINGERPRINT project have been highlighted in length. The Campus article focuses on the work with the original drawings by Pieter Bruegel the Elder and how the imaging effort by the FINGERPRINT-team have made the difference to better understand the virtuosity of Breugel the artist. These results will also be presented on the KBR exhibition: The World of Bruegel in Black and White.
The Imaging Lab was responsible for the digitisation of all drawings and prints and also carried out the advanced imaging enabling in-depth visualisations.
KU Leuven Libraries presents a new platform: Digital Heritage Online. It gathers all digitized heritage objects from its collections, with objects dating from the 9th up to the 20th century, in one viewing interface. The platform enables users to browse these digitized objects in an open and visually appealing way. It also provides a search environment within the Digital Heritage Online collection.
The user may browse the collections and items based on either content theme, material type, or location of the physical object. A fourth entry shows which objects were made digitally available during the previous three months.
Start window of Digital Heritage Online platform with four entries to discover the database.
By clicking on an object thumbnail within the collection, the user can consult the extended bibliographic information in the library catalogue and view the full object online.
Catalogue item description with ‘Teneo’ link to view the full object and detailed bibliographic information.
Additionally, the extended object description names all collections to which the object belongs and shows other related items. This allows for contextualized browsing of the digitized collection in the library catalogue environment.
Catalogue item description with ‘Collection path’ showing the collections to which a single item belongs as well as thumbnails of related collection items, two features supporting contextualised browsing.
One can also execute specific searches within the Digital Heritage Online collection, excluding non-digitized collection items in the search results. Searches may be performed either in Digital Heritage Online as a whole or within its subcollections. Searches via the blue search button will include items in collections outside of the Digital Heritage Online platform.
Search results for ‘Leuven’ in the Digital Heritage Online collection, employing the search box on the left.
Digital Heritage Online may be accessed either directly via this link or through the homepage of KU Leuven Libraries’ integrated search interface and catalogue: Limo, by clicking on ‘Curated Collections’.
Limo homepage view with access to Digital Heritage Online as one of the ‘curated collections’.
Over the past ten years, KU Leuven Libraries has intensively worked on digitizing its documentary heritage. At the time of writing this blog post, the digitized collection held almost 88.000 objects and it is continuously expanding week by week.
Since 2016, we have focused our attention on opening up our digitized collection in a clearer and more user-friendly way. While we keep in mind the FAIR principles as a long-term goal, the most pressing needs were increasing the findability and accessibility of the collection. Parts of the collection were already available through aggregator platforms such as Europeana and Flandrica or KU Leuven platforms such as Magister Dixit and Lovaniensia. A selection of heritage objects figure in virtual exhibitions on our EXPO site. And whenever copyright and intellectual property rights allow consultation and use, digitized collection items can be consulted through Limo, the library’s integrated search interface mentioned above, where users can search the full library catalogue and view items that have a digital representation online.
The Digital Heritage Online platform now provides a clear access point and a search environment for our digitized heritage. It enables a unified view on all digitized content and a true visual browsing environment. It is, naturally, only a first step into opening up the digitized collection. Opening up the data itself (images, metadata and content) for use and reuse is firmly positioned on our agenda. But for now: happy exploring!
Digital Heritage Online is the result of a close collaboration between different services of KU Leuven Libraries. LIBIS took care of the design and technical development of the Collection Discovery platform in KU Leuven Libraries’ catalogue and search interface. Digital Heritage Online was the pilot project for this new Limo implementation. Together with the various collection curators and thanks to the many digitisation projects, the Digitisation Department designed a structure for the various sub-collections. Technical operations for bringing together the many heritage object descriptions in the correct Alma collections were carried out in collaboration with the Document Processing Department. The content coordination of Digital Heritage Online is in the hands of the Digitisation Department.
KU Leuven Libraries is happy to present a new online platform for accessing its fragile heritage collections online: EXPO.
EXPO offers virtual exhibitions and a gallery of individual collection items, and informs about digitization and imaging projects at the library.
The virtual exhibitions connect collection items with other works and may or may not be a continuation of actual library exhibitions. Exceptional works from the KU Leuven Libraries’ collections are highlighted in the gallery, which gathers both recognized masterpieces and other fascinating works with a unique story. The project pages introduce the website visitor to ongoing and past activities in the field of digitization. Both initiatives aiming at the digital disclosure of the library collections as well as projects in the context of research and technical imaging are presented.
EXPO is the result of a close collaboration between various departments. Exhibition curators, collection keepers, and other heritage collaborators and partners create the exhibitions and fill the gallery with collection highlights. LIBIS carried out the technical implementation of the site as part of the Heron (Heritage Online) service. The site is based on the open source web publishing platform Omeka. For the site’s development, LIBIS created a direct connection between Omeka, the library management system Alma and the preservation system Teneo/Rosetta, allowing curators to work within a single, integrated environment. The coordination of EXPO is taken on by KU Leuven Libraries Digitisation & Document Delivery. This department manages digitisation projects, executes digitisation in its Imaging Lab and supports research projects with bespoke digitisation techniques and the specific expertise required for scientific imaging.