Multimodal dataviewing platform: the Mirador 3 based 3Pi viewer

Hendrik Hameeuw, Dirk Kinnaes, Stephan Pauls, Bruno Vandermeulen, Nele Gabriëls, Lieve Watteeuw

Intro

Heritage artifacts and collections are crossroads where historical context, materiality, content and technology join. In the field of heritage science, scholars deploy methodologies from the humanities and applied sciences to increase their understanding of these heritage artifacts and/or the collections they are part of. This is achieved through the balanced and well-thought-out use of research strategies with advanced technologies and infrastructures specifically adapted to or applied for this purpose. The Data Collecting Strategy depends on the research or conservation questions raised for each specific case and leads to an array of potential research infrastructures and techniques to be combined into an ad hoc integrated approach. Techniques such as high-end digitisation, advanced imaging (i.e. MLR, NBMSI, IRR, …), spectral (i.e. FORS spectroscopy, Raman, …) and chemical (i.e. XRF, …) characterization produce various types of data, both visual and analytical. Each type of data generates specific information, facilitates in-depth research and provides insight in the material genesis and state of the object. At the same time managing, aligning and publishing these multimodal and -layered datasets online is challenging.

A solution is integrating and curating all these various datasets into an easy-to-use interoperable consultation platform. Preferable this virtual environment:

A. presents the heritage object as it appears to the human eye;
B. provides access to the full captured and calibrated data;
C. aligns that with the processed data and results (visual and analytical) supporting the predetermined research questions;
D. tags the data with technical metadata and paradata;
E. groups this data under a persistent ID;
F. and allows for easy interaction with all the included multi-layered information.

IIIF’s Mirador 3 offers the basic features necessary to allow such a complexly structured integration. Annotations are used to describe fragments (smaller images represented as layers) and other data files associated with the overall image. Fragments contain processed visual data and are linked to rectangular areas, while analytical data is represented by points. We enhanced the Mirador 3 viewer in such a way that layers representing fragments will be displayed in register with the overall image. The fragments can be rotated over any angle, not just multiples of 90°.

At the University of Leuven’s Core Facility VIEW, within the framework of the 3Pi Project (Diagnosis of Papyrus-Paper-Parchment manuscripts through advanced Imaging), heritage artefacts are studied and documented in such a multimodal manner. The newly engineered Mirador 3 implementation has been executed by LIBIS in collaboration with the Digitisation Department of KU Leuven Libraries and the Book Heritage Lab of the Faculty of Theology and Religious Studies. The result is a dedicated workflow implemented within the existing end-to-end digitisation process worked out at KU Leuven Libraries. This guarantees for the curated published dataset to be embedded in the institutional long-term preservation strategy. All this allows to share, preserve, explore and analyze this type of complex research data, and acts as an important accelerator facilitating research, documentation and publication of heritage objects.

Data acquisition and processing

Data acquisition and processing are integrated. However, the complexity and extensiveness of the final data can not always be understood in advance as research projects apply all sorts of infrastructures and produce interim datasets as a result of defining data needs during their course. From this point of view, the acquisition can proceed independently of the ultimate intended curated published dataset to be embedded in the institutional long-term preservation strategy.

The types of data which can be included and viewed in the final result can be divers: besides image files (.tiff, .jpeg) these can be various file formats (.pdf, .xlsx, .csv, .txt …). These latter formats enable the inclusion of the raw data of analytic measuring devices, transformed in common file formats, which makes the outcome more FAIR.

Curation

One of the main incentives of the 3Pi viewer is to present the available complex of rich multimodal data in a structured manner to the end-user. This requires a well-thought-out approach on the part of those who created the data and used it for research. Each curated dataset can support one or more aspects that were central to that research. It is therefore the researcher who is in charge of this crucial step. This means that for each dataset that is assigned a unique ID after ingest (based on a concordance table), it must be decided which data is added to it, how it is structured, and which metadata must be assigned both per individual data file and for the dataset in its entirety.

The curation is prepared in a structured manner, which not only allows to respond flexibly to the challenge to bring the variety of data together as one whole but also enables smooth execution of an automated ingest process. The curation involves three major steps: composition, structuring and mapping.

First, the compilation of all data files is included in a spreadsheet list where contextual and structural metadata (including viewer metadata & ingest guidelines) are added per file (=ingest-list). Second, this step also creates structure to the set of data files. The data types are labeled and each file is given an index. Third, since the 3Pi viewer is mainly built according to a layered architecture where both images and measurement points are annotated within a canvas, the exact location of the data files is mapped. This is done in the layered tiff-file of the ‘_BasicRGB’ (part of the file name) image with the largest dimensions. Whether other images have the same dimensions (and thus when structuring is labeled as in register), or are smaller, their exact location is mapped with a black quadrangle (rotated or not) in this layered tiff. Also the exact position of each other document in the dataset is mapped in a similar manner by black dot. Each mapped layer in the tiff receives the same name as the corresponding file in the overall dataset.

Ingest

Ingest scheme

The SIP Dataset for ingest includes all the separate data files, plus a XXX_INGEST.xlsx (=ingest-list) list with contextual and structural metadata on these files and a XXX_METADATA.xlsx list of metadata on the entire dataset. All of this data remains available via the final Mirador 3 viewer implementation. An important intermediate step during ingest consists of extracting the coordinates of the mapped zones and measurement points in the layered tiff in order to generate their annotations in the mirador 3 implementation. Furthermore, a distinction is made between the images (these are directed towards the IIIF Image Server) and any of the other files in the dataset (these remain in the data repository); this data streaming path to the viewer is recorded in the generated IIIF manifest.

Viewing

After ingest in the preservation system and with the generated manifest, the persistent url gives access to the full curated dataset with a unique ID (IE or Intellectual Entity number) via the Mirador 3 implementation of the 3Pi-viewer. It brings all data together into one interactive environment. Visual data can be viewed, zoomed and rotated to explore all details through standard tools as implemented in Mirador 3. This interaction also allows the viewer to switch between the different layers, adjust their order and their opacity. All activated annotations are mapped in the viewing canvas and where applicable linked to the non-visual analytical data files. All images are listed in the layer panel, and links to the data behind the mapped annotations are grouped in the annotation panel. Interactive data for which Mirador 3 does not support visualization can be consulted through a link in the annotation panel. It automatically uploads the targeted file from the curated dataset into an external dedicated viewer. In the 3Pi-viewer that has been implemented for RTI and ZUN datafiles, two types of Multi-Light Reflectance images. They are pushed towards the Pixel+ viewer.

Ancient Egyptian coffin fragment:https://lib.is/IE17414261/representation
Bible of Anjou, GSM Cod 1, fol. 3v:https://lib.is/IE17930549/representation
Two examples of curated datasets for the 3Pi viewer

The number of files that can be added to the curated datasets is, theoretically, limitless. This is of course accompanied by an increase in complexity and a possible high data load per set. Thus, if combined with high or non-high resolution images (even with pyramid tiffs), the result can become highly demanding for local hardware systems. In that regard, it is important to consider whether it is always useful to upload this complete dataset into the Mirador 3 viewer. Another challenge is the question as to how a user of this data can communicate certain parts of it to other users via this viewer. Having the option to specifically address certain files from the dataset in a simple way offers a useful solution here. Therefore, each separate file in the curated dataset has been attributed a simple index number, included in the ingest list. The permalink of the curated dataset gives the options to address each (or a selection) of the datafiles separately (see illustration above and example below): only the addressed files are uploaded in the viewer. To select the datafiles of interest, the original ingest list can be easily downloaded via the information panel in the Mirador 3 viewer. In the permalink, the index numbers (column 1 in the ingest-list) can be listed by adding to its end ‘?referer=’ and the index numbers only separated by a comma; thus: [permalink]+[?referer=]+[index-number,index-number,index-number,…].

Example of an individually addressed datasets (in this case to support the analysis of the blue pigments on the ‘Ancient Egyptian coffin fragment’): https://lib.is/IE17414261/representation?referer=1,28,34,35,37,39,51,52,53,54,55,67,68,69,70,71,72,73,74,75,76,77

This approach allows users to manipulate the original curated dataset in favor of the narrative in which it is used for documentation or study purposes. As the basis for the addressable link is the permalink of the unique ID (i.e. the IE number), the results can be included in any type of (scientific) communication.

Celebrating World Digital Preservation Day: A short story about creating collections as data for both present and future use

By Nele Gabriëls, Dirk Kinnaes, Hendrik Hameeuw and Bruno Vandermeulen

November 2nd, 2023 is World Digital Preservation Day! To honour the day, let’s have a look at some exciting partnerships we have been engaged in. Because yes, digital preservation (and all that it enables) most definitely is a concerted effort!

Physical and digital: does one replace the other?

At KU Leuven Libraries, we often (though not solely) work with heritage materials. Our aim: making collections readily available for as wide a range of use cases as possible. Of course we are proud to have an open data policy. We also strive for uniformity and standardisation when creating, structuring and preserving the digitised collections for both present-day and future use. 

When describing digitisation as the transformation of a physical object to a digital medium, the resulting digital surrogate is regularly considered a virtual replacement of the original. In certain cases, rendering an accurate representation of an original physical work through digitisation might indeed be the primary goal. To achieve this, digitisation must follow guidelines such as FADGI or Metamorfoze to capture the original as accurately as possible. We have been working on this for many years and continue along this path (see here), though importantly, when the materials require adjustments to the guidelines, we create specific implementations like we did for the digitisation of KU Leuven’s Charter collection (see here).

Digitisation can also add new information about the original object to its digital counterpart, thereby significantly broadening its potential for use. By using specific techniques like raking light, 3D capture, or other enrichment techniques, specific features (even invisible ones) of the physical object are captured into digital data. And what happens when a large set of data is considered, researched and worked with as a whole? Exactly: a totally different type of interaction with the materials and entirely new possibilities for usage emerge.

Creating, structuring and preserving our collections as data

But first the basics: data creation and preservation. For the creation of trustworthy and high-quality data that represents the physical appearance of the original objects, we apply international standards as mentioned above. The master files (.tiff) are of course grouped into digital objects that represent physical collection items. These are accompanied not only by descriptive metadata (a set of fields taken from our metadata repository is included in the object’s Submission Information Package or SIP), but also by extensive structural metadata describing the position of each of the files created in the full object. 

Various checks are performed on the submitted SIPs in order to guarantee the completeness and consistency of metadata and files before ingesting the SIPs into our repository (Teneo).  During the ingest procedure, the submitted files undergo several checks (including virus check, format identification and validation), technical metadata are extracted and consultation copies are generated.

The core of Teneo is a long term preservation system (Rosetta). The repository has a layered data model based on PREMIS (intellectual entities (IEs), representations, files).  Access rights can be defined at each level. The role of each file and the relations with other files within the same digital object is described in the structural metadata, which is essential for viewing and for using the metadata/files in other applications. Some of these use cases and applications may not yet be known at the time of ingest. The metadata and the files can be delivered in various ways, including offline export (METS + files), download (files, complete IEs), and via viewers.  Also, standard IIIF manifest files (json) are available, so that objects can be viewed in external IIIF compliant viewers as well. METS and IIIF manifest files are also useful in scenarios where the metadata/files are used as data rather than for consultation. 

Marketing (and testing!) our wares: KU Leuven Libraries’ BiblioTech 2023 Hackathon

Preservation and consultation of the individual objects in place (even if ever evolving), KU Leuven Libraries is searching its way toward assembling its collections into datasets amenable for computational use. Users don’t always seem to find their way to our digital collections and if they do, the digital items are often only used in ways similar to the interaction with the original items. In order to market the digital collections (attract attention to their existence) and to provide showcases for what kind of actions are enabled by giving access to extensive datasets, our digitisation department partnered with the Faculty of Arts and KU Leuven Libraries Artes Research team to organise the BiblioTech 2023 hackathon. Our partners were just as thrilled by the opportunity to promote the value of digital skills for research and creative projects amongst students and staff as we were.

The event was a great success. Seven teams of seven to eight participants (students, PhD and early career researchers) set about a digital project, each having been assigned one dataset to work with. Having been introduced to their team and to the allotted datasets during the ‘Meet the data, meet the people’ pre-event and having worked on their project for ten days during the actual hackathon period, they presented their work at the closing event in the form of a presentation and a poster. (You can look at the posters here.) The datasets consisted of images, descriptive and structural metadata and documentation on both the dataset structure and how to interpret the metadata. 

A collection curator giving background information on one of the collections on which the datasets were based during the ‘Meet the Data, Meet the People’ pre-event to the hackathon.
A collection curator giving background information on one of the collections on which the datasets were based during the ‘Meet the Data, Meet the People’ pre-event to the hackathon.

Creating the sets was a concerted effort of various departments of KU Leuven Libraries (Metadata, Digitisation, and library IT) in collaboration with the Faculty of Arts. It was exciting to take the first steps towards presenting our collections as datasets for computational use, after having worked on metadatasets for various collections during previous years with several students of the master programme in Digital Humanities (e.g. here, here and here). Work on the datasets was performed parallel to co-writing this article (to be published in the journal ‘Global Knowledge, Memory and Communication) on how to create collection datasets for computational use. 

Datasets were accessible through the brand-new internal KU Leuven platform for active research data, ManGO, developed by the university’s IT department. Through them, participants also had access to high performance computing power of the Flemish Supercomputing Center.

An event like the BiblioTech 2023 hackathon (the first hackathon to be organised by KU Leuven Libraries!) created a great momentum for working with data. Since it has taken place, the uptake of the library’s digital collections has risen significantly and the idea of making full datasets available is taking root. We are excited to continue exploring this route!

Interoperable multi-modal datasets

Research projects generate loads of different types of data. Preserving them is one aspect, ensuring that users also get insight into it, is an additional challenge. The VIEW Core Facility at KU Leuven produces both visual and analytical data of heritage objects, all derived from a variety of research infrastructures. Together with LIBIS, VIEW has developed a solution to preserve such multimodal, layered datasets and present them in one viewing environment. Both raw and processed datasets, visual and analytical data and different file types (tif, jpg2000, pdf, xlsx, interactive data formats, …) are brought together into one package. The solution allowed us to structure the data, link visual and analytical data and map the metadata. Additionally, a flexible solution to curate datasets in the viewer was developed so one can select the layers or data you want to present, e.g. in the framework of an article or blogpost to highlight specific features or findings.

The layered tiff file is the base to map all data, both visual and analytical, in correct spatial position

The solution is based on the IIIF Mirador 3 framework where layered data can be presented in their spatial context. At the base is a layered tiff image in which different images of various sources and resolution are mapped onto each other in register and zones pinpointed for inclusion of the measuring points of analytical data such as XRF (X-ray Fluorescence)  or Fors (Fiber Optic Reflectace Spectroscopy) in their spatial context. The specific metadata for each layer is brought together in an tabular file format to be imported and presented as annotations of the layers in the IIIF viewer.

A tabular file is prepared to link data files with metadata. This metadata is is presented as annotation in the IIIF Mirador viewer

After ingesting the data in our preservation system, the Mirador 3 viewer brings all data together into one interactive viewing environment. Visual data can be viewed and zoomed in to explore the minute details, the viewer can interact with the different layers, the context of the data is presented as annotations are linked to the specific layer, non-visual analytical data or interactive data can be consulted through links in the annotation.

The IIIF Mirador viewing environment bringing together multimodal datasets (visual and analytical) in a mapped spatial environment.

The preservation of multimodal datasets (raw, processed, visual and analytical) into one package has many advantages. All data of one object is brought together, enabling publishing the set into one environment and linking different types of data into one mapped spatial context. It allows both researchers and the wider public to interact with the data and explore minute details in context. 

Anjou Bible: https://lib.is/IE17930549/representation, more info about the Anjou Bible can be found here

Egyptian Coffin Fragment: https://lib.is/IE17414261/representation

A library story with our own data on Google Arts & Culture

by Nele Gabriëls, Zhuo Li & Hendrik Hameeuw

KU Leuven Libraries creates, year after year, large quantities of data. A big portion of it consists of visual media accompanied with a variety of predetermined and structured metadata. It is all brought together into is its own Unified Resource Management System ALMA and preservation environment Teneo (back office), where that data is respectively managed and stored for long term preservation and from where it is harvested and made searchable via its online library catalog LIMO (front office). Within its own biotope, that same data can also be structured and presented differently, for example into virtual exhibitions via EXPO or via topic-specific platforms such as Digital Heritage Online. Equally, but outside its biotope, periodically selected sub-collections are made available via the portal site EUROPEANA.EU.

KU Leuven Libraries keeps exploring how this structed and well-preserved data can be further reused via or on other new or established platforms, all in support of its open data policy. Several interns have already worked on this challenge and came up with refreshing examples. You can read all about the work by Mariana, Allison, and Luna. This post presents the work by Zhuo.

screenshot of the stories header on the Google Arts & Culture platform (desktop version)

Zhuo Li joined the Department of Digitisation as an intern student of the Cultural Studies Master’s program at KU Leuven. She took on the challenge to investigate how a standard digitised collection ‘as it is created and managed at KU Leuven Libraries’ can be transferred and curated into a Google Arts & Culture story. Zhuo assessed the interoperability of the available data and subsequently mapped out the steps to make it fit in the Google Arts & Culture architecture. First, a selection of the available metadata per image was uploaded in batch to the Google data management dashboard. The uploaded metadata included a field with a url linking back to the original digital representation (cross platform) of the item in the KU Leuven Libraries repository as shown in the library’s viewer. Next, the images were uploaded to KU Leuven Libraries’ Google Arts & Culture collection. The Google data management dashboard linked metadata with images.

With this data, Zhuo could create her story which gives an overview of the various locations of the KU Leuven Libraries facilities across Flanders & Brussels. For each location Zhuo selected one or two items from the KU Leuven Libraries digitized collections to illustrate the history of these locations decades or centuries back in time.

All the materials used in the story, beside the Google Street View and YouTube plugins, are from KU Leuven Libraries Digital Heritage Online and include digitized items dated to the Modern Age. 

Happy exploring!

Wikidata and Wikimedia Commons as Linked Data Hubs: Dissemination of KU Leuven Libraries’ digitized collections

By Luna Beerden

Hi! I’m Luna, a former student in Egyptology and Digital Humanities at the KU Leuven. During my internship at KU Leuven Libraries in Spring to Summer 2022, I worked on the enhancement of data dissemination of the library’s digitized collections. Because of my background in Egyptology, the collection of 3485 glass diapositives with an Egyptological theme (part of a much larger collection of glass slides) were chosen as a case study. The topics of these diapositives varies greatly: they range from daily life and travel photography, over educational slides with views of monuments and museum artifacts, to slides depicting (at that time) ongoing excavations. Little information is known on the origin of the collection or the exact age of the slides, although they were surely used to teach Egyptology at the university well into the 1970s.

Considering the sheer size and variety of the collection, not all glass diapositives were pushed to the Wiki Environment at once. The primary focus were the slides depicting the French excavations in El-Tod (Louvre Museum, 1936) and Medamud (IFAO, 1925-1932). The selection of these batches was made based on their perceived unicity, with excavation slides estimated to be rarer than educational slides. Further, a third batch of slides depicting museum artifacts was prepared for upload, chosen to highlight the connectivity with several holding institutions of important Egyptological collections in a way to further improve the visibility of the collection of KU Leuven Libraries. This is not to say, however, that the other glass diapositives are perceived in any way less important.


Figure 1: Example of a glass diapositive from KU Leuven Libraries’ digitized collection, depicting the excavation of the southern part of the courtyard of the Temple of Montu in Medamud (Egypt) (‘File: Medamud. Temple of Montu. Excavation of the southern part of the courtyard’, consulted 21st of February 2022 via Wikimedia commons or via Teneo viewer KU Leuven.  

Considering the Library’s open data policy and the work performed by previous Digital Humanities’ interns within this setting, I did not have to start from scratch, but could rather build upon their work and experiences. This blog post takes a closer look at the research process and practicalities that eventually led to the creation of the pilot study ‘Wikidata and Wikimedia Commons as Linked Data Hubs: Dissemination of KU Leuven Libraries’ digitized collections through the Wiki Environment’.

Metadata Dissemination – the Wiki Environment

Before any work was started on the data itself, it was essential to first establish if the Wiki Environment was a good fit for the digitized collections of KU Leuven Libraries. An extensive literature review was performed where both platforms, being Wikidata and Wikimedia Commons, were critically examined and their added value to the libraries’ existing workflows was determined. The result of this review proved positive, with the main advantages of use being the platforms’ multilingual and centralised character, their position as a linked data hub, the high level of engagement compared to the libraries’ own database PRIMO, and the existence of specialised query and visualisation tools.

Although uploading files to Wikimedia Commons has been going on for some time, there’s been a recent shift towards Wikidata because of its linked open data structure, which is hidden behind a user-friendly interface, data models, and tools, all concealing these technological complexities[1]. In short, Wikidata builds upon RDF triplets, consisting of an object, a property, and a subject that all receive a unique identifier (QIDs and PIDs). Qualifiers can be added to specify or provide additional information to a statement, and references to back up the information given[2]. Due to this linked data structure, different data sources can be querried simultaneously with the Wikidata Query Service (WDQS) which allows users to pose more complex questions and essentially transforms Wikidata into an authority hub centralising and connecting metadata from collections worldwide[3].

One of the key drawbacks of the Wiki Environment that has to be addressed, is the absence of standardization combined with the lack of documentation on the ‘Wiki-way’ to structure metadata, especially for Wikidata. Therefore, a lot of research and comparison was performed during this internship to understand best practices and to create guidelines to be implemented in future work by KU Leuven Libraries, and by extension the GLAM sector.

Case study – Glass diapositives Egyptology, KU Leuven Libraries

Data handling was performed in OpenRefine, a free, open source tool that offers the possibility to clean and transform large amounts of data with minimal effort and that is closely linked to the Wiki Environment. Prior to data cleaning, the metadata model and the precise content of the raw metadata had to be closely examined and matched with the Wiki Environment. The entities corresponding to the metadata values to be pushed were determined by going through all Wikidata properties and Wikimedia Commons templates, and by investigating uploads of glass diapositives by other libraries and holding institutions. One of the main difficulties of data mapping to both platforms was the limitation in Wikidata properties and Wikimedia Commons templates to choose from. In essence, the metadata of the collection was so rich that not all information could be easily mapped. Instead of leaving certain data out, an individual solution was sought for all fields, and metadata modification took place to enhance the usability and readability of the data.

With data mapping and cleaning completed, it was possible to start data reconciliation; a semi-automated process where metadata is matched to data from an external source, in this case Wikidata. All matches had to be checked manually, as the pattern matching performed by OpenRefine does not deliver perfect results. Reconciliation was carried out for all columns in the dataset, excluding unique metadata such as inventory numbers and URLs. Pushing the metadata to Wikidata, required a Wikibase scheme to be prepared, which can be carried out in the Wikidata extension of OpenRefine (see figure 2). Fundamentally, a Wikibase scheme represents the structure of the Wikidata items to be created and consists of a combination of terms and statements. All relationships could easily be mapped by selecting the previously determined Wikidata properties with their associated values and qualifiers as statements, and dragging the associated, reconciled columns of the OpenRefine project from the top of the page. Once the scheme was completed, it was saved for export to Quickstatements and for potential reuse during future uploads. The Quickstatements tool was selected for metadata upload as it allows for more control of batch uploads, with a list of previous uploads being available, errors being flagged, and reversion being easily executable. An export to Quickstatements can be commenced through the Wikidata extension of OpenRefine, creating a .TXT file with V1 commands that should be copy-pasted in the Quickstatements tool (see figure 3).

Figure 2: Excerpt of the Wikibase scheme in OpenRefine with all columns of the OpenRefine project present on top.
Figure 3: Excerpt of batch upload glass diapositives of El-Tod to Wikidata using Quickstatements.

A main issue that occurred when pushing the metadata to Wikidata, was the absence of titles for the created Wikidata items when querying the platform with any language setting other than Dutch. Not only did this lead to a decreased readability for users but hampered the item’s findability as Dutch search terms were required for the glass diapositives to appear in the user’s search results. To solve this, English translations were created of the titles of all glass diapositives using the translators 5.4.1 library in Python (see figure 4), and added to Wikidata via Quickstatements. An example of a Wikidata item created is shown in figure 5.

Figure 4: Python code used to translate the titles of the glass diapositives in Jupyter Notebook (Beerden 2022, ‘glass-diapositives-egyptology-jupyter-notebook-translations’, consulted 15th of August 2022 via https://github.com/lunabeerden/glass-diapositives-egyptology-ku-leuven-libraries).
Figure 5: Result of the first batch upload of glass diapositives from El-Tod to Wikidata. Detail of Wikidata item Q113103931 after manual data correction of misinterpreted accented characters and addition of English translations (Digitalisering 22/07/2022, ‘El-Tod. Temple of Montu. Archaeologist and worker in front of the façade of the pronaos’, consulted 29th of July 2022 via https://www.wikidata.org/wiki/Q113103931).

Now the metadata of the glass diapositives was uploaded to Wikidata, it was time to push the photographs and their metadata to Wikimedia Commons as well. First, the exifdata present in the images had to be complemented with information from the descriptive metadata as a safety measure against misuse and misattribution. The choice of XMP tags (Extensible Metadata Platform), or metadata to add, was inspired by the webinar on exifdata organised by meemoo and a case study on Wikimedia Commons by the Groeningemuseum in Bruges.[4] ExifTool, a Perl library and command-line application to adapt metadata, was chosen to overwrite the exifdata that was already present in the images (see figure 6).

Figure 6: Update of exifdata glass diapositives El-Tod and Medamud using ExifTool.

Using the Pattypan tool, the batch upload of all images with their updated exifdata was started, following a custom template based on the photography template offered by Wikimedia Commons (see figure 7). All data was uploaded under the Wikimedia Commons category Glass diapositives Egyptology, KU Leuven Libraries with parent categories Digitised collections of KU Leuven Libraries and Historical Photographs of Egypt, allowing easy future uploads of the library’s collections. Other categories to which the collection was linked include Image sources of Belgium and Images from libraries in Belgium; with each glass diapositive individually assigned to more specific categories related to its content such as Tod Temple of Montu in Tod and Senusret III. The attribution of categories to the glass diapositives, aims to collect various data sources centrally and as such allows a larger public, existing of both expert and non-expert users, to be reached.

Figure 7: Filled in photography template for file on Wikimedia Commons (Digitalisering, ‘File: El-Tod. Tempel van Montu. Laagreliëf van Sesostris III 01’, consulted 24th of  July 2022 via https://commons.wikimedia.org/w/index.php?title=File:ElTod._Tempel_van_Montu._Laagreli%C3%ABf_van_Sesostris_III_01.jpg&action=edit&section=1).

Finally, the link between Wikimedia Commons and Wikidata had to be established prior to pushing the data to Wikimedia. To this end, the Wikidata QID of all items was extracted in OpenRefine using the reconciliation function and added to a custom Wikimedia Commons template using a regular expression. Not only was this action performed for the title of the glass diapositives, but information such as the collection or current location of the slide was connected as well. Small errors that occurred during upload were solved using the Cat-a-lot gadget of Wikimedia Commons. An example of a successful upload to Wikimedia Commons can be seen in figure 8. As this connection is bilateral, in addition to adding the Wikidata identifiers to Wikimedia Commons, the Wikidata entries had to be enhanced with the Wikimedia Commons filenames in the Wikidata property Image (P18). Although it is possible to update the previously created Wikibase scheme and push the data once more, as explained by meemoo in their webinar[5], a more efficient way of adding metadata to the Wiki Environment is proposed. By creating a .XLSX file with Quickstatements, and running this instead, only newly created metadata will be considered.

Figure 8: Descriptive metadata as present on Wikimedia Commons (Digitalisering, ‘File: El-Tod. Tempel van Montu. Laagreliëf van Sesostris III 01’, consulted 24th of July 2022 via https://commons.wikimedia.org/w/index.php?title=File:ElTod._Tempel_van_Montu._Laagreli%C3%ABf_van_Sesostris_III_01.jpg&action=edit&section=1).

To further improve the accessibility, searchability, and quality of information offered, structured data should be added on Wikimedia Commons. This allows the metadata to be both human- and machine-readable. Although it is not yet possible to add structured data unique to each slide in batch and it is too time consuming and error-prone to do this manually for each entry, multiple projects are currently ongoing and will hopefully allow this in the future. For now, the AC/DC tool or Add to Commons/Descriptive Claims was used to add fixed values to large selections of files, such as information on copyright status or collection details (see figure 9).

Figure 9: Structured data tab on Wikimedia Commons for glass slide added during first batch upload (Digitalisering, ‘File: El-Tod. Tempel van Montu. Laagreliëf van Sesostris III 01’, consulted 24th of July 2022 via https://commons.wikimedia.org/w/index.php?title=File:ElTod._Tempel_van_Montu._Laagreli%C3%ABf_van_Sesostris_III_01.jpg&action=edit&section=1).

In order to already prepare the metadata of KU Leuven Libraries for said advancements, Named Entity Recognition (NER) was performed, where keywords describing the content of each glass diapositive were extracted from the title and subtitle of all records. Multiple experiments were done to determine the best way to carry out NER for this collection. Eventually, the Dandelion API service was chosen as it can easily be implemented in OpenRefine using the named entity recognition extension developed by the Free Your Metadata project[6]. The best results were achieved on the English translations created before (see figure 10).

Figure 10: Excerpt of English translation (left) and named entities extracted from these translations using Dandelion API in OpenRefine (right).

Proposed Workflow

Next to the upload of a selection of glass diapositives to the Wiki Environment, the most important outcome of this internship was the creation of a guideline for data upload to be implemented in KU Leuven Libraries’ existing workflows, illustrated in the flowchart below (see figure 11) and shortly discussed here. All actions proposed need to be considered part of an iterative process where different phases can be repeated multiple times and revisited if necessary. Figure 12 provides an overview of all files created and their relation to each other. More information and recommendations can be found in the report written during the course of this internship.

Prior to data publication, the metadata quality and data consistency should be assessed and improved using OpenRefine for data cleaning. The quality of the data mapping performed, with the attribution of the correct Wikidata properties and Wikimedia Commons templates, is highly influential to the level of data dissemination achieved and therefore plays a key role in the process. Once mapping is completed, the collection has to be prepared accordingly by the creation of a Wikibase schema in OpenRefine, an export to Quickstatements, and an eventual push to Wikidata. To ease this process, prepared columns should be placed at the end of the OpenRefine project, following the structure of the metadata after upload and being separated by platform and/or use. Using OpenRefine, the exifdata present in the images must be expanded and prepared to be pushed with ExifTool, after which upload to Wikimedia Commons can start. To do so, the metadata is extracted from the images into an XLSX file using Pattypan. After updating the XLSX file with the preferred information, the metadata is validated and uploaded to Wikimedia Commons with said tool. The last step of the process entails the release of a VRTS statement on the copyright status and license of the uploaded images, which can be achieved through the interactive release generator.

Figure 11: Proposed workflow for upload to Wikidata and Wikimedia Commons.
Figure 12: Overview of files created during this internship with their interrelationships.

Lessons learned

By uploading the images and their associated metadata to the Wiki Environment, not only the goal of data dissemination is reached but KU Leuven Libraries responds to the ever-increasing importance of linked data usage in the GLAM sector. As mentioned previously, Wikidata and Wikimedia Commons, in this regard, act as linked data or authority hubs where data reconciliation and the existence of templates ensure the use of a single structured vocabulary across the Wiki Environment by institutions worldwide, contrary to the use of library-specific protocols. This does not only aid the searchability of the data by playing into machine-readability and data linkage, but greatly enhances the usability for the (general) public, who can now simultaneously look throughout a large number of collections using a single uniform search term.

My internship demonstrated not only the (dis)advantages of the use of the Wiki Environment and the value that programming languages have in metadata enrichment and amelioration, but showed that optimizing the data upload requires a substantial amount of time. It is therefore crucial that KU Leuven Libraries draws up a balance between results achieved and time invested, amongst others by keeping track of user statistics over prolonged periods of time. Not only the workflow shortly introduced here can be applied on other digitized collections. The automatic translation of metadata and named entity extraction performed during this internship could also be performed, whether or not upload to the Wiki Environment is intended.

Link to the collection

https://commons.wikimedia.org/wiki/Category:Glass_diapositives_Egyptology,_KU_Leuven_Libraries_upload


[1] Tharani, K. ‘Much More than a Mere Technology: A Systematic Review of Wikidata in Libraries’, The Journal of Academic Librarianship 47 (2021): 102326.

[2] Wikidata 10/07/2022, ‘Wikidata: Identifiers’, consulted 13th of August 2022 via https://www.wikidata.org/wiki/Wikidata:Identifiers.

[3] Europeana 17/09/2020, ‘Why Data Partners should Link their Vocabulary to Wikidata: A New Case Study’, consulted 23rd of March 2022 via https://pro.europeana.eu/post/why-data-partners-should-link-their-vocabulary-to-wikidata-a-new-case-study; van Veen, T., ‘Wikidata: From “an” Identifier to “the” Identifier’, Information Technology and Llibraries 38 (2019): 72.

[4] Meemoo, Vlaams instituut voor het archief 27/04/2021, ‘Wikimedia upload 4: metadata embedden met exiftool’, consulted 15th of March 2022 via https://youtu.be/W0v0Iwde86I; Saenko A., Donvil S. and Vanderperren N. 09/03/2022, ‘Publicatie: Upload van reproducties van kunstwerken uit het Groeningemuseum op Wikimedia Commons’, consulted 15th of March 2022 via https://www.projectcest.be/wiki/Publicatie:Upload_van_reproducties_van_kunstwerken_uit_het_Groeningemuseum_op_Wikimedia_Commons.

[5] Meemoo 27/04/2021 34:00, ‘Wikimedia upload 5: beelden uploaden met Pattypan en koppelen met Wikidata’, consulted 20th of March 2022 via https://www.youtube.com/watch?v=vkY41FVhmxk.

[6] Free Your Metadata 2016, ‘Named entity recognition’, consulted 5th of July 2022 via https://freeyourmetadata.org/named-entity-extraction/; Ruben Verborgh 21/06/2017, ‘Refine-NER-Extension’, consulted 5th of July 2022 via https://github.com/RubenVerborgh/Refine-NER-Extension.

Views of Louvain: the making of an interactive map

By Allison Bearly

Time to introduce you to the work of Allison Bearly! Allison spent the spring 2020 semester working as an intern for the Special Collections and Digitisation Departments of KU Leuven Libraries, as part of her Advanced Master’s in Digital Humanities programme. We were delighted she took to the challenge of working creatively with one of our open data collections. Here’s her story.

Hello there! I’m Allison. During my internship at KU Leuven Libraries, I worked with the digitized Views of Leuven collection from the library’s Digital Heritage Online collection, a collection that contains 352 images of Leuven and the surrounding areas dating from the 16th to the 20th centuries. With the Library’s recent move to adopt an open data policy for its digitized collections and a previous intern’s work on a pilot study to share the collections as computationally amenable data, I built upon this foundation.

In my internship, I had two main objectives. The first was to create another machine-actionable dataset to share on the Library’s GitHub account, using the pilot study as a starting point. The second goal was to creatively reuse this data to show an example of what can be done when cultural heritage institutions share their collections as open data. Concretely, this took the form of creating an online, interactive map of Leuven through the centuries. In this blog post, I will explain the steps I took to achieve these goals. 

Data Cleaning and Enriching

Thanks to Mariana’s work on the pilot study, I was able to follow many of her steps for the data cleaning, transforming and refining. For a more detailed explanation, give Mariana’s blog post a read. I also chose to do the data cleaning in OpenRefine since it is a free, open source tool that easily lets users clean and transform data. 

Once I had done basic data cleaning and transforming, I began to enrich the data with the location information. To use the dataset to map the images, I needed coordinates for each image. Before adding coordinates, however, I added another category, which I called Place Name. Although the metadata already included information about the place pictured in the image, it wasn’t always complete or consistent, especially when more than one location was featured in an image. Although I had to add the information for the Place Name column manually, it was a valuable use of time because the result was a consistent name for each place, which allowed me to take advantage of OpenRefine’s text facet feature to sort the records and add the coordinates in mass.

Figure 1: Screenshot from OpenRefine: Abdij Keizersberg, which has 5 records, is selected from the text facet

I decided to use OpenStreetMap (OSM) to get the coordinates and as the base layer for the map because it is open source and widely used for various web applications. 

Figure 2: View of the OpenStreetMap interface. By right clicking and choosing “Show address” at the location of Abdij Keizersberg, we get the coordinates in the search results window in decimal degree format (here highlighted in yellow).

The latitude is copied from OSM and pasted into OpenRefine by clicking on the edit button on the latitude cell. Then the “Fill down” feature is applied to add the latitude to the rest of the records with the same Place Name. The same steps were then repeated with the longitude column. The text facet feature made adding the coordinate information a relatively quick process and it ensured that all records with the same Place Name have the exact same coordinates. 

To see the complete cleaned and enriched dataset, check out the repository on GitHub.  

Figure 3: Three views of the OpenRefine interface. The first image shows the latitude being pasted into the corresponding cell of the first record. In the second image, the fill down function is applied, resulting in the third image which shows that the latitude has been added to the remaining records.  

Creative Reuse: Georeferencing Images

With the data cleaning and enrichment complete, I moved on to the next step of my project. As part of my interactive map of Leuven, I wanted to feature the historical maps in the collection so I used them as overlays over a base layer map. In order to do so, I first needed to georeference the historical maps. Georeferencing is the process of adding coordinate information to raster files (the historical maps) so that they can be aligned to a map with a coordinate system. It works by assigning ground control points (GCPs) to the raster image for which the coordinates are known. 

In order to georeference all 18 of the historical maps in the collection, I turned to QGIS, an open source GIS software. QGIS has a georeferencer feature which allows you to select GCPs and assign them to coordinates on a base map. The first step is to add the OpenStreetMap tile as the base map and zoom in to the correct location (in this case, Leuven), and subsequently to upload the raster image, i.e. the historical map. 

Figure 4: Views of the QGIS interface, adding the OSM tile (left image) and the raster file (the historical map in jpg format; right image).

Next, within the QGIS georeferencer, a point on the historical map is selected to add a GCP. When a point is chosen, a popup box lets you either manually enter the coordinates or choose them from the map canvas, in this case, the OpenStreetMap which was added in the first step. The same spot is selected on the OSM map, the coordinates get filled in and are assigned to that GCP. The same steps are then repeated to add more GCPs. 

Figure 5: Views of the QGIS georeferencer showing a popup box after selecting a point on the historical map through which to add a GCP (top image), the OSM on which to select the same point (bottom left image) and the assignation of the coordinates to the GCP on the raster image (bottom right image).  

A minimum of 4 GCPs should be added that are evenly distributed around the image. The more GCPs, the more accurately the georeferenced image will align with the base map. Once a sufficient number of GCPs are added, the georeferencer is run and the historical map is aligned over the base map. Depending on how accurately the GCPs were chosen and how accurate the historical map is, there will be some warping and distortion of the historical map. 

Figure 6: The QGIS interface shows the historical map of Leuven georeferenced on top of the base map. 

Creative Reuse: Making the interactive online map

The Leaflet Javascript library was used to make the interactive online map. Leaflet is open source, has a lot of mapping functionalities, and is well documented. Following the Leaflet Quick Start Guide, I was easily able to write the code to set up the basic map. Instead of using Mapbox (which has a limited number of free tile views) as the tile layer for the map as suggested in the quick start guide, I have used CartoDB which is a stylized OpenStreetMap tile layer. 

Since there are 18 historical maps in the Views of Leuven collection, I set up 18 HTML pages, one for each map. Each of the HTML map pages use the same core JavaScript file, however there is some custom JavaScript which is embedded in each HTML file — namely the snippet which indicates the historical map which is used as the overlay, shown below. Following the documentation for the image overlay, two pieces of information are needed: the url to the image and the image bounds, which are coordinates that indicate where the corners of the overlaid image should be. The coordinates for the image bounds were captured from the georeferenced map in QGIS.

Figure 7: JavaScript code which adds historical map as on overlay to the base map. The imageUrl gives the link to the historical map and the image bounds are the coordinates for the northeast and southwest corners of the overlaid image.

With the map pages set up with the historical map overlay, the next step was to add all of the places to the map. The d3 library let me parse the CSV file and add all of the locations (by using the coordinates that were added in the data enrichment stage) as points to the map. Using the Leaflet popup feature, I added the name of the place and a Bootstrap carousel of the thumbnail images to the popup.

Figure 8: Map with locations as points added. Popup includes the place name and a carousel of thumbnail images of that place.

From there, in the JavaScript file, I used loops with conditional statements (if/else logic) to add the places to different layers using Leaflet’s layer functionality based on the category and century using the Local Uniform Title Category and the Rounded Date in the CSV file. The map includes 12 layers: Leuven Places (which includes all places), Religious Buildings, Public Place, University Buildings, Panoramic Views, Parks & Waterways, Objects, 16th Century, 17th Century, 18th Century, 19th Century, and 20th Century Places. In order for the if/else logic to work properly, the Local Uniform Title Category could not be multi-valued. If an image belonged to two categories, for example religious buildings and panoramic views, then the record was duplicated in the CSV file. In one row, the Local Uniform Title Category was listed as religious building and in the second row it was listed as panoramic views. This is why the modified dataset has more rows than the original dataset. 

For some final touches on the map, the Leaflet Marker Cluster plug-in was used to cluster the markers that were close to each other so the map wasn’t overwhelmed with markers on the map. The Leaflet Extra Markers plug-in was used to change the color and add icons and numbers to the markers to allow for easy identification of the category of the place. 

Enjoy exploring our selected Views of Leuven!

Figure 9: Map with the customized, clustered markers showing the layers that are available.

For a more in-depth look, the modified CSV file and samples of the JavaScript, HTML, and CSS source files are available on the Mapping the Views of Leuven GitHub repository. Are you interested in creatively reusing digitized cultural heritage data? Check out the KU Leuven Libraries’s GitHub account to see all of the cleaned collection-based datasets which are available as open data. And to see the finished product of this work, check out the Views of Leuven online map here

Mind the gap? Open Access spectral data for documentary heritage digitisation


by Hendrik Hameeuw & Bruno Vandermeulen

This post presents the publication of an open access spectral dataset which, beside other areas of application, provides insights and opportunities for better fine-tuned digitisation results for documentary heritage.

The Imaging Lab of KU Leuven Libraries has a strong focus on digitisation and imaging of documentary heritage. A crucial step in the digitisation process is the image creation; the transition of the physical object into a bitmap or array of pixels. During this process the colors of the object to be photographed, have to be translated into a particular colorimetric value at pixel base. The better that translation is fine-tuned, the more accurate the final result will be and the more the viewer can be assured the digital presentation matches the original object as it would appears in similar lighting conditions (=reliable data). To finetune the translation a reference color target is used in combination with specific software to build a bespoke ICC profile (=translator). This reference color target consists of an array of colored patches from which the colorimetric values are known.

BUT …

… the color patches on the applied color targets do not always represent all the tints, tones, shades and hues typical for documentary heritage. So what is the problem … ?

… the digitisation of historical documentary heritage is not the same as wedding photography, product photography or even digitising the colorful art work of Kandinsky. Most of these materials, especially the historical ones, have a limited gamut (range of colors), and hold in particular less pronounced colors.

… and thus, the colors selected for most reference color targets are quite the opposite: they have a wide gamut (well spread across the theoretical color space) and they include the more pronounced colors.

Consequently, when these reference color targets are used to calibrate colors, they do not necessarily calibrate the colors photographers encounter when digitising documentary heritage. And that is unfortunate as the whole effort of profiling the colors for specific lighting conditions to obtain faithful digital representation documenting the original might be in jeopardy! Or less dramatically, we are aware this is a challenge and this job can be done better!

HOW TO ProfilE Colors

To further explore and understand the issue, let’s first take a step backwards: how is color profiled? Cameras (ranging from professional to smartphone) have embedded software or algorithms which interpret the incoming reflected light and for each pixel ‘translates’ it into a colorimetric values. These algorithms are predetermined, to assure the images look good. In fact, the standard algorithms will accentuate and shift some colors focusing more on pleasing color as opposed to accurate color. Color scientists know that, the sale managers of camera production companies know that, and such processed images are appreciated by the costumers buying those cameras. Thus, a perfect world!

When the digitisation process aims at creating a digital surrogate resembling as close as possible the original (including color), this strategy falls short. A solution is to capture the object in a raw format and process the data through specific software for processing of raw data (Capture One, Lightroom, Phocus, RawTherapee, …). In such software the color profile that comes with the camera can be disabled and replaced by a custom, tailormade profile.

Why custom made? Well, the appearance of the surface and thus the color, through the reflection of the light, changes when the lighting condition changes, such as a change in position of the light, color temperature, … In a digitisation studio all those parameters can be controlled and need to be kept the same throughout a digitisation process. In such a controlled environment, it is possible to obtain close to perfect digital representations if the correct procedures are followed and adequate hard- and software is used.

To create an in situ color profile for the specific and standardized (lighting) conditions a reference color target is used. The most popular and widely used of these targets is the ColorChecker Digital SG. The basic idea of such manual color calibration is straightforward:

  1. a standardized illumination set-up for imaging documentary heritage is established (=photo-studio)
  2. an image is taken from a reference color target for which the colorimetric values are known (=calibration target)
  3. based on the obtained result from 2. with the illumination conditions from 1. software can estimate how the colors in the data of the image should be interpreted in order to obtain a correct representation, without taking into account the color translation by the profile camera. As such, an in situ color profile is calculated. (=color profiling)
  4. the color profile calculated in 3. is applied on all images taken with the same illumination set-up as 1. (=color calibrated digitisation).

Central in this process is the reference color target from which an image is taken. This target consists of a number of different neutral and colored patches. For each patch the colorimetric values are known or measured and stored in a text file. Using a target representing all the colors in the visual spectrum is impossible. To overcome this, manufacturers of reference color targets try to include a selection of colors more or less evenly spread across the visual spectrum. At the same time they will try to include a sufficient number of tones that frequently appear in photographs, such as skin tones, sky, green vegetation, …

When the photographer has set up his equipment (camera and lighting) (1.) the target is photographed and the reflection or response in relation to the specific conditions is registered by the camera  (2.). At this moment, the response (measured energy) is uninterpreted! Color profiling software measures the response of the patches in the image and creates a table (LUT) with these values and the reference values of the target (the text file). An ICC color profile translates the measured value into the reference value. (3.). The ICC color profile is stored and the photographer applies it to all the images made within the same conditions (light position, …) (4.). Have these conditions changed (btw, that includes the position of the camera or the specific lens), a new color profile needs to be made.

Thus, with the help of a reference color targets software (in a camera or computer) can calculate how the registered (observed) energy captured by the sensor in the camera should be translated to generate the colors of a surface as they are in real.

Profiling the colors that matter

One has to define the discrepancy between the colors on the reference color target and the colors of the documentary heritage we digitize on a day to day basis. This should provide the intel to understand whether the existence of this discrepancy – the gap – is a pure theoretical problem or a real life issue. At the Imaging Lab of KU Leuven Libraries we decided to make that effort. In collaboration with our colleagues of Special Collections we defined which original historical materials are representative for library heritage and archive collections and started measuring their spectral responses with a standard reflective spectrometer (Eye-One (i1) Pro Photo) (see below: ‘The Open Access Spectral Data’). As such, an insight has been obtained by providing the spectral responses and corresponding colors with their attested tints, tones, shades and hues typical for documentary heritage.

When de spectral data is inspected it can immediately be observed the attested colors are in the yellow, brown and slightly red regions of the color space (below A & B). And secondly, when this cluster of measured colors is compared with the spread of patches on one of the most popular color calibration targets (ColorChecker Digital SG) very interesting insights are revealed (below C & D).

Visualizing and comparing the spectral data shows that the measured historical materials fall in a zone represented by very little color patches on the ColorChecker Digital SG (see also the video below for a more in depth discussion by Don Williams). That means that this zone, which represents in particular colors that are common on historical documentary heritage, will remain poorly profiled. The spectral dataset accentuates this very well.

A gap between the colors that are commonly profiled, and which should be profiled, is identified.

The profiled colors on ‘standard’ calibration targets are no perfect match with the colors which should be profiled as a gap can be observed. Consequently, even when the current digitisation standards for documentary heritage, such as metamorfoze and FADGI, are followed; it is unclear how accurate a number of specific, frequently occurring heritage colors which matter most are registered. As such, subtle variations and changes in their materiality (for ex. due to time, light exposure, conservation interactions, …), which in theory should be observable based on the colorimetric values, can remain undetected or will be poorly represented.

What NOW?

  • Further study and actions can be made. The KU Leuven Libraries spectral data shows there is still room for improvement. This needs to be explored further. The spectral dataset used for the above made conclusions counts 433 measure points, selected on typical historical documents in a Belgian Special Collections library. In the broader international context, this exercise should be repeated to establish extended spectral insights in historical documentary heritage across cultural and material traditions.
  • Based on the above, it seems wise to populate color calibration targets with other/extra patches more closely related to the type of imaged materials. This is not new, for the DT Next Generation Target (v2) a similar exercise has already been done, leading to a calibration target with extra ‘heritage colors’. The KU Leuven data has also already been matched for the selection of color patches on the new FADGI ISO 19264 (for a comparison see the video above). These new targets will need to be tested, not only in their ability to calibrate standardised colors in general, but more importantly in checking out their added value in color profiling documentary heritage materials.
  • To facilitate and support future activities and research with spectral data of historical (documentary) heritage, this data should be made available for the broad community of heritage scientists.

THE OPEN ACCESS SPECTRAL DATA

The entire KU Leuven Libraries spectral data has been published online as open data. Together with the needed documentation this give the opportunity to use this data for any future work for which spectral characterisation of documentary heritage materials is wanted.
The dataset is published on zenodo.org as Hameeuw Hendrik, Vandermeulen Bruno, Van Cutsem Frédéric, Smets An & Snijders Tjamke (2021): KU Leuven Libraries Open access Spectral data of historical paper, parchment/vellum, leather, inks and pigments (Version 1.0) [Data set]. http://doi.org/10.5281/zenodo.3965419.
Feel free to work with the dataset. When you do, do not hesitate to reach out to us if you have any questions. If you have feedback and/or are interested to collaborate on the topic: digitalisering@kuleuven.be.

Data-level access to Belgian historical censuses

By André Davids and Nele Gabriels

KU Leuven Libraries is creating data-level access to Belgian historical censuses. This blog post gives some context and a brief overview. To know more about the digitisation process, you can read our colleague André Davids’ article “Die Texterkennung als Herausforderung bei der Digitalisierung von Tabellen”, in O-Bib. Das Offene Bibliotheksjournal / Herausgeber VDB, 7(2), 1-13. https://doi.org/10.5282/o-bib/5584 (in German).

Censuses have been carried out for some 5000 years for the purposes of tax collection and the military. Soon after the foundation of Belgium, however, the Belgian sociologist and statistician led the way for the first censuses which, from the start, were also intended for research. These 1846 censuses covered not only population but also agriculture and industry. They were highly acclaimed across Europe and followed by many subsequent censuses.

These Belgian censuses are indeed a true treasure trove for socio-economic research. Their analysis, however, is very time-consuming due to their extent and format. Complex questions such as ‘What impacts salary levels more: the rising industry or rather the location?’ are very difficult to answer when working with the originals.

Statistique de la Belgique: Industrie, recensement de 1880 (1887). The open pages reveal the complexity of the census’ content.

Converting the many volumes into digital format would open a whole range of new possibilities, so KU Leuven Libraries Economics and Business’ project ‘Belgian Historical Censuses’ (website in Dutch) currently digitizes the physical volumes in order to create data-level access to the tables in a spreadsheet format. Based on research needs, primary focus is on the industrial censuses that have been organised between 1846 and 1947.

The basis, of course, is transforming the physical copies into digital format. Based on their physical state, the KU Leuven collection items are either scanned (modern materials in good condition) or photographically digitized (heritage and fragile materials) using presets resp. processing techniques that improve the success rate of OCR.

Text recognition starts with layout analysis. ABBYY FineReader assigns categories to the various parts of the image files: image, text, table. Manual checks and adjustments ensure a correct interpretation of the table structure.

Layout analysis in ABBY FineReader

Once the layout analysis is successfully completed, ABBYY FineReader executes the actual text recognition. The result is, again, manually checked.

Executing the actual text recognition

 A searchable PDF file as well as an Excel spreadsheet can now be exported. To ensure the layout of the spreadsheet correctly reflects the tables, they are intensively edited.

Exporting the digitised object as Excel still requires intensive editing of the spreadsheet.

OCR’ing numerical data is particularly challenging. Contrary to text (where a single incorrectly OCR’ed letter still allows to interpret the work correctly), any error at the level of a single figure (e.g. 1 being read as 7) has major consequences when working with the data.

The only possible solution is manually controlling and adjusting the numerical data – a very intensive step in the quality assurance process, especially for the older censuses due to the font used in these documents. QA operators recalculate the totals of separate rows/columns and compare these with the totals included in the censuses. When the corresponding totals of original and digital objects differ, the individual numbers in the row/column are corrected based on the original volume.

Checking and adjusting the numerical data through comparison of totals in the digitised object (Excel) and the original object.

After having created a searchable PDF and a computationally ready Excel, structural metadata is created. This metadata and both files are ingested into the library’s preservation environment Teneo. The metadata allows to link the digital objects to the correct descriptive metadata in the library catalogue Alma and to present the object to the public as open data. The data is now ready to be used!

Teneo viewer showing the machine-readable PDF (in the centre) and partial metadata (on the right). The computationally-ready spreadsheet (indicated on the left) may be downloaded too.

Launch pixel+ viewer: New dimensions take a deeper look at heritage

press releases in English, Dutch, French

Together with the Art & History Museum and the Royal Library of Belgium (KBR), KU Leuven is launching an online open access application to view heritage objects dynamically and interactively online. This pixel+ viewer allows you to view centuries-old objects in a different light and reveal hidden details.  

Japanese print on paper (© KU Leuven Libraries collections) in the pixel+ viewer

As a result of the Corona crisis, museums and other heritage institutions today have little or no physical access, both in Belgium and abroad. It puts the consultation of objects and the study of our past under strong pressure. In part, we can fall back on digitised objects, notes and old publications, but these only represent part of the information, which means that important details can be overlooked. Fortunately, the sector, in collaboration with engineers, has devised solutions to remedy this.

In the heritage sector, the digitisation of objects has long been the focus of attention and experimentation. For the public, this usually results in an online photo that can be zoomed in or on which the contrast can be adjusted. These are purely colour images, standard digital photographs conceal no extra information. However, different types of image scanners register a lot more characteristics of a surface than just the colour. Being able to visualize this information in a handy online tool therefore offers new possibilities for anyone working with heritage objects. Think, for example, of the KBR drawings by Pieter Bruegel the Elder that were recently examined by KU Leuven. The researchers were able to study the paper down to the fibre using their Portable Light Dome (PLD) scanner. They also got a much better view of the extensive range of techniques used by the old master.

Detail on original Pieter Bruegel the Old drawing from 1557 (KBR: II132816, Luxuria), without colour the imprinted stylus traces of the engraver become visible (© Fingerprint, KBR and KU Leuven).

Software is the key

Over the past 15 years, KU Leuven researchers, together with various partners from the heritage sector, have developed digital techniques that can visualise objects to an unprecedented level of detail: the PLD scanner. “With this method, they illuminate an object from a large number of angles and take photos of it, the so-called ‘single-camera, multi-light recording’, says Hendrik Hameeuw, co-coordinator of the project at KU Leuven. “The way in which this recording is subsequently processed determines which characteristics of the surface, such as relief or texture, the software can show and thus how the user experiences the object”.

New universal file format

 “To be entirely complete, we actually have to look at the file types of these interactive datasets,” says Hameeuw. Most heritage institutions calculate and store these types of images of their heritage with a specific image format, usually RTI/HSH. The software developed in Leuven works with PLD files (ZUN, CUN) that have extra functionalities compared to those RTI/HSH files. Pixel+ now makes this way of calculation available to the whole world, not only by offering it online, but also by introducing a new kind of container file for it: glTF. “Compare it with an ordinary photo on your computer. It will probably be a JPEG or GIF file. But if you want to work with it in Photoshop, the program will turn the same image into a PSD file”. These glTFs are compatible with both the Leuven PLD and the RTI/HSH files. “With this we offer a new universal standard for this kind of images and we also open them immediately via the online pixel+ viewer, a kind of free photoshop for ‘single-camera, multi-light recording’ images”. This allows both RTI/HSH and PLD files to be studied and compared within the same program for the first time.µ

A new world

Pixel+ extracts a lot of extra information from the available data. The objects, such as old coins, miniatures or paintings, suddenly acquire extra dimensions after hundreds of years, which can be used for research on these objects to gain new insights. Especially in the field of 3D (geometry) and the correct understanding of the reflections of light on an object, the Leuven software is taking major steps forward.

“The technology is interesting for many objects, from clay tablets over coins to paintings or medieval manuscripts,” explains Hameeuw. “The software allows, among other things, the objects to be viewed virtually with different incidence of light, the relief to be mapped at pixel level or a 3D visualisation to be generated”. Frédéric Lemmers of the KBR Digitisation Department joins in: “By even combining it with multi-spectral imaging, researchers recently discovered that the heads of some figures in KBR’s 13th-century Rijmbijbel were painted over at a later date.” At the Art & History Museum, the technology was used to make heavily weathered texts on almost 4,000-year-old Egyptian figurines readable again.

Institutions from all over the world, from the Metropolitan Museum of Art in New York (USA) to the Regionaal Archeologisch Museum a/d Schelde in Avelgem (Belgium), will be able to upload, consult and study their own datasets or files in pixel+. The software converts the information according to various new standards and allows users to access the virtual heritage objects interactively. “This development really is a milestone for the heritage sector”, emphasises Chris Vastenhoud, promoter of the project from the Art & History Museum. “A whole new world will open up for heritage institutions worldwide. They will be able to document and share a lot of additional information in order to communicate about the objects in their collections”.

Pixel+ is available to everyone at http://www.heritage-visualisation.org with examples of objects from the collections of the Art & History Museum, KBR and KU Leuven.


The online pixel+ viewer with an example of a cuneiform tablet from the collection of the Museum Art & History, Brussels. (© Art & History Museum and KU Leuven).

The project is a collaboration between Art & History Museum, KU Leuven Department of Electrical Engineering, KU Leuven Illuminare, KU Leuven Libraries Digitisation and KBR; and was funded by the Federal Science Policy Office (BELSPO) through the BRAIN-be programme (Pioneer projects).

Contact list of all partners involved:

At the beginning of April 2020, the pixel+ project staff already presented their results during the online (as a result of Corona) SPIE conference. As a result, the paper below was published: 

Vincent Vanweddingen, Hendrik Hameeuw, Bruno Vandermeulen, Chris Vastenhoud, Lieve Watteeuw, Frédéric Lemmers, Athena Van der Perre, Paul Konijn, Luc Van Gool, Marc Proesmans 2020: Pixel+: integrating and standardizing of various interactive pixel-based imagery, in: Peter Schelkens, Tomasz Kozacki (eds.) Optics, Photonics and Digital Technologies for Imaging Applications VI, Proc. of SPIE Vol. 11353, 113530G. (DOI: 10.1117/12.2555685)

read paper – see presentation

Additional examples can be viewed and created at http://www.heritage-visualisation.org/examples.html

Opening up a little more: a minimal-computing approach for developing Git and machine-actionable GLAM open data

by Mariana Ziku and Nele Gabriels

One of the current hot topics in the GLAM (Galleries, Libraries, Archives, Museums) sector is that of presenting collections as data for use and reuse by as diverse a public as possible. Having adopted an open data policy and working towards FAIR data, a previous blog post described our implementation of images as open data. Digitisation does not only create images, however, so we have started the exciting road to disclosing other data too. During 2019, we were delighted that Mariana Ziku, at the time an Advanced MSc in Digital Humanities student at KU Leuven, took up an internship at our Digitisation Department and set out to investigate how to start navigating this road. Here is her story!

Hi there! I’m Mariana, an art historian and curator. During Spring 2019, I had the opportunity to join the Digitisation Department at KU Leuven Libraries, as the department set the scene for extending its open digital strategy. The goal: investigating new ways for sharing and using the libraries’ digitised collections as data. A Digital Humanities (DH) traineeship and research position was opened under the title “Open Access Publishing in Digital Cultural Heritage” to examine the data aspect of heritage collections in the context of open access.

This blog post gives a brief insight into the research, aspirations and practical outcomes of this internship, which resulted in the pilot study ‘Open access documentary heritage: the development of git and machine actionable data for digitised heritage collections’. 

Figure 1: The pilot stack for creating machine-actionable documentary heritage data for KU Leuven Libraries’ Portrait Collection. CC-BY 4.0. Vector icons and graphics used for the KU Leuven Pilot Stack infographic: at flaticon.com by Freepik and at Freepik by Katemangostar.

Open-access development in GLAMs

First on the agenda: delving into KU Leuven Libraries’ digital ecosystem. This included looking into the library’s digitisation workflows, imaging techniques and R&D projects (like this one) as well as discovering the elaborated back-end architecture and metadata binding it all together. At that time, two anticipated open-access projects were going public: the ‘Digital Heritage Online’ browsing environment for the library’s digitised heritage collections and the enhanced image viewer, promoting a more functional user interface with a clear communication of the licensing terms and with full object and file-level download functions for the public domain collections.

With this view of KU Leuven Libraries’ open digital strategy in mind, we explored the current open-access publishing landscape among other GLAM institutions. We got an indicative overview through the active running project of McCarthy and Wallace “Survey of GLAM open access policy and practice”. The survey is an informal and open Google spreadsheet, allowing GLAM institutions to list themselves and index their openness spectrum. By running a Tableau data analysis and visualisation on selected survey facets, we outlined the open access digital ecosystem of approximately 600 GLAM institutions. This revealed, among others, that museums are the most active cultural institution types in the field of open access, followed by libraries. Countries with the most open institutions are Germany, U.S.A. and Sweden.

Figure 2: A data visualisation board of instances in the Survey of GLAM open access policy and practice, an ongoing project by McCarthy and Wallace indexing the openness spectrum of GLAM institutions for enabling data re-use of their collections. (Data captured April 2019. The list has grown since then.)

The survey provides a valuable insight into instances of open data that each GLAM institution makes available online. The data is currently organised in 17 columns, including links to the Wikidata and GitHub accounts of the listed GLAMs. Although the majority of indexed institutions had a Wikidata account, the number of GLAMs with an active GitHub account was low, (approximately 50 out of 600 institutions). Nevertheless, looking for good practices in open-access publishing, we started to explore existing GLAM activity on GitHub, examining, among others,  data repositories of prominent institutions that have long been active on GitHub (e.g. MoMA, MET, NYPL), wiki-documented projects (e.g. AMNH) and various curated lists. GitHub seemed a time- and cost-effective way to provide access to GLAM open data, which convinced us to further explore the potential of collaborative Git platforms as GitHub for digital cultural heritage.

The question for the internship pilot was set: how can Git be used within the framework of GLAM digital practices and could it become part of a strategy for creating open data-level access to digitised GLAM collections?

Towards the computational transformation of documentary heritage as open data: a pilot study for KU Leuven Libraries

Documentary heritage metadata created and curated within libraries is considered to be well-structured and consistent. In many ways this is indeed the case due to the use of internationally recognised metadata standards and the creation of description formats for specific material types. Even so, the structure of this metadata as-is poses many challenges for computational use because libraries primarily create metadata for human-readable cataloguing purposes rather than for digital research. In addition, library metadata is being created over a long period of time by many people who make different interpretational choices and, occasionally, errors. Even the metadata formats may be changed over time thus creating inconsistencies. As a result, the data may become “dirty” on top of its structural challenges for computational use. 

As digital scholarship becomes increasingly more prevalent, the development of new systematic and reliable approaches to transform GLAM data into machine-actionable data is critical. In this context, the concept of Git can be helpful. Git is an open source system, which is able to maintain a history of fully functional changes of uploaded data or coding. It also enables simultaneous collaborative work, for example when keeping a trustworthy record of coding issues and of contributions from a wider community of users. However, using Git to access CSV and other spreadsheet files on a data-level is not quite here yet for the wider user community: Git platforms like GitHub do not (yet) support collaborative, hands-on work on datasets contained in CSV files with the option to track changes in real-time with Git. Nevertheless, GitHub can be useful for publishing open GLAM data, as it can foster engagement, help build a knowledgeable community and generate feedback. 

The pilot set forth to provide a test case by publishing the digitised Portrait Collection as open, computational-fluent data in a so-called ‘optimal’ dataset. Approximately 10.000 graphic prints from KU Leuven Libraries’ Special Collections had been digitised several years ago and provided  with descriptive and structural metadata. Now, the dataset would be prepared to function as an open-access documentary heritage resource, specifically designed for digital humanities research and made available via a new KU Leuven Libraries’ GitHub repository. 

Open GLAM data: good practices, frameworks and guidelines

As we were looking into data-specific open access for documentary heritage, the pilot study first investigated practices, standards and technical deployments that reframe cultural heritage (CH) as open data. This included the analysis of standardised policy frameworks for creating GLAM data repositories on the one hand and research infrastructures supporting CH data-driven scholarship on the other.

We investigated good practices for creating trustworthy GLAM data repositories and digital resources that move towards open-access, improved discoverability and active reusability. Interestingly, these were based on (amongst others):

In addition, the publications of the “Collections as Data” project (Thomas Padilla et al, May 2019) were particularly helpful for gaining a better understanding of the challenges that GLAM institutions encounter in providing resources for digital scholarship and in developing the necessary digital expertise. 

A minimal-computing approach to developing git and machine-actionable data 

Turning to the dataset preparation for publication, it was essential to develop both a process that required only minimal programming knowledge and a step-by-step guide for GLAM data processing that would be as accessible as possible within a humanities context. That way, over time more machine-actionable datasets from the library’s heritage collections could be developed as part of the general working processes of the library. 

The Metadata Services Department took on the metadata export from the metadata repository. Using MarcEdit and FileMaker, both tools that are already part of the library’s general workflows, the initial MARCXML was transformed into a CSV format. 

Next, we needed to gain insight into the content of the dataset through simple data analysis. We used Jupyter Notebooks hosted in Colaboratory, to ensure a real-time shared use of the findings and to allow everyone to look at and interact with the data analysis outcome. Analysing the CSV file with short Python scripts revealed basic information on the data quality inside the columns. Furthermore, columns were identified as containing categorical, qualitative or numerical data, as columns with unique values or as almost empty columns to be discarded. Using Colaboratory requires knowledge of the Python programming language, however. The search for a minimal computing approach to GLAM data processing required looking for tools that are easier to use.

https://colab.research.google.com/drive/1aiiB0KUxowDwlg9gg_nB-5T-_966-Z7M

Figure 3: Click on the link above to go to Google Colab and see the notebook with the first set of data explorations of the initial CSV file (before data processing for its computational transformation). The short Python scripts can be copied and easily adjusted for the exploration of other datasets.

Following the advice of Packed (Belgian center of expertise in digital heritage, now part of Meemoo), we selected OpenRefine, a free, open-source stand-alone desktop application for data processing. OpenRefine can be used without advanced programming skills, although it has elaborated capabilities as well. Basic data processes can be performed through “push-button” actions and simple text filters written in the General Refine Expression Language (GREL). 

The three principal stages of data processing could be executed in OpenRefine: data cleaning, data transforming and data refining. The result was an optimal CSV file containing machine-actionable data. This data could now be used for data analysis and visualisation with minimal friction, in the context of data-driven scholarship as well as for other creative computational reuses.

Data processing

Let’s take a better look at what data processing actually does. Having a clearer view of the dataset and a minimal-computing tool, we could now start processing the data. Transforming the CSV file into machine-actionable data required over 500 individual data processing tasks on OpenRefine. Below follows a single example for each data processing stage: cleaning, transforming and refining.

>>>Data cleaning

Data cleaning is primarily based on faceting and clustering. Faceting is a useful feature of OpenRefine that allows the browsing of columns and rows. Within these facets the data can be filtered based on subsets (‘categories’) which in their turn can be changed in bulk. 

Figure 4: Screenshot of the OpenRefine tool showing a list of the 32 categories for the facet ‘Description of graphic material’. One of these categories is used in 10.323 out of the total number of 10.512 records. 

Inspecting the column “Description of graphic material”, as shown in the image above, reveals a total of 32 different data categories in which all of the 10.512 records (i.e. the complete portrait collection) are being sorted. The number of categories can be reduced because some categories are formed due to inconsistencies in data documentation, like the dot behind ‘$oBlack-and-white’ in one of the categories highlighted in green in the example below.

Figure 5: Screenshot of OpenRefine showing two inconsistencies in data documentation. Highlighted in green is an unintended category duplication due to a faulty value (a dot added to the value ‘Black-and-white’). Highlighted in yellow is a seemingly faulty category that, upon inspection of the image file described in this record, turned out to be correct.

However, category integration is a critical process that requires detailed inspection and consideration of data nuances that may else be lost. In the image above, for example, a category with a double entry of “Black and White” is not a mistake. It merely implies that the image contains two portraits, both of which are black and white. 

In the image below is another example of a column with 248 different categories that need to be numbered down in order to become computational amenable. The metadata coding for this content type (Typographical terms) is not based on MARC 21 standards, but refers to a local coding system used in conjunction with MARC21. Here too, the number of different categories can be reduced with category integration, in combination with splitting, which is explained below.

Figure 6: On the left, the initial column of the unprocessed CSV file containing approx. 250 categories of typographical terms, which made the column not fit for computational use. On the right, the outcome of data cleaning in the new CSV file: 20 major categories of typographical terms could be identified encompassing all the Portrait Collection’s artworks (Typographical terms 1) and a second column was created (Typographical terms 2) in order to include more specific information which complements the first column.

Another critical issue with data not fit for computational use is inconsistencies occurring from different spellings and name variants of the same entity. The clustering feature of OpenRefine groups choices that look similar, with the option to choose between different levels of sensitivity. The image below shows how data entries referring to cities were clustered in order to surface patterns that are representations of the same thing, but are documented in different ways. 

Figure 7: Performing clustering in OpenRefine inside the column “Place of manufacture”: an automatic identification and grouping of various spelling forms of the cities where the graphic prints were created.

>>>Data transformation

Going a step further with data processing, an extended transformation of many columns is recommended to further refine the repository and form more amenable data for computational use. Typically, improvements include splitting columns, renaming entities and replacing strings. 

Splitting columns can be useful when a single column holds two or more variables, as in Figure 6 above. The data becomes computational responsive when separated and split into two or more columns. 

Renaming entities too can enhance their computational capabilities. Often this involves replacing strings, for example when various URL formats exist inside a single column, that direct to diverse webpages. In the following example, the column “link to digital object 1” contains different formatted URLs that display an image preview of each portrait. The URLs are mainly in the form of “delivery links” that redirect to an image display within a viewer environment: http://depot.lias.be/delivery/DeliveryManagerServlet?dps_pid=IE4849156. However, delivery links are not fit for image harvesting as they link to a viewer and not to a file. In order to enable image harvesting, we transformed the link to a direct image display of the file in thumbnail form: http://resolver.libis.be/IE4849156/thumbnail. This way, a quick image preview will be available while the embedded thumbnail preview will be useful for more elaborated data visualisations.

Figure 8: First task from a set of successive data transformations, performed for creating a unified URL format for the image-display link. The resulting link directs to a thumbnail image preview, which can be integrated more easily into data visualisations. Data transformation was performed through a GREL expression.

The link was transformed using a GREL expression, by appending text to the string: cells[‘Link to digital object 1’].value + ‘/thumbnail’. In this case the column’s name is ‘Link to digital object 1’ and the added text is ‘/thumbnail’. We can preview the resulting link on the right side of the table: http://depot.lias.be/delivery/DeliveryManagerServlet?dps_pid=FL4856542/thumbnail

In many cases, as in the above data transformation instance, successive data transformations might be needed in order to make a single data entry consistent and functional.

>>>Data refining

Data refining starts with fixing encoding issues in strings by replacing diacritic characters. For example, the diacritic character “é” is not properly represented and instead an arbitrary character chain appears (“eÌ”). In order to refine this correctly across all columns, we performed a transformation through the “All” arrow-down menu with a GREL expression. OpenRefine subsequently shows the number of transformations automatically applied to each column. 

Figure 9: Refining data by reinterpreting diacritic characters throughout all columns with a GREL expression. The top right image shows the high number of character modifications within column ‘Main title’, with 427 cells containing  the diacritic character at least once.

GitHub Setup

With the dataset now ready for publication, we set up a simple GitHub environment to enable access and reuse of the data. We created an institutional profile and prepared two readme files: one with a general introduction about the library’s objectives with regard to open access publishing, and another with detailed description of the uploaded dataset and its technical aspects

Useful applications and features have been integrated that further support the use of GitHub as an open GLAM repository. These include the extension of the open research repository ‘Zenodo’ as well as the “All Contributors” bot extension. 

The integration of Zenodo makes the data repository citable through the minting of a Digital Object Identifier (DOI). Zenodo’s support of DOI versioning allows to cite a specific dataset and its potential different versions. Zenodo is mounted on GitHub, allowing us to choose a suitable license and a number of other useful aspects for the dataset (e.g. export formats, citation style, recorded versions). 

Figure 10: The Portraits Collection dataset is available to download and cite from the open research repository ‘Zenodo’, where a DOI has been especially minted for the dataset. Zenodo is accessible through an ORCID or GitHub account and the Zenodo DOI badge within GitHub has been easily integrated by the following Markup language snippet: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3460785.svg)%5D(https://doi.org/10.5281/zenodo.3460785)

All Contributors” is a framework developed within GitHub to attribute all possible contributors that will be engaged with the dataset by utilising an “emoji key”. Contributors include not only people who work on code but also everyone who may ask and answer questions, design, provide examples, translate, etc. The bot is configured accordingly and can also function for opening up inquiries for data modelling to the broader community, enabling attribution within the library’s GitHub page to literally everyone with a GitHub account engaging with the dataset. 

Figure 11: The “All contributors” extension integrated in the GitHub GLAM account can acknowledge everyone engaging with the related data (not just coders). A set of emojis (shown below the avatar pictures) represents specific contribution types (e.g. reviewers, translators, making questions, sharing ideas).

https://github.com/KU-Leuven-Libraries

—-

Looking ahead

The pilot served as a test case to develop a process for the preparation of more machine-actionable datasets through minimal-computing processes at KU Leuven Libraries. It also set up the GitHub profile of KU Leuven Libraries, initiating Git by applying the Zenodo DOI and All contributions extensions, creating the readme files and uploading the machine-actionable dataset on its Git repository, while commiting and proposing file changes by using Git. The data preparation process has been documented on a step by step basis, in order to create a blueprint for data processing of documentary heritage intended for computational use and to offer critical insights, identify omissions, missing information and process steps that could possibly be improved in the future. 

The pilot was KU Leuven Libraries’ first step towards creating a Git repository of open and ready to analyse datasets of documentary heritage from the library’s collections, openly available to (amongst others) researchers and students in (digital) humanities looking to use computational-fluent open data that display minimal discrepancies.

The pilot study for developing git and open documentary heritage data for computational use was conducted in the Digitisation Department of KU Leuven Libraries by Mariana Ziku as part of her thesis in the Advanced MSc in Digital Humanities, under the supervision of Nele Gabriels, the guidance of Bruno Vandermeulen, the training with Diederik Lanoye and the advice and good company of Hendrik Hameeuw and Mark Verbrugge. Thanks also to Bart Magnus and Sam Donvil from Packed (now Meemoo) for sharing their expertise on digital cultural heritage.

We are OPEN! Share and reuse our public domain collections (and read about how we got there)

Here’s a piece of exciting news: KU Leuven Libraries has adopted an open data policy for its digitised collections! This means that nearly 42.000 digital representations of originals from the library’s public domain holdings may be freely shared and (re-)used by all. And of course this number continues to grow.

How it works?

It’s easy. Online visitors can check the copyright status of the images when viewing digitised collection items online. A direct link guides them to our terms of use page, where they will read that everyone is free to use and reuse images offered as open data at will, mentioning KU Leuven as the owner of the collection item where possible.

The IE viewer displays copyright status information and links to the general terms and conditions.

The information pane of the IIIF-based Mirador viewer displays the same copyright status information and link to the general terms and conditions as the IE viewer

While we strive to keep our collections as open as possible, some items are available under a license, e.g. when the public domain originals are not part of KU Leuven Libraries’ holdings or when permission was given for copyrighted materials to be digitized and made available to the public under specific conditions. Visitors can consult the licensing conditions via the viewer.

When images are made available under a licence, the viewers (here: Mirador) link to specific conditions for use for the digital object in view.

Having checked the status of a digital object and the conditions for use, online visitors can use the direct download possibilities included in the Teneo viewer. These offer single-file or full-object download.

Images may be downloaded as single files or as full object in the IE viewer.

The downloaded images are jpg or jp2 and allow perfect readability of even the smallest written text. Visitors are now ready to become active users of the digitised collection!

How we got there

Easy as it may seem, implementing an open data policy required significant effort on various levels to ensure that online visitors would be clearly informed about the judicial status of both physical originals and the digital representation and what this means for its use. KU Leuven Libraries currently presents its digitised collections either in the Rosetta IE viewer (with an ‘Intellectual Entity’ generally equalling an individual collection item) or – for bespoke collections – in a IIIF-based Mirador viewer. Systems and processes had to be adjusted to show this information on an single-item level in these view environments.

First, data can only be freely used and shared if it can be both accessed and acquired. To this end, easy-to-use download functions were first created within the IE viewer. This viewer now offers both single-file or full-object download (see the images above). Mirador too will include download options before long.

Second, covering the bases, our team collected the rights information from legacy project descriptions and agreements, both internal to KU Leuven Libraries and (for digital objects based on original held by other institutions) with external partners. Unclear phrasing was clarified and the images resulting from digitisation projects were assigned one of three possible legal statuses: public domain/open data, available under a license, or copyrighted.

Third, for each of these three statuses, terms & conditions for use of the digital objects were designed in close collaboration with KU Leuven’s Legal Department. Furthermore, an overview page was created detailing, for each of the digitisation projects, the licensing conditions for those digitized items that are available under a license.

Fourth, a copyright status was assigned to each of the individual digitised objects, more specifically to both the physical (public domain or in copyright) and digital (content as open data, under a license or in copyright) objects. For the originals, the descriptive metadata model was modified to include the copyright status; for the digital objects, status information would become part of the metadata in the information package presented for ingest into the preservation environment.

Terms and conditions for the use of images as open data.

And finally: fifth, we turned to metadata visualisation in the viewer environments. The metadata shown in both the IE and the Mirador viewer is not that of the public search environment Limo nor from the metadata repository Alma, but rather from the digital asset preservation system, Rosetta; hence the choice to include the copyright status of the digital representations in the ingest information packages. By nature, the metadata (just like any data) in the preservation system is unchangeable. Including the extra information for those digital objects already in Rosetta was not something to be done lightly, but implementing an open data policy justified the decision.

Rather than adding copyright statuses to the existing metadata, we decided to create a standard mapping between Alma and Rosetta and replace the existing descriptive metadata in Rosetta. That way, two other issues could be addressed: the inconsistent and the static nature of the descriptive information shown in the digital representation viewers. The inconsistency was a legacy of the early digitisation projects at KU Leuven Libraries (with some projects generating extensive descriptions and others hardly any) while the static nature of the metadata is inherent to its extraction from the preservation environment.

This chart shows the four main elements of our architecture and the flow of data between them. While the viewer (both IE and Mirador) is accessed via a link in Limo, the actual images and metadata shown is retrieved from Rosetta.

The new standard mapping between Alma and Rosetta provides a direct answer to the first issue. The viewers display a uniform metadata set consisting of title, material type, genre, location of the original item and copyright status information. A link to the full descriptive record in Limo gives users access to the most up-to-date information about the digital item on view. And of course both viewers’ metadata panes display all the object-level information required to implement an open data policy. Together with the download functions, this enables KU Leuven Libraries to offer its digitised collections as open data.

The road to ‘open’

KU Leuven Libraries is fully committed to opening up its digitised collections in depth and to as high a standard as possible. Presenting the images of ca. 42.000 – out of nearly 95.000 – digitised library collection items as open data is an important first step in this direction. While we promise to keep improving the user experience in the viewers, with enhanced download functionalities and easier access to terms of use and licensing conditions, you will hear about our first endeavours into opening up metadata as freely available data sets in a next blogpost.

Meanwhile, we invite everyone to visit the collections at Digital Heritage Online (read all about DHO in a previous blog post) and to actively use, reuse and share our digitised collections!