Data-level access to Belgian historical censuses

By André Davids and Nele Gabriels

KU Leuven Libraries is creating data-level access to Belgian historical censuses. This blog post gives some context and a brief overview. To know more about the digitisation process, you can read our colleague André Davids’ article “Die Texterkennung als Herausforderung bei der Digitalisierung von Tabellen”, in O-Bib. Das Offene Bibliotheksjournal / Herausgeber VDB, 7(2), 1-13. https://doi.org/10.5282/o-bib/5584 (in German).

Censuses have been carried out for some 5000 years for the purposes of tax collection and the military. Soon after the foundation of Belgium, however, the Belgian sociologist and statistician led the way for the first censuses which, from the start, were also intended for research. These 1846 censuses covered not only population but also agriculture and industry. They were highly acclaimed across Europe and followed by many subsequent censuses.

These Belgian censuses are indeed a true treasure trove for socio-economic research. Their analysis, however, is very time-consuming due to their extent and format. Complex questions such as ‘What impacts salary levels more: the rising industry or rather the location?’ are very difficult to answer when working with the originals.

Statistique de la Belgique: Industrie, recensement de 1880 (1887). The open pages reveal the complexity of the census’ content.

Converting the many volumes into digital format would open a whole range of new possibilities, so KU Leuven Libraries Economics and Business’ project ‘Belgian Historical Censuses’ (website in Dutch) currently digitizes the physical volumes in order to create data-level access to the tables in a spreadsheet format. Based on research needs, primary focus is on the industrial censuses that have been organised between 1846 and 1947.

The basis, of course, is transforming the physical copies into digital format. Based on their physical state, the KU Leuven collection items are either scanned (modern materials in good condition) or photographically digitized (heritage and fragile materials) using presets resp. processing techniques that improve the success rate of OCR.

Text recognition starts with layout analysis. ABBYY FineReader assigns categories to the various parts of the image files: image, text, table. Manual checks and adjustments ensure a correct interpretation of the table structure.

Layout analysis in ABBY FineReader

Once the layout analysis is successfully completed, ABBYY FineReader executes the actual text recognition. The result is, again, manually checked.

Executing the actual text recognition

 A searchable PDF file as well as an Excel spreadsheet can now be exported. To ensure the layout of the spreadsheet correctly reflects the tables, they are intensively edited.

Exporting the digitised object as Excel still requires intensive editing of the spreadsheet.

OCR’ing numerical data is particularly challenging. Contrary to text (where a single incorrectly OCR’ed letter still allows to interpret the work correctly), any error at the level of a single figure (e.g. 1 being read as 7) has major consequences when working with the data.

The only possible solution is manually controlling and adjusting the numerical data – a very intensive step in the quality assurance process, especially for the older censuses due to the font used in these documents. QA operators recalculate the totals of separate rows/columns and compare these with the totals included in the censuses. When the corresponding totals of original and digital objects differ, the individual numbers in the row/column are corrected based on the original volume.

Checking and adjusting the numerical data through comparison of totals in the digitised object (Excel) and the original object.

After having created a searchable PDF and a computationally ready Excel, structural metadata is created. This metadata and both files are ingested into the library’s preservation environment Teneo. The metadata allows to link the digital objects to the correct descriptive metadata in the library catalogue Alma and to present the object to the public as open data. The data is now ready to be used!

Teneo viewer showing the machine-readable PDF (in the centre) and partial metadata (on the right). The computationally-ready spreadsheet (indicated on the left) may be downloaded too.

Launch pixel+ viewer: New dimensions take a deeper look at heritage

press releases in English, Dutch, French

Together with the Art & History Museum and the Royal Library of Belgium (KBR), KU Leuven is launching an online open access application to view heritage objects dynamically and interactively online. This pixel+ viewer allows you to view centuries-old objects in a different light and reveal hidden details.  

Japanese print on paper (© KU Leuven Libraries collections) in the pixel+ viewer

As a result of the Corona crisis, museums and other heritage institutions today have little or no physical access, both in Belgium and abroad. It puts the consultation of objects and the study of our past under strong pressure. In part, we can fall back on digitised objects, notes and old publications, but these only represent part of the information, which means that important details can be overlooked. Fortunately, the sector, in collaboration with engineers, has devised solutions to remedy this.

In the heritage sector, the digitisation of objects has long been the focus of attention and experimentation. For the public, this usually results in an online photo that can be zoomed in or on which the contrast can be adjusted. These are purely colour images, standard digital photographs conceal no extra information. However, different types of image scanners register a lot more characteristics of a surface than just the colour. Being able to visualize this information in a handy online tool therefore offers new possibilities for anyone working with heritage objects. Think, for example, of the KBR drawings by Pieter Bruegel the Elder that were recently examined by KU Leuven. The researchers were able to study the paper down to the fibre using their Portable Light Dome (PLD) scanner. They also got a much better view of the extensive range of techniques used by the old master.

Detail on original Pieter Bruegel the Old drawing from 1557 (KBR: II132816, Luxuria), without colour the imprinted stylus traces of the engraver become visible (© Fingerprint, KBR and KU Leuven).

Software is the key

Over the past 15 years, KU Leuven researchers, together with various partners from the heritage sector, have developed digital techniques that can visualise objects to an unprecedented level of detail: the PLD scanner. “With this method, they illuminate an object from a large number of angles and take photos of it, the so-called ‘single-camera, multi-light recording’, says Hendrik Hameeuw, co-coordinator of the project at KU Leuven. “The way in which this recording is subsequently processed determines which characteristics of the surface, such as relief or texture, the software can show and thus how the user experiences the object”.

New universal file format

 “To be entirely complete, we actually have to look at the file types of these interactive datasets,” says Hameeuw. Most heritage institutions calculate and store these types of images of their heritage with a specific image format, usually RTI/HSH. The software developed in Leuven works with PLD files (ZUN, CUN) that have extra functionalities compared to those RTI/HSH files. Pixel+ now makes this way of calculation available to the whole world, not only by offering it online, but also by introducing a new kind of container file for it: glTF. “Compare it with an ordinary photo on your computer. It will probably be a JPEG or GIF file. But if you want to work with it in Photoshop, the program will turn the same image into a PSD file”. These glTFs are compatible with both the Leuven PLD and the RTI/HSH files. “With this we offer a new universal standard for this kind of images and we also open them immediately via the online pixel+ viewer, a kind of free photoshop for ‘single-camera, multi-light recording’ images”. This allows both RTI/HSH and PLD files to be studied and compared within the same program for the first time.µ

A new world

Pixel+ extracts a lot of extra information from the available data. The objects, such as old coins, miniatures or paintings, suddenly acquire extra dimensions after hundreds of years, which can be used for research on these objects to gain new insights. Especially in the field of 3D (geometry) and the correct understanding of the reflections of light on an object, the Leuven software is taking major steps forward.

“The technology is interesting for many objects, from clay tablets over coins to paintings or medieval manuscripts,” explains Hameeuw. “The software allows, among other things, the objects to be viewed virtually with different incidence of light, the relief to be mapped at pixel level or a 3D visualisation to be generated”. Frédéric Lemmers of the KBR Digitisation Department joins in: “By even combining it with multi-spectral imaging, researchers recently discovered that the heads of some figures in KBR’s 13th-century Rijmbijbel were painted over at a later date.” At the Art & History Museum, the technology was used to make heavily weathered texts on almost 4,000-year-old Egyptian figurines readable again.

Institutions from all over the world, from the Metropolitan Museum of Art in New York (USA) to the Regionaal Archeologisch Museum a/d Schelde in Avelgem (Belgium), will be able to upload, consult and study their own datasets or files in pixel+. The software converts the information according to various new standards and allows users to access the virtual heritage objects interactively. “This development really is a milestone for the heritage sector”, emphasises Chris Vastenhoud, promoter of the project from the Art & History Museum. “A whole new world will open up for heritage institutions worldwide. They will be able to document and share a lot of additional information in order to communicate about the objects in their collections”.

Pixel+ is available to everyone at http://www.heritage-visualisation.org with examples of objects from the collections of the Art & History Museum, KBR and KU Leuven.


The online pixel+ viewer with an example of a cuneiform tablet from the collection of the Museum Art & History, Brussels. (© Art & History Museum and KU Leuven).

The project is a collaboration between Art & History Museum, KU Leuven Department of Electrical Engineering, KU Leuven Illuminare, KU Leuven Libraries Digitisation and KBR; and was funded by the Federal Science Policy Office (BELSPO) through the BRAIN-be programme (Pioneer projects).

Contact list of all partners involved:

At the beginning of April 2020, the pixel+ project staff already presented their results during the online (as a result of Corona) SPIE conference. As a result, the paper below was published: 

Vincent Vanweddingen, Hendrik Hameeuw, Bruno Vandermeulen, Chris Vastenhoud, Lieve Watteeuw, Frédéric Lemmers, Athena Van der Perre, Paul Konijn, Luc Van Gool, Marc Proesmans 2020: Pixel+: integrating and standardizing of various interactive pixel-based imagery, in: Peter Schelkens, Tomasz Kozacki (eds.) Optics, Photonics and Digital Technologies for Imaging Applications VI, Proc. of SPIE Vol. 11353, 113530G. (DOI: 10.1117/12.2555685)

read paper – see presentation

Additional examples can be viewed and created at http://www.heritage-visualisation.org/examples.html

Opening up a little more: a minimal-computing approach for developing Git and machine-actionable GLAM open data

by Mariana Ziku and Nele Gabriels

One of the current hot topics in the GLAM (Galleries, Libraries, Archives, Museums) sector is that of presenting collections as data for use and reuse by as diverse a public as possible. Having adopted an open data policy and working towards FAIR data, a previous blog post described our implementation of images as open data. Digitisation does not only create images, however, so we have started the exciting road to disclosing other data too. During 2019, we were delighted that Mariana Ziku, at the time an Advanced MSc in Digital Humanities student at KU Leuven, took up an internship at our Digitisation Department and set out to investigate how to start navigating this road. Here is her story!

Hi there! I’m Mariana, an art historian and curator. During Spring 2019, I had the opportunity to join the Digitisation Department at KU Leuven Libraries, as the department set the scene for extending its open digital strategy. The goal: investigating new ways for sharing and using the libraries’ digitised collections as data. A Digital Humanities (DH) traineeship and research position was opened under the title “Open Access Publishing in Digital Cultural Heritage” to examine the data aspect of heritage collections in the context of open access.

This blog post gives a brief insight into the research, aspirations and practical outcomes of this internship, which resulted in the pilot study ‘Open access documentary heritage: the development of git and machine actionable data for digitised heritage collections’. 

Figure 1: The pilot stack for creating machine-actionable documentary heritage data for KU Leuven Libraries’ Portrait Collection. CC-BY 4.0. Vector icons and graphics used for the KU Leuven Pilot Stack infographic: at flaticon.com by Freepik and at Freepik by Katemangostar.

Open-access development in GLAMs

First on the agenda: delving into KU Leuven Libraries’ digital ecosystem. This included looking into the library’s digitisation workflows, imaging techniques and R&D projects (like this one) as well as discovering the elaborated back-end architecture and metadata binding it all together. At that time, two anticipated open-access projects were going public: the ‘Digital Heritage Online’ browsing environment for the library’s digitised heritage collections and the enhanced image viewer, promoting a more functional user interface with a clear communication of the licensing terms and with full object and file-level download functions for the public domain collections.

With this view of KU Leuven Libraries’ open digital strategy in mind, we explored the current open-access publishing landscape among other GLAM institutions. We got an indicative overview through the active running project of McCarthy and Wallace “Survey of GLAM open access policy and practice”. The survey is an informal and open Google spreadsheet, allowing GLAM institutions to list themselves and index their openness spectrum. By running a Tableau data analysis and visualisation on selected survey facets, we outlined the open access digital ecosystem of approximately 600 GLAM institutions. This revealed, among others, that museums are the most active cultural institution types in the field of open access, followed by libraries. Countries with the most open institutions are Germany, U.S.A. and Sweden.

Figure 2: A data visualisation board of instances in the Survey of GLAM open access policy and practice, an ongoing project by McCarthy and Wallace indexing the openness spectrum of GLAM institutions for enabling data re-use of their collections. (Data captured April 2019. The list has grown since then.)

The survey provides a valuable insight into instances of open data that each GLAM institution makes available online. The data is currently organised in 17 columns, including links to the Wikidata and GitHub accounts of the listed GLAMs. Although the majority of indexed institutions had a Wikidata account, the number of GLAMs with an active GitHub account was low, (approximately 50 out of 600 institutions). Nevertheless, looking for good practices in open-access publishing, we started to explore existing GLAM activity on GitHub, examining, among others,  data repositories of prominent institutions that have long been active on GitHub (e.g. MoMA, MET, NYPL), wiki-documented projects (e.g. AMNH) and various curated lists. GitHub seemed a time- and cost-effective way to provide access to GLAM open data, which convinced us to further explore the potential of collaborative Git platforms as GitHub for digital cultural heritage.

The question for the internship pilot was set: how can Git be used within the framework of GLAM digital practices and could it become part of a strategy for creating open data-level access to digitised GLAM collections?

Towards the computational transformation of documentary heritage as open data: a pilot study for KU Leuven Libraries

Documentary heritage metadata created and curated within libraries is considered to be well-structured and consistent. In many ways this is indeed the case due to the use of internationally recognised metadata standards and the creation of description formats for specific material types. Even so, the structure of this metadata as-is poses many challenges for computational use because libraries primarily create metadata for human-readable cataloguing purposes rather than for digital research. In addition, library metadata is being created over a long period of time by many people who make different interpretational choices and, occasionally, errors. Even the metadata formats may be changed over time thus creating inconsistencies. As a result, the data may become “dirty” on top of its structural challenges for computational use. 

As digital scholarship becomes increasingly more prevalent, the development of new systematic and reliable approaches to transform GLAM data into machine-actionable data is critical. In this context, the concept of Git can be helpful. Git is an open source system, which is able to maintain a history of fully functional changes of uploaded data or coding. It also enables simultaneous collaborative work, for example when keeping a trustworthy record of coding issues and of contributions from a wider community of users. However, using Git to access CSV and other spreadsheet files on a data-level is not quite here yet for the wider user community: Git platforms like GitHub do not (yet) support collaborative, hands-on work on datasets contained in CSV files with the option to track changes in real-time with Git. Nevertheless, GitHub can be useful for publishing open GLAM data, as it can foster engagement, help build a knowledgeable community and generate feedback. 

The pilot set forth to provide a test case by publishing the digitised Portrait Collection as open, computational-fluent data in a so-called ‘optimal’ dataset. Approximately 10.000 graphic prints from KU Leuven Libraries’ Special Collections had been digitised several years ago and provided  with descriptive and structural metadata. Now, the dataset would be prepared to function as an open-access documentary heritage resource, specifically designed for digital humanities research and made available via a new KU Leuven Libraries’ GitHub repository. 

Open GLAM data: good practices, frameworks and guidelines

As we were looking into data-specific open access for documentary heritage, the pilot study first investigated practices, standards and technical deployments that reframe cultural heritage (CH) as open data. This included the analysis of standardised policy frameworks for creating GLAM data repositories on the one hand and research infrastructures supporting CH data-driven scholarship on the other.

We investigated good practices for creating trustworthy GLAM data repositories and digital resources that move towards open-access, improved discoverability and active reusability. Interestingly, these were based on (amongst others):

In addition, the publications of the “Collections as Data” project (Thomas Padilla et al, May 2019) were particularly helpful for gaining a better understanding of the challenges that GLAM institutions encounter in providing resources for digital scholarship and in developing the necessary digital expertise. 

A minimal-computing approach to developing git and machine-actionable data 

Turning to the dataset preparation for publication, it was essential to develop both a process that required only minimal programming knowledge and a step-by-step guide for GLAM data processing that would be as accessible as possible within a humanities context. That way, over time more machine-actionable datasets from the library’s heritage collections could be developed as part of the general working processes of the library. 

The Metadata Services Department took on the metadata export from the metadata repository. Using MarcEdit and FileMaker, both tools that are already part of the library’s general workflows, the initial MARCXML was transformed into a CSV format. 

Next, we needed to gain insight into the content of the dataset through simple data analysis. We used Jupyter Notebooks hosted in Colaboratory, to ensure a real-time shared use of the findings and to allow everyone to look at and interact with the data analysis outcome. Analysing the CSV file with short Python scripts revealed basic information on the data quality inside the columns. Furthermore, columns were identified as containing categorical, qualitative or numerical data, as columns with unique values or as almost empty columns to be discarded. Using Colaboratory requires knowledge of the Python programming language, however. The search for a minimal computing approach to GLAM data processing required looking for tools that are easier to use.

https://colab.research.google.com/drive/1aiiB0KUxowDwlg9gg_nB-5T-_966-Z7M

Figure 3: Click on the link above to go to Google Colab and see the notebook with the first set of data explorations of the initial CSV file (before data processing for its computational transformation). The short Python scripts can be copied and easily adjusted for the exploration of other datasets.

Following the advice of Packed (Belgian center of expertise in digital heritage, now part of Meemoo), we selected OpenRefine, a free, open-source stand-alone desktop application for data processing. OpenRefine can be used without advanced programming skills, although it has elaborated capabilities as well. Basic data processes can be performed through “push-button” actions and simple text filters written in the General Refine Expression Language (GREL). 

The three principal stages of data processing could be executed in OpenRefine: data cleaning, data transforming and data refining. The result was an optimal CSV file containing machine-actionable data. This data could now be used for data analysis and visualisation with minimal friction, in the context of data-driven scholarship as well as for other creative computational reuses.

Data processing

Let’s take a better look at what data processing actually does. Having a clearer view of the dataset and a minimal-computing tool, we could now start processing the data. Transforming the CSV file into machine-actionable data required over 500 individual data processing tasks on OpenRefine. Below follows a single example for each data processing stage: cleaning, transforming and refining.

>>>Data cleaning

Data cleaning is primarily based on faceting and clustering. Faceting is a useful feature of OpenRefine that allows the browsing of columns and rows. Within these facets the data can be filtered based on subsets (‘categories’) which in their turn can be changed in bulk. 

Figure 4: Screenshot of the OpenRefine tool showing a list of the 32 categories for the facet ‘Description of graphic material’. One of these categories is used in 10.323 out of the total number of 10.512 records. 

Inspecting the column “Description of graphic material”, as shown in the image above, reveals a total of 32 different data categories in which all of the 10.512 records (i.e. the complete portrait collection) are being sorted. The number of categories can be reduced because some categories are formed due to inconsistencies in data documentation, like the dot behind ‘$oBlack-and-white’ in one of the categories highlighted in green in the example below.

Figure 5: Screenshot of OpenRefine showing two inconsistencies in data documentation. Highlighted in green is an unintended category duplication due to a faulty value (a dot added to the value ‘Black-and-white’). Highlighted in yellow is a seemingly faulty category that, upon inspection of the image file described in this record, turned out to be correct.

However, category integration is a critical process that requires detailed inspection and consideration of data nuances that may else be lost. In the image above, for example, a category with a double entry of “Black and White” is not a mistake. It merely implies that the image contains two portraits, both of which are black and white. 

In the image below is another example of a column with 248 different categories that need to be numbered down in order to become computational amenable. The metadata coding for this content type (Typographical terms) is not based on MARC 21 standards, but refers to a local coding system used in conjunction with MARC21. Here too, the number of different categories can be reduced with category integration, in combination with splitting, which is explained below.

Figure 6: On the left, the initial column of the unprocessed CSV file containing approx. 250 categories of typographical terms, which made the column not fit for computational use. On the right, the outcome of data cleaning in the new CSV file: 20 major categories of typographical terms could be identified encompassing all the Portrait Collection’s artworks (Typographical terms 1) and a second column was created (Typographical terms 2) in order to include more specific information which complements the first column.

Another critical issue with data not fit for computational use is inconsistencies occurring from different spellings and name variants of the same entity. The clustering feature of OpenRefine groups choices that look similar, with the option to choose between different levels of sensitivity. The image below shows how data entries referring to cities were clustered in order to surface patterns that are representations of the same thing, but are documented in different ways. 

Figure 7: Performing clustering in OpenRefine inside the column “Place of manufacture”: an automatic identification and grouping of various spelling forms of the cities where the graphic prints were created.

>>>Data transformation

Going a step further with data processing, an extended transformation of many columns is recommended to further refine the repository and form more amenable data for computational use. Typically, improvements include splitting columns, renaming entities and replacing strings. 

Splitting columns can be useful when a single column holds two or more variables, as in Figure 6 above. The data becomes computational responsive when separated and split into two or more columns. 

Renaming entities too can enhance their computational capabilities. Often this involves replacing strings, for example when various URL formats exist inside a single column, that direct to diverse webpages. In the following example, the column “link to digital object 1” contains different formatted URLs that display an image preview of each portrait. The URLs are mainly in the form of “delivery links” that redirect to an image display within a viewer environment: http://depot.lias.be/delivery/DeliveryManagerServlet?dps_pid=IE4849156. However, delivery links are not fit for image harvesting as they link to a viewer and not to a file. In order to enable image harvesting, we transformed the link to a direct image display of the file in thumbnail form: http://resolver.libis.be/IE4849156/thumbnail. This way, a quick image preview will be available while the embedded thumbnail preview will be useful for more elaborated data visualisations.

Figure 8: First task from a set of successive data transformations, performed for creating a unified URL format for the image-display link. The resulting link directs to a thumbnail image preview, which can be integrated more easily into data visualisations. Data transformation was performed through a GREL expression.

The link was transformed using a GREL expression, by appending text to the string: cells[‘Link to digital object 1’].value + ‘/thumbnail’. In this case the column’s name is ‘Link to digital object 1’ and the added text is ‘/thumbnail’. We can preview the resulting link on the right side of the table: http://depot.lias.be/delivery/DeliveryManagerServlet?dps_pid=FL4856542/thumbnail

In many cases, as in the above data transformation instance, successive data transformations might be needed in order to make a single data entry consistent and functional.

>>>Data refining

Data refining starts with fixing encoding issues in strings by replacing diacritic characters. For example, the diacritic character “é” is not properly represented and instead an arbitrary character chain appears (“eÌ”). In order to refine this correctly across all columns, we performed a transformation through the “All” arrow-down menu with a GREL expression. OpenRefine subsequently shows the number of transformations automatically applied to each column. 

Figure 9: Refining data by reinterpreting diacritic characters throughout all columns with a GREL expression. The top right image shows the high number of character modifications within column ‘Main title’, with 427 cells containing  the diacritic character at least once.

GitHub Setup

With the dataset now ready for publication, we set up a simple GitHub environment to enable access and reuse of the data. We created an institutional profile and prepared two readme files: one with a general introduction about the library’s objectives with regard to open access publishing, and another with detailed description of the uploaded dataset and its technical aspects

Useful applications and features have been integrated that further support the use of GitHub as an open GLAM repository. These include the extension of the open research repository ‘Zenodo’ as well as the “All Contributors” bot extension. 

The integration of Zenodo makes the data repository citable through the minting of a Digital Object Identifier (DOI). Zenodo’s support of DOI versioning allows to cite a specific dataset and its potential different versions. Zenodo is mounted on GitHub, allowing us to choose a suitable license and a number of other useful aspects for the dataset (e.g. export formats, citation style, recorded versions). 

Figure 10: The Portraits Collection dataset is available to download and cite from the open research repository ‘Zenodo’, where a DOI has been especially minted for the dataset. Zenodo is accessible through an ORCID or GitHub account and the Zenodo DOI badge within GitHub has been easily integrated by the following Markup language snippet: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3460785.svg)%5D(https://doi.org/10.5281/zenodo.3460785)

All Contributors” is a framework developed within GitHub to attribute all possible contributors that will be engaged with the dataset by utilising an “emoji key”. Contributors include not only people who work on code but also everyone who may ask and answer questions, design, provide examples, translate, etc. The bot is configured accordingly and can also function for opening up inquiries for data modelling to the broader community, enabling attribution within the library’s GitHub page to literally everyone with a GitHub account engaging with the dataset. 

Figure 11: The “All contributors” extension integrated in the GitHub GLAM account can acknowledge everyone engaging with the related data (not just coders). A set of emojis (shown below the avatar pictures) represents specific contribution types (e.g. reviewers, translators, making questions, sharing ideas).

https://github.com/KU-Leuven-Libraries

—-

Looking ahead

The pilot served as a test case to develop a process for the preparation of more machine-actionable datasets through minimal-computing processes at KU Leuven Libraries. It also set up the GitHub profile of KU Leuven Libraries, initiating Git by applying the Zenodo DOI and All contributions extensions, creating the readme files and uploading the machine-actionable dataset on its Git repository, while commiting and proposing file changes by using Git. The data preparation process has been documented on a step by step basis, in order to create a blueprint for data processing of documentary heritage intended for computational use and to offer critical insights, identify omissions, missing information and process steps that could possibly be improved in the future. 

The pilot was KU Leuven Libraries’ first step towards creating a Git repository of open and ready to analyse datasets of documentary heritage from the library’s collections, openly available to (amongst others) researchers and students in (digital) humanities looking to use computational-fluent open data that display minimal discrepancies.

The pilot study for developing git and open documentary heritage data for computational use was conducted in the Digitisation Department of KU Leuven Libraries by Mariana Ziku as part of her thesis in the Advanced MSc in Digital Humanities, under the supervision of Nele Gabriels, the guidance of Bruno Vandermeulen, the training with Diederik Lanoye and the advice and good company of Hendrik Hameeuw and Mark Verbrugge. Thanks also to Bart Magnus and Sam Donvil from Packed (now Meemoo) for sharing their expertise on digital cultural heritage.

We are OPEN! Share and reuse our public domain collections (and read about how we got there)

Here’s a piece of exciting news: KU Leuven Libraries has adopted an open data policy for its digitised collections! This means that nearly 42.000 digital representations of originals from the library’s public domain holdings may be freely shared and (re-)used by all. And of course this number continues to grow.

How it works?

It’s easy. Online visitors can check the copyright status of the images when viewing digitised collection items online. A direct link guides them to our terms of use page, where they will read that everyone is free to use and reuse images offered as open data at will, mentioning KU Leuven as the owner of the collection item where possible.

The IE viewer displays copyright status information and links to the general terms and conditions.

The information pane of the IIIF-based Mirador viewer displays the same copyright status information and link to the general terms and conditions as the IE viewer

While we strive to keep our collections as open as possible, some items are available under a license, e.g. when the public domain originals are not part of KU Leuven Libraries’ holdings or when permission was given for copyrighted materials to be digitized and made available to the public under specific conditions. Visitors can consult the licensing conditions via the viewer.

When images are made available under a licence, the viewers (here: Mirador) link to specific conditions for use for the digital object in view.

Having checked the status of a digital object and the conditions for use, online visitors can use the direct download possibilities included in the Teneo viewer. These offer single-file or full-object download.

Images may be downloaded as single files or as full object in the IE viewer.

The downloaded images are jpg or jp2 and allow perfect readability of even the smallest written text. Visitors are now ready to become active users of the digitised collection!

How we got there

Easy as it may seem, implementing an open data policy required significant effort on various levels to ensure that online visitors would be clearly informed about the judicial status of both physical originals and the digital representation and what this means for its use. KU Leuven Libraries currently presents its digitised collections either in the Rosetta IE viewer (with an ‘Intellectual Entity’ generally equalling an individual collection item) or – for bespoke collections – in a IIIF-based Mirador viewer. Systems and processes had to be adjusted to show this information on an single-item level in these view environments.

First, data can only be freely used and shared if it can be both accessed and acquired. To this end, easy-to-use download functions were first created within the IE viewer. This viewer now offers both single-file or full-object download (see the images above). Mirador too will include download options before long.

Second, covering the bases, our team collected the rights information from legacy project descriptions and agreements, both internal to KU Leuven Libraries and (for digital objects based on original held by other institutions) with external partners. Unclear phrasing was clarified and the images resulting from digitisation projects were assigned one of three possible legal statuses: public domain/open data, available under a license, or copyrighted.

Third, for each of these three statuses, terms & conditions for use of the digital objects were designed in close collaboration with KU Leuven’s Legal Department. Furthermore, an overview page was created detailing, for each of the digitisation projects, the licensing conditions for those digitized items that are available under a license.

Fourth, a copyright status was assigned to each of the individual digitised objects, more specifically to both the physical (public domain or in copyright) and digital (content as open data, under a license or in copyright) objects. For the originals, the descriptive metadata model was modified to include the copyright status; for the digital objects, status information would become part of the metadata in the information package presented for ingest into the preservation environment.

Terms and conditions for the use of images as open data.

And finally: fifth, we turned to metadata visualisation in the viewer environments. The metadata shown in both the IE and the Mirador viewer is not that of the public search environment Limo nor from the metadata repository Alma, but rather from the digital asset preservation system, Rosetta; hence the choice to include the copyright status of the digital representations in the ingest information packages. By nature, the metadata (just like any data) in the preservation system is unchangeable. Including the extra information for those digital objects already in Rosetta was not something to be done lightly, but implementing an open data policy justified the decision.

Rather than adding copyright statuses to the existing metadata, we decided to create a standard mapping between Alma and Rosetta and replace the existing descriptive metadata in Rosetta. That way, two other issues could be addressed: the inconsistent and the static nature of the descriptive information shown in the digital representation viewers. The inconsistency was a legacy of the early digitisation projects at KU Leuven Libraries (with some projects generating extensive descriptions and others hardly any) while the static nature of the metadata is inherent to its extraction from the preservation environment.

This chart shows the four main elements of our architecture and the flow of data between them. While the viewer (both IE and Mirador) is accessed via a link in Limo, the actual images and metadata shown is retrieved from Rosetta.

The new standard mapping between Alma and Rosetta provides a direct answer to the first issue. The viewers display a uniform metadata set consisting of title, material type, genre, location of the original item and copyright status information. A link to the full descriptive record in Limo gives users access to the most up-to-date information about the digital item on view. And of course both viewers’ metadata panes display all the object-level information required to implement an open data policy. Together with the download functions, this enables KU Leuven Libraries to offer its digitised collections as open data.

The road to ‘open’

KU Leuven Libraries is fully committed to opening up its digitised collections in depth and to as high a standard as possible. Presenting the images of ca. 42.000 – out of nearly 95.000 – digitised library collection items as open data is an important first step in this direction. While we promise to keep improving the user experience in the viewers, with enhanced download functionalities and easier access to terms of use and licensing conditions, you will hear about our first endeavours into opening up metadata as freely available data sets in a next blogpost.

Meanwhile, we invite everyone to visit the collections at Digital Heritage Online (read all about DHO in a previous blog post) and to actively use, reuse and share our digitised collections!

FINGERPRINT project in KU Leuven Campuskrant

Cartoon by Joris Snaert © on the use of the Portable Light Dome when digitizing a Bruegel drawing (page 2 of Campuskrant 31/1) 

An interview in the KU Leuven university journal Campuskrant with prof. Lieve Watteeuw the KU Leuven, contributions to the FINGERPRINT project have been highlighted in length. The Campus article focuses on the work with the original drawings by Pieter Bruegel the Elder and how the imaging effort by the FINGERPRINT-team have made the difference to better understand the virtuosity of Breugel the artist.  These results will also be presented on the KBR exhibition: The World of Bruegel in Black and White.

The Imaging Lab was responsible for the digitisation of all drawings and prints and also carried out the advanced imaging enabling in-depth visualisations.

PDF of the complete campuskrant edition

More info on the Fingerprint Blog

Digital Heritage Online: KU Leuven Libraries implements its new discovery platform

KU Leuven Libraries presents a new platform: Digital Heritage Online. It gathers all digitized heritage objects from its collections, with objects dating from the 9th up to the 20th century, in one viewing interface. The platform enables users to browse these digitized objects in an open and visually appealing way. It also provides a search environment within the Digital Heritage Online collection.

The user may browse the collections and items based on either content theme, material type, or location of the physical object. A fourth entry shows which objects were made digitally available during the previous three months.

Start window of Digital Heritage Online platform with four entries to discover the database.

By clicking on an object thumbnail within the collection, the user can consult the extended bibliographic information in the library catalogue and view the full object online.

Catalogue item description with ‘Teneo’ link to view the full object and detailed bibliographic information.

Additionally,  the extended object description names all collections to which the object belongs and shows other related items. This allows for contextualized browsing of the digitized collection in the library catalogue environment.

Catalogue item description with ‘Collection path’ showing the collections to which a single item belongs as well as thumbnails of related collection items, two features supporting contextualised browsing.

One can also execute specific searches within the Digital Heritage Online collection, excluding non-digitized collection items in the search results. Searches may be performed either in Digital Heritage Online as a whole or within its subcollections. Searches via the blue search button will include items in collections outside of the Digital Heritage Online platform.

Search results for ‘Leuven’ in the Digital Heritage Online collection, employing the search box on the left.

Digital Heritage Online may be accessed either directly via this link or through the homepage of KU Leuven Libraries’ integrated search interface and catalogue: Limo, by clicking on ‘Curated Collections’.

Limo homepage view with access to Digital Heritage Online as one of the ‘curated collections’.

Over the past ten years, KU Leuven Libraries has intensively worked on digitizing its documentary heritage. At the time of writing this blog post, the digitized collection held almost 88.000 objects and it is continuously expanding week by week.

Since 2016, we have focused our attention on opening up our digitized collection in a clearer and more user-friendly way. While we keep in mind the FAIR principles as a long-term goal, the most pressing needs were increasing the findability and accessibility of the collection. Parts of the collection were already available through aggregator platforms such as Europeana and Flandrica or KU Leuven platforms such as Magister Dixit and Lovaniensia. A selection of heritage objects figure in virtual exhibitions on our EXPO site. And whenever copyright and intellectual property rights allow consultation and use, digitized collection items can be consulted through Limo, the library’s integrated search interface mentioned above, where users can search the full library catalogue and view items that have a digital representation online.

The Digital Heritage Online platform now provides a clear access point and a search environment for our digitized heritage. It enables a unified view on all digitized content and a true visual browsing environment. It is, naturally, only a first step into opening up the digitized collection. Opening up the data itself (images, metadata and content) for use and reuse is firmly positioned on our agenda. But for now: happy exploring!

Accreditation

Digital Heritage Online is the result of a close collaboration between different services of KU Leuven Libraries. LIBIS took care of the design and technical development of the Collection Discovery platform in KU Leuven Libraries’ catalogue and search interface. Digital Heritage Online was the pilot project for this new Limo implementation. Together with the various collection curators and thanks to the many digitisation projects, the Digitisation Department designed a structure for the various sub-collections. Technical operations for bringing together the many heritage object descriptions in the correct Alma collections were carried out in collaboration with the Document Processing Department. The content coordination of Digital Heritage Online is in the hands of the Digitisation Department.

Introducing EXPO

Screen Shot 2018-06-04 at 22.05.59

KU Leuven Libraries is happy to present a new online platform for accessing its fragile heritage collections online: EXPO.

EXPO offers virtual exhibitions and a gallery of individual collection items, and informs about digitization and imaging projects at the library.

The virtual exhibitions connect collection items with other works and may or may not be a continuation of actual library exhibitions. Exceptional works from the KU Leuven Libraries’ collections are highlighted in the gallery, which gathers both recognized masterpieces and other fascinating works with a unique story. The project pages introduce the website visitor to ongoing and past activities in the field of digitization. Both initiatives aiming at the digital disclosure of the library collections as well as projects in the context of research and technical imaging are presented.

Screen Shot 2018-06-04 at 22.07.26

EXPO is the result of a close collaboration between various departments. Exhibition curators, collection keepers, and other heritage collaborators and partners create the exhibitions and fill the gallery with collection highlights. LIBIS carried out the technical implementation of the site as part of the Heron (Heritage Online) service. The site is based on the open source web publishing platform Omeka. For the site’s development, LIBIS created a direct connection between Omeka, the library management system Alma and the preservation system Teneo/Rosetta, allowing curators to work within a single, integrated environment. The coordination of EXPO is taken on by KU Leuven Libraries Digitisation & Document Delivery. This department manages digitisation projects, executes digitisation in its Imaging Lab and supports research projects with bespoke digitisation techniques and the specific expertise required for scientific imaging.

Screen Shot 2018-06-04 at 22.07.41

Imaging Bruegel’s Drawings

FP_NHD27_D_KBR_SII13281_20180307_0013

Bruegel’s original Luxuria drawing (KBR SII132816) under the Multispectral Portable Light Dome

In the framework of the Belspo BRAIN-be FINGERPRINT project, the Imaging Lab has performed in March 2018 a two day scientific and technical photography session at the The Printroom of the Royal Library of Belgium on 4+1 unique Bruegel drawings kept at the Royal Library and the Royal Museums of Fine Arts of Belgium. For a full account on this event follow this link.

 

 

 

Introducing ArtGarden

BH2_BVDM_20140626_106

The Imaging Lab is partner in the ArtGarden research project. The project aims to test and develop an efficient (“best practices”) matrix (tool – protocol) for monitoring, imaging and documenting (art-technical), fragile historic mixed media objects. This is used to facilitate decision making during conservation and preservation practice.

The Imaging Lab is involved to investigate the historical materials and techniques through scientific imaging tools such as multi-spectral imaging or the Microdome (developed within the RICH project)

The focus of the project is the guiding and evaluation of conservation treatment and the transportation, display in a museum environment and long-term storage of complex degraded historic mixed media artefacts. Up until now, guidelines have concentrated on one material characteristic. The complex nature of a large number of historic mixed-media artefacts in museum collections is more challenging and less developed. The ArtGarden project combines documentation, conservation and preservation protocols (Terminology defined by ICOM-CC, New Delhi, 2008) to create an innovative tool to support collection care, maintenance, display and valorization of complex historic collection artefacts.

Royal Institute for Cultural Heritage, KIK/IRPA
KU Leuven
University of Antwerp

A project funded by Belspo/BRAIN.

FINGERPRINT: the toolbox

infrastructuur-fingerprint00021-1In a previous blogpost, we introduced the FINGERPRINT project. FINGERPRINT is an interdisciplinary collection and data management project on the exceptional collection of graphic works by Pieter Bruegel the Elder (ca. 1520-1569). It involves the collection and processing of a large amount of visual and material data. To obtain that visual data we have an extensive toolbox at our disposal: a high resolution medium format digital back, a motorized repro stand, a Nikon DSLR modified for multispectral imaging with a collection of multispectral filters, the RICH microdomes and much more. A brief overview. Continue reading