The objective of metadata curatorship is to ensure that users can effectively and efficiently access objects of interest from a repository, digital library, catalogue, etc. using well-assigned metadata values aligned with an appropriately chosen schema. However, we are often facing problems related to the low quality of metadata used for the description of digital resources, for example wrong definitions, inconsistencies, or resources with incomplete descriptions. There may be many reasons for that, all completely valid, e.g, in many cases those who host a digital repository have few human resources to work on improving metadata, and often data providers are not themselves the metadata creators.
In this paper we present our ongoing work aiming at defining computable metrics to assess metadata quality and automatize metadata quality check process.
State of the art
The curation framework developed by Bruce and Hillmann (2004) is considered as a benchmark in the pursuit of quality assessment of digital repository. This framework defines seven dimensions to measure the quality of the metadata:
Conformance to Expectations
Logical Consistency and Coherence
. In the digital cultural heritage domain, these dimensions are fundamental for the evaluation of metadata quality. The evaluation helps various curators to systematically identify metadata problems, and could be straightforwardly applied to Europeana Digital Library
or the Ariadne
project as well. From the literature analysis it can be inferred that the problem of metadata curation is one of the most debated topic in the domain of humanities (Dangerfield et al 2013). Few attempts (e.g., Ochoa, X. and Duval, E. (2009)) were made to automatically compute quality metrics: however, they either focus on one dimension or are specific of some repository or metadata schema/profile.
Proposal for automatic quality check
The solution we propose is a framework which aims at automatically checking metadata quality of a repository along the dimensions proposed by Bruce and Hillmann (2004). It would also enable reporting to metadata curators a detailed analysis of the metadata situation and suggestions for possible improvements. To develop such framework, two main activities have to be carried out:
To define metadata quality metrics, capturing the status of the metadata both at object level (i.e., how good are the metadata of a single entry in the repository) and, aggregated, at repository level;
To define algorithms to compute the aforementioned quality metrics, and (possibly) return suggestions on how to fix low-quality metadata;
We already identified some strategies on how to perform these activities along some of the dimensions proposed by Bruce and Hillmann (2004), and in particular:
: computed by statistical approach; ratio of filled fields with respect to metadata profile. The resulting value is a real number between 0 and 1: the closer this value is to 1, the more complete the metadata description of the object.
: computed using Natural Language Processing applied to description field; several linguistic features (e.g., text length, syntactic complexity, conceptual density, entity mentions) are extracted from the description text, and, using part of the gold standard data as training set, a binary classifier will be trained to distinguish between “good” and “bad” descriptions based on such features. Each description is represented as a vector of features.
Logical consistency & Coherence
: computed using semantic web technologies; leverage ontological background knowledge on artists, objects, dates, art periods, etc., to assess the coherence of metadata values (e.g., the dating of a metadata object should be compatible with the birthdate/deathdate of its author(s)); enforce use of reference ontologies / vocabularies on controlled metadata.
To validate the quality of the developed metrics and algorithms, we will rely on a “gold standard” dataset manually collected from the available repositories. Part of this dataset of good quality data (i.e., positive examples), will also be complemented with bad metadata quality objects (i.e., negative examples) to support the training of the envisioned supervised approaches.
Work done so far
We collected an initial “gold standard” dataset of 100 high-quality records, in terms of metadata description, from the repository of the italian digital library Cultura Italia
, spanning different object typologies (paintings, drawings, sculptures, etc.), to test the effectiveness of the metrics and methods to be developed.
We started implementing the metrics and methods for the
dimension. The idea is to count the available metadata, dividing this number over all expected metadata. More precisely, to capture that some metadata may be more important (e.g., compulsory fields) than others when estimating the completeness of an object description, we associate a weight to each metadata: e.g., compulsory fields may have weight 2, while non-compulsory ones weight 1. Hence, the completeness ratio results by dividing the sum of the weights of the available metadata over the sum of the weights of the expected metadata. The resulting value is a real number between 0 and 1: the closer this value is to 1, the more complete the metadata description of the object.
We computed the completeness ratio for all the ~2,500 records of the “Palazzo Pitti” dataset in CulturaItalia. Fig. 1 shows a breakdown of the records in 5 interval categories according to their completeness ratio (e.g., 5% of the records have a completeness ration between 0.2 and 0.4), while Fig.2 plots the frequency of each specific metadata in the dataset.
Fig.1 Percentage of completeness ratio
Fig.2 Frequency of the metadata elements
Considering the amount of digital archives, problems related to metadata curation becomes evident. Reasons may be different: There is no curation task force, the metadata curation activity is delegated to the content providers or the metadata curation activity is made by hand. The development of an automatic process will enable the curators to not only obtain snapshots of the quality of a repository, but also to constantly monitor its evolution and how different events affect it without the need to run costly human effort. This could lead to the creation of innovative applications based on metadata quality that would improve the final user experience.
Bruce, T.R. and Hillmann, D. (2004) The Continuum of Metadata Quality: Defining, Expressing, Exploiting.
Metadata in Practice, Hillmann and Westbrooks.
Dangerfield, M.C. and Kalshoven, L. (2013)
Report and Recommendations from the Task Force on Metadata Quality
Ochoa, X. and Duval, E. (2009) Automatic evaluation of metadata quality in digital repositories.
International Journal on Digital Libraries, 10, pp. 67–91.
Paulheim,H. (2017) Knowledge graph refinement: A survey of approaches and evaluation methods.
Semantic Web 8(3) pp. 489-508.
Radulovic, F., Mihindukulasooriya, N., Garca-Castro, R. and Gomez-Prez, A. (2018) A comprehensive quality model for Linked Data.
Semantic Web, 9, pp. 3-24.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at Utrecht University
July 9, 2019 - July 12, 2019
436 works by 1162 authors indexed
Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html
Series: ADHO (14)