Digital Scholarly Editions in the Humanities are the result of critical editing processes, in which analog documents are transcribed into a digital form, mostly with the help of markup languages. An essential step of these processes is the recognition and markup of entities such as people, places and organizations in the digital texts. The software
NEISS TEI Entity Enricher (NTEE) assists this step in two ways: It provides the possibility to train custom entity taggers based on state-of-the-art neural networks and it supports manual identification of the recognized entities. The input and output format is TEI-XML (TEI Consortium, 2021).
NTEE is an open source software. The program and its documentation are available at:
Description of functionality
The core feature of the software is the fine-tuning of pre-trained neural language models for an edition-specific entity recognition task. The resulting entity tagger can then be used to mark up entities in one or more TEI files. In total, the NTEE workflow includes five subtasks: configuration, groundtruth building, training, prediction and postprocessing. They all take place in a single user interface.
A training data overview in the Groundtruth Builder screen
As a prerequisite for training, a groundtruth must be created in the
Groundtruth Builder (see Fig. 1), i.e. a set of already annotated data that can be used to train an entity tagger. For this purpose, TEI files are applied as input data. Also, three configurations need to be done: The
Entity Definition is simply a TEI-independent list of those entity classes to be tagged later. A
TEI Read Entity Mapping (see Fig. 2) then defines which XML elements (including attribute values) correspond to these entities in the input TEI files. Finally, the
TEI Reader Config can be used to exclude the content of certain XML elements from processing, as a restriction on the basic behavior of NTEE to process the complete content of the ‘tei:text’ element mandatory in the TEI standard.
Example of mapping XML elements to entities for the purpose of groundtruth building
Configurations for training an entity tagger
With a given groundtruth it is possible to start the training process in the
NER Trainer screen (see Fig. 3) by selecting the groundtruth and a pre-trained language model. Different models (English, French, German, Spanish, Multilingual) published on the
Hugging Face hub (Wolf et al., 2020) can be selected in NTEE out of the box. After training, the resulting entity tagger is ready to be applied to multiple TEI files for entity prediction. Finally, a manual assessment of the prediction results and the identification of the entities may be carried out in the
Postprocessing screen (see Fig. 4 and 5).
Display and edit recognized entities in the Postprocessing screen
Named Entity Recognition (NER)
The neural networks used in NTEE are based on so-called
Bidirectional Encoder Representations from Transformers (BERTs) (Devlin et al., 2019). This is currently one of the most widespread network architecture used to train language models that solve tasks like NER in the field of Natural Language Processing. Classically, a BERT is first pre-trained with large amounts of unlabeled text to obtain a robust language model and then fine-tuned to a downstream task like NER. In our paper (Zöllner et al., 2021), we have extensively investigated various pre-training and fine-tuning techniques to train such a BERT model as optimally as possible for entity enrichment in Digital Scholarly Editions. In NTEE, we have taken care to ensure that the provided models are usable even with limited computational resources.
Named Entity Linking (NEL)
The procedure of identifying recognized entities is called
Named Entity Linking due to the fact that it is done by providing a Uniform Resource Identifier (URI). For this, NTEE supports the users by delivering identification suggestions based on the entity element string using queries to a local Entity Library or the Wikidata Knowledge Graph (Bayer, 2013).
The latter delivers an ontology with which it is possible to accelerate the identification process: By using Semantic Web technology the set of potential candidates is narrowed.
The former is a JSON file, in which entities can be saved locally. On the one hand, this also accelerates the identification process due to the omission of a web query to the Wikidata Knowledge Graph in these cases. On the other hand, this enables the user to apply edition-specific knowledge in the NEL process such as unusual nicknames of persons or entities at all that are not stored in the Wikidata database.
Identity suggestions for an entity in the Postprocessing screen
This work was funded by the European Social Fund (ESF) and the Ministry of Education, Science, and Culture of Mecklenburg-Western Pomerania (Germany) within the project NEISS (Neural Extraction of Information, Structure, and Symmetry in Images) under grant no ESF/14-BM-A55-0006/19.
Bayer, T. (2013). The Wikidata revolution is here: enabling structured data on Wikipedia https://diff.wikimedia.org/2013/04/25/the-wikidata-revolution/.
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–86 doi:10.18653/v1/N19-1423. https://aclanthology.org/N19-1423.
TEI Consortium (2021). TEI P5: Guidelines for Electronic Text Encoding and Interchange https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., et al. (2020). HuggingFace’s Transformers: State-of-the-art Natural Language Processing.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45.
Zöllner, J., Sperfeld, K., Wick, C. and Labahn, R. (2021). Optimizing Small BERTs Trained for German NER.
12(11) doi:10.3390/info12110443. https://www.mdpi.com/2078-2489/12/11/443.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
July 25, 2022 - July 29, 2022
361 works by 945 authors indexed
Held in Tokyo and remote (hybrid) on account of COVID-19
Conference website: https://dh2022.adho.org/
Contributors: Scott B. Weingart, James Cummings
Series: ADHO (16)