Bootstrapping Project-specific Spell-checkers

paper, specified "long paper"
Authorship
  1. 1. C. M. Sperberg-McQueen

    Black Mesa Technologies LLC

  2. 2. Claus Huitfeldt

    University of Bergen

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Spell-checking software is well established in consumer applications but often unexploited by data-creation projects in the digital humanities. We argue that spell-checking provides a relatively straightforward way to find (some) transcription errors even in texts written in idiosyncratic or inconstant spelling.

Working hypotheses
We believe that spell checking is feasible, useful, and underused in DH projects.
More specifically:

Fewer than half of DH projects transcribing existing materials use spell-checking technology.
Standard word-in-isolation spell checking can find transcription errors.
Project-specific spelling dictionaries can do better than off-the-shelf dictionaries for

writing by idiosyncratic or inconstant spellers
older language forms no longer supported in off-the-shelf dictionaries
non-standard and minority languages which lack off-the-shelf dictionaries

Project-specific filters may be necessary to create a checkable alpha text ([Huitfeldt 2006]) but are feasible.

Modeling spell-checking, modeling languages
In conventional word-in-isolation spell-checking ([Earnest 2011], [Damerau 1964], [McIlroy 1982], [Bentley 1986], [Peterson 1980], [Kuenning 2018], [Atkinson 2017], [Németh 2018]), the language model is trivial: all acceptable forms are equiprobable, a form is acceptable if and only if listed in the dictionary, unknown forms have probability zero, and any token with probability zero is a probable error. To find alternative spellings, a Levenshtein distance of one ([Norvig 2007]) or more ([Garbe 2012], [Garbe 2015]) is sometimes used. A combination of phonetic encoding and Levenshtein distance can sometimes be helpful ([Atkinson 2017]).
Recent work on spelling correction (e.g. [Choudhury et al. 2018], [Dashti et al. 2018]) uses more elaborate models to detect ‘real-word’ errors (e.g. “be” for “he”).
In this paper, however, we assume the simple model of text as a sequence of equiprobable known forms.

Challenges in using spell-checking
For spell-checking in DH projects, some complications arise.

Transcribers normally seek to reproduce the spelling of the exemplar, not to correct it. When standard spelling dictionaries are used to check material which consistently violates orthographic norms (idiosyncratic spelling), they will erroneously flag some correctly transcribed misspellings and miss unconscious corrections by the transcriber.
Off-the-shelf dictionaries reflect current norms for widely spoken languages. Older language varieties and under-resourced languages often lack dictionaries.
The language transcribed may have no standardized orthography; spelling may vary by scribe or line-by-line (inconstant spelling).
XML documents may contain material not to be spell-checked (markup, project-internal comments, etc.), or material in different languages or varieties (e.g. 21st-century English in the header and 17th-century English in the body).

We believe these complications can be addressed.
For idiosyncratic spelling, the solution is to use a project-specific dictionary, not a standard dictionary, so that correctly transcribed misspellings will be accepted and inadvertent corrections flagged.
Inconstant spelling makes spell-checkers miss transcription errors that substitute one accepted form for another. But spell-checking can still catch other errors. (Similarly, an English spelling dictionary with both British and American spellings won't catch "colour" in American texts, but it will catch the typo "teh".)
Producing project-specific dictionaries from scratch requires some work, but our experiments suggest that even modest effort can produce spelling dictionaries that will detect existing transcription errors without excessive noise.
For dealing with XML, it's helpful to write filters to extract the desired word forms. Fortunately, this is normally straightforward.

A small pilot study
Several practical questions arise:

How can project-specific dictionaries be constructed?
What should they contain (and exclude)?
How much work is involved? How big must the dictionary be:

to catch as many actual errors as possible?
not to flag correctly transcribed words erroneously?

How much project data is necessary to obtain a dictionary of that size?

We have explored these questions using material from the Wittgenstein Archives at the University of Bergen and from Liam Quin's digital version of Alexander Chalmers's General Biographical Dictionary ([Quin 2017]).

For each project, we selected test material: for Wittgenstein, two small texts taken from non-final versions of the corpus; for Chalmers, 10,000 words from volume 25.
For Wittgenstein, we checked the normalized-spelling text, identifying word forms which violate German orthographic norms.
The Wittgenstein project defined standard orthography as that of Duden's
Rechtschreibung, but admitted some idiosyncratic spellings consistently used by Wittgenstein.

For Chalmers, we proofread the sample against the page scans.
With programmatic filters we extracted an alpha text (a list of words for spell-checking).
We constructed dictionaries of various sizes by compiling lists of correct forms from different subsets of the project corpora.
In principle, project-specific dictionaries should be built by proofreading texts one-by-one; to streamline the pilot project, we took the shortcut of checking wordlists against off-the-shelf dictionaries. This does not visibly affect the statistical results shown later, but it does mean that for Chalmers some mistranscriptions were missed and some bad flags thrown.

We checked the test samples against those dictionaries.
For each test, we counted the number of correct and incorrect tokens in the sample flagged or left unflagged by the spell checker.

Results of the pilot study

Constructing project-specific dictionaries
The simplest (not fastest) method is to start with an empty dictionary and spell-check texts from the project's corpus one by one. For each word flagged by the spell checker, either add it to the dictionary or correct it in the text. (More on this below.)
With an empty dictionary, the spell-checker will at first flag every form in the text. To avoid the tedium of dealing with so many bad flags, it may be worthwhile to list the most frequent forms in whatever part of the corpus is available for consultation, check them manually, and seed the dictionary with them. For Wittgenstein, a dictionary of 1300 forms covers about 90% of the running tokens in the text, flagging only one token in ten.

What to include and exclude
Ideally, the dictionary should include all forms which actually occur correctly in the corpus and no forms which are transcription errors. This ideal is unattainable for two reasons. First, the same form may occur both as a correct transcription and as a mistranscription (real-word errors); it cannot be both included and excluded. Second, as the corpus grows, there will always be some correct forms not yet found in the dictionary, and thus always some erroneous flags.
The optimal solution is to balance the relative inconvenience of undetected errors and false flags against the relative frequency with which each form is correct or mistranscribed. If undetected errors and bad flags have equal weight, then a form should be included in the dictionary if any occurrence is more likely correct than not. If we would rather see ten bad flags than miss one mistranscription, then a form should be included only if it is ten times more likely to be correct than incorrect. The project's preferences determine the threshold to be met.
If spelling habits vary from document to document, it can be useful to make both a project-wide dictionary and document-specific dictionaries for texts with distinctive usage.
When forms intentionally excluded from the dictionary do occur correctly transcribed, they can be marked with sic or similar markup and excluded from the alpha text, to avoid throwing bad flags for them.
With these complications, the rule for forms flagged by the spell-checker becomes:

If the form is correctly transcribed and meets the project's correctness threshold, add it to the project dictionary.
If the form is correctly transcribed and meets the threshold in the current document but not elsewhere, then add it to the document-specific dictionary.
If the form is correctly transcribed but does not meet the threshold, then tag it with sic or the equivalent.
If the form is incorrectly transcribed, correct it.

Dictionary size
Error detection does not require a big dictionary. Indeed, because of real-word errors, bigger dictionaries often find fewer actual errors than smaller dictionaries, as shown below for Chalmers.

Small dictionaries, however, throw too many bad flags. Fortunately, the number of bad flags falls dramatically as dictionary size rises, as shown below for Wittgenstein (red, two samples) and Chalmers (blue).

Whether spell-checking feels useful or pointless depends (we think) on its signal:noise ratio; in our data dictionaries of about 15,000 forms reach a (bearable?) ratio of 1:10 (ten erroneous flags for each detected error).

How big a corpus must be processed to produce a dictionary of suitable size? It varies, but as the plots below show, something more than 200,000 tokens are needed for a dictionary of 15,000 forms.

Conclusions and future work
Spell checking can find transcription errors in real-world data, even though transcription errors are logically distinct from misspellings and even when the spelling is idiosyncratic or inconstant.
Developing a project-specific dictionary takes little time and can be expected to improve the results of proofreading. Even very small project-specific dictionaries can be useful.
Work remains to be done to extend the pilot study to more materials, to support interactive correction of texts, to improve on the XML support offered by existing spell-checkers, and to explore the application of more sophisticated models of text to support word-in-context spell-checking in lieu of word-in-isolation spell-checking.

Bibliography
[Atkinson 2017] Atkinson, Kevin. “GNU Aspell.”
http://aspell.net (Last rev. 30 January 2017).

[Bentley 1986] Bentley, Jon. 1985. “A spelling checker”. In Programming Pearls.. Reading, Mass.: Addison-Wesley, 1986, pp. 139-150. Reprinted from Communications of the ACM May 1985.
[Charniak 1993] Charniak, Eugene.
Statistical Language Learning. Cambridge, Mass.: MIT Press, 1993.

[Choudhury et al. 2018] Choudhury, Ranjan, Nabamita Deb, and Kishore Kashyap. “Context-Sensitive Spelling Checker for Assamese Language.” 2018. In
Recent Developments in Machine Learning and Data Analytics, ed. Jugal Kalita, Valentina Emilia Balas, Samarjeet Borah, Ratika Pradhan (= Advances in Intelligent Systems and Computing 740). New York, etc.: Springer, 2018, pp. 177-188.

[Damerau 1964] Damerau, Fred J. 1964. “A technique for computer detection and correction of spelling errors”.
Communications of the ACM 7.3 (March 1964): 171-176.

[Dashti et al. 2018] Dashti, Seyed MohammedSadegh, Amid Khatibi Bardsiri, and Vahid Khatibi Bardsiri. “Correcting real-word spelling errors: A new hybrid approach.”
Digital Scholarship in the Humanities 33.3 (2018): 488-499.

[Earnest 2011] Earnest, Les. “The three first spelling checkers”. Unpublished sketch, May 2011. On the Web at
https://web.archive.org​/web​/20121022091418​/brhttp:​/​/www.stanford.edu​/~learnest​/spelling.pdf, archived from
http://www.stanford.edu​/~learnest​/spelling.pdf.

[Garbe 2012] Garbe, Wolf. 2012. “Fast 1000x Faster Spelling Correction algorithm.” Blog post originally posted at http://blog.faroo.de​/2012​/06​/07​/improved-edit-distance-based-spelling-correction/ and now at https://medium.com​/@wolfgarbe​/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f

[Garbe 2015] Garbe, Wolf. 2015. “Fast approximate string matching with large edit distances in Big Data.” Blog post originally posted at
and now at

[Huitfeldt 2006] Huitfeldt, Claus. 2006. “Philosophy Case Study.” In
Electronic Textual Editing, ed. Lou Burnard, Katherine O´Brien O´Keeffe, and John Unsworth. New York: MLA 2006, pp. 181-96.

[Kuenning 2018] Kuenning, Geoff. “International ispell [v 3.4.00].”
(Last rev. 26 March 2018).

[McIlroy 1982] McIlroy, Douglas. “Development of a spelling list.”
IEEE Transactions on Communications 30.1 (January 1982): 91-99.

[Németh 2018] Németh, László. 2018. “Hunspell.”
http://hunspell.github.io (Last rev. 6 July 2018).

[Norvig 2007] Norvig, Peter. “How to Write a Spelling Corrector.” Blog post Feb. 2007 (with periodic revisions to August 2016).

[Peterson 1980] Peterson, James L. “Computer programs for detecting and correcting spelling errors.”
Communications of the ACM 23.12 (December 1980): 676-687.

[Quin 2017] Quin, Liam. “Improving text quality with automatic majority editions: How shall I count the ways?”. In
XML Prague 2017 Conference Proceedings. University of Economics, Prague, February 9-11, 2017, pp. 33-45. On the web at http://archive.xmlprague.cz/2017/files/xmlprague-2017-proceedings.pdf. The digitized edition of Chalmers is on the web at
.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.