Developing a text readability system for Sesotho based on classical readability metrics:

paper, specified "short paper"
Authorship
  1. 1. Johannes Sibeko

    Nelson Mandela University, South Africa

  2. 2. Menno van Zaanen

    South African Centre for Digital Language Resources, South Africa

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Sesotho as one of South Africa’s (SA) eleven official languages is a home language to about eight percent of SA inhabitants and 98 percent of the population in Lesotho (Reid et al., 2019). Like many Asian languages, Sesotho is an under-resourced language (Wills et al., 2020). The repository of the South African Centre for Digital Language Resources (SADiLaR) provides the limited Sesotho resources (see https://repo.sadilar.org/).
This project aims to develop a readability tool for Sesotho texts. When additional language resources are required, these will also be developed. For readers (especially learners) to select texts suitable for their reading level, a measure of readability for texts is essential.
Existing text readability investigations in the context of SA, have focused mainly on health documents (Joubert and Githinji, 2014; Krige and Reid, 2017; Leopeng, 2019; De Wet, 2021) and textbooks (Sibanda, 2013; Wissing et al., 2016). Krige and Reid (2017) used three English metrics to measure readability of medical pamphlets in Sesotho, which does not consider differences between the languages. Language specific readability metrics should be developed before proper conclusions can be drawn. To our knowledge, no language specific readability metrics have exist for any African language, apart from Afrikaans (Jansen et al., 2017). Unfortunately, no implementations of these metrics could be found.
To develop readability metrics, texts with known readability levels are needed. Unfortunately, for Sesotho, copyright restrictions limit access to texts with (expected) known readability levels, such as textbooks. However, in SA, Sesotho is tested at high school on two levels, home language (HL) and first additional language (FAL). We expect these exam texts to have consistent readability over the years, with HL texts more difficult to read than FAL texts. To test this, we analyzed the readability of SA English HL and FAL exam texts (Sibeko and Zaanen, 2021) using existing metrics, which showed that the readability of the texts is consistent over time and different between the two levels.
If we assume that the development of the exam texts for Sesotho (and the other SA languages) follows the same process as that for English texts (Sibeko and Zaanen, 2021), Sesotho exam texts also show clear differences in levels of readability. They can then be used for the development of readability metrics for Sesotho.
We currently build on text properties used in nine readability metrics for English (Sibeko and Zaanen, 2021): Flesch-Kincaid Grade Level (Kincaid) (Kincaid et al., 1975), Flesch Reading Ease (Flesch) (Flesch, 1948), Simple Measure of Gobbledygook (SMOG) (Mc Laughlin, 1969), Gunning Fog index (Fog) (Gunning, 1952; Gunning, 1969), läsbarhetsindex (LIX) (Björnsson, 1968), Rate index (RIX) (Anderson, 1983), Automated Readability index (ARI) (Senter and Smith, 1967; Kincaid and Delionbach, 1973), Coleman-Liau index (Coleman and Liau, 1975), and the Dale-Chall index (Dale and Chall, 1948).
The readability metrics rely on text properties such as word and sentence length. Due to differences in language structure, these properties cannot be applied readily to other languages. To this end, we are re-conceptualising properties, such as long words, which have more than six characters in the LIX and RIX metrics, difficult words, which do not appear in the 3000 most frequently used English words in the Dale-Chall Index, and complex words, which have more than two syllables in the Gunning Fog Index, to reflect Sesotho’s context. In particular, features such as syllables and frequently used words are language specific.
To resolve these issues, we currently develop automated Sesotho syllabification systems, including a rule-based system based on Guma’s (1982) description and a pattern-based system (using TeX’s hyphenation system (Liang, 1983)). Additionally, we investigate the concepts of long, difficult, and complex words in Sesotho. To make matters more complex, Sesotho has two orthographies, Lesotho (LS) and SA orthography (SAS) (Motjope-Mokhali et al., 2020). We currently use SAS orthography given the usage of the SA high school exam texts.
Once the different text properties are defined, they can be applied to the Sesotho exam texts. The values can then be combined in linear regression models, which will provide mathematical formulas that provide a level of text readability for Sesotho texts.
This contribution describes progress in the development of the first automated text readability analysis tool for a SA language (Sesotho). Given the limited availability of computational resources for Sesotho, we also describe language resources developed within the project. To aid the development of digital language resources, all developed Sesotho resources will be published in open repositories, such as Github and SADiLaR’s repository.

Bibliography

Anderson, J. (1983). Lix and rix: Variations on a little-known readability index.
Journal of Reading,
26(6): 490–96.

Björnsson, C. (1968).
Läsbarhet. (Pedagogiskt Utvecklingsarbete Vid Stockholms Skolor. 6). Liber: Solna, Seelig.

Coleman, M. and Liau, T. L. (1975). A computer readability formula designed for machine scoring.
Journal of Applied Psychology,
60(2): 283–84.

Dale, E. and Chall, J. S. (1948). A formula for predicting readability: Instructions.
Educational Research Bulletin: 37–54.

De Wet, A. (2021). The development of a contextually appropriate measure of individual recovery for mental health service users in a South African context Stellenbosch University, Stellenbosch, South Africa PhD thesis.

Flesch, R. (1948). A new readability yardstick.
Journal of Applied Psychology,
32(3): 221.

Guma, S. (1982).
An Outline Structure of Southern Sotho. 2nd ed. Pietermaritzburg, South Africa: Shooter; Shuter Publishers.

Gunning, R. (1952).
Technique of Clear Writing. McGraw-Hill.

Gunning, R. (1969). The fog index after twenty years.
Journal of Business Communication,
6(2): 3–13.

Jansen, C., Richards, R. and Van Zyl, L. (2017). Evaluating four readability formulas for Afrikaans.
Stellenbosch Papers in Linguistics Plus,
53: 149–66.

Joubert, K. and Githinji, E. (2014). Quality and readability of information pamphlets on hearing and paediatric hearing loss in the gauteng province, South Africa.
International Journal of Pediatric Otorhinolaryngology,
78: 354–58.

Kincaid, J. P. and Delionbach, L. J. (1973). Validation of the automated readability index: A follow-up.
Human Factors,
15(1): 17–20.

Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L. and Chissom, B. S. (1975).
Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Naval Technical Training Command Millington TN Research Branch.

Krige, D. and Reid, M. (2017). A pilot investigation into the readability of sesotho health information pamphlets.
Communitas,
22: 113–23.

Leopeng, M. T. (2019). Translations of informed consent documents for clinical trials in South Africa: Are they readable? University of Cape Town, Cape Town, South Africa Master’s thesis.

Liang, F. (1983). Word hy-phen-a-tion by com-put-er Stanford, USA: Stanford University PhD thesis.

Mc Laughlin, G. H. (1969). SMOG grading-a new readability formula.
Journal of Reading,
12(8): 639–46.

Motjope-Mokhali, T., Kosch, I. and Mafela, M. J. (2020). Sethantso sa sesotho and Sesuto-English dictionary: A comparative analysis of their designs and entries.
Lexikos,
30: 1–17.

Reid, M., Nel, M. and Janse van Rensburg-Bonthuyzen, E. (2019). Development of a Sesotho health literacy test in a South African context.
African Journal of Primary Health Care and Family Medicine,
11(1): 1–13.

Senter, R. and Smith, E. A. (1967).
Automated Readability Index. Cincinnati University, OH.

Sibanda, L. (2013). A case study of the readability of two grade 4 natural sciences textbooks currently used in South African schools Rhodes University, Grahamstown, South Africa Master’s thesis.

Sibeko, J. and Zaanen, M. van (2021). An analysis of readability metrics on English exam texts. In,
Proceedings of the International Conference of the Digital Humanities Association of Southern Africa (Dhasa).

Wills, S., Uys, P., Heerden, C. J. van and Barnard, E. (2020). Language modeling for speech analytics in under-resourced languages. In,
Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech 2020), Shanghai, China. International Speech Communication Association, pp. 4941–45.

Wissing, G.-J., Blignaut, A. S. and Van den Berg, K. (2016). Using readability, comprehensibility and lexical coverage to evaluate the suitability of an introductory accountancy textbook to its readership.
Stellenbosch Papers in Linguistics,
46: 155–79.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO