Multilinguals Write Back: Modeling Language, Politics and Identity in Philippine Social Media

paper, specified "long paper"
Authorship
  1. 1. Frances Antoinette Cruz

    University of Antwerp/University of the Philippines Diliman

  2. 2. Mike Kestemont

    Universität Antwerpen (University of Antwerp)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Research problem statement
This paper documents the intersections of language and the public sphere through a cross-sectional model of comments on public Facebook (FB) pages of selected newspapers from 2015-2019. The analysis examines language use in discussions of current events, differences between national and regional newspapers, and social media insights into the conduct of public discourse. A common observation in the Philippines is, while English is used for official documents, tertiary education, and national broadsheets, oral discussions tend to involve Filipino or regional languages (Gonzalez, 1998). Class and political differences are also heavily associated with language use and media preferences (Kusaka 2017). Social media, with its informal written language, yet socially and politically relevant content, may thus offer insights into contemporary language use and public engagement with media.
Methodology
Comments on public FB pages of selected national and regional newspapers (
Manila Bulletin,
Manila Times,
Philippine Daily Inquirer,
Philippine Star,
Cebu Daily News,
Mindanews,
Mindanao Times,
Sun Star Cebu,
Sun Star Davao, and
The Freeman) were captured through
Facepager (Jünger & Keyling 2019). The corpus is taken from a separate project on Muslim identities in the Philippines and consists of data from selected months in 2015, 2017, and 2019.

While language identification is a common task in natural language processing, most of the available software casts this as a multi-class classification task, where only a single language can be assigned to a textual document. In the present case, involving code-switching as a signature feature, multiple languages can be simultaneously present in a document, thus turning this task effectively into a multilabel classification problem. We finetuned a bespoke multilabel classifier on top of a pretrained BERT (Devlin et al., 2019) feature extractor (the ‘bert-base-multilingual-uncased’ model). We only considered messages where the target language(s) could be identified and divided these into a train, validation and test set of 10,000 social media messages (containing 7,304, and 2 x 913 instances respectively). We manually annotated for the presence/absence of Tagalog/Filipino, Cebuano, and English respectively. We compared the performance of this SOTA approach to a simpler baseline, consisting of a conventional multitarget classifier in the form of a random forest (RF) (Pedregosa et al., 2011) of 16,156 manually annotated entries (with a train, validation and test set of 12,998 entries, and 2 x 2,294 instances), trained on top of a TF-IDF representation of a vocabulary character n-grams (for 2 ≤ n ≤ 6) (see details Table 2 below). Finally, we applied the BERT language detector (with the weights that optimized the validation performance) to the unseen data.

Table 1 – Number of entries per language in Test, Train and Validation Sets

Table 2 – Test Accuracies

Findings
Both classifiers were able to demonstrate similar general trends: Tagalog/Filipino entries were the most common overall (Figures 1 and 2). Regionally, more diverse language use is noticeable. Cebuano comments were prominent in at least two Cebu-based newspapers, while the Mindanao Times and Sun Star Davao featured Tagalog/Filipino as the most-used language, followed by English and Cebuano (Figures 3 and 4), suggesting the usage of Tagalog/Filipino even where Visayan languages are prominent. Despite the presence of monolingual English-language newspapers, the fact that current events are
written about and responded to by a multilingual Philippine public sphere is often obscured. Social media thus offers opportunities not only for re-thinking monolingual norms in media, but may also act as a forum for revitalizing written forms of regional languages, while acting as a potent corpus and resource for codeswitching and informal written language for automatic language identifiers.

Fig. 1 – Language Use (BERT)

Fig. 2 – Language Use (RF)

Fig. 3 – Languages per Newspaper (BERT)

Fig. 4 – Languages per Newspaper (RF)

Bibliography

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Proceedings of NAACL-HLT, Minneapolis, MN, June 2019, pp. 4171–4186.

Gonzalez, A. (1998). The Language Planning Situation in the Philippines.
Journal of Multilingual and Multicultural Development,
19(5 & 6): 487–525.

Jünger, J. and Keyling, T. (2019).
Facepager: An application for automated data retrieval on the web.

https://github.com/strohne/Facepager/
(accessed April 7 2022).

Kusaka, W. (2017).
Moral Politics in the Philippines. Singapore/Japan: NUS Press & Kyoto University Press.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Bertrand, T., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., David Cournapeau, Brucher, M., Perro, M., and Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python.
Journal of Machine Learning Research,
12: 2825–2830.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO