Introduction to Text Analysis and Topic Modeling with R

workshop / tutorial
Authorship
  1. 1. Matthew Jockers

    University of Nebraska–Lincoln

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

General Description:
Introduction to Text Analysis and Topic Modeling with R is a sequence of two workshops that will provide a practical introduction to text analysis with a special emphasis on topic modeling. Taken together, the workshops will cover basic text processing, data ingestion, data preparation, and topic modeling. The main computing environment for the workshops will be R: "the open source programming language and software environment for statistical computing and graphics." While no programming experience is required, students must have basic computer skills, must be familiar with their computer's file system, and must be comfortable entering commands in a command line environment. Though the two workshops are designed to stand alone, the second one is more advanced and assumes some basic familiarity with topic modeling. Participants might want to visit The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors for a general overview.

Suggested Workshop Preparation:
While not required, participants are encouraged to work through at least the first two of the seven basic R lessons available at R Code School prior to taking this workshop.

In advance of the workshop, students should:
Download the current version of R (at the time of this writing version 3.0.0) from the CRAN website by clicking on the link that is appropriate to your operating system (see cran.at.r-project.org):
If you use MS Windows, click on the "base" and then on the link to the executable (i.e. ".exe") setup file.
If you are running Mac OSX, choose the link to the most current package.
If you use Linux, choose your distribution and then the installer file. Follow the instructions for installing R on your system in the standard or "default" directory. You will now have the base installation of R on your system.
If you are on a Windows or Macintosh computer, you will find the R application in the directory on your system where Programs (Windows) Applications (Macintosh) are stored. If you want to launch the R GUI, you can double click the icon to start the R GUI. We will not be using the R GUI in the workshop. We will use RStudio (see below).
If you are on a Linux/Unix system, simply type "R" at the command line to enter the R program environment.
Download and Install RStudio
The R GUI application is fine for a lot of simple programming, but RStudio is an application that offers a very nice user environment for writing and running R programs. RStudio is an IDE, that’s "Integrated Development Environment" for R. RStudio runs happily on Windows, Mac, and Linux. After you have downloaded R (by following the instructions above) you must download the "Desktop" version (i.e. not the Server version) of RStudio from www.rstudio.com. Follow the installation instructions and then launch RStudio just like you would any other program/application. When you launch RStudio, you do not have to also launch the R program. RStudio accesses the R program you installed in the first step.
Workshop Syllabus:
Important: It is critical that you arrive on time to every session and be ready to roll with RStudio installed and running. The workshop will begin on schedule, and if you miss the first few minutes of any session you'll be lost!

Workshop One: Introduction to Text Analysis with Applications in R
Summary: In this workshop you will be introduced to the R programming language while learning the basics of computational text analysis. You will learn basic R syntax and be introduced to the RStudio programing environment. Text analysis topics covered will include text ingestion and tokenization, word frequency analysis, dispersion plots, and if time permits, correlation analysis.

Session one (9:00-10:15)*
The R computing environment
R console vs. RStudio
Basic text manipulation in R
Word Frequency
Break (10:15-10:30)
Session two (10:30-12:00)
Dispersion Plots
Correlation
* Times are for illustration only.

Workshop Two: Introduction to Topic Modeling with Applications in R
Summary: In this workshop, you will be introduced to topic modeling and learn how to analyze and visualize topic model output in R. For this work, we will use the R implementation of MALLET that was developed by David Mimno. Student will also learn how to parse TEI-based XML and how to segment large texts into chucks. We will discuss various text pre-processing procedures including how to do part of speech tagging in R using the openNLP package. Though this will be a hands-on workshop, some techniques explored here a quite advanced and those unfamiliar with such things as XML document structure and basic text analysis may find it better to observe and then use the included documentation to practice the techniques at home.

Session three (13:00-14:30)*
Loading a corpus
Preparing files for Topic Modeling
Break (14:30-14:45)
Session four (14:45-16:00)
Running the Model
Exploring topic coherence with term clouds
Topic data analysis
* Times are for illustration only.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None