Stanford University
Natural Language Processing Tools for the Digital Humanities
Manning, Christopher, Stanford University, manning@stanford.edu
Large and ever-increasing amounts of text are now available digitally from many sources. Beyond raw text, there are also increasing troves of text annotated with various kinds of metadata and analysis. This data provides new opportunities in the humanities to do different kinds of analyses and at different scales, some of which blur the boundaries between the traditional analytical and critical methods of the humanities versus empirical and quantitative approaches common in the social sciences. Since texts are central to the humanities, a key opportunity is in “text mining” – making use of computers for analyzing texts, and it is here that there is much opportunity for the use of tools from Natural Language Processing. The last two decades have also seen the field of Natural Language Processing refocused on being able to process and analyze the huge amounts of available digital speech and text, partly through the use of new probabilistic and machine learning methods. This has led to the development of many robust methods and tools for text processing, many of which are within reach of the ambitious practitioner, and often are available for free as open source software.
This tutorial will survey what you can do with digital texts, starting from word counts and working up through deeper forms of analysis including collocations, named entities, parts of speech, constituency and dependency parses, detecting relations, events, and semantic roles, coreference resolution, and clustering and classification for various purposes, including theme, genre and sentiment analysis. It will provide a high-level not-too-technical presentation of what these tools do and how, and provide concrete information on what kinds of tools are available, how they are used, what options are available, examples of their use, and some idea of their reliability, limitations, and whether they can be customized. The emphasis will be at the level of what techniques exist and what you can and can’t do with them. The hope is to empower participants in envisioning how these tools might be employed in humanities research.
The rough plan of the tutorial is as follows. The plan spends a bit more time on the things that people are most likely to be able to take away and use (such as, parts of speech, NER, and parsing).
Introduction, digital text corpora, markup, metadata, and search. Issues of spelling, tokenization and morphology (30 mins)
Counting words, counting n-grams, collocations (20 mins)
Part-of-speech tagging and named entity recognition (40 mins)
Parsing: constituency and dependencies and their applications (30 mins)
Briefer survey of methods finding more semantics: relations, events, semantic roles, and coreference resolution (20 mins)
Clustering and classification: applications including authorship attribution, topic models, word sense disambiguation, and sentiment analysis (30 mins)
Wrap up (10 mins)
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at Stanford University
Stanford, California, United States
June 19, 2011 - June 22, 2011
151 works by 361 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (still needs to be added)
Conference website: https://dh2011.stanford.edu/
Series: ADHO (6)
Organizers: ADHO