- what Natural Language Processing is and how it is relevant to History/the Humanities?
- faster ways to identify place names in text files and match these to lat/long coordinates?
- what the pitfalls of these automatic approaches can be (esp. for non-English languages)?
- Natural Language - human generated language vs. formal computing languages (or just tabular data)
- Natural Language Processing - research field encompassing linguistics, computer science, artificial intelligence that works to allow machines to process natural language
- Rule-based vs. statistical NLP - writing rules vs. using machine learning algorithms to analyze a corpus
- Corpus - set of texts (hand keyed, OCR'd, born digital)
- Corpus linguistics
- Geographic information science
- Gazetteer
- Lexicon
GTA helps us answer questions like:
- Where is [a corpus] talking about?
- What is [a corpus] saying about these places?
- What is [a corpus] saying about specific themes in proximity to these places?
GTA allows scholars to create re-usable geo-spatial data from natural language texts.
This is a crucial departure from many early spatial humanities projects that were based on geocoding lists of place names. Geocoding could be done by hand (manually finding a lat/long for a place name) or automatically (scanning pre-existing gazetteers for perfect or near perfect string matches to place names).
It has evolved from depending on string matches to using a combination of rule-based methods nd Machine Learning algorithms to parse texts and georesolve locations. While it is possible to perform these tasks separately, geoparsing software combines the NLP and GIS tasks.
Have a look at the geoparser pipeline_scheme.png
file in this repo. Most GTA projects use a similar pipeline that combines tools originally developed for corpus linguistics and geographic information science.
- Tokenization
- Part-of-speech tagging
- Lemmatization
- Named Entity Recognition
- Lookup of Entities in Digital Gazetteers
- Disambiguation of Gazetteer record 'matches'
- Georesolution to 1 coordinate set
In this workshop, we will be using the Edinburgh Geoparser, developed by the Language Technology Group at the University of Edinburgh. It is one of the first geoparsers to be geared towards humanities research applications. It was designed to work with relatively modern English running texts.
Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin. 2015. Adapting the Edinburgh Geoparser for Historical Georeferencing. International Journal for Humanities and Arts Computing, 9(1), pp. 15-35, March 2015
Claire Grover, Richard Tobin, Kate Byrne, Matthew Woollard, James Reid, Stuart Dunn, and Julian Ball. 2010b. Use of the Edinburgh Geoparser for georeferencing digitised historical collections. Philosophical Transactions of the Royal Society A, 368(1925):3875-3889.
The Edinburgh Geoparser, or EG for short, is free for research.
Let's take a break for questions before we get started with the tutorial.
- Download the Edinburgh Geoparser
- decompress the file
- Save EG folder to your local drive (any location that suits you, close to your
home
folder) - Open
scripts
>setup
file in any text editor (don't open by double clicking on the file, use CTL+Open and then select your text editor) - Replace
Darwin?1[012345]*)
withDarwin?1[0-9]*)
- Open your terminal
cd
to your EG folder- List EG folder contents
ls ./geoparser-v1.1
cd
to the scripts folder- test that EG functions with this command using a sample input file provided with the program
cat ../in/172172.txt | ./run -t plain -g geonames -o ../out 172172
- Install Docker
- Create Docker account
- Start Docker app
- Open up shell to run Docker commands
- We will be using a Docker container of the Edinburgh Geoparser
- Download the
172172.txt
file from this Github repo and put it in a local folder cd
to that folder in your shell I am aware that the following command doesn't work. In the process of adapting to Windows- Run the following command to test:
cat 172172.txt | docker run -i -v cd:/out kmcdono2/eg:latest ./run -t plain -g geonames -o /out 172172
(Alternate Windows solution: MobaXterm)
- README: run instructions
- in: files you want to parse and geo-resolve
- out: results
- lib: processing libraries
- resolve: programs for geo-resolution
- scripts: scripts to run EG
More on EG file structure and pipeline.
EG combines all of the parsing and georesolution steps into one pipeline and produces a set of output files that include some basic map visualizations.
EG works well with plain text input. It can also accept google books files and XML, under certain conditions.
-t plain
-t gb
-t ltgxml
EG accesses a few commonly used gazetteers. You can also specify a local gazetteer (if you have one that is specific to your time/place).
-g geonames
Geonames
-g os
UK Ordnance Survey
-g deep
Historical placenames in England
-g plplus
Pleiades+, ancient Greek and Roman world
EG results include a range of files. Define the output directory and file name in the run command.
-o /out 172172
172172.out.xml
- XML file containing text and all word-level metadata produced during the process
172172.gaz.xml
- ranked list of geo-resolution candidates for each extracted place entity
172172.display.html
- visualization of geoparsed text, map, and list of geo-coordinates for each place
There are other ways to limit your results:
-top
- this parameter creates 172172.display-top.html
that only maps top-ranked locations
You can express a preference for results within a geographical area.
-l lat long radius score
- bounding circle
Example: -l
-lb W N E S score
- bounding box
(WNES are decimal degrees)
cat ../in/172172.txt | ./run -t plain -g geonames -lb -141.002701 83.110619 -52.620201 41.681019 2 - o ../out 172172
(Score of 2
means that a gazetteer result within the bounding box or circle has 2x weight as a result that is outside that area.)
Do you want to run multiple files through in one go? Download the run-multiple-files.sh
script from this repo. Place it in the scripts directory. Run the following command to make it executable: chmod u+x run-multiple-files.sh
Then, still from the scripts directory, run:
./run-multiple-files.sh -i ../in -o ../out
In this case, -i
specifies input directory and -o
the output directory.
Let's run EG on Darwin's The Origin of Species
Download the origins.txt
file from the repo and put it in your in
directory.
Be sure you've navigated to the scripts directory:
cd scripts
Then run the following (your output directory must already exist, e.g. out
):
cat ../in/origin.txt | ./run -t plain -g geonames -o ../out origin
Run:
cat origin.txt | docker run -i -v cd:/out kmcdono2/eg:latest ./run -t plain -g geonames -o /out origin
Let's take a look at one of the out.xml files to see what is happening to the text.
<text>
<p>
Paragraphs
<s>
Sentences
<w>
Words
<standoff>
<ents>
This element occurs twice because the NER process has two runs. 1) rule-based to identify and classify entities (ner-rb
). EG entities can be classified as date, location, person, and organization. 2) verbs & verb phrases for detecting events. We are mostly interested in the first pass to identify and classify.
<part>
Links an entity element back to its position in the text ("start word" [sw] and
"end word" [ew] both have word ids [e.g. w26]).
It is possible to export specific metadata fields (place name in text, gazetteer record ID, country of location, lat/long, feature type, etc.).
Run this command to transform your XML results into a TSV (tab separated values) file:
./bin/sys-i386-snow-leopard/lxprintf -e "ent[@type='location']" "%s\t%s\t%s\t%s\t%s\n" "normalizespace(parts)" "@gazref" "@in-country" "@lat" "@long" < ./ out/burtons.out.xml > ./out/burtons.out.tsv
You can also download this script as a file and run it just like we did the multiple files script.
chmod u+x extract-to-tsv.sh
Then: ./extract-to-tsv.sh < ../out/burtons.out.xml > ../out/burtons.out.tsv
You can edit the script (either in the command line or in the file) to include or exclude metadata fields. The script above, for ex., does not include @feat-type.
Check out the full EG documentation.
Questions?
Now lets try using some texts that interest you.
If you have plain text files in English, put them into the in
directory. Use the same run scripts from above, but be sure to edit the in and out files or directories to match your new file names.
If you don't have any plain text, try downloading something from the Internet Archive or use one of the sample txts in the in
directory.
What kinds of problems do you encounter?
Why do you think these problems occur?
How would you need to adapt the geoparser to solve these problems?
EG is an excellent tool, with developers who are interested in creating version for specific research needs.
However, there are lots of other fish in the GTA sea. If you want to try your hand at some other NLP packages that integrate Machine Learning into Named Entity Recognition, or have texts in non-English languages, for example, check out the following resources:
- NLTK (python)
- NLP for non-English languages in R (udpipe)
- SpaCy, also python
Interested in the challenge of evaluating different pipelines for your corpus? There is a brand new article by Milan Gritta et al that can guide you through this process.
Before you leave, please take a moment to fill out this quick evaluation form. It will help us adapt content for the future!