General info

This project provides the framework for automated georeferencing of textual (meta)data. This framework was developed during for PhD dissertation (at Ghent University, Belgium) Georeferencing Text Using Social Media.

The code in this framework covers certain aspects that have been published in scientific papers. Therefore, if you use the code from this framework, please cite the following paper:

Georeferencing Flickr resources based on textual meta-data. Olivier Van Laere, Steven Schockaert, Bart Dhoedt.

In case you would use or refer to the spatially aware method of feature ranking based on the Ripley K implementation, please cite the following paper:

Spatially-Aware Term Selection for Geotagging. Olivier Van Laere, Jonathan Quinn, Steven Schockaert, Bart Dhoedt.

In case you make use of the geographical spread score feature ranking method, please cite the authors of this scoring method:

WISTUD at MediaEval 2011: Placing Task. Claudia Hauff and Geert-Jan Houben.

This code has been used to participate in the 2010, 2011 and 2012 MediaEval Placing Task. Therefore, an example is also provided on how to run a baseline submission for this task. This framework uses only textual meta- data, no visual features are used.

How to get started

git clone [email protected]:ovlaere/placing-text.git
Locate a suitable training and test file and add them to the folder (see below)
mvn package in the root folder of the project
run one of the scripts, for instance:

java -cp target/placing-text-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
-Xms2g -Xmx4g -ea -Dfile.encoding=UTF-8 \
be.ugent.intec.ibcn.examples.FeatureExample

Please note that this example assumes that you have a file called training in the same folder. See the source code for more examples. But you get the idea, don't you.

The code in this framework expects two files to be present for you georeferencing problem: a training and a test file. The file format is up to you as you can implement your own parsers. The examples however use a parser that expects these files to have a format such as

<number_of_items>
ID,...,latitude,longitude,tag tag tag tag
ID,...,latitude,longitude,tag tag tag tag
...
ID,...,latitude,longitude,tag tag tag tag

That is, and ID, an obsolete field (in my case - flickr data - that was the owner which is unused in this implementation), the latitude and longitude of the item and a space-separated field of tags.

Important: To increase performance, the framework tries to read the number of lines in the training and test file from the first line. If you put the number of items in that line, this is parsed and used in the code. If you forget this or omit this on purpose, the framework will loop through the file to determine the number of lines in the file. When using training files consisting of for instance 32 million lines, this might decrease your performance.

Most parts of the code are parallelized for execution on multiple cores, so the more cores you have on the system running this code, the better.

As for memory requirements: the language models are build in memory, so the more memory you can spare the virtual machine (-Xmx parameter), the better. If you try to run a model with lots of classes and lots of features, the code will run through it in batches (e.g. 1000 classes a time using 1M features), while more memory or less features would allow you to process 3000 classes per batch. Anyway, this is nothing to worry about at this point. Just start running the examples and see where you get.

The workflow to obtain location predictions for a set of test items, starting from the training data, is the following:

The training data is clustered.
The training data is analyzed for features that are ranked (one way or another, preferably ranked according to the geographic clues they provide)
A Naive Bayes classifier is trained, using the classes (clusters) discovered and a given number of features.
The classifier returns a class that is most likely for each test item.
For each of these test items, the training data within their predicted class is analyzed for the most similar training item. The location of this item is then returned as the location estimation.
The predicted locations are compared to the ground truth after which detailed statistics are presented about the results over the entire test collection.

For each of these steps, a documented example is available. Also, take a look in the classes used by these examples, as they are all well documented.

If you would like to implement parsers for your own input files, have a look at the interface definitions in be.ugent.intec.ibcn.geo.common.interfaces. Different implementations can be found in be.ugent.intec.ibcn.geo.common.io.parsers.

Examples

Clustering be.ugent.intec.ibcn.examples.ClusteringExample
Feature ranking be.ugent.intec.ibcn.examples.FeatureExample
Classification be.ugent.intec.ibcn.examples.Classifier
Georeferencing be.ugent.intec.ibcn.examples.ReferencingExample
Analyzing be.ugent.intec.ibcn.examples.AnalyzerExample

Additionally, the following example provides a workflow that combines all necessary steps to end with a file that predicts the locations for a given test collection from the MediaEval Placing Task, given a certain training file.

MediaEval 2012 Placing Task workflow be.ugent.intec.ibcn.examples.MediaEval2012PlacingExample

Implementation details

Clustering algorithms
- Grid clustering
- Partitioning Around Medoids
Feature ranking
- Chi Square
- Max-Chi Square
- Information Gain
- Log-Likelihood
- Most Frequently Used
- Geographic spread score
- Ripley-K based spatially aware ranking
Location prediction
- Medoid based conversion
- Similarity based conversion (Jaccard)

External libraries

This code makes extensive use of KD-trees, for which we would like to acknowledge the used implementation from http://home.wlu.edu/~levys/software/kd/.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
repo/local/kd/kd-tree		repo/local/kd/kd-tree
src/main		src/main
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

General info

How to get started

Examples

Implementation details

External libraries

About

Releases

Packages

Languages

ovlaere/placing-text

Folders and files

Latest commit

History

Repository files navigation

General info

How to get started

Examples

Implementation details

External libraries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages