CLIFF is a lightweight server to allow HTTP requests to the Stanford Named Entity
Recognizer and a modified CLAVIN 2.0.0 geoparser.
It allows you to submit unstructured text over HTTP and a receive in reply JSON
results with information about organizations mentioned, locations mentioned,
people mentioned, and countries the text is "about". The geoparsing is tuned
to identify cities, states and countries.
You can try CLIFF out on our public website: http://cliff.mediameter.org. We don't host a public installation of CLIFF for you to use. If you want to install and use CLIFF, @ahalterman created an awesome vagrant script that will install it to a virtual host you can use. Follow those to get this installed.
To test it out, hit this url in a browser and you should get some JSON back:
http://localhost:8080/CLIFF-2.1.1/parse/text?q=This is some text about New York City, and maybe about Accra as well, and maybe Boston as well.
Of course, when you use this in a script you should do an HTTP POST, not a GET!
###/parse/text
The reason CLIFF exists! This parses some text and returns the entities mentioned (people, places and organizations).
Parameter | Default | Notes |
---|---|---|
q | (required) | Raw text of a news story that you want to parse |
replaceAllDemonyms | false | "true" if you want to count things like "Chinese" as a mention of the country China |
Example Query:
http://localhost:8080/CLIFF-2.1.1/parse/text?q=Some%20clever%20text%20mentioning%20places%20like%20New%20Delhi,%20and%20people%20like%20Einstein.%20Perhaps%20also%20we%20want%20mention%20an%20organization%20like%20the%20United%20Nations?
Response:
{
"results": {
"organizations": [
{
"count": 1,
"name": "United Nations"
}
],
"places": {
"focus": {
"cities": [
{
"id": 1261481,
"lon": 77.22445,
"name": "New Delhi",
"score": 1,
"countryGeoNameId": "1269750",
"countryCode": "IN",
"featureCode": "PPLC",
"featureClass": "P",
"stateCode": "07",
"lat": 28.63576,
"stateGeoNameId": "1273293",
"population": 317797
}
],
"states": [
{
"id": 1273293,
"lon": 77.1,
"name": "National Capital Territory of Delhi",
"score": 1,
"countryGeoNameId": "1269750",
"countryCode": "IN",
"featureCode": "ADM1",
"featureClass": "A",
"stateCode": "07",
"lat": 28.6667,
"stateGeoNameId": "1273293",
"population": 16787941
}
],
"countries": [
{
"id": 1269750,
"lon": 79,
"name": "Republic of India",
"score": 1,
"countryGeoNameId": "1269750",
"countryCode": "IN",
"featureCode": "PCLI",
"featureClass": "A",
"stateCode": "00",
"lat": 22,
"stateGeoNameId": "",
"population": 1173108018
}
]
},
"mentions": [
{
"id": 1261481,
"lon": 77.22445,
"source": {
"charIndex": 40,
"string": "New Delhi"
},
"name": "New Delhi",
"countryGeoNameId": "1269750",
"countryCode": "IN",
"featureCode": "PPLC",
"featureClass": "P",
"stateCode": "07",
"confidence": 1,
"lat": 28.63576,
"stateGeoNameId": "1273293",
"population": 317797
}
]
},
"people": [
{
"count": 1,
"name": "Einstein"
}
]
},
"status": "ok",
"milliseconds": 36,
"version": "2.1.1"
}
###/geonames
A convenience method to help you lookup places by their geonames ids.
Parameter | Default | Notes |
---|---|---|
id | (required) | The unique id that identifies a place in the geonames.org database |
Example Query:
http://localhost:8080/CLIFF-2.1.1/geonames?id=4930956
Response:
{
"results": {
"id": 4930956,
"lon": -71.05977,
"name": "Boston",
"countryGeoNameId": "6252001",
"countryCode": "US",
"featureCode": "PPLA",
"featureClass": "P",
"stateCode": "MA",
"lat": 42.35843,
"stateGeoNameId": "6254926",
"population": 617594
},
"status": "ok",
"version": "2.1.1"
}
###/extract
A convenience method to help you get the raw text of the story from a URL. This uses the boilerpipe library.
Parameter | Default | Notes |
---|---|---|
url | (required) | The url of a news story to extract the text of |
Example Query:
http://localhost:8080/CLIFF-2.1.1/extract?url=http://www.theonion.com/articles/woman-thinks-she-can-just-waltz-back-into-work-aft,38349/
Response:
{
"results": {
"text": "Woman Thinks She Can Just Waltz Back Into Work After Maternity Leave Without Bringing Baby To Office\nNEWS IN BRIEF\nVol 51 Issue 13 \u00b7 Local \u00b7 Workplace \u00b7 Parents \u00b7 Kids \u00b7 After Birth\nKENWOOD, OH\u2014Saying she has a lot of nerve to try and pull something like this, employees of insurance agency Boland & Sons told reporters Wednesday that coworker Emily Nelson seems to believe she can just waltz back into work after her maternity leave without once bringing her baby into the office. \u201cI don\u2019t know where she gets off thinking she doesn\u2019t need to come in here with that baby strapped around her in a bjorn,\u201d said Greg Sheldrick, adding that Nelson is out of her goddamn mind if she seriously believes showing off a few measly pictures of the newborn on her cell phone is an adequate substitute for bringing him around to meet everyone in their department. \u201cShe\u2019s been back for three weeks already, so the grace period is over. She needs to come in with that baby in a stroller, roll it by my desk, and say \u2018Somebody wants to say hello,\u2019 or, frankly, she might as well never show her face here again. Seriously, every single person here better get a chance to lean in and smile at that baby, and God help her if she shows up the rest of this week empty-handed.\u201d Sheldrick reportedly expressed equal astonishment that Nelson\u2019s husband thinks he can get away with not once arriving with the infant to pick up his wife from work.\nShare This Story:\n",
"title": "Woman Thinks She Can Just Waltz Back Into Work After Maternity Leave Without Bringing Baby To Office - The Onion - America's Finest News Source",
"url": "http:\/\/www.theonion.com\/articles\/woman-thinks-she-can-just-waltz-back-into-work-aft,38349\/"
},
"status": "ok",
"milliseconds": 651,
"version": "2.1.1"
}
You can configure how CLIFF runs by editing the following properties in the src/main/resources/cliff.properties
file.
Controls which Stanford NER Model to use while extracting entities:
Value | Default | Model | Notes |
---|---|---|---|
ENGLISH_ALL_3CLASS | * | english.all.3class.distsim.crf | Quick, but doesn't catch all demonyms |
ENGLISH_CONLL_4CLASS | english.conll.4class.distsim.crf | Catches most demonyms, but is about 30% slower |
You need maven and java (1.7). We develop in Eclipse Kepler: Java EE.
You need to download and install CLAVIN 2.0.0 in order to build the Geonames Gazetteer Index
for geoparsing. The idea is that you build all that, and then create a symlink at
/etc/cliff2/IndexDirectory
to the CLAVIN index you just built.
CLIFF is setup to be run inside a Java servlet container (ie. Tomcat7). For development
we use the Maven Tomcat plugin. To deploy,
add this to your %TOMCAT_PATH%/conf/tomcat-users.xml
file:
<role rolename="manager"/>
<role rolename="manager-gui"/>
<role rolename="manager-script"/>
<user username="cliff" password="beer" roles="manager,manager-gui,manager-script"/>
Also add this to your ~/.m2/settings.xml
:
<servers>
<server>
<id>CliffTomcatServer</id>
<username>cliff</username>
<password>beer</password>
</server>
</servers>
That lets the Maven Tomcat plugin upload the WAR it builds over the website control panel.
First make sure tomcat is running (ie. catalina run
). Now run mvn tomcat7:deploy -DskipTests
to deploy the app, or mvn tomcat7:redeploy -DskipTests
to redeploy once you've already got
the app deployed.
We have a number of unit tests that can be run with mvn test
.
To build a release:
- first update the version number in the
pom.xml
file - also update the version number in
org.mediameter.cliff.ParseManager
- to create the WAR file, run
mvn package -DskipTests
. - update the examples in
README.md
- tag the release with the version number
vX.Y.Z
- author a new release for that tag on GitHub, write a description of the changes, and upload the .war
We run our servers on Ubuntu - here's some tips for deploying to that type of server:
- First make sure you have java7:
sudo apt-get install openjdk-7-jdk
- First install tomcat7:
sudo apt-get install tomcat7
. - Point tomcat at the correct java: open
/etc/default/tomcat7
then uncomment and change theJAVA_HOME
var to/usr/lib/jvm/java-7-openjdk-amd64
- Increase tomcat's memory: open
/etc/default/tomcat7
and change theJAVA_OPTS
var to inclue something like-Xmx4024m
- Put your war in the right place: on Ubuntu this is
/var/lib/tomcat7/webapps/