Skip to content

Just the facts -- web page content extraction

Notifications You must be signed in to change notification settings

eristoddle/dragnet

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

dragnet

Dragnet isn't interested in the shiny chrome or boilerplate dressing of a webpage. It's interested in... 'just the facts.'

It is meant to become a collection of reference implementations of various dechroming / content extraction algorithms.

Each of the algorithms is implemented as a class of static methods that can be imported from the top level of dragnet, and implement a method analyze, which accepts a string of HTML and returns a string representative of the content.

Running

Fill a directory documents with per-site folders of the HTML sources of documents from that site, and then run.py will iterate through each of the input files and produce a corresponding file in output with just the content. For example,

documents/
    wired.com/
        latest-higgs-rumors
    seomoz.org/
        8-attributes-of-content-that-inspire-action

Arias et al.

Based on Language Independent Content Extraction from Web Pages

from dragnet import Arias
import requests
r = requests.get(
    'http://www.wired.com/wiredscience/2012/06/latest-higgs-rumors/')
print Arias.analyze(r.content)

Kohlschütter et al.

Based on Boilerplate Detection using Shallow Text Features

from dragnet import Kohlschuetter
import requests
r = requests.get(
    'http://www.wired.com/wiredscience/2012/06/latest-higgs-rumors/')
print Kohlschuetter.analyze(r.content)

About

Just the facts -- web page content extraction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published