dragnet

Dragnet isn't interested in the shiny chrome or boilerplate dressing of a webpage. It's interested in... 'just the facts.'

It is meant to become a collection of reference implementations of various dechroming / content extraction algorithms.

Each of the algorithms is implemented as a class of static methods that can be imported from the top level of dragnet, and implement a method analyze, which accepts a string of HTML and returns a string representative of the content.

Running

Fill a directory documents with per-site folders of the HTML sources of documents from that site, and then run.py will iterate through each of the input files and produce a corresponding file in output with just the content. For example,

documents/
    wired.com/
        latest-higgs-rumors
    seomoz.org/
        8-attributes-of-content-that-inspire-action

Arias et al.

Based on Language Independent Content Extraction from Web Pages

from dragnet import Arias
import requests
r = requests.get(
    'http://www.wired.com/wiredscience/2012/06/latest-higgs-rumors/')
print Arias.analyze(r.content)

Kohlschütter et al.

Based on Boilerplate Detection using Shallow Text Features

from dragnet import Kohlschuetter
import requests
r = requests.get(
    'http://www.wired.com/wiredscience/2012/06/latest-higgs-rumors/')
print Kohlschuetter.analyze(r.content)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dragnet		dragnet
.gitignore		.gitignore
README.md		README.md
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dragnet

Running

Arias et al.

Kohlschütter et al.

About

Releases

Packages

eristoddle/dragnet

Folders and files

Latest commit

History

Repository files navigation

dragnet

Running

Arias et al.

Kohlschütter et al.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages