Skip to content
This repository has been archived by the owner on Aug 24, 2023. It is now read-only.
/ seize Public archive

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

Notifications You must be signed in to change notification settings

peremenov/seize

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

seize

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader.

Install

npm i --save seize

Usage

Seize can be used with DOM libraries such as jsdom for example. It only extracts and prepares certain DOM-node for further usage.

Example

var Seize = require('seize'),
    jsdom = require('jsdom').jsdom;

var window = jsdom('<your html here>').defaultView,
    seize  = new Seize(window.document);

seize.content(); // returns DOM-node
seize.text();    // returns only text

Browser usage

For browser usage you shoud clone you DOM object or create it from HTML string:

/**
 * Converts html string to Document
 * @param  {String} html  html document string
 * @return {Node}         document
 */
function HTMLParser(html){
  var doc = document.implementation.createHTMLDocument("example");
  doc.documentElement.innerHTML = html;
  return doc;
};

How it works

Here is simple algorythm how it works:

  • Getting html tags that we expect to be text or content container such as p, table, img, etc.
  • Filtering unnesessary tags by content and tag names wich defenantly can't be in a content container
  • Setting score for each container by containing tags
  • Setting score by class name, id name, tag xPath score and text score
  • Sorting canditates by score
  • Taking first candidate
  • Cleaning up article

Todo

  • Improve readme
  • Detect pages wich can't be extracted
  • More tests
  • More examples

Contributing

You are welcomed to improve this small piece of software :)

Author

About

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published