Skip to content
This repository has been archived by the owner on Aug 24, 2023. It is now read-only.

Commit

Permalink
Init commit
Browse files Browse the repository at this point in the history
  • Loading branch information
peremenov committed Apr 22, 2016
0 parents commit 0440bfb
Show file tree
Hide file tree
Showing 23 changed files with 25,855 additions and 0 deletions.
8 changes: 8 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
root = true

[*]
indent_style = space
indent_size = 2
trim_trailing_whitespace = true
charset = utf-8
insert_final_newline = true
38 changes: 38 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@

# Created by https://www.gitignore.io/api/node

### Node ###
# Logs
logs
*.log
npm-debug.log*

# Runtime data
pids
*.pid
*.seed

# Directory for instrumented libs generated by jscoverage/JSCover
lib-cov

# Coverage directory used by tools like istanbul
coverage

# Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
.grunt

# node-waf configuration
.lock-wscript

# Compiled binary addons (http://nodejs.org/api/addons.html)
build/Release

# Dependency directories
node_modules
jspm_packages

# Optional npm cache directory
.npm

# Optional REPL history
.node_repl_history
2 changes: 2 additions & 0 deletions .npmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.editorconfig
.travis.yml
7 changes: 7 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
language: node_js
node_js:
- "5"
before_script:
- npm i

script: npm test
23 changes: 23 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"name": "seize",
"version": "0.1.0",
"description": "Seize is light Node or Browser content extractor inspired by arc90 readability and Safari Reader",
"main": "seize.js",
"scripts": {
"test": "./node_modules/mocha/bin/mocha -R spec"
},
"repository": {
"type" : "git",
"url" : "https://github.com/peremenov/seize.git"
},
"author": "Kir Peremenov",
"license": "MIT",
"dependencies": {
"lodash": "4.0.8"
},
"devDependencies": {
"chai": "^3.5.0",
"jsdom": "^8.4.0",
"mocha": "^2.4.5"
}
}
73 changes: 73 additions & 0 deletions readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# seize

<!-- ## [по-русски](./readme.ru.md) -->

Seize is light Node or Browser content extractor inspired by [arc90 readability](http://www.arc90.com/work/readability/) and Safari Reader.

## Install

```bash
npm i --save seize
```

## Usage

Seize can be used with DOM libraries such as [jsdom](https://github.com/tmpvar/jsdom) for example. It only extracts and prepares certain DOM-node for further usage.

### Example

```javascript
var Seize = require('seize'),
jsdom = require('jsdom').jsdom;

var window = jsdom('<your html here>', jsdomOptions).defaultView,
seize = new Seize(window.document);

seize.content(); // returns DOM-node
seize.text(); // returns only text without formatting
```


## Browser usage

For browser usage you shoud clone you DOM object or create it from HTML string:

```javascript
/**
* Converts html string to Document
* @param {String} html html document string
* @return {Node} document
*/
function HTMLParser(html){
var doc = document.implementation.createHTMLDocument("example");
doc.documentElement.innerHTML = html;
return doc;
};
```

## How it works

Here is simple algorythm how it works:

* Getting html tags that we expect to be text or content container such as `p`, `table`, `img`, etc.
* Filtering unnesessary tags by content and tag names wich defenantly can't be in a content container
* Setting score for each container by containing tags
* Setting score by class name, id name, tag xPath score and text score
* Sorting canditates by score
* Taking first candidate
* Cleaning up article

## Todo

- Improve readme
- Detect pages wich can't be extracted
- More tests
- More examples

## Contributing

You are welcomed to improve this small piece of software :)

## Author

- [Kir Peremenov](mailto:[email protected])
Loading

0 comments on commit 0440bfb

Please sign in to comment.