This repository has been archived by the owner on Aug 24, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 0440bfb
Showing
23 changed files
with
25,855 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
root = true | ||
|
||
[*] | ||
indent_style = space | ||
indent_size = 2 | ||
trim_trailing_whitespace = true | ||
charset = utf-8 | ||
insert_final_newline = true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
|
||
# Created by https://www.gitignore.io/api/node | ||
|
||
### Node ### | ||
# Logs | ||
logs | ||
*.log | ||
npm-debug.log* | ||
|
||
# Runtime data | ||
pids | ||
*.pid | ||
*.seed | ||
|
||
# Directory for instrumented libs generated by jscoverage/JSCover | ||
lib-cov | ||
|
||
# Coverage directory used by tools like istanbul | ||
coverage | ||
|
||
# Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files) | ||
.grunt | ||
|
||
# node-waf configuration | ||
.lock-wscript | ||
|
||
# Compiled binary addons (http://nodejs.org/api/addons.html) | ||
build/Release | ||
|
||
# Dependency directories | ||
node_modules | ||
jspm_packages | ||
|
||
# Optional npm cache directory | ||
.npm | ||
|
||
# Optional REPL history | ||
.node_repl_history |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
.editorconfig | ||
.travis.yml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
language: node_js | ||
node_js: | ||
- "5" | ||
before_script: | ||
- npm i | ||
|
||
script: npm test |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
{ | ||
"name": "seize", | ||
"version": "0.1.0", | ||
"description": "Seize is light Node or Browser content extractor inspired by arc90 readability and Safari Reader", | ||
"main": "seize.js", | ||
"scripts": { | ||
"test": "./node_modules/mocha/bin/mocha -R spec" | ||
}, | ||
"repository": { | ||
"type" : "git", | ||
"url" : "https://github.com/peremenov/seize.git" | ||
}, | ||
"author": "Kir Peremenov", | ||
"license": "MIT", | ||
"dependencies": { | ||
"lodash": "4.0.8" | ||
}, | ||
"devDependencies": { | ||
"chai": "^3.5.0", | ||
"jsdom": "^8.4.0", | ||
"mocha": "^2.4.5" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# seize | ||
|
||
<!-- ## [по-русски](./readme.ru.md) --> | ||
|
||
Seize is light Node or Browser content extractor inspired by [arc90 readability](http://www.arc90.com/work/readability/) and Safari Reader. | ||
|
||
## Install | ||
|
||
```bash | ||
npm i --save seize | ||
``` | ||
|
||
## Usage | ||
|
||
Seize can be used with DOM libraries such as [jsdom](https://github.com/tmpvar/jsdom) for example. It only extracts and prepares certain DOM-node for further usage. | ||
|
||
### Example | ||
|
||
```javascript | ||
var Seize = require('seize'), | ||
jsdom = require('jsdom').jsdom; | ||
|
||
var window = jsdom('<your html here>', jsdomOptions).defaultView, | ||
seize = new Seize(window.document); | ||
|
||
seize.content(); // returns DOM-node | ||
seize.text(); // returns only text without formatting | ||
``` | ||
|
||
|
||
## Browser usage | ||
|
||
For browser usage you shoud clone you DOM object or create it from HTML string: | ||
|
||
```javascript | ||
/** | ||
* Converts html string to Document | ||
* @param {String} html html document string | ||
* @return {Node} document | ||
*/ | ||
function HTMLParser(html){ | ||
var doc = document.implementation.createHTMLDocument("example"); | ||
doc.documentElement.innerHTML = html; | ||
return doc; | ||
}; | ||
``` | ||
|
||
## How it works | ||
|
||
Here is simple algorythm how it works: | ||
|
||
* Getting html tags that we expect to be text or content container such as `p`, `table`, `img`, etc. | ||
* Filtering unnesessary tags by content and tag names wich defenantly can't be in a content container | ||
* Setting score for each container by containing tags | ||
* Setting score by class name, id name, tag xPath score and text score | ||
* Sorting canditates by score | ||
* Taking first candidate | ||
* Cleaning up article | ||
|
||
## Todo | ||
|
||
- Improve readme | ||
- Detect pages wich can't be extracted | ||
- More tests | ||
- More examples | ||
|
||
## Contributing | ||
|
||
You are welcomed to improve this small piece of software :) | ||
|
||
## Author | ||
|
||
- [Kir Peremenov](mailto:[email protected]) |
Oops, something went wrong.