Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
extractcontent		extractcontent
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST		MANIFEST
README.rst		README.rst
setup.py		setup.py

Repository files navigation

ExtractContent

ExtractContentはHTMLから本文を抽出するモジュールです。このモジュールは同名のRubyモジュールをPython用に書き直したものです。

Usage

::

import extractcontent extractor = extractcontent.ExtractContent()

# オプション値を指定する opt = {"threshold":50} extractor.set_default(opt)

html = open("index.html").read() # 解析対象HTML extractor.analyse(html) text, title = extractor.as_text() html, title = extractor.as_html() title = extractor.extract_title(html)

""" オプションの種類: 名称 / デフォルト値

threshold / 100 本文と見なすスコアの閾値

min_length / 80 評価を行うブロック長の最小値

decay_factor / 0.73 減衰係数小さいほど先頭に近いブロックのスコアが高くなります

continuous_factor / 1.62 連続ブロック係数大きいほどブロックを連続と判定しにくくなる

punctuation_weight / 10 句読点に対するスコア大きいほど句読点が存在するブロックを本文と判定しやすくなる

punctuations / r"(?is)(343200[201202]|357274[201214216237]|.[^A-Za-z0-9]|,[^0-9]|!|?)" 句読点を抽出する正規表現

debug / False: Trueの場合、ブロック情報を出力

"""

謝辞：オリジナル版の作成者やForkで改良を加えた方々に感謝します。

Copyright of the original implementation:: (c)2007/2008/2009 Nakatani Shuyo / Cybozu labs Inc. All rights reserved - http://rubyforge.org/projects/extractcontent/ - http://labs.cybozu.co.jp/blog/nakatani/2007/09/web_1.html
https://github.com/petitviolet/python-extractcontent
https://github.com/yono/python-extractcontent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ExtractContent

Usage

About

Releases

Packages

Languages

License

kanjirz50/python-extractcontent3

Folders and files

Latest commit

History

Repository files navigation

ExtractContent

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages