CHANGES.txt

0.11.3
------

 - Adjust to relocation of module for container abstract base classes

0.11.2
------

 - Bump lxml from 4.6.3 to 4.6.5 to fix security vulnerability

0.11.1
------

 - Bump lxml from 4.6.2 to 4.6.3 to fix security vulnerability
 - Retire support for Python 3.5 due to lxml 4.6.3 incompatibility

0.11.0
------

 - Retire support for Python 2
 - Retire Travis CI build and enable Python 3.8 and 3.9 in GitHub Actions
 - Enable local file access for wkhtmltopdf to fix failure in embedding images in PDFs

0.10.2
------

 - Bump lxml from 4.3.0 to 4.6.2 for security patch
 - Remove support for Python 3.4 as not supported in latest lxml version

0.10.1
------

 - Bugfix: TypeError when attempting to hash unencoded Unicode-objects

0.10.0
------

 - Test python 3.7 and 3.8 in Travis CI/GitHub Actions
 - Replace cgi.escape with html.escape in Python 3 due to removal of cgi.escape in 3.8
 - Reformat using Black

0.9.15
------

 - travis CI does not support 3.7 yet, removing that version from build

0.9.14
------

 - added versions 3.6 and 3.7 to travis CI build, removed 2.6 and 3.3
 - 2.6 and 3.3 deprecated by lxml

0.9.13
------

 - 3.7 added as supported version in setup
 - Updated LICENSE and requirements.txt

0.9.12
------

 - 3.6 added as supported version in setup
 - Updated LICENSE

0.9.11
------

 - Bugfix: MissingSchema during requests get
 - Bugfix: Check for Python 2 should have been for Python 3

0.9.10
------

 - More refactoring

0.9.9
------

 - Converted markdown README to rst

0.9.8
------

 - Changed Utility classifier to Utilities

0.9.7
------

 - Replaced compat.py with six module
 - Made imports relative rather than from PATH
 - More refactoring

0.9.6
------

 - Bugfix: Remove non-links through filtering by protocol
 - Refactorings

0.9.5
------

 - Bugfix: Properly join internal and base URLs for crawling

0.9.4
------

 - Retired support for 3.2 as tldextract doesn't support it

0.9.3
------

 - Moved crawling functions into a Crawler class
 - General refactorings to docstrings, function names, etc.
 - Consolidated max_pages and max_links arguments as max_crawls
 - Added tldextract module for getting URL domain, suffixes

0.9.2
------

 - Added compat.py file
 - Moved compatible builtin definitions to __init__
 - Added requests cache

0.9.1
------

 - Updated version in requirements and setup keywords
 - Removed --use-mirrors for 3.5 support

0.9.0
------

- Bugfix: Fixed comparison of duplicate URLs when crawling

0.8.11
------

 - Bugfix: Improper check of domain when being restrictive

0.8.10
------

 - Strip '/' from end of urls when crawling

0.8.9
------

 - Added argument for cache link size & fixed up others

0.8.8
------

 - Updated README and setup

0.8.7
------

 - added CSV as a format

0.8.6
------

 - added environ variable SCRAPE_DISABLE_IMGS to not save images

0.8.5
------

 - warn user that saving images during crawling is slow

0.8.4
------

 - moved print_text() from crawl.py back to scrape.py

0.8.3
------

 - fixed bad formatting in readme usage

0.8.2
------

 - ignore-load-errors removed from wkhtmltopdf executable

0.8.1
------

 - removed extra schema adding

0.8.0
------

 - fixed bug where added url schema not reflected in query

0.7.9
------

 - moved file crawling to new file
 - avoid overwrite prompt in tests

0.7.8
------

 - updated program description
 - removed overwriting test due to issues with it

0.7.7
------

 - no longer defaults to overwriting files, added program flags/a prompt
 - adding renaming mechanism if choosing to not overwrite a file
 - some function reorganizing

0.7.6
------

 - added print text to stdout option
 - removed extra newline appended in re_filter
 - wrapped pdfkit import in try/except as it isnt essential

0.7.5
------

 - removed extra urlparse import

0.7.4
------

 - added option to not save images
 - images are now only saved if saving to HTML or PDF
 - checks if outfilename has extension before adding new one
 - fixed domains being sometimes mismatched to urls
 - fixed extension being unnecessary appended to urls (for the most part)

0.7.3
------

 - development status reverted to beta

0.7.2
------

 - now saves images with PART.html files (but not css yet)
 - added module level docstrings

0.7.1
------

 - added EOFError handling

0.7.0
------

 - fixed crawl not returning filenames to add to infilenames
 - fixed re_filter adding duplicate matches
 - fixed domain unboundlocalerror

0.6.9
------

 - fixed bug where query not found in urls due to trailing /

0.6.8
------

 - updated program usage

0.6.7
------

 - fixed bounds check on out file names

0.6.6
------

 - added out file names as a program argument
 - fixed bug where re-writing multiple files
 - fixed bug where writing only the first file when writing single file

0.6.5
------

 - major improvement to remove_whitespace()

0.6.4
------

 - more docstring improvements

0.6.3
------

 - began process of making docstrings conform to pep257
 - increased size of link cache from 10 to 100
 - remove the newline at start of text files
 - add newlines between lines filtered by regex
 - remove_whitespace now removes newlines that are 3 in a row or more

0.6.2
------

 - stylistic changes
 - files are now read in 1K chunks

0.6.1
------

 - remove consecutive whitespace before writing text files
 - empty text files no longer written

0.6.0
------

 - fixed bug where single out file name wasn't properly constructed
 - out file names are all returned as lowercase now

0.5.9
------

 - fixed bug where text wouldn't write unless xpath specified

0.5.8
------

 - can now parse HTML using XPath and save to all formats
 - remove carriage returns in scraped text files

0.5.7
------

 - added maximum out file name length of 24 characters

0.5.6
------

 - fixed urls not being properly added under file_types

0.5.5
------

 - fixed UnboundLocalError in write_single_file

0.5.4
------

 - fixed redefinition of out_file_name in write_to_text

0.5.3
------

 - fixed IndexError in write_to_text

0.5.2
------

 - small fix for finding single out file name

0.5.1
------

 - remade method to find single out file name

0.5.0
------

 - can now save to single or multiple output files/directories
 - added tests for writing to single or multiple files
 - preserves original lines/newlines when parsing/writing files

0.4.11
------

 - changed generator.next() to next(generator) for python 3 compatibility

0.4.10
------

 - forgot to remove all occurrences of xrange

0.4.9
------

 - changed unicode decode to ascii decode when writing html to disk

0.4.8
------

 - added missing python 3 compatibilities

0.4.7
------

 - fixed urlparse importerror in utils.py for python 3 users

0.4.6
------

 - fixed html => text
 - all conversions fixed, test_scrape.py added to keep it this way
 - added pdfkit to requirements.txt

0.4.5
------

 - added docstrings to all functions
 - fixed IOError when trying to convert local html to html
 - fixed IOError when trying to convert local html to pdf
 - fixed saving scraped files to text, was saving PART filenames instead

0.4.4
------

 - prompts for filetype from user if none entered
 - modularized a couple functions

0.4.3
------

 - fixed out_file naming
 - pep8 and pylint reformatting

0.4.2
------

 - removed read_part_files in place of get_part_files as pdfkit reads filenames

0.4.1
------

 - fixed bug preventing writing scraped urls to pdf

0.4.0
------

 - can now read in text and filter it
 - recognizes local files, no need for user to enter special flag
 - moved html/ files to testing/ and added a text file to it
 - added better distinction between input and output files
 - changed instances of file to f_name in utils
 - pep8 reformatting

0.3.9
------

 - add scheme to urls if none present
 - fixed bug where raw_html was calling get_html rather than get_raw_html

0.3.8
------

 - made distinction between links and pages with multiple links on them
 - use --maxpages to set the maximum number of pages to get links from
 - use --maxlinks to set the maximum number of links to parse
 - improved the argument help messages
 - improved notes/description in README

0.3.7
------

 - fixes to page caching and writing PART files
 - use --local to read in local html files
 - use --max to indicate max number of pages to crawl
 - changed program description and keywords

0.3.6
------

 - cleanup using pylint as reference

0.3.5
------

- updated long program description in readme
- added pypi monthly downloads image in readme

0.3.4
------

 - updated description header in readme

0.3.3
------

 - added file conversion to program description

0.3.2
------

 - added travis-ci build status to readme

0.3.1
------

 - updated program description and added extra installation instructions
 - added .travis.yml and requirements.txt

0.3.0
------

 - added read option for user inputted html files, currently writes files individually and not grouped, to do next is add grouping option
 - added html/ directory containing test html files
 - made relative imports explicit using absolute_import
 - added proxies to utils.py

0.2.10
------

 - moved OrderedSet class to orderedset.py rather than utils.py

0.2.9
------

 - updated program description and keywords in setup.py

0.2.8
------

 - restricts crawling to seed domain by default, changed --strict to --nonstrict for crawling outside given website

0.2.5
------

 - added requests to install_requires in setup.py

0.2.4
------

 - added attributes flag which specifies which tag attributes to extract from a given page, such as text, href, etc.

0.2.3
------

 - updated flags and flag help messages
 - verbose now by default and reduced number of messages, use --quiet to silence messages
 - changed name of --files flag to --html for saving output as html
 - added --text flag, default is still text

0.2.2
------

 - fixed character encoding issue, all unicode now

0.2.1
------

 - improvements to exception handling for proper PART file removal

0.2.0
------

 - pages are now saved as they are crawled to PART.html files and processed/removed as necessary, this greatly saves on program memory
 - added a page cache with a limit of 10 for greater duplicate protection
 - added --files option for keeping webpages as PART.html instead of saving as text or pdf, this also organizes them into a subdirectory named after the seed url's domain
 - changed --restrict flag to --strict for restricting the domain to the seed domain while crawling
 - more --verbose messages being printed

0.1.10
------

 - now compares urls scheme-less before updating links to prevent http:// and https:// duplicates and replaced set_scheme with remove_scheme in utils.py
 - renamed write_pages to write_links

0.1.9
------

 - added behavior for --crawl keywords in crawl method
 - added a domain check before outputting crawled message or adding to crawled links
 - domain key in args is now set to base domain for proper --restrict behavior
 - clean_url now rstrips / character for proper link crawling
 - resolve_url now rstrips / character for proper out_file writing
 - updated description of --crawl flag

0.1.8
------

 - removed url fragments
 - replaced set_base with urlparse method urljoin
 - out_file name construction now uses urlparse 'path' member
 - raw_links is now an OrderedSet to try to eliminate as much processing as possible
 - added clear method to OrderedSet in utils.py

0.1.7
------

 - removed validate_domain and replaced it with a lambda instead
 - replaced domain with base_url in set_base as should have been done before
 - crawled message no longer prints if url was a duplicate

0.1.6
------

 - uncommented import __version__

0.1.5
------

 - set_domain was replaced by set_base, proper solution for links that are relative
 - fixed verbose behavior
 - updated description in README

0.1.4
------

 - fixed output file generation, was using domain instead of base_url
 - minor code cleanup

0.1.3
------

 - blank lines are no longer written to text unless as a page separator
 - style tags now ignored alongside script tags when getting text

0.1.2
------

 - added shebang

0.1.1
------

 - uncommented import __version__

0.1.0
------

 - reformatting to conform with PEP 8
 - added regexp support for matching crawl keywords and filter text keywords
 - improved url resolution by correcting domains and schemes
 - added --restrict option to restrict crawler links to only those with seed domain
 - made text the default write option rather than pdf, can now use --pdf to change that
 - removed page number being written to text, separator is now just a single blank line
 - improved construction of output file name

0.0.11
------

 - fixed missing comma in install_requires in setup.py
 - also labeled now as beta as there are still some kinks with crawling

0.0.10
------

 - now ignoring pdfkit load errors only if more than one link to try to prevent an empty pdf being created in case of error

0.0.9
------

 - pdfkit now ignores load errors and writes as many pages as possible

0.0.8
------

 - better implementation of crawler, can now scrape entire websites
 - added OrderedSet class to utils.py

0.0.7
------

 - changed --keywords to --filter and positional arg url to urls

0.0.6
------

 - use --keywords flag for filtering text
 - can pass multiple links now
 - will not write empty files anymore

0.0.5
------

 - added --verbose argument for use with pdfkit
 - improved output file name processing

0.0.4
------

 - accepts 0 or 1 url's, allowing a call with just --version

0.0.3
------

 - Moved utils.py to scrape/

0.0.2
------

 - First entry