-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy pathCHANGES.txt
687 lines (454 loc) · 13.3 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
0.11.3
------
- Adjust to relocation of module for container abstract base classes
0.11.2
------
- Bump lxml from 4.6.3 to 4.6.5 to fix security vulnerability
0.11.1
------
- Bump lxml from 4.6.2 to 4.6.3 to fix security vulnerability
- Retire support for Python 3.5 due to lxml 4.6.3 incompatibility
0.11.0
------
- Retire support for Python 2
- Retire Travis CI build and enable Python 3.8 and 3.9 in GitHub Actions
- Enable local file access for wkhtmltopdf to fix failure in embedding images in PDFs
0.10.2
------
- Bump lxml from 4.3.0 to 4.6.2 for security patch
- Remove support for Python 3.4 as not supported in latest lxml version
0.10.1
------
- Bugfix: TypeError when attempting to hash unencoded Unicode-objects
0.10.0
------
- Test python 3.7 and 3.8 in Travis CI/GitHub Actions
- Replace cgi.escape with html.escape in Python 3 due to removal of cgi.escape in 3.8
- Reformat using Black
0.9.15
------
- travis CI does not support 3.7 yet, removing that version from build
0.9.14
------
- added versions 3.6 and 3.7 to travis CI build, removed 2.6 and 3.3
- 2.6 and 3.3 deprecated by lxml
0.9.13
------
- 3.7 added as supported version in setup
- Updated LICENSE and requirements.txt
0.9.12
------
- 3.6 added as supported version in setup
- Updated LICENSE
0.9.11
------
- Bugfix: MissingSchema during requests get
- Bugfix: Check for Python 2 should have been for Python 3
0.9.10
------
- More refactoring
0.9.9
------
- Converted markdown README to rst
0.9.8
------
- Changed Utility classifier to Utilities
0.9.7
------
- Replaced compat.py with six module
- Made imports relative rather than from PATH
- More refactoring
0.9.6
------
- Bugfix: Remove non-links through filtering by protocol
- Refactorings
0.9.5
------
- Bugfix: Properly join internal and base URLs for crawling
0.9.4
------
- Retired support for 3.2 as tldextract doesn't support it
0.9.3
------
- Moved crawling functions into a Crawler class
- General refactorings to docstrings, function names, etc.
- Consolidated max_pages and max_links arguments as max_crawls
- Added tldextract module for getting URL domain, suffixes
0.9.2
------
- Added compat.py file
- Moved compatible builtin definitions to __init__
- Added requests cache
0.9.1
------
- Updated version in requirements and setup keywords
- Removed --use-mirrors for 3.5 support
0.9.0
------
- Bugfix: Fixed comparison of duplicate URLs when crawling
0.8.11
------
- Bugfix: Improper check of domain when being restrictive
0.8.10
------
- Strip '/' from end of urls when crawling
0.8.9
------
- Added argument for cache link size & fixed up others
0.8.8
------
- Updated README and setup
0.8.7
------
- added CSV as a format
0.8.6
------
- added environ variable SCRAPE_DISABLE_IMGS to not save images
0.8.5
------
- warn user that saving images during crawling is slow
0.8.4
------
- moved print_text() from crawl.py back to scrape.py
0.8.3
------
- fixed bad formatting in readme usage
0.8.2
------
- ignore-load-errors removed from wkhtmltopdf executable
0.8.1
------
- removed extra schema adding
0.8.0
------
- fixed bug where added url schema not reflected in query
0.7.9
------
- moved file crawling to new file
- avoid overwrite prompt in tests
0.7.8
------
- updated program description
- removed overwriting test due to issues with it
0.7.7
------
- no longer defaults to overwriting files, added program flags/a prompt
- adding renaming mechanism if choosing to not overwrite a file
- some function reorganizing
0.7.6
------
- added print text to stdout option
- removed extra newline appended in re_filter
- wrapped pdfkit import in try/except as it isnt essential
0.7.5
------
- removed extra urlparse import
0.7.4
------
- added option to not save images
- images are now only saved if saving to HTML or PDF
- checks if outfilename has extension before adding new one
- fixed domains being sometimes mismatched to urls
- fixed extension being unnecessary appended to urls (for the most part)
0.7.3
------
- development status reverted to beta
0.7.2
------
- now saves images with PART.html files (but not css yet)
- added module level docstrings
0.7.1
------
- added EOFError handling
0.7.0
------
- fixed crawl not returning filenames to add to infilenames
- fixed re_filter adding duplicate matches
- fixed domain unboundlocalerror
0.6.9
------
- fixed bug where query not found in urls due to trailing /
0.6.8
------
- updated program usage
0.6.7
------
- fixed bounds check on out file names
0.6.6
------
- added out file names as a program argument
- fixed bug where re-writing multiple files
- fixed bug where writing only the first file when writing single file
0.6.5
------
- major improvement to remove_whitespace()
0.6.4
------
- more docstring improvements
0.6.3
------
- began process of making docstrings conform to pep257
- increased size of link cache from 10 to 100
- remove the newline at start of text files
- add newlines between lines filtered by regex
- remove_whitespace now removes newlines that are 3 in a row or more
0.6.2
------
- stylistic changes
- files are now read in 1K chunks
0.6.1
------
- remove consecutive whitespace before writing text files
- empty text files no longer written
0.6.0
------
- fixed bug where single out file name wasn't properly constructed
- out file names are all returned as lowercase now
0.5.9
------
- fixed bug where text wouldn't write unless xpath specified
0.5.8
------
- can now parse HTML using XPath and save to all formats
- remove carriage returns in scraped text files
0.5.7
------
- added maximum out file name length of 24 characters
0.5.6
------
- fixed urls not being properly added under file_types
0.5.5
------
- fixed UnboundLocalError in write_single_file
0.5.4
------
- fixed redefinition of out_file_name in write_to_text
0.5.3
------
- fixed IndexError in write_to_text
0.5.2
------
- small fix for finding single out file name
0.5.1
------
- remade method to find single out file name
0.5.0
------
- can now save to single or multiple output files/directories
- added tests for writing to single or multiple files
- preserves original lines/newlines when parsing/writing files
0.4.11
------
- changed generator.next() to next(generator) for python 3 compatibility
0.4.10
------
- forgot to remove all occurrences of xrange
0.4.9
------
- changed unicode decode to ascii decode when writing html to disk
0.4.8
------
- added missing python 3 compatibilities
0.4.7
------
- fixed urlparse importerror in utils.py for python 3 users
0.4.6
------
- fixed html => text
- all conversions fixed, test_scrape.py added to keep it this way
- added pdfkit to requirements.txt
0.4.5
------
- added docstrings to all functions
- fixed IOError when trying to convert local html to html
- fixed IOError when trying to convert local html to pdf
- fixed saving scraped files to text, was saving PART filenames instead
0.4.4
------
- prompts for filetype from user if none entered
- modularized a couple functions
0.4.3
------
- fixed out_file naming
- pep8 and pylint reformatting
0.4.2
------
- removed read_part_files in place of get_part_files as pdfkit reads filenames
0.4.1
------
- fixed bug preventing writing scraped urls to pdf
0.4.0
------
- can now read in text and filter it
- recognizes local files, no need for user to enter special flag
- moved html/ files to testing/ and added a text file to it
- added better distinction between input and output files
- changed instances of file to f_name in utils
- pep8 reformatting
0.3.9
------
- add scheme to urls if none present
- fixed bug where raw_html was calling get_html rather than get_raw_html
0.3.8
------
- made distinction between links and pages with multiple links on them
- use --maxpages to set the maximum number of pages to get links from
- use --maxlinks to set the maximum number of links to parse
- improved the argument help messages
- improved notes/description in README
0.3.7
------
- fixes to page caching and writing PART files
- use --local to read in local html files
- use --max to indicate max number of pages to crawl
- changed program description and keywords
0.3.6
------
- cleanup using pylint as reference
0.3.5
------
- updated long program description in readme
- added pypi monthly downloads image in readme
0.3.4
------
- updated description header in readme
0.3.3
------
- added file conversion to program description
0.3.2
------
- added travis-ci build status to readme
0.3.1
------
- updated program description and added extra installation instructions
- added .travis.yml and requirements.txt
0.3.0
------
- added read option for user inputted html files, currently writes files individually and not grouped, to do next is add grouping option
- added html/ directory containing test html files
- made relative imports explicit using absolute_import
- added proxies to utils.py
0.2.10
------
- moved OrderedSet class to orderedset.py rather than utils.py
0.2.9
------
- updated program description and keywords in setup.py
0.2.8
------
- restricts crawling to seed domain by default, changed --strict to --nonstrict for crawling outside given website
0.2.5
------
- added requests to install_requires in setup.py
0.2.4
------
- added attributes flag which specifies which tag attributes to extract from a given page, such as text, href, etc.
0.2.3
------
- updated flags and flag help messages
- verbose now by default and reduced number of messages, use --quiet to silence messages
- changed name of --files flag to --html for saving output as html
- added --text flag, default is still text
0.2.2
------
- fixed character encoding issue, all unicode now
0.2.1
------
- improvements to exception handling for proper PART file removal
0.2.0
------
- pages are now saved as they are crawled to PART.html files and processed/removed as necessary, this greatly saves on program memory
- added a page cache with a limit of 10 for greater duplicate protection
- added --files option for keeping webpages as PART.html instead of saving as text or pdf, this also organizes them into a subdirectory named after the seed url's domain
- changed --restrict flag to --strict for restricting the domain to the seed domain while crawling
- more --verbose messages being printed
0.1.10
------
- now compares urls scheme-less before updating links to prevent http:// and https:// duplicates and replaced set_scheme with remove_scheme in utils.py
- renamed write_pages to write_links
0.1.9
------
- added behavior for --crawl keywords in crawl method
- added a domain check before outputting crawled message or adding to crawled links
- domain key in args is now set to base domain for proper --restrict behavior
- clean_url now rstrips / character for proper link crawling
- resolve_url now rstrips / character for proper out_file writing
- updated description of --crawl flag
0.1.8
------
- removed url fragments
- replaced set_base with urlparse method urljoin
- out_file name construction now uses urlparse 'path' member
- raw_links is now an OrderedSet to try to eliminate as much processing as possible
- added clear method to OrderedSet in utils.py
0.1.7
------
- removed validate_domain and replaced it with a lambda instead
- replaced domain with base_url in set_base as should have been done before
- crawled message no longer prints if url was a duplicate
0.1.6
------
- uncommented import __version__
0.1.5
------
- set_domain was replaced by set_base, proper solution for links that are relative
- fixed verbose behavior
- updated description in README
0.1.4
------
- fixed output file generation, was using domain instead of base_url
- minor code cleanup
0.1.3
------
- blank lines are no longer written to text unless as a page separator
- style tags now ignored alongside script tags when getting text
0.1.2
------
- added shebang
0.1.1
------
- uncommented import __version__
0.1.0
------
- reformatting to conform with PEP 8
- added regexp support for matching crawl keywords and filter text keywords
- improved url resolution by correcting domains and schemes
- added --restrict option to restrict crawler links to only those with seed domain
- made text the default write option rather than pdf, can now use --pdf to change that
- removed page number being written to text, separator is now just a single blank line
- improved construction of output file name
0.0.11
------
- fixed missing comma in install_requires in setup.py
- also labeled now as beta as there are still some kinks with crawling
0.0.10
------
- now ignoring pdfkit load errors only if more than one link to try to prevent an empty pdf being created in case of error
0.0.9
------
- pdfkit now ignores load errors and writes as many pages as possible
0.0.8
------
- better implementation of crawler, can now scrape entire websites
- added OrderedSet class to utils.py
0.0.7
------
- changed --keywords to --filter and positional arg url to urls
0.0.6
------
- use --keywords flag for filtering text
- can pass multiple links now
- will not write empty files anymore
0.0.5
------
- added --verbose argument for use with pdfkit
- improved output file name processing
0.0.4
------
- accepts 0 or 1 url's, allowing a call with just --version
0.0.3
------
- Moved utils.py to scrape/
0.0.2
------
- First entry