Skip to content

Latest commit

 

History

History
347 lines (301 loc) · 40.9 KB

CHANGELOG.md

File metadata and controls

347 lines (301 loc) · 40.9 KB

Change Log

Full Changelog

Closed issues:

  • Add support for the SFTP protocol #319

Merged pull requests:

3.4.0-20200518 (2020-05-18)

Full Changelog

Closed issues:

  • Cannot find class [ExtractorYoutubeDL] #322
  • Checkpoints 'spoiled' when used to resume crawls #277

Merged pull requests:

  • Fix match result is always false in MatchesListRegexDecideRule #328 (morokosi)
  • Add real crawlStatus in the crawlReport #326 (clawia)
  • youtube-dl: request best medium-ish size format #325 (galgeek)
  • Add parsing for HTML tags (data-*) #323 (clawia)
  • Add support for the SFTP protocol #320 (bnfleb)

3.4.0-20200304 (2020-03-04)

Full Changelog

Fixed bugs:

  • exception logged when opening/saving crawler-beans.cxml via web interface editor #305
  • Java interface text editor error when saving crawler-beans.cxml #293
  • Unable to upload crawler-beans.cxml with curl #282
  • CookieStoreTest.testConcurrentLoad fails randomly #274

Closed issues:

  • Contrib project has a maven dependency with an older version of guava library. #311
  • BloomFilter64bitTest is slow #299
  • ObjectIdentityBdbManualCacheTest is slow #297
  • HTTPS console inaccessible via browser #279
  • JDK11 support: ssl errors from console #275
  • JDK11 support: FetchHTTPTest: ssl handshake_failure #268
  • JDK11 support: org.archive.util.ObjectIdentityBdbCacheTest failures #267
  • JDK11 support: ClassNotFoundException: javax.transaction.xa.Xid #266
  • JDK11 support: tools.jar #265
  • JDK11 support: jaxb #264

Merged pull requests:

  • Use the Wayback Machine to repair a link to Oracle docs. #315 (anjackson)
  • Utilize the d parameter #314 (hennekey)
  • Exclude hbase-client's guava 12 transitive dependency #312 (ato)
  • Fix stream closed exception for Paged view #308 (ldko)
  • Fix stream closed exception by not closing output stream #306 (ato)
  • Replace custom Base32 encoding #304 (hennekey)
  • Replace constant with accessor methods #303 (hennekey)
  • limit ExtractorYoutubeDL heap usage #302 (nlevitt)
  • fix logging config #301 (nlevitt)
  • Use Guice instead of custom bloom filter implementation #300 (hennekey)
  • Speed up ObjectIdentityBdbManualCacheTest #298 (hennekey)
  • Set JUnit version to latest #296 (hennekey)
  • Disable test that connects to wwwb-dedup.us.archive.org #295 (ato)
  • Fix 'Method Not Allowed' on POST of config editor form #294 (ato)
  • Crawltrap regex timeout #290 (csrster)
  • Bdb frontier access #289 (csrster)
  • Attempt to filter out embedded images. #288 (csrster)
  • change trough dedup date type to varchar. #287 (nlevitt)
  • Add support for forced queue assignment and parallel queues #286 (adam-miller)
  • Warc writer chain #285 (nlevitt)
  • Fix jobdir PUT #283 (ato)
  • Upgrade BDB JE to version 7.5.11 - IMPORTANT CHANGE #281 (anjackson)
  • Mitigate random CookieStore.testConcurrentLoad test failures #280 (ato)
  • JDK11 support: upgrade to Jetty 9.4.19, Restlet 2.4.0 and drop JDK 7 support #276 (ato)
  • JDK11 support: remove unused class ObjectIdentityBdbCache and tests #273 (ato)
  • JDK11 support: upgrade maven-surefire-plugin to 2.22.2 #272 (ato)
  • JDK11 support: exclude tools.jar from hbase-client dependency #271 (ato)
  • Travis fixes #270 (ato)
  • WIP: ExtractorYoutubeDL #257 (nlevitt)
  • Update README and add LICENSE.txt #256 (ruebot)

3.4.0-20190418 (2019-04-18)

Full Changelog

Fixed bugs:

  • Invalid format exception in scanJobLog #239
  • Domain name lookup failures get cached forever #234
  • Allow failed lookups to expire, for #234. #235 (anjackson)

Closed issues:

  • Failed DNS requests remain enqueued #252
  • SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" #236
  • Make FetchHistoryProcessor 304 handler more robust #229
  • ToeThread death when using HighestUriPrecedenceProvider #221
  • Google Drive robots.txt broken #193

Merged pull requests:

  • set of frontier management changes to support CrawlHQ module #253 (dvanduzer)
  • fix some trough dedup bugs #251 (nlevitt)
  • Remove suffix from warcWriter since it is no longer used. #249 (ruebot)
  • Revert "Upgrade httpclient to 4.5.7 and handle cookies more compliantly" #248 (ato)
  • Upgrade httpclient to 4.5.7 and handle cookies more compliantly #246 (anjackson)
  • Update README.md #244 (mikeizbicki)
  • Handle commas more compliantly when parsing srcset #243 (ato)
  • Trough dedup #242 (nlevitt)
  • Ensure we start parsing full lines, for #239. #240 (anjackson)
  • Add CHANGELOG; address #233. #238 (ruebot)

3.4.0-20190207 (2019-02-07)

Full Changelog

Fixed bugs:

  • Add checks to guard against server sending 304 in error #230 (anjackson)

Merged pull requests:

3.4.0-20190205 (2019-02-05)

Full Changelog

Fixed bugs:

  • HTML extractor does not handle the base href correctly when it's relative #208
  • Heritrix3 (including pre-built binaries) Fails to Bootstrap with Java8 due to Changes in Java stdlib #176
  • Heritrix3 Fails to Build from Source #175
  • Missing OneLineSimpleLayout class file #173

Closed issues:

  • BdbFrontier thread safety #212
  • HTTP response only results in garbage bytes #206
  • Possibly stalled crawl #203
  • Where do i find the crawled information (Contents) after crawling is completed #199
  • -j option can'not handle spaces in directory names? #182
  • heritrix doesn't scrape rewrite srcset urls correctly #177
  • Possible race-condition when first using the WARC writers? #167
  • can you integration with spring boot #162
  • Noisy alerts about 401s without auth challenge #158
  • Can't see all beans in scripts #157
  • How to configure warcWriter with MirrorWriter? #156
  • Requesting inaccurate paths from js causes routing errors #155

Merged pull requests:

  • JDK11 support: explicitly depend on JAXB #269 (ato)
  • do not checkpoint if crawl job has not started #227 (nlevitt)
  • namespace scope log logger to crawl job #226 (nlevitt)
  • un-threadlocal the HConnection #224 (nlevitt)
  • reset HBaseAdmin on error #223 (nlevitt)
  • keep trying to start up hbase dedup forever #222 (nlevitt)
  • implement PredicatedDecideRule.onlyDecision() #220 (nlevitt)
  • use non-deprecated hbase api #219 (nlevitt)
  • Correct spelling mistakes. #218 (EdwardBetts)
  • Update API with note about checkpoint launching. #217 (anjackson)
  • Extend API to simplify using the latest checkpoint #215 (anjackson)
  • Ensure frontier work queues are updated safely across threads. #213 (anjackson)
  • fix exception starting DecideRuleSequence logging #210 (nlevitt)
  • HtmlExtractor: allow relative hrefs in the base element #209 (anjackson)
  • Fix link to User Guide #207 (maurice-schleussinger)
  • Add parameter to allow even distribution for parallel queues. #205 (adam-miller)
  • catch exceptions scoping outlinks to stop them from derailing process… #197 (nlevitt)
  • fix for test failures in a workspace on NFS-mounted filesystem #196 (kngenie)
  • limit max size of form input #194 (galgeek)
  • Enforce robots.txt character limit per char not per line #192 (ato)
  • Allow JavaDNS to be disabled as part of resolving outstanding build and test issues #190 (anjackson)
  • WARCLimitEnforcer.java - Add support for multiple warc writers. #189 (adam-miller)
  • treat a failed fetch (e.g. socket timeout) of robots.txt the same way… #187 (nlevitt)
  • reduce batch size to 400 and avoid ridiculously long log lines #186 (nlevitt)
  • escape strings in sql posted to trough #185 (nlevitt)
  • trough feed #180 (nlevitt)
  • Add parsing for srcset attributes #179 (BitBaron)
  • KafkaCrawlLogFeed had been using lots of heap because each callback i… #178 (nlevitt)
  • AMQP fine control #171 (anjackson)
  • fix for race-condition when first using the WARC writers https://gith… #168 (nlevitt)
  • Don't wait to receive Umbra urls if Heritrix sends no url to Umbra #166 (galgeek)
  • AMQP URL Waiter #165 (galgeek)
  • Fixes for apparent build errors (extends #154) #164 (nlevitt)
  • Kafka 0.9 #163 (nlevitt)
  • No link extraction on URI not successfully downloaded #161 (kris-sigur)
  • Fixes issue #158 : Noisy alerts about 401s without auth challenge #159 (kris-sigur)
  • Fixes for apparent build errors #154 (anjackson)
  • Switch to Java 7 #152 (anjackson)
  • Make Content-Location header url INFERRED not REFFER hop type since C… #151 (vonrosen)
  • various changes to amqp publish and receive #150 (nlevitt)
  • Update to ExtractorHTML.java for cond. comments #149 (eleclerc)
  • Don't canonicalize source tag so that SourceSeedDecideRule will work.… #148 (vonrosen)
  • More fixes for mutlipart form submission #146 (vonrosen)
  • Make some urls with whitespace acceptable to JavaScript extractor. #145 (vonrosen)
  • run received urls through the candidates processor, to check scope an… #144 (nlevitt)
  • handle login forms with <input type="text"> fields in addition to use… #143 (nlevitt)
  • Form login multipart #142 (nlevitt)
  • Disable SNI for a request if that request failed due to an SNI error … #141 (vonrosen)
  • handle multiple clauses for same user agent in robots.txt #139 (nlevitt)
  • crawl level and host level limits on *novel* (not deduplicated) bytes and urls #138 (nlevitt)
  • SourceSeedDecideRule, SeedLimitsEnforcer #137 (nlevitt)
  • Register seeds send in via AMQP #136 (anjackson)
  • Allow KnowledgableExtractorJS to parse out youtube watch from youtube… #135 (vonrosen)
  • Add maximum to number of cookies to store for domain to BdbCookieStore #133 (vonrosen)
  • try very hard to start url consumer, and therefore bind the queue to … #132 (nlevitt)
  • set isRunning=true so that stop() gets called to avoid leaking connec… #131 (nlevitt)
  • catch exceptions and log error in StatisticsTracker.run(), to make su… #130 (nlevitt)
  • load keytool utility main class dynamically, trying both the old and … #129 (nlevitt)
  • AMQPUrlReceiver changes to support RabbitMQ >= 3.3 #128 (anjackson)
  • 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing #126 (caofangkun)
  • Amqp declarations fix #125 (ldko)
  • Allow realm to be set by server for basic auth. #124 (vonrosen)
  • Hosts report #123 (kris-sigur)
  • only submit checkbox and radio button form fields if they are on by d… #122 (nlevitt)
  • new contrib module KnowledgableExtractorJS, a subclass of ExtractorJS th... #121 (nlevitt)
  • for ARI-4267 accept possible uris with two dots in the filename part if ... #120 (nlevitt)
  • Fix for HER-2082 #119 (adam-miller)
  • Fix for ServerNotModified WARC revisit records incorrectly record WARC-Payload-Digest #118 (kris-sigur)
  • avoid java.lang.NullPointerException at org.archive.modules.writer.Write... #117 (nlevitt)
  • make sure log4j is configured when running unit tests, to avoid log4j er... #116 (nlevitt)
  • Set character set to UTF-8 when passing through files. #115 (kris-sigur)
  • remove RecordingOutputStreamTest.java (moving to webarchive-commons) #114 (nlevitt)
  • Amqp receiver deadlock #112 (nlevitt)
  • somewhat ugly fix to handle exceptions from the bean browser like java.l... #111 (nlevitt)
  • Upgrade to HttpClient 4.3.6 #110 (kris-sigur)
  • so that it can appear in the crawl log, add contentSize to CrawlURI extr... #109 (nlevitt)
  • kafka crawl log feed #108 (nlevitt)
  • Handle case where form does not have an action defined. #107 (vonrosen)
  • seriously, fix extraInfo handling in AMQPCrawlLogFeed #106 (nlevitt)
  • fix extraInfo handling in AMQPCrawlLogFeed #105 (nlevitt)
  • change field names to match new druid config #104 (nlevitt)
  • CandidatesProcessor.java #103 (adam-miller)
  • avoid deadlock in AMQPUrlReceiver hopefully #102 (nlevitt)
  • Remove forcefetch for AMQP received urls so they don't get crawled twice... #101 (vonrosen)
  • Allow discovery of urls in content attribute of meta tags. #100 (vonrosen)
  • AMQPCrawlLogFeed, DecideRuleSequenceWithAMQPFeed, DecideRuleSequence.logExtraInfo #99 (nlevitt)
  • Fix for HER-2074 #97 (kris-sigur)
  • new cookie store system to address HER-2070 "cookie monster" bug #96 (nlevitt)
  • FIX corner-case of bean browser failing due to an exception from hashCode() #95 (kngenie)
  • do not require "+" (plus sign) before @OPERATOR_CONTACT_URL@ in user-age... #94 (nlevitt)
  • Allow urls in JavaScript between unicode quotes to be detected. #93 (vonrosen)
  • remove more unused classes #92 (nlevitt)
  • FetchHTTP.java #91 (adam-miller)
  • Move Wayback-dedup module to heritrix-contrib #90 (kngenie)
  • Don’t let exception from property getter fail entire bean-browser. #89 (kngenie)
  • fix bug in CrawlURI.compare() discovered by Kenji, add unit test CrawlUR... #88 (nlevitt)
  • Allow xml extractor to handle urls in CDATA. #87 (vonrosen)
  • remove unused Transform* classes #86 (nlevitt)
  • switch to mainline iipc webarchive-commons latest release #84 (nlevitt)
  • oops! count novel urls/bytes for hosts report, etc #83 (nlevitt)
  • Fix for HER-2071 #82 (kris-sigur)
  • Hbase cdh5 #81 (nlevitt)
  • ExtractorHTML when a/@href links include the attribute data-remote="true... #80 (nlevitt)
  • Revisit redux #79 (nlevitt)
  • treat content as html and extract links if it looks like html, even if m... #78 (nlevitt)
  • Force urls received from AMQP to be recrawled so custom http headers can... #77 (vonrosen)
  • HER-2039 remove class Link, use CrawlURI #76 (nlevitt)
  • in CrawlURI.createCrawlURI(), avoid clobbering inherited data with data ... #75 (nlevitt)
  • Fix for https://webarchive.jira.com/browse/ARI-3943 #74 (vonrosen)
  • Treat codebase as link hops, not embeds #73 (kris-sigur)
  • add A_ANNOTATIONS to persistentKeys so that CrawlURI doesn't lose its an... #72 (nlevitt)
  • avoid calling CheckpointService.hasAvailableCheckpoints() when crawl not... #71 (nlevitt)
  • for ARI-3712, add extracted links relative to both via and base, and annotate with "extractorSWFRelToVia", "extractorSWFRelToBase", or "extractorSWFRelToBoth" if resulting link is the same whether relative to base or via #70 (nlevitt)
  • For https://webarchive.jira.com/browse/ARI-3865 #69 (vonrosen)
  • handle exception determining whether to apply overlay #68 (nlevitt)
  • don't log severe with stack trace on normal amqp shutdown #67 (nlevitt)
  • oops, make "exit java process" button work again #66 (nlevitt)
  • shut down the starter-restarter thread at crawl finish!! #65 (nlevitt)
  • Via surt prefixed decide rule #64 (adam-miller)
  • Contrib - ExtractorPDFContent #63 (adam-miller)
  • Ari 3765 gracefully handle amqp server going up and down #62 (nlevitt)
  • HER-2065 synchronize on inactiveQueuesByPrecedence inside of synchronize... #61 (nlevitt)
  • Cosmetics #60 (nlevitt)
  • fix unit test now that we accept speculative urls with query params with... #59 (nlevitt)
  • for ARI-3723, accept speculative urls with query params with no value #58 (nlevitt)
  • AMQPUrlReceiver - improve handling of case where rabbitmq is unreachable... #57 (nlevitt)
  • fix FormLoginProcessor checkpointing #56 (nlevitt)
  • oops, update test to expect post data as url-encoded query string #54 (nlevitt)
  • Fix form login #53 (nlevitt)
  • Implicitly add the ${} around groovyExpression. When cxml contains ${}, ... #52 (nlevitt)
  • Expression deciderule #51 (nlevitt)
  • Replace deprecated routines in guava #50 (shriphani)
  • Youtube march 2014 #49 (nlevitt)
  • Umbra #48 (nlevitt)
  • Adjusting Youtube itag priority #47 (adam-miller)
  • switch dependency from ia-web-commons 1.1.1-SNAPSHOT to webarchive-commo... #46 (nlevitt)
  • Update youtube itags #45 (nlevitt)
  • update httpcomponents, should address NPE we've seen https://issues.apac... #44 (nlevitt)
  • fix job.log file handler was left open when jobdir is removed #43 (martinsbalodis)
  • Adding the queue declaration and binding to the UrlReceiver #42 (eldondev)
  • Fix slow cookies #41 (nlevitt)
  • For https://webarchive.jira.com/browse/HER-2064 #40 (vonrosen)
  • progress and formatting changes #39 (nlevitt)
  • Umbra - AMQPUrlReceiver.java receive urls via amqp and add to frontier, related changes #38 (nlevitt)
  • fix HER-2063 - omit port in Host request header when it is default for t... #37 (nlevitt)
  • Avoid the exception below by handling bad charsets in FetchHTTP. Restore... #36 (nlevitt)
  • whoops! send escaped path+query on http request line; had been sending r... #35 (nlevitt)
  • fix NullPointerException in case of 401 with no auth challenge (includes... #34 (nlevitt)
  • First pass at a processor to publish crawluris to AMQP channels #33 (eldondev)
  • Switch to BasicHttpClientConnectionManager instead of #32 (nlevitt)
  • make http proxy port configurable in cxml, avoiding this: org.springfram... #31 (nlevitt)
  • Fix bdb cookie store #30 (nlevitt)
  • HER-2062 Fix for WorkQueueFrontier.deleteURIs handling of retired queues #29 (kris-sigur)
  • switch to httpcomponents, get rid of archive-overlay-commons-httpclient #28 (nlevitt)
  • rename dist/README.md to dist/README.txt so that maven bundles it in the... #27 (nlevitt)

3.2.0 (2014-01-10)

Full Changelog

Merged pull requests:

  • update readme for 3.2.0 release #26 (nlevitt)
  • bump version number to 3.2.0 for release #25 (nlevitt)
  • for url-agnostic dedup, follow "Proposal for Standardizing the Recording... #24 (nlevitt)
  • fix HER-1979 so heritrix can run on windows xp #23 (nlevitt)
  • HER-1726: Templatize HTML #21 (adam-miller)
  • Her 2031 - Improve login-form submission options #20 (gojomo)
  • BeanLookupBindings for simpler script access to beans #19 (travisfw)
  • Fix for HER-2018: XML representation for /engine/job/<jobName>/beans returns incorrect url for named beans #17 (adam-miller)
  • Fix for HER-2017 XML representation of beans uses root node of type "script" #16 (adam-miller)
  • Reuse htmllinkcontext #15 (kngenie)
  • suppress unused warnings for serialVersionUid #14 (travisfw)
  • have TooManyPathSegmentsDecideRule count path segments only #13 (travisfw)
  • generics warnings fixes #12 (travisfw)
  • New reports #11 (travisfw)
  • ScriptedDecideRule#getEngine() rewrite for better synchronization and thread local mgmt #10 (travisfw)

3.1.1 (2012-05-02)

Full Changelog

Merged pull requests:

3.0.0 (2009-12-05)

* This Change Log was automatically generated by github_changelog_generator