Closed issues:
- Add support for the SFTP protocol #319
Merged pull requests:
- Fixes extractor multiple regex matcher recycle #335 (adam-miller)
- Remove deprecated sudo setting. #333 (dengliming)
3.4.0-20200518 (2020-05-18)
Closed issues:
Merged pull requests:
- Fix match result is always false in MatchesListRegexDecideRule #328 (morokosi)
- Add real crawlStatus in the crawlReport #326 (clawia)
- youtube-dl: request best medium-ish size format #325 (galgeek)
- Add parsing for HTML tags (data-*) #323 (clawia)
- Add support for the SFTP protocol #320 (bnfleb)
3.4.0-20200304 (2020-03-04)
Fixed bugs:
- exception logged when opening/saving crawler-beans.cxml via web interface editor #305
- Java interface text editor error when saving crawler-beans.cxml #293
- Unable to upload crawler-beans.cxml with curl #282
- CookieStoreTest.testConcurrentLoad fails randomly #274
Closed issues:
- Contrib project has a maven dependency with an older version of guava library. #311
- BloomFilter64bitTest is slow #299
- ObjectIdentityBdbManualCacheTest is slow #297
- HTTPS console inaccessible via browser #279
- JDK11 support: ssl errors from console #275
- JDK11 support: FetchHTTPTest: ssl handshake_failure #268
- JDK11 support: org.archive.util.ObjectIdentityBdbCacheTest failures #267
- JDK11 support: ClassNotFoundException: javax.transaction.xa.Xid #266
- JDK11 support: tools.jar #265
- JDK11 support: jaxb #264
Merged pull requests:
- Use the Wayback Machine to repair a link to Oracle docs. #315 (anjackson)
- Utilize the
d
parameter #314 (hennekey) - Exclude hbase-client's guava 12 transitive dependency #312 (ato)
- Fix stream closed exception for Paged view #308 (ldko)
- Fix stream closed exception by not closing output stream #306 (ato)
- Replace custom Base32 encoding #304 (hennekey)
- Replace constant with accessor methods #303 (hennekey)
- limit ExtractorYoutubeDL heap usage #302 (nlevitt)
- fix logging config #301 (nlevitt)
- Use Guice instead of custom bloom filter implementation #300 (hennekey)
- Speed up ObjectIdentityBdbManualCacheTest #298 (hennekey)
- Set JUnit version to latest #296 (hennekey)
- Disable test that connects to wwwb-dedup.us.archive.org #295 (ato)
- Fix 'Method Not Allowed' on POST of config editor form #294 (ato)
- Crawltrap regex timeout #290 (csrster)
- Bdb frontier access #289 (csrster)
- Attempt to filter out embedded images. #288 (csrster)
- change trough dedup
date
type to varchar. #287 (nlevitt) - Add support for forced queue assignment and parallel queues #286 (adam-miller)
- Warc writer chain #285 (nlevitt)
- Fix jobdir PUT #283 (ato)
- Upgrade BDB JE to version 7.5.11 - IMPORTANT CHANGE #281 (anjackson)
- Mitigate random CookieStore.testConcurrentLoad test failures #280 (ato)
- JDK11 support: upgrade to Jetty 9.4.19, Restlet 2.4.0 and drop JDK 7 support #276 (ato)
- JDK11 support: remove unused class ObjectIdentityBdbCache and tests #273 (ato)
- JDK11 support: upgrade maven-surefire-plugin to 2.22.2 #272 (ato)
- JDK11 support: exclude tools.jar from hbase-client dependency #271 (ato)
- Travis fixes #270 (ato)
- WIP: ExtractorYoutubeDL #257 (nlevitt)
- Update README and add LICENSE.txt #256 (ruebot)
3.4.0-20190418 (2019-04-18)
Fixed bugs:
- Invalid format exception in scanJobLog #239
- Domain name lookup failures get cached forever #234
- Allow failed lookups to expire, for #234. #235 (anjackson)
Closed issues:
- Failed DNS requests remain enqueued #252
- SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" #236
- Make FetchHistoryProcessor 304 handler more robust #229
- ToeThread death when using HighestUriPrecedenceProvider #221
- Google Drive robots.txt broken #193
Merged pull requests:
- set of frontier management changes to support CrawlHQ module #253 (dvanduzer)
- fix some trough dedup bugs #251 (nlevitt)
- Remove suffix from warcWriter since it is no longer used. #249 (ruebot)
- Revert "Upgrade httpclient to 4.5.7 and handle cookies more compliantly" #248 (ato)
- Upgrade httpclient to 4.5.7 and handle cookies more compliantly #246 (anjackson)
- Update README.md #244 (mikeizbicki)
- Handle commas more compliantly when parsing srcset #243 (ato)
- Trough dedup #242 (nlevitt)
- Ensure we start parsing full lines, for #239. #240 (anjackson)
- Add CHANGELOG; address #233. #238 (ruebot)
3.4.0-20190207 (2019-02-07)
Fixed bugs:
Merged pull requests:
- Add synchronized statements for internetarchive#221. #231 (anjackson)
3.4.0-20190205 (2019-02-05)
Fixed bugs:
- HTML extractor does not handle the base href correctly when it's relative #208
- Heritrix3 (including pre-built binaries) Fails to Bootstrap with Java8 due to Changes in Java stdlib #176
- Heritrix3 Fails to Build from Source #175
- Missing OneLineSimpleLayout class file #173
Closed issues:
- BdbFrontier thread safety #212
- HTTP response only results in garbage bytes #206
- Possibly stalled crawl #203
- Where do i find the crawled information (Contents) after crawling is completed #199
-j
option can'not handle spaces in directory names? #182- heritrix doesn't scrape rewrite srcset urls correctly #177
- Possible race-condition when first using the WARC writers? #167
- can you integration with spring boot #162
- Noisy alerts about 401s without auth challenge #158
- Can't see all beans in scripts #157
- How to configure warcWriter with MirrorWriter? #156
- Requesting inaccurate paths from js causes routing errors #155
Merged pull requests:
- JDK11 support: explicitly depend on JAXB #269 (ato)
- do not checkpoint if crawl job has not started #227 (nlevitt)
- namespace scope log logger to crawl job #226 (nlevitt)
- un-threadlocal the HConnection #224 (nlevitt)
- reset HBaseAdmin on error #223 (nlevitt)
- keep trying to start up hbase dedup forever #222 (nlevitt)
- implement PredicatedDecideRule.onlyDecision() #220 (nlevitt)
- use non-deprecated hbase api #219 (nlevitt)
- Correct spelling mistakes. #218 (EdwardBetts)
- Update API with note about checkpoint launching. #217 (anjackson)
- Extend API to simplify using the latest checkpoint #215 (anjackson)
- Ensure frontier work queues are updated safely across threads. #213 (anjackson)
- fix exception starting DecideRuleSequence logging #210 (nlevitt)
- HtmlExtractor: allow relative hrefs in the base element #209 (anjackson)
- Fix link to User Guide #207 (maurice-schleussinger)
- Add parameter to allow even distribution for parallel queues. #205 (adam-miller)
- catch exceptions scoping outlinks to stop them from derailing process… #197 (nlevitt)
- fix for test failures in a workspace on NFS-mounted filesystem #196 (kngenie)
- limit max size of form input #194 (galgeek)
- Enforce robots.txt character limit per char not per line #192 (ato)
- Allow JavaDNS to be disabled as part of resolving outstanding build and test issues #190 (anjackson)
- WARCLimitEnforcer.java - Add support for multiple warc writers. #189 (adam-miller)
- treat a failed fetch (e.g. socket timeout) of robots.txt the same way… #187 (nlevitt)
- reduce batch size to 400 and avoid ridiculously long log lines #186 (nlevitt)
- escape strings in sql posted to trough #185 (nlevitt)
- trough feed #180 (nlevitt)
- Add parsing for srcset attributes #179 (BitBaron)
- KafkaCrawlLogFeed had been using lots of heap because each callback i… #178 (nlevitt)
- AMQP fine control #171 (anjackson)
- fix for race-condition when first using the WARC writers https://gith… #168 (nlevitt)
- Don't wait to receive Umbra urls if Heritrix sends no url to Umbra #166 (galgeek)
- AMQP URL Waiter #165 (galgeek)
- Fixes for apparent build errors (extends #154) #164 (nlevitt)
- Kafka 0.9 #163 (nlevitt)
- No link extraction on URI not successfully downloaded #161 (kris-sigur)
- Fixes issue #158 : Noisy alerts about 401s without auth challenge #159 (kris-sigur)
- Fixes for apparent build errors #154 (anjackson)
- Switch to Java 7 #152 (anjackson)
- Make Content-Location header url INFERRED not REFFER hop type since C… #151 (vonrosen)
- various changes to amqp publish and receive #150 (nlevitt)
- Update to ExtractorHTML.java for cond. comments #149 (eleclerc)
- Don't canonicalize source tag so that SourceSeedDecideRule will work.… #148 (vonrosen)
- More fixes for mutlipart form submission #146 (vonrosen)
- Make some urls with whitespace acceptable to JavaScript extractor. #145 (vonrosen)
- run received urls through the candidates processor, to check scope an… #144 (nlevitt)
- handle login forms with <input type="text"> fields in addition to use… #143 (nlevitt)
- Form login multipart #142 (nlevitt)
- Disable SNI for a request if that request failed due to an SNI error … #141 (vonrosen)
- handle multiple clauses for same user agent in robots.txt #139 (nlevitt)
- crawl level and host level limits on *novel* (not deduplicated) bytes and urls #138 (nlevitt)
- SourceSeedDecideRule, SeedLimitsEnforcer #137 (nlevitt)
- Register seeds send in via AMQP #136 (anjackson)
- Allow KnowledgableExtractorJS to parse out youtube watch from youtube… #135 (vonrosen)
- Add maximum to number of cookies to store for domain to BdbCookieStore #133 (vonrosen)
- try very hard to start url consumer, and therefore bind the queue to … #132 (nlevitt)
- set isRunning=true so that stop() gets called to avoid leaking connec… #131 (nlevitt)
- catch exceptions and log error in StatisticsTracker.run(), to make su… #130 (nlevitt)
- load keytool utility main class dynamically, trying both the old and … #129 (nlevitt)
- AMQPUrlReceiver changes to support RabbitMQ >= 3.3 #128 (anjackson)
- 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing #126 (caofangkun)
- Amqp declarations fix #125 (ldko)
- Allow realm to be set by server for basic auth. #124 (vonrosen)
- Hosts report #123 (kris-sigur)
- only submit checkbox and radio button form fields if they are on by d… #122 (nlevitt)
- new contrib module KnowledgableExtractorJS, a subclass of ExtractorJS th... #121 (nlevitt)
- for ARI-4267 accept possible uris with two dots in the filename part if ... #120 (nlevitt)
- Fix for HER-2082 #119 (adam-miller)
- Fix for ServerNotModified WARC revisit records incorrectly record WARC-Payload-Digest #118 (kris-sigur)
- avoid java.lang.NullPointerException at org.archive.modules.writer.Write... #117 (nlevitt)
- make sure log4j is configured when running unit tests, to avoid log4j er... #116 (nlevitt)
- Set character set to UTF-8 when passing through files. #115 (kris-sigur)
- remove RecordingOutputStreamTest.java (moving to webarchive-commons) #114 (nlevitt)
- Amqp receiver deadlock #112 (nlevitt)
- somewhat ugly fix to handle exceptions from the bean browser like java.l... #111 (nlevitt)
- Upgrade to HttpClient 4.3.6 #110 (kris-sigur)
- so that it can appear in the crawl log, add contentSize to CrawlURI extr... #109 (nlevitt)
- kafka crawl log feed #108 (nlevitt)
- Handle case where form does not have an action defined. #107 (vonrosen)
- seriously, fix extraInfo handling in AMQPCrawlLogFeed #106 (nlevitt)
- fix extraInfo handling in AMQPCrawlLogFeed #105 (nlevitt)
- change field names to match new druid config #104 (nlevitt)
- CandidatesProcessor.java #103 (adam-miller)
- avoid deadlock in AMQPUrlReceiver hopefully #102 (nlevitt)
- Remove forcefetch for AMQP received urls so they don't get crawled twice... #101 (vonrosen)
- Allow discovery of urls in content attribute of meta tags. #100 (vonrosen)
- AMQPCrawlLogFeed, DecideRuleSequenceWithAMQPFeed, DecideRuleSequence.logExtraInfo #99 (nlevitt)
- Fix for HER-2074 #97 (kris-sigur)
- new cookie store system to address HER-2070 "cookie monster" bug #96 (nlevitt)
- FIX corner-case of bean browser failing due to an exception from hashCode() #95 (kngenie)
- do not require "+" (plus sign) before @OPERATOR_CONTACT_URL@ in user-age... #94 (nlevitt)
- Allow urls in JavaScript between unicode quotes to be detected. #93 (vonrosen)
- remove more unused classes #92 (nlevitt)
- FetchHTTP.java #91 (adam-miller)
- Move Wayback-dedup module to heritrix-contrib #90 (kngenie)
- Don’t let exception from property getter fail entire bean-browser. #89 (kngenie)
- fix bug in CrawlURI.compare() discovered by Kenji, add unit test CrawlUR... #88 (nlevitt)
- Allow xml extractor to handle urls in CDATA. #87 (vonrosen)
- remove unused Transform* classes #86 (nlevitt)
- switch to mainline iipc webarchive-commons latest release #84 (nlevitt)
- oops! count novel urls/bytes for hosts report, etc #83 (nlevitt)
- Fix for HER-2071 #82 (kris-sigur)
- Hbase cdh5 #81 (nlevitt)
- ExtractorHTML when a/@href links include the attribute data-remote="true... #80 (nlevitt)
- Revisit redux #79 (nlevitt)
- treat content as html and extract links if it looks like html, even if m... #78 (nlevitt)
- Force urls received from AMQP to be recrawled so custom http headers can... #77 (vonrosen)
- HER-2039 remove class Link, use CrawlURI #76 (nlevitt)
- in CrawlURI.createCrawlURI(), avoid clobbering inherited data with data ... #75 (nlevitt)
- Fix for https://webarchive.jira.com/browse/ARI-3943 #74 (vonrosen)
- Treat codebase as link hops, not embeds #73 (kris-sigur)
- add A_ANNOTATIONS to persistentKeys so that CrawlURI doesn't lose its an... #72 (nlevitt)
- avoid calling CheckpointService.hasAvailableCheckpoints() when crawl not... #71 (nlevitt)
- for ARI-3712, add extracted links relative to both via and base, and annotate with "extractorSWFRelToVia", "extractorSWFRelToBase", or "extractorSWFRelToBoth" if resulting link is the same whether relative to base or via #70 (nlevitt)
- For https://webarchive.jira.com/browse/ARI-3865 #69 (vonrosen)
- handle exception determining whether to apply overlay #68 (nlevitt)
- don't log severe with stack trace on normal amqp shutdown #67 (nlevitt)
- oops, make "exit java process" button work again #66 (nlevitt)
- shut down the starter-restarter thread at crawl finish!! #65 (nlevitt)
- Via surt prefixed decide rule #64 (adam-miller)
- Contrib - ExtractorPDFContent #63 (adam-miller)
- Ari 3765 gracefully handle amqp server going up and down #62 (nlevitt)
- HER-2065 synchronize on inactiveQueuesByPrecedence inside of synchronize... #61 (nlevitt)
- Cosmetics #60 (nlevitt)
- fix unit test now that we accept speculative urls with query params with... #59 (nlevitt)
- for ARI-3723, accept speculative urls with query params with no value #58 (nlevitt)
- AMQPUrlReceiver - improve handling of case where rabbitmq is unreachable... #57 (nlevitt)
- fix FormLoginProcessor checkpointing #56 (nlevitt)
- oops, update test to expect post data as url-encoded query string #54 (nlevitt)
- Fix form login #53 (nlevitt)
- Implicitly add the ${} around groovyExpression. When cxml contains ${}, ... #52 (nlevitt)
- Expression deciderule #51 (nlevitt)
- Replace deprecated routines in guava #50 (shriphani)
- Youtube march 2014 #49 (nlevitt)
- Umbra #48 (nlevitt)
- Adjusting Youtube itag priority #47 (adam-miller)
- switch dependency from ia-web-commons 1.1.1-SNAPSHOT to webarchive-commo... #46 (nlevitt)
- Update youtube itags #45 (nlevitt)
- update httpcomponents, should address NPE we've seen https://issues.apac... #44 (nlevitt)
- fix job.log file handler was left open when jobdir is removed #43 (martinsbalodis)
- Adding the queue declaration and binding to the UrlReceiver #42 (eldondev)
- Fix slow cookies #41 (nlevitt)
- For https://webarchive.jira.com/browse/HER-2064 #40 (vonrosen)
- progress and formatting changes #39 (nlevitt)
- Umbra - AMQPUrlReceiver.java receive urls via amqp and add to frontier, related changes #38 (nlevitt)
- fix HER-2063 - omit port in Host request header when it is default for t... #37 (nlevitt)
- Avoid the exception below by handling bad charsets in FetchHTTP. Restore... #36 (nlevitt)
- whoops! send escaped path+query on http request line; had been sending r... #35 (nlevitt)
- fix NullPointerException in case of 401 with no auth challenge (includes... #34 (nlevitt)
- First pass at a processor to publish crawluris to AMQP channels #33 (eldondev)
- Switch to BasicHttpClientConnectionManager instead of #32 (nlevitt)
- make http proxy port configurable in cxml, avoiding this: org.springfram... #31 (nlevitt)
- Fix bdb cookie store #30 (nlevitt)
- HER-2062 Fix for WorkQueueFrontier.deleteURIs handling of retired queues #29 (kris-sigur)
- switch to httpcomponents, get rid of archive-overlay-commons-httpclient #28 (nlevitt)
- rename dist/README.md to dist/README.txt so that maven bundles it in the... #27 (nlevitt)
3.2.0 (2014-01-10)
Merged pull requests:
- update readme for 3.2.0 release #26 (nlevitt)
- bump version number to 3.2.0 for release #25 (nlevitt)
- for url-agnostic dedup, follow "Proposal for Standardizing the Recording... #24 (nlevitt)
- fix HER-1979 so heritrix can run on windows xp #23 (nlevitt)
- HER-1726: Templatize HTML #21 (adam-miller)
- Her 2031 - Improve login-form submission options #20 (gojomo)
- BeanLookupBindings for simpler script access to beans #19 (travisfw)
- Fix for HER-2018: XML representation for /engine/job/<jobName>/beans returns incorrect url for named beans #17 (adam-miller)
- Fix for HER-2017 XML representation of beans uses root node of type "script" #16 (adam-miller)
- Reuse htmllinkcontext #15 (kngenie)
- suppress unused warnings for serialVersionUid #14 (travisfw)
- have TooManyPathSegmentsDecideRule count path segments only #13 (travisfw)
- generics warnings fixes #12 (travisfw)
- New reports #11 (travisfw)
- ScriptedDecideRule#getEngine() rewrite for better synchronization and thread local mgmt #10 (travisfw)
3.1.1 (2012-05-02)
Merged pull requests:
- Publicsuffixes2 #9 (kngenie)
- Ip address set decide rule #7 (travisfw)
- HER-2001: Use the CodeMirror editor for crawl config and script console #6 (ato)
- HER-1998 #5 (adam-miller)
- sort script engines in script console #4 (travisfw)
3.0.0 (2009-12-05)
* This Change Log was automatically generated by github_changelog_generator