Change Log

Unreleased

Full Changelog

Closed issues:

Add support for the SFTP protocol #319

Merged pull requests:

Fixes extractor multiple regex matcher recycle #335 (adam-miller)
Remove deprecated sudo setting. #333 (dengliming)

3.4.0-20200518 (2020-05-18)

Full Changelog

Closed issues:

Cannot find class [ExtractorYoutubeDL] #322
Checkpoints 'spoiled' when used to resume crawls #277

Merged pull requests:

Fix match result is always false in MatchesListRegexDecideRule #328 (morokosi)
Add real crawlStatus in the crawlReport #326 (clawia)
youtube-dl: request best medium-ish size format #325 (galgeek)
Add parsing for HTML tags (data-*) #323 (clawia)
Add support for the SFTP protocol #320 (bnfleb)

3.4.0-20200304 (2020-03-04)

Full Changelog

Fixed bugs:

exception logged when opening/saving crawler-beans.cxml via web interface editor #305
Java interface text editor error when saving crawler-beans.cxml #293
Unable to upload crawler-beans.cxml with curl #282
CookieStoreTest.testConcurrentLoad fails randomly #274

Closed issues:

Contrib project has a maven dependency with an older version of guava library. #311
BloomFilter64bitTest is slow #299
ObjectIdentityBdbManualCacheTest is slow #297
HTTPS console inaccessible via browser #279
JDK11 support: ssl errors from console #275
JDK11 support: FetchHTTPTest: ssl handshake_failure #268
JDK11 support: org.archive.util.ObjectIdentityBdbCacheTest failures #267
JDK11 support: ClassNotFoundException: javax.transaction.xa.Xid #266
JDK11 support: tools.jar #265
JDK11 support: jaxb #264

Merged pull requests:

Use the Wayback Machine to repair a link to Oracle docs. #315 (anjackson)
Utilize the d parameter #314 (hennekey)
Exclude hbase-client's guava 12 transitive dependency #312 (ato)
Fix stream closed exception for Paged view #308 (ldko)
Fix stream closed exception by not closing output stream #306 (ato)
Replace custom Base32 encoding #304 (hennekey)
Replace constant with accessor methods #303 (hennekey)
limit ExtractorYoutubeDL heap usage #302 (nlevitt)
fix logging config #301 (nlevitt)
Use Guice instead of custom bloom filter implementation #300 (hennekey)
Speed up ObjectIdentityBdbManualCacheTest #298 (hennekey)
Set JUnit version to latest #296 (hennekey)
Disable test that connects to wwwb-dedup.us.archive.org #295 (ato)
Fix 'Method Not Allowed' on POST of config editor form #294 (ato)
Crawltrap regex timeout #290 (csrster)
Bdb frontier access #289 (csrster)
Attempt to filter out embedded images. #288 (csrster)
change trough dedup date type to varchar. #287 (nlevitt)
Add support for forced queue assignment and parallel queues #286 (adam-miller)
Warc writer chain #285 (nlevitt)
Fix jobdir PUT #283 (ato)
Upgrade BDB JE to version 7.5.11 - IMPORTANT CHANGE #281 (anjackson)
Mitigate random CookieStore.testConcurrentLoad test failures #280 (ato)
JDK11 support: upgrade to Jetty 9.4.19, Restlet 2.4.0 and drop JDK 7 support #276 (ato)
JDK11 support: remove unused class ObjectIdentityBdbCache and tests #273 (ato)
JDK11 support: upgrade maven-surefire-plugin to 2.22.2 #272 (ato)
JDK11 support: exclude tools.jar from hbase-client dependency #271 (ato)
Travis fixes #270 (ato)
WIP: ExtractorYoutubeDL #257 (nlevitt)
Update README and add LICENSE.txt #256 (ruebot)

3.4.0-20190418 (2019-04-18)

Full Changelog

Fixed bugs:

Invalid format exception in scanJobLog #239
Domain name lookup failures get cached forever #234
Allow failed lookups to expire, for #234. #235 (anjackson)

Closed issues:

Failed DNS requests remain enqueued #252
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" #236
Make FetchHistoryProcessor 304 handler more robust #229
ToeThread death when using HighestUriPrecedenceProvider #221
Google Drive robots.txt broken #193

Merged pull requests:

set of frontier management changes to support CrawlHQ module #253 (dvanduzer)
fix some trough dedup bugs #251 (nlevitt)
Remove suffix from warcWriter since it is no longer used. #249 (ruebot)
Revert "Upgrade httpclient to 4.5.7 and handle cookies more compliantly" #248 (ato)
Upgrade httpclient to 4.5.7 and handle cookies more compliantly #246 (anjackson)
Update README.md #244 (mikeizbicki)
Handle commas more compliantly when parsing srcset #243 (ato)
Trough dedup #242 (nlevitt)
Ensure we start parsing full lines, for #239. #240 (anjackson)
Add CHANGELOG; address #233. #238 (ruebot)

3.4.0-20190207 (2019-02-07)

Full Changelog

Fixed bugs:

Add checks to guard against server sending 304 in error #230 (anjackson)

Merged pull requests:

Add synchronized statements for internetarchive#221. #231 (anjackson)

3.4.0-20190205 (2019-02-05)

Full Changelog

Fixed bugs:

HTML extractor does not handle the base href correctly when it's relative #208
Heritrix3 (including pre-built binaries) Fails to Bootstrap with Java8 due to Changes in Java stdlib #176
Heritrix3 Fails to Build from Source #175
Missing OneLineSimpleLayout class file #173

Closed issues:

BdbFrontier thread safety #212
HTTP response only results in garbage bytes #206
Possibly stalled crawl #203
Where do i find the crawled information (Contents) after crawling is completed #199
-j option can'not handle spaces in directory names? #182
heritrix doesn't scrape rewrite srcset urls correctly #177
Possible race-condition when first using the WARC writers? #167
can you integration with spring boot #162
Noisy alerts about 401s without auth challenge #158
Can't see all beans in scripts #157
How to configure warcWriter with MirrorWriter? #156
Requesting inaccurate paths from js causes routing errors #155

Merged pull requests:

JDK11 support: explicitly depend on JAXB #269 (ato)
do not checkpoint if crawl job has not started #227 (nlevitt)
namespace scope log logger to crawl job #226 (nlevitt)
un-threadlocal the HConnection #224 (nlevitt)
reset HBaseAdmin on error #223 (nlevitt)
keep trying to start up hbase dedup forever #222 (nlevitt)
implement PredicatedDecideRule.onlyDecision() #220 (nlevitt)
use non-deprecated hbase api #219 (nlevitt)
Correct spelling mistakes. #218 (EdwardBetts)
Update API with note about checkpoint launching. #217 (anjackson)
Extend API to simplify using the latest checkpoint #215 (anjackson)
Ensure frontier work queues are updated safely across threads. #213 (anjackson)
fix exception starting DecideRuleSequence logging #210 (nlevitt)
HtmlExtractor: allow relative hrefs in the base element #209 (anjackson)
Fix link to User Guide #207 (maurice-schleussinger)
Add parameter to allow even distribution for parallel queues. #205 (adam-miller)
catch exceptions scoping outlinks to stop them from derailing process… #197 (nlevitt)
fix for test failures in a workspace on NFS-mounted filesystem #196 (kngenie)
limit max size of form input #194 (galgeek)
Enforce robots.txt character limit per char not per line #192 (ato)
Allow JavaDNS to be disabled as part of resolving outstanding build and test issues #190 (anjackson)
WARCLimitEnforcer.java - Add support for multiple warc writers. #189 (adam-miller)
treat a failed fetch (e.g. socket timeout) of robots.txt the same way… #187 (nlevitt)
reduce batch size to 400 and avoid ridiculously long log lines #186 (nlevitt)
escape strings in sql posted to trough #185 (nlevitt)
trough feed #180 (nlevitt)
Add parsing for srcset attributes #179 (BitBaron)
KafkaCrawlLogFeed had been using lots of heap because each callback i… #178 (nlevitt)
AMQP fine control #171 (anjackson)
fix for race-condition when first using the WARC writers https://gith… #168 (nlevitt)
Don't wait to receive Umbra urls if Heritrix sends no url to Umbra #166 (galgeek)
AMQP URL Waiter #165 (galgeek)
Fixes for apparent build errors (extends #154) #164 (nlevitt)
Kafka 0.9 #163 (nlevitt)
No link extraction on URI not successfully downloaded #161 (kris-sigur)
Fixes issue #158 : Noisy alerts about 401s without auth challenge #159 (kris-sigur)
Fixes for apparent build errors #154 (anjackson)
Switch to Java 7 #152 (anjackson)
Make Content-Location header url INFERRED not REFFER hop type since C… #151 (vonrosen)
various changes to amqp publish and receive #150 (nlevitt)
Update to ExtractorHTML.java for cond. comments #149 (eleclerc)
Don't canonicalize source tag so that SourceSeedDecideRule will work.… #148 (vonrosen)
More fixes for mutlipart form submission #146 (vonrosen)
Make some urls with whitespace acceptable to JavaScript extractor. #145 (vonrosen)
run received urls through the candidates processor, to check scope an… #144 (nlevitt)
handle login forms with <input type="text"> fields in addition to use… #143 (nlevitt)
Form login multipart #142 (nlevitt)
Disable SNI for a request if that request failed due to an SNI error … #141 (vonrosen)
handle multiple clauses for same user agent in robots.txt #139 (nlevitt)
crawl level and host level limits on *novel* (not deduplicated) bytes and urls #138 (nlevitt)
SourceSeedDecideRule, SeedLimitsEnforcer #137 (nlevitt)
Register seeds send in via AMQP #136 (anjackson)
Allow KnowledgableExtractorJS to parse out youtube watch from youtube… #135 (vonrosen)
Add maximum to number of cookies to store for domain to BdbCookieStore #133 (vonrosen)
try very hard to start url consumer, and therefore bind the queue to … #132 (nlevitt)
set isRunning=true so that stop() gets called to avoid leaking connec… #131 (nlevitt)
catch exceptions and log error in StatisticsTracker.run(), to make su… #130 (nlevitt)
load keytool utility main class dynamically, trying both the old and … #129 (nlevitt)
AMQPUrlReceiver changes to support RabbitMQ >= 3.3 #128 (anjackson)
'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing #126 (caofangkun)
Amqp declarations fix #125 (ldko)
Allow realm to be set by server for basic auth. #124 (vonrosen)
Hosts report #123 (kris-sigur)
only submit checkbox and radio button form fields if they are on by d… #122 (nlevitt)
new contrib module KnowledgableExtractorJS, a subclass of ExtractorJS th... #121 (nlevitt)
for ARI-4267 accept possible uris with two dots in the filename part if ... #120 (nlevitt)
Fix for HER-2082 #119 (adam-miller)
Fix for ServerNotModified WARC revisit records incorrectly record WARC-Payload-Digest #118 (kris-sigur)
avoid java.lang.NullPointerException at org.archive.modules.writer.Write... #117 (nlevitt)
make sure log4j is configured when running unit tests, to avoid log4j er... #116 (nlevitt)
Set character set to UTF-8 when passing through files. #115 (kris-sigur)
remove RecordingOutputStreamTest.java (moving to webarchive-commons) #114 (nlevitt)
Amqp receiver deadlock #112 (nlevitt)
somewhat ugly fix to handle exceptions from the bean browser like java.l... #111 (nlevitt)
Upgrade to HttpClient 4.3.6 #110 (kris-sigur)
so that it can appear in the crawl log, add contentSize to CrawlURI extr... #109 (nlevitt)
kafka crawl log feed #108 (nlevitt)
Handle case where form does not have an action defined. #107 (vonrosen)
seriously, fix extraInfo handling in AMQPCrawlLogFeed #106 (nlevitt)
fix extraInfo handling in AMQPCrawlLogFeed #105 (nlevitt)
change field names to match new druid config #104 (nlevitt)
CandidatesProcessor.java #103 (adam-miller)
avoid deadlock in AMQPUrlReceiver hopefully #102 (nlevitt)
Remove forcefetch for AMQP received urls so they don't get crawled twice... #101 (vonrosen)
Allow discovery of urls in content attribute of meta tags. #100 (vonrosen)
AMQPCrawlLogFeed, DecideRuleSequenceWithAMQPFeed, DecideRuleSequence.logExtraInfo #99 (nlevitt)
Fix for HER-2074 #97 (kris-sigur)
new cookie store system to address HER-2070 "cookie monster" bug #96 (nlevitt)
FIX corner-case of bean browser failing due to an exception from hashCode() #95 (kngenie)
do not require "+" (plus sign) before @OPERATOR_CONTACT_URL@ in user-age... #94 (nlevitt)
Allow urls in JavaScript between unicode quotes to be detected. #93 (vonrosen)
remove more unused classes #92 (nlevitt)
FetchHTTP.java #91 (adam-miller)
Move Wayback-dedup module to heritrix-contrib #90 (kngenie)
Don’t let exception from property getter fail entire bean-browser. #89 (kngenie)
fix bug in CrawlURI.compare() discovered by Kenji, add unit test CrawlUR... #88 (nlevitt)
Allow xml extractor to handle urls in CDATA. #87 (vonrosen)
remove unused Transform* classes #86 (nlevitt)
switch to mainline iipc webarchive-commons latest release #84 (nlevitt)
oops! count novel urls/bytes for hosts report, etc #83 (nlevitt)
Fix for HER-2071 #82 (kris-sigur)
Hbase cdh5 #81 (nlevitt)
ExtractorHTML when a/@href links include the attribute data-remote="true... #80 (nlevitt)
Revisit redux #79 (nlevitt)
treat content as html and extract links if it looks like html, even if m... #78 (nlevitt)
Force urls received from AMQP to be recrawled so custom http headers can... #77 (vonrosen)
HER-2039 remove class Link, use CrawlURI #76 (nlevitt)
in CrawlURI.createCrawlURI(), avoid clobbering inherited data with data ... #75 (nlevitt)
Fix for https://webarchive.jira.com/browse/ARI-3943 #74 (vonrosen)
Treat codebase as link hops, not embeds #73 (kris-sigur)
add A_ANNOTATIONS to persistentKeys so that CrawlURI doesn't lose its an... #72 (nlevitt)
avoid calling CheckpointService.hasAvailableCheckpoints() when crawl not... #71 (nlevitt)
for ARI-3712, add extracted links relative to both via and base, and annotate with "extractorSWFRelToVia", "extractorSWFRelToBase", or "extractorSWFRelToBoth" if resulting link is the same whether relative to base or via #70 (nlevitt)
For https://webarchive.jira.com/browse/ARI-3865 #69 (vonrosen)
handle exception determining whether to apply overlay #68 (nlevitt)
don't log severe with stack trace on normal amqp shutdown #67 (nlevitt)
oops, make "exit java process" button work again #66 (nlevitt)
shut down the starter-restarter thread at crawl finish!! #65 (nlevitt)
Via surt prefixed decide rule #64 (adam-miller)
Contrib - ExtractorPDFContent #63 (adam-miller)
Ari 3765 gracefully handle amqp server going up and down #62 (nlevitt)
HER-2065 synchronize on inactiveQueuesByPrecedence inside of synchronize... #61 (nlevitt)
Cosmetics #60 (nlevitt)
fix unit test now that we accept speculative urls with query params with... #59 (nlevitt)
for ARI-3723, accept speculative urls with query params with no value #58 (nlevitt)
AMQPUrlReceiver - improve handling of case where rabbitmq is unreachable... #57 (nlevitt)
fix FormLoginProcessor checkpointing #56 (nlevitt)
oops, update test to expect post data as url-encoded query string #54 (nlevitt)
Fix form login #53 (nlevitt)
Implicitly add the ${} around groovyExpression. When cxml contains ${}, ... #52 (nlevitt)
Expression deciderule #51 (nlevitt)
Replace deprecated routines in guava #50 (shriphani)
Youtube march 2014 #49 (nlevitt)
Umbra #48 (nlevitt)
Adjusting Youtube itag priority #47 (adam-miller)
switch dependency from ia-web-commons 1.1.1-SNAPSHOT to webarchive-commo... #46 (nlevitt)
Update youtube itags #45 (nlevitt)
update httpcomponents, should address NPE we've seen https://issues.apac... #44 (nlevitt)
fix job.log file handler was left open when jobdir is removed #43 (martinsbalodis)
Adding the queue declaration and binding to the UrlReceiver #42 (eldondev)
Fix slow cookies #41 (nlevitt)
For https://webarchive.jira.com/browse/HER-2064 #40 (vonrosen)
progress and formatting changes #39 (nlevitt)
Umbra - AMQPUrlReceiver.java receive urls via amqp and add to frontier, related changes #38 (nlevitt)
fix HER-2063 - omit port in Host request header when it is default for t... #37 (nlevitt)
Avoid the exception below by handling bad charsets in FetchHTTP. Restore... #36 (nlevitt)
whoops! send escaped path+query on http request line; had been sending r... #35 (nlevitt)
fix NullPointerException in case of 401 with no auth challenge (includes... #34 (nlevitt)
First pass at a processor to publish crawluris to AMQP channels #33 (eldondev)
Switch to BasicHttpClientConnectionManager instead of #32 (nlevitt)
make http proxy port configurable in cxml, avoiding this: org.springfram... #31 (nlevitt)
Fix bdb cookie store #30 (nlevitt)
HER-2062 Fix for WorkQueueFrontier.deleteURIs handling of retired queues #29 (kris-sigur)
switch to httpcomponents, get rid of archive-overlay-commons-httpclient #28 (nlevitt)
rename dist/README.md to dist/README.txt so that maven bundles it in the... #27 (nlevitt)

3.2.0 (2014-01-10)

Full Changelog

Merged pull requests:

update readme for 3.2.0 release #26 (nlevitt)
bump version number to 3.2.0 for release #25 (nlevitt)
for url-agnostic dedup, follow "Proposal for Standardizing the Recording... #24 (nlevitt)
fix HER-1979 so heritrix can run on windows xp #23 (nlevitt)
HER-1726: Templatize HTML #21 (adam-miller)
Her 2031 - Improve login-form submission options #20 (gojomo)
BeanLookupBindings for simpler script access to beans #19 (travisfw)
Fix for HER-2018: XML representation for /engine/job/<jobName>/beans returns incorrect url for named beans #17 (adam-miller)
Fix for HER-2017 XML representation of beans uses root node of type "script" #16 (adam-miller)
Reuse htmllinkcontext #15 (kngenie)
suppress unused warnings for serialVersionUid #14 (travisfw)
have TooManyPathSegmentsDecideRule count path segments only #13 (travisfw)
generics warnings fixes #12 (travisfw)
New reports #11 (travisfw)
ScriptedDecideRule#getEngine() rewrite for better synchronization and thread local mgmt #10 (travisfw)

3.1.1 (2012-05-02)

Full Changelog

Merged pull requests:

Publicsuffixes2 #9 (kngenie)
Ip address set decide rule #7 (travisfw)
HER-2001: Use the CodeMirror editor for crawl config and script console #6 (ato)
HER-1998 #5 (adam-miller)
sort script engines in script console #4 (travisfw)

3.0.0 (2009-12-05)

* This Change Log was automatically generated by github_changelog_generator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Change Log

Unreleased

3.4.0-20200518 (2020-05-18)

3.4.0-20200304 (2020-03-04)

3.4.0-20190418 (2019-04-18)

3.4.0-20190207 (2019-02-07)

3.4.0-20190205 (2019-02-05)

3.2.0 (2014-01-10)

3.1.1 (2012-05-02)

3.0.0 (2009-12-05)

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Change Log

Unreleased

3.4.0-20200518 (2020-05-18)

3.4.0-20200304 (2020-03-04)

3.4.0-20190418 (2019-04-18)

3.4.0-20190207 (2019-02-07)

3.4.0-20190205 (2019-02-05)

3.2.0 (2014-01-10)

3.1.1 (2012-05-02)

3.0.0 (2009-12-05)