Skip to content

Commit 41765ca

Browse files
committed
DupeFilter: add setting for verbose logging + stats counter for filtered requests
1 parent 42dc34f commit 41765ca

File tree

2 files changed

+35
-14
lines changed

2 files changed

+35
-14
lines changed

docs/topics/settings.rst

+19-9
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ These mechanisms are described in more detail below.
4747

4848
Global overrides are the ones that take most precedence, and are usually
4949
populated by command-line options. You can also override one (or more) settings
50-
from command line using the ``-s`` (or ``--set``) command line option.
50+
from command line using the ``-s`` (or ``--set``) command line option.
5151

5252
For more information see the :attr:`~scrapy.settings.Settings.overrides`
5353
Settings attribute.
@@ -115,7 +115,7 @@ Built-in settings reference
115115
===========================
116116

117117
Here's a list of all available Scrapy settings, in alphabetical order, along
118-
with their default values and the scope where they apply.
118+
with their default values and the scope where they apply.
119119

120120
The scope, where available, shows where the setting is being used, if it's tied
121121
to any particular component. In that case the module of that component will be
@@ -303,7 +303,7 @@ orders. For more info see :ref:`topics-downloader-middleware-setting`.
303303
DOWNLOADER_MIDDLEWARES_BASE
304304
---------------------------
305305

306-
Default::
306+
Default::
307307

308308
{
309309
'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
@@ -376,7 +376,7 @@ See `DOWNLOAD_HANDLERS_BASE` for example format.
376376
DOWNLOAD_HANDLERS_BASE
377377
----------------------
378378

379-
Default::
379+
Default::
380380

381381
{
382382
'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
@@ -387,7 +387,7 @@ Default::
387387

388388
A dict containing the request download handlers enabled by default in Scrapy.
389389
You should never modify this setting in your project, modify
390-
:setting:`DOWNLOAD_HANDLERS` instead.
390+
:setting:`DOWNLOAD_HANDLERS` instead.
391391

392392
.. setting:: DOWNLOAD_TIMEOUT
393393

@@ -410,7 +410,17 @@ The class used to detect and filter duplicate requests.
410410
The default (``RFPDupeFilter``) filters based on request fingerprint using
411411
the ``scrapy.utils.request.request_fingerprint`` function.
412412

413-
.. setting:: jDITOR
413+
.. setting:: DUPEFILTER_DEBUG
414+
415+
DUPEFILTER_DEBUG
416+
----------------
417+
418+
Default: ``False``
419+
420+
By default, ``RFPDupeFilter`` only logs the first duplicate request.
421+
Setting :setting:`DUPEFILTER_DEBUG` to ``True`` will make it log all duplicate requests.
422+
423+
.. setting:: EDITOR
414424

415425
EDITOR
416426
------
@@ -428,7 +438,7 @@ EXTENSIONS
428438

429439
Default:: ``{}``
430440

431-
A dict containing the extensions enabled in your project, and their orders.
441+
A dict containing the extensions enabled in your project, and their orders.
432442

433443
.. setting:: EXTENSIONS_BASE
434444

@@ -452,7 +462,7 @@ Default::
452462

453463
The list of available extensions. Keep in mind that some of them need to
454464
be enabled through a setting. By default, this setting contains all stable
455-
built-in extensions.
465+
built-in extensions.
456466

457467
For more information See the :ref:`extensions user guide <topics-extensions>`
458468
and the :ref:`list of available extensions <topics-extensions-ref>`.
@@ -869,7 +879,7 @@ USER_AGENT
869879

870880
Default: ``"Scrapy/VERSION (+http://scrapy.org)"``
871881

872-
The default User-Agent to use when crawling, unless overridden.
882+
The default User-Agent to use when crawling, unless overridden.
873883

874884
.. _Amazon web services: http://aws.amazon.com/
875885
.. _breadth-first order: http://en.wikipedia.org/wiki/Breadth-first_search

scrapy/dupefilter.py

+16-5
Original file line numberDiff line numberDiff line change
@@ -28,17 +28,21 @@ def log(self, request, spider): # log that a request has been filtered
2828
class RFPDupeFilter(BaseDupeFilter):
2929
"""Request Fingerprint duplicates filter"""
3030

31-
def __init__(self, path=None):
31+
_log_level = log.DEBUG
32+
33+
def __init__(self, path=None, verbose_log=False):
3234
self.file = None
3335
self.fingerprints = set()
3436
self.logdupes = True
37+
self.verbose_log = verbose_log
3538
if path:
3639
self.file = open(os.path.join(path, 'requests.seen'), 'a+')
3740
self.fingerprints.update(x.rstrip() for x in self.file)
3841

3942
@classmethod
4043
def from_settings(cls, settings):
41-
return cls(job_dir(settings))
44+
verbose_log = settings.getbool('DUPEFILTER_DEBUG')
45+
return cls(job_dir(settings), verbose_log)
4246

4347
def request_seen(self, request):
4448
fp = self.request_fingerprint(request)
@@ -56,7 +60,14 @@ def close(self, reason):
5660
self.file.close()
5761

5862
def log(self, request, spider):
59-
if self.logdupes:
60-
fmt = "Filtered duplicate request: %(request)s - no more duplicates will be shown (see DUPEFILTER_CLASS)"
61-
log.msg(format=fmt, request=request, level=log.DEBUG, spider=spider)
63+
if self.verbose_log:
64+
fmt = "Filtered duplicate request: %(request)s"
65+
log.msg(format=fmt, request=request, level=self._log_level, spider=spider)
66+
elif self.logdupes:
67+
fmt = ("Filtered duplicate request: %(request)s"
68+
" - no more duplicates will be shown"
69+
" (see DUPEFILTER_DEBUG to show all duplicates)")
70+
log.msg(format=fmt, request=request, level=self._log_level, spider=spider)
6271
self.logdupes = False
72+
73+
spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)

0 commit comments

Comments
 (0)