SEP | 19 |
Title | Per-spider settings |
Author | Pablo Hoffman, Nicolás Ramirez, Julia Medina |
Created | 2013-03-07 |
Status | Final (implemented with minor variations) |
This is a proposal to add support for overriding settings per-spiders in a consistent way, while taking the chance to refactor the settings population and whole crawl workflow.
In short, you will be able to overwrite settings (on a per-spider basis) by implementing a class method in your spider:
class MySpider(Spider): @classmethod def custom_settings(cls): return { "DOWNLOAD_DELAY": 5.0, "RETRY_ENABLED": False, }
- new
custom_settings
class method will be added to spiders, to give them a chance to override settings before they're used to instantiate the crawler - spider managers will maintain loading spider classes functionality (with a
new
load
method that will return a spider class given its name), but spider initialization will be delegated to crawlers (with a newfrom_crawler
class method in spiders, that will allow them access to crawlers directly) - spider manager will be striped out of Crawler class, as it will no longer need it
SPIDER_MODULES
andSPIDER_MANAGER_CLASS
settings will be removed and replaced by entries onscrapy.cfg
. Thus spider managers won't need project settings to configure themselves- CrawlerProcess will be remove, since crawlers will be created independently
with a required spider class and optional
SettingsReader
instance - Settings class will be split into two classes:
SettingsLoader
andSettingsReader
, and a new concept of "setting priority" will be added
Settings class will be split into two classes SettingsLoader
and
SettingsReader
. First one will be used to settle all different levels of
settings across the project, and the later will be a frozen version of the
already loaded settings and will be the preferred way to access them. This
will avoid the current possible misconception that you can change settings
after they have been populated. There'll be a new concept of settings
priorities, and settings.overrides
will be deprecated in favor of
explicitly loaded settings with priorities, as it'll make the settings
overriding not order-dependent.
Because of this, CrawlerSettings
(with its overrides, settings_module and
defaults) will be remove, but its interface could be maintained for backward
compatibility in SettingsReader
(since on SettingsLoader
, overrides
dictionary and settings with priorities don't get along with a consistent
implementation). Maintaining this attributes and their functionality is not
advisable since it breaks the read-only access nature of the class.
With the new per-spider settings, there's a need of a helper function that
will take a spider and return a SettingsReader
instance populated with
defaults', project's and the given spider's settings. Motive behind this is
that get_project_settings
can't continue being used for getting settings
instance for crawler usage when using the API directly, as the project is not
the only source of settings anymore. get_projects_settings
will become an
internal function because of that.
SettingsLoader
is going to populate settings at startup, then it'll be
converted to a SettingsReader
instance and discarded afterwards.
It is supposed to be write-only, but many previously loaded settings are
needed to be accessed before freezing them. For example, the
COMMANDS_MODULE
setting allows loading more command default settings.
Another example is that we need to read LOG_*
settings early because we
must be able to log errors on the load settings process. ScrapyCommands
may be configured based upon current settings, as users can plug custom
commands. These are some of the reasons that suggest that we need a read-write
access for this class.
- Will have a method
set(name, value, priority)
to register a setting with a given priority. Asetdict(dict, priority)
method may come handy for loading project's and per-spider settings. - Will have current Settings getter functions (
get
,getint
,getfloat
,getdict
, etc.) (See above for reasons behind this). - Will have a
freeze
method that returns an instance ofSettingsReader
, with a copy of the current state of settings (already prioritized).
It's intended to be the one used by core, extensions, and all components that
use settings without modifying them. Because there are logical objects that
change settings, such as ScrapyCommands
, use cases of each settings class
need to be comprehensively explained.
New crawlers will be created with an instance of this class (The one returned
by the freeze
method on the already populated SettingsReader
), because
they are not expected to alter the settings.
It'll be read-only, keeping the same getter methods of current Settings
(get
, getint
, getfloat
, getdict
, etc.). There could be a
set
method that will throw a descriptive explanatory error for debugging
compatibility, avoiding its inadvertently usage.
There will be 5 setting priorities used by default:
- 0: global defaults (those in
scrapy.settings.default_settings
) - 10: per-command defaults (for example, shell runs with
KEEP_ALIVE=True
) - 20: project settings (those in
settings.py
) - 30: per-spider settings (those returned by
Spider.custom_settings
class method) - 40: command line arguments (those passed in the command line)
There are a couple of issues here:
SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE
andSCRAPY_{settings}
need-to-be deprecated environment variables: Can be kept, with a new or existing priority.- We could have different priorities for settings given with the
-s
argument and other named arguments in the command line (For example,-s LOG_ENABLE=False --loglevel=ERROR
will setLOG_ENABLE
to True, because named options are overridden later in the current implementation), but because the processing of command line options is done in one place we could leave them with the same priority and depend on the order of the set calls just for this case.
scrapy.conf.settings
singleton is a deprecated implementation concerning
settings load. Could be maintained as it is, but the singleton should
implement new SettingsReader
interface in order to work.
Currently, the spider manager is part of the crawler which creates a cyclic loop between settings and spiders and it shouldn't belong there. The spiders should be loaded outside and passed to the crawler object, which will require a spider class to be instantiated.
This new spider manager will not have access to the settings (they won't be loaded yet) so it will use scrapy.cfg to configure itself.
The scrapy.cfg
would look like this:
[settings] default = myproject.settings [spiders] manager = scrapy.spidermanager.SpiderManager modules = myproject.spiders
manager
replacesSPIDER_MANAGER_CLASS
setting and will, if omitted, default toscrapy.spidermanager.SpiderManager
modules
replacesSPIDER_MODULES
setting and will be required
These ideas translate to the following changes on the SpiderManager
class:
__init__(spider_modules)
->__init__()
.spider_modules
will be looked inscrapy.cfg
.create('spider_name', **spider_kargs)
->load('spider_name')
. This will return a spider class, not an instance. It's basically a__get__
toself._spiders
.- All remaining functions should be deprecated or remove accordantly, since a crawler reference is no longer needed.
- New helper
get_spider_manager_class_from_scrapycfg
inscrapy/utils/spidermanager.py
.
A new class method custom_settings
is proposed, that could be use to
override project and default settings before they're used to instantiate the
crawler:
class MySpider(Spider): @classmethod def custom_settings(cls): return { "DOWNLOAD_DELAY": 5.0, "RETRY_ENABLED": False, }
This will only involve a set
call with the corresponding priority when
populating SettingsLoader
.
Contributing to API changes, new from_crawler
class method will be added
to spiders to give them a chance to access settings, stats, or the crawler
core components themselves. This should be the new way to create a spider from
now on (instead of normally instantiate it as is currently).
As already stated, ScrapyCommands
modify the settings, so they need a
SettingsLoader
instance reference in order to do that.
Present process_option
implementations on Base and other commands read and
override settings. These overrides should be changed to set
calls with
the allocated priority.
Each command with a custom run
method should be modified to reflect the new
refactored API (Particularly crawl
command).
CrawlerProcess
should be remove because Scrapy crawl command no longer
supports running multiple spiders. The preferred way for doing this is using
the API manually, instantiating a separate Crawler for each spider, so
CrawlerProcess
has loosen its utility.
This change is not directly related to the project (it's not focus on settings but it fits in the API clean up task), but it's a great opportunity to consider since we're changing the crawling startup flow.
This class will be deleted and the attributes and methods will be merge with
Crawler
. For that effect, these are the specific merges and removals:
self.crawlers
doesn't make sense is this new set up, each reference will be replace with self.create_crawler
will be__init__
ofCrawler
_start_crawler
will be merge withCrawler.start
start
will be merge withCrawler.crawl
but this will need from the later an extra boolean parameterstart_reactor
(default: True) to crawl with or without starting twisted reactor (This is required incommands.shell
in order to start the reactor in another thread).
This summarizes the current and new proposed mechanisms for starting up a Scrapy crawler. Imports and non representative functions are omitted for brevity.
# execute in cmdline # loads settings.py, returns CrawlerSettings(settings_module) settings = get_project_settings() settings.defaults.update(cmd.default_settings) cmd.crawler_process = CrawlerProcess(settings) cmd.run # (In a _run_print_help call) # Command.run in commands/crawl.py self.crawler_process.create_crawler() spider = crawler.spiders.create(spider_name, **spider_kwargs) crawler.crawl(spider) self.crawler_process.start() # starts crawling spider # CrawlerProcess._start_crawler in crawler.py crawler.configure()
# execute in cmdline smcls = get_spider_manager_class_from_scrapycfg() sm = smcls() # loads spiders from module defined in scrapy.cfg spidercls = sm.load(spider_name) # returns spider class, not instance settings = get_project_settings() # loads settings.py settings.setdict(cmd.default_settings, priority=40) settings.setdict(spidercls.custom_settings(), priority=30) settings = settings.freeze() cmd.crawler = Crawler(spidercls, settings=settings) # Crawler.__init__ in crawler.py self.configure() cmd.run # (In a _run_print_help call) # Command.run in commands/crawl.py self.crawler.crawl(**spider_kwargs) # Crawler.crawl in crawler.py spider = self.spidercls.from_crawler(self, **spider_kwargs) # starts crawling spider