SEP | 9 |
Title | Singleton removal |
Author | Pablo Hoffman |
Created | 2009-11-14 |
Status | Document in progress (being written) |
This SEP proposes a refactoring of the Scrapy to get ri of singletons, which will result in a cleaner API and will allow us to implement the library API proposed in :doc:`sep-004`.
Scrapy 0.7 has the following singletons:
- Execution engine (
scrapy.core.engine.scrapyengine
) - Execution manager (
scrapy.core.manager.scrapymanager
) - Extension manager (
scrapy.extension.extensions
) - Spider manager (
scrapy.spider.spiders
) - Stats collector (
scrapy.stats.stats
) - Logging system (
scrapy.log
) - Signals system (
scrapy.xlib.pydispatcher
)
The proposed architecture is to have one "root" object called Crawler
(which will replace the current Execution Manager) and make all current
singletons members of that object, as explained below:
crawler:
scrapy.crawler.Crawler
instance (replaces currentscrapy.core.manager.ExecutionManager
) - instantiated with aSettings
objectcrawler.settings:
scrapy.conf.Settings
instance (passed in the constructor)crawler.extensions:
scrapy.extension.ExtensionManager
instance- crawler.engine:
scrapy.core.engine.ExecutionEngine
instance crawler.engine.scheduler
crawler.engine.scheduler.middleware
- to access scheduler middleware
crawler.engine.downloader
crawler.engine.downloader.middleware
- to access downloader middleware
crawler.engine.scraper
crawler.engine.scraper.spidermw
- to access spider middleware
- crawler.engine:
crawler.spiders:
SpiderManager
instance (concrete class given inSPIDER_MANAGER_CLASS
setting)crawler.stats:
StatsCollector
instance (concrete class given inSTATS_CLASS
setting)crawler.log: Logger class with methods replacing the current
scrapy.log
functions. Logging would be started (if enabled) onCrawler
constructor, so no log starting functions are required.crawler.log.msg
- crawler.signals: signal handling
crawler.signals.send()
- same aspydispatch.dispatcher.send()
crawler.signals.connect()
- same aspydispatch.dispatcher.connect()
crawler.signals.disconnect()
- same aspydispatch.dispatcher.disconnect()
All components (extensions, middlewares, etc) will receive this Crawler
object in their constructors, and this will be the only mechanism for accessing
any other components (as opposed to importing each singleton from their
respective module). This will also serve to stabilize the core API, something
which we haven't documented so far (partly because of this).
So, for a typical middleware constructor code, instead of this:
#!python from scrapy.core.exceptions import NotConfigured from scrapy.conf import settings class SomeMiddleware(object): def __init__(self): if not settings.getbool('SOMEMIDDLEWARE_ENABLED'): raise NotConfigured
We'd write this:
#!python from scrapy.core.exceptions import NotConfigured class SomeMiddleware(object): def __init__(self, crawler): if not crawler.settings.getbool('SOMEMIDDLEWARE_ENABLED'): raise NotConfigured
When running from command line (the only mechanism supported so far) the
scrapy.command.cmdline
module will:
- instantiate a
Settings
object and populate it with the values in SCRAPY_SETTINGS_MODULE, and per-command overrides - instantiate a
Crawler
object with theSettings
object (theCrawler
instantiates all its components based on the given settings) - run
Crawler.crawl()
with the URLs or domains passed in the command line
When using Scrapy with the library API, the programmer will:
- instantiate a
Settings
object (which only has the defaults settings, by default) and override the desired settings - instantiate a
Crawler
object with theSettings
object
Should we pass
Settings
object toScrapyCommand.add_options()
?- How should spiders access settings?
- Option 1. Pass
Crawler
object to spider constructors too - pro: one way to access all components (settings and signals being the most relevant to spiders)
- con?: spider code can access (and control) any crawler component - since we don't want to support spiders messing with the crawler (write an extension or spider middleware if you need that)
- Option 1. Pass
Option 2. Pass
Settings
object to spider constructors, which would then be accessed throughself.settings
, like logging which is accessed throughself.log
- con: would need a way to access stats too