SEP	9
Title	Singleton removal
Author	Pablo Hoffman
Created	2009-11-14
Status	Document in progress (being written)

SEP-009 - Singletons removal

This SEP proposes a refactoring of the Scrapy to get ri of singletons, which will result in a cleaner API and will allow us to implement the library API proposed in :doc:`sep-004`.

Current singletons

Scrapy 0.7 has the following singletons:

Execution engine (scrapy.core.engine.scrapyengine)
Execution manager (scrapy.core.manager.scrapymanager)
Extension manager (scrapy.extension.extensions)
Spider manager (scrapy.spider.spiders)
Stats collector (scrapy.stats.stats)
Logging system (scrapy.log)
Signals system (scrapy.xlib.pydispatcher)

Proposed API

The proposed architecture is to have one "root" object called Crawler (which will replace the current Execution Manager) and make all current singletons members of that object, as explained below:

crawler: scrapy.crawler.Crawler instance (replaces current scrapy.core.manager.ExecutionManager) - instantiated with a Settings object
- crawler.settings: scrapy.conf.Settings instance (passed in the constructor)
- crawler.extensions: scrapy.extension.ExtensionManager instance
- crawler.engine: scrapy.core.engine.ExecutionEngine instance
  crawler.engine.scheduler
  
  crawler.engine.scheduler.middleware - to access scheduler middleware
  
  crawler.engine.downloader
  
  crawler.engine.downloader.middleware - to access downloader middleware
  
  crawler.engine.scraper
  
  crawler.engine.scraper.spidermw - to access spider middleware
- crawler.spiders: SpiderManager instance (concrete class given in SPIDER_MANAGER_CLASS setting)
- crawler.stats: StatsCollector instance (concrete class given in STATS_CLASS setting)
- crawler.log: Logger class with methods replacing the current scrapy.log functions. Logging would be started (if enabled) on Crawler constructor, so no log starting functions are required.
  - crawler.log.msg
- crawler.signals: signal handling
  crawler.signals.send() - same as pydispatch.dispatcher.send()
  
  crawler.signals.connect() - same as pydispatch.dispatcher.connect()
  
  crawler.signals.disconnect() - same as pydispatch.dispatcher.disconnect()

Required code changes after singletons removal

All components (extensions, middlewares, etc) will receive this Crawler object in their constructors, and this will be the only mechanism for accessing any other components (as opposed to importing each singleton from their respective module). This will also serve to stabilize the core API, something which we haven't documented so far (partly because of this).

So, for a typical middleware constructor code, instead of this:

#!python
from scrapy.core.exceptions import NotConfigured
from scrapy.conf import settings

class SomeMiddleware(object):
   def __init__(self):
      if not settings.getbool('SOMEMIDDLEWARE_ENABLED'):
          raise NotConfigured

We'd write this:

#!python
from scrapy.core.exceptions import NotConfigured

class SomeMiddleware(object):
   def __init__(self, crawler):
      if not crawler.settings.getbool('SOMEMIDDLEWARE_ENABLED'):
          raise NotConfigured

Running from command line

When running from command line (the only mechanism supported so far) the scrapy.command.cmdline module will:

instantiate a Settings object and populate it with the values in SCRAPY_SETTINGS_MODULE, and per-command overrides
instantiate a Crawler object with the Settings object (the Crawler instantiates all its components based on the given settings)
run Crawler.crawl() with the URLs or domains passed in the command line

Using Scrapy as a library

When using Scrapy with the library API, the programmer will:

instantiate a Settings object (which only has the defaults settings, by default) and override the desired settings
instantiate a Crawler object with the Settings object

Open issues to resolve

Should we pass Settings object to ScrapyCommand.add_options()?
How should spiders access settings?
- Option 1. Pass Crawler object to spider constructors too
  
  pro: one way to access all components (settings and signals being the most relevant to spiders)
  
  con?: spider code can access (and control) any crawler component - since we don't want to support spiders messing with the crawler (write an extension or spider middleware if you need that)
- Option 2. Pass Settings object to spider constructors, which would then be accessed through self.settings, like logging which is accessed through self.log
  con: would need a way to access stats too

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sep-009.rst

sep-009.rst

SEP-009 - Singletons removal

Current singletons

Proposed API

Required code changes after singletons removal

Running from command line

Using Scrapy as a library

Open issues to resolve

Files

sep-009.rst

Latest commit

History

sep-009.rst

File metadata and controls

SEP-009 - Singletons removal

Current singletons

Proposed API

Required code changes after singletons removal

Running from command line

Using Scrapy as a library

Open issues to resolve