Merge pull request apache#238 from mistercrunch/docs

New doc entry for Pools and Connections
hbcbh1999 · Jun 3, 2015 · 2670600 · 2670600
2 parents 161db60 + 6f77f66
commit 2670600
Show file tree

Hide file tree

Showing 4 changed files with 117 additions and 7 deletions.
diff --git a/TODO.md b/TODO.md
@@ -1,7 +1,7 @@
 TODO
 -----
 #### UI
-* Run button / backfill wizard
+* Backfill form
 * Add templating to adhoc queries
 * Charts: better error handling
 
@@ -16,7 +16,6 @@ TODO
 #### Backend
 * Add a run_only_latest flag to BaseOperator, runs only most recent task instance where deps are met
 * Pickle all the THINGS!
-* Add priority_weight(Int) to BaseOperator, +@property subtree_priority
 * Distributed scheduler
 * Add decorator to timeout imports on master process [lib](https://github.com/pnpnpn/timeout-decorator)
 * Raise errors when setting dependencies on task in foreign DAGs

diff --git a/airflow/configuration.py b/airflow/configuration.py
@@ -30,38 +30,104 @@
 
 DEFAULT_CONFIG = """\
 [core]
+# The home folder for airflow, default is ~/airflow
 airflow_home = {AIRFLOW_HOME}
+
+# The folder where you airflow pipelines live, most likely a
+# subfolder in a code repository
 dags_folder = {AIRFLOW_HOME}/dags
+
+# The folder where airflow should store its log files
 base_log_folder = {AIRFLOW_HOME}/logs
+
+# The executor class that airflow should use. Choices include
+# SequentialExecutor, LocalExecutor, CeleryExecutor
 executor = SequentialExecutor
+
+# The SqlAlchemy connection string to the metadata database.
+# SqlAlchemy supports many different database engine, more information
+# their website
 sql_alchemy_conn = sqlite:///{AIRFLOW_HOME}/airflow.db
+
+# The amount of parallelism as a setting to the executor. This defines
+# the max number of task instances that should run simultaneously
+# on this airflow installation
 parallelism = 32
+
+# Whether to load the examples that ship with Airflow. It's good to
+# get started, but you probably want to set this to False in a production
+# environment
 load_examples = True
 
+
 [webserver]
+# The base url of your website as airflow cannot guess what domain or
+# cname you are using. This is use in autamated emails that
+# airflow sends to point links to the right web server
 base_url = http://localhost:8080
+
+# The ip specified when starting the web server
 web_server_host = 0.0.0.0
+
+# The port on which to run the web server
 web_server_port = 8080
 
+
 [smtp]
+# If you want airflow to send emails on retries, failure, and you want to
+# the airflow.utils.send_email function, you have to configure an smtp
+# server here
 smtp_host = localhost
 smtp_user = airflow
 smtp_port = 25
 smtp_password = airflow
 smtp_mail_from = [email protected]
 
 [celery]
+# This section only applies if you are using the CeleryExecutor in
+# [core] section above
+
+# The app name that will be used by celery
 celery_app_name = airflow.executors.celery_executor
+
+# The concurrency that will be used when starting workers with the
+# "airflow worker" command. This defines the number of task instances that
+# a worker will take, so size up your workers based on the resources on
+# your worker box and the nature of your tasks
 celeryd_concurrency = 16
+
+# When you start an airflow worker, airflow starts a tiny web server
+# subprocess to serve the workers local log files to the airflow main
+# web server, who then builds pages and sends them to users. This defines
+# the port on which the logs are served. It needs to be unused, and open
+# visible from the main web server to connect into the workers.
 worker_log_server_port = 8793
+
+# The Celery broker URL. Celery supports RabbitMQ, Redis and experimentaly
+# a sqlalchemy database. Refer to the Celery documentation for more
+# information.
 broker_url = sqla+mysql://airflow:airflow@localhost:3306/airflow
+
+# Another key Celery setting
 celery_result_backend = db+mysql://airflow:airflow@localhost:3306/airflow
+
+# Celery Flower is a sweet UI for Celery. Airflow has a shortcut to start
+# it `airflow flower`. This defines the port that Celery Flower runs on
 flower_port = 8383
+
+# Default queue that tasks get assigned to and that worker listen on.
 default_queue = default
 
 [scheduler]
+# Task instances listen for external kill signal (when you clear tasks
+# from the CLI or the UI), this defines the frequency at which they should
+# listen (in seconds).
 job_heartbeat_sec = 5
-scheduler_heartbeat_sec = 60
+
+# The scheduler constantly tries to trigger new tasks (look at the
+# scheduler section in the docs for more information). This defines
+# how often the scheduler should run (in seconds).
+scheduler_heartbeat_sec = 5
 """
 
 TEST_CONFIG = """\

diff --git a/docs/concepts.rst b/docs/concepts.rst
@@ -4,7 +4,7 @@ Concepts
 Operators
 '''''''''
 
-Operators allows to generate a certain type of task on the graph. There
+Operators allow for generating a certain type of task on the graph. There
 are 3 main type of operators:
 
 -  **Sensor:** Waits for events to happen, it could be a file appearing
@@ -26,7 +26,7 @@ values when calling the abstract operator. A task could be waiting for a
 specific partition in Hive, or triggerring a specific DML statement in
 Oracle.
 
-Task instances
+Task Instances
 ''''''''''''''
 
 A task instance represents a task run, for a specific point in time.
@@ -48,3 +48,48 @@ information out of pipelines, centralized in the metadata database.
 Hooks are also very useful on their own to use in Python scripts, 
 Airflow airflow.operators.PythonOperator, and in interactive environment
 like iPython or Jupyter Notebook.
+
+Pools
+'''''
+
+Some systems can get overwelmed when too many processes hit them at the same
+time. Airflow pools can be used to **limit the execution parallelism** on 
+arbitrary sets of tasks. The list of pools is managed in the UI 
+(``Menu -> Admin -> Pools``) by giving the pools a name and assigning 
+it a number of worker slots. Tasks can then be associated with 
+one of the existing pools by using the ``pool`` parameter when 
+creating tasks (instantiating operators). 
+
+The ``pool`` parameter can
+be used in conjunction with ``priority_weight`` to define priorities
+in the queue, and which tasks get executed first as slots open up in the
+pool. The default ``priority_weight`` is of ``1``, and can be bumped to any
+number. When sorting the queue to evaluate which task should be executed 
+next, we use the ``priority_weight``, summed up with of all 
+the tasks ``priority_weight`` downstream from this task. This way you can
+bumped a specific important task and the whole path to that task gets
+prioritized accordingly.
+
+Tasks will be scheduled as usual while the slots fill up. Once capacity is
+reached, runnable tasks get queued and there state will show as such in the
+UI. As slots free up, queued up tasks start running based on the 
+``priority_weight`` (of the task and its descendants).
+
+Note that by default tasks aren't assigned to any pool and their 
+execution parallelism is only limited to the executor's setting.
+
+Connections
+'''''''''''
+
+The connection information to external systems is stored in the Airflow
+metadata database and managed in the UI (``Menu -> Admin -> Connections``).
+A ``conn_id`` is defined there and hostname / login / password / schema 
+information attached to it. Then Airflow pipelines can simply refer
+to the centrally managed ``conn_id`` without having to hard code any
+of this information anywhere.
+
+Many connections with the same ``conn_id`` can be defined and when that 
+is the case, and when the **hooks** uses the ``get_connection`` method 
+from ``BaseHook``, Airflow will choose one connection randomly, allowing
+for some basic load balancing and some fault tolerance when used in
+conjunction with retries.
diff --git a/docs/profiling.rst b/docs/profiling.rst
@@ -1,7 +1,7 @@
 Data Profiling
 ==============
 
-Part of being a productive data ninja is about having the right weapons to
+Part of being a productive with data is about having the right weapons to
 profile the data you are working with. Airflow provides a simple query 
 interface to write sql and get results quickly, and a charting application 
 letting you visualize data.
@@ -24,7 +24,7 @@ You can even use the same templating and macros availlable when writting
 airflow pipelines, parameterizing your queries and modifying parameters 
 direclty in the URL.
 
-These charts ain't Tableau, but they're easy to create, modify and share.
+These charts are basic, but they're easy to create, modify and share.
 
 Chart Screenshot
 ................