Skip to content

Commit

Permalink
Merge pull request apache#238 from mistercrunch/docs
Browse files Browse the repository at this point in the history
New doc entry for Pools and Connections
  • Loading branch information
mistercrunch committed Jun 3, 2015
2 parents 161db60 + 6f77f66 commit 2670600
Show file tree
Hide file tree
Showing 4 changed files with 117 additions and 7 deletions.
3 changes: 1 addition & 2 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
TODO
-----
#### UI
* Run button / backfill wizard
* Backfill form
* Add templating to adhoc queries
* Charts: better error handling

Expand All @@ -16,7 +16,6 @@ TODO
#### Backend
* Add a run_only_latest flag to BaseOperator, runs only most recent task instance where deps are met
* Pickle all the THINGS!
* Add priority_weight(Int) to BaseOperator, +@property subtree_priority
* Distributed scheduler
* Add decorator to timeout imports on master process [lib](https://github.com/pnpnpn/timeout-decorator)
* Raise errors when setting dependencies on task in foreign DAGs
Expand Down
68 changes: 67 additions & 1 deletion airflow/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,38 +30,104 @@

DEFAULT_CONFIG = """\
[core]
# The home folder for airflow, default is ~/airflow
airflow_home = {AIRFLOW_HOME}
# The folder where you airflow pipelines live, most likely a
# subfolder in a code repository
dags_folder = {AIRFLOW_HOME}/dags
# The folder where airflow should store its log files
base_log_folder = {AIRFLOW_HOME}/logs
# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor
executor = SequentialExecutor
# The SqlAlchemy connection string to the metadata database.
# SqlAlchemy supports many different database engine, more information
# their website
sql_alchemy_conn = sqlite:///{AIRFLOW_HOME}/airflow.db
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# Whether to load the examples that ship with Airflow. It's good to
# get started, but you probably want to set this to False in a production
# environment
load_examples = True
[webserver]
# The base url of your website as airflow cannot guess what domain or
# cname you are using. This is use in autamated emails that
# airflow sends to point links to the right web server
base_url = http://localhost:8080
# The ip specified when starting the web server
web_server_host = 0.0.0.0
# The port on which to run the web server
web_server_port = 8080
[smtp]
# If you want airflow to send emails on retries, failure, and you want to
# the airflow.utils.send_email function, you have to configure an smtp
# server here
smtp_host = localhost
smtp_user = airflow
smtp_port = 25
smtp_password = airflow
smtp_mail_from = [email protected]
[celery]
# This section only applies if you are using the CeleryExecutor in
# [core] section above
# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor
# The concurrency that will be used when starting workers with the
# "airflow worker" command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
celeryd_concurrency = 16
# When you start an airflow worker, airflow starts a tiny web server
# subprocess to serve the workers local log files to the airflow main
# web server, who then builds pages and sends them to users. This defines
# the port on which the logs are served. It needs to be unused, and open
# visible from the main web server to connect into the workers.
worker_log_server_port = 8793
# The Celery broker URL. Celery supports RabbitMQ, Redis and experimentaly
# a sqlalchemy database. Refer to the Celery documentation for more
# information.
broker_url = sqla+mysql://airflow:airflow@localhost:3306/airflow
# Another key Celery setting
celery_result_backend = db+mysql://airflow:airflow@localhost:3306/airflow
# Celery Flower is a sweet UI for Celery. Airflow has a shortcut to start
# it `airflow flower`. This defines the port that Celery Flower runs on
flower_port = 8383
# Default queue that tasks get assigned to and that worker listen on.
default_queue = default
[scheduler]
# Task instances listen for external kill signal (when you clear tasks
# from the CLI or the UI), this defines the frequency at which they should
# listen (in seconds).
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 60
# The scheduler constantly tries to trigger new tasks (look at the
# scheduler section in the docs for more information). This defines
# how often the scheduler should run (in seconds).
scheduler_heartbeat_sec = 5
"""

TEST_CONFIG = """\
Expand Down
49 changes: 47 additions & 2 deletions docs/concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Concepts
Operators
'''''''''

Operators allows to generate a certain type of task on the graph. There
Operators allow for generating a certain type of task on the graph. There
are 3 main type of operators:

- **Sensor:** Waits for events to happen, it could be a file appearing
Expand All @@ -26,7 +26,7 @@ values when calling the abstract operator. A task could be waiting for a
specific partition in Hive, or triggerring a specific DML statement in
Oracle.

Task instances
Task Instances
''''''''''''''

A task instance represents a task run, for a specific point in time.
Expand All @@ -48,3 +48,48 @@ information out of pipelines, centralized in the metadata database.
Hooks are also very useful on their own to use in Python scripts,
Airflow airflow.operators.PythonOperator, and in interactive environment
like iPython or Jupyter Notebook.

Pools
'''''

Some systems can get overwelmed when too many processes hit them at the same
time. Airflow pools can be used to **limit the execution parallelism** on
arbitrary sets of tasks. The list of pools is managed in the UI
(``Menu -> Admin -> Pools``) by giving the pools a name and assigning
it a number of worker slots. Tasks can then be associated with
one of the existing pools by using the ``pool`` parameter when
creating tasks (instantiating operators).

The ``pool`` parameter can
be used in conjunction with ``priority_weight`` to define priorities
in the queue, and which tasks get executed first as slots open up in the
pool. The default ``priority_weight`` is of ``1``, and can be bumped to any
number. When sorting the queue to evaluate which task should be executed
next, we use the ``priority_weight``, summed up with of all
the tasks ``priority_weight`` downstream from this task. This way you can
bumped a specific important task and the whole path to that task gets
prioritized accordingly.

Tasks will be scheduled as usual while the slots fill up. Once capacity is
reached, runnable tasks get queued and there state will show as such in the
UI. As slots free up, queued up tasks start running based on the
``priority_weight`` (of the task and its descendants).

Note that by default tasks aren't assigned to any pool and their
execution parallelism is only limited to the executor's setting.

Connections
'''''''''''

The connection information to external systems is stored in the Airflow
metadata database and managed in the UI (``Menu -> Admin -> Connections``).
A ``conn_id`` is defined there and hostname / login / password / schema
information attached to it. Then Airflow pipelines can simply refer
to the centrally managed ``conn_id`` without having to hard code any
of this information anywhere.

Many connections with the same ``conn_id`` can be defined and when that
is the case, and when the **hooks** uses the ``get_connection`` method
from ``BaseHook``, Airflow will choose one connection randomly, allowing
for some basic load balancing and some fault tolerance when used in
conjunction with retries.
4 changes: 2 additions & 2 deletions docs/profiling.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Data Profiling
==============

Part of being a productive data ninja is about having the right weapons to
Part of being a productive with data is about having the right weapons to
profile the data you are working with. Airflow provides a simple query
interface to write sql and get results quickly, and a charting application
letting you visualize data.
Expand All @@ -24,7 +24,7 @@ You can even use the same templating and macros availlable when writting
airflow pipelines, parameterizing your queries and modifying parameters
direclty in the URL.

These charts ain't Tableau, but they're easy to create, modify and share.
These charts are basic, but they're easy to create, modify and share.

Chart Screenshot
................
Expand Down

0 comments on commit 2670600

Please sign in to comment.