0% found this document useful (0 votes)

1K views

How To Work With Apache Airflow

This document summarizes the methods available for interacting with Google Cloud Storage through the Airflow Google Cloud Storage hook. It describes methods for creating, deleting, downloading, and uploading buckets and objects. It also includes methods for checking metadata and access controls.

Uploaded by

Sakshi Arts

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views

How To Work With Apache Airflow

Uploaded by

Sakshi Arts

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 111

Airflow Documentation, Release 1.10.

• destination_object (string) – The (renamed) path of the object if given. Can

be omitted; then the same name is used.
create_bucket(bucket_name, storage_class=’MULTI_REGIONAL’, location=’US’,
project_id=None, labels=None)
Creates a new bucket. Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a
name that is already in use.
See also:
For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/
bucketnaming.html#requirements

Parameters
• bucket_name (string) – The name of the bucket.
• storage_class (string) – This defines how objects in the bucket are stored and
determines the SLA and the cost of storage. Values include
– MULTI_REGIONAL
– REGIONAL
– STANDARD
– NEARLINE
– COLDLINE.
If this value is not specified when the bucket is created, it will default to STANDARD.
• location (string) – The location of the bucket. Object data for objects in the bucket
resides in physical storage within this region. Defaults to US.
See also:
https://developers.google.com/storage/docs/bucket-locations
• project_id (string) – The ID of the GCP Project.
• labels (dict) – User-provided labels, in key/value pairs.
Returns If successful, it returns the id of the bucket.

delete(bucket, object, generation=None)

Delete an object if versioning is not enabled for the bucket, or if generation parameter is used.
Parameters
• bucket (string) – name of the bucket, where the object resides
• object (string) – name of the object to delete
• generation (string) – if present, permanently delete the object of this generation
Returns True if succeeded
download(bucket, object, filename=None)
Get a file from Google Cloud Storage.
Parameters
• bucket (string) – The bucket to fetch from.
• object (string) – The object to fetch.
• filename (string) – If set, a local file path where the file should be written to.

212 Chapter 3. Content

Airflow Documentation, Release 1.10.2

exists(bucket, object)
Checks for the existence of a file in Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is.
• object (string) – The name of the object to check in the Google cloud storage bucket.
get_conn()
Returns a Google Cloud Storage service object.
get_crc32c(bucket, object)
Gets the CRC32c checksum of an object in Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is.
• object (string) – The name of the object to check in the Google cloud storage bucket.
get_md5hash(bucket, object)
Gets the MD5 hash of an object in Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is.
• object (string) – The name of the object to check in the Google cloud storage bucket.
get_size(bucket, object)
Gets the size of a file in Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is.
• object (string) – The name of the object to check in the Google cloud storage bucket.
insert_bucket_acl(bucket, entity, role, user_project)
Creates a new ACL entry on the specified bucket. See: https://cloud.google.com/storage/docs/json_api/v1/
bucketAccessControls/insert
Parameters
• bucket (str) – Name of a bucket.
• entity (str) – The entity holding the permission, in one of the following forms:
user-userId, user-email, group-groupId, group-email, domain-domain, project-team-
projectId, allUsers, allAuthenticatedUsers. See: https://cloud.google.com/storage/docs/
access-control/lists#scopes
• role (str) – The access permission for the entity. Acceptable values are: “OWNER”,
“READER”, “WRITER”.
• user_project (str) – (Optional) The project to be billed for this request. Required
for Requester Pays buckets.
insert_object_acl(bucket, object_name, entity, role, generation, user_project)
Creates a new ACL entry on the specified object. See: https://cloud.google.com/storage/docs/json_api/v1/
objectAccessControls/insert
Parameters
• bucket (str) – Name of a bucket.

3.16. Integration 213

Airflow Documentation, Release 1.10.2

• object_name (str) – Name of the object. For information about how to URL en-
code object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/
#encoding
• entity (str) – The entity holding the permission, in one of the following forms:
user-userId, user-email, group-groupId, group-email, domain-domain, project-team-
projectId, allUsers, allAuthenticatedUsers See: https://cloud.google.com/storage/docs/
access-control/lists#scopes
• role (str) – The access permission for the entity. Acceptable values are: “OWNER”,
“READER”.
• generation (str) – (Optional) If present, selects a specific revision of this object (as
opposed to the latest version, the default).
• user_project (str) – (Optional) The project to be billed for this request. Required
for Requester Pays buckets.
is_updated_after(bucket, object, ts)
Checks if an object is updated in Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is.
• object (string) – The name of the object to check in the Google cloud storage bucket.
• ts (datetime) – The timestamp to check against.
list(bucket, versions=None, maxResults=None, prefix=None, delimiter=None)
List all objects from the bucket with the give string prefix in name
Parameters
• bucket (string) – bucket name
• versions (boolean) – if true, list all versions of the objects
• maxResults (integer) – max count of items to return in a single page of responses
• prefix (string) – prefix string which filters objects whose name begin with this prefix
• delimiter (string) – filters objects based on the delimiter (for e.g ‘.csv’)
Returns a stream of object names matching the filtering criteria
rewrite(source_bucket, source_object, destination_bucket, destination_object=None)
Has the same functionality as copy, except that will work on files over 5 TB, as well as when copying
between locations and/or storage classes.
destination_object can be omitted, in which case source_object is used.
Parameters
• source_bucket (string) – The bucket of the object to copy from.
• source_object (string) – The object to copy.
• destination_bucket (string) – The destination of the object to copied to.
• destination_object – The (renamed) path of the object if given. Can be omitted;
then the same name is used.
upload(bucket, object, filename, mime_type=’application/octet-stream’, gzip=False, multipart=False,
num_retries=0)
Uploads a local file to Google Cloud Storage.

214 Chapter 3. Content

Airflow Documentation, Release 1.10.2

Parameters
• bucket (string) – The bucket to upload to.
• object (string) – The object name to set when uploading the local file.
• filename (string) – The local file path to the file to be uploaded.
• mime_type (str) – The MIME type to set when uploading the file.
• gzip (bool) – Option to compress file for upload
• multipart (bool or int) – If True, the upload will be split into multiple HTTP
requests. The default size is 256MiB per request. Pass a number instead of True to specify
the request size, which must be a multiple of 262144 (256KiB).
• num_retries (int) – The number of times to attempt to re-upload the file (or indi-
vidual chunks, in the case of multipart uploads). Retries are attempted with exponential
backoff.

GCPTransferServiceHook

class airflow.contrib.hooks.gcp_transfer_hook.GCPTransferServiceHook(api_version=’v1’,
gcp_conn_id=’google_cloud_de
dele-
gate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Hook for GCP Storage Transfer Service.
get_conn()
Retrieves connection to Google Storage Transfer service.
Returns Google Storage Transfer service object
Return type dict

3.16.5.14 Google Kubernetes Engine

Google Kubernetes Engine Cluster Operators

• GKEClusterDeleteOperator : Creates a Kubernetes Cluster in Google Cloud Platform

• GKEPodOperator : Deletes a Kubernetes Cluster in Google Cloud Platform

3.16. Integration 215

Airflow Documentation, Release 1.10.2

GKEClusterCreateOperator

GKEClusterDeleteOperator

GKEPodOperator

Google Kubernetes Engine Hook

3.16.6 Qubole

Apache Airflow has a native operator and hooks to talk to Qubole, which lets you submit your big data jobs directly
to Qubole from Apache Airflow.

3.16.6.1 QuboleOperator

3.16.6.2 QubolePartitionSensor

3.16.6.3 QuboleFileSensor

3.16.6.4 QuboleCheckOperator

3.16.6.5 QuboleValueCheckOperator

3.17 Metrics

3.17.1 Configuration

Airflow can be set up to send metrics to StatsD:

[scheduler]
statsd_on = True
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow

3.17.2 Counters

Name Description
<job_name>_start Number of started <job_name> job, ex. SchedulerJob, LocalTaskJob
<job_name>_end Number of ended <job_name> job, ex. SchedulerJob, LocalTaskJob
operator_failures_<operator_name> Operator <operator_name> failures
operator_successes_<operator_name> Operator <operator_name> successes
ti_failures Overall task instances failures
ti_successes Overall task instances successes
zombies_killed Zombie tasks killed
scheduler_heartbeat Scheduler heartbeats

216 Chapter 3. Content

Airflow Documentation, Release 1.10.2

3.17.3 Gauges

Name Description
collect_dags Seconds taken to scan and import DAGs
dagbag_import_errors DAG import errors
dagbag_size DAG bag size

3.17.4 Timers

Name Description
dagrun.dependency-check.<dag_id> Seconds taken to check DAG dependencies

3.18 Kubernetes

3.18.1 Kubernetes Executor

The kubernetes executor is introduced in Apache Airflow 1.10.0. The Kubernetes executor will create a new pod for
every task instance.
Example helm charts are available at scripts/ci/kubernetes/kube/{airflow,volumes,postgres}.yaml in the source distri-
bution. The volumes are optional and depend on your configuration. There are two volumes available:
• Dags: by storing all the dags onto the persistent disks, all the workers can read the dags from there. Another
option is using git-sync, before starting the container, a git pull of the dags repository will be performed and
used throughout the lifecycle of the pod.
• Logs: by storing the logs onto a persistent disk, all the logs will be available for all the workers and the webserver
itself. If you don’t configure this, the logs will be lost after the worker pods shuts down. Another option is to
use S3/GCS/etc to store the logs.

3.18.2 Kubernetes Operator

from airflow.contrib.operators import KubernetesOperator

from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.contrib.kubernetes.secret import Secret

secret_file = Secret('volume', '/etc/sql_conn', 'airflow-secrets', 'sql_alchemy_conn')

secret_env = Secret('env', 'SQL_CONN', 'airflow-secrets', 'sql_alchemy_conn')
volume_mount = VolumeMount('test-volume',
mount_path='/root/mount_file',
sub_path=None,
read_only=True)

volume_config= {
'persistentVolumeClaim':
{
'claimName': 'test-volume'
}
}
volume = Volume(name='test-volume', configs=volume_config)
(continues on next page)

3.18. Kubernetes 217

Airflow Documentation, Release 1.10.2

(continued from previous page)

affinity = {
'nodeAffinity': {
'preferredDuringSchedulingIgnoredDuringExecution': [
{
"weight": 1,
"preference": {
"matchExpressions": {
"key": "disktype",
"operator": "In",
"values": ["ssd"]
}
}
}
]
},
"podAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": [
{
"labelSelector": {
"matchExpressions": [
{
"key": "security",
"operator": "In",
"values": ["S1"]
}
]
},
"topologyKey": "failure-domain.beta.kubernetes.io/zone"
}
]
},
"podAntiAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": [
{
"labelSelector": {
"matchExpressions": [
{
"key": "security",
"operator": "In",
"values": ["S2"]
}
]
},
"topologyKey": "kubernetes.io/hostname"
}
]
}
}

tolerations = [
{
'key': "key",
'operator': 'Equal',
'value': 'value'
}
]
(continues on next page)

218 Chapter 3. Content

Airflow Documentation, Release 1.10.2

(continued from previous page)

k = KubernetesPodOperator(namespace='default',
image="ubuntu:16.04",
cmds=["bash", "-cx"],
arguments=["echo", "10"],
labels={"foo": "bar"},
secrets=[secret_file,secret_env]
volume=[volume],
volume_mounts=[volume_mount]
name="test",
task_id="task",
affinity=affinity,
is_delete_operator_pod=True,
hostnetwork=False,
tolerations=tolerations
)

class airflow.contrib.kubernetes.secret.Secret(deploy_type, deploy_target, secret, key)

Defines Kubernetes Secret Volume

3.19 Lineage

Note: Lineage support is very experimental and subject to change.

Airflow can help track origins of data, what happens to it and where it moves over time. This can aid having audit
trails and data governance, but also debugging of data flows.
Airflow tracks data by means of inlets and outlets of the tasks. Let’s work from an example and see how it works.
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.lineage.datasets import File
from airflow.models import DAG
from datetime import timedelta

FILE_CATEGORIES = ["CAT1", "CAT2", "CAT3"]

args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2)
}

dag = DAG(
dag_id='example_lineage', default_args=args,
schedule_interval='0 0 * * *',
dagrun_timeout=timedelta(minutes=60))

f_final = File("/tmp/final")
run_this_last = DummyOperator(task_id='run_this_last', dag=dag,
inlets={"auto": True},
outlets={"datasets": [f_final,]})

f_in = File("/tmp/whole_directory/")
(continues on next page)

3.19. Lineage 219

Airflow Documentation, Release 1.10.2

(continued from previous page)

outlets = []
for file in FILE_CATEGORIES:
f_out = File("/tmp/{}/{{{{ execution_date }}}}".format(file))
outlets.append(f_out)
run_this = BashOperator(
task_id='run_me_first', bash_command='echo 1', dag=dag,
inlets={"datasets": [f_in,]},
outlets={"datasets": outlets}
)
run_this.set_downstream(run_this_last)

Tasks take the parameters inlets and outlets. Inlets can be manually defined by a list of dataset {“datasets”: [dataset1,
dataset2]} or can be configured to look for outlets from upstream tasks {“task_ids”: [“task_id1”, “task_id2”]} or
can be configured to pick up outlets from direct upstream tasks {“auto”: True} or a combination of them. Outlets are
defined as list of dataset {“datasets”: [dataset1, dataset2]}. Any fields for the dataset are templated with the context
when the task is being executed.

Note: Operators can add inlets and outlets automatically if the operator supports it.

In the example DAG task run_me_first is a BashOperator that takes 3 inlets: CAT1, CAT2, CAT3, that are generated
from a list. Note that execution_date is a templated field and will be rendered when the task is running.

Note: Behind the scenes Airflow prepares the lineage metadata as part of the pre_execute method of a task. When
the task has finished execution post_execute is called and lineage metadata is pushed into XCOM. Thus if you are
creating your own operators that override this method make sure to decorate your method with prepare_lineage and
apply_lineage respectively.

3.19.1 Apache Atlas

Airflow can send its lineage metadata to Apache Atlas. You need to enable the atlas backend and configure it properly,
e.g. in your airflow.cfg:

[lineage]
backend = airflow.lineage.backend.atlas

[atlas]
username = my_username
password = my_password
host = host
port = 21000

Please make sure to have the atlasclient package installed.

220 Chapter 3. Content

Airflow Documentation, Release 1.10.2

3.20 Changelog

3.20.1 Airflow 1.10.2, 2019-01-19

3.20.1.1 New features

• [AIRFLOW-2658] Add GCP specific k8s pod operator (#3532)

• [AIRFLOW-2440] Google Cloud SQL import/export operator (#4251)
• [AIRFLOW-3212] Add AwsGlueCatalogPartitionSensor (#4112)
• [AIRFLOW-2750] Add subcommands to delete and list users
• [AIRFLOW-3480] Add GCP Spanner Database Operators (#4353)
• [AIRFLOW-3560] Add DayOfWeek Sensor (#4363)
• [AIRFLOW-3371] BigQueryHook’s Ability to Create View (#4213)
• [AIRFLOW-3332] Add method to allow inserting rows into BQ table (#4179)
• [AIRFLOW-3055] add get_dataset and get_datasets_list to bigquery_hook (#3894)
• [AIRFLOW-2887] Added BigQueryCreateEmptyDatasetOperator and create_emty_dataset to bigquery_hook
(#3876)
• [AIRFLOW-2758] Add a sensor for MongoDB
• [AIRFLOW-2640] Add Cassandra table sensor
• [AIRFLOW-3398] Google Cloud Spanner instance database query operator (#4314)
• [AIRFLOW-3310] Google Cloud Spanner deploy / delete operators (#4286)
• [AIRFLOW-3406] Implement an Azure CosmosDB operator (#4265)
• [AIRFLOW-3434] Allows creating intermediate dirs in SFTPOperator (#4270)
• [AIRFLOW-3345] Add Google Cloud Storage (GCS) operators for ACL (#4192)
• [AIRFLOW-3266] Add AWS Athena Hook and Operator (#4111)
• [AIRFLOW-3346] Add hook and operator for GCP transfer service (#4189)
• [AIRFLOW-2983] Add prev_ds_nodash and next_ds_nodash macro (#3821)
• [AIRFLOW-3403] Add AWS Athena Sensor (#4244)
• [AIRFLOW-3323] Support HTTP basic authentication for Airflow Flower (#4166)
• [AIRFLOW-3410] Add feature to allow Host Key Change for SSH Op (#4249)
• [AIRFLOW-3275] Add Google Cloud SQL Query operator (#4170)
• [AIRFLOW-2691] Manage JS dependencies via npm
• [AIRFLOW-2795] Oracle to Oracle Transfer Operator (#3639)
• [AIRFLOW-2596] Add Oracle to Azure Datalake Transfer Operator
• [AIRFLOW-3220] Add Instance Group Manager Operators for GCE (#4167)
• [AIRFLOW-2882] Add import and export for pool cli using JSON
• [AIRFLOW-2965] CLI tool to show the next execution datetime (#3834)
• [AIRFLOW-2874] Enables FAB’s theme support (#3719)

3.20. Changelog 221

Airflow Documentation, Release 1.10.2

• [AIRFLOW-3336] Add new TriggerRule for 0 upstream failures (#4182)

3.20.1.2 Improvements

• [AIRFLOW-3680] Consistency update in tests for All GCP-related operators (#4493)

• [AIRFLOW-3675] Use googlapiclient for google apis (#4484)
• [AIRFLOW-3205] Support multipart uploads to GCS (#4084)
• [AIRFLOW-2826] Add GoogleCloudKMSHook (#3677)
• [AIRFLOW-3676] Add required permission to CloudSQL export/import example (#4489)
• [AIRFLOW-3679] Added Google Cloud Base Hook to documentation (#4487)
• [AIRFLOW-3594] Unify different License Header
• [AIRFLOW-3197] Remove invalid parameter KeepJobFlowAliveWhenNoSteps in example DAG (#4404)
• [AIRFLOW-3504] Refine the functionality of “/health” endpoint (#4309)
• [AIRFLOW-3103][AIRFLOW-3147] Update flask-appbuilder (#3937)
• [AIRFLOW-3168] More resillient database use in CI (#4014)
• [AIRFLOW-3076] Remove preloading of MySQL testdata (#3911)
• [AIRFLOW-3035] Allow custom ‘job_error_states’ in dataproc ops (#3884)
• [AIRFLOW-3246] Make hmsclient optional in airflow.hooks.hive_hooks (#4080)
• [AIRFLOW-3059] Log how many rows are read from Postgres (#3905)
• [AIRFLOW-2463] Make task instance context available for hive queries
• [AIRFLOW-3190] Make flake8 compliant (#4035)
• [AIRFLOW-1998] Implemented DatabricksRunNowOperator for jobs/run-now . . . (#3813)
• [AIRFLOW-2267] Airflow DAG level access (#3197)
• [AIRFLOW-2359] Add set failed for DagRun and task in tree view (#3255)
• [AIRFLOW-3008] Move Kubernetes example DAGs to contrib
• [AIRFLOW-3402] Support global k8s affinity and toleration configs (#4247)
• [AIRFLOW-3610] Add region param for EMR jobflow creation (#4418)
• [AIRFLOW-3531] Fix test for GCS to GCS Transfer Hook (#4452)
• [AIRFLOW-3531] Add gcs to gcs transfer operator. (#4331)
• [AIRFLOW-3034]: Readme updates : Add Slack & Twitter, remove Gitter
• [AIRFLOW-3028] Update Text & Images in Readme.md
• [AIRFLOW-208] Add badge to show supported Python versions (#3839)
• [AIRFLOW-2238] Update PR tool to push directly to Github
• [AIRFLOW-2238] Flake8 fixes on dev/airflow-pr
• [AIRFLOW-2238] Update PR tool to remove outdated info (#3978)
• [AIRFLOW-3005] Replace ‘Airbnb Airflow’ with ‘Apache Airflow’ (#3845)
• [AIRFLOW-3150] Make execution_date templated in TriggerDagRunOperator (#4359)

222 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1196][AIRFLOW-2399] Add templated field in TriggerDagRunOperator (#4228)

• [AIRFLOW-3340] Placeholder support in connections form (#4185)
• [AIRFLOW-3446] Add Google Cloud BigTable operators (#4354)
• [AIRFLOW-1921] Add support for https and user auth (#2879)
• [AIRFLOW-2770] Read dags_in_image config value as a boolean (#4319)
• [AIRFLOW-3022] Add volume mount to KubernetesExecutorConfig (#3855)
• [AIRFLOW-2917] Set AIRFLOW__CORE__SQL_ALCHEMY_CONN only when needed (#3766)
• [AIRFLOW-2712] Pass annotations to KubernetesExecutorConfig
• [AIRFLOW-461] Support autodetected schemas in BigQuery run_load (#3880)
• [AIRFLOW-2997] Support cluster fields in bigquery (#3838)
• [AIRFLOW-2916] Arg verify for AwsHook() & S3 sensors/operators (#3764)
• [AIRFLOW-491] Add feature to pass extra api configs to BQ Hook (#3733)
• [AIRFLOW-2889] Fix typos detected by github.com/client9/misspell (#3732)
• [AIRFLOW-850] Add a PythonSensor (#4349)
• [AIRFLOW-2747] Explicit re-schedule of sensors (#3596)
• [AIRFLOW-3392] Add index on dag_id in sla_miss table (#4235)
• [AIRFLOW-3001] Add index ‘ti_dag_date’ to taskinstance (#3885)
• [AIRFLOW-2861] Add index on log table (#3709)
• [AIRFLOW-3518] Performance fixes for topological_sort of Tasks (#4322)
• [AIRFLOW-3521] Fetch more than 50 items in airflow-jira compare script (#4300)
• [AIRFLOW-1919] Add option to query for DAG runs given a DAG ID
• [AIRFLOW-3444] Explicitly set transfer operator description. (#4279)
• [AIRFLOW-3411] Add OpenFaaS hook (#4267)
• [AIRFLOW-2785] Add context manager entry points to mongoHook
• [AIRFLOW-2524] Add SageMaker doc to AWS integration section (#4278)
• [AIRFLOW-3479] Keeps records in Log Table when DAG is deleted (#4287)
• [AIRFLOW-2948] Arg check & better doc - SSHOperator & SFTPOperator (#3793)
• [AIRFLOW-2245] Add remote_host of SSH/SFTP operator as templated field (#3765)
• [AIRFLOW-2670] Update SSH Operator’s Hook to respect timeout (#3666)
• [AIRFLOW-3380] Add metrics documentation (#4219)
• [AIRFLOW-3361] Log the task_id in the PendingDeprecationWarning from BaseOperator (#4030)
• [AIRFLOW-3213] Create ADLS to GCS operator (#4134)
• [AIRFLOW-3395] added the REST API endpoints to the doc (#4236)
• [AIRFLOW-3294] Update connections form and integration docs (#4129)
• [AIRFLOW-3236] Create AzureDataLakeStorageListOperator (#4094)
• [AIRFLOW-3062] Add Qubole in integration docs (#3946)

3.20. Changelog 223

Airflow Documentation, Release 1.10.2

• [AIRFLOW-3306] Disable flask-sqlalchemy modification tracking. (#4146)

• [AIRFLOW-2867] Refactor Code to conform standards (#3714)
• [AIRFLOW-2753] Add dataproc_job_id instance var holding actual DP jobId
• [AIRFLOW-3132] Enable specifying auto_remove option for DockerOperator (#3977)
• [AIRFLOW-2731] Raise psutil restriction to <6.0.0
• [AIRFLOW-3384] Allow higher versions of Sqlalchemy and Jinja2 (#4227)
• [Airflow-2760] Decouple DAG parsing loop from scheduler loop (#3873)
• [AIRFLOW-3004] Add config disabling scheduler cron (#3899)
• [AIRFLOW-3175] Fix docstring format in airflow/jobs.py (#4025)
• [AIRFLOW-3589] Visualize reschedule state in all views (#4408)
• [AIRFLOW-2698] Simplify Kerberos code (#3563)
• [AIRFLOW-2499] Dockerise CI pipeline (#3393)
• [AIRFLOW-3432] Add test for feature “Delete DAG in UI” (#4266)
• [AIRFLOW-3301] Update DockerOperator CI test for PR #3977 (#4138)
• [AIRFLOW-3478] Make sure that the session is closed
• [AIRFLOW-3687] Add missing @apply_defaults decorators (#4498)
• [AIRFLOW-3691] Update notice to 2019 (#4503)
• [AIRFLOW-3689] Update pop-up message when deleting DAG in RBAC UI (#4505)
• [AIRFLOW-2801] Skip test_mark_success_no_kill in PostgreSQL on CI (#3642)
• [AIRFLOW-3693] Replace psycopg2-binary by psycopg2 (#4508)
• [AIRFLOW-3700] Change the lowest allowed version of “requests” (#4517)
• [AIRFLOW-3704] Support SSL Protection When Redis is Used as Broker for CeleryExecutor (#4521)
• [AIRFLOW-3681] All GCP operators have now optional GCP Project ID (#4500)
• [Airflow 2782] Upgrades Dagre D3 version to latest possible
• [Airflow 2783] Implement eslint for JS code check (#3641)
• [AIRFLOW-2805] Display multiple timezones on UI (#3687)
• [AIRFLOW-3302] Small CSS fixes (#4140)
• [Airflow-2766] Respect shared datetime across tabs
• [AIRFLOW-2776] Compress tree view JSON
• [AIRFLOW-2407] Use feature detection for reload() (#3298)
• [AIRFLOW-3452] Removed an unused/dangerous display-none (#4295)
• [AIRFLOW-3348] Update run statistics on dag refresh (#4197)
• [AIRFLOW-3125] Monitor Task Instances creation rates (#3966)

224 Chapter 3. Content

Airflow Documentation, Release 1.10.2

3.20.1.3 Bug fixes

• [AIRFLOW-3191] Fix not being able to specify execution_date when creating dagrun (#4037)
• [AIRFLOW-3657] Fix zendesk integration (#4466)
• [AIRFLOW-3605] Load plugins from entry_points (#4412)
• [AIRFLOW-3646] Rename plugins_manager.py to test_xx to trigger tests (#4464)
• [AIRFLOW-3655] Escape links generated in model views (#4463)
• [AIRFLOW-3662] Add dependency for Enum (#4468)
• [AIRFLOW-3630] Cleanup of GCP Cloud SQL Connection (#4451)
• [AIRFLOW-1837] Respect task start_date when different from dag’s (#4010)
• [AIRFLOW-2829] Brush up the CI script for minikube
• [AIRFLOW-3519] Fix example http operator (#4455)
• [AIRFLOW-2811] Fix scheduler_ops_metrics.py to work (#3653)
• [AIRFLOW-2751] add job properties update in hive to druid operator.
• [AIRFLOW-2918] Remove unused imports
• [AIRFLOW-2918] Fix Flake8 violations (#3931)
• [AIRFLOW-2771] Add except type to broad S3Hook try catch clauses
• [AIRFLOW-2918] Fix Flake8 violations (#3772)
• [AIRFLOW-2099] Handle getsource() calls gracefully
• [AIRFLOW-3397] Fix integrety error in rbac AirflowSecurityManager (#4305)
• [AIRFLOW-3281] Fix Kubernetes operator with git-sync (#3770)
• [AIRFLOW-2615] Limit DAGs parsing to once only
• [AIRFLOW-2952] Fix Kubernetes CI (#3922)
• [AIRFLOW-2933] Enable Codecov on Docker-CI Build (#3780)
• [AIRFLOW-2082] Resolve a bug in adding password_auth to api as auth method (#4343)
• [AIRFLOW-3612] Remove incubation/incubator mention (#4419)
• [AIRFLOW-3581] Fix next_ds/prev_ds semantics for manual runs (#4385)
• [AIRFLOW-3527] Update Cloud SQL Proxy to have shorter path for UNIX socket (#4350)
• [AIRFLOW-3316] For gcs_to_bq: add missing init of schema_fields var (#4430)
• [AIRFLOW-3583] Fix AirflowException import (#4389)
• [AIRFLOW-3578] Fix Type Error for BigQueryOperator (#4384)
• [AIRFLOW-2755] Added kubernetes.worker_dags_folder configuration (#3612)
• [AIRFLOW-2655] Fix inconsistency of default config of kubernetes worker
• [AIRFLOW-2645][AIRFLOW-2617] Add worker_container_image_pull_policy
• [AIRFLOW-2661] fix config dags_volume_subpath and logs_volume_subpath
• [AIRFLOW-3550] Standardize GKE hook (#4364)
• [AIRFLOW-2863] Fix GKEClusterHook catching wrong exception (#3711)

3.20. Changelog 225

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2939][AIRFLOW-3568] Fix TypeError in GCSToS3Op & S3ToGCSOp (#4371)

• [AIRFLOW-3327] Add support for location in BigQueryHook (#4324)
• [AIRFLOW-3438] Fix default values in BigQuery Hook & BigQueryOperator (. . .
• [AIRFLOW-3355] Fix BigQueryCursor.execute to work with Python3 (#4198)
• [AIRFLOW-3447] Add 2 options for ts_nodash Macro (#4323)
• [AIRFLOW-1552] Airflow Filter_by_owner not working with password_auth (#4276)
• [AIRFLOW-3484] Fix Over-logging in the k8s executor (#4296)
• [AIRFLOW-3309] Add MongoDB connection (#4154)
• [AIRFLOW-3414] Fix reload_module in DagFileProcessorAgent (#4253)
• [AIRFLOW-1252] API accept JSON when invoking a trigger dag (#2334)
• [AIRFLOW-3425] Fix setting default scope in hook (#4261)
• [AIRFLOW-3416] Fixes Python 3 compatibility with CloudSqlQueryOperator (#4254)
• [AIRFLOW-3263] Ignore exception when ‘run’ kills already killed job (#4108)
• [AIRFLOW-3264] URL decoding when parsing URI for connection (#4109)
• [AIRFLOW-3365][AIRFLOW-3366] Allow celery_broker_transport_options to be set with environment vari-
ables (#4211)
• [AIRFLOW-2642] fix wrong value git-sync initcontainer env GIT_SYNC_ROOT (#3519)
• [AIRFLOW-3353] Pin redis verison (#4195)
• [AIRFLOW-3251] KubernetesPodOperator now uses ‘image_pull_secrets’ argument when creating Pods
(#4188)
• [AIRFLOW-2705] Move class-level moto decorator to method-level
• [AIRFLOW-3233] Fix deletion of DAGs in the UI (#4069)
• [AIRFLOW-2908] Allow retries with KubernetesExecutor. (#3758)
• [AIRFLOW-1561] Fix scheduler to pick up example DAGs without other DAGs (#2635)
• [AIRFLOW-3352] Fix expose_config not honoured on RBAC UI (#4194)
• [AIRFLOW-3592] Fix logs when task is in rescheduled state (#4492)
• [AIRFLOW-3634] Fix GCP Spanner Test (#4440)
• [AIRFLOW-XXX] Fix PythonVirtualenvOperator tests (#3968)
• [AIRFLOW-3239] Fix/refine tests for api/common/experimental/ (#4255)
• [AIRFLOW-2951] Update dag_run table end_date when state change (#3798)
• [AIRFLOW-2756] Fix bug in set DAG run state workflow (#3606)
• [AIRFLOW-3690] Fix bug to set state of a task for manually-triggered DAGs (#4504)
• [AIRFLOW-3319] KubernetsExecutor: Need in try_number in labels if getting them later (#4163)
• [AIRFLOW-3724] Fix the broken refresh button on Graph View in RBAC UI
• [AIRFLOW-3732] Fix issue when trying to edit connection in RBAC UI
• [AIRFLOW-2866] Fix missing CSRF token head when using RBAC UI (#3804)
• [AIRFLOW-3259] Fix internal server error when displaying charts (#4114)

226 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-3271] Fix issue with persistence of RBAC Permissions modified via UI (#4118)
• [AIRFLOW-3141] Handle duration View for missing dag (#3984)
• [AIRFLOW-2766] Respect shared datetime across tabs
• [AIRFLOW-1413] Fix FTPSensor failing on error message with unexpected (#2450)
• [AIRFLOW-3378] KubernetesPodOperator does not delete on timeout failure (#4218)
• [AIRFLOW-3245] Fix list processing in resolve_template_files (#4086)
• [AIRFLOW-2703] Catch transient DB exceptions from scheduler’s heartbeat it does not crash (#3650)
• [AIRFLOW-1298] Clear UPSTREAM_FAILED using the clean cli (#3886)

3.20.1.4 Doc-only changes

• [AIRFLOW-XXX] GCP operators documentation clarifications (#4273)

• [AIRFLOW-XXX] Docs: Fix paths to GCS transfer operator (#4479)
• [AIRFLOW-XXX] Add missing GCP operators to Docs (#4260)
• [AIRFLOW-XXX] Fix Docstrings for Operators (#3820)
• [AIRFLOW-XXX] Fix inconsistent comment in example_python_operator.py (#4337)
• [AIRFLOW-XXX] Fix incorrect parameter in SFTPOperator example (#4344)
• [AIRFLOW-XXX] Add missing remote logging field (#4333)
• [AIRFLOW-XXX] Revise template variables documentation (#4172)
• [AIRFLOW-XXX] Fix typo in docstring of gcs_to_bq (#3833)
• [AIRFLOW-XXX] Fix display of SageMaker operators/hook docs (#4263)
• [AIRFLOW-XXX] Better instructions for airflow flower (#4214)
• [AIRFLOW-XXX] Make pip install commands consistent (#3752)
• [AIRFLOW-XXX] Add BigQueryGetDataOperator to Integration Docs (#4063)
• [AIRFLOW-XXX] Don’t spam test logs with “bad cron expression” messages (#3973)
• [AIRFLOW-XXX] Update committer list based on latest TLP discussion (#4427)
• [AIRFLOW-XXX] Fix incorrect statement in contributing guide (#4104)
• [AIRFLOW-XXX] Fix Broken Link in CONTRIBUTING.md
• [AIRFLOW-XXX] Update Contributing Guide - Git Hooks (#4120)
• [AIRFLOW-3426] Correct Python Version Documentation Reference (#4259)
• [AIRFLOW-2663] Add instructions to install SSH dependencies
• [AIRFLOW-XXX] Clean up installation extra packages table (#3750)
• [AIRFLOW-XXX] Remove redundant space in Kerberos (#3866)
• [AIRFLOW-3086] Add extras group for google auth to setup.py (#3917)
• [AIRFLOW-XXX] Add Kubernetes Dependency in Extra Packages Doc (#4281)
• [AIRFLOW-3696] Add Version info to Airflow Documentation (#4512)
• [AIRFLOW-XXX] Correct Typo in sensor’s exception (#4545)

3.20. Changelog 227

Airflow Documentation, Release 1.10.2

• [AIRFLOW-XXX] Fix a typo of config (#4544)

• [AIRFLOW-XXX] Fix BashOperator Docstring (#4052)
• [AIRFLOW-3018] Fix Minor issues in Documentation
• [AIRFLOW-XXX] Fix Minor issues with Azure Cosmos Operator (#4289)
• [AIRFLOW-3382] Fix incorrect docstring in DatastoreHook (#4222)
• [AIRFLOW-XXX] Fix copy&paste mistake (#4212)
• [AIRFLOW-3260] Correct misleading BigQuery error (#4098)
• [AIRFLOW-XXX] Fix Typo in SFTPOperator docstring (#4016)
• [AIRFLOW-XXX] Fixing the issue in Documentation (#3998)
• [AIRFLOW-XXX] Fix undocumented params in S3_hook
• [AIRFLOW-XXX] Fix SlackWebhookOperator execute method comment (#3963)
• [AIRFLOW-3070] Refine web UI authentication-related docs (#3863)

3.20.2 Airflow 1.10.1, 2018-11-13

3.20.2.1 New features

• [AIRFLOW-2524] Airflow integration with AWS Sagemaker

• [AIRFLOW-2657] Add ability to delete DAG from web ui
• [AIRFLOW-2780] Adds IMAP Hook to interact with a mail server
• [AIRFLOW-2794] Add delete support for Azure blob
• [AIRFLOW-2912] Add operators for Google Cloud Functions
• [AIRFLOW-2974] Add Start/Restart/Terminate methods Databricks Hook
• [AIRFLOW-2989] No Parameter to change bootDiskType for DataprocClusterCreateOperator
• [AIRFLOW-3078] Basic operators for Google Compute Engine
• [AIRFLOW-3147] Update Flask-AppBuilder version
• [AIRFLOW-3231] Basic operators for Google Cloud SQL (deploy / patch / delete)
• [AIRFLOW-3276] Google Cloud SQL database create / patch / delete operators

3.20.2.2 Improvements

• [AIRFLOW-393] Add progress callbacks for FTP downloads

• [AIRFLOW-520] Show Airflow version on web page
• [AIRFLOW-843] Exceptions now available in context during on_failure_callback
• [AIRFLOW-2476] Update tabulate dependency to v0.8.2
• [AIRFLOW-2592] Bump Bleach dependency
• [AIRFLOW-2622] Add “confirm=False” option to SFTPOperator
• [AIRFLOW-2662] support affinity & nodeSelector policies for kubernetes executor/operator

228 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2709] Improve error handling in Databricks hook

• [AIRFLOW-2723] Update lxml dependancy to >= 4.0.
• [AIRFLOW-2763] No precheck mechanism in place during worker initialisation for the connection to metadata
database
• [AIRFLOW-2789] Add ability to create single node cluster to DataprocClusterCreateOperator
• [AIRFLOW-2797] Add ability to create Google Dataproc cluster with custom image
• [AIRFLOW-2854] kubernetes_pod_operator add more configuration items
• [AIRFLOW-2855] Need to Check Validity of Cron Expression When Process DAG File/Zip File
• [AIRFLOW-2904] Clean an unnecessary line in airflow/executors/celery_executor.py
• [AIRFLOW-2921] A trivial incorrectness in CeleryExecutor()
• [AIRFLOW-2922] Potential deal-lock bug in CeleryExecutor()
• [AIRFLOW-2932] GoogleCloudStorageHook - allow compression of file
• [AIRFLOW-2949] Syntax Highlight for Single Quote
• [AIRFLOW-2951] dag_run end_date Null after a dag is finished
• [AIRFLOW-2956] Kubernetes tolerations for pod operator
• [AIRFLOW-2997] Support for clustered tables in Bigquery hooks/operators
• [AIRFLOW-3006] Fix error when schedule_interval=”None”
• [AIRFLOW-3008] Move Kubernetes related example DAGs to contrib/example_dags
• [AIRFLOW-3025] Allow to specify dns and dns-search parameters for DockerOperator
• [AIRFLOW-3067] (www_rbac) Flask flash messages are not displayed properly (no background color)
• [AIRFLOW-3069] Decode output of S3 file transform operator
• [AIRFLOW-3072] Assign permission get_logs_with_metadata to viewer role
• [AIRFLOW-3090] INFO logs are too verbose
• [AIRFLOW-3103] Update Flask-Login
• [AIRFLOW-3112] Align SFTP hook with SSH hook
• [AIRFLOW-3119] Enable loglevel on celery worker and inherit from airflow.cfg
• [AIRFLOW-3137] Make ProxyFix middleware optional
• [AIRFLOW-3173] Add _cmd options for more password config options
• [AIRFLOW-3177] Change scheduler_heartbeat metric from gauge to counter
• [AIRFLOW-3193] Pin docker requirement version to v3
• [AIRFLOW-3195] Druid Hook: Log ingestion spec and task id
• [AIRFLOW-3197] EMR Hook is missing some parameters to valid on the AWS API
• [AIRFLOW-3232] Make documentation for GCF Functions operator more readable
• [AIRFLOW-3262] Can’t get log containing Response when using SimpleHttpOperator
• [AIRFLOW-3265] Add support for “unix_socket” in connection extra for Mysql Hook

3.20. Changelog 229

Airflow Documentation, Release 1.10.2

3.20.2.3 Doc-only changes

• [AIRFLOW-1441] Tutorial Inconsistencies Between Example Pipeline Definition and Recap

• [AIRFLOW-2682] Add how-to guide(s) for how to use basic operators like BashOperator and PythonOperator
• [AIRFLOW-3104] .airflowignore feature is not mentioned at all in documentation
• [AIRFLOW-3237] Refactor example DAGs
• [AIRFLOW-3187] Update airflow.gif file with a slower version
• [AIRFLOW-3159] Update Airflow documentation on GCP Logging
• [AIRFLOW-3030] Command Line docs incorrect subdir
• [AIRFLOW-2990] Docstrings for Hooks/Operators are in incorrect format
• [AIRFLOW-3127] Celery SSL Documentation is out-dated
• [AIRFLOW-2779] Add license headers to doc files
• [AIRFLOW-2779] Add project version to license

3.20.2.4 Bug fixes

• [AIRFLOW-839] docker_operator.py attempts to log status key without first checking existence
• [AIRFLOW-1104] Concurrency check in scheduler should count queued tasks as well as running
• [AIRFLOW-1163] Add support for x-forwarded-* headers to support access behind AWS ELB
• [AIRFLOW-1195] Cleared tasks in SubDagOperator do not trigger Parent dag_runs
• [AIRFLOW-1508] Skipped state not part of State.task_states
• [AIRFLOW-1762] Use key_file in SSHHook.create_tunnel()
• [AIRFLOW-1837] Differing start_dates on tasks not respected by scheduler.
• [AIRFLOW-1874] Support standard SQL in Check, ValueCheck and IntervalCheck BigQuery operators
• [AIRFLOW-1917] print() from python operators end up with extra new line
• [AIRFLOW-1970] Database cannot be initialized if an invalid fernet key is provided
• [AIRFLOW-2145] Deadlock after clearing a running task
• [AIRFLOW-2216] Cannot specify a profile for AWS Hook to load with s3 config file
• [AIRFLOW-2574] initdb fails when mysql password contains percent sign
• [AIRFLOW-2707] Error accessing log files from web UI
• [AIRFLOW-2716] Replace new Python 3.7 keywords
• [AIRFLOW-2744] RBAC app doesn’t integrate plugins (blueprints etc)
• [AIRFLOW-2772] BigQuery hook does not allow specifying both the partition field name and table name at the
same time
• [AIRFLOW-2778] Bad Import in collect_dag in DagBag
• [AIRFLOW-2786] Variables view fails to render if a variable has an empty key
• [AIRFLOW-2799] Filtering UI objects by datetime is broken
• [AIRFLOW-2800] Remove airflow/ low-hanging linting errors

230 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2825] S3ToHiveTransfer operator may not may able to handle GZIP file with uppercase ext in S3
• [AIRFLOW-2848] dag_id is missing in metadata table “job” for LocalTaskJob
• [AIRFLOW-2860] DruidHook: time variable is not updated correctly when checking for timeout
• [AIRFLOW-2865] Race condition between on_success_callback and LocalTaskJob’s cleanup
• [AIRFLOW-2893] Stuck dataflow job due to jobName mismatch.
• [AIRFLOW-2895] Prevent scheduler from spamming heartbeats/logs
• [AIRFLOW-2900] Code not visible for Packaged DAGs
• [AIRFLOW-2905] Switch to regional dataflow job service.
• [AIRFLOW-2907] Sendgrid - Attachments - ERROR - Object of type ‘bytes’ is not JSON serializable
• [AIRFLOW-2938] Invalid ‘extra’ field in connection can raise an AttributeError when attempting to edit
• [AIRFLOW-2979] Deprecated Celery Option not in Options list
• [AIRFLOW-2981] TypeError in dataflow operators when using GCS jar or py_file
• [AIRFLOW-2984] Cannot convert naive_datetime when task has a naive start_date/end_date
• [AIRFLOW-2994] flatten_results in BigQueryOperator/BigQueryHook should default to None
• [AIRFLOW-3002] ValueError in dataflow operators when using GCS jar or py_file
• [AIRFLOW-3012] Email on sla miss is send only to first address on the list
• [AIRFLOW-3046] ECS Operator mistakenly reports success when task is killed due to EC2 host termination
• [AIRFLOW-3064] No output from airflow test due to default logging config
• [AIRFLOW-3072] Only admin can view logs in RBAC UI
• [AIRFLOW-3079] Improve initdb to support MSSQL Server
• [AIRFLOW-3089] Google auth doesn’t work under http
• [AIRFLOW-3099] Errors raised when some blocs are missing in airflow.cfg
• [AIRFLOW-3109] Default user permission should contain ‘can_clear’
• [AIRFLOW-3111] Confusing comments and instructions for log templates in UPDATING.md and de-
fault_airflow.cfg
• [AIRFLOW-3124] Broken webserver debug mode (RBAC)
• [AIRFLOW-3136] Scheduler Failing the Task retries run while processing Executor Events
• [AIRFLOW-3138] Migration cc1e65623dc7 creates issues with postgres
• [AIRFLOW-3161] Log Url link does not link to task instance logs in RBAC UI
• [AIRFLOW-3162] HttpHook fails to parse URL when port is specified
• [AIRFLOW-3183] Potential Bug in utils/dag_processing/DagFileProcessorManager.max_runs_reached()
• [AIRFLOW-3203] Bugs in DockerOperator & Some operator test scripts were named incorrectly
• [AIRFLOW-3238] Dags, removed from the filesystem, are not deactivated on initdb
• [AIRFLOW-3268] Cannot pass SSL dictionary to mysql connection via URL
• [AIRFLOW-3277] Invalid timezone transition handling for cron schedules
• [AIRFLOW-3295] Require encryption in DaskExecutor when certificates are configured.

3.20. Changelog 231

Airflow Documentation, Release 1.10.2

• [AIRFLOW-3297] EmrStepSensor marks cancelled step as successful

3.20.3 Airflow 1.10.0, 2018-08-03

• [AIRFLOW-2870] Use abstract TaskInstance for migration

• [AIRFLOW-2859] Implement own UtcDateTime (#3708)
• [AIRFLOW-2140] Don’t require kubernetes for the SparkSubmit hook
• [AIRFLOW-2869] Remove smart quote from default config
• [AIRFLOW-2857] Fix Read the Docs env
• [AIRFLOW-2817] Force explicit choice on GPL dependency
• [AIRFLOW-2716] Replace async and await py3.7 keywords
• [AIRFLOW-2810] Fix typo in Xcom model timestamp
• [AIRFLOW-2710] Clarify fernet key value in documentation
• [AIRFLOW-2606] Fix DB schema and SQLAlchemy model
• [AIRFLOW-2646] Fix setup.py not to install snakebite on Python3
• [AIRFLOW-2604] Add index to task_fail
• [AIRFLOW-2650] Mark SchedulerJob as succeed when hitting Ctrl-c
• [AIRFLOW-2678] Fix db schema unit test to remove checking fab models
• [AIRFLOW-2624] Fix webserver login as anonymous
• [AIRFLOW-2654] Fix incorret URL on refresh in Graph View of FAB UI
• [AIRFLOW-2668] Handle missing optional cryptography dependency
• [AIRFLOW-2681] Include last dag run of externally triggered DAGs in UI.
• [AIRFLOW-1840] Support back-compat on old celery config
• [AIRFLOW-2612][AIRFLOW-2534] Clean up Hive-related tests
• [AIRFLOW-2608] Implements/Standardize custom exceptions for experimental APIs
• [AIRFLOW-2607] Fix failing TestLocalClient
• [AIRFLOW-2638] dbapi_hook: support REPLACE INTO
• [AIRFLOW-2542][AIRFLOW-1790] Rename AWS Batch Operator queue to job_queue
• [AIRFLOW-2567] Extract result from the kubernetes pod as Xcom
• [AIRFLOW-XXX] Adding REA Group to readme
• [AIRFLOW-2601] Allow user to specify k8s config
• [AIRFLOW-2559] Azure Fileshare hook
• [AIRFLOW-1786] Enforce correct behavior for soft-fail sensors
• [AIRFLOW-2355] Airflow trigger tag parameters in subdag
• [AIRFLOW-2613] Fix Airflow searching .zip bug
• [AIRFLOW-2627] Add a sensor for Cassandra
• [AIRFLOW-2634][AIRFLOW-2534] Remove dependency for impyla

232 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2611] Fix wrong dag volume mount path for kubernetes executor
• [AIRFLOW-2562] Add Google Kubernetes Engine Operators
• [AIRFLOW-2630] Fix classname in test_sql_sensor.py
• [AIRFLOW-2534] Fix bug in HiveServer2Hook
• [AIRFLOW-2586] Stop getting AIRFLOW_HOME value from config file in bash operator
• [AIRFLOW-2605] Fix autocommit for MySqlHook
• [AIRFLOW-2539][AIRFLOW-2359] Move remaing log config to configuration file
• [AIRFLOW-1656] Tree view dags query changed
• [AIRFLOW-2617] add imagePullPolicy config for kubernetes executor
• [AIRFLOW-2429] Fix security/task/sensors/ti_deps folders flake8 error
• [AIRFLOW-2550] Implements API endpoint to list DAG runs
• [AIRFLOW-2512][AIRFLOW-2522] Use google-auth instead of oauth2client
• [AIRFLOW-2429] Fix operators folder flake8 error
• [AIRFLOW-2585] Fix several bugs in CassandraHook and CassandraToGCSOperator
• [AIRFLOW-2597] Restore original dbapi.run() behavior
• [AIRFLOW-2590] Fix commit in DbApiHook.run() for no-autocommit DB
• [AIRFLOW-1115] fix github oauth api URL
• [AIRFLOW-2587] Add TIMESTAMP type mapping to MySqlToHiveTransfer
• [AIRFLOW-2591][AIRFLOW-2581] Set default value of autocommit to False in DbApiHook.run()
• [AIRFLOW-59] Implement bulk_dump and bulk_load for the Postgres hook
• [AIRFLOW-2533] Fix path to DAG’s on kubernetes executor workers
• [AIRFLOW-2581] RFLOW-2581] Fix DbApiHook autocommit
• [AIRFLOW-2578] Add option to use proxies in JiraHook
• [AIRFLOW-2575] Make gcs to gcs operator work with large files
• [AIRFLOW-437] Send TI context in kill zombies
• [AIRFLOW-2566] Change backfill to rerun failed tasks
• [AIRFLOW-1021] Fix double login for new users with LDAP
• [AIRFLOW-XXX] Typo fix
• [AIRFLOW-2561] Fix typo in EmailOperator
• [AIRFLOW-2573] Cast BigQuery TIMESTAMP field to float
• [AIRFLOW-2560] Adding support for internalIpOnly to DataprocClusterCreateOperator
• [AIRFLOW-2565] templatize cluster_label
• [AIRFLOW-83] add mongo hook and operator
• [AIRFLOW-2558] Clear task/dag is clearing all executions
• [AIRFLOW-XXX] Fix doc typos
• [AIRFLOW-2513] Change bql to sql for BigQuery Hooks & Ops

3.20. Changelog 233

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2557] Fix pagination for s3

• [AIRFLOW-2545] Eliminate DeprecationWarning
• [AIRFLOW-2500] Fix MySqlToHiveTransfer to transfer unsigned type properly
• [AIRFLOW-2462] Change PasswordUser setter to correct syntax
• [AIRFLOW-2525] Fix a bug introduced by commit dabf1b9
• [AIRFLOW-2553] Add webserver.pid to .gitignore
• [AIRFLOW-1863][AIRFLOW-2529] Add dag run selection widgets to gantt view
• [AIRFLOW-2504] Log username correctly and add extra to search columns
• [AIRFLOW-2551] Encode binary data with base64 standard rather than base64 url
• [AIRFLOW-2537] Add reset-dagrun option to backfill command
• [AIRFLOW-2526] dag_run.conf can override params
• [AIRFLOW-2544][AIRFLOW-1967] Guard against next major release of Celery, Flower
• [AIRFLOW-XXX] Add Yieldr to who is using airflow
• [AIRFLOW-2547] Describe how to run tests using Docker
• [AIRFLOW-2538] Update faq doc on how to reduce airflow scheduler latency
• [AIRFLOW-2529] Improve graph view performance and usability
• [AIRFLOW-2517] backfill support passing key values through CLI
• [AIRFLOW-2532] Support logs_volume_subpath for KubernetesExecutor
• [AIRFLOW-2466] consider task_id in _change_state_for_tis_without_dagrun
• [AIRFLOW-2519] Fix CeleryExecutor with SQLAlchemy
• [AIRFLOW-2402] Fix RBAC task log
• [AIRFLOW-XXX] Add M4U to user list
• [AIRFLOW-2536] docs about how to deal with airflow initdb failure
• [AIRFLOW-2530] KubernetesOperator supports multiple clusters
• [AIRFLOW-1499] Eliminate duplicate and unneeded code
• [AIRFLOW-2521] backfill - make variable name and logging messages more acurate
• [AIRFLOW-2429] Fix hook, macros folder flake8 error
• [Airflow-XXX] add Prime to company list
• [AIRFLOW-2525] Fix PostgresHook.copy_expert to work with “COPY FROM”
• [AIRFLOW-2515] Add dependency on thrift_sasl to hive extra
• [AIRFLOW-2523] Add how-to for managing GCP connections
• [AIRFLOW-2510] Introduce new macros: prev_ds and next_ds
• [AIRFLOW-1730] Unpickle value of XCom queried from DB
• [AIRFLOW-2518] Fix broken ToC links in integration.rst
• [AIRFLOW-1472] Fix SLA misses triggering on skipped tasks.
• [AIRFLOW-2520] CLI - make backfill less verbose

234 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2107] add time_partitioning to run_query on BigQueryBaseCursor

• [AIRFLOW-1057][AIRFLOW-1380][AIRFLOW-2362][2362] AIRFLOW Update DockerOperator to new API
• [AIRFLOW-2415] Make airflow DAG templating render numbers
• [AIRFLOW-2473] Fix wrong skip condition for TransferTests
• [AIRFLOW-2472] Implement MySqlHook.bulk_dump
• [AIRFLOW-2419] Use default view for subdag operator
• [AIRFLOW-2498] Fix Unexpected argument in SFTP Sensor
• [AIRFLOW-2509] Separate config docs into how-to guides
• [AIRFLOW-2429] Add BaseExecutor back
• [AIRFLOW-2429] Fix dag, example_dags, executors flake8 error
• [AIRFLOW-2502] Change Single triple quotes to double for docstrings
• [AIRFLOW-2503] Fix broken links in CONTRIBUTING.md
• [AIRFLOW-2501] Refer to devel instructions in docs contrib guide
• [AIRFLOW-2429] Fix contrib folder’s flake8 errors
• [AIRFLOW-2471] Fix HiveCliHook.load_df to use unused parameters
• [AIRFLOW-2495] Update celery to 4.1.1
• [AIRFLOW-2429] Fix api, bin, config_templates folders flake8 error
• [AIRFLOW-2493] Mark template_fields of all Operators in the API document as “templated”
• [AIRFLOW-2489] Update FlaskAppBuilder to 1.11.1
• [AIRFLOW-2448] Enhance HiveCliHook.load_df to work with datetime
• [AIRFLOW-2487] Enhance druid ingestion hook
• [AIRFLOW-2397] Support affinity policies for Kubernetes executor/operator
• [AIRFLOW-2482] Add test for rewrite method in GCS Hook
• [AIRFLOW-2481] Fix flaky Kubernetes test
• [AIRFLOW-2479] Improve doc FAQ section
• [AIRFLOW-2485] Fix Incorrect logging for Qubole Sensor
• [AIRFLOW-2486] Remove unnecessary slash after port
• [AIRFLOW-2429] Make Airflow flake8 compliant
• [AIRFLOW-2491] Resolve flask version conflict
• [AIRFLOW-2484] Remove duplicate key in MySQL to GCS Op
• [ARIFLOW-2458] Add cassandra-to-gcs operator
• [AIRFLOW-2477] Improve time units for task duration and landing times charts for RBAC UI
• [AIRFLOW-2474] Only import snakebite if using py2
• [AIRFLOW-48] Parse connection uri querystring
• [AIRFLOW-2467][AIRFLOW-2] Update import direct warn message to use the module name
• [AIRFLOW-XXX] Fix order of companies

3.20. Changelog 235

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2452] Document field_dict must be OrderedDict

• [AIRFLOW-2420] Azure Data Lake Hook
• [AIRFLOW-2213] Add Quoble check operator
• [AIRFLOW-2465] Fix wrong module names in the doc
• [AIRFLOW-1929] Modifying TriggerDagRunOperator to accept execution_date
• [AIRFLOW-2460] Users can now use volume mounts and volumes
• [AIRFLOW-2110][AIRFLOW-2122] Enhance Http Hook
• [AIRFLOW-XXX] Updated contributors list
• [AIRFLOW-2435] Add launch_type to ECSOperator to allow FARGATE
• [AIRFLOW-2451] Remove extra slash (‘/’) char when using wildcard in gcs_to_gcs operator
• [AIRFLOW-2461] Add support for cluster scaling on dataproc operator
• [AIRFLOW-2376] Fix no hive section error
• [AIRFLOW-2425] Add lineage support
• [AIRFLOW-2430] Extend query batching to additional slow queries
• [AIRFLOW-2453] Add default nil value for kubernetes/git_subpath
• [AIRFLOW-2396] Add support for resources in kubernetes operator
• [AIRFLOW-2169] Encode binary data with base64 before importing to BigQuery
• [AIRFLOW-XXX] Add spotahome in user list
• [AIRFLOW-2457] Update FAB version requirement
• [AIRFLOW-2454][Airflow 2454] Support imagePullPolicy for k8s
• [AIRFLOW-2450] update supported k8s versions to 1.9 and 1.10
• [AIRFLOW-2333] Add Segment Hook and TrackEventOperator
• [AIRFLOW-2442][AIRFLOW-2] Airflow run command leaves database connections open
• [AIRFLOW-2016] assign template_fields for Dataproc Workflow Template sub-classes, not base class
• [AIRFLOW-2446] Add S3ToRedshiftTransfer into the “Integration” doc
• [AIRFLOW-2449] Fix operators.py to run all test cases
• [AIRFLOW-2424] Add dagrun status endpoint and increased k8s test coverage
• [AIRFLOW-2441] Fix bugs in HiveCliHook.load_df
• [AIRFLOW-2358][AIRFLOW-201804] Make the Kubernetes example optional
• [AIRFLOW-2436] Remove cli_logger in initdb
• [AIRFLOW-2444] Remove unused option(include_adhoc) in cli backfill command
• [AIRFLOW-2447] Fix TestHiveMetastoreHook to run all cases
• [AIRFLOW-2445] Allow templating in kubernetes operator
• [AIRFLOW-2086][AIRFLOW-2393] Customize default dagrun number in tree view
• [AIRFLOW-2437] Add PubNub to list of current airflow users
• [AIRFLOW-XXX] Add Quantopian to list of Airflow users

236 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1978] Add WinRM windows operator and hook

• [AIRFLOW-2427] Add tests to named hive sensor
• [AIRFLOW-2412] Fix HiveCliHook.load_file to address HIVE-10541
• [AIRFLOW-2431] Add the navigation bar color parameter for RBAC UI
• [AIRFLOW-2407] Resolve Python undefined names
• [AIRFLOW-1952] Add the navigation bar color parameter
• [AIRFLOW-2222] Implement GoogleCloudStorageHook.rewrite
• [AIRFLOW-2426] Add Google Cloud Storage Hook tests
• [AIRFLOW-2418] Bump Flask-WTF
• [AIRFLOW-2417] Wait for pod is not running to end task
• [AIRFLOW-1914] Add other charset support to email utils
• [AIRFLOW-XXX] Update README.md with Craig@Work
• [AIRFLOW-1899] Fix Kubernetes tests
• [AIRFLOW-1812] Update logging example
• [AIRFLOW-2313] Add TTL parameters for Dataproc
• [AIRFLOW-2411] add dataproc_jars to templated_fields
• [AIRFLOW-XXX] Add Reddit to Airflow users
• [AIRFLOW-XXX] Fix wrong table header in scheduler.rst
• [AIRFLOW-2409] Supply password as a parameter
• [AIRFLOW-2410][AIRFLOW-75] Set the timezone in the RBAC Web UI
• [AIRFLOW-2394] default cmds and arguments in kubernetes operator
• [AIRFLOW-2406] Add Apache2 License Shield to Readme
• [AIRFLOW-2404] Add additional documentation for unqueued task
• [AIRFLOW-2400] Add Ability to set Environment Variables for K8s
• [AIRFLOW-XXX] Add Twine Labs as an Airflow user
• [AIRFLOW-1853] Show only the desired number of runs in tree view
• [AIRFLOW-2401] Document the use of variables in Jinja template
• [AIRFLOW-2403] Fix License Headers
• [AIRFLOW-1313] Fix license header
• [AIRFLOW-2398] Add BounceX to list of current airflow users
• [AIRFLOW-2363] Fix return type bug in TaskHandler
• [AIRFLOW-2389] Create a pinot db api hook
• [AIRFLOW-2390] Resolve FlaskWTFDeprecationWarning
• [AIRFLOW-1933] Fix some typos
• [AIRFLOW-1960] Add support for secrets in kubernetes operator
• [AIRFLOW-1313] Add vertica_to_mysql operator

3.20. Changelog 237

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1575] Add AWS Kinesis Firehose Hook for inserting batch records
• [AIRFLOW-2266][AIRFLOW-2343] Remove google-cloud-dataflow dependency
• [AIRFLOW-2370] Implement –use_random_password in create_user
• [AIRFLOW-2348] Strip path prefix from the destination_object when source_object contains a wildcard[]
• [AIRFLOW-2391] Fix to Flask 0.12.2
• [AIRFLOW-2381] Fix the flaky ApiPasswordTests test
• [AIRFLOW-2378] Add Groupon to list of current users
• [AIRFLOW-2382] Fix wrong description for delimiter
• [AIRFLOW-2380] Add support for environment variables in Spark submit operator.
• [AIRFLOW-2377] Improve Sendgrid sender support
• [AIRFLOW-2331] Support init action timeout on dataproc cluster create
• [AIRFLOW-1835] Update docs: Variable file is json
• [AIRFLOW-1781] Make search case-insensitive in LDAP group
• [AIRFLOW-2042] Fix browser menu appearing over the autocomplete menu
• [AIRFLOW-XXX] Remove wheelhouse files from travis not owned by travis
• [AIRFLOW-2336] Use hmsclient in hive_hook
• [AIRFLOW-2041] Correct Syntax in python examples
• [AIRFLOW-74] SubdagOperators can consume all celeryd worker processes
• [AIRFLOW-2369] Fix gcs tests
• [AIRFLOW-2365] Fix autocommit attribute check
• [AIRFLOW-2068] MesosExecutor allows optional Docker image
• [AIRFLOW-1652] Push DatabricksRunSubmitOperator metadata into XCOM
• [AIRFLOW-2234] Enable insert_rows for PrestoHook
• [AIRFLOW-2208][Airflow-22208] Link to same DagRun graph from TaskInstance view
• [AIRFLOW-1153] Allow HiveOperators to take hiveconfs
• [AIRFLOW-775] Fix autocommit settings with Jdbc hook
• [AIRFLOW-2364] Warn when setting autocommit on a connection which does not support it
• [AIRFLOW-2357] Add persistent volume for the logs
• [AIRFLOW-766] Skip conn.commit() when in Auto-commit
• [AIRFLOW-2351] Check for valid default_args start_date
• [AIRFLOW-1433] Set default rbac to initdb
• [AIRFLOW-2270] Handle removed tasks in backfill
• [AIRFLOW-2344] Fix connections -l to work with pipe/redirect
• [AIRFLOW-2300] Add S3 Select functionarity to S3ToHiveTransfer
• [AIRFLOW-1314] Cleanup the config
• [AIRFLOW-1314] Polish some of the Kubernetes docs/config

238 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1314] Improve error handling

• [AIRFLOW-1999] Add per-task GCP service account support
• [AIRFLOW-1314] Rebasing against master
• [AIRFLOW-1314] Small cleanup to address PR comments (#24)
• [AIRFLOW-1314] Add executor_config and tests
• [AIRFLOW-1314] Improve k8s support
• [AIRFLOW-1314] Use VolumeClaim for transporting DAGs
• [AIRFLOW-1314] Create integration testing environment
• [AIRFLOW-1314] Git Mode to pull in DAGs for Kubernetes Executor
• [AIRFLOW-1314] Add support for volume mounts & Secrets in Kubernetes Executor
• [AIRFLOW=1314] Basic Kubernetes Mode
• [AIRFLOW-2326][AIRFLOW-2222] remove contrib.gcs_copy_operator
• [AIRFLOW-2328] Fix empty GCS blob in S3ToGoogleCloudStorageOperator
• [AIRFLOW-2350] Fix grammar in UPDATING.md
• [AIRFLOW-2302] Fix documentation
• [AIRFLOW-2345] pip is not used in this setup.py
• [AIRFLOW-2347] Add Banco de Formaturas to Readme
• [AIRFLOW-2346] Add Investorise as official user of Airflow
• [AIRFLOW-2330] Do not append destination prefix if not given
• [AIRFLOW-2240][DASK] Added TLS/SSL support for the dask-distributed scheduler.
• [AIRFLOW-2309] Fix duration calculation on TaskFail
• [AIRFLOW-2335] fix issue with jdk8 download for ci
• [AIRFLOW-2184] Add druid_checker_operator
• [AIRFLOW-2299] Add S3 Select functionarity to S3FileTransformOperator
• [AIRFLOW-2254] Put header as first row in unload
• [AIRFLOW-610] Respect _cmd option in config before defaults
• [AIRFLOW-2287] Fix incorrect ASF headers
• [AIRFLOW-XXX] Add Zego as an Apache Airflow user
• [AIRFLOW-952] fix save empty extra field in UI
• [AIRFLOW-1325] Add ElasticSearch log handler and reader
• [AIRFLOW-2301] Sync files of an S3 key with a GCS path
• [AIRFLOW-2293] Fix S3FileTransformOperator to work with boto3
• [AIRFLOW-3212][AIRFLOW-2314] Remove only leading slash in GCS path
• [AIRFLOW-1509][AIRFLOW-442] SFTP Sensor
• [AIRFLOW-2291] Add optional params to ML Engine
• [AIRFLOW-1774] Allow consistent templating of arguments in MLEngineBatchPredictionOperator

3.20. Changelog 239

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2302] Add missing operators and hooks

• [AIRFLOW-2312] Docs Typo Correction: Corresponding
• [AIRFLOW-1623] Trigger on_kill method in operators
• [AIRFLOW-2162] When impersonating another user, pass env variables to sudo
• [AIRFLOW-2304] Update quickstart doc to mention scheduler part
• [AIRFLOW-1633] docker_operator needs a way to set shm_size
• [AIRFLOW-1340] Add S3 to Redshift transfer operator
• [AIRFLOW-2303] Lists the keys inside an S3 bucket
• [AIRFLOW-2209] restore flask_login imports
• [AIRFLOW-2306] Add Bonnier Broadcasting to list of current users
• [AIRFLOW-2305][AIRFLOW-2027] Fix CI failure caused by []
• [AIRFLOW-2281] Add support for Sendgrid categories
• [AIRFLOW-2027] Only trigger sleep in scheduler after all files have parsed
• [AIRFLOW-2256] SparkOperator: Add Client Standalone mode and retry mechanism
• [AIRFLOW-2284] GCS to S3 operator
• [AIRFLOW-2287] Update license notices
• [AIRFLOW-2296] Add Cinimex DataLab to Readme
• [AIRFLOW-2298] Add Kalibrr to who uses airflow
• [AIRFLOW-2292] Fix docstring for S3Hook.get_wildcard_key
• [AIRFLOW-XXX] Update PR template
• [AIRFLOW-XXX] Remove outdated migrations.sql
• [AIRFLOW-2287] Add license header to docs/Makefile
• [AIRFLOW-2286] Add tokopedia to the readme
• [AIRFLOW-2273] Add Discord webhook operator/hook
• [AIRFLOW-2282] Fix grammar in UPDATING.md
• [AIRFLOW-2200] Add snowflake operator with tests
• [AIRFLOW-2178] Add handling on SLA miss errors
• [AIRFLOW-2169] Fix type ‘bytes’ is not JSON serializable in python3
• [AIRFLOW-2215] Pass environment to subproces.Popen in base_task_runner
• [AIRFLOW-2253] Add Airflow CLI instrumentation
• [AIRFLOW-2274] Fix Dataflow tests
• [AIRFLOW-2269] Add Custom Ink as an Airflow user
• [AIRFLOW-2259] Dataflow Hook Index out of range
• [AIRFLOW-2233] Update updating.md to include the info of hdfs_sensors renaming
• [AIRFLOW-2217] Add Slack webhook operator
• [AIRFLOW-1729] improve dagBag time

240 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2264] Improve create_user cli help message

• [AIRFLOW-2260] [AIRFLOW-2260] SSHOperator add command template .sh files
• [AIRFLOW-2261] Check config/env for remote base log folder
• [AIRFLOW-2258] Allow import of Parquet-format files into BigQuery
• [AIRFLOW-1430] Include INSTALL instructions to avoid GPL
• [AIRFLOW-1430] Solve GPL dependency
• [AIRFLOW-2251] Add Thinknear as an Airflow user
• [AIRFLOW-2244] bugfix: remove legacy LongText code from models.py
• [AIRFLOW-2247] Fix RedshiftToS3Transfer not to fail with ValueError
• [AIRFLOW-2249] Add side-loading support for Zendesk Hook
• [AIRFLOW-XXX] Add Qplum to Airflow users
• [AIRFLOW-2228] Enhancements in ValueCheckOperator
• [AIRFLOW-1206] Typos
• [AIRFLOW-2060] Update pendulum version to 1.4.4
• [AIRFLOW-2248] Fix wrong param name in RedshiftToS3Transfer doc
• [AIRFLOW-1433][AIRFLOW-85] New Airflow Webserver UI with RBAC support
• [AIRFLOW-1235] Fix webserver’s odd behaviour
• [AIRFLOW-1460] Allow restoration of REMOVED TI’s
• [airflow-2235] Fix wrong docstrings in two operators
• [AIRFLOW-XXX] Fix chronological order for companies using Airflow
• [AIRFLOW-2124] Upload Python file to a bucket for Dataproc
• [AIRFLOW-2212] Fix ungenerated sensor API reference
• [AIRFLOW-2226] Rename google_cloud_storage_default to google_cloud_default
• [AIRFLOW-2211] Rename hdfs_sensors.py to hdfs_sensor.py for consistency
• [AIRFLOW-2225] Update document to include DruidDbApiHook
• [Airflow-2202] Add filter support in HiveMetastoreHook().max_partition()
• [AIRFLOW-2220] Remove duplicate numeric list entry in security.rst
• [AIRFLOW-XXX] Update tutorial documentation
• [AIRFLOW-2215] Update celery task to preserve environment variables and improve logging on exception
• [AIRFLOW-2185] Use state instead of query param
• [AIRFLOW-2183] Refactor DruidHook to enable sql
• [AIRFLOW-2203] Defer cycle detection
• [AIRFLOW-2203] Remove Useless Commands.
• [AIRFLOW-2203] Cache signature in apply_defaults
• [AIRFLOW-2203] Speed up Operator Resources
• [AIRFLOW-2203] Cache static rules (trigger/weight)

3.20. Changelog 241

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2203] Store task ids as sets not lists

• [AIRFLOW-2205] Remove unsupported args from JdbcHook doc
• [AIRFLOW-2207] Fix flaky test that uses app.cached_app()
• [AIRFLOW-2206] Remove unsupported args from JdbcOperator doc
• [AIRFLOW-2140] Add Kubernetes scheduler to SparkSubmitOperator
• [AIRFLOW-XXX] Add Xero to list of users
• [AIRFLOW-2204] Fix webserver debug mode
• [AIRFLOW-102] Fix test_complex_template always succeeds
• [AIRFLOW-442] Add SFTPHook
• [AIRFLOW-2169] Add schema to MySqlToGoogleCloudStorageOperator
• [AIRFLOW-2184][AIRFLOW-2138] Google Cloud Storage allow wildcards
• [AIRFLOW-1588] Cast Variable value to string
• [AIRFLOW-2199] Fix invalid reference to logger
• [AIRFLOW-2191] Change scheduler heartbeat logs from info to debug
• [AIRFLOW-2106] SalesForce hook sandbox option
• [AIRFLOW-2197] Silence hostname_callable config error message
• [AIRFLOW-2150] Use lighter call in HiveMetastoreHook().max_partition()
• [AIRFLOW-2186] Change the way logging is carried out in few ops
• [AIRFLOW-2181] Convert password_auth and test_password_endpoints from DOS to UNIX
• [AIRFLOW-2187] Fix Broken Travis CI due to AIRFLOW-2123
• [AIRFLOW-2175] Check that filepath is not None
• [AIRFLOW-2173] Don’t check task IDs for concurrency reached check
• [AIRFLOW-2168] Remote logging for Azure Blob Storage
• [AIRFLOW-XXX] Add DocuTAP to list of users
• [AIRFLOW-2176] Change the way logging is carried out in BQ Get Data Operator
• [AIRFLOW-2177] Add mock test for GCS Download op
• [AIRFLOW-2123] Install CI dependencies from setup.py
• [AIRFLOW-2129] Presto hook calls _parse_exception_message but defines _get_pretty_exception_message
• [AIRFLOW-2174] Fix typos and wrongly rendered documents
• [AIRFLOW-2171] Store delegated credentials
• [AIRFLOW-2166] Restore BQ run_query dialect param
• [AIRFLOW-2163] Add HBC Digital to users of airflow
• [AIRFLOW-2065] Fix race-conditions when creating loggers
• [AIRFLOW-2147] Plugin manager: added ‘sensors’ attribute
• [AIRFLOW-2059] taskinstance query is awful, un-indexed, and does not scale
• [AIRFLOW-2159] Fix a few typos in salesforce_hook

242 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2132] Add step to initialize database

• [AIRFLOW-2160] Fix bad rowid deserialization
• [AIRFLOW-2161] Add Vevo to list of companies using Airflow
• [AIRFLOW-2149] Add link to apache Beam documentation to create self executing Jar
• [AIRFLOW-2151] Allow getting the session from AwsHook
• [AIRFLOW-2097] tz referenced before assignment
• [AIRFLOW-2152] Add Multiply to list of companies using Airflow
• [AIRFLOW-1551] Add operator to trigger Jenkins job
• [AIRFLOW-2034] Fix mixup between %s and {} when using str.format Convention is to use .format for string
formating oustide logging, else use lazy format See comment in related issue https://github.com/apache/airflow/
pull/2823/files Identified problematic case using following command line .git/COMMIT_EDITMSG:grep -r
‘%s’./* | grep ‘.format(‘
• [AIRFLOW-2102] Add custom_args to Sendgrid personalizations
• [AIRFLOW-1035][AIRFLOW-1053] import unicode_literals to parse Unicode in HQL
• [AIRFLOW-2127] Keep loggers during DB migrations
• [AIRFLOW-2146] Resolve issues with BQ using DbApiHook methods
• [AIRFLOW-2087] Scheduler Report shows incorrect Total task number
• [AIRFLOW-2139] Remove unncecessary boilerplate to get DataFrame using pandas_gbq
• [AIRFLOW-2125] Using binary package psycopg2-binary
• [AIRFLOW-2142] Include message on mkdir failure
• [AIRFLOW-1615] SSHHook: use port specified by Connection
• [AIRFLOW-2122] Handle boolean values in sshHook
• [AIRFLOW-XXX] Add Tile to the list of users
• [AIRFLOW-2130] Add missing Operators to API Reference docs
• [AIRFLOW-XXX] Add timeout units (seconds)
• [AIRFLOW-2134] Add Alan to the list of companies that use Airflow
• [AIRFLOW-2133] Remove references to GitHub issues in CONTRIBUTING
• [AIRFLOW-2131] Remove confusing AirflowImport docs
• [AIRFLOW-1852] Allow hostname to be overridable.
• [AIRFLOW-2126] Add Bluecore to active users
• [AIRFLOW-1618] Add feature to create GCS bucket
• [AIRFLOW-2108] Fix log indentation in BashOperator
• [AIRFLOW-2115] Fix doc links to PythonHosted
• [AIRFLOW-XXX] Add contributor from Easy company
• [AIRFLOW-1882] Add ignoreUnknownValues option to gcs_to_bq operator
• [AIRFLOW-2089] Add on kill for SparkSubmit in Standalone Cluster

3.20. Changelog 243

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2113] Address missing DagRun callbacks Given that the handle_callback method belongs to the
DAG object, we are able to get the list of task directly with get_task and reduce the communication with the
database, making airflow more lightweight.
• [AIRFLOW-2112] Fix svg width for Recent Tasks on UI.
• [AIRFLOW-2116] Set CI Cloudant version to <2.0
• [AIRFLOW-XXX] Add PMC to list of companies using Airflow
• [AIRFLOW-2100] Fix Broken Documentation Links
• [AIRFLOW-1404] Add ‘flatten_results’ & ‘maximum_bytes_billed’ to BQ Operator
• [AIRFLOW-800] Initialize valid Google BigQuery Connection
• [AIRFLOW-1319] Fix misleading SparkSubmitOperator and SparkSubmitHook docstring
• [AIRFLOW-1983] Parse environment parameter as template
• [AIRFLOW-2095] Add operator to create External BigQuery Table
• [AIRFLOW-2085] Add SparkJdbc operator
• [AIRFLOW-1002] Add ability to clean all dependencies of removed DAG
• [AIRFLOW-2094] Jinjafied project_id, region & zone in DataProc{*} Operators
• [AIRFLOW-2092] Fixed incorrect parameter in docstring for FTPHook
• [AIRFLOW-XXX] Add SocialCops to Airflow users
• [AIRFLOW-2088] Fix duplicate keys in MySQL to GCS Helper function
• [AIRFLOW-2091] Fix incorrect docstring parameter in BigQuery Hook
• [AIRFLOW-2090] Fix typo in DataStore Hook
• [AIRFLOW-1157] Fix missing pools crashing the scheduler
• [AIRFLOW-713] Jinjafy {EmrCreateJobFlow,EmrAddSteps}Operator attributes
• [AIRFLOW-2083] Docs: Use “its” instead of “it’s” where appropriate
• [AIRFLOW-2066] Add operator to create empty BQ table
• [AIRFLOW-XXX] add Karmic to list of companies
• [AIRFLOW-2073] Make FileSensor fail when the file doesn’t exist
• [AIRFLOW-2078] Improve task_stats and dag_stats performance
• [AIRFLOW-2080] Use a log-out icon instead of a power button
• [AIRFLOW-2077] Fetch all pages of list_objects_v2 response
• [AIRFLOW-XXX] Add TM to list of companies
• [AIRFLOW-1985] Impersonation fixes for using run_as_user
• [AIRFLOW-2018][AIRFLOW-2] Make Sensors backward compatible
• [AIRFLOW-XXX] Fix typo in concepts doc (dag_md)
• [AIRFLOW-2069] Allow Bytes to be uploaded to S3
• [AIRFLOW-2074] Fix log var name in GHE auth
• [AIRFLOW-1927] Convert naive datetimes for TaskInstances
• [AIRFLOW-1760] Password auth for experimental API

244 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2038] Add missing kubernetes dependency for dev

• [AIRFLOW-2040] Escape special chars in task instance logs URL
• [AIRFLOW-1968][AIRFLOW-1520] Add role_arn and aws_account_id/aws_iam_role support back to aws
hook
• [AIRFLOW-2048] Fix task instance failure string formatting
• [AIRFLOW-2046] Fix kerberos error to work with python 3.x
• [AIRFLOW-2063] Add missing docs for GCP
• [AIRFLOW-XXX] Fix typo in docs
• [AIRFLOW-1793] Use docker_url instead of invalid base_url
• [AIRFLOW-2055] Elaborate on slightly ambiguous documentation
• [AIRFLOW-2039] BigQueryOperator supports priority property
• [AIRFLOW-2053] Fix quote character bug in BQ hook
• [AIRFLOW-2057] Add Overstock to list of companies
• [AIRFLOW-XXX] Add Plaid to Airflow users
• [AIRFLOW-2044] Add SparkSubmitOperator to documentation
• [AIRFLOW-2037] Add methods to get Hash values of a GCS object
• [AIRFLOW-2050] Fix Travis permission problem
• [AIRFLOW-2043] Add Intercom to list of companies
• [AIRFLOW-2023] Add debug logging around number of queued files
• [AIRFLOW-XXX] Add Pernod-ricard as a airflow user
• [AIRFLOW-1453] Add ‘steps’ into template_fields in EmrAddSteps
• [AIRFLOW-2015] Add flag for interactive runs
• [AIRFLOW-1895] Fix primary key integrity for mysql
• [AIRFLOW-2030] Fix KeyError:i in DbApiHook for insert
• [AIRFLOW-1943] Add External BigQuery Table feature
• [AIRFLOW-2033] Add Google Cloud Storage List Operator
• [AIRFLOW-2006] Add local log catching to kubernetes operator
• [AIRFLOW-2031] Add missing gcp_conn_id in the example in DataFlow docstrings
• [AIRFLOW-2029] Fix AttributeError in BigQueryPandasConnector
• [AIRFLOW-2028] Add JobTeaser to official users list
• [AIRFLOW-2016] Add support for Dataproc Workflow Templates
• [AIRFLOW-2025] Reduced Logging verbosity
• [AIRFLOW-1267][AIRFLOW-1874] Add dialect parameter to BigQueryHook
• [AIRFLOW-XXX] Fixed a typo
• [AIRFLOW-XXX] Typo node to nodes
• [AIRFLOW-2019] Update DataflowHook for updating Streaming type job

3.20. Changelog 245

Airflow Documentation, Release 1.10.2

• [AIRFLOW-2017][Airflow 2017] adding query output to PostgresOperator

• [AIRFLOW-1889] Split sensors into separate files
• [AIRFLOW-1950] Optionally pass xcom_pull task_ids
• [AIRFLOW-1755] Allow mount below root
• [AIRFLOW-511][Airflow 511] add success/failure callbacks on dag level
• [AIRFLOW-192] Add weight_rule param to BaseOperator
• [AIRFLOW-2008] Use callable for python column defaults
• [AIRFLOW-1984] Fix to AWS Batch operator
• [AIRFLOW-2000] Support non-main dataflow job class
• [AIRFLOW-2003] Use flask-caching instead of flask-cache
• [AIRFLOW-2002] Do not swallow exception on logging import
• [AIRFLOW-2004] Import flash from flask not flask.login
• [AIRFLOW-1997] Fix GCP operator doc strings
• [AIRFLOW-1996] Update DataflowHook waitfordone for Streaming type job[]
• [AIRFLOW-1995][Airflow 1995] add on_kill method to SqoopOperator
• [AIRFLOW-1770] Allow HiveOperator to take in a file
• [AIRFLOW-1994] Change background color of Scheduled state Task Instances
• [AIRFLOW-1436][AIRFLOW-1475] EmrJobFlowSensor considers Cancelled step as Successful
• [AIRFLOW-1517] Kubernetes operator PR fixes
• [AIRFLOW-1517] addressed PR comments
• [AIRFLOW-1517] started documentation of k8s operator
• [AIRFLOW-1517] Restore authorship of resources
• [AIRFLOW-1517] Remove authorship of resources
• [AIRFLOW-1517] Add minikube for kubernetes integration tests
• [AIRFLOW-1517] Restore authorship of resources
• [AIRFLOW-1517] fixed license issues
• [AIRFLOW-1517] Created more accurate failures for kube cluster issues
• [AIRFLOW-1517] Remove authorship of resources
• [AIRFLOW-1517] Add minikube for kubernetes integration tests
• [AIRFLOW-1988] Change BG color of None state TIs
• [AIRFLOW-790] Clean up TaskInstances without DagRuns
• [AIRFLOW-1949] Fix var upload, str() produces “b’. . . ’” which is not json
• [AIRFLOW-1930] Convert func.now() to timezone.utcnow()
• [AIRFLOW-1688] Support load.time_partitioning in bigquery_hook
• [AIRFLOW-1975] Make TriggerDagRunOperator callback optional
• [AIRFLOW-1480] Render template attributes for ExternalTaskSensor fields

246 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1958] Add **kwargs to send_email

• [AIRFLOW-1976] Fix for missing log/logger attribute FileProcessHandler
• [AIRFLOW-1982] Fix Executor event log formatting
• [AIRFLOW-1971] Propagate hive config on impersonation
• [AIRFLOW-1969] Always use HTTPS URIs for Google OAuth2
• [AIRFLOW-1954] Add DataFlowTemplateOperator
• [AIRFLOW-1963] Add config for HiveOperator mapred_queue
• [AIRFLOW-1946][AIRFLOW-1855] Create a BigQuery Get Data Operator
• [AIRFLOW-1953] Add labels to dataflow operators
• [AIRFLOW-1967] Update Celery to 4.0.2
• [AIRFLOW-1964] Add Upsight to list of Airflow users
• [AIRFLOW-XXX] Changelog for 1.9.0
• [AIRFLOW-1470] Implement BashSensor operator
• [AIRFLOW-XXX] Pin sqlalchemy dependency
• [AIRFLOW-1955] Do not reference unassigned variable
• [AIRFLOW-1957] Add contributor to BalanceHero in Readme
• [AIRFLOW-1517] Restore authorship of secrets and init container
• [AIRFLOW-1517] Remove authorship of secrets and init container
• [AIRFLOW-1935] Add BalanceHero to readme
• [AIRFLOW-1939] add astronomer contributors
• [AIRFLOW-1517] Kubernetes Operator
• [AIRFLOW-1928] Fix @once with catchup=False
• [AIRFLOW-1937] Speed up scheduling by committing in batch
• [AIRFLOW-1821] Enhance default logging config by removing extra loggers
• [AIRFLOW-1904] Correct DAG fileloc to the right filepath
• [AIRFLOW-1909] Update docs with supported versions of MySQL server
• [AIRFLOW-1915] Relax flask-wtf dependency specification
• [AIRFLOW-1920] Update CONTRIBUTING.md to reflect enforced linting rules
• [AIRFLOW-1942] Update Sphinx docs to remove deprecated import structure
• [AIRFLOW-1846][AIRFLOW-1697] Hide Ad Hoc Query behind secure_mode config
• [AIRFLOW-1948] Include details for on_kill failure
• [AIRFLOW-1938] Clean up unused exception
• [AIRFLOW-1932] Add GCP Pub/Sub Pull and Ack
• [AIRFLOW-XXX] Purge coveralls
• [AIRFLOW-XXX] Remove unused coveralls token
• [AIRFLOW-1938] Remove tag version check in setup.py

3.20. Changelog 247

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1916] Don’t upload logs to remote from run –raw

• [AIRFLOW-XXX] Fix failing PubSub tests on Python3
• [AIRFLOW-XXX] Upgrade to python 3.5 and disable dask tests
• [AIRFLOW-1913] Add new GCP PubSub operators
• [AIRFLOW-1525] Fix minor LICENSE and NOTICE issues
• [AIRFLOW-1687] fix fernet error without encryption
• [AIRFLOW-1912] airflow.processor should not propagate logging
• [AIRFLOW-1911] Rename celeryd_concurrency
• [AIRFLOW-1885] Fix IndexError in ready_prefix_on_cmdline
• [AIRFLOW-1854] Improve Spark Submit operator for standalone cluster mode
• [AIRFLOW-1908] Fix celery broker options config load
• [AIRFLOW-1907] Pass max_ingestion_time to Druid hook
• [AIRFLOW-1909] Add away to list of users
• [AIRFLOW-1893][AIRFLOW-1901] Propagate PYTHONPATH when using impersonation
• [AIRFLOW-1892] Modify BQ hook to extract data filtered by column
• [AIRFLOW-1829] Support for schema updates in query jobs
• [AIRFLOW-1840] Make celery configuration congruent with Celery 4
• [AIRFLOW-1878] Fix stderr/stdout redirection for tasks
• [AIRFLOW-1897][AIRFLOW-1873] Task Logs for running instance not visible in WebUI
• [AIRFLOW-1896] FIX bleach <> html5lib incompatibility
• [AIRFLOW-1884][AIRFLOW-1059] Reset orphaned task state for external dagruns
• [AIRFLOW-XXX] Fix typo in comment
• [AIRFLOW-1869] Do not emit spurious warning on missing logs
• [AIRFLOW-1888] Add AWS Redshift Cluster Sensor
• [AIRFLOW-1887] Renamed endpoint url variable
• [AIRFLOW-1873] Set TI.try_number to right value depending TI state
• [AIRFLOW-1891] Fix non-ascii typo in default configuration template
• [AIRFLOW-1879] Handle ti log entirely within ti
• [AIRFLOW-1869] Write more error messages into gcs and file logs
• [AIRFLOW-1876] Write subtask id to task log header
• [AIRFLOW-1554] Fix wrong DagFileProcessor termination method call
• [AIRFLOW-342] Do not use amqp, rpc as result backend
• [AIRFLOW-966] Make celery broker_transport_options configurable
• [AIRFLOW-1881] Make operator log in task log
• [AIRFLOW-XXX] Added DataReply to the list of Airflow Users
• [AIRFLOW-1883] Get File Size for objects in Google Cloud Storage

248 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1872] Set context for all handlers including parents

• [AIRFLOW-1855][AIRFLOW-1866] Add GCS Copy Operator to copy multiple files
• [AIRFLOW-1870] Enable flake8 tests
• [AIRFLOW-1785] Enable Python 3 tests
• [AIRFLOW-1850] Copy cmd before masking
• [AIRFLOW-1665] Reconnect on database errors
• [AIRFLOW-1559] Dispose SQLAlchemy engines on exit
• [AIRFLOW-1559] Close file handles in subprocesses
• [AIRFLOW-1559] Make database pooling optional
• [AIRFLOW-1848][Airflow-1848] Fix DataFlowPythonOperator py_file extension doc comment
• [AIRFLOW-1843] Add Google Cloud Storage Sensor with prefix
• [AIRFLOW-1803] Time zone documentation
• [AIRFLOW-1826] Update views to use timezone aware objects
• [AIRFLOW-1827] Fix api endpoint date parsing
• [AIRFLOW-1806] Use naive datetime when using cron
• [AIRFLOW-1809] Update tests to use timezone aware objects
• [AIRFLOW-1806] Use naive datetime for cron scheduling
• [AIRFLOW-1807] Force use of time zone aware db fields
• [AIRFLOW-1808] Convert all utcnow() to time zone aware
• [AIRFLOW-1804] Add time zone configuration options
• [AIRFLOW-1802] Convert database fields to timezone aware
• [AIRFLOW-XXX] Add dask lock files to excludes
• [AIRFLOW-1790] Add support for AWS Batch operator
• [AIRFLOW-XXX] Update README.md
• [AIRFLOW-1820] Remove timestamp from metric name
• [AIRFLOW-1810] Remove unused mysql import in migrations.
• [AIRFLOW-1838] Properly log collect_dags exception
• [AIRFLOW-1842] Fixed Super class name for the gcs to gcs copy operator
• [AIRFLOW-1845] Modal background now covers long or tall pages
• [AIRFLOW-1229] Add link to Run Id, incl execution_date
• [AIRFLOW-1842] Add gcs to gcs copy operator with renaming if required
• [AIRFLOW-1841] change False to None in operator and hook
• [AIRFLOW-1839] Fix more bugs in S3Hook boto -> boto3 migration
• [AIRFLOW-1830] Support multiple domains in Google authentication backend
• [AIRFLOW-1831] Add driver-classpath spark submit
• [AIRFLOW-1795] Correctly call S3Hook after migration to boto3

3.20. Changelog 249

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1811] Fix render Druid operator

• [AIRFLOW-1819] Fix slack operator unittest bug
• [AIRFLOW-1805] Allow Slack token to be passed through connection
• [AIRFLOW-1816] Add region param to Dataproc operators
• [AIRFLOW-868] Add postgres_to_gcs operator and unittests
• [AIRFLOW-1613] make mysql_to_gcs_operator py3 compatible
• [AIRFLOW-1817] use boto3 for s3 dependency
• [AIRFLOW-1813] Bug SSH Operator empty buffer
• [AIRFLOW-1801][AIRFLOW-288] Url encode execution dates
• [AIRFLOW-1563] Catch OSError while symlinking the latest log directory
• [AIRFLOW-1794] Remove uses of Exception.message for Python 3
• [AIRFLOW-1799] Fix logging line which raises errors
• [AIRFLOW-1102] Upgrade Gunicorn >=19.4.0
• [AIRFLOW-1756] Fix S3TaskHandler to work with Boto3-based S3Hook
• [AIRFLOW-1797] S3Hook.load_string didn’t work on Python3
• [AIRFLOW-646] Add docutils to setup_requires
• [AIRFLOW-1792] Missing intervals DruidOperator
• [AIRFLOW-1789][AIRFLOW-1712] Log SSHOperator stderr to log.warning
• [AIRFLOW-1787] Fix task instance batch clear and set state bugs
• [AIRFLOW-1780] Fix long output lines with unicode from hanging parent
• [AIRFLOW-387] Close SQLAlchemy sessions properly
• [AIRFLOW-1779] Add keepalive packets to ssh hook
• [AIRFLOW-1669] Fix Docker and pin Moto to 1.1.19
• [AIRFLOW-71] Add support for private Docker images
• [AIRFLOW-XXX] Give a clue what the ‘ds’ variable is
• [AIRFLOW-XXX] Correct typos in the faq docs page
• [AIRFLOW-1571] Add AWS Lambda Hook
• [AIRFLOW-1675] Fix docstrings for API docs
• [AIRFLOW-1712][AIRFLOW-756][AIRFLOW-751] Log SSHOperator output
• [AIRFLOW-1776] Capture stdout and stderr for logging
• [AIRFLOW-1765] Make experimental API securable without needing Kerberos.
• [AIRFLOW-1764] The web interface should not use the experimental API
• [AIRFLOW-1771] Rename heartbeat to avoid confusion
• [AIRFLOW-1769] Add support for templates in VirtualenvOperator
• [AIRFLOW-1763] Fix S3TaskHandler unit tests
• [AIRFLOW-1315] Add Qubole File & Partition Sensors

250 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1018] Make processor use logging framework

• [AIRFLOW-1695] Add RedshiftHook using boto3
• [AIRFLOW-1706] Fix query error for MSSQL backend
• [AIRFLOW-1711] Use ldap3 dict for group membership
• [AIRFLOW-1723] Make sendgrid a plugin
• [AIRFLOW-1757] Add missing options to SparkSubmitOperator
• [AIRFLOW-1734][Airflow 1734] Sqoop hook/operator enhancements
• [AIRFLOW-1761] Fix type in scheduler.rst
• [AIRFLOW-1731] Set pythonpath for logging
• [AIRFLOW-1641] Handle executor events in the scheduler
• [AIRFLOW-1744] Make sure max_tries can be set
• [AIRFLOW-1732] Improve dataflow hook logging
• [AIRFLOW-1736] Add HotelQuickly to Who Uses Airflow
• [AIRFLOW-1657] Handle failing qubole operator
• [AIRFLOW-1677] Fix typo in example_qubole_operator
• [AIRFLOW-926] Fix JDBC Hook
• [AIRFLOW-1520] Boto3 S3Hook, S3Log
• [AIRFLOW-1716] Fix multiple __init__ def in SimpleDag
• [AIRFLOW-XXX] Fix DateTime in Tree View
• [AIRFLOW-1719] Fix small typo
• [AIRFLOW-1432] Charts label for Y axis not visible
• [AIRFLOW-1743] Verify ldap filters correctly
• [AIRFLOW-1745] Restore default signal disposition
• [AIRFLOW-1741] Correctly hide second chart on task duration page
• [AIRFLOW-1728] Add networkUri, subnet, tags to Dataproc operator
• [AIRFLOW-1726] Add copy_expert psycopg2 method to PostgresHook
• [AIRFLOW-1330] Add conn_type argument to CLI when adding connection
• [AIRFLOW-1698] Remove SCHEDULER_RUNS env var in systemd
• [AIRFLOW-1694] Stop using itertools.izip
• [AIRFLOW-1692] Change test_views filename to support Windows
• [AIRFLOW-1722] Fix typo in scheduler autorestart output filename
• [AIRFLOW-1723] Support sendgrid in email backend
• [AIRFLOW-1718] Set num_retries on Dataproc job request execution
• [AIRFLOW-1727] Add unit tests for DataProcHook
• [AIRFLOW-1631] Fix timing issue in unit test
• [AIRFLOW-1631] Fix local executor unbound parallelism

3.20. Changelog 251

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1724] Add Fundera to Who uses Airflow?

• [AIRFLOW-1683] Cancel BigQuery job on timeout.
• [AIRFLOW-1714] Fix misspelling: s/seperate/separate/
• [AIRFLOW-1681] Add batch clear in task instance view
• [AIRFLOW-1696] Fix dataproc version label error
• [AIRFLOW-1613] Handle binary field in MySqlToGoogleCloudStorageOperator
• [AIRFLOW-1697] Mode to disable charts endpoint
• [AIRFLOW-1691] Add better Google cloud logging documentation
• [AIRFLOW-1690] Add detail to gcs error messages
• [AIRFLOW-1682] Make S3TaskHandler write to S3 on close
• [AIRFLOW-1634] Adds task_concurrency feature
• [AIRFLOW-1676] Make GCSTaskHandler write to GCS on close
• [AIRFLOW-1678] Fix erroneously repeated word in function docstrings
• [AIRFLOW-1323] Made Dataproc operator parameter names consistent
• [AIRFLOW-1590] fix unused module and variable
• [AIRFLOW-1671] Add @apply_defaults back to gcs download operator
• [AIRFLOW-988] Fix repeating SLA miss callbacks
• [AIRFLOW-1611] Customize logging
• [AIRFLOW-1668] Expose keepalives_idle for Postgres connections
• [AIRFLOW-1658] Kill Druid task on timeout
• [AIRFLOW-1669][AIRFLOW-1368] Fix Docker import
• [AIRFLOW-891] Make webserver clock include date
• [AIRFLOW-1560] Add AWS DynamoDB hook and operator for inserting batch items
• [AIRFLOW-1654] Show tooltips for link icons in DAGs view
• [AIRFLOW-1660] Change webpage width to full-width
• [AIRFLOW-1664] write file as binary instead of str
• [AIRFLOW-1659] Fix invalid obj attribute bug in file_task_handler.py
• [AIRFLOW-1635] Allow creating GCP connection without requiring a JSON file
• [AIRFLOW-1650] Fix custom celery config loading
• [AIRFLOW-1647] Fix Spark-sql hook
• [AIRFLOW-1587] Fix CeleryExecutor import error
• [Airflow-1640][AIRFLOW-1640] Add qubole default connection
• [AIRFLOW-1576] Added region param to Dataproc{*}Operators
• [AIRFLOW-1643] Add healthjump to officially using list
• [AIRFLOW-1626] Add Azri Solutions to Airflow users
• [AIRFLOW-1636] Add AWS and EMR connection type

252 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1527] Refactor celery config

• [AIRFLOW-1639] Fix Fernet error handling
• [AIRFLOW-1637] Fix Travis CI build status link
• [AIRFLOW-1628] Fix docstring of sqlsensor
• [AIRFLOW-1331] add SparkSubmitOperator option
• [AIRFLOW-1627] Only query pool in SubDAG init when necessary
• [AIRFLOW-1629] Make extra a textarea in edit connections form
• [AIRFLOW-1368] Automatically remove Docker container on exit
• [AIRFLOW-289] Make airflow timezone independent
• [AIRFLOW-1356] Add –celery_hostname to airflow worker
• [AIRFLOW-1247] Fix ignore_all_dependencies argument ignored
• [AIRFLOW-1621] Add tests for server side paging
• [AIRFLOW-1591] Avoid attribute error when rendering logging filename
• [AIRFLOW-1031] Replace hard-code to DagRun.ID_PREFIX
• [AIRFLOW-1604] Rename logger to log
• [AIRFLOW-1512] Add PythonVirtualenvOperator
• [AIRFLOW-1617] Fix XSS vulnerability in Variable endpoint
• [AIRFLOW-1497] Reset hidden fields when changing connection type
• [AIRFLOW-1619] Add poll_sleep parameter to GCP dataflow operator
• [AIRFLOW-XXX] Remove landscape.io config
• [AIRFLOW-XXX] Remove non working service badges
• [AIRFLOW-1177] Fix Variable.setdefault w/existing JSON
• [AIRFLOW-1600] Fix exception handling in get_fernet
• [AIRFLOW-1614] Replace inspect.stack() with sys._getframe()
• [AIRFLOW-1519] Add server side paging in DAGs list
• [AIRFLOW-1309] Allow hive_to_druid to take tblproperties
• [AIRFLOW-1613] Make MySqlToGoogleCloudStorageOperator compaitible with python3
• [AIRFLOW-1603] add PAYMILL to companies list
• [AIRFLOW-1609] Fix gitignore to ignore all venvs
• [AIRFLOW-1601] Add configurable task cleanup time

3.20.4 Airflow 1.9.0, 2018-01-02

• [AIRFLOW-1525] Fix minor LICENSE and NOTICE issues

• [AIRFLOW-XXX] Bump version to 1.9.0
• [AIRFLOW-1897][AIRFLOW-1873] Task Logs for running instance not visible in WebUI
• [AIRFLOW-XXX] Make sure session is committed

3.20. Changelog 253

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1896] FIX bleach <> html5lib incompatibility

• [AIRFLOW-XXX] Fix log handler test
• [AIRFLOW-1873] Set TI.try_number to right value depending TI state
• [AIRFLOW-1554] Fix wrong DagFileProcessor termination method call
• [AIRFLOW-1872] Set context for all handlers including parents
• [AIRFLOW-XXX] Add dask lock files to excludes
• [AIRFLOW-1839] Fix more bugs in S3Hook boto -> boto3 migration
• [AIRFLOW-1795] Correctly call S3Hook after migration to boto3
• [AIRFLOW-1813] Bug SSH Operator empty buffer
• [AIRFLOW-1794] Remove uses of Exception.message for Python 3
• [AIRFLOW-1799] Fix logging line which raises errors
• [AIRFLOW-1102] Upgrade Gunicorn >=19.4.0
• [AIRFLOW-1756] Fix S3TaskHandler to work with Boto3-based S3Hook
• [AIRFLOW-1797] S3Hook.load_string didn’t work on Python3
• [AIRFLOW-1792] Missing intervals DruidOperator
• [AIRFLOW-1789][AIRFLOW-1712] Log SSHOperator stderr to log.warning
• [AIRFLOW-1669] Fix Docker and pin Moto to 1.1.19
• [AIRFLOW-71] Add support for private Docker images
• [AIRFLOW-1779] Add keepalive packets to ssh hook
• [AIRFLOW-XXX] Give a clue what the ‘ds’ variable is
• [AIRFLOW-XXX] Correct typos in the faq docs page
• [AIRFLOW-1571] Add AWS Lambda Hook
• [AIRFLOW-1675] Fix docstrings for API docs
• [AIRFLOW-1712][AIRFLOW-756][AIRFLOW-751] Log SSHOperator output
• [AIRFLOW-1776] Capture stdout and stderr for logging
• [AIRFLOW-1765] Make experimental API securable without needing Kerberos.
• [AIRFLOW-1764] The web interface should not use the experimental API
• [AIRFLOW-1634] Adds task_concurrency feature
• [AIRFLOW-1018] Make processor use logging framework
• [AIRFLOW-1695] Add RedshiftHook using boto3
• [AIRFLOW-1706] Fix query error for MSSQL backend
• [AIRFLOW-1711] Use ldap3 dict for group membership
• [AIRFLOW-1757] Add missing options to SparkSubmitOperator
• [AIRFLOW-1734][Airflow 1734] Sqoop hook/operator enhancements
• [AIRFLOW-1731] Set pythonpath for logging
• [AIRFLOW-1641] Handle executor events in the scheduler

254 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1744] Make sure max_tries can be set

• [AIRFLOW-1330] Add conn_type argument to CLI when adding connection
• [AIRFLOW-926] Fix JDBC Hook
• [AIRFLOW-1520] Boto3 S3Hook, S3Log
• [AIRFLOW-XXX] Fix DateTime in Tree View
• [AIRFLOW-1432] Charts label for Y axis not visible
• [AIRFLOW-1743] Verify ldap filters correctly
• [AIRFLOW-1745] Restore default signal disposition
• [AIRFLOW-1741] Correctly hide second chart on task duration page
• [AIRFLOW-1726] Add copy_expert psycopg2 method to PostgresHook
• [AIRFLOW-1698] Remove SCHEDULER_RUNS env var in systemd
• [AIRFLOW-1694] Stop using itertools.izip
• [AIRFLOW-1692] Change test_views filename to support Windows
• [AIRFLOW-1722] Fix typo in scheduler autorestart output filename
• [AIRFLOW-1691] Add better Google cloud logging documentation
• [AIRFLOW-1690] Add detail to gcs error messages
• [AIRFLOW-1682] Make S3TaskHandler write to S3 on close
• [AIRFLOW-1676] Make GCSTaskHandler write to GCS on close
• [AIRFLOW-1635] Allow creating GCP connection without requiring a JSON file
• [AIRFLOW-1323] Made Dataproc operator parameter names consistent
• [AIRFLOW-1590] fix unused module and variable
• [AIRFLOW-988] Fix repeating SLA miss callbacks
• [AIRFLOW-1611] Customize logging
• [AIRFLOW-1668] Expose keepalives_idle for Postgres connections
• [AIRFLOW-1658] Kill Druid task on timeout
• [AIRFLOW-1669][AIRFLOW-1368] Fix Docker import
• [AIRFLOW-1560] Add AWS DynamoDB hook and operator for inserting batch items
• [AIRFLOW-1654] Show tooltips for link icons in DAGs view
• [AIRFLOW-1660] Change webpage width to full-width
• [AIRFLOW-1664] write file as binary instead of str
• [AIRFLOW-1659] Fix invalid obj attribute bug in file_task_handler.py
• [AIRFLOW-1650] Fix custom celery config loading
• [AIRFLOW-1647] Fix Spark-sql hook
• [AIRFLOW-1587] Fix CeleryExecutor import error
• [AIRFLOW-1636] Add AWS and EMR connection type
• [AIRFLOW-1527] Refactor celery config

3.20. Changelog 255

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1639] Fix Fernet error handling

• [AIRFLOW-1628] Fix docstring of sqlsensor
• [AIRFLOW-1331] add SparkSubmitOperator option
• [AIRFLOW-1627] Only query pool in SubDAG init when necessary
• [AIRFLOW-1629] Make extra a textarea in edit connections form
• [AIRFLOW-1621] Add tests for server side paging
• [AIRFLOW-1519] Add server side paging in DAGs list
• [AIRFLOW-289] Make airflow timezone independent
• [AIRFLOW-1356] Add –celery_hostname to airflow worker
• [AIRFLOW-1591] Avoid attribute error when rendering logging filename
• [AIRFLOW-1031] Replace hard-code to DagRun.ID_PREFIX
• [AIRFLOW-1604] Rename logger to log
• [AIRFLOW-1512] Add PythonVirtualenvOperator
• [AIRFLOW-1617] Fix XSS vulnerability in Variable endpoint
• [AIRFLOW-1497] Reset hidden fields when changing connection type
• [AIRFLOW-1177] Fix Variable.setdefault w/existing JSON
• [AIRFLOW-1600] Fix exception handling in get_fernet
• [AIRFLOW-1614] Replace inspect.stack() with sys._getframe()
• [AIRFLOW-1613] Make MySqlToGoogleCloudStorageOperator compaitible with python3
• [AIRFLOW-1609] Fix gitignore to ignore all venvs
• [AIRFLOW-1601] Add configurable task cleanup time
• [AIRFLOW-XXX] Bumping Airflow 1.9.0alpha0 version
• [AIRFLOW-1608] Handle pending job state in GCP Dataflow hook
• [AIRFLOW-1606] Use non static DAG.sync_to_db
• [AIRFLOW-1606][Airflow-1606][AIRFLOW-1605][AIRFLOW-160] DAG.sync_to_db is now a normal
method
• [AIRFLOW-1602] LoggingMixin in DAG class
• [AIRFLOW-1593] expose load_string in WasbHook
• [AIRFLOW-1597] Add GameWisp as Airflow user
• [AIRFLOW-1594] Don’t install test packages into python root.[]
• [AIRFLOW-1582] Improve logging within Airflow
• [AIRFLOW-1476] add INSTALL instruction for source releases
• [AIRFLOW-XXX] Save username and password in airflow-pr
• [AIRFLOW-1522] Increase text size for var field in variables for MySQL
• [AIRFLOW-950] Missing AWS integrations on documentation::integrations
• [AIRFLOW-XXX] 1.8.2 release notes

256 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1573] Remove thrift < 0.10.0 requirement

• [AIRFLOW-1584] Remove insecure /headers endpoint
• [AIRFLOW-1586] Add mapping for date type to mysql_to_gcs operator
• [AIRFLOW-1579] Adds support for jagged rows in Bigquery hook for BQ load jobs
• [AIRFLOW-1577] Add token support to DatabricksHook
• [AIRFLOW-1580] Error in string formating
• [AIRFLOW-1567] Updated docs for Google ML Engine operators/hooks
• [AIRFLOW-1574] add ‘to’ attribute to templated vars of email operator
• [AIRFLOW-1572] add carbonite to company list
• [AIRFLOW-1568] Fix typo in BigQueryHook
• [AIRFLOW-1493][AIRFLOW-XXXX][WIP] fixed dumb thing
• [AIRFLOW-1567][Airflow-1567] Renamed cloudml hook and operator to mlengine
• [AIRFLOW-1568] Add datastore export/import operators
• [AIRFLOW-1564] Use Jinja2 to render logging filename
• [AIRFLOW-1562] Spark-sql logging contains deadlock
• [AIRFLOW-1556][Airflow 1556] Add support for SQL parameters in BigQueryBaseCursor
• [AIRFLOW-108] Add CreditCards.com to companies list
• [AIRFLOW-1541] Add channel to template fields of slack_operator
• [AIRFLOW-1535] Add service account/scopes in dataproc
• [AIRFLOW-1384] Add to README.md CaDC/ARGO
• [AIRFLOW-1546] add Zymergen 80to org list in README
• [AIRFLOW-1545] Add Nextdoor to companies list
• [AIRFLOW-1544] Add DataFox to companies list
• [AIRFLOW-1529] Add logic supporting quoted newlines in Google BigQuery load jobs
• [AIRFLOW-1521] Fix emplate rendering for BigqueryTableDeleteOperator
• [AIRFLOW-1324] Generalize Druid operator and hook
• [AIRFLOW-1516] Fix error handling getting fernet
• [AIRFLOW-1420][AIRFLOW-1473] Fix deadlock check
• [AIRFLOW-1495] Fix migration on index on job_id
• [AIRFLOW-1483] Making page size consistent in list
• [AIRFLOW-1495] Add TaskInstance index on job_id
• [AIRFLOW-855] Replace PickleType with LargeBinary in XCom
• [AIRFLOW-1505] Document when Jinja substitution occurs
• [AIRFLOW-1504] Log dataproc cluster name
• [AIRFLOW-1239] Fix unicode error for logs in base_task_runner
• [AIRFLOW-1280] Fix Gantt chart height

3.20. Changelog 257

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1507] Template parameters in file_to_gcs operator

• [AIRFLOW-1452] workaround lock on method
• [AIRFLOW-1385] Make Airflow task logging configurable
• [AIRFLOW-940] Handle error on variable decrypt
• [AIRFLOW-1492] Add gauge for task successes/failures
• [AIRFLOW-1443] Update Airflow configuration documentation
• [AIRFLOW-1486] Unexpected S3 writing log error
• [AIRFLOW-1487] Added links to all companies officially using Airflow
• [AIRFLOW-1489] Fix typo in BigQueryCheckOperator
• [AIRFLOW-1349] Fix backfill to respect limits
• [AIRFLOW-1478] Chart owner column should be sortable
• [AIRFLOW-1397][AIRFLOW-1] No Last Run column data displyed in Airflow UI 1.8.1
• [AIRFLOW-1474] Add dag_id regex feature for airflow clear command
• [AIRFLOW-1445] Changing HivePartitionSensor UI color to lighter shade
• [AIRFLOW-1359] Use default_args in Cloud ML eval
• [AIRFLOW-1389] Support createDisposition in BigQueryOperator
• [AIRFLOW-1349] Refactor BackfillJob _execute
• [AIRFLOW-1459] Fixed broken integration .rst formatting
• [AIRFLOW-1448] Revert “Fix cli reading logfile in memory”
• [AIRFLOW-1398] Allow ExternalTaskSensor to wait on multiple runs of a task
• [AIRFLOW-1399] Fix cli reading logfile in memory
• [AIRFLOW-1442] Remove extra space from ignore_all_deps generated command
• [AIRFLOW-1438] Change batch size per query in scheduler
• [AIRFLOW-1439] Add max billing tier for the BQ Hook and Operator
• [AIRFLOW-1437] Modify BigQueryTableDeleteOperator
• [Airflow 1332] Split logs based on try number
• [AIRFLOW-1385] Create abstraction for Airflow task logging
• [AIRFLOW-756][AIRFLOW-751] Replace ssh hook, operator & sftp operator with paramiko based
• [AIRFLOW-1393][[AIRFLOW-1393] Enable Py3 tests in contrib/spark_submit_hook[
• [AIRFLOW-1345] Dont expire TIs on each scheduler loop
• [AIRFLOW-1059] Reset orphaned tasks in batch for scheduler
• [AIRFLOW-1255] Fix SparkSubmitHook output deadlock
• [AIRFLOW-1359] Add Google CloudML utils for model evaluation
• [AIRFLOW-1247] Fix ignore all dependencies argument ignored
• [AIRFLOW-1401] Standardize cloud ml operator arguments
• [AIRFLOW-1394] Add quote_character param to GCS hook and operator

258 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1402] Cleanup SafeConfigParser DeprecationWarning

• [AIRFLOW-1326][[AIRFLOW-1326][AIRFLOW-1184] Don’t split argument array – it’s already an array.[
• [AIRFLOW-1384] Add ARGO/CaDC as a Airflow user
• [AIRFLOW-1357] Fix scheduler zip file support
• [AIRFLOW-1382] Add working dir option to DockerOperator
• [AIRFLOW-1388] Add Cloud ML Engine operators to integration doc
• [AIRFLOW-1387] Add unicode string prefix
• [AIRFLOW-1366] Add max_tries to task instance
• [AIRFLOW-1300] Enable table creation with TBLPROPERTIES
• [AIRFLOW-1271] Add Google CloudML Training Operator
• [AIRFLOW-300] Add Google Pubsub hook and operator
• [AIRFLOW-1343] Fix dataproc label format
• [AIRFLOW-1367] Pass Content-ID To reference inline images in an email, we need to be able to add <img
src=”cid:{}”/> to the HTML. However currently the Content-ID (cid) is not passed, so we need to add it
• [AIRFLOW-1265] Fix celery executor parsing CELERY_SSL_ACTIVE
• [AIRFLOW-1272] Google Cloud ML Batch Prediction Operator
• [AIRFLOW-1352][AIRFLOW-1335] Revert MemoryHandler change ()[]
• [AIRFLOW-1350] Add query_uri param to Hive/SparkSQL DataProc operator
• [AIRFLOW-1334] Check if tasks are backfill on scheduler in a join
• [AIRFLOW-1343] Add Airflow default label to the dataproc operator
• [AIRFLOW-1273] Add Google Cloud ML version and model operators
• [AIRFLOW-1273]AIRFLOW-1273] Add Google Cloud ML version and model operators
• [AIRFLOW-1321] Fix hidden field key to ignore case
• [AIRFLOW-1337] Make log_format key names lowercase
• [AIRFLOW-1338][AIRFLOW-782] Add GCP dataflow hook runner change to UPDATING.md
• [AIRFLOW-801] Remove outdated docstring on BaseOperator
• [AIRFLOW-1344] Fix text encoding bug when reading logs for Python 3.5
• [AIRFLOW-1338] Fix incompatible GCP dataflow hook
• [AIRFLOW-1333] Enable copy function for Google Cloud Storage Hook
• [AIRFLOW-1337] Allow log format customization via airflow.cfg
• [AIRFLOW-1320] Update LetsBonus users in README
• [AIRFLOW-1335] Use MemoryHandler for buffered logging
• [AIRFLOW-1339] Add Drivy to the list of users
• [AIRFLOW-1275] Put ‘airflow pool’ into API
• [AIRFLOW-1296] Propagate SKIPPED to all downstream tasks
• [AIRFLOW-1317] Fix minor issues in API reference

3.20. Changelog 259

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1308] Disable nanny usage for Dask

• [AIRFLOW-1172] Support nth weekday of the month cron expression
• [AIRFLOW-936] Add clear/mark success for DAG in the UI
• [AIRFLOW-1294] Backfills can loose tasks to execute
• [AIRFLOW-1299] Support imageVersion in Google Dataproc cluster
• [AIRFLOW-1291] Update NOTICE and LICENSE files to match ASF requirements
• [AIRFLOW-1301] Add New Relic to list of companies
• [AIRFLOW-1289] Removes restriction on number of scheduler threads
• [AIRFLOW-1024] Ignore celery executor errors (#49)
• [AIRFLOW-1265] Fix exception while loading celery configurations
• [AIRFLOW-1290] set docs author to ‘Apache Airflow’
• [AIRFLOW-1242] Allowing project_id to have a colon in it.
• [AIRFLOW-1282] Fix known event column sorting
• [AIRFLOW-1166] Speed up _change_state_for_tis_without_dagrun
• [AIRFLOW-1208] Speed-up cli tests
• [AIRFLOW-1192] Some enhancements to qubole_operator
• [AIRFLOW-1281] Sort variables by key field by default
• [AIRFLOW-1277] Forbid KE creation with empty fields
• [AIRFLOW-1276] Forbid event creation with end_data earlier than start_date
• [AIRFLOW-1263] Dynamic height for charts
• [AIRFLOW-1266] Increase width of gantt y axis
• [AIRFLOW-1244] Forbid creation of a pool with empty name
• [AIRFLOW-1274][HTTPSENSOR] Rename parameter params to data
• [AIRFLOW-654] Add SSL Config Option for CeleryExecutor w/ RabbitMQ - Add BROKER_USE_SSL config
to give option to send AMQP messages over SSL - Can be set using usual airflow options (e.g. airflow.cfg, env
vars, etc.)
• [AIRFLOW-1256] Add United Airlines to readme
• [AIRFLOW-1251] Add eRevalue to Airflow users
• [AIRFLOW-908] Print hostname at the start of cli run
• [AIRFLOW-1237] Fix IN-predicate sqlalchemy warning
• [AIRFLOW-1243] DAGs table has no default entries to show
• [AIRFLOW-1245] Fix random failure in test_trigger_dag_for_date
• [AIRFLOW-1248] Fix wrong conf name for worker timeout
• [AIRFLOW-1197] : SparkSubmitHook on_kill error
• [AIRFLOW-1191] : SparkSubmitHook custom cmd
• [AIRFLOW-1234] Cover utils.operator_helpers with UTs
• [AIRFLOW-1217] Enable Sqoop logging

260 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-645] Support HTTPS connections in HttpHook

• [AIRFLOW-1231] Use flask_wtf.CSRFProtect
• [AIRFLOW-1232] Remove deprecated readfp warning
• [AIRFLOW-1233] Cover utils.json with unit tests
• [AIRFLOW-1227] Remove empty column on the Logs view
• [AIRFLOW-1226] Remove empty column on the Jobs view
• [AIRFLOW-1221] Fix templating bug with DatabricksSubmitRunOperator
• [AIRFLOW-1210] Enable DbApiHook unit tests
• [AIRFLOW-1199] Fix create modal
• [AIRFLOW-1200] Forbid creation of a variable with an empty key
• [AIRFLOW-1207] Enable utils.helpers unit tests
• [AIRFLOW-1213] Add hcatalog parameters to sqoop
• [AIRFLOW-1201] Update deprecated ‘nose-parameterized’
• [AIRFLOW-1186] Sort dag.get_task_instances by execution_date
• [AIRFLOW-1203] Pin Google API client version to fix OAuth issue
• [AIRFLOW-1145] Fix closest_date_partition function with before set to True If we’re looking for the closest
date before, we should take the latest date in the list of date before.
• [AIRFLOW-1180] Fix flask-wtf version for test_csrf_rejection
• [AIRFLOW-993] Update date inference logic
• [AIRFLOW-1170] DbApiHook insert_rows inserts parameters separately
• [AIRFLOW-1041] Do not shadow xcom_push method[]
• [AIRFLOW-860][AIRFLOW-935] Fix plugin executor import cycle and executor selection
• [AIRFLOW-1189] Fix get a DataFrame using BigQueryHook failing
• [AIRFLOW-1184] SparkSubmitHook does not split args
• [AIRFLOW-1182] SparkSubmitOperator template field
• [AIRFLOW-823] Allow specifying execution date in task_info API
• [AIRFLOW-1175] Add Pronto Tools to Airflow user list
• [AIRFLOW-1150] Fix scripts execution in sparksql hook[]
• [AIRFLOW-1141] remove crawl_for_tasks
• [AIRFLOW-1193] Add Checkr to company using Airflow
• [AIRFLOW-1168] Add closing() to all connections and cursors
• [AIRFLOW-1188] Add max_bad_records param to GoogleCloudStorageToBigQueryOperator
• [AIRFLOW-1187][AIRFLOW-1185] Fix PyPi package names in documents
• [AIRFLOW-1185] Fix PyPi URL in templates
• [AIRFLOW-XXX] Updating CHANGELOG, README, and UPDATING after 1.8.1 release
• [AIRFLOW-1181] Add delete and list functionality to gcs_hook

3.20. Changelog 261

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1179] Fix Pandas 0.2x breaking Google BigQuery change

• [AIRFLOW-1167] Support microseconds in FTPHook modification time
• [AIRFLOW-1173] Add Robinhood to who uses Airflow
• [AIRFLOW-945][AIRFLOW-941] Remove psycopg2 connection workaround
• [AIRFLOW-1140] DatabricksSubmitRunOperator should template the “json” field.
• [AIRFLOW-1160] Update Spark parameters for Mesos
• [AIRFLOW 1149][AIRFLOW-1149] Allow for custom filters in Jinja2 templates
• [AIRFLOW-1036] Randomize exponential backoff
• [AIRFLOW-1155] Add Tails.com to community
• [AIRFLOW-1142] Do not reset orphaned state for backfills
• [AIRFLOW-492] Make sure stat updates cannot fail a task
• [AIRFLOW-1119] Fix unload query so headers are on first row[]
• [AIRFLOW-1089] Add Spark application arguments
• [AIRFLOW-1125] Document encrypted connections
• [AIRFLOW-1122] Increase stroke width in UI
• [AIRFLOW-1138] Add missing licenses to files in scripts directory
• (AIRFLOW-11-38) [AIRFLOW-1136] Capture invalid arguments for Sqoop
• [AIRFLOW-1127] Move license notices to LICENSE
• [AIRFLOW-1118] Add evo.company to Airflow users
• [AIRFLOW-1121][AIRFLOW-1004] Fix airflow webserver –pid to write out pid file
• [AIRFLOW-1124] Do not set all tasks to scheduled in backfill
• [AIRFLOW-1120] Update version view to include Apache prefix
• [AIRFLOW-1091] Add script that can compare jira target against merges
• [AIRFLOW-1107] Add support for ftps non-default port
• [AIRFLOW-1000] Rebrand distribution to Apache Airflow
• [AIRFLOW-1094] Run unit tests under contrib in Travis
• [AIRFLOW-1112] Log which pool when pool is full in scheduler
• [AIRFLOW-1106] Add Groupalia/Letsbonus to the ReadMe
• [AIRFLOW-1109] Use kill signal to kill processes and log results
• [AIRFLOW-1074] Don’t count queued tasks for concurrency limits
• [AIRFLOW-1095] Make ldap_auth memberOf come from configuration
• [AIRFLOW-1090] Add HBO
• [AIRFLOW-1035] Use binary exponential backoff
• [AIRFLOW-1081] Improve performance of duration chart
• [AIRFLOW-1078] Fix latest_runs endpoint for old flask versions
• [AIRFLOW-1085] Enhance the SparkSubmitOperator

262 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1050] Do not count up_for_retry as not ready

• [AIRFLOW-1028] Databricks Operator for Airflow
• [AIRFLOW-1075] Security docs cleanup
• [AIRFLOW-1033][AIFRLOW-1033] Fix ti_deps for no schedule dags
• [AIRFLOW-1016] Allow HTTP HEAD request method on HTTPSensor
• [AIRFLOW-970] Load latest_runs on homepage async
• [AIRFLOW-111] Include queued tasks in scheduler concurrency check
• [AIRFLOW-1001] Fix landing times if there is no following schedule
• [AIRFLOW-1065] Add functionality for Azure Blob Storage over wasb://
• [AIRFLOW-947] Improve exceptions for unavailable Presto cluster
• [AIRFLOW-1067] use example.com in examples
• [AIRFLOW-1064] Change default sort to job_id for TaskInstanceModelView
• [AIRFLOW-1030][AIRFLOW-1] Fix hook import for HttpSensor
• [AIRFLOW-1051] Add a test for resetdb to CliTests
• [AIRFLOW-1004][AIRFLOW-276] Fix airflow webserver -D to run in background
• [AIRFLOW-1062] Fix DagRun#find to return correct result
• [AIRFLOW-1011] Fix bug in BackfillJob._execute() for SubDAGs
• [AIRFLOW-1038] Specify celery serialization options explicitly
• [AIRFLOW-1054] Fix broken import in test_dag
• [AIRFLOW-1007] Use Jinja sandbox for chart_data endpoint
• [AIRFLOW-719] Fix race condition in ShortCircuit, Branch and LatestOnly
• [AIRFLOW-1043] Fix doc strings of operators
• [AIRFLOW-840] Make ticket renewer python3 compatible
• [AIRFLOW-985] Extend the sqoop operator and hook
• [AIRFLOW-1034] Make it possible to connect to S3 in sigv4 regions
• [AIRFLOW-1045] Make log level configurable via airflow.cfg
• [AIRFLOW-1047] Sanitize strings passed to Markup
• [AIRFLOW-1040] Fix some small typos in comments and docstrings
• [AIRFLOW-1017] get_task_instance shouldn’t throw exception when no TI
• [AIRFLOW-1006] Add config_templates to MANIFEST
• [AIRFLOW-999] Add support for Redis database
• [AIRFLOW-1009] Remove SQLOperator from Concepts page
• [AIRFLOW-1006] Move config templates to separate files
• [AIRFLOW-1005] Improve Airflow startup time
• [AIRFLOW-1010] Add convenience script for signing releases
• [AIRFLOW-995] Remove reference to actual Airflow issue

3.20. Changelog 263

Airflow Documentation, Release 1.10.2

• [AIRFLOW-681] homepage doc link should pointing to apache repo not airbnb repo
• [AIRFLOW-705][AIRFLOW-706] Fix run_command bugs
• [AIRFLOW-990] Fix Py27 unicode logging in DockerOperator
• [AIRFLOW-963] Fix non-rendered code examples
• [AIRFLOW-969] Catch bad python_callable argument
• [AIRFLOW-984] Enable subclassing of SubDagOperator
• [AIRFLOW-997] Update setup.cfg to point to Apache
• [AIRFLOW-994] Add MiNODES to the official airflow user list
• [AIRFLOW-995][AIRFLOW-1] Update GitHub PR Template
• [AIRFLOW-989] Do not mark dag run successful if unfinished tasks
• [AIRFLOW-903] New configuration setting for the default dag view
• [AIRFLOW-979] Add GovTech GDS
• [AIRFLOW-933] Replace eval with literal_eval to prevent RCE
• [AIRFLOW-974] Fix mkdirs race condition
• [AIRFLOW-917] Fix formatting of error message
• [AIRFLOW-770] Refactor BaseHook so env vars are always read
• [AIRFLOW-900] Double trigger should not kill original task instance
• [AIRFLOW-900] Fixes bugs in LocalTaskJob for double run protection
• [AIRFLOW-932][AIRFLOW-932][AIRFLOW-921][AIRFLOW-910] Do not mark tasks removed when back-
filling[
• [AIRFLOW-961] run onkill when SIGTERMed
• [AIRFLOW-910] Use parallel task execution for backfills
• [AIRFLOW-967] Wrap strings in native for py2 ldap compatibility
• [AIRFLOW-958] Improve tooltip readability
• AIRFLOW-959 Cleanup and reorganize .gitignore
• AIRFLOW-960 Add .editorconfig file
• [AIRFLOW-931] Do not set QUEUED in TaskInstances
• [AIRFLOW-956] Get docs working on readthedocs.org
• [AIRFLOW-954] Fix configparser ImportError
• [AIRFLOW-941] Use defined parameters for psycopg2
• [AIRFLOW-943] Update Digital First Media in users list
• [AIRFLOW-942] Add mytaxi to Airflow users
• [AIRFLOW-939] add .swp to gitginore
• [AIRFLOW-719] Prevent DAGs from ending prematurely
• [AIRFLOW-938] Use test for True in task_stats queries
• [AIRFLOW-937] Improve performance of task_stats

264 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-933] use ast.literal_eval rather eval because ast.literal_eval does not execute input.
• [AIRFLOW-925] Revert airflow.hooks change that cherry-pick picked
• [AIRFLOW-919] Running tasks with no start date shouldn’t break a DAGs UI
• [AIRFLOW-802][AIRFLOW-1] Add spark-submit operator/hook
• [AIRFLOW-725] Use keyring to store credentials for JIRA
• [AIRFLOW-916] Remove deprecated readfp function
• [AIRFLOW-911] Add coloring and timing to tests
• [AIRFLOW-906] Update Code icon from lightning bolt to file
• [AIRFLOW-897] Prevent dagruns from failing with unfinished tasks
• [AIRFLOW-896] Remove unicode to 8-bit conversion in BigQueryOperator
• [AIRFLOW-899] Tasks in SCHEDULED state should be white in the UI instead of black
• [AIRFLOW-895] Address Apache release incompliancies
• [AIRFLOW-893][AIRFLOW-510] Fix crashing webservers when a dagrun has no start date
• [AIRFLOW-880] Make webserver serve logs in a sane way for remote logs
• [AIRFLOW-889] Fix minor error in the docstrings for BaseOperator
• [AIRFLOW-809][AIRFLOW-1] Use __eq__ ColumnOperator When Testing Booleans
• [AIRFLOW-875] Add template to HttpSensor params
• [AIRFLOW-866] Add FTPSensor
• [AIRFLOW-881] Check if SubDagOperator is in DAG context manager
• [AIRFLOW-885] Add change.org to the users list
• [AIRFLOW-836] Use POST and CSRF for state changing endpoints
• [AIRFLOW-862] Fix Unit Tests for DaskExecutor
• [AIRFLOW-887] Support future v0.16
• [AIRFLOW-886] Pass result to post_execute() hook
• [AIRFLOW-871] change logging.warn() into warning()
• [AIRFLOW-882] Remove unnecessary dag>>op assignment in docs
• [AIRFLOW-861] make pickle_info endpoint be login_required
• [AIRFLOW-869] Refactor mark success functionality
• [AIRFLOW-877] Remove .sql template extension from GCS download operator
• [AIRFLOW-826] Add Zendesk hook
• [AIRFLOW-842] do not query the DB with an empty IN clause
• [AIRFLOW-834] change raise StopIteration into return
• [AIRFLOW-832] Let debug server run without SSL
• [AIRFLOW-862] Add DaskExecutor
• [AIRFLOW-858] Configurable database name for DB operators
• [AIRFLOW-863] Example DAGs should have recent start dates

3.20. Changelog 265

Airflow Documentation, Release 1.10.2

• [AIRFLOW-853] use utf8 encoding for stdout line decode

• [AIRFLOW-857] Use library assert statements instead of conditionals
• [AIRFLOW-856] Make sure execution date is set for local client
• [AIRFLOW-854] Add OKI as Airflow user
• [AIRFLOW-830][AIRFLOW-829][AIRFLOW-88] Reduce Travis log verbosity
• [AIRFLOW-814] Fix Presto*CheckOperator.__init__
• [AIRFLOW-793] Enable compressed loading in S3ToHiveTransfer
• [AIRFLOW-844] Fix cgroups directory creation
• [AIRFLOW-831] Restore import to fix broken tests
• [AIRFLOW-794] Access DAGS_FOLDER and SQL_ALCHEMY_CONN exclusively from settings
• [AIRFLOW-694] Fix config behaviour for empty envvar
• [AIRFLOW-365] Set dag.fileloc explicitly and use for Code view
• [AIRFLOW-781] Allow DataFlowOperators to accept jobs stored in GCS

3.20.5 Airflow 1.8.2, 2017-09-04

• [AIRFLOW-809][AIRFLOW-1] Use eq ColumnOperator When Testing Booleans

• [AIRFLOW-1296] Propagate SKIPPED to all downstream tasks
• Re-enable caching for hadoop components
• Pin Hive and Hadoop to a specific version and create writable warehouse dir
• [AIRFLOW-1308] Disable nanny usage for Dask
• Updating CHANGELOG for 1.8.2rc1
• [AIRFLOW-1294] Backfills can loose tasks to execute
• [AIRFLOW-1291] Update NOTICE and LICENSE files to match ASF requirements
• [AIRFLOW-XXX] Set version to 1.8.2rc1
• [AIRFLOW-1160] Update Spark parameters for Mesos
• [AIRFLOW 1149][AIRFLOW-1149] Allow for custom filters in Jinja2 templates
• [AIRFLOW-1119] Fix unload query so headers are on first row[]
• [AIRFLOW-1089] Add Spark application arguments
• [AIRFLOW-1078] Fix latest_runs endpoint for old flask versions
• [AIRFLOW-1074] Don’t count queued tasks for concurrency limits
• [AIRFLOW-1064] Change default sort to job_id for TaskInstanceModelView
• [AIRFLOW-1038] Specify celery serialization options explicitly
• [AIRFLOW-1036] Randomize exponential backoff
• [AIRFLOW-993] Update date inference logic
• [AIRFLOW-1167] Support microseconds in FTPHook modification time
• [AIRFLOW-1179] Fix Pandas 0.2x breaking Google BigQuery change

266 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1263] Dynamic height for charts

• [AIRFLOW-1266] Increase width of gantt y axis
• [AIRFLOW-1290] set docs author to ‘Apache Airflow’
• [AIRFLOW-1282] Fix known event column sorting
• [AIRFLOW-1166] Speed up _change_state_for_tis_without_dagrun
• [AIRFLOW-1192] Some enhancements to qubole_operator
• [AIRFLOW-1281] Sort variables by key field by default
• [AIRFLOW-1244] Forbid creation of a pool with empty name
• [AIRFLOW-1243] DAGs table has no default entries to show
• [AIRFLOW-1227] Remove empty column on the Logs view
• [AIRFLOW-1226] Remove empty column on the Jobs view
• [AIRFLOW-1199] Fix create modal
• [AIRFLOW-1200] Forbid creation of a variable with an empty key
• [AIRFLOW-1186] Sort dag.get_task_instances by execution_date
• [AIRFLOW-1145] Fix closest_date_partition function with before set to True If we’re looking for the closest
date before, we should take the latest date in the list of date before.
• [AIRFLOW-1180] Fix flask-wtf version for test_csrf_rejection
• [AIRFLOW-1170] DbApiHook insert_rows inserts parameters separately
• [AIRFLOW-1150] Fix scripts execution in sparksql hook[]
• [AIRFLOW-1168] Add closing() to all connections and cursors
• [AIRFLOW-XXX] Updating CHANGELOG, README, and UPDATING after 1.8.1 release

3.20.6 Airflow 1.8.1, 2017-05-09

• [AIRFLOW-1142] SubDAG Tasks Not Executed Even Though All Dependencies Met
• [AIRFLOW-1138] Add licenses to files in scripts directory
• [AIRFLOW-1127] Move license notices to LICENSE instead of NOTICE
• [AIRFLOW-1124] Do not set all task instances to scheduled on backfill
• [AIRFLOW-1120] Update version view to include Apache prefix
• [AIRFLOW-1062] DagRun#find returns wrong result if external_trigger=False is specified
• [AIRFLOW-1054] Fix broken import on test_dag
• [AIRFLOW-1050] Retries ignored - regression
• [AIRFLOW-1033] TypeError: can’t compare datetime.datetime to None
• [AIRFLOW-1017] get_task_instance should return None instead of throw an exception for non-existent TIs
• [AIRFLOW-1011] Fix bug in BackfillJob._execute() for SubDAGs
• [AIRFLOW-1004] airflow webserver -D runs in foreground
• [AIRFLOW-1001] Landing Time shows “unsupported operand type(s) for -: ‘datetime.datetime’ and ‘None-
Type’” on example_subdag_operator

3.20. Changelog 267

Airflow Documentation, Release 1.10.2

• [AIRFLOW-1000] Rebrand to Apache Airflow instead of Airflow

• [AIRFLOW-989] Clear Task Regression
• [AIRFLOW-974] airflow.util.file mkdir has a race condition
• [AIRFLOW-906] Update Code icon from lightning bolt to file
• [AIRFLOW-858] Configurable database name for DB operators
• [AIRFLOW-853] ssh_execute_operator.py stdout decode default to ASCII
• [AIRFLOW-832] Fix debug server
• [AIRFLOW-817] Trigger dag fails when using CLI + API
• [AIRFLOW-816] Make sure to pull nvd3 from local resources
• [AIRFLOW-815] Add previous/next execution dates to available default variables.
• [AIRFLOW-813] Fix unterminated unit tests in tests.job (tests/job.py)
• [AIRFLOW-812] Scheduler job terminates when there is no dag file
• [AIRFLOW-806] UI should properly ignore DAG doc when it is None
• [AIRFLOW-794] Consistent access to DAGS_FOLDER and SQL_ALCHEMY_CONN
• [AIRFLOW-785] ImportError if cgroupspy is not installed
• [AIRFLOW-784] Cannot install with funcsigs > 1.0.0
• [AIRFLOW-780] The UI no longer shows broken DAGs
• [AIRFLOW-777] dag_is_running is initlialized to True instead of False
• [AIRFLOW-719] Skipped operations make DAG finish prematurely
• [AIRFLOW-694] Empty env vars do not overwrite non-empty config values
• [AIRFLOW-492] Insert into dag_stats table results into failed task while task itself succeeded
• [AIRFLOW-139] Executing VACUUM with PostgresOperator
• [AIRFLOW-111] DAG concurrency is not honored
• [AIRFLOW-88] Improve clarity Travis CI reports

3.20.7 Airflow 1.8.0, 2017-03-12

• [AIRFLOW-900] Double trigger should not kill original task instance

• [AIRFLOW-900] Fixes bugs in LocalTaskJob for double run protection
• [AIRFLOW-932] Do not mark tasks removed when backfilling
• [AIRFLOW-961] run onkill when SIGTERMed
• [AIRFLOW-910] Use parallel task execution for backfills
• [AIRFLOW-967] Wrap strings in native for py2 ldap compatibility
• [AIRFLOW-941] Use defined parameters for psycopg2
• [AIRFLOW-719] Prevent DAGs from ending prematurely
• [AIRFLOW-938] Use test for True in task_stats queries
• [AIRFLOW-937] Improve performance of task_stats

268 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-933] use ast.literal_eval rather eval because ast.literal_eval does not execute input.
• [AIRFLOW-925] Revert airflow.hooks change that cherry-pick picked
• [AIRFLOW-919] Running tasks with no start date shouldn’t break a DAGs UI
• [AIRFLOW-802] Add spark-submit operator/hook
• [AIRFLOW-897] Prevent dagruns from failing with unfinished tasks
• [AIRFLOW-861] make pickle_info endpoint be login_required
• [AIRFLOW-853] use utf8 encoding for stdout line decode
• [AIRFLOW-856] Make sure execution date is set for local client
• [AIRFLOW-830][AIRFLOW-829][AIRFLOW-88] Reduce Travis log verbosity
• [AIRFLOW-831] Restore import to fix broken tests
• [AIRFLOW-794] Access DAGS_FOLDER and SQL_ALCHEMY_CONN exclusively from settings
• [AIRFLOW-694] Fix config behaviour for empty envvar
• [AIRFLOW-365] Set dag.fileloc explicitly and use for Code view
• [AIRFLOW-931] Do not set QUEUED in TaskInstances
• [AIRFLOW-899] Tasks in SCHEDULED state should be white in the UI instead of black
• [AIRFLOW-895] Address Apache release incompliancies
• [AIRFLOW-893][AIRFLOW-510] Fix crashing webservers when a dagrun has no start date
• [AIRFLOW-793] Enable compressed loading in S3ToHiveTransfer
• [AIRFLOW-863] Example DAGs should have recent start dates
• [AIRFLOW-869] Refactor mark success functionality
• [AIRFLOW-856] Make sure execution date is set for local client
• [AIRFLOW-814] Fix Presto*CheckOperator.__init__
• [AIRFLOW-844] Fix cgroups directory creation
• [AIRFLOW-816] Use static nvd3 and d3
• [AIRFLOW-821] Fix py3 compatibility
• [AIRFLOW-817] Check for None value of execution_date in endpoint
• [AIRFLOW-822] Close db before exception
• [AIRFLOW-815] Add prev/next execution dates to template variables
• [AIRFLOW-813] Fix unterminated unit tests in SchedulerJobTest
• [AIRFLOW-813] Fix unterminated scheduler unit tests
• [AIRFLOW-806] UI should properly ignore DAG doc when it is None
• [AIRFLOW-812] Fix the scheduler termination bug.
• [AIRFLOW-780] Fix dag import errors no longer working
• [AIRFLOW-783] Fix py3 incompatibility in BaseTaskRunner
• [AIRFLOW-810] Correct down_revision dag_id/state index creation
• [AIRFLOW-807] Improve scheduler performance for large DAGs

3.20. Changelog 269

Airflow Documentation, Release 1.10.2

• [AIRFLOW-798] Check return_code before forcing termination

• [AIRFLOW-139] Let psycopg2 handle autocommit for PostgresHook
• [AIRFLOW-776] Add missing cgroups devel dependency
• [AIRFLOW-777] Fix expression to check if a DagRun is in running state
• [AIRFLOW-785] Don’t import CgroupTaskRunner at global scope
• [AIRFLOW-784] Pin funcsigs to 1.0.0
• [AIRFLOW-624] Fix setup.py to not import airflow.version as version
• [AIRFLOW-779] Task should fail with specific message when deleted
• [AIRFLOW-778] Fix completey broken MetastorePartitionSensor
• [AIRFLOW-739] Set pickle_info log to debug
• [AIRFLOW-771] Make S3 logs append instead of clobber
• [AIRFLOW-773] Fix flaky datetime addition in api test
• [AIRFLOW-219][AIRFLOW-398] Cgroups + impersonation
• [AIRFLOW-683] Add jira hook, operator and sensor
• [AIRFLOW-762] Add Google DataProc delete operator
• [AIRFLOW-760] Update systemd config
• [AIRFLOW-759] Use previous dag_run to verify depend_on_past
• [AIRFLOW-757] Set child_process_log_directory default more sensible
• [AIRFLOW-692] Open XCom page to super-admins only
• [AIRFLOW-737] Fix HDFS Sensor directory.
• [AIRFLOW-747] Fix retry_delay not honoured
• [AIRFLOW-558] Add Support for dag.catchup=(True|False) Option
• [AIRFLOW-489] Allow specifying execution date in trigger_dag API
• [AIRFLOW-738] Commit deleted xcom items before insert
• [AIRFLOW-729] Add Google Cloud Dataproc cluster creation operator
• [AIRFLOW-728] Add Google BigQuery table sensor
• [AIRFLOW-741] Log to debug instead of info for app.py
• [AIRFLOW-731] Fix period bug for NamedHivePartitionSensor
• [AIRFLOW-740] Pin jinja2 to < 2.9.0
• [AIRFLOW-663] Improve time units for task performance charts
• [AIRFLOW-665] Fix email attachments
• [AIRFLOW-734] Fix SMTP auth regression when not using user/pass
• [AIRFLOW-702] Fix LDAP Regex Bug
• [AIRFLOW-717] Add Cloud Storage updated sensor
• [AIRFLOW-695] Retries do not execute because dagrun is in FAILED state
• [AIRFLOW-673] Add operational metrics test for SchedulerJob

270 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-727] try_number is not increased

• [AIRFLOW-715] A more efficient HDFS Sensor:
• [AIRFLOW-716] Allow AVRO BigQuery load-job without schema
• [AIRFLOW-718] Allow the query URI for DataProc Pig
• Log needs to be part of try/catch block
• [AIRFLOW-721] Descendant process can disappear before termination
• [AIRFLOW-403] Bash operator’s kill method leaves underlying processes running
• [AIRFLOW-657] Add AutoCommit Parameter for MSSQL
• [AIRFLOW-641] Improve pull request instructions
• [AIRFLOW-685] Add test for MySqlHook.bulk_load()
• [AIRFLOW-686] Match auth backend config section
• [AIRFLOW-691] Add SSH KeepAlive option to SSH_hook
• [AIRFLOW-709] Use same engine for migrations and reflection
• [AIRFLOW-700] Update to reference to web authentication documentation
• [AIRFLOW-649] Support non-sched DAGs in LatestOnlyOp
• [AIRFLOW-712] Fix AIRFLOW-667 to use proper HTTP error properties
• [AIRFLOW-710] Add OneFineStay as official user
• [AIRFLOW-703][AIRFLOW-1] Stop Xcom being cleared too early
• [AIRFLOW-679] Stop concurrent task instances from running
• [AIRFLOW-704][AIRFLOW-1] Fix invalid syntax in BQ hook
• [AIRFLOW-667] Handle BigQuery 503 error
• [AIRFLOW-680] Disable connection pool for commands
• [AIRFLOW-678] Prevent scheduler from double triggering TIs
• [AIRFLOW-677] Kill task if it fails to heartbeat
• [AIRFLOW-674] Ability to add descriptions for DAGs
• [AIRFLOW-682] Bump MAX_PERIODS to make mark_success work for large DAGs
• Use jdk selector to set required jdk
• [AIRFLOW-647] Restore dag.get_active_runs
• [AIRFLOW-662] Change seasons to months in project description
• [AIRFLOW-656] Add dag/task/date index to xcom table
• [AIRFLOW-658] Improve schema_update_options in GCP
• [AIRFLOW-41] Fix pool oversubscription
• [AIRFLOW-489] Add API Framework
• [AIRFLOW-653] Add some missing endpoint tests
• [AIRFLOW-652] Remove obsolete endpoint
• [AIRFLOW-345] Add contrib ECSOperator

3.20. Changelog 271

Airflow Documentation, Release 1.10.2

• [AIRFLOW-650] Adding Celect to user list

• [AIRFLOW-510] Filter Paused Dags, show Last Run & Trigger Dag
• [AIRFLOW-643] Improve date handling for sf_hook
• [AIRFLOW-638] Add schema_update_options to GCP ops
• [AIRFLOW-640] Install and enable nose-ignore-docstring
• [AIRFLOW-639]AIRFLOW-639] Alphasort package names
• [AIRFLOW-375] Fix pylint errors
• [AIRFLOW-347] Show empty DAG runs in tree view
• [AIRFLOW-628] Adding SalesforceHook to contrib/hooks
• [AIRFLOW-514] hive hook loads data from pandas DataFrame into hive and infers types
• [AIRFLOW-565] Fixes DockerOperator on Python3.x
• [AIRFLOW-635] Encryption option for S3 hook
• [AIRFLOW-137] Fix max_active_runs on clearing tasks
• [AIRFLOW-343] Fix schema plumbing in HiveServer2Hook
• [AIRFLOW-130] Fix ssh operator macosx
• [AIRFLOW-633] Show TI attributes in TI view
• [AIRFLOW-626][AIRFLOW-1] HTML Content does not show up when sending email with attachment
• [AIRFLOW-533] Set autocommit via set_autocommit
• [AIRFLOW-629] stop pinning lxml
• [AIRFLOW-464] Add setdefault method to Variable
• [AIRFLOW-626][AIRFLOW-1] HTML Content does not show up when sending email with attachment
• [AIRFLOW-591] Add datadog hook & sensor
• [AIRFLOW-561] Add RedshiftToS3Transfer operator
• [AIRFLOW-570] Pass root to date form on gantt
• [AIRFLOW-504] Store fractional seconds in MySQL tables
• [AIRFLOW-623] LDAP attributes not always a list
• [AIRFLOW-611] source_format in BigQueryBaseCursor
• [AIRFLOW-619] Fix exception in Gannt chart
• [AIRFLOW-618] Cast DateTimes to avoid sqllite errors
• [AIRFLOW-422] Add JSON endpoint for task info
• [AIRFLOW-616][AIRFLOW-617] Minor fixes to PR tool UX
• [AIRFLOW-179] Fix DbApiHook with non-ASCII chars
• [AIRFLOW-566] Add timeout while fetching logs
• [AIRFLOW-615] Set graph glyphicon first
• [AIRFLOW-609] Add application_name to PostgresHook
• [AIRFLOW-604] Revert .first() to .one()

272 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-370] Create AirflowConfigException in exceptions.py

• [AIRFLOW-582] Fixes TI.get_dagrun filter (removes start_date)
• [AIRFLOW-568] Fix double task_stats count if a DagRun is active
• [AIRFLOW-585] Fix race condition in backfill execution loop
• [AIRFLOW-580] Prevent landscape warning on .format
• [AIRFLOW-597] Check if content is None, not false-equivalent
• [AIRFLOW-586] test_dag_v1 fails from 0 to 3 a.m.
• [AIRFLOW-453] Add XCom Admin Page
• [AIRFLOW-588] Add Google Cloud Storage Object sensor[]
• [AIRFLOW-592] example_xcom import Error
• [AIRFLOW-587] Fix incorrect scope for Google Auth[]
• [AIRFLOW-589] Add templatable job_name[]
• [AIRFLOW-227] Show running config in config view
• [AIRFLOW-319]AIRFLOW-319] xcom push response in HTTP Operator
• [AIRFLOW-385] Add symlink to latest scheduler log directory
• [AIRFLOW-583] Fix decode error in gcs_to_bq
• [AIRFLOW-96] s3_conn_id using environment variable
• [AIRFLOW-575] Clarify tutorial and FAQ about schedule_interval always inheriting from DAG object
• [AIRFLOW-577] Output BigQuery job for improved debugging
• [AIRFLOW-560] Get URI & SQLA engine from Connection
• [AIRFLOW-518] Require DataProfilingMixin for Variables CRUD
• [AIRFLOW-553] Fix load path for filters.js
• [AIRFLOW-554] Add Jinja support to Spark-sql
• [AIRFLOW-550] Make ssl config check empty string safe
• [AIRFLOW-500] Use id for github allowed teams
• [AIRFLOW-556] Add UI PR guidelines
• [AIRFLOW-358][AIRFLOW-430] Add connections cli
• [AIRFLOW-548] Load DAGs immediately & continually
• [AIRFLOW-539] Updated BQ hook and BQ operator to support Standard SQL.
• [AIRFLOW-378] Add string casting to params of spark-sql operator
• [AIRFLOW-544] Add Pause/Resume toggle button
• [AIRFLOW-333][AIRFLOW-258] Fix non-module plugin components
• [AIRFLOW-542] Add tooltip to DAGs links icons
• [AIRFLOW-530] Update docs to reflect connection environment var has to be in uppercase
• [AIRFLOW-525] Update template_fields in Qubole Op
• [AIRFLOW-480] Support binary file download from GCS

3.20. Changelog 273

Airflow Documentation, Release 1.10.2

• [AIRFLOW-198] Implement latest_only_operator

• [AIRFLOW-91] Add SSL config option for the webserver
• [AIRFLOW-191] Fix connection leak with PostgreSQL backend
• [AIRFLOW-512] Fix ‘bellow’ typo in docs & comments
• [AIRFLOW-509][AIRFLOW-1] Create operator to delete tables in BigQuery
• [AIRFLOW-498] Remove hard-coded gcp project id
• [AIRFLOW-505] Support unicode characters in authors’ names
• [AIRFLOW-494] Add per-operator success/failure metrics
• [AIRFLOW-488] Fix test_simple fail
• [AIRFLOW-468] Update Panda requirement to 0.17.1
• [AIRFLOW-159] Add cloud integration section + GCP documentation
• [AIRFLOW-477][AIRFLOW-478] Restructure security section for clarity
• [AIRFLOW-467] Allow defining of project_id in BigQueryHook
• [AIRFLOW-483] Change print to logging statement
• [AIRFLOW-475] make the segment granularity in Druid hook configurable

3.20.8 Airflow 1.7.2

• [AIRFLOW-463] Link Airflow icon to landing page

• [AIRFLOW-149] Task Dependency Engine + Why Isn’t My Task Running View
• [AIRFLOW-361] Add default failure handler for the Qubole Operator
• [AIRFLOW-353] Fix dag run status update failure
• [AIRFLOW-447] Store source URIs in Python 3 compatible list
• [AIRFLOW-443] Make module names unique when importing
• [AIRFLOW-444] Add Google authentication backend
• [AIRFLOW-446][AIRFLOW-445] Adds missing dataproc submit options
• [AIRFLOW-431] Add CLI for CRUD operations on pools
• [AIRFLOW-329] Update Dag Overview Page with Better Status Columns
• [AIRFLOW-360] Fix style warnings in models.py
• [AIRFLOW-425] Add white fill for null state tasks in tree view.
• [AIRFLOW-69] Use dag runs in backfill jobs
• [AIRFLOW-415] Make dag_id not found error clearer
• [AIRFLOW-416] Use ordinals in README’s company list
• [AIRFLOW-369] Allow setting default DAG orientation
• [AIRFLOW-410] Add 2 Q/A to the FAQ in the docs
• [AIRFLOW-407] Add different colors for some sensors
• [AIRFLOW-414] Improve error message for missing FERNET_KEY

274 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-406] Sphinx/rst fixes

• [AIRFLOW-412] Fix lxml dependency
• [AIRFLOW-413] Fix unset path bug when backfilling via pickle
• [AIRFLOW-78] Airflow clear leaves dag_runs
• [AIRFLOW-402] Remove NamedHivePartitionSensor static check, add docs
• [AIRFLOW-394] Add an option to the Task Duration graph to show cumulative times
• [AIRFLOW-404] Retry download if unpacking fails for hive
• [AIRFLOW-276] Gunicorn rolling restart
• [AIRFLOW-399] Remove dags/testdruid.py
• [AIRFLOW-400] models.py/DAG.set_dag_runs_state() does not correctly set state
• [AIRFLOW-395] Fix colon/equal signs typo for resources in default config
• [AIRFLOW-397] Documentation: Fix typo “instatiating” to “instantiating”
• [AIRFLOW-395] Remove trailing commas from resources in config
• [AIRFLOW-388] Add a new chart for Task_Tries for each DAG
• [AIRFLOW-322] Fix typo in FAQ section
• [AIRFLOW-375] Pylint fixes
• limit scope to user email only AIRFLOW-386
• [AIRFLOW-383] Cleanup example qubole operator dag
• [AIRFLOW-160] Parse DAG files through child processes
• [AIRFLOW-381] Manual UI Dag Run creation: require dag_id field
• [AIRFLOW-373] Enhance CLI variables functionality
• [AIRFLOW-379] Enhance Variables page functionality: import/export variables
• [AIRFLOW-331] modify the LDAP authentication config lines in ‘Security’ sample codes
• [AIRFLOW-356][AIRFLOW-355][AIRFLOW-354] Replace nobr, enable DAG only exists locally message,
change edit DAG icon
• [AIRFLOW-362] Import __future__ division
• [AIRFLOW-359] Pin flask-login to 0.2.11
• [AIRFLOW-261] Add bcc and cc fields to EmailOperator
• [AIRFLOW-348] Fix code style warnings
• [AIRFLOW-349] Add metric for number of zombies killed
• [AIRFLOW-340] Remove unused dependency on Babel
• [AIRFLOW-339]: Ability to pass a flower conf file
• [AIRFLOW-341][operators] Add resource requirement attributes to operators
• [AIRFLOW-335] Fix simple style errors/warnings
• [AIRFLOW-337] Add __repr__ to VariableAccessor and VariableJsonAccessor
• [AIRFLOW-334] Fix using undefined variable

3.20. Changelog 275

Airflow Documentation, Release 1.10.2

• [AIRFLOW-315] Fix blank lines code style warnings

• [AIRFLOW-306] Add Spark-sql Hook and Operator
• [AIRFLOW-327] Add rename method to the FTPHook
• [AIRFLOW-321] Fix a wrong code example about tests/dags
• [AIRFLOW-316] Always check DB state for Backfill Job execution
• [AIRFLOW-264] Adding workload management for Hive
• [AIRFLOW-297] support exponential backoff option for retry delay
• [AIRFLOW-31][AIRFLOW-200] Add note to updating.md
• [AIRFLOW-307] There is no __neq__ python magic method.
• [AIRFLOW-309] Add requirements of develop dependencies to docs
• [AIRFLOW-307] Rename __neq__ to __ne__ python magic method.
• [AIRFLOW-313] Fix code style for sqoop_hook.py
• [AIRFLOW-311] Fix wrong path in CONTRIBUTING.md
• [AIRFLOW-24] DataFlow Java Operator
• [AIRFLOW-308] Add link to refresh DAG within DAG view header
• [AIRFLOW-314] Fix BigQuery cursor run_table_upsert method
• [AIRFLOW-298] fix incubator diclaimer in docs
• [AIRFLOW-284] HiveServer2Hook fix for cursor scope for get_results
• [AIRFLOW-260] More graceful exit when issues can’t be closed
• [AIRFLOW-260] Handle case when no version is found
• [AIRFLOW-228] Handle empty version list in PR tool
• [AIRFLOW-302] Improve default squash commit message
• [AIRFLOW-187] Improve prompt styling
• [AIRFLOW-187] Fix typo in argument name
• [AIRFLOW-187] Move “Close XXX” message to end of squash commit
• [AIRFLOW-247] Add EMR hook, operators and sensors. Add AWS base hook
• [AIRFLOW-301] Fix broken unit test
• [AIRFLOW-100] Add execution_date_fn to ExternalTaskSensor
• [AIRFLOW-282] Remove PR Tool logic that depends on version formatting
• [AIRFLOW-291] Add index for state in TI table
• [AIRFLOW-269] Add some unit tests for PostgreSQL
• [AIRFLOW-296] template_ext is being treated as a string rather than a tuple in qubole operator
• [AIRFLOW-286] Improve FTPHook to implement context manager interface
• [AIRFLOW-243] Create NamedHivePartitionSensor
• [AIRFLOW-246] Improve dag_stats endpoint query
• [AIRFLOW-189] Highlighting of Parent/Child nodes in Graphs

276 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [ARFLOW-255] Check dagrun timeout when comparing active runs

• [AIRFLOW-281] Add port to mssql_hook
• [AIRFLOW-285] Use Airflow 2.0 style imports for all remaining hooks/operators
• [AIRFLOW-40] Add LDAP group filtering feature.
• [AIRFLOW-277] Multiple deletions does not work in Task Instances view if using SQLite backend
• [AIRFLOW-200] Make hook/operator imports lazy, and print proper exceptions
• [AIRFLOW-283] Make store_to_xcom_key a templated field in GoogleCloudStorageDownloadOperator
• [AIRFLOW-278] Support utf-8 ecoding for SQL
• [AIRFLOW-280] clean up tmp druid table no matter if an ingestion job succeeds or not
• [AIRFLOW-274] Add XCom functionality to GoogleCloudStorageDownloadOperator
• [AIRFLOW-273] Create an svg version of the airflow logo.
• [AIRFLOW-275] Update contributing guidelines
• [AIRFLOW-244] Modify hive operator to inject analysis data
• [AIRFLOW-162] Allow variable to be accessible into templates
• [AIRFLOW-248] Add Apache license header to all files
• [AIRFLOW-263] Remove temp backtick file
• [AIRFLOW-252] Raise Sqlite exceptions when deleting tasks instance in WebUI
• [AIRFLOW-180] Fix timeout behavior for sensors
• [AIRFLOW-262] Simplify commands in MANIFEST.in
• [AIRFLOW-31] Add zope dependency
• [AIRFLOW-6] Remove dependency on Highcharts
• [AIRFLOW-234] make task that aren’t running self-terminate
• [AIRFLOW-256] Fix test_scheduler_reschedule heartrate
• Add Python 3 compatibility fix
• [AIRFLOW-31] Use standard imports for hooks/operators
• [AIRFLOW-173] Initial implementation of FileSensor
• [AIRFLOW-224] Collect orphaned tasks and reschedule them
• [AIRFLOW-239] Fix tests indentation
• [AIRFLOW-225] Better units for task duration graph
• [AIRFLOW-241] Add testing done section to PR template
• [AIRFLOW-222] Show duration of task instances in ui
• [AIRFLOW-231] Do not eval user input in PrestoHook
• [AIRFLOW-216] Add Sqoop Hook and Operator
• [AIRFLOW-171] Add upgrade notes on email and S3 to 1.7.1.2
• [AIRFLOW-238] Make compatible with flask-admin 1.4.1
• [AIRFLOW-230] [HiveServer2Hook] adding multi statements support

3.20. Changelog 277

Airflow Documentation, Release 1.10.2

• [AIRFLOW-142] setup_env.sh doesn’t download hive tarball if hdp is specified as distro

• [AIRFLOW-223] Make parametrable the IP on which Flower binds to
• [AIRFLOW-218] Added option to enable webserver gunicorn access/err logs
• [AIRFLOW-213] Add “Closes #X” phrase to commit messages
• [AIRFLOW-68] Align start_date with the schedule_interval
• [AIRFLOW-9] Improving docs to meet Apache’s standards
• [AIRFLOW-131] Make XCom.clear more selective
• [AIRFLOW-214] Fix occasion of detached taskinstance
• [AIRFLOW-206] Add commit to close PR
• [AIRFLOW-206] Always load local log files if they exist
• [AIRFLOW-211] Fix JIRA “resolve” vs “close” behavior
• [AIRFLOW-64] Add note about relative DAGS_FOLDER
• [AIRFLOW-114] Sort plugins dropdown
• [AIRFLOW-209] Add scheduler tests and improve lineage handling
• [AIRFLOW-207] Improve JIRA auth workflow
• [AIRFLOW-187] Improve PR tool UX
• [AIRFLOW-155] Documentation of Qubole Operator
• Optimize and refactor process_dag
• [AIRFLOW-185] Handle empty versions list
• [AIRFLOW-201] Fix for HiveMetastoreHook + kerberos
• [AIRFLOW-202]: Fixes stray print line
• [AIRFLOW-196] Fix bug that exception is not handled in HttpSensor
• [AIRFLOW-195] : Add toggle support to subdag clearing in the CLI
• [AIRFLOW-23] Support for Google Cloud DataProc
• [AIRFLOW-25] Configuration for Celery always required
• [AIRFLOW-190] Add codecov and remove download count
• [AIRFLOW-168] Correct evaluation of @once schedule
• [AIRFLOW-183] Fetch log from remote when worker returns 4xx/5xx response
• [AIRFLOW-181] Fix failing unpacking of hadoop by redownloading
• [AIRFLOW-176] remove unused formatting key
• [AIRFLOW-167]: Add dag_state option in cli
• [AIRFLOW-178] Fix bug so that zip file is detected in DAG folder
• [AIRFLOW-176] Improve PR Tool JIRA workflow
• AIRFLOW-45: Support Hidden Airflow Variables
• [AIRFLOW-175] Run git-reset before checkout in PR tool
• [AIRFLOW-157] Make PR tool Py3-compat; add JIRA command

278 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• [AIRFLOW-170] Add missing @apply_defaults

3.20.9 Airflow 1.7.1, 2016-05-19

• Fix : Don’t treat premature tasks as could_not_run tasks

• AIRFLOW-92 Avoid unneeded upstream_failed session closes apache/airflow#1485
• Add logic to lock DB and avoid race condition
• Handle queued tasks from multiple jobs/executors
• AIRFLOW-52 Warn about overwriting tasks in a DAG
• Fix corner case with joining processes/queues (#1473)
• [AIRFLOW-52] Fix bottlenecks when working with many tasks
• Add columns to toggle extra detail in the connection list view.
• Log the number of errors when importing DAGs
• Log dagbag metrics dupplicate messages in queue into Statsd (#1406)
• Clean up issue template (#1419)
• correct missed arg.foreground to arg.daemon in cli
• Reinstate imports for github enterprise auth
• Use os.execvp instead of subprocess.Popen for the webserver
• Revert from using “–foreground” to “–daemon”
• Implement a Cloudant hook
• Add missing args to airflow clear
• Fixed a bug in the scheduler: num_runs used where runs intended
• Add multiprocessing support to the scheduler
• Partial fix to make sure next_run_date cannot be None
• Support list/get/set variables in the CLI
• Properly handle BigQuery booleans in BigQuery hook.
• Added the ability to view XCom variables in webserver
• Change DAG.tasks from a list to a dict
• Add support for zipped dags
• Stop creating hook on instantiating of S3 operator
• User subquery in views to find running DAGs
• Prevent DAGs from being reloaded on every scheduler iteration
• Add a missing word to docs
• Document the parameters of DbApiHook
• added oracle operator with existing oracle hook
• Add PyOpenSSL to Google cloud gcp_api.
• Remove executor error unit test

3.20. Changelog 279

Airflow Documentation, Release 1.10.2

• Add DAG inference, deferral, and context manager

• Don’t return error when writing files to Google cloud storage.
• Fix GCS logging for gcp_api.
• Ensure attr is in scope for error message
• Fixing misnamed PULL_REQUEST_TEMPLATE
• Extract non_pooled_task_slot_count into a configuration param
• Update plugins.rst for clarity on the example (#1309)
• Fix s3 logging issue
• Add twitter feed example dag
• Github ISSUE_TEMPLATE & PR_TEMPLATE cleanup
• Reduce logger verbosity
• Adding a PR Template
• Add Lucid to list of users
• Fix usage of asciiart
• Use session instead of outdated main_session for are_dependencies_met
• Fix celery flower port allocation
• Fix for missing edit actions due to flask-admin upgrade
• Fix typo in comment in prioritize_queued method
• Add HipchatOperator
• Include all example dags in backfill unit test
• Make sure skipped jobs are actually skipped
• Fixing a broken example dag, example_skip_dag.py
• Add consistent and thorough signal handling and logging
• Allow Operators to specify SKIPPED status internally
• Update docstring for executor trap unit test
• Doc: explain the usage of Jinja templating for templated params
• Don’t schedule runs before the DAG’s start_date
• Fix infinite retries with pools, with test
• Fix handling of deadlocked jobs
• Show only Airflow’s deprecation warnings
• Set DAG_FOLDER for unit tests
• Missing comma in setup.py
• Deprecate *args and **kwargs in BaseOperator
• Raise deep scheduler exceptions to force a process restart.
• Change inconsistent example DAG owners
• Fix module path of send_email_smtp in configuration

280 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• added Gentner Lab to list of users

• Increase timeout time for unit test
• Fix reading strings from conf
• CHORE - Remove Trailing Spaces
• Fix SSHExecuteOperator crash when using a custom ssh port
• Add note about airflow components to template
• Rewrite BackfillJob logic for clarity
• Add unit tests
• Fix miscellaneous bugs and clean up code
• Fix logic for determining DagRun states
• Make SchedulerJob not run EVERY queued task
• Improve BackfillJob handling of queued/deadlocked tasks
• Introduce ignore_depends_on_past parameters
• Use Popen with CeleryExecutor
• Rename user table to users to avoid conflict with postgres
• Beware of negative pool slots.
• Add support for calling_format from boto to S3_Hook
• Add pypi meta data and sync version number
• Set dags_are_paused_at_creation’s default value to True
• Resurface S3Log class eaten by rebase/push -f
• Add missing session.commit() at end of initdb
• Validate that subdag tasks have pool slots available, and test
• Use urlparse for remote GCS logs, and add unit tests
• Make webserver worker timeout configurable
• Fixed scheduling for @once interval
• Use psycopg2’s API for serializing postgres cell values
• Make the provide_session decorator more robust
• update link to Lyft’s website
• use num_shards instead of partitions to be consistent with batch ingestion
• Add documentation links to README
• Update docs with separate configuration section
• Fix airflow.utils deprecation warning code being Python 3 incompatible
• Extract dbapi cell serialization into its own method
• Set Postgres autocommit as supported only if server version is < 7.4
• Use refactored utils module in unit test imports
• Add changelog for 1.7.0

3.20. Changelog 281

Airflow Documentation, Release 1.10.2

• Use LocalExecutor on Travis if possible

• remove unused logging,errno, MiniHiveCluster imports
• remove extra import of logging lib
• Fix required gcloud version
• Refactoring utils into smaller submodules
• Properly measure number of task retry attempts
• Add function to get configuration as dict, plus unit tests
• Merge branch ‘master’ into hivemeta_sasl
• Add wiki link to README.md
• [hotfix] make email.Utils > email.utils for py3
• Add the missing “Date” header to the warning e-mails
• Add the missing “Date” header to the warning e-mails
• Check name of SubDag class instead of class itself
• [hotfix] removing repo_token from .coveralls.yml
• Set the service_name in coverals.yml
• Fixes #1223
• Update Airflow docs for remote logging
• Add unit tests for trapping Executor errors
• Make sure Executors properly trap errors
• Fix HttpOpSensorTest to use fake resquest session
• Linting
• Add an example on pool usage in the documentation
• Add two methods to bigquery hook’s base cursor: run_table_upsert, which adds a table or updates an existing
table; and run_grant_dataset_view_access, which grants view access to a given dataset for a given table.
• Tasks references upstream and downstream tasks using strings instead of references
• Fix typos in models.py
• Fix broken links in documentation
• [hotfix] fixing the Scheduler CLI to make dag_id optional
• Update link to Common Pitfalls wiki page in README
• Allow disabling periodic committing when inserting rows with DbApiHook
• added Glassdoor to “who uses airflow”
• Fix typo preventing from launching webserver
• Documentation badge
• Fixing ISSUE_TEMPLATE name to include .md suffix
• Adding an ISSUE_TEMPLATE to ensure that issues are adequately defined
• Linting & debugging

282 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• Refactoring the CLI to be data-driven

• Updating the Bug Reporting protocol in the Contributing.md file
• Fixing the docs
• clean up references to old session
• remove session reference
• resolve conflict
• clear xcom data when task instance starts
• replace main_session with @provide_session
• Add extras to installation.rst
• Changes to Contributing to reflect more closely the current state of development.
• Modifying README to link to the wiki committer list
• docs: fixes a spelling mistake in default config
• Set killMode to ‘control-group’ for webservice.service
• Set KillMode to ‘control-group’ for worker.service
• Linting
• Fix WebHdfsSensor
• Adding more licenses to pass checks
• fixing landscape’s config
• [hotfix] typo that made it in master
• [hotfix] fixing landscape requirement detection
• Make testing on hive conditional
• Merge remote-tracking branch ‘upstream/master’ into minicluster
• Update README.md
• Throwing in a few license to pass the build
• Adding a reqs.txt for landscape.io
• Pointing to a reqs file
• Some linting
• Adding a .landscape.yml file
• badge for pypi version
• Add license and ignore for sql and csv
• Use correct connection id
• Use correct table name
• Provide data for ci tests
• new badge for showing staleness of reqs
• removing requirements.txt as it is uni-dimensional
• Make it work on py3

3.20. Changelog 283

Airflow Documentation, Release 1.10.2

• Remove decode for logging

• Also keep py2 compatible
• More py3 fixes
• Convert to bytes for py3 compat
• Make sure to be py3 compatible
• Use unicodecsv to make it py3 compatible
• Replace tab with spaces Remove unused import
• Merge remote-tracking branch ‘upstream/master’
• Support decimal types in MySQL to GCS
• Make sure to write binary as string can be unicode
• Ignore metastore
• More impyla fixes
• Test HivemetaStore if python 2
• Allow users to set hdfs_namenode_principal in HDFSHook config
• Add tests for Hiveserver2 and fix some issues from impyla
• Merge branch ‘impyla’ into minicluster
• This patch allows for testing of hive operators and hooks. Sasl is used (NoSasl in connection string is not
possible). Tests have been adjusted.
• Treat SKIPPED and SUCCESS the same way when evaluating depends_on_past=True
• fix bigquery hook
• version cap for gcp_api
• Fix typo when returning VerticaHook
• Adding fernet key to use it as part of stdout commands
• Adding support for ssl parameters. (picking up from jthomas123)
• more detail in error message.
• make sure paths don’t conflict bc of trailing /
• change gcs_hook to self.hook
• refactor remote log read/write and add GCS support
• Only use multipart upload in S3Hook if file is large enough
• Merge branch ‘airbnb/master’
• Add GSSAPI SASL to HiveMetaStoreHook.
• Add warning for deprecated setting
• Use kerberos_service_name = ‘hive’ as standard instead of ‘impala’.
• Use GSSAPI instead of KERBEROS and provide backwards compatibility
• ISSUE-1123 Use impyla instead of pyhs2
• set celery_executor to use queue name as exchange

284 Chapter 3. Content

Airflow Documentation, Release 1.10.2

3.21 FAQ

3.21.1 Why isn’t my task getting scheduled?

There are very many reasons why your task might not be getting scheduled. Here are some of the common causes:
• Does your script “compile”, can the Airflow engine parse it and find your DAG object. To test this, you can
run airflow list_dags and confirm that your DAG shows up in the list. You can also run airflow
list_tasks foo_dag_id --tree and confirm that your task shows up in the list as expected. If you
use the CeleryExecutor, you may want to confirm that this works both where the scheduler runs as well as where
the worker runs.
• Does the file containing your DAG contain the string “airflow” and “DAG” somewhere in the contents? When
searching the DAG directory, Airflow ignores files not containing “airflow” and “DAG” in order to prevent the
DagBag parsing from importing all python files collocated with user’s DAGs.
• Is your start_date set properly? The Airflow scheduler triggers the task soon after the start_date +
scheduler_interval is passed.
• Is your schedule_interval set properly? The default schedule_interval is one day (datetime.
timedelta(1)). You must specify a different schedule_interval directly to the DAG ob-
ject you instantiate, not as a default_param, as task instances do not override their parent DAG’s
schedule_interval.
• Is your start_date beyond where you can see it in the UI? If you set your start_date to some time say 3
months ago, you won’t be able to see it in the main view in the UI, but you should be able to see it in the Menu
-> Browse ->Task Instances.
• Are the dependencies for the task met. The task instances directly upstream from the task need to be in a
success state. Also, if you have set depends_on_past=True, the previous task instance needs to have
succeeded (except if it is the first run for that task). Also, if wait_for_downstream=True, make sure you
understand what it means. You can view how these properties are set from the Task Instance Details
page for your task.
• Are the DagRuns you need created and active? A DagRun represents a specific execution of an entire DAG and
has a state (running, success, failed, . . . ). The scheduler creates new DagRun as it moves forward, but never goes
back in time to create new ones. The scheduler only evaluates running DagRuns to see what task instances
it can trigger. Note that clearing tasks instances (from the UI or CLI) does set the state of a DagRun back to
running. You can bulk view the list of DagRuns and alter states by clicking on the schedule tag for a DAG.
• Is the concurrency parameter of your DAG reached? concurrency defines how many running task
instances a DAG is allowed to have, beyond which point things get queued.
• Is the max_active_runs parameter of your DAG reached? max_active_runs defines how many
running concurrent instances of a DAG there are allowed to be.
You may also want to read the Scheduler section of the docs and make sure you fully understand how it proceeds.

3.21.2 How do I trigger tasks based on another task’s failure?

Check out the Trigger Rule section in the Concepts section of the documentation

3.21.3 Why are connection passwords still not encrypted in the metadata db after I
installed airflow[crypto]?

Check out the Connections section in the Configuration section of the documentation

3.21. FAQ 285

Airflow Documentation, Release 1.10.2

3.21.4 What’s the deal with start_date?

start_date is partly legacy from the pre-DagRun era, but it is still relevant in many ways. When creating a new
DAG, you probably want to set a global start_date for your tasks using default_args. The first DagRun to
be created will be based on the min(start_date) for all your task. From that point on, the scheduler creates new
DagRuns based on your schedule_interval and the corresponding task instances run as your dependencies are
met. When introducing new tasks to your DAG, you need to pay special attention to start_date, and may want to
reactivate inactive DagRuns to get the new task onboarded properly.
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite
confusing. The task is triggered once the period closes, and in theory an @hourly DAG would never get to an hour
after now as now() moves along.
Previously we also recommended using rounded start_date in relation to your schedule_interval. This
meant an @hourly would be at 00:00 minutes:seconds, a @daily job at midnight, a @monthly job on
the first of the month. This is no longer required. Airflow will now auto align the start_date and the
schedule_interval, by using the start_date as the moment to start looking.
You can use any sensor or a TimeDeltaSensor to delay the execution of tasks within the schedule interval. While
schedule_interval does allow specifying a datetime.timedelta object, we recommend using the macros
or cron expressions instead, as it enforces this idea of rounded schedules.
When using depends_on_past=True it’s important to pay special attention to start_date as the past depen-
dency is not enforced only on the specific schedule of the start_date specified for the task. It’s also important to
watch DagRun activity status in time when introducing new depends_on_past=True, unless you are planning
on running a backfill for the new task(s).
Also important to note is that the tasks start_date, in the context of a backfill CLI command, get overridden by
the backfill’s command start_date. This allows for a backfill on tasks that have depends_on_past=True to
actually start, if that wasn’t the case, the backfill just wouldn’t start.

3.21.5 How can I create DAGs dynamically?

Airflow looks in your DAGS_FOLDER for modules that contain DAG objects in their global namespace, and adds
the objects it finds in the DagBag. Knowing this all we need is a way to dynamically assign variable in the global
namespace, which is easily done in python using the globals() function for the standard library which behaves
like a simple dictionary.

for i in range(10):
dag_id = 'foo_{}'.format(i)
globals()[dag_id] = DAG(dag_id)
# or better, call a function that returns a DAG object!

3.21.6 What are all the airflow run commands in my process list?

There are many layers of airflow run commands, meaning it can call itself.
• Basic airflow run: fires up an executor, and tell it to run an airflow run --local command. if using
Celery, this means it puts a command in the queue for it to run remote, on the worker. If using LocalExecutor,
that translates into running it in a subprocess pool.
• Local airflow run --local: starts an airflow run --raw command (described below) as a sub-
process and is in charge of emitting heartbeats, listening for external kill signals and ensures some cleanup takes
place if the subprocess fails
• Raw airflow run --raw runs the actual operator’s execute method and performs the actual work

286 Chapter 3. Content

Airflow Documentation, Release 1.10.2

3.21.7 How can my airflow dag run faster?

There are three variables we could control to improve airflow dag performance:
• parallelism: This variable controls the number of task instances that the airflow worker can run simultane-
ously. User could increase the parallelism variable in the airflow.cfg.
• concurrency: The Airflow scheduler will run no more than $concurrency task instances for your DAG
at any given time. Concurrency is defined in your Airflow DAG. If you do not set the concurrency on your DAG,
the scheduler will use the default value from the dag_concurrency entry in your airflow.cfg.
• max_active_runs: the Airflow scheduler will run no more than max_active_runs DagRuns of your
DAG at a given time. If you do not set the max_active_runs in your DAG, the scheduler will use the default
value from the max_active_runs_per_dag entry in your airflow.cfg.

3.21.8 How can we reduce the airflow UI page load time?

If your dag takes long time to load, you could reduce the value of default_dag_run_display_number con-
figuration in airflow.cfg to a smaller value. This configurable controls the number of dag run to show in UI with
default value 25.

3.21.9 How to fix Exception: Global variable explicit_defaults_for_timestamp

needs to be on (1)?

This means explicit_defaults_for_timestamp is disabled in your mysql server and you need to enable it
by:
1. Set explicit_defaults_for_timestamp = 1 under the mysqld section in your my.cnf file.
2. Restart the Mysql server.

3.21.10 How to reduce airflow dag scheduling latency in production?

• max_threads: Scheduler will spawn multiple threads in parallel to schedule dags. This is controlled by
max_threads with default value of 2. User should increase this value to a larger value(e.g numbers of cpus
where scheduler runs - 1) in production.
• scheduler_heartbeat_sec: User should consider to increase scheduler_heartbeat_sec config
to a higher value(e.g 60 secs) which controls how frequent the airflow scheduler gets the heartbeat and updates
the job’s entry in database.

3.22 API Reference

3.22.1 Operators

Operators allow for generation of certain types of tasks that become nodes in the DAG when instantiated. All op-
erators derive from BaseOperator and inherit many attributes and methods that way. Refer to the BaseOperator
documentation for more details.
There are 3 main types of operators:
• Operators that performs an action, or tell another system to perform an action
• Transfer operators move data from one system to another

3.22. API Reference 287

Airflow Documentation, Release 1.10.2

• Sensors are a certain type of operator that will keep running until a certain criterion is met. Examples include
a specific file landing in HDFS or S3, a partition appearing in Hive, or a specific time of the day. Sensors are
derived from BaseSensorOperator and run a poke method at a specified poke_interval until it returns
True.

3.22.1.1 BaseOperator

All operators are derived from BaseOperator and acquire much functionality through inheritance. Since this is
the core of the engine, it’s worth taking the time to understand the parameters of BaseOperator to understand the
primitive features that can be leveraged in your DAGs.
class airflow.models.BaseOperator(**kwargs)
Bases: airflow.utils.log.logging_mixin.LoggingMixin
Abstract base class for all operators. Since operators create objects that become nodes in the dag, BaseOperator
contains many recursive methods for dag crawling behavior. To derive this class, you are expected to override
the constructor as well as the ‘execute’ method.
Operators derived from this class should perform or trigger certain tasks synchronously (wait for comple-
tion). Example of operators could be an operator that runs a Pig job (PigOperator), a sensor operator that
waits for a partition to land in Hive (HiveSensorOperator), or one that moves data from Hive to MySQL
(Hive2MySqlOperator). Instances of these operators (tasks) target specific operations, running specific scripts,
functions or data transfers.
This class is abstract and shouldn’t be instantiated. Instantiating a class derived from this one results in the
creation of a task object, which ultimately becomes a node in DAG objects. Task dependencies should be set by
using the set_upstream and/or set_downstream methods.
Parameters
• task_id (string) – a unique, meaningful id for the task
• owner (string) – the owner of the task, using the unix username is recommended
• retries (int) – the number of retries that should be performed before failing the task
• retry_delay (timedelta) – delay between retries
• retry_exponential_backoff (bool) – allow progressive longer waits between re-
tries by using exponential backoff algorithm on retry delay (delay will be converted into
seconds)
• max_retry_delay (timedelta) – maximum delay interval between retries
• start_date (datetime) – The start_date for the task, determines the
execution_date for the first task instance. The best practice is to have the start_date
rounded to your DAG’s schedule_interval. Daily jobs have their start_date some
day at 00:00:00, hourly jobs have their start_date at 00:00 of a specific hour. Note that Air-
flow simply looks at the latest execution_date and adds the schedule_interval
to determine the next execution_date. It is also very important to note that differ-
ent tasks’ dependencies need to line up in time. If task A depends on task B and their
start_date are offset in a way that their execution_date don’t line up, A’s dependencies will
never be met. If you are looking to delay a task, for example running a daily task at 2AM,
look into the TimeSensor and TimeDeltaSensor. We advise against using dynamic
start_date and recommend using fixed ones. Read the FAQ entry about start_date for
more information.
• end_date (datetime) – if specified, the scheduler won’t go beyond this date

288 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• depends_on_past (bool) – when set to true, task instances will run sequentially while
relying on the previous task’s schedule to succeed. The task instance for the start_date is
allowed to run.
• wait_for_downstream (bool) – when set to true, an instance of task X will wait
for tasks immediately downstream of the previous instance of task X to finish successfully
before it runs. This is useful if the different instances of a task X alter the same asset, and
this asset is used by tasks downstream of task X. Note that depends_on_past is forced to
True wherever wait_for_downstream is used.
• queue (str) – which queue to target when running this job. Not all executors implement
queue management, the CeleryExecutor does support targeting specific queues.
• dag (DAG) – a reference to the dag the task is attached to (if any)
• priority_weight (int) – priority weight of this task against other task. This allows
the executor to trigger higher priority tasks before others when things get backed up.
• weight_rule (str) – weighting method used for the effective total priority weight
of the task. Options are: { downstream | upstream | absolute } default is
downstream When set to downstream the effective weight of the task is the aggregate
sum of all downstream descendants. As a result, upstream tasks will have higher weight and
will be scheduled more aggressively when using positive weight values. This is useful when
you have multiple dag run instances and desire to have all upstream tasks to complete for all
runs before each dag can continue processing downstream tasks. When set to upstream
the effective weight is the aggregate sum of all upstream ancestors. This is the opposite
where downtream tasks have higher weight and will be scheduled more aggressively when
using positive weight values. This is useful when you have multiple dag run instances and
prefer to have each dag complete before starting upstream tasks of other dags. When set to
absolute, the effective weight is the exact priority_weight specified without ad-
ditional weighting. You may want to do this when you know exactly what priority weight
each task should have. Additionally, when set to absolute, there is bonus effect of signif-
icantly speeding up the task creation process as for very large DAGS. Options can be set as
string or using the constants defined in the static class airflow.utils.WeightRule
• pool (str) – the slot pool this task should run in, slot pools are a way to limit concurrency
for certain tasks
• sla (datetime.timedelta) – time by which the job is expected to succeed. Note that
this represents the timedelta after the period is closed. For example if you set an SLA
of 1 hour, the scheduler would send an email soon after 1:00AM on the 2016-01-02 if
the 2016-01-01 instance has not succeeded yet. The scheduler pays special attention for
jobs with an SLA and sends alert emails for sla misses. SLA misses are also recorded in the
database for future reference. All tasks that share the same SLA time get bundled in a single
email, sent soon after that time. SLA notification are sent once and only once for each task
instance.
• execution_timeout (datetime.timedelta) – max time allowed for the execu-
tion of this task instance, if it goes beyond it will raise and fail.
• on_failure_callback (callable) – a function to be called when a task instance of
this task fails. a context dictionary is passed as a single parameter to this function. Con-
text contains references to related objects to the task instance and is documented under the
macros section of the API.
• on_retry_callback (callable) – much like the on_failure_callback except
that it is executed when retries occur.

3.22. API Reference 289

Airflow Documentation, Release 1.10.2

• on_success_callback (callable) – much like the on_failure_callback ex-

cept that it is executed when the task succeeds.
• trigger_rule (str) – defines the rule by which dependencies are applied for
the task to get triggered. Options are: { all_success | all_failed |
all_done | one_success | one_failed | none_failed | dummy} de-
fault is all_success. Options can be set as string or using the constants defined in
the static class airflow.utils.TriggerRule
• resources (dict) – A map of resource parameter names (the argument names of the
Resources constructor) to their values.
• run_as_user (str) – unix username to impersonate while running the task
• task_concurrency (int) – When set, a task will be able to limit the concurrent runs
across execution_dates
• executor_config (dict) – Additional task-level configuration parameters that are in-
terpreted by a specific executor. Parameters are namespaced by the name of executor.
Example: to run this task in a specific docker container through the KubernetesExecutor

MyOperator(...,
executor_config={
"KubernetesExecutor":
{"image": "myCustomDockerImage"}
}
)

clear(**kwargs)
Clears the state of task instances associated with the task, following the parameters specified.
dag
Returns the Operator’s DAG if set, otherwise raises an error
deps
Returns the list of dependencies for the operator. These differ from execution context dependencies in that
they are specific to tasks and can be extended/overridden by subclasses.
downstream_list
@property: list of tasks directly downstream
execute(context)
This is the main method to derive when creating an operator. Context is the same dictionary used as when
rendering jinja templates.
Refer to get_template_context for more context.
get_direct_relative_ids(upstream=False)
Get the direct relative ids to the current task, upstream or downstream.
get_direct_relatives(upstream=False)
Get the direct relatives to the current task, upstream or downstream.
get_flat_relative_ids(upstream=False, found_descendants=None)
Get a flat list of relatives’ ids, either upstream or downstream.
get_flat_relatives(upstream=False)
Get a flat list of relatives, either upstream or downstream.
get_task_instances(session, start_date=None, end_date=None)
Get a set of task instance related to this task for a specific date range.

290 Chapter 3. Content

Airflow Documentation, Release 1.10.2

has_dag()
Returns True if the Operator has been assigned to a DAG.
on_kill()
Override this method to cleanup subprocesses when a task instance gets killed. Any use of the threading,
subprocess or multiprocessing module within an operator needs to be cleaned up or it will leave ghost
processes behind.
post_execute(context, *args, **kwargs)
This hook is triggered right after self.execute() is called. It is passed the execution context and any results
returned by the operator.
pre_execute(context, *args, **kwargs)
This hook is triggered right before self.execute() is called.
prepare_template()
Hook that is triggered after the templated fields get replaced by their content. If you need your operator to
alter the content of the file before the template is rendered, it should override this method to do so.
render_template(attr, content, context)
Renders a template either from a file or directly in a field, and returns the rendered result.
render_template_from_field(attr, content, context, jinja_env)
Renders a template from a field. If the field is a string, it will simply render the string and return the result.
If it is a collection or nested set of collections, it will traverse the structure and render all strings in it.
run(start_date=None, end_date=None, ignore_first_depends_on_past=False, ignore_ti_state=False,
mark_success=False)
Run a set of task instances for a date range.
schedule_interval
The schedule interval of the DAG always wins over individual tasks so that tasks within a DAG always
line up. The task still needs a schedule_interval as it may not be attached to a DAG.
set_downstream(task_or_task_list)
Set a task or a task list to be directly downstream from the current task.
set_upstream(task_or_task_list)
Set a task or a task list to be directly upstream from the current task.
upstream_list
@property: list of tasks directly upstream
xcom_pull(context, task_ids=None, dag_id=None, key=u’return_value’, include_prior_dates=None)
See TaskInstance.xcom_pull()
xcom_push(context, key, value, execution_date=None)
See TaskInstance.xcom_push()

3.22.1.2 BaseSensorOperator

All sensors are derived from BaseSensorOperator. All sensors inherit the timeout and poke_interval on
top of the BaseOperator attributes.
class airflow.sensors.base_sensor_operator.BaseSensorOperator(**kwargs)
Bases: airflow.models.BaseOperator, airflow.models.SkipMixin
Sensor operators are derived from this class and inherit these attributes.
Sensor operators keep executing at a time interval and succeed when a criteria is met and fail if and when they
time out.

3.22. API Reference 291

Airflow Documentation, Release 1.10.2

Parameters
• soft_fail (bool) – Set to true to mark the task as SKIPPED on failure
• poke_interval (int) – Time in seconds that the job should wait in between each tries
• timeout (int) – Time, in seconds before the task times out and fails.
• mode (str) – How the sensor operates. Options are: { poke | reschedule }, de-
fault is poke. When set to poke the sensor is taking up a worker slot for its whole execution
time and sleeps between pokes. Use this mode if the expected runtime of the sensor is short
or if a short poke interval is requried. When set to reschedule the sensor task frees the
worker slot when the criteria is not yet met and it’s rescheduled at a later time. Use this
mode if the expected time until the criteria is met is. The poke inteval should be more than
one minute to prevent too much load on the scheduler.
deps
Adds one additional dependency for all sensor operators that checks if a sensor task instance can be
rescheduled.
poke(context)
Function that the sensors defined while deriving this class should override.

3.22.1.3 Core Operators

Operators

class airflow.operators.bash_operator.BashOperator(**kwargs)
Bases: airflow.models.BaseOperator
Execute a Bash script, command or set of commands.
Parameters
• bash_command (string) – The command, set of commands or reference to a bash script
(must be ‘.sh’) to be executed. (templated)
• xcom_push (bool) – If xcom_push is True, the last line written to stdout will also be
pushed to an XCom when the bash command completes.
• env (dict) – If env is not None, it must be a mapping that defines the environment vari-
ables for the new process; these are used instead of inheriting the current process environ-
ment, which is the default behavior. (templated)
• output_encoding (str) – Output encoding of bash command
execute(context)
Execute the bash command in a temporary directory which will be cleaned afterwards
class airflow.operators.python_operator.BranchPythonOperator(**kwargs)
Bases: airflow.operators.python_operator.PythonOperator, airflow.models.
SkipMixin
Allows a workflow to “branch” or follow a single path following the execution of this task.
It derives the PythonOperator and expects a Python function that returns the task_id to follow. The task_id
returned should point to a task directly downstream from {self}. All other “branches” or directly downstream
tasks are marked with a state of skipped so that these paths can’t move forward. The skipped states are
propageted downstream to allow for the DAG state to fill up and the DAG run’s state to be inferred.

292 Chapter 3. Content

Airflow Documentation, Release 1.10.2

Note that using tasks with depends_on_past=True downstream from BranchPythonOperator is

logically unsound as skipped status will invariably lead to block tasks that depend on their past successes.
skipped states propagates where all directly upstream tasks are skipped.
class airflow.operators.check_operator.CheckOperator(**kwargs)
Bases: airflow.models.BaseOperator
Performs checks against a db. The CheckOperator expects a sql query that will return a single row. Each
value on that first row is evaluated using python bool casting. If any of the values return False the check is
failed and errors out.
Note that Python bool casting evals the following as False:
• False
• 0
• Empty string ("")
• Empty list ([])
• Empty dictionary or set ({})
Given a query like SELECT COUNT(*) FROM foo, it will fail only if the count == 0. You can craft much
more complex query that could, for instance, check that the table has the same number of rows as the source
table upstream, or that the count of today’s partition is greater than yesterday’s partition, or that a set of metrics
are less than 3 standard deviation for the 7 day average.
This operator can be used as a data quality check in your pipeline, and depending on where you put it in your
DAG, you have the choice to stop the critical path, preventing from publishing dubious data, or on the side and
receive email alerts without stopping the progress of the DAG.
Note that this is an abstract class and get_db_hook needs to be defined. Whereas a get_db_hook is hook that
gets a single record from an external source.
Parameters sql (string) – the sql to be executed. (templated)
class airflow.operators.docker_operator.DockerOperator(**kwargs)
Bases: airflow.models.BaseOperator
Execute a command inside a docker container.
A temporary directory is created on the host and mounted into a container to allow storing files that together
exceed the default disk size of 10GB in a container. The path to the mounted directory can be accessed via the
environment variable AIRFLOW_TMP_DIR.
If a login to a private registry is required prior to pulling the image, a Docker connection needs to be configured
in Airflow and the connection ID be provided with the parameter docker_conn_id.
Parameters
• image (str) – Docker image from which to create the container. If image tag is omitted,
“latest” will be used.
• api_version (str) – Remote API version. Set to auto to automatically detect the
server’s version.
• auto_remove (bool) – Auto-removal of the container on daemon side when the con-
tainer’s process exits. The default is False.
• command (str or list) – Command to be run in the container. (templated)
• cpus (float) – Number of CPUs to assign to the container. This value gets multiplied
with 1024. See https://docs.docker.com/engine/reference/run/#cpu-share-constraint

3.22. API Reference 293

Airflow Documentation, Release 1.10.2

• dns (list of strings) – Docker custom DNS servers

• dns_search (list of strings) – Docker custom DNS search domain
• docker_url (str) – URL of the host running the docker daemon. Default is
unix://var/run/docker.sock
• environment (dict) – Environment variables to set in the container. (templated)
• force_pull (bool) – Pull the docker image on every run. Default is False.
• mem_limit (float or str) – Maximum amount of memory the container can use.
Either a float value, which represents the limit in bytes, or a string like 128m or 1g.
• network_mode (str) – Network mode for the container.
• tls_ca_cert (str) – Path to a PEM-encoded certificate authority to secure the docker
connection.
• tls_client_cert (str) – Path to the PEM-encoded certificate used to authenticate
docker client.
• tls_client_key (str) – Path to the PEM-encoded key used to authenticate docker
client.
• tls_hostname (str or bool) – Hostname to match against the docker server certifi-
cate or False to disable the check.
• tls_ssl_version (str) – Version of SSL to use when communicating with docker
daemon.
• tmp_dir (str) – Mount point inside the container to a temporary directory created on
the host by the operator. The path is also made available via the environment variable
AIRFLOW_TMP_DIR inside the container.
• user (int or str) – Default user inside the docker container.
• volumes – List of volumes to mount into the container, e.g. ['/host/path:/
container/path', '/host/path2:/container/path2:ro'].
• working_dir (str) – Working directory to set on the container (equivalent to the -w
switch the docker client)
• xcom_push (bool) – Does the stdout will be pushed to the next step using XCom. The
default is False.
• xcom_all (bool) – Push all the stdout or just the last line. The default is False (last line).
• docker_conn_id (str) – ID of the Airflow connection to use
• shm_size (int) – Size of /dev/shm in bytes. The size must be greater than 0. If
omitted uses system default.
class airflow.operators.dummy_operator.DummyOperator(**kwargs)
Bases: airflow.models.BaseOperator
Operator that does literally nothing. It can be used to group tasks in a DAG.
class airflow.operators.email_operator.EmailOperator(**kwargs)
Bases: airflow.models.BaseOperator
Sends an email.
Parameters

294 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• to (list or string (comma or semicolon delimited)) – list of emails to

send the email to. (templated)
• subject (string) – subject line for the email. (templated)
• html_content (string) – content of the email, html markup is allowed. (templated)
• files (list) – file names to attach in email
• cc (list or string (comma or semicolon delimited)) – list of recipients
to be added in CC field
• bcc (list or string (comma or semicolon delimited)) – list of recipi-
ents to be added in BCC field
• mime_subtype (string) – MIME sub content type
• mime_charset (string) – character set parameter added to the Content-Type header.
class airflow.operators.generic_transfer.GenericTransfer(**kwargs)
Bases: airflow.models.BaseOperator
Moves data from a connection to another, assuming that they both provide the required methods in their respec-
tive hooks. The source hook needs to expose a get_records method, and the destination a insert_rows method.
This is meant to be used on small-ish datasets that fit in memory.
Parameters
• sql (str) – SQL query to execute against the source database. (templated)
• destination_table (str) – target table. (templated)
• source_conn_id (str) – source connection
• destination_conn_id (str) – source connection
• preoperator (str or list of str) – sql statement or list of statements to be
executed prior to loading the data. (templated)
class airflow.operators.hive_operator.HiveOperator(**kwargs)
Bases: airflow.models.BaseOperator
Executes hql code or hive script in a specific Hive database.
Parameters
• hql (string) – the hql to be executed. Note that you may also use a relative path from
the dag file of a (template) hive script. (templated)
• hive_cli_conn_id (string) – reference to the Hive database. (templated)
• hiveconfs (dict) – if defined, these key value pairs will be passed to hive as
-hiveconf "key"="value"
• hiveconf_jinja_translate (boolean) – when True, hiveconf-type templating
${var} gets translated into jinja-type templating {{ var }} and ${hiveconf:var} gets trans-
lated into jinja-type templating {{ var }}. Note that you may want to use this along with the
DAG(user_defined_macros=myargs) parameter. View the DAG object documen-
tation for more details.
• script_begin_tag (str) – If defined, the operator will get rid of the part of the script
before the first occurrence of script_begin_tag
• mapred_queue (string) – queue used by the Hadoop CapacityScheduler. (templated)

3.22. API Reference 295

Airflow Documentation, Release 1.10.2

• mapred_queue_priority (string) – priority within CapacityScheduler queue. Pos-

sible settings include: VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW
• mapred_job_name (string) – This name will appear in the jobtracker. This can make
monitoring easier.
class airflow.operators.check_operator.IntervalCheckOperator(**kwargs)
Bases: airflow.models.BaseOperator
Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from
days_back before.
Note that this is an abstract class and get_db_hook needs to be defined. Whereas a get_db_hook is hook that
gets a single record from an external source.
Parameters
• table (str) – the table name
• days_back (int) – number of days between ds and the ds we want to check against.
Defaults to 7 days
• metrics_threshold (dict) – a dictionary of ratios indexed by metrics
class airflow.operators.latest_only_operator.LatestOnlyOperator(**kwargs)
Bases: airflow.models.BaseOperator, airflow.models.SkipMixin
Allows a workflow to skip tasks that are not running during the most recent schedule interval.
If the task is run outside of the latest schedule interval, all directly downstream tasks will be skipped.
class airflow.operators.mssql_operator.MsSqlOperator(**kwargs)
Bases: airflow.models.BaseOperator
Executes sql code in a specific Microsoft SQL database
Parameters
• mssql_conn_id (string) – reference to a specific mssql database
• sql (string or string pointing to a template file with .sql
extension. (templated)) – the sql code to be executed
• database (string) – name of database which overwrite defined one in connection
class airflow.operators.mssql_to_hive.MsSqlToHiveTransfer(**kwargs)
Bases: airflow.models.BaseOperator
Moves data from Microsoft SQL Server to Hive. The operator runs your query against Microsoft SQL Server,
stores the file locally before loading it into a Hive table. If the create or recreate arguments are set to
True, a CREATE TABLE and DROP TABLE statements are generated. Hive data types are inferred from the
cursor’s metadata. Note that the table generated in Hive uses STORED AS textfile which isn’t the most
efficient serialization format. If a large amount of data is loaded and/or if the table gets queried considerably,
you may want to use this operator only to stage the data into a temporary table before loading it into its final
destination using a HiveOperator.
Parameters
• sql (str) – SQL query to execute against the Microsoft SQL Server database. (templated)
• hive_table (str) – target Hive table, use dot notation to target a specific database.
(templated)
• create (bool) – whether to create the table if it doesn’t exist
• recreate (bool) – whether to drop and recreate the table at every execution

296 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• partition (dict) – target partition as a dict of partition columns and values. (templated)
• delimiter (str) – field delimiter in the file
• mssql_conn_id (str) – source Microsoft SQL Server connection
• hive_conn_id (str) – destination hive connection
• tblproperties (dict) – TBLPROPERTIES of the hive table being created
class airflow.operators.pig_operator.PigOperator(**kwargs)
Bases: airflow.models.BaseOperator
Executes pig script.
Parameters
• pig (string) – the pig latin script to be executed. (templated)
• pig_cli_conn_id (string) – reference to the Hive database
• pigparams_jinja_translate (boolean) – when True, pig params-type templating
${var} gets translated into jinja-type templating {{ var }}. Note that you may want to use
this along with the DAG(user_defined_macros=myargs) parameter. View the DAG
object documentation for more details.
class airflow.operators.python_operator.PythonOperator(**kwargs)
Bases: airflow.models.BaseOperator
Executes a Python callable
Parameters
• python_callable (python callable) – A reference to an object that is callable
• op_kwargs (dict) – a dictionary of keyword arguments that will get unpacked in your
function
• op_args (list) – a list of positional arguments that will get unpacked when calling your
callable
• provide_context (bool) – if set to true, Airflow will pass a set of keyword arguments
that can be used in your function. This set of kwargs correspond exactly to what you can
use in your jinja templates. For this to work, you need to define **kwargs in your function
header.
• templates_dict (dict of str) – a dictionary where the values are templates that
will get templated by the Airflow engine sometime between __init__ and execute
takes place and are made available in your callable’s context after the template has been
applied. (templated)
• templates_exts (list(str)) – a list of file extensions to resolve while processing
templated fields, for examples ['.sql', '.hql']
class airflow.operators.python_operator.PythonVirtualenvOperator(**kwargs)
Bases: airflow.operators.python_operator.PythonOperator
Allows one to run a function in a virtualenv that is created and destroyed automatically (with certain caveats).
The function must be defined using def, and not be part of a class. All imports must happen inside the function
and no variables outside of the scope may be referenced. A global scope variable named virtualenv_string_args
will be available (populated by string_args). In addition, one can pass stuff through op_args and op_kwargs, and
one can use a return value. Note that if your virtualenv runs in a different Python major version than Airflow,
you cannot use return values, op_args, or op_kwargs. You can use string_args though. :param python_callable:
A python function with no references to outside variables,

3.22. API Reference 297

Airflow Documentation, Release 1.10.2

defined with def, which will be run in a virtualenv

Parameters
• requirements (list(str)) – A list of requirements as specified in a pip install com-
mand
• python_version (str) – The Python version to run the virtualenv with. Note that both
2 and 2.7 are acceptable forms.
• use_dill (bool) – Whether to use dill to serialize the args and result (pickle is default).
This allow more complex types but requires you to include dill in your requirements.
• system_site_packages (bool) – Whether to include system_site_packages in your
virtualenv. See virtualenv documentation for more information.
• op_args – A list of positional arguments to pass to python_callable.
• op_kwargs (dict) – A dict of keyword arguments to pass to python_callable.
• string_args (list(str)) – Strings that are present in the global var vir-
tualenv_string_args, available to python_callable at runtime as a list(str). Note that args
are split by newline.
• templates_dict (dict of str) – a dictionary where the values are templates that
will get templated by the Airflow engine sometime between __init__ and execute
takes place and are made available in your callable’s context after the template has been
applied
• templates_exts (list(str)) – a list of file extensions to resolve while processing
templated fields, for examples ['.sql', '.hql']

class airflow.operators.s3_file_transform_operator.S3FileTransformOperator(**kwargs)
Bases: airflow.models.BaseOperator
Copies data from a source S3 location to a temporary location on the local filesystem. Runs a transformation on
this file as specified by the transformation script and uploads the output to a destination S3 location.
The locations of the source and the destination files in the local filesystem is provided as an first and second
arguments to the transformation script. The transformation script is expected to read the data from source,
transform it and write the output to the local destination file. The operator then takes over control and uploads
the local destination file to S3.
S3 Select is also available to filter the source contents. Users can omit the transformation script if S3 Select
expression is specified.
Parameters
• source_s3_key (str) – The key to be retrieved from S3. (templated)
• source_aws_conn_id (str) – source s3 connection
• source_verify (bool or str) – Whether or not to verify SSL certificates for S3
connetion. By default SSL certificates are verified. You can provide the following values:
– False: do not validate SSL certificates. SSL will still be used (unless use_ssl is
False), but SSL certificates will not be verified.
– path/to/cert/bundle.pem: A filename of the CA cert bundle to uses. You
can specify this argument if you want to use a different CA cert bundle than the one
used by botocore.
This is also applicable to dest_verify.

298 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• dest_s3_key (str) – The key to be written from S3. (templated)

• dest_aws_conn_id (str) – destination s3 connection
• replace (bool) – Replace dest S3 key if it already exists
• transform_script (str) – location of the executable transformation script
• select_expression (str) – S3 Select expression
class airflow.operators.s3_to_hive_operator.S3ToHiveTransfer(**kwargs)
Bases: airflow.models.BaseOperator
Moves data from S3 to Hive. The operator downloads a file from S3, stores the file locally before loading it into
a Hive table. If the create or recreate arguments are set to True, a CREATE TABLE and DROP TABLE
statements are generated. Hive data types are inferred from the cursor’s metadata from.
Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization
format. If a large amount of data is loaded and/or if the tables gets queried considerably, you may want to use
this operator only to stage the data into a temporary table before loading it into its final destination using a
HiveOperator.
Parameters
• s3_key (str) – The key to be retrieved from S3. (templated)
• field_dict (dict) – A dictionary of the fields name in the file as keys and their Hive
types as values
• hive_table (str) – target Hive table, use dot notation to target a specific database.
(templated)
• create (bool) – whether to create the table if it doesn’t exist
• recreate (bool) – whether to drop and recreate the table at every execution
• partition (dict) – target partition as a dict of partition columns and values. (templated)
• headers (bool) – whether the file contains column names on the first line
• check_headers (bool) – whether the column names on the first line should be checked
against the keys of field_dict
• wildcard_match (bool) – whether the s3_key should be interpreted as a Unix wildcard
pattern
• delimiter (str) – field delimiter in the file
• aws_conn_id (str) – source s3 connection
• hive_cli_conn_id (str) – destination hive connection
• input_compressed (bool) – Boolean to determine if file decompression is required to
process headers
• tblproperties (dict) – TBLPROPERTIES of the hive table being created
• select_expression (str) – S3 Select expression
Parame verify Whether or not to verify SSL certificates for S3 connection. By default SSL certifi-
cates are verified. You can provide the following values: - False: do not validate SSL certificates.
SSL will still be used
(unless use_ssl is False), but SSL certificates will not be verified.

3.22. API Reference 299

Airflow Documentation, Release 1.10.2

• path/to/cert/bundle.pem: A filename of the CA cert bundle to uses. You can specify

this argument if you want to use a different CA cert bundle than the one used by botocore.

class airflow.operators.python_operator.ShortCircuitOperator(**kwargs)
Bases: airflow.operators.python_operator.PythonOperator, airflow.models.
SkipMixin
Allows a workflow to continue only if a condition is met. Otherwise, the workflow “short-circuits” and down-
stream tasks are skipped.
The ShortCircuitOperator is derived from the PythonOperator. It evaluates a condition and short-circuits the
workflow if the condition is False. Any downstream tasks are marked with a state of “skipped”. If the condition
is True, downstream tasks proceed as normal.
The condition is determined by the result of python_callable.
class airflow.operators.http_operator.SimpleHttpOperator(**kwargs)
Bases: airflow.models.BaseOperator
Calls an endpoint on an HTTP system to execute an action
Parameters
• http_conn_id (string) – The connection to run the sensor against
• endpoint (string) – The relative part of the full url. (templated)
• method (string) – The HTTP method to use, default = “POST”
• data (For POST/PUT, depends on the content-type parameter,
for GET a dictionary of key/value string pairs) – The data to pass.
POST-data in POST/PUT and params in the URL for a GET request. (templated)
• headers (a dictionary of string key/value pairs) – The HTTP headers
to be added to the GET request
• response_check (A lambda or defined function.) – A check against the
‘requests’ response object. Returns True for ‘pass’ and False otherwise.
• extra_options (A dictionary of options, where key is string
and value depends on the option that's being modified.) – Extra
options for the ‘requests’ library, see the ‘requests’ documentation (options to modify
timeout, ssl, etc.)
• xcom_push (bool) – Push the response to Xcom (default: False)
• log_response (bool) – Log the response (default: False)
class airflow.operators.sqlite_operator.SqliteOperator(**kwargs)
Bases: airflow.models.BaseOperator
Executes sql code in a specific Sqlite database
Parameters
• sqlite_conn_id (string) – reference to a specific sqlite database
• sql (string or string pointing to a template file. File must
have a '.sql' extensions.) – the sql code to be executed. (templated)
class airflow.operators.subdag_operator.SubDagOperator(**kwargs)
Bases: airflow.models.BaseOperator

300 Chapter 3. Content

Airflow Documentation, Release 1.10.2

class airflow.operators.dagrun_operator.TriggerDagRunOperator(**kwargs)
Bases: airflow.models.BaseOperator
Triggers a DAG run for a specified dag_id
Parameters
• trigger_dag_id (str) – the dag_id to trigger (templated)
• python_callable (python callable) – a reference to a python function that will
be called while passing it the context object and a placeholder object obj for your
callable to fill and return if you want a DagRun created. This obj object contains a run_id
and payload attribute that you can modify in your function. The run_id should be a
unique identifier for that DAG run, and the payload has to be a picklable object that will be
made available to your tasks while executing that DAG run. Your function header should
look like def foo(context, dag_run_obj):
• execution_date (str or datetime.datetime) – Execution date for the dag
(templated)
class airflow.operators.check_operator.ValueCheckOperator(**kwargs)
Bases: airflow.models.BaseOperator
Performs a simple value check using sql code.
Note that this is an abstract class and get_db_hook needs to be defined. Whereas a get_db_hook is hook that
gets a single record from an external source.
Parameters sql (string) – the sql to be executed. (templated)

Sensors

class airflow.sensors.external_task_sensor.ExternalTaskSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a task to complete in a different DAG
Parameters
• external_dag_id (string) – The dag_id that contains the task you want to wait for
• external_task_id (string) – The task_id that contains the task you want to wait for
• allowed_states (list) – list of allowed states, default is ['success']
• execution_delta (datetime.timedelta) – time difference with the previous ex-
ecution to look at, the default is the same execution_date as the current task. For yesterday,
use [positive!] datetime.timedelta(days=1). Either execution_delta or execution_date_fn
can be passed to ExternalTaskSensor, but not both.
• execution_date_fn (callable) – function that receives the current execution
date and returns the desired execution dates to query. Either execution_delta or execu-
tion_date_fn can be passed to ExternalTaskSensor, but not both.
poke(**kwargs)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.hive_partition_sensor.HivePartitionSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a partition to show up in Hive.

3.22. API Reference 301

Airflow Documentation, Release 1.10.2

Note: Because partition supports general logical operators, it can be inefficient. Consider using Named-
HivePartitionSensor instead if you don’t need the full flexibility of HivePartitionSensor.
Parameters
• table (string) – The name of the table to wait for, supports the dot notation
(my_database.my_table)
• partition (string) – The partition clause to wait for. This is passed as is to the metas-
tore Thrift client get_partitions_by_filter method, and apparently supports SQL
like notation as in ds='2015-01-01' AND type='value' and comparison opera-
tors as in "ds>=2015-01-01"
• metastore_conn_id (str) – reference to the metastore thrift service connection id
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.http_sensor.HttpSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Executes a HTTP get statement and returns False on failure: 404 not found or response_check function re-
turned False

Parameters
• http_conn_id (string) – The connection to run the sensor against
• method (string) – The HTTP request method to use
• endpoint (string) – The relative part of the full url
• request_params (a dictionary of string key/value pairs) – The pa-
rameters to be added to the GET url
• headers (a dictionary of string key/value pairs) – The HTTP headers
to be added to the GET request
• response_check (A lambda or defined function.) – A check against the
‘requests’ response object. Returns True for ‘pass’ and False otherwise.
• extra_options (A dictionary of options, where key is string
and value depends on the option that's being modified.) – Extra
options for the ‘requests’ library, see the ‘requests’ documentation (options to modify
timeout, ssl, etc.)

poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.metastore_partition_sensor.MetastorePartitionSensor(**kwargs)
Bases: airflow.sensors.sql_sensor.SqlSensor
An alternative to the HivePartitionSensor that talk directly to the MySQL db. This was created as a result of
observing sub optimal queries generated by the Metastore thrift service when hitting subpartitioned tables. The
Thrift service’s queries were written in a way that wouldn’t leverage the indexes.
Parameters
• schema (str) – the schema
• table (str) – the table

302 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• partition_name (str) – the partition name, as defined in the PARTITIONS table

of the Metastore. Order of the fields does matter. Examples: ds=2016-01-01 or
ds=2016-01-01/sub=foo for a sub partitioned table
• mysql_conn_id (str) – a reference to the MySQL conn_id for the metastore
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.named_hive_partition_sensor.NamedHivePartitionSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a set of partitions to show up in Hive.
Parameters
• partition_names (list of strings) – List of fully qualified names of the par-
titions to wait for. A fully qualified name is of the form schema.table/pk1=pv1/
pk2=pv2, for example, default.users/ds=2016-01-01. This is passed as is to the metastore
Thrift client get_partitions_by_name method. Note that you cannot use logical or
comparison operators as in HivePartitionSensor.
• metastore_conn_id (str) – reference to the metastore thrift service connection id
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.s3_key_sensor.S3KeySensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a key (a file-like instance on S3) to be present in a S3 bucket. S3 being a key/value it does not support
folders. The path is just a key a resource.
Parameters
• bucket_key (str) – The key being waited on. Supports full s3:// style url or relative
path from root level.
• bucket_name (str) – Name of the S3 bucket
• wildcard_match (bool) – whether the bucket_key should be interpreted as a Unix
wildcard pattern
• aws_conn_id (str) – a reference to the s3 connection
• verify (bool or str) – Whether or not to verify SSL certificates for S3 connection.
By default SSL certificates are verified. You can provide the following values: - False: do
not validate SSL certificates. SSL will still be used
(unless use_ssl is False), but SSL certificates will not be verified.

– path/to/cert/bundle.pem: A filename of the CA cert bundle to uses. You can specify

this argument if you want to use a different CA cert bundle than the one used by boto-
core.

poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.s3_prefix_sensor.S3PrefixSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a prefix to exist. A prefix is the first part of a key, thus enabling checking of constructs similar to glob
airfl* or SQL LIKE ‘airfl%’. There is the possibility to precise a delimiter to indicate the hierarchy or keys,

3.22. API Reference 303

Airflow Documentation, Release 1.10.2

meaning that the match will stop at that delimiter. Current code accepts sane delimiters, i.e. characters that are
NOT special characters in the Python regex engine.
Parameters
• bucket_name (str) – Name of the S3 bucket
• prefix (str) – The prefix being waited on. Relative path from bucket root level.
• delimiter (str) – The delimiter intended to show hierarchy. Defaults to ‘/’.
• aws_conn_id (str) – a reference to the s3 connection
• verify (bool or str) – Whether or not to verify SSL certificates for S3 connection.
By default SSL certificates are verified. You can provide the following values: - False: do
not validate SSL certificates. SSL will still be used
(unless use_ssl is False), but SSL certificates will not be verified.

– path/to/cert/bundle.pem: A filename of the CA cert bundle to uses. You can specify

this argument if you want to use a different CA cert bundle than the one used by boto-
core.

poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.sql_sensor.SqlSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Runs a sql statement until a criteria is met. It will keep trying while sql returns no row, or if the first cell in (0,
‘0’, ‘’).
Parameters
• conn_id (string) – The connection to run the sensor against
• sql – The sql to run. To pass, it needs to return at least one cell that contains a non-zero /
empty string value.
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.time_sensor.TimeSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits until the specified time of the day.
Parameters target_time (datetime.time) – time after which the job succeeds
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.time_delta_sensor.TimeDeltaSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a timedelta after the task’s execution_date + schedule_interval. In Airflow, the daily task stamped with
execution_date 2016-01-01 can only start running on 2016-01-02. The timedelta here represents the time
after the execution period has closed.
Parameters delta (datetime.timedelta) – time length to wait after execution_date before
succeeding
poke(context)
Function that the sensors defined while deriving this class should override.

304 Chapter 3. Content

Airflow Documentation, Release 1.10.2

class airflow.sensors.web_hdfs_sensor.WebHdfsSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a file or folder to land in HDFS
poke(context)
Function that the sensors defined while deriving this class should override.

3.22.1.4 Community-contributed Operators

Operators

class airflow.contrib.operators.aws_athena_operator.AWSAthenaOperator(**kwargs)
Bases: airflow.models.BaseOperator
An operator that submit presto query to athena.
Parameters
• query (str) – Presto to be run on athena. (templated)
• database (str) – Database to select. (templated)
• output_location (str) – s3 path to write the query results into. (templated)
• aws_conn_id (str) – aws connection to use
• sleep_time (int) – Time to wait between two consecutive call to check query status on
athena
execute(context)
Run Presto Query on Athena
on_kill()
Cancel the submitted athena query
class airflow.contrib.operators.awsbatch_operator.AWSBatchOperator(**kwargs)
Bases: airflow.models.BaseOperator
Execute a job on AWS Batch Service
Parameters
• job_name (str) – the name for the job that will run on AWS Batch
• job_definition (str) – the job definition name on AWS Batch
• job_queue (str) – the queue name on AWS Batch
• overrides (dict) – the same parameter that boto3 will receive on con-
tainerOverrides (templated): http://boto3.readthedocs.io/en/latest/reference/services/batch.
html#submit_job
• max_retries (int) – exponential backoff retries while waiter is not merged, 4200 = 48
hours
• aws_conn_id (str) – connection id of AWS credentials / region name. If None, creden-
tial boto3 strategy will be used (http://boto3.readthedocs.io/en/latest/guide/configuration.
html).
• region_name (str) – region name to use in AWS Hook. Override the region_name in
connection (if provided)

3.22. API Reference 305

Airflow Documentation, Release 1.10.2

class airflow.contrib.operators.bigquery_check_operator.BigQueryCheckOperator(**kwargs)
Bases: airflow.operators.check_operator.CheckOperator
Performs checks against BigQuery. The BigQueryCheckOperator expects a sql query that will return a
single row. Each value on that first row is evaluated using python bool casting. If any of the values return
False the check is failed and errors out.
Note that Python bool casting evals the following as False:
• False
• 0
• Empty string ("")
• Empty list ([])
• Empty dictionary or set ({})
Given a query like SELECT COUNT(*) FROM foo, it will fail only if the count == 0. You can craft much
more complex query that could, for instance, check that the table has the same number of rows as the source
table upstream, or that the count of today’s partition is greater than yesterday’s partition, or that a set of metrics
are less than 3 standard deviation for the 7 day average.
This operator can be used as a data quality check in your pipeline, and depending on where you put it in your
DAG, you have the choice to stop the critical path, preventing from publishing dubious data, or on the side and
receive email alterts without stopping the progress of the DAG.
Parameters
• sql (string) – the sql to be executed
• bigquery_conn_id (string) – reference to the BigQuery database
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).
class airflow.contrib.operators.bigquery_check_operator.BigQueryValueCheckOperator(**kwargs)
Bases: airflow.operators.check_operator.ValueCheckOperator
Performs a simple value check using sql code.
Parameters
• sql (string) – the sql to be executed
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).
class airflow.contrib.operators.bigquery_check_operator.BigQueryIntervalCheckOperator(**kwa
Bases: airflow.operators.check_operator.IntervalCheckOperator
Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from
days_back before.
This method constructs a query like so

SELECT {metrics_threshold_dict_key} FROM {table}

WHERE {date_filter_column}=<date>

Parameters
• table (str) – the table name

306 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• days_back (int) – number of days between ds and the ds we want to check against.
Defaults to 7 days
• metrics_threshold (dict) – a dictionary of ratios indexed by metrics, for example
‘COUNT(*)’: 1.5 would require a 50 percent or less difference between the current day, and
the prior days_back.
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).

class airflow.contrib.operators.bigquery_get_data.BigQueryGetDataOperator(**kwargs)
Bases: airflow.models.BaseOperator
Fetches the data from a BigQuery table (alternatively fetch data for selected columns) and returns data in a
python list. The number of elements in the returned list will be equal to the number of rows fetched. Each
element in the list will again be a list where element would represent the columns values for that row.
Example Result: [['Tony', '10'], ['Mike', '20'], ['Steve', '15']]

Note: If you pass fields to selected_fields which are in different order than the order of columns already
in BQ table, the data will still be in the order of BQ table. For example if the BQ table has 3 columns as
[A,B,C] and you pass ‘B,A’ in the selected_fields the data would still be of the form 'A,B'.

Example:

get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)

Parameters
• dataset_id (string) – The dataset ID of the requested table. (templated)
• table_id (string) – The table ID of the requested table. (templated)
• max_results (string) – The maximum number of records (rows) to be fetched from
the table. (templated)
• selected_fields (string) – List of fields to return (comma-separated). If unspeci-
fied, all fields are returned.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

class airflow.contrib.operators.bigquery_operator.BigQueryCreateEmptyTableOperator(**kwargs)
Bases: airflow.models.BaseOperator
Creates a new, empty table in the specified BigQuery dataset, optionally with schema.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly pass
the schema fields in, or you may point the operator to a Google cloud storage object name. The object in Google
cloud storage must be a JSON file with the schema fields in it. You can also create a table without schema.

3.22. API Reference 307

Airflow Documentation, Release 1.10.2

Parameters
• project_id (string) – The project to create the table into. (templated)
• dataset_id (string) – The dataset to create the table into. (templated)
• table_id (string) – The Name of the table to be created. (templated)
• schema_fields (list) – If set, the schema field list as defined here: https://cloud.
google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema
Example:

schema_fields=[{"name": "emp_name", "type": "STRING", "mode":

˓→"REQUIRED"},

{"name": "salary", "type": "INTEGER", "mode":

˓→"NULLABLE"}]

• gcs_schema_object (string) – Full path to the JSON file containing schema

(templated). For example: gs://test-bucket/dir1/dir2/employee_schema.
json
• time_partitioning (dict) – configure optional time partitioning fields i.e. partition
by field, type and expiration as per API specifications.
See also:
https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#timePartitioning
• bigquery_conn_id (string) – Reference to a specific BigQuery hook.
• google_cloud_storage_conn_id (string) – Reference to a specific Google
cloud storage hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• labels (dict) – a dictionary containing labels for the table, passed to BigQuery
Example (with schema JSON in GCS):

CreateTable = BigQueryCreateEmptyTableOperator(
task_id='BigQueryCreateEmptyTableOperator_task',
dataset_id='ODS',
table_id='Employees',
project_id='internal-gcp-project',
gcs_schema_object='gs://schema-bucket/employee_schema.json',
bigquery_conn_id='airflow-service-account',
google_cloud_storage_conn_id='airflow-service-account'
)

Corresponding Schema file (employee_schema.json):

[
{
"mode": "NULLABLE",
"name": "emp_name",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "salary",
(continues on next page)

308 Chapter 3. Content

Airflow Documentation, Release 1.10.2

(continued from previous page)

"type": "INTEGER"
}
]

Example (with schema in the DAG):

CreateTable = BigQueryCreateEmptyTableOperator(
task_id='BigQueryCreateEmptyTableOperator_task',
dataset_id='ODS',
table_id='Employees',
project_id='internal-gcp-project',
schema_fields=[{"name": "emp_name", "type": "STRING", "mode":
˓→"REQUIRED"},

{"name": "salary", "type": "INTEGER", "mode":

˓→"NULLABLE"}],

bigquery_conn_id='airflow-service-account',
google_cloud_storage_conn_id='airflow-service-account'
)

class airflow.contrib.operators.bigquery_operator.BigQueryCreateExternalTableOperator(**kwa
Bases: airflow.models.BaseOperator
Creates a new external table in the dataset with the data in Google Cloud Storage.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly
pass the schema fields in, or you may point the operator to a Google cloud storage object name. The object in
Google cloud storage must be a JSON file with the schema fields in it.
Parameters
• bucket (string) – The bucket to point the external table to. (templated)
• source_objects (list) – List of Google cloud storage URIs to point table to. (tem-
plated) If source_format is ‘DATASTORE_BACKUP’, the list must only contain a single
URI.
• destination_project_dataset_table (string) – The dotted
(<project>.)<dataset>.<table> BigQuery table to load data into (templated). If <project> is
not included, project will be the project defined in the connection json.
• schema_fields (list) – If set, the schema field list as defined here: https://cloud.
google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema
Example:

schema_fields=[{"name": "emp_name", "type": "STRING", "mode":

˓→"REQUIRED"},

{"name": "salary", "type": "INTEGER", "mode":

˓→"NULLABLE"}]

Should not be set when source_format is ‘DATASTORE_BACKUP’.

• schema_object (string) – If set, a GCS object path pointing to a .json file that con-
tains the schema for the table. (templated)
• source_format (string) – File format of the data.
• compression (string) – [Optional] The compression type of the data source. Possible
values include GZIP and NONE. The default value is NONE. This setting is ignored for
Google Cloud Bigtable, Google Cloud Datastore backups and Avro formats.

3.22. API Reference 309

Airflow Documentation, Release 1.10.2

• skip_leading_rows (int) – Number of rows to skip when loading from a CSV.

• field_delimiter (string) – The delimiter to use for the CSV.
• max_bad_records (int) – The maximum number of bad records that BigQuery can
ignore when running the job.
• quote_character (string) – The value that is used to quote data sections in a CSV
file.
• allow_quoted_newlines (boolean) – Whether to allow quoted newlines (true) or
not (false).
• allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns.
The missing values are treated as nulls. If false, records with missing trailing columns are
treated as bad records, and if there are too many bad records, an invalid error is returned in
the job result. Only applicable to CSV, ignored for other formats.
• bigquery_conn_id (string) – Reference to a specific BigQuery hook.
• google_cloud_storage_conn_id (string) – Reference to a specific Google
cloud storage hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• src_fmt_configs (dict) – configure optional fields specific to the source format
:param labels a dictionary containing labels for the table, passed to BigQuery :type labels: dict
class airflow.contrib.operators.bigquery_operator.BigQueryDeleteDatasetOperator(**kwargs)
Bases: airflow.models.BaseOperator
” This operator deletes an existing dataset from your Project in Big query. https://cloud.google.com/bigquery/
docs/reference/rest/v2/datasets/delete :param project_id: The project id of the dataset. :type project_id: string
:param dataset_id: The dataset to be deleted. :type dataset_id: string
Example:

delete_temp_data = BigQueryDeleteDatasetOperator(dataset_id = 'temp-dataset',

project_id = 'temp-project',
bigquery_conn_id='_my_gcp_conn_',
task_id='Deletetemp',
dag=dag)

class airflow.contrib.operators.bigquery_operator.BigQueryCreateEmptyDatasetOperator(**kwarg
Bases: airflow.models.BaseOperator
” This operator is used to create new dataset for your Project in Big query. https://cloud.google.com/bigquery/
docs/reference/rest/v2/datasets#resource
Parameters
• project_id (str) – The name of the project where we want to create the dataset. Don’t
need to provide, if projectId in dataset_reference.
• dataset_id (str) – The id of dataset. Don’t need to provide, if datasetId in
dataset_reference.
• dataset_reference – Dataset reference that could be provided with request body.
More info: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource
class airflow.contrib.operators.bigquery_operator.BigQueryOperator(**kwargs)
Bases: airflow.models.BaseOperator

310 Chapter 3. Content

Airflow Documentation, Release 1.10.2

Executes BigQuery SQL queries in a specific BigQuery database

Parameters
• bql (Can receive a str representing a sql statement, a list
of str (sql statements), or reference to a template file.
Template reference are recognized by str ending in '.sql'.) –
(Deprecated. Use sql parameter instead) the sql code to be executed (templated)
• sql (Can receive a str representing a sql statement, a list
of str (sql statements), or reference to a template file.
Template reference are recognized by str ending in '.sql'.) –
the sql code to be executed (templated)
• destination_dataset_table (string) – A dotted
(<project>.|<project>:)<dataset>.<table> that, if set, will store the results of the query.
(templated)
• write_disposition (string) – Specifies the action that occurs if the destination
table already exists. (default: ‘WRITE_EMPTY’)
• create_disposition (string) – Specifies whether the job is allowed to create new
tables. (default: ‘CREATE_IF_NEEDED’)
• allow_large_results (boolean) – Whether to allow large results.
• flatten_results (boolean) – If true and query uses legacy SQL dialect, flattens all
nested and repeated fields in the query results. allow_large_results must be true
if this is set to false. For standard SQL queries, this flag is ignored and results are never
flattened.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• udf_config (list) – The User Defined Function configuration for the query. See https:
//cloud.google.com/bigquery/user-defined-functions for details.
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).
• maximum_billing_tier (integer) – Positive integer that serves as a multiplier of
the basic price. Defaults to None, in which case it uses the value set in the project.
• maximum_bytes_billed (float) – Limits the bytes billed for this job. Queries that
will have bytes billed beyond this limit will fail (without incurring a charge). If unspecified,
this will be set to your project default.
• api_resource_configs (dict) – a dictionary that contain params ‘configuration’
applied for Google BigQuery Jobs API: https://cloud.google.com/bigquery/docs/reference/
rest/v2/jobs for example, {‘query’: {‘useQueryCache’: False}}. You could use it if you
need to provide some params that are not supported by BigQueryOperator like args.
• schema_update_options (tuple) – Allows the schema of the destination table to be
updated as a side effect of the load job.
• query_params (dict) – a dictionary containing query parameter types and values,
passed to BigQuery.
• labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery

3.22. API Reference 311

Airflow Documentation, Release 1.10.2

• priority (string) – Specifies a priority for the query. Possible values include INTER-
ACTIVE and BATCH. The default value is INTERACTIVE.
• time_partitioning (dict) – configure optional time partitioning fields i.e. partition
by field, type and expiration as per API specifications.
• cluster_fields (list of str) – Request that the result of this query be stored
sorted by one or more columns. This is only available in conjunction with time_partitioning.
The order of columns given determines the sort order.
• location (str) – The geographic location of the job. Required except for US and EU.
See details at https://cloud.google.com/bigquery/docs/locations#specifying_your_location
class airflow.contrib.operators.bigquery_table_delete_operator.BigQueryTableDeleteOperator(
Bases: airflow.models.BaseOperator
Deletes BigQuery tables
Parameters
• deletion_dataset_table (string) – A dotted
(<project>.|<project>:)<dataset>.<table> that indicates which table will be deleted.
(templated)
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• ignore_if_missing (boolean) – if True, then return success even if the requested
table does not exist.
class airflow.contrib.operators.bigquery_to_bigquery.BigQueryToBigQueryOperator(**kwargs)
Bases: airflow.models.BaseOperator
Copies data from one BigQuery table to another.
See also:
For more details about these parameters: https://cloud.google.com/bigquery/docs/reference/v2/jobs#
configuration.copy

Parameters
• source_project_dataset_tables (list|string) – One or more dotted
(project:|project.)<dataset>.<table> BigQuery tables to use as the source data. If <project>
is not included, project will be the project defined in the connection json. Use a list if there
are multiple source tables. (templated)
• destination_project_dataset_table (string) – The destination BigQuery
table. Format is: (project:|project.)<dataset>.<table> (templated)
• write_disposition (string) – The write disposition if the table already exists.
• create_disposition (string) – The create disposition if the table doesn’t exist.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery

312 Chapter 3. Content

Airflow Documentation, Release 1.10.2

class airflow.contrib.operators.bigquery_to_gcs.BigQueryToCloudStorageOperator(**kwargs)
Bases: airflow.models.BaseOperator
Transfers a BigQuery table to a Google Cloud Storage bucket.
See also:
For more details about these parameters: https://cloud.google.com/bigquery/docs/reference/v2/jobs

Parameters
• source_project_dataset_table (string) – The dotted (<project>.
|<project>:)<dataset>.<table> BigQuery table to use as the source data. If
<project> is not included, project will be the project defined in the connection json. (tem-
plated)
• destination_cloud_storage_uris (list) – The destination Google Cloud Stor-
age URI (e.g. gs://some-bucket/some-file.txt). (templated) Follows convention defined here:
https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple
• compression (string) – Type of compression to use.
• export_format (string) – File format to export.
• field_delimiter (string) – The delimiter to use when extracting to a CSV.
• print_header (boolean) – Whether to print a header for a CSV file extract.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery

class airflow.contrib.operators.databricks_operator.DatabricksSubmitRunOperator(**kwargs)
Bases: airflow.models.BaseOperator
Submits a Spark job run to Databricks using the api/2.0/jobs/runs/submit API endpoint.
There are two ways to instantiate this operator.
In the first way, you can take the JSON payload that you typically use to call the api/2.0/jobs/runs/
submit endpoint and pass it directly to our DatabricksSubmitRunOperator through the json param-
eter. For example

json = {
'new_cluster': {
'spark_version': '2.1.0-db3-scala2.11',
'num_workers': 2
},
'notebook_task': {
'notebook_path': '/Users/[email protected]/PrepareData',
},
}
notebook_run = DatabricksSubmitRunOperator(task_id='notebook_run', json=json)

Another way to accomplish the same thing is to use the named parameters of the
DatabricksSubmitRunOperator directly. Note that there is exactly one named parameter for
each top level parameter in the runs/submit endpoint. In this method, your code would look like this:

3.22. API Reference 313

Airflow Documentation, Release 1.10.2

new_cluster = {
'spark_version': '2.1.0-db3-scala2.11',
'num_workers': 2
}
notebook_task = {
'notebook_path': '/Users/[email protected]/PrepareData',
}
notebook_run = DatabricksSubmitRunOperator(
task_id='notebook_run',
new_cluster=new_cluster,
notebook_task=notebook_task)

In the case where both the json parameter AND the named parameters are provided, they will be merged together.
If there are conflicts during the merge, the named parameters will take precedence and override the top level
json keys.
Currently the named parameters that DatabricksSubmitRunOperator supports are
• spark_jar_task
• notebook_task
• new_cluster
• existing_cluster_id
• libraries
• run_name
• timeout_seconds

Parameters
• json (dict) – A JSON object containing API parameters which will be passed directly
to the api/2.0/jobs/runs/submit endpoint. The other named parameters (i.e.
spark_jar_task, notebook_task..) to this operator will be merged with this json
dictionary if they are provided. If there are conflicts during the merge, the named parameters
will take precedence and override the top level json keys. (templated)
See also:
For more information about templating see Jinja Templating. https://docs.databricks.com/
api/latest/jobs.html#runs-submit
• spark_jar_task (dict) – The main class and parameters for the JAR task. Note
that the actual JAR is specified in the libraries. EITHER spark_jar_task OR
notebook_task should be specified. This field will be templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobssparkjartask
• notebook_task (dict) – The notebook path and parameters for the notebook task.
EITHER spark_jar_task OR notebook_task should be specified. This field will
be templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobsnotebooktask

314 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• new_cluster (dict) – Specs for a new cluster on which this task will be run. EITHER
new_cluster OR existing_cluster_id should be specified. This field will be
templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobsclusterspecnewcluster
• existing_cluster_id (string) – ID for existing cluster on which to run this task.
EITHER new_cluster OR existing_cluster_id should be specified. This field
will be templated.
• libraries (list of dicts) – Libraries which this run will use. This field will be
templated.
See also:
https://docs.databricks.com/api/latest/libraries.html#managedlibrarieslibrary
• run_name (string) – The run name used for this task. By default this will be set
to the Airflow task_id. This task_id is a required parameter of the superclass
BaseOperator. This field will be templated.
• timeout_seconds (int32) – The timeout for this run. By default a value of 0 is used
which means to have no timeout. This field will be templated.
• databricks_conn_id (string) – The name of the Airflow connection to use. By
default and in the common case this will be databricks_default. To use token based
authentication, provide the key token in the extra field for the connection.
• polling_period_seconds (int) – Controls the rate which we poll for the result of
this run. By default the operator will poll every 30 seconds.
• databricks_retry_limit (int) – Amount of times retry if the Databricks backend
is unreachable. Its value must be greater than or equal to 1.
• databricks_retry_delay (float) – Number of seconds to wait between retries (it
might be a floating point number).
• do_xcom_push (boolean) – Whether we should push run_id and run_page_url to xcom.

class airflow.contrib.operators.dataflow_operator.DataFlowJavaOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Java Cloud DataFlow batch job. The parameters of the operation will be passed to the job.
See also:
For more detail on job submission have a look at the reference: https://cloud.google.com/dataflow/pipelines/
specifying-exec-params

Parameters
• jar (string) – The reference to a self executing DataFlow jar.
• dataflow_default_options (dict) – Map of default job options.
• options (dict) – Map of job specific options.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

3.22. API Reference 315

Airflow Documentation, Release 1.10.2

• poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Plat-
form for the dataflow job status while the job is in the JOB_STATE_RUNNING state.
• job_class (string) – The name of the dataflow job class to be executued, it is often
not the main class configured in the dataflow jar file.

Both jar and options are templated so you can use variables in them.
Note that both dataflow_default_options and options will be merged to specify pipeline execution
parameter, and dataflow_default_options is expected to save high-level options, for instances, project
and zone information, which apply to all dataflow operators in the DAG.
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and
staging location.
default_args = {
'dataflow_default_options': {
'project': 'my-gcp-project',
'zone': 'europe-west1-d',
'stagingLocation': 'gs://my-staging-bucket/staging/'
}
}

You need to pass the path to your dataflow as a file reference with the jar parameter, the jar needs to
be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/
#self-executing-jar). Use options to pass on options to your job.
t1 = DataFlowJavaOperator(
task_id='datapflow_example',
jar='{{var.value.gcp_dataflow_base}}pipeline/build/libs/pipeline-example-1.0.
˓→jar',

options={
'autoscalingAlgorithm': 'BASIC',
'maxNumWorkers': '50',
'start': '{{ds}}',
'partitionType': 'DAY',
'labels': {'foo' : 'bar'}
},
gcp_conn_id='gcp-airflow-service-account',
dag=my-dag)

class airflow.contrib.operators.dataflow_operator.DataflowTemplateOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Templated Cloud DataFlow batch job. The parameters of the operation will be passed to the job.
Parameters
• template (string) – The reference to the DataFlow template.
• dataflow_default_options (dict) – Map of default job environment options.
• parameters (dict) – Map of job specific parameters for the template.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Plat-
form for the dataflow job status while the job is in the JOB_STATE_RUNNING state.

316 Chapter 3. Content

Airflow Documentation, Release 1.10.2

It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and
staging location.
See also:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters https://cloud.google.
com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment

default_args = {
'dataflow_default_options': {
'project': 'my-gcp-project'
'zone': 'europe-west1-d',
'tempLocation': 'gs://my-staging-bucket/staging/'
}
}
}

You need to pass the path to your dataflow template as a file reference with the template parameter. Use
parameters to pass on parameters to your job. Use environment to pass on runtime environment variables
to your job.

t1 = DataflowTemplateOperator(
task_id='datapflow_example',
template='{{var.value.gcp_dataflow_base}}',
parameters={
'inputFile': "gs://bucket/input/my_input.txt",
'outputFile': "gs://bucket/output/my_output.txt"
},
gcp_conn_id='gcp-airflow-service-account',
dag=my-dag)

template, dataflow_default_options and parameters are templated so you can use variables in
them.
Note that dataflow_default_options is expected to save high-level options for project information,
which apply to all dataflow operators in the DAG.
See also:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3 /LaunchTemplateParameters
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment For more de-
tail on job template execution have a look at the reference: https://cloud.google.com/dataflow/docs/
templates/executing-templates
class airflow.contrib.operators.dataflow_operator.DataFlowPythonOperator(**kwargs)
Bases: airflow.models.BaseOperator
Launching Cloud Dataflow jobs written in python. Note that both dataflow_default_options and options will
be merged to specify pipeline execution parameter, and dataflow_default_options is expected to save high-level
options, for instances, project and zone information, which apply to all dataflow operators in the DAG.
See also:
For more detail on job submission have a look at the reference: https://cloud.google.com/dataflow/pipelines/
specifying-exec-params

Parameters
• py_file (string) – Reference to the python dataflow pipleline file.py, e.g.,
/some/local/file/path/to/your/python/pipeline/file.

3.22. API Reference 317

Airflow Documentation, Release 1.10.2

• py_options – Additional python options.

• dataflow_default_options (dict) – Map of default job options.
• options (dict) – Map of job specific options.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Plat-
form for the dataflow job status while the job is in the JOB_STATE_RUNNING state.

execute(context)
Execute the python dataflow job.
class airflow.contrib.operators.dataproc_operator.DataprocClusterCreateOperator(**kwargs)
Bases: airflow.models.BaseOperator
Create a new cluster on Google Cloud Dataproc. The operator will wait until the creation is successful or an
error occurs in the creation process.
The parameters allow to configure the cluster. Please refer to
https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters
for a detailed explanation on the different parameters. Most of the configuration parameters detailed in the link
are available as a parameter to this operator.
Parameters
• cluster_name (string) – The name of the DataProc cluster to create. (templated)
• project_id (str) – The ID of the google cloud project in which to create the cluster.
(templated)
• num_workers (int) – The # of workers to spin up. If set to zero will spin up cluster in a
single node mode
• storage_bucket (string) – The storage bucket to use, setting to None lets dataproc
generate a custom one for you
• init_actions_uris (list[string]) – List of GCS uri’s containing dataproc ini-
tialization scripts
• init_action_timeout (string) – Amount of time executable scripts in
init_actions_uris has to complete
• metadata (dict) – dict of key-value google compute engine metadata entries to add to
all instances
• image_version (string) – the version of software inside the Dataproc cluster
• custom_image – custom Dataproc image for more info see https://cloud.google.com/
dataproc/docs/guides/dataproc-images
• properties (dict) – dict of properties to set on config files (e.g. spark-defaults.conf),
see https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#
SoftwareConfig
• master_machine_type (string) – Compute engine machine type to use for the mas-
ter node

318 Chapter 3. Content

Airflow Documentation, Release 1.10.2

• master_disk_type (string) – Type of the boot disk for the master node (de-
fault is pd-standard). Valid values: pd-ssd (Persistent Disk Solid State Drive) or
pd-standard (Persistent Disk Hard Disk Drive).
• master_disk_size (int) – Disk size for the master node
• worker_machine_type (string) – Compute engine machine type to use for the
worker nodes
• worker_disk_type (string) – Type of the boot disk for the worker node (de-
fault is pd-standard). Valid values: pd-ssd (Persistent Disk Solid State Drive) or
pd-standard (Persistent Disk Hard Disk Drive).
• worker_disk_size (int) – Disk size for the worker nodes
• num_preemptible_workers (int) – The # of preemptible worker nodes to spin up
• labels (dict) – dict of labels to add to the cluster
• zone (string) – The zone where the cluster will be located. (templated)
• network_uri (string) – The network uri to be used for machine communication, can-
not be specified with subnetwork_uri
• subnetwork_uri (string) – The subnetwork uri to be used for machine communica-
tion, cannot be specified with network_uri
• internal_ip_only (bool) – If true, all instances in the cluster will only have internal
IP addresses. This can only be enabled for subnetwork enabled networks
• tags (list[string]) – The GCE tags to add to all instances
• region – leave as ‘global’, might become relevant in the future. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• service_account (string) – The service account of the dataproc instances.
• service_account_scopes (list[string]) – The URIs of service account scopes
to be included.
• idle_delete_ttl (int) – The longest duration that cluster would keep alive while
staying idle. Passing this threshold will cause cluster to be auto-deleted. A duration in
seconds.
• auto_delete_time (datetime.datetime) – The time when cluster will be auto-
deleted.
• auto_delete_ttl (int) – The life duration of cluster, the cluster will be auto-deleted
at the end of this duration. A duration in seconds. (If auto_delete_time is set this parameter
will be ignored)
Type custom_image: string
class airflow.contrib.operators.dataproc_operator.DataprocClusterScaleOperator(**kwargs)
Bases: airflow.models.BaseOperator
Scale, up or down, a cluster on Google Cloud Dataproc. The operator will wait until the cluster is re-scaled.
Example:

3.22. API Reference 319

Airflow Documentation, Release 1.10.2

t1 = DataprocClusterScaleOperator(
task_id='dataproc_scale',
project_id='my-project',
cluster_name='cluster-1',
num_workers=10,
num_preemptible_workers=10,
graceful_decommission_timeout='1h',
dag=dag)

See also:
For more detail on about scaling clusters have a look at the reference: https://cloud.google.com/dataproc/docs/
concepts/configuring-clusters/scaling-clusters

Parameters
• cluster_name (string) – The name of the cluster to scale. (templated)
• project_id (string) – The ID of the google cloud project in which the cluster runs.
(templated)
• region (string) – The region for the dataproc cluster. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• num_workers (int) – The new number of workers
• num_preemptible_workers (int) – The new number of preemptible workers
• graceful_decommission_timeout (string) – Timeout for graceful YARN de-
comissioning. Maximum value is 1d
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

class airflow.contrib.operators.dataproc_operator.DataprocClusterDeleteOperator(**kwargs)
Bases: airflow.models.BaseOperator
Delete a cluster on Google Cloud Dataproc. The operator will wait until the cluster is destroyed.
Parameters
• cluster_name (string) – The name of the cluster to create. (templated)
• project_id (string) – The ID of the google cloud project in which the cluster runs.
(templated)
• region (string) – leave as ‘global’, might become relevant in the future. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
class airflow.contrib.operators.dataproc_operator.DataProcPigOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Pig query Job on a Cloud DataProc cluster. The parameters of the operation will be passed to the cluster.
It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and
UDFs.

320 Chapter 3. Content

Airflow Documentation, Release 1.10.2

default_args = {
'cluster_name': 'cluster-1',
'dataproc_pig_jars': [
'gs://example/udf/jar/datafu/1.2.0/datafu.jar',
'gs://example/udf/jar/gpig/1.2/gpig.jar'
]
}

You can pass a pig script as string or file reference. Use variables to pass on variables for the pig script to be
resolved on the cluster or use the parameters to be resolved in the script as template parameters.
Example:
t1 = DataProcPigOperator(
task_id='dataproc_pig',
query='a_pig_script.pig',
variables={'out': 'gs://example/output/{{ds}}'},
dag=dag)

See also:
For more detail on about job submission have a look at the reference: https://cloud.google.com/dataproc/
reference/rest/v1/projects.regions.jobs

Parameters
• query (string) – The query or reference to the query file (pg or pig extension). (tem-
plated)
• query_uri (string) – The uri of a pig script on Cloud Storage.
• variables (dict) – Map of named parameters for the query. (templated)
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster. (templated)
• dataproc_pig_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
• dataproc_pig_jars (list) – URIs to jars provisioned in Cloud Storage (example:
for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.

3.22. API Reference 321

Airflow Documentation, Release 1.10.2

class airflow.contrib.operators.dataproc_operator.DataProcHiveOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Hive query Job on a Cloud DataProc cluster.
Parameters
• query (string) – The query or reference to the query file (q extension).
• query_uri (string) – The uri of a hive script on Cloud Storage.
• variables (dict) – Map of named parameters for the query.
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes.
• cluster_name (string) – The name of the DataProc cluster.
• dataproc_hive_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
• dataproc_hive_jars (list) – URIs to jars provisioned in Cloud Storage (example:
for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.
class airflow.contrib.operators.dataproc_operator.DataProcSparkSqlOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Spark SQL query Job on a Cloud DataProc cluster.
Parameters
• query (string) – The query or reference to the query file (q extension). (templated)
• query_uri (string) – The uri of a spark sql script on Cloud Storage.
• variables (dict) – Map of named parameters for the query. (templated)
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster. (templated)
• dataproc_spark_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments

322 Chapter 3. Content

Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
PIRATE KING Resume - Dark
No ratings yet
PIRATE KING Resume - Dark
1 page
Post-Quiz - XP
100% (3)
Post-Quiz - XP
4 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Databuildtoolpdf 220704 142715
No ratings yet
Databuildtoolpdf 220704 142715
39 pages
Spark Interview Questions: Click Here
No ratings yet
Spark Interview Questions: Click Here
35 pages
Airflow User Guide
No ratings yet
Airflow User Guide
444 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
PySpark VS SQL Interview Questions
No ratings yet
PySpark VS SQL Interview Questions
16 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Snowflake
No ratings yet
Snowflake
122 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Airflow - Notes
No ratings yet
Airflow - Notes
82 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Running Airflow Reliably With Kubernetes
100% (1)
Running Airflow Reliably With Kubernetes
47 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Dags: The Definitive Guide: Everything You Need To Know About Airflow Dags
100% (1)
Dags: The Definitive Guide: Everything You Need To Know About Airflow Dags
72 pages
What Is Bigquery: Enterprise Data Warehouse
No ratings yet
What Is Bigquery: Enterprise Data Warehouse
2 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Introduction to Databricks
No ratings yet
Introduction to Databricks
149 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Building Data Pipelines - 1
No ratings yet
Building Data Pipelines - 1
25 pages
Lab 7 - Orchestrating Data Movement With Azure Data Factory
No ratings yet
Lab 7 - Orchestrating Data Movement With Azure Data Factory
26 pages
PySpark 30 Days Practice Guide?
No ratings yet
PySpark 30 Days Practice Guide?
35 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Apache Hive Interview Questions
50% (2)
Apache Hive Interview Questions
6 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Hive Interview
75% (4)
Hive Interview
17 pages
Best Practices of Apache Airflow
No ratings yet
Best Practices of Apache Airflow
3 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
No ratings yet
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
14 pages
Data-Engineering Course Structure
No ratings yet
Data-Engineering Course Structure
9 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Spark 4.0
No ratings yet
Spark 4.0
123 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
BigQuery Query Optimization With Troposphere PDF
No ratings yet
BigQuery Query Optimization With Troposphere PDF
51 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Kubernetes A Complete Guide
From Everand
Kubernetes A Complete Guide
Gerardus Blokdyk
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Assemblers: - Two Functions: - Some Features: - Other Features
No ratings yet
Assemblers: - Two Functions: - Some Features: - Other Features
37 pages
Dynamic Array, Pointers, Etc C++ Programming.
No ratings yet
Dynamic Array, Pointers, Etc C++ Programming.
22 pages
204-PS - Practical Sheet With Solution Unit 1-2-3
No ratings yet
204-PS - Practical Sheet With Solution Unit 1-2-3
22 pages
Talend Exam
No ratings yet
Talend Exam
3 pages
Rational For Data Structure Lab
No ratings yet
Rational For Data Structure Lab
51 pages
Load Runner Topics
No ratings yet
Load Runner Topics
3 pages
Spark Internals RDD Cache
No ratings yet
Spark Internals RDD Cache
25 pages
Logcat Prev CSC Log
No ratings yet
Logcat Prev CSC Log
66 pages
Cs606 Collection of Old Papers
0% (2)
Cs606 Collection of Old Papers
18 pages
Import As
100% (1)
Import As
27 pages
Context Variables
No ratings yet
Context Variables
13 pages
SAP Technical Overview
100% (1)
SAP Technical Overview
31 pages
SAP Fiori On SAP Enterprise Portal - All You Need To Know On One Page
No ratings yet
SAP Fiori On SAP Enterprise Portal - All You Need To Know On One Page
3 pages
AZ-400 (562 Questions)
No ratings yet
AZ-400 (562 Questions)
18 pages
1.1 Project Summary
No ratings yet
1.1 Project Summary
41 pages
Computer Science Class 11 - Sultan Chand - ModelTestPaper2
100% (1)
Computer Science Class 11 - Sultan Chand - ModelTestPaper2
5 pages
Applet Life Cycle
No ratings yet
Applet Life Cycle
3 pages
Advanced Java Programming One Mark Questions
No ratings yet
Advanced Java Programming One Mark Questions
28 pages
CS2105 Assignment 1
No ratings yet
CS2105 Assignment 1
7 pages
Introduction To Kernel 2.1 Block Diagram of System Kernel 2.2 Describe Architecture of UNIX Operating System
No ratings yet
Introduction To Kernel 2.1 Block Diagram of System Kernel 2.2 Describe Architecture of UNIX Operating System
33 pages
Preparing A Static PDF Form
No ratings yet
Preparing A Static PDF Form
6 pages
Jack N Shoemaker, Thotwave Technologies, Cary, NC: Proc Format in Action
No ratings yet
Jack N Shoemaker, Thotwave Technologies, Cary, NC: Proc Format in Action
6 pages
Java Full Stack Development Syllabus
No ratings yet
Java Full Stack Development Syllabus
21 pages
Parameters by Functional Category
No ratings yet
Parameters by Functional Category
6 pages
H M Thrupthi-Work
No ratings yet
H M Thrupthi-Work
3 pages
UPI20
No ratings yet
UPI20
280 pages
CICD
No ratings yet
CICD
8 pages
Single Linked List With An Example
No ratings yet
Single Linked List With An Example
3 pages