How To Work With Apache Airflow
How To Work With Apache Airflow
Parameters
• bucket_name (string) – The name of the bucket.
• storage_class (string) – This defines how objects in the bucket are stored and
determines the SLA and the cost of storage. Values include
– MULTI_REGIONAL
– REGIONAL
– STANDARD
– NEARLINE
– COLDLINE.
If this value is not specified when the bucket is created, it will default to STANDARD.
• location (string) – The location of the bucket. Object data for objects in the bucket
resides in physical storage within this region. Defaults to US.
See also:
https://developers.google.com/storage/docs/bucket-locations
• project_id (string) – The ID of the GCP Project.
• labels (dict) – User-provided labels, in key/value pairs.
Returns If successful, it returns the id of the bucket.
exists(bucket, object)
Checks for the existence of a file in Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is.
• object (string) – The name of the object to check in the Google cloud storage bucket.
get_conn()
Returns a Google Cloud Storage service object.
get_crc32c(bucket, object)
Gets the CRC32c checksum of an object in Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is.
• object (string) – The name of the object to check in the Google cloud storage bucket.
get_md5hash(bucket, object)
Gets the MD5 hash of an object in Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is.
• object (string) – The name of the object to check in the Google cloud storage bucket.
get_size(bucket, object)
Gets the size of a file in Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is.
• object (string) – The name of the object to check in the Google cloud storage bucket.
insert_bucket_acl(bucket, entity, role, user_project)
Creates a new ACL entry on the specified bucket. See: https://cloud.google.com/storage/docs/json_api/v1/
bucketAccessControls/insert
Parameters
• bucket (str) – Name of a bucket.
• entity (str) – The entity holding the permission, in one of the following forms:
user-userId, user-email, group-groupId, group-email, domain-domain, project-team-
projectId, allUsers, allAuthenticatedUsers. See: https://cloud.google.com/storage/docs/
access-control/lists#scopes
• role (str) – The access permission for the entity. Acceptable values are: “OWNER”,
“READER”, “WRITER”.
• user_project (str) – (Optional) The project to be billed for this request. Required
for Requester Pays buckets.
insert_object_acl(bucket, object_name, entity, role, generation, user_project)
Creates a new ACL entry on the specified object. See: https://cloud.google.com/storage/docs/json_api/v1/
objectAccessControls/insert
Parameters
• bucket (str) – Name of a bucket.
• object_name (str) – Name of the object. For information about how to URL en-
code object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/
#encoding
• entity (str) – The entity holding the permission, in one of the following forms:
user-userId, user-email, group-groupId, group-email, domain-domain, project-team-
projectId, allUsers, allAuthenticatedUsers See: https://cloud.google.com/storage/docs/
access-control/lists#scopes
• role (str) – The access permission for the entity. Acceptable values are: “OWNER”,
“READER”.
• generation (str) – (Optional) If present, selects a specific revision of this object (as
opposed to the latest version, the default).
• user_project (str) – (Optional) The project to be billed for this request. Required
for Requester Pays buckets.
is_updated_after(bucket, object, ts)
Checks if an object is updated in Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is.
• object (string) – The name of the object to check in the Google cloud storage bucket.
• ts (datetime) – The timestamp to check against.
list(bucket, versions=None, maxResults=None, prefix=None, delimiter=None)
List all objects from the bucket with the give string prefix in name
Parameters
• bucket (string) – bucket name
• versions (boolean) – if true, list all versions of the objects
• maxResults (integer) – max count of items to return in a single page of responses
• prefix (string) – prefix string which filters objects whose name begin with this prefix
• delimiter (string) – filters objects based on the delimiter (for e.g ‘.csv’)
Returns a stream of object names matching the filtering criteria
rewrite(source_bucket, source_object, destination_bucket, destination_object=None)
Has the same functionality as copy, except that will work on files over 5 TB, as well as when copying
between locations and/or storage classes.
destination_object can be omitted, in which case source_object is used.
Parameters
• source_bucket (string) – The bucket of the object to copy from.
• source_object (string) – The object to copy.
• destination_bucket (string) – The destination of the object to copied to.
• destination_object – The (renamed) path of the object if given. Can be omitted;
then the same name is used.
upload(bucket, object, filename, mime_type=’application/octet-stream’, gzip=False, multipart=False,
num_retries=0)
Uploads a local file to Google Cloud Storage.
Parameters
• bucket (string) – The bucket to upload to.
• object (string) – The object name to set when uploading the local file.
• filename (string) – The local file path to the file to be uploaded.
• mime_type (str) – The MIME type to set when uploading the file.
• gzip (bool) – Option to compress file for upload
• multipart (bool or int) – If True, the upload will be split into multiple HTTP
requests. The default size is 256MiB per request. Pass a number instead of True to specify
the request size, which must be a multiple of 262144 (256KiB).
• num_retries (int) – The number of times to attempt to re-upload the file (or indi-
vidual chunks, in the case of multipart uploads). Retries are attempted with exponential
backoff.
GCPTransferServiceHook
class airflow.contrib.hooks.gcp_transfer_hook.GCPTransferServiceHook(api_version=’v1’,
gcp_conn_id=’google_cloud_de
dele-
gate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Hook for GCP Storage Transfer Service.
get_conn()
Retrieves connection to Google Storage Transfer service.
Returns Google Storage Transfer service object
Return type dict
GKEClusterCreateOperator
GKEClusterDeleteOperator
GKEPodOperator
3.16.6 Qubole
Apache Airflow has a native operator and hooks to talk to Qubole, which lets you submit your big data jobs directly
to Qubole from Apache Airflow.
3.16.6.1 QuboleOperator
3.16.6.2 QubolePartitionSensor
3.16.6.3 QuboleFileSensor
3.16.6.4 QuboleCheckOperator
3.16.6.5 QuboleValueCheckOperator
3.17 Metrics
3.17.1 Configuration
[scheduler]
statsd_on = True
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow
3.17.2 Counters
Name Description
<job_name>_start Number of started <job_name> job, ex. SchedulerJob, LocalTaskJob
<job_name>_end Number of ended <job_name> job, ex. SchedulerJob, LocalTaskJob
operator_failures_<operator_name> Operator <operator_name> failures
operator_successes_<operator_name> Operator <operator_name> successes
ti_failures Overall task instances failures
ti_successes Overall task instances successes
zombies_killed Zombie tasks killed
scheduler_heartbeat Scheduler heartbeats
3.17.3 Gauges
Name Description
collect_dags Seconds taken to scan and import DAGs
dagbag_import_errors DAG import errors
dagbag_size DAG bag size
3.17.4 Timers
Name Description
dagrun.dependency-check.<dag_id> Seconds taken to check DAG dependencies
3.18 Kubernetes
The kubernetes executor is introduced in Apache Airflow 1.10.0. The Kubernetes executor will create a new pod for
every task instance.
Example helm charts are available at scripts/ci/kubernetes/kube/{airflow,volumes,postgres}.yaml in the source distri-
bution. The volumes are optional and depend on your configuration. There are two volumes available:
• Dags: by storing all the dags onto the persistent disks, all the workers can read the dags from there. Another
option is using git-sync, before starting the container, a git pull of the dags repository will be performed and
used throughout the lifecycle of the pod.
• Logs: by storing the logs onto a persistent disk, all the logs will be available for all the workers and the webserver
itself. If you don’t configure this, the logs will be lost after the worker pods shuts down. Another option is to
use S3/GCS/etc to store the logs.
volume_config= {
'persistentVolumeClaim':
{
'claimName': 'test-volume'
}
}
volume = Volume(name='test-volume', configs=volume_config)
(continues on next page)
affinity = {
'nodeAffinity': {
'preferredDuringSchedulingIgnoredDuringExecution': [
{
"weight": 1,
"preference": {
"matchExpressions": {
"key": "disktype",
"operator": "In",
"values": ["ssd"]
}
}
}
]
},
"podAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": [
{
"labelSelector": {
"matchExpressions": [
{
"key": "security",
"operator": "In",
"values": ["S1"]
}
]
},
"topologyKey": "failure-domain.beta.kubernetes.io/zone"
}
]
},
"podAntiAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": [
{
"labelSelector": {
"matchExpressions": [
{
"key": "security",
"operator": "In",
"values": ["S2"]
}
]
},
"topologyKey": "kubernetes.io/hostname"
}
]
}
}
tolerations = [
{
'key': "key",
'operator': 'Equal',
'value': 'value'
}
]
(continues on next page)
k = KubernetesPodOperator(namespace='default',
image="ubuntu:16.04",
cmds=["bash", "-cx"],
arguments=["echo", "10"],
labels={"foo": "bar"},
secrets=[secret_file,secret_env]
volume=[volume],
volume_mounts=[volume_mount]
name="test",
task_id="task",
affinity=affinity,
is_delete_operator_pod=True,
hostnetwork=False,
tolerations=tolerations
)
3.19 Lineage
Airflow can help track origins of data, what happens to it and where it moves over time. This can aid having audit
trails and data governance, but also debugging of data flows.
Airflow tracks data by means of inlets and outlets of the tasks. Let’s work from an example and see how it works.
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.lineage.datasets import File
from airflow.models import DAG
from datetime import timedelta
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2)
}
dag = DAG(
dag_id='example_lineage', default_args=args,
schedule_interval='0 0 * * *',
dagrun_timeout=timedelta(minutes=60))
f_final = File("/tmp/final")
run_this_last = DummyOperator(task_id='run_this_last', dag=dag,
inlets={"auto": True},
outlets={"datasets": [f_final,]})
f_in = File("/tmp/whole_directory/")
(continues on next page)
Tasks take the parameters inlets and outlets. Inlets can be manually defined by a list of dataset {“datasets”: [dataset1,
dataset2]} or can be configured to look for outlets from upstream tasks {“task_ids”: [“task_id1”, “task_id2”]} or
can be configured to pick up outlets from direct upstream tasks {“auto”: True} or a combination of them. Outlets are
defined as list of dataset {“datasets”: [dataset1, dataset2]}. Any fields for the dataset are templated with the context
when the task is being executed.
Note: Operators can add inlets and outlets automatically if the operator supports it.
In the example DAG task run_me_first is a BashOperator that takes 3 inlets: CAT1, CAT2, CAT3, that are generated
from a list. Note that execution_date is a templated field and will be rendered when the task is running.
Note: Behind the scenes Airflow prepares the lineage metadata as part of the pre_execute method of a task. When
the task has finished execution post_execute is called and lineage metadata is pushed into XCOM. Thus if you are
creating your own operators that override this method make sure to decorate your method with prepare_lineage and
apply_lineage respectively.
Airflow can send its lineage metadata to Apache Atlas. You need to enable the atlas backend and configure it properly,
e.g. in your airflow.cfg:
[lineage]
backend = airflow.lineage.backend.atlas
[atlas]
username = my_username
password = my_password
host = host
port = 21000
3.20 Changelog
3.20.1.2 Improvements
• [AIRFLOW-3191] Fix not being able to specify execution_date when creating dagrun (#4037)
• [AIRFLOW-3657] Fix zendesk integration (#4466)
• [AIRFLOW-3605] Load plugins from entry_points (#4412)
• [AIRFLOW-3646] Rename plugins_manager.py to test_xx to trigger tests (#4464)
• [AIRFLOW-3655] Escape links generated in model views (#4463)
• [AIRFLOW-3662] Add dependency for Enum (#4468)
• [AIRFLOW-3630] Cleanup of GCP Cloud SQL Connection (#4451)
• [AIRFLOW-1837] Respect task start_date when different from dag’s (#4010)
• [AIRFLOW-2829] Brush up the CI script for minikube
• [AIRFLOW-3519] Fix example http operator (#4455)
• [AIRFLOW-2811] Fix scheduler_ops_metrics.py to work (#3653)
• [AIRFLOW-2751] add job properties update in hive to druid operator.
• [AIRFLOW-2918] Remove unused imports
• [AIRFLOW-2918] Fix Flake8 violations (#3931)
• [AIRFLOW-2771] Add except type to broad S3Hook try catch clauses
• [AIRFLOW-2918] Fix Flake8 violations (#3772)
• [AIRFLOW-2099] Handle getsource() calls gracefully
• [AIRFLOW-3397] Fix integrety error in rbac AirflowSecurityManager (#4305)
• [AIRFLOW-3281] Fix Kubernetes operator with git-sync (#3770)
• [AIRFLOW-2615] Limit DAGs parsing to once only
• [AIRFLOW-2952] Fix Kubernetes CI (#3922)
• [AIRFLOW-2933] Enable Codecov on Docker-CI Build (#3780)
• [AIRFLOW-2082] Resolve a bug in adding password_auth to api as auth method (#4343)
• [AIRFLOW-3612] Remove incubation/incubator mention (#4419)
• [AIRFLOW-3581] Fix next_ds/prev_ds semantics for manual runs (#4385)
• [AIRFLOW-3527] Update Cloud SQL Proxy to have shorter path for UNIX socket (#4350)
• [AIRFLOW-3316] For gcs_to_bq: add missing init of schema_fields var (#4430)
• [AIRFLOW-3583] Fix AirflowException import (#4389)
• [AIRFLOW-3578] Fix Type Error for BigQueryOperator (#4384)
• [AIRFLOW-2755] Added kubernetes.worker_dags_folder configuration (#3612)
• [AIRFLOW-2655] Fix inconsistency of default config of kubernetes worker
• [AIRFLOW-2645][AIRFLOW-2617] Add worker_container_image_pull_policy
• [AIRFLOW-2661] fix config dags_volume_subpath and logs_volume_subpath
• [AIRFLOW-3550] Standardize GKE hook (#4364)
• [AIRFLOW-2863] Fix GKEClusterHook catching wrong exception (#3711)
• [AIRFLOW-3271] Fix issue with persistence of RBAC Permissions modified via UI (#4118)
• [AIRFLOW-3141] Handle duration View for missing dag (#3984)
• [AIRFLOW-2766] Respect shared datetime across tabs
• [AIRFLOW-1413] Fix FTPSensor failing on error message with unexpected (#2450)
• [AIRFLOW-3378] KubernetesPodOperator does not delete on timeout failure (#4218)
• [AIRFLOW-3245] Fix list processing in resolve_template_files (#4086)
• [AIRFLOW-2703] Catch transient DB exceptions from scheduler’s heartbeat it does not crash (#3650)
• [AIRFLOW-1298] Clear UPSTREAM_FAILED using the clean cli (#3886)
3.20.2.2 Improvements
• [AIRFLOW-839] docker_operator.py attempts to log status key without first checking existence
• [AIRFLOW-1104] Concurrency check in scheduler should count queued tasks as well as running
• [AIRFLOW-1163] Add support for x-forwarded-* headers to support access behind AWS ELB
• [AIRFLOW-1195] Cleared tasks in SubDagOperator do not trigger Parent dag_runs
• [AIRFLOW-1508] Skipped state not part of State.task_states
• [AIRFLOW-1762] Use key_file in SSHHook.create_tunnel()
• [AIRFLOW-1837] Differing start_dates on tasks not respected by scheduler.
• [AIRFLOW-1874] Support standard SQL in Check, ValueCheck and IntervalCheck BigQuery operators
• [AIRFLOW-1917] print() from python operators end up with extra new line
• [AIRFLOW-1970] Database cannot be initialized if an invalid fernet key is provided
• [AIRFLOW-2145] Deadlock after clearing a running task
• [AIRFLOW-2216] Cannot specify a profile for AWS Hook to load with s3 config file
• [AIRFLOW-2574] initdb fails when mysql password contains percent sign
• [AIRFLOW-2707] Error accessing log files from web UI
• [AIRFLOW-2716] Replace new Python 3.7 keywords
• [AIRFLOW-2744] RBAC app doesn’t integrate plugins (blueprints etc)
• [AIRFLOW-2772] BigQuery hook does not allow specifying both the partition field name and table name at the
same time
• [AIRFLOW-2778] Bad Import in collect_dag in DagBag
• [AIRFLOW-2786] Variables view fails to render if a variable has an empty key
• [AIRFLOW-2799] Filtering UI objects by datetime is broken
• [AIRFLOW-2800] Remove airflow/ low-hanging linting errors
• [AIRFLOW-2825] S3ToHiveTransfer operator may not may able to handle GZIP file with uppercase ext in S3
• [AIRFLOW-2848] dag_id is missing in metadata table “job” for LocalTaskJob
• [AIRFLOW-2860] DruidHook: time variable is not updated correctly when checking for timeout
• [AIRFLOW-2865] Race condition between on_success_callback and LocalTaskJob’s cleanup
• [AIRFLOW-2893] Stuck dataflow job due to jobName mismatch.
• [AIRFLOW-2895] Prevent scheduler from spamming heartbeats/logs
• [AIRFLOW-2900] Code not visible for Packaged DAGs
• [AIRFLOW-2905] Switch to regional dataflow job service.
• [AIRFLOW-2907] Sendgrid - Attachments - ERROR - Object of type ‘bytes’ is not JSON serializable
• [AIRFLOW-2938] Invalid ‘extra’ field in connection can raise an AttributeError when attempting to edit
• [AIRFLOW-2979] Deprecated Celery Option not in Options list
• [AIRFLOW-2981] TypeError in dataflow operators when using GCS jar or py_file
• [AIRFLOW-2984] Cannot convert naive_datetime when task has a naive start_date/end_date
• [AIRFLOW-2994] flatten_results in BigQueryOperator/BigQueryHook should default to None
• [AIRFLOW-3002] ValueError in dataflow operators when using GCS jar or py_file
• [AIRFLOW-3012] Email on sla miss is send only to first address on the list
• [AIRFLOW-3046] ECS Operator mistakenly reports success when task is killed due to EC2 host termination
• [AIRFLOW-3064] No output from airflow test due to default logging config
• [AIRFLOW-3072] Only admin can view logs in RBAC UI
• [AIRFLOW-3079] Improve initdb to support MSSQL Server
• [AIRFLOW-3089] Google auth doesn’t work under http
• [AIRFLOW-3099] Errors raised when some blocs are missing in airflow.cfg
• [AIRFLOW-3109] Default user permission should contain ‘can_clear’
• [AIRFLOW-3111] Confusing comments and instructions for log templates in UPDATING.md and de-
fault_airflow.cfg
• [AIRFLOW-3124] Broken webserver debug mode (RBAC)
• [AIRFLOW-3136] Scheduler Failing the Task retries run while processing Executor Events
• [AIRFLOW-3138] Migration cc1e65623dc7 creates issues with postgres
• [AIRFLOW-3161] Log Url link does not link to task instance logs in RBAC UI
• [AIRFLOW-3162] HttpHook fails to parse URL when port is specified
• [AIRFLOW-3183] Potential Bug in utils/dag_processing/DagFileProcessorManager.max_runs_reached()
• [AIRFLOW-3203] Bugs in DockerOperator & Some operator test scripts were named incorrectly
• [AIRFLOW-3238] Dags, removed from the filesystem, are not deactivated on initdb
• [AIRFLOW-3268] Cannot pass SSL dictionary to mysql connection via URL
• [AIRFLOW-3277] Invalid timezone transition handling for cron schedules
• [AIRFLOW-3295] Require encryption in DaskExecutor when certificates are configured.
• [AIRFLOW-2611] Fix wrong dag volume mount path for kubernetes executor
• [AIRFLOW-2562] Add Google Kubernetes Engine Operators
• [AIRFLOW-2630] Fix classname in test_sql_sensor.py
• [AIRFLOW-2534] Fix bug in HiveServer2Hook
• [AIRFLOW-2586] Stop getting AIRFLOW_HOME value from config file in bash operator
• [AIRFLOW-2605] Fix autocommit for MySqlHook
• [AIRFLOW-2539][AIRFLOW-2359] Move remaing log config to configuration file
• [AIRFLOW-1656] Tree view dags query changed
• [AIRFLOW-2617] add imagePullPolicy config for kubernetes executor
• [AIRFLOW-2429] Fix security/task/sensors/ti_deps folders flake8 error
• [AIRFLOW-2550] Implements API endpoint to list DAG runs
• [AIRFLOW-2512][AIRFLOW-2522] Use google-auth instead of oauth2client
• [AIRFLOW-2429] Fix operators folder flake8 error
• [AIRFLOW-2585] Fix several bugs in CassandraHook and CassandraToGCSOperator
• [AIRFLOW-2597] Restore original dbapi.run() behavior
• [AIRFLOW-2590] Fix commit in DbApiHook.run() for no-autocommit DB
• [AIRFLOW-1115] fix github oauth api URL
• [AIRFLOW-2587] Add TIMESTAMP type mapping to MySqlToHiveTransfer
• [AIRFLOW-2591][AIRFLOW-2581] Set default value of autocommit to False in DbApiHook.run()
• [AIRFLOW-59] Implement bulk_dump and bulk_load for the Postgres hook
• [AIRFLOW-2533] Fix path to DAG’s on kubernetes executor workers
• [AIRFLOW-2581] RFLOW-2581] Fix DbApiHook autocommit
• [AIRFLOW-2578] Add option to use proxies in JiraHook
• [AIRFLOW-2575] Make gcs to gcs operator work with large files
• [AIRFLOW-437] Send TI context in kill zombies
• [AIRFLOW-2566] Change backfill to rerun failed tasks
• [AIRFLOW-1021] Fix double login for new users with LDAP
• [AIRFLOW-XXX] Typo fix
• [AIRFLOW-2561] Fix typo in EmailOperator
• [AIRFLOW-2573] Cast BigQuery TIMESTAMP field to float
• [AIRFLOW-2560] Adding support for internalIpOnly to DataprocClusterCreateOperator
• [AIRFLOW-2565] templatize cluster_label
• [AIRFLOW-83] add mongo hook and operator
• [AIRFLOW-2558] Clear task/dag is clearing all executions
• [AIRFLOW-XXX] Fix doc typos
• [AIRFLOW-2513] Change bql to sql for BigQuery Hooks & Ops
• [AIRFLOW-1575] Add AWS Kinesis Firehose Hook for inserting batch records
• [AIRFLOW-2266][AIRFLOW-2343] Remove google-cloud-dataflow dependency
• [AIRFLOW-2370] Implement –use_random_password in create_user
• [AIRFLOW-2348] Strip path prefix from the destination_object when source_object contains a wildcard[]
• [AIRFLOW-2391] Fix to Flask 0.12.2
• [AIRFLOW-2381] Fix the flaky ApiPasswordTests test
• [AIRFLOW-2378] Add Groupon to list of current users
• [AIRFLOW-2382] Fix wrong description for delimiter
• [AIRFLOW-2380] Add support for environment variables in Spark submit operator.
• [AIRFLOW-2377] Improve Sendgrid sender support
• [AIRFLOW-2331] Support init action timeout on dataproc cluster create
• [AIRFLOW-1835] Update docs: Variable file is json
• [AIRFLOW-1781] Make search case-insensitive in LDAP group
• [AIRFLOW-2042] Fix browser menu appearing over the autocomplete menu
• [AIRFLOW-XXX] Remove wheelhouse files from travis not owned by travis
• [AIRFLOW-2336] Use hmsclient in hive_hook
• [AIRFLOW-2041] Correct Syntax in python examples
• [AIRFLOW-74] SubdagOperators can consume all celeryd worker processes
• [AIRFLOW-2369] Fix gcs tests
• [AIRFLOW-2365] Fix autocommit attribute check
• [AIRFLOW-2068] MesosExecutor allows optional Docker image
• [AIRFLOW-1652] Push DatabricksRunSubmitOperator metadata into XCOM
• [AIRFLOW-2234] Enable insert_rows for PrestoHook
• [AIRFLOW-2208][Airflow-22208] Link to same DagRun graph from TaskInstance view
• [AIRFLOW-1153] Allow HiveOperators to take hiveconfs
• [AIRFLOW-775] Fix autocommit settings with Jdbc hook
• [AIRFLOW-2364] Warn when setting autocommit on a connection which does not support it
• [AIRFLOW-2357] Add persistent volume for the logs
• [AIRFLOW-766] Skip conn.commit() when in Auto-commit
• [AIRFLOW-2351] Check for valid default_args start_date
• [AIRFLOW-1433] Set default rbac to initdb
• [AIRFLOW-2270] Handle removed tasks in backfill
• [AIRFLOW-2344] Fix connections -l to work with pipe/redirect
• [AIRFLOW-2300] Add S3 Select functionarity to S3ToHiveTransfer
• [AIRFLOW-1314] Cleanup the config
• [AIRFLOW-1314] Polish some of the Kubernetes docs/config
• [AIRFLOW-2113] Address missing DagRun callbacks Given that the handle_callback method belongs to the
DAG object, we are able to get the list of task directly with get_task and reduce the communication with the
database, making airflow more lightweight.
• [AIRFLOW-2112] Fix svg width for Recent Tasks on UI.
• [AIRFLOW-2116] Set CI Cloudant version to <2.0
• [AIRFLOW-XXX] Add PMC to list of companies using Airflow
• [AIRFLOW-2100] Fix Broken Documentation Links
• [AIRFLOW-1404] Add ‘flatten_results’ & ‘maximum_bytes_billed’ to BQ Operator
• [AIRFLOW-800] Initialize valid Google BigQuery Connection
• [AIRFLOW-1319] Fix misleading SparkSubmitOperator and SparkSubmitHook docstring
• [AIRFLOW-1983] Parse environment parameter as template
• [AIRFLOW-2095] Add operator to create External BigQuery Table
• [AIRFLOW-2085] Add SparkJdbc operator
• [AIRFLOW-1002] Add ability to clean all dependencies of removed DAG
• [AIRFLOW-2094] Jinjafied project_id, region & zone in DataProc{*} Operators
• [AIRFLOW-2092] Fixed incorrect parameter in docstring for FTPHook
• [AIRFLOW-XXX] Add SocialCops to Airflow users
• [AIRFLOW-2088] Fix duplicate keys in MySQL to GCS Helper function
• [AIRFLOW-2091] Fix incorrect docstring parameter in BigQuery Hook
• [AIRFLOW-2090] Fix typo in DataStore Hook
• [AIRFLOW-1157] Fix missing pools crashing the scheduler
• [AIRFLOW-713] Jinjafy {EmrCreateJobFlow,EmrAddSteps}Operator attributes
• [AIRFLOW-2083] Docs: Use “its” instead of “it’s” where appropriate
• [AIRFLOW-2066] Add operator to create empty BQ table
• [AIRFLOW-XXX] add Karmic to list of companies
• [AIRFLOW-2073] Make FileSensor fail when the file doesn’t exist
• [AIRFLOW-2078] Improve task_stats and dag_stats performance
• [AIRFLOW-2080] Use a log-out icon instead of a power button
• [AIRFLOW-2077] Fetch all pages of list_objects_v2 response
• [AIRFLOW-XXX] Add TM to list of companies
• [AIRFLOW-1985] Impersonation fixes for using run_as_user
• [AIRFLOW-2018][AIRFLOW-2] Make Sensors backward compatible
• [AIRFLOW-XXX] Fix typo in concepts doc (dag_md)
• [AIRFLOW-2069] Allow Bytes to be uploaded to S3
• [AIRFLOW-2074] Fix log var name in GHE auth
• [AIRFLOW-1927] Convert naive datetimes for TaskInstances
• [AIRFLOW-1760] Password auth for experimental API
• [AIRFLOW-681] homepage doc link should pointing to apache repo not airbnb repo
• [AIRFLOW-705][AIRFLOW-706] Fix run_command bugs
• [AIRFLOW-990] Fix Py27 unicode logging in DockerOperator
• [AIRFLOW-963] Fix non-rendered code examples
• [AIRFLOW-969] Catch bad python_callable argument
• [AIRFLOW-984] Enable subclassing of SubDagOperator
• [AIRFLOW-997] Update setup.cfg to point to Apache
• [AIRFLOW-994] Add MiNODES to the official airflow user list
• [AIRFLOW-995][AIRFLOW-1] Update GitHub PR Template
• [AIRFLOW-989] Do not mark dag run successful if unfinished tasks
• [AIRFLOW-903] New configuration setting for the default dag view
• [AIRFLOW-979] Add GovTech GDS
• [AIRFLOW-933] Replace eval with literal_eval to prevent RCE
• [AIRFLOW-974] Fix mkdirs race condition
• [AIRFLOW-917] Fix formatting of error message
• [AIRFLOW-770] Refactor BaseHook so env vars are always read
• [AIRFLOW-900] Double trigger should not kill original task instance
• [AIRFLOW-900] Fixes bugs in LocalTaskJob for double run protection
• [AIRFLOW-932][AIRFLOW-932][AIRFLOW-921][AIRFLOW-910] Do not mark tasks removed when back-
filling[
• [AIRFLOW-961] run onkill when SIGTERMed
• [AIRFLOW-910] Use parallel task execution for backfills
• [AIRFLOW-967] Wrap strings in native for py2 ldap compatibility
• [AIRFLOW-958] Improve tooltip readability
• AIRFLOW-959 Cleanup and reorganize .gitignore
• AIRFLOW-960 Add .editorconfig file
• [AIRFLOW-931] Do not set QUEUED in TaskInstances
• [AIRFLOW-956] Get docs working on readthedocs.org
• [AIRFLOW-954] Fix configparser ImportError
• [AIRFLOW-941] Use defined parameters for psycopg2
• [AIRFLOW-943] Update Digital First Media in users list
• [AIRFLOW-942] Add mytaxi to Airflow users
• [AIRFLOW-939] add .swp to gitginore
• [AIRFLOW-719] Prevent DAGs from ending prematurely
• [AIRFLOW-938] Use test for True in task_stats queries
• [AIRFLOW-937] Improve performance of task_stats
• [AIRFLOW-933] use ast.literal_eval rather eval because ast.literal_eval does not execute input.
• [AIRFLOW-925] Revert airflow.hooks change that cherry-pick picked
• [AIRFLOW-919] Running tasks with no start date shouldn’t break a DAGs UI
• [AIRFLOW-802][AIRFLOW-1] Add spark-submit operator/hook
• [AIRFLOW-725] Use keyring to store credentials for JIRA
• [AIRFLOW-916] Remove deprecated readfp function
• [AIRFLOW-911] Add coloring and timing to tests
• [AIRFLOW-906] Update Code icon from lightning bolt to file
• [AIRFLOW-897] Prevent dagruns from failing with unfinished tasks
• [AIRFLOW-896] Remove unicode to 8-bit conversion in BigQueryOperator
• [AIRFLOW-899] Tasks in SCHEDULED state should be white in the UI instead of black
• [AIRFLOW-895] Address Apache release incompliancies
• [AIRFLOW-893][AIRFLOW-510] Fix crashing webservers when a dagrun has no start date
• [AIRFLOW-880] Make webserver serve logs in a sane way for remote logs
• [AIRFLOW-889] Fix minor error in the docstrings for BaseOperator
• [AIRFLOW-809][AIRFLOW-1] Use __eq__ ColumnOperator When Testing Booleans
• [AIRFLOW-875] Add template to HttpSensor params
• [AIRFLOW-866] Add FTPSensor
• [AIRFLOW-881] Check if SubDagOperator is in DAG context manager
• [AIRFLOW-885] Add change.org to the users list
• [AIRFLOW-836] Use POST and CSRF for state changing endpoints
• [AIRFLOW-862] Fix Unit Tests for DaskExecutor
• [AIRFLOW-887] Support future v0.16
• [AIRFLOW-886] Pass result to post_execute() hook
• [AIRFLOW-871] change logging.warn() into warning()
• [AIRFLOW-882] Remove unnecessary dag>>op assignment in docs
• [AIRFLOW-861] make pickle_info endpoint be login_required
• [AIRFLOW-869] Refactor mark success functionality
• [AIRFLOW-877] Remove .sql template extension from GCS download operator
• [AIRFLOW-826] Add Zendesk hook
• [AIRFLOW-842] do not query the DB with an empty IN clause
• [AIRFLOW-834] change raise StopIteration into return
• [AIRFLOW-832] Let debug server run without SSL
• [AIRFLOW-862] Add DaskExecutor
• [AIRFLOW-858] Configurable database name for DB operators
• [AIRFLOW-863] Example DAGs should have recent start dates
• [AIRFLOW-1142] SubDAG Tasks Not Executed Even Though All Dependencies Met
• [AIRFLOW-1138] Add licenses to files in scripts directory
• [AIRFLOW-1127] Move license notices to LICENSE instead of NOTICE
• [AIRFLOW-1124] Do not set all task instances to scheduled on backfill
• [AIRFLOW-1120] Update version view to include Apache prefix
• [AIRFLOW-1062] DagRun#find returns wrong result if external_trigger=False is specified
• [AIRFLOW-1054] Fix broken import on test_dag
• [AIRFLOW-1050] Retries ignored - regression
• [AIRFLOW-1033] TypeError: can’t compare datetime.datetime to None
• [AIRFLOW-1017] get_task_instance should return None instead of throw an exception for non-existent TIs
• [AIRFLOW-1011] Fix bug in BackfillJob._execute() for SubDAGs
• [AIRFLOW-1004] airflow webserver -D runs in foreground
• [AIRFLOW-1001] Landing Time shows “unsupported operand type(s) for -: ‘datetime.datetime’ and ‘None-
Type’” on example_subdag_operator
• [AIRFLOW-933] use ast.literal_eval rather eval because ast.literal_eval does not execute input.
• [AIRFLOW-925] Revert airflow.hooks change that cherry-pick picked
• [AIRFLOW-919] Running tasks with no start date shouldn’t break a DAGs UI
• [AIRFLOW-802] Add spark-submit operator/hook
• [AIRFLOW-897] Prevent dagruns from failing with unfinished tasks
• [AIRFLOW-861] make pickle_info endpoint be login_required
• [AIRFLOW-853] use utf8 encoding for stdout line decode
• [AIRFLOW-856] Make sure execution date is set for local client
• [AIRFLOW-830][AIRFLOW-829][AIRFLOW-88] Reduce Travis log verbosity
• [AIRFLOW-831] Restore import to fix broken tests
• [AIRFLOW-794] Access DAGS_FOLDER and SQL_ALCHEMY_CONN exclusively from settings
• [AIRFLOW-694] Fix config behaviour for empty envvar
• [AIRFLOW-365] Set dag.fileloc explicitly and use for Code view
• [AIRFLOW-931] Do not set QUEUED in TaskInstances
• [AIRFLOW-899] Tasks in SCHEDULED state should be white in the UI instead of black
• [AIRFLOW-895] Address Apache release incompliancies
• [AIRFLOW-893][AIRFLOW-510] Fix crashing webservers when a dagrun has no start date
• [AIRFLOW-793] Enable compressed loading in S3ToHiveTransfer
• [AIRFLOW-863] Example DAGs should have recent start dates
• [AIRFLOW-869] Refactor mark success functionality
• [AIRFLOW-856] Make sure execution date is set for local client
• [AIRFLOW-814] Fix Presto*CheckOperator.__init__
• [AIRFLOW-844] Fix cgroups directory creation
• [AIRFLOW-816] Use static nvd3 and d3
• [AIRFLOW-821] Fix py3 compatibility
• [AIRFLOW-817] Check for None value of execution_date in endpoint
• [AIRFLOW-822] Close db before exception
• [AIRFLOW-815] Add prev/next execution dates to template variables
• [AIRFLOW-813] Fix unterminated unit tests in SchedulerJobTest
• [AIRFLOW-813] Fix unterminated scheduler unit tests
• [AIRFLOW-806] UI should properly ignore DAG doc when it is None
• [AIRFLOW-812] Fix the scheduler termination bug.
• [AIRFLOW-780] Fix dag import errors no longer working
• [AIRFLOW-783] Fix py3 incompatibility in BaseTaskRunner
• [AIRFLOW-810] Correct down_revision dag_id/state index creation
• [AIRFLOW-807] Improve scheduler performance for large DAGs
3.21 FAQ
There are very many reasons why your task might not be getting scheduled. Here are some of the common causes:
• Does your script “compile”, can the Airflow engine parse it and find your DAG object. To test this, you can
run airflow list_dags and confirm that your DAG shows up in the list. You can also run airflow
list_tasks foo_dag_id --tree and confirm that your task shows up in the list as expected. If you
use the CeleryExecutor, you may want to confirm that this works both where the scheduler runs as well as where
the worker runs.
• Does the file containing your DAG contain the string “airflow” and “DAG” somewhere in the contents? When
searching the DAG directory, Airflow ignores files not containing “airflow” and “DAG” in order to prevent the
DagBag parsing from importing all python files collocated with user’s DAGs.
• Is your start_date set properly? The Airflow scheduler triggers the task soon after the start_date +
scheduler_interval is passed.
• Is your schedule_interval set properly? The default schedule_interval is one day (datetime.
timedelta(1)). You must specify a different schedule_interval directly to the DAG ob-
ject you instantiate, not as a default_param, as task instances do not override their parent DAG’s
schedule_interval.
• Is your start_date beyond where you can see it in the UI? If you set your start_date to some time say 3
months ago, you won’t be able to see it in the main view in the UI, but you should be able to see it in the Menu
-> Browse ->Task Instances.
• Are the dependencies for the task met. The task instances directly upstream from the task need to be in a
success state. Also, if you have set depends_on_past=True, the previous task instance needs to have
succeeded (except if it is the first run for that task). Also, if wait_for_downstream=True, make sure you
understand what it means. You can view how these properties are set from the Task Instance Details
page for your task.
• Are the DagRuns you need created and active? A DagRun represents a specific execution of an entire DAG and
has a state (running, success, failed, . . . ). The scheduler creates new DagRun as it moves forward, but never goes
back in time to create new ones. The scheduler only evaluates running DagRuns to see what task instances
it can trigger. Note that clearing tasks instances (from the UI or CLI) does set the state of a DagRun back to
running. You can bulk view the list of DagRuns and alter states by clicking on the schedule tag for a DAG.
• Is the concurrency parameter of your DAG reached? concurrency defines how many running task
instances a DAG is allowed to have, beyond which point things get queued.
• Is the max_active_runs parameter of your DAG reached? max_active_runs defines how many
running concurrent instances of a DAG there are allowed to be.
You may also want to read the Scheduler section of the docs and make sure you fully understand how it proceeds.
Check out the Trigger Rule section in the Concepts section of the documentation
3.21.3 Why are connection passwords still not encrypted in the metadata db after I
installed airflow[crypto]?
Check out the Connections section in the Configuration section of the documentation
start_date is partly legacy from the pre-DagRun era, but it is still relevant in many ways. When creating a new
DAG, you probably want to set a global start_date for your tasks using default_args. The first DagRun to
be created will be based on the min(start_date) for all your task. From that point on, the scheduler creates new
DagRuns based on your schedule_interval and the corresponding task instances run as your dependencies are
met. When introducing new tasks to your DAG, you need to pay special attention to start_date, and may want to
reactivate inactive DagRuns to get the new task onboarded properly.
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite
confusing. The task is triggered once the period closes, and in theory an @hourly DAG would never get to an hour
after now as now() moves along.
Previously we also recommended using rounded start_date in relation to your schedule_interval. This
meant an @hourly would be at 00:00 minutes:seconds, a @daily job at midnight, a @monthly job on
the first of the month. This is no longer required. Airflow will now auto align the start_date and the
schedule_interval, by using the start_date as the moment to start looking.
You can use any sensor or a TimeDeltaSensor to delay the execution of tasks within the schedule interval. While
schedule_interval does allow specifying a datetime.timedelta object, we recommend using the macros
or cron expressions instead, as it enforces this idea of rounded schedules.
When using depends_on_past=True it’s important to pay special attention to start_date as the past depen-
dency is not enforced only on the specific schedule of the start_date specified for the task. It’s also important to
watch DagRun activity status in time when introducing new depends_on_past=True, unless you are planning
on running a backfill for the new task(s).
Also important to note is that the tasks start_date, in the context of a backfill CLI command, get overridden by
the backfill’s command start_date. This allows for a backfill on tasks that have depends_on_past=True to
actually start, if that wasn’t the case, the backfill just wouldn’t start.
Airflow looks in your DAGS_FOLDER for modules that contain DAG objects in their global namespace, and adds
the objects it finds in the DagBag. Knowing this all we need is a way to dynamically assign variable in the global
namespace, which is easily done in python using the globals() function for the standard library which behaves
like a simple dictionary.
for i in range(10):
dag_id = 'foo_{}'.format(i)
globals()[dag_id] = DAG(dag_id)
# or better, call a function that returns a DAG object!
3.21.6 What are all the airflow run commands in my process list?
There are many layers of airflow run commands, meaning it can call itself.
• Basic airflow run: fires up an executor, and tell it to run an airflow run --local command. if using
Celery, this means it puts a command in the queue for it to run remote, on the worker. If using LocalExecutor,
that translates into running it in a subprocess pool.
• Local airflow run --local: starts an airflow run --raw command (described below) as a sub-
process and is in charge of emitting heartbeats, listening for external kill signals and ensures some cleanup takes
place if the subprocess fails
• Raw airflow run --raw runs the actual operator’s execute method and performs the actual work
There are three variables we could control to improve airflow dag performance:
• parallelism: This variable controls the number of task instances that the airflow worker can run simultane-
ously. User could increase the parallelism variable in the airflow.cfg.
• concurrency: The Airflow scheduler will run no more than $concurrency task instances for your DAG
at any given time. Concurrency is defined in your Airflow DAG. If you do not set the concurrency on your DAG,
the scheduler will use the default value from the dag_concurrency entry in your airflow.cfg.
• max_active_runs: the Airflow scheduler will run no more than max_active_runs DagRuns of your
DAG at a given time. If you do not set the max_active_runs in your DAG, the scheduler will use the default
value from the max_active_runs_per_dag entry in your airflow.cfg.
If your dag takes long time to load, you could reduce the value of default_dag_run_display_number con-
figuration in airflow.cfg to a smaller value. This configurable controls the number of dag run to show in UI with
default value 25.
This means explicit_defaults_for_timestamp is disabled in your mysql server and you need to enable it
by:
1. Set explicit_defaults_for_timestamp = 1 under the mysqld section in your my.cnf file.
2. Restart the Mysql server.
• max_threads: Scheduler will spawn multiple threads in parallel to schedule dags. This is controlled by
max_threads with default value of 2. User should increase this value to a larger value(e.g numbers of cpus
where scheduler runs - 1) in production.
• scheduler_heartbeat_sec: User should consider to increase scheduler_heartbeat_sec config
to a higher value(e.g 60 secs) which controls how frequent the airflow scheduler gets the heartbeat and updates
the job’s entry in database.
3.22.1 Operators
Operators allow for generation of certain types of tasks that become nodes in the DAG when instantiated. All op-
erators derive from BaseOperator and inherit many attributes and methods that way. Refer to the BaseOperator
documentation for more details.
There are 3 main types of operators:
• Operators that performs an action, or tell another system to perform an action
• Transfer operators move data from one system to another
• Sensors are a certain type of operator that will keep running until a certain criterion is met. Examples include
a specific file landing in HDFS or S3, a partition appearing in Hive, or a specific time of the day. Sensors are
derived from BaseSensorOperator and run a poke method at a specified poke_interval until it returns
True.
3.22.1.1 BaseOperator
All operators are derived from BaseOperator and acquire much functionality through inheritance. Since this is
the core of the engine, it’s worth taking the time to understand the parameters of BaseOperator to understand the
primitive features that can be leveraged in your DAGs.
class airflow.models.BaseOperator(**kwargs)
Bases: airflow.utils.log.logging_mixin.LoggingMixin
Abstract base class for all operators. Since operators create objects that become nodes in the dag, BaseOperator
contains many recursive methods for dag crawling behavior. To derive this class, you are expected to override
the constructor as well as the ‘execute’ method.
Operators derived from this class should perform or trigger certain tasks synchronously (wait for comple-
tion). Example of operators could be an operator that runs a Pig job (PigOperator), a sensor operator that
waits for a partition to land in Hive (HiveSensorOperator), or one that moves data from Hive to MySQL
(Hive2MySqlOperator). Instances of these operators (tasks) target specific operations, running specific scripts,
functions or data transfers.
This class is abstract and shouldn’t be instantiated. Instantiating a class derived from this one results in the
creation of a task object, which ultimately becomes a node in DAG objects. Task dependencies should be set by
using the set_upstream and/or set_downstream methods.
Parameters
• task_id (string) – a unique, meaningful id for the task
• owner (string) – the owner of the task, using the unix username is recommended
• retries (int) – the number of retries that should be performed before failing the task
• retry_delay (timedelta) – delay between retries
• retry_exponential_backoff (bool) – allow progressive longer waits between re-
tries by using exponential backoff algorithm on retry delay (delay will be converted into
seconds)
• max_retry_delay (timedelta) – maximum delay interval between retries
• start_date (datetime) – The start_date for the task, determines the
execution_date for the first task instance. The best practice is to have the start_date
rounded to your DAG’s schedule_interval. Daily jobs have their start_date some
day at 00:00:00, hourly jobs have their start_date at 00:00 of a specific hour. Note that Air-
flow simply looks at the latest execution_date and adds the schedule_interval
to determine the next execution_date. It is also very important to note that differ-
ent tasks’ dependencies need to line up in time. If task A depends on task B and their
start_date are offset in a way that their execution_date don’t line up, A’s dependencies will
never be met. If you are looking to delay a task, for example running a daily task at 2AM,
look into the TimeSensor and TimeDeltaSensor. We advise against using dynamic
start_date and recommend using fixed ones. Read the FAQ entry about start_date for
more information.
• end_date (datetime) – if specified, the scheduler won’t go beyond this date
• depends_on_past (bool) – when set to true, task instances will run sequentially while
relying on the previous task’s schedule to succeed. The task instance for the start_date is
allowed to run.
• wait_for_downstream (bool) – when set to true, an instance of task X will wait
for tasks immediately downstream of the previous instance of task X to finish successfully
before it runs. This is useful if the different instances of a task X alter the same asset, and
this asset is used by tasks downstream of task X. Note that depends_on_past is forced to
True wherever wait_for_downstream is used.
• queue (str) – which queue to target when running this job. Not all executors implement
queue management, the CeleryExecutor does support targeting specific queues.
• dag (DAG) – a reference to the dag the task is attached to (if any)
• priority_weight (int) – priority weight of this task against other task. This allows
the executor to trigger higher priority tasks before others when things get backed up.
• weight_rule (str) – weighting method used for the effective total priority weight
of the task. Options are: { downstream | upstream | absolute } default is
downstream When set to downstream the effective weight of the task is the aggregate
sum of all downstream descendants. As a result, upstream tasks will have higher weight and
will be scheduled more aggressively when using positive weight values. This is useful when
you have multiple dag run instances and desire to have all upstream tasks to complete for all
runs before each dag can continue processing downstream tasks. When set to upstream
the effective weight is the aggregate sum of all upstream ancestors. This is the opposite
where downtream tasks have higher weight and will be scheduled more aggressively when
using positive weight values. This is useful when you have multiple dag run instances and
prefer to have each dag complete before starting upstream tasks of other dags. When set to
absolute, the effective weight is the exact priority_weight specified without ad-
ditional weighting. You may want to do this when you know exactly what priority weight
each task should have. Additionally, when set to absolute, there is bonus effect of signif-
icantly speeding up the task creation process as for very large DAGS. Options can be set as
string or using the constants defined in the static class airflow.utils.WeightRule
• pool (str) – the slot pool this task should run in, slot pools are a way to limit concurrency
for certain tasks
• sla (datetime.timedelta) – time by which the job is expected to succeed. Note that
this represents the timedelta after the period is closed. For example if you set an SLA
of 1 hour, the scheduler would send an email soon after 1:00AM on the 2016-01-02 if
the 2016-01-01 instance has not succeeded yet. The scheduler pays special attention for
jobs with an SLA and sends alert emails for sla misses. SLA misses are also recorded in the
database for future reference. All tasks that share the same SLA time get bundled in a single
email, sent soon after that time. SLA notification are sent once and only once for each task
instance.
• execution_timeout (datetime.timedelta) – max time allowed for the execu-
tion of this task instance, if it goes beyond it will raise and fail.
• on_failure_callback (callable) – a function to be called when a task instance of
this task fails. a context dictionary is passed as a single parameter to this function. Con-
text contains references to related objects to the task instance and is documented under the
macros section of the API.
• on_retry_callback (callable) – much like the on_failure_callback except
that it is executed when retries occur.
MyOperator(...,
executor_config={
"KubernetesExecutor":
{"image": "myCustomDockerImage"}
}
)
clear(**kwargs)
Clears the state of task instances associated with the task, following the parameters specified.
dag
Returns the Operator’s DAG if set, otherwise raises an error
deps
Returns the list of dependencies for the operator. These differ from execution context dependencies in that
they are specific to tasks and can be extended/overridden by subclasses.
downstream_list
@property: list of tasks directly downstream
execute(context)
This is the main method to derive when creating an operator. Context is the same dictionary used as when
rendering jinja templates.
Refer to get_template_context for more context.
get_direct_relative_ids(upstream=False)
Get the direct relative ids to the current task, upstream or downstream.
get_direct_relatives(upstream=False)
Get the direct relatives to the current task, upstream or downstream.
get_flat_relative_ids(upstream=False, found_descendants=None)
Get a flat list of relatives’ ids, either upstream or downstream.
get_flat_relatives(upstream=False)
Get a flat list of relatives, either upstream or downstream.
get_task_instances(session, start_date=None, end_date=None)
Get a set of task instance related to this task for a specific date range.
has_dag()
Returns True if the Operator has been assigned to a DAG.
on_kill()
Override this method to cleanup subprocesses when a task instance gets killed. Any use of the threading,
subprocess or multiprocessing module within an operator needs to be cleaned up or it will leave ghost
processes behind.
post_execute(context, *args, **kwargs)
This hook is triggered right after self.execute() is called. It is passed the execution context and any results
returned by the operator.
pre_execute(context, *args, **kwargs)
This hook is triggered right before self.execute() is called.
prepare_template()
Hook that is triggered after the templated fields get replaced by their content. If you need your operator to
alter the content of the file before the template is rendered, it should override this method to do so.
render_template(attr, content, context)
Renders a template either from a file or directly in a field, and returns the rendered result.
render_template_from_field(attr, content, context, jinja_env)
Renders a template from a field. If the field is a string, it will simply render the string and return the result.
If it is a collection or nested set of collections, it will traverse the structure and render all strings in it.
run(start_date=None, end_date=None, ignore_first_depends_on_past=False, ignore_ti_state=False,
mark_success=False)
Run a set of task instances for a date range.
schedule_interval
The schedule interval of the DAG always wins over individual tasks so that tasks within a DAG always
line up. The task still needs a schedule_interval as it may not be attached to a DAG.
set_downstream(task_or_task_list)
Set a task or a task list to be directly downstream from the current task.
set_upstream(task_or_task_list)
Set a task or a task list to be directly upstream from the current task.
upstream_list
@property: list of tasks directly upstream
xcom_pull(context, task_ids=None, dag_id=None, key=u’return_value’, include_prior_dates=None)
See TaskInstance.xcom_pull()
xcom_push(context, key, value, execution_date=None)
See TaskInstance.xcom_push()
3.22.1.2 BaseSensorOperator
All sensors are derived from BaseSensorOperator. All sensors inherit the timeout and poke_interval on
top of the BaseOperator attributes.
class airflow.sensors.base_sensor_operator.BaseSensorOperator(**kwargs)
Bases: airflow.models.BaseOperator, airflow.models.SkipMixin
Sensor operators are derived from this class and inherit these attributes.
Sensor operators keep executing at a time interval and succeed when a criteria is met and fail if and when they
time out.
Parameters
• soft_fail (bool) – Set to true to mark the task as SKIPPED on failure
• poke_interval (int) – Time in seconds that the job should wait in between each tries
• timeout (int) – Time, in seconds before the task times out and fails.
• mode (str) – How the sensor operates. Options are: { poke | reschedule }, de-
fault is poke. When set to poke the sensor is taking up a worker slot for its whole execution
time and sleeps between pokes. Use this mode if the expected runtime of the sensor is short
or if a short poke interval is requried. When set to reschedule the sensor task frees the
worker slot when the criteria is not yet met and it’s rescheduled at a later time. Use this
mode if the expected time until the criteria is met is. The poke inteval should be more than
one minute to prevent too much load on the scheduler.
deps
Adds one additional dependency for all sensor operators that checks if a sensor task instance can be
rescheduled.
poke(context)
Function that the sensors defined while deriving this class should override.
Operators
class airflow.operators.bash_operator.BashOperator(**kwargs)
Bases: airflow.models.BaseOperator
Execute a Bash script, command or set of commands.
Parameters
• bash_command (string) – The command, set of commands or reference to a bash script
(must be ‘.sh’) to be executed. (templated)
• xcom_push (bool) – If xcom_push is True, the last line written to stdout will also be
pushed to an XCom when the bash command completes.
• env (dict) – If env is not None, it must be a mapping that defines the environment vari-
ables for the new process; these are used instead of inheriting the current process environ-
ment, which is the default behavior. (templated)
• output_encoding (str) – Output encoding of bash command
execute(context)
Execute the bash command in a temporary directory which will be cleaned afterwards
class airflow.operators.python_operator.BranchPythonOperator(**kwargs)
Bases: airflow.operators.python_operator.PythonOperator, airflow.models.
SkipMixin
Allows a workflow to “branch” or follow a single path following the execution of this task.
It derives the PythonOperator and expects a Python function that returns the task_id to follow. The task_id
returned should point to a task directly downstream from {self}. All other “branches” or directly downstream
tasks are marked with a state of skipped so that these paths can’t move forward. The skipped states are
propageted downstream to allow for the DAG state to fill up and the DAG run’s state to be inferred.
• partition (dict) – target partition as a dict of partition columns and values. (templated)
• delimiter (str) – field delimiter in the file
• mssql_conn_id (str) – source Microsoft SQL Server connection
• hive_conn_id (str) – destination hive connection
• tblproperties (dict) – TBLPROPERTIES of the hive table being created
class airflow.operators.pig_operator.PigOperator(**kwargs)
Bases: airflow.models.BaseOperator
Executes pig script.
Parameters
• pig (string) – the pig latin script to be executed. (templated)
• pig_cli_conn_id (string) – reference to the Hive database
• pigparams_jinja_translate (boolean) – when True, pig params-type templating
${var} gets translated into jinja-type templating {{ var }}. Note that you may want to use
this along with the DAG(user_defined_macros=myargs) parameter. View the DAG
object documentation for more details.
class airflow.operators.python_operator.PythonOperator(**kwargs)
Bases: airflow.models.BaseOperator
Executes a Python callable
Parameters
• python_callable (python callable) – A reference to an object that is callable
• op_kwargs (dict) – a dictionary of keyword arguments that will get unpacked in your
function
• op_args (list) – a list of positional arguments that will get unpacked when calling your
callable
• provide_context (bool) – if set to true, Airflow will pass a set of keyword arguments
that can be used in your function. This set of kwargs correspond exactly to what you can
use in your jinja templates. For this to work, you need to define **kwargs in your function
header.
• templates_dict (dict of str) – a dictionary where the values are templates that
will get templated by the Airflow engine sometime between __init__ and execute
takes place and are made available in your callable’s context after the template has been
applied. (templated)
• templates_exts (list(str)) – a list of file extensions to resolve while processing
templated fields, for examples ['.sql', '.hql']
class airflow.operators.python_operator.PythonVirtualenvOperator(**kwargs)
Bases: airflow.operators.python_operator.PythonOperator
Allows one to run a function in a virtualenv that is created and destroyed automatically (with certain caveats).
The function must be defined using def, and not be part of a class. All imports must happen inside the function
and no variables outside of the scope may be referenced. A global scope variable named virtualenv_string_args
will be available (populated by string_args). In addition, one can pass stuff through op_args and op_kwargs, and
one can use a return value. Note that if your virtualenv runs in a different Python major version than Airflow,
you cannot use return values, op_args, or op_kwargs. You can use string_args though. :param python_callable:
A python function with no references to outside variables,
Parameters
• requirements (list(str)) – A list of requirements as specified in a pip install com-
mand
• python_version (str) – The Python version to run the virtualenv with. Note that both
2 and 2.7 are acceptable forms.
• use_dill (bool) – Whether to use dill to serialize the args and result (pickle is default).
This allow more complex types but requires you to include dill in your requirements.
• system_site_packages (bool) – Whether to include system_site_packages in your
virtualenv. See virtualenv documentation for more information.
• op_args – A list of positional arguments to pass to python_callable.
• op_kwargs (dict) – A dict of keyword arguments to pass to python_callable.
• string_args (list(str)) – Strings that are present in the global var vir-
tualenv_string_args, available to python_callable at runtime as a list(str). Note that args
are split by newline.
• templates_dict (dict of str) – a dictionary where the values are templates that
will get templated by the Airflow engine sometime between __init__ and execute
takes place and are made available in your callable’s context after the template has been
applied
• templates_exts (list(str)) – a list of file extensions to resolve while processing
templated fields, for examples ['.sql', '.hql']
class airflow.operators.s3_file_transform_operator.S3FileTransformOperator(**kwargs)
Bases: airflow.models.BaseOperator
Copies data from a source S3 location to a temporary location on the local filesystem. Runs a transformation on
this file as specified by the transformation script and uploads the output to a destination S3 location.
The locations of the source and the destination files in the local filesystem is provided as an first and second
arguments to the transformation script. The transformation script is expected to read the data from source,
transform it and write the output to the local destination file. The operator then takes over control and uploads
the local destination file to S3.
S3 Select is also available to filter the source contents. Users can omit the transformation script if S3 Select
expression is specified.
Parameters
• source_s3_key (str) – The key to be retrieved from S3. (templated)
• source_aws_conn_id (str) – source s3 connection
• source_verify (bool or str) – Whether or not to verify SSL certificates for S3
connetion. By default SSL certificates are verified. You can provide the following values:
– False: do not validate SSL certificates. SSL will still be used (unless use_ssl is
False), but SSL certificates will not be verified.
– path/to/cert/bundle.pem: A filename of the CA cert bundle to uses. You
can specify this argument if you want to use a different CA cert bundle than the one
used by botocore.
This is also applicable to dest_verify.
class airflow.operators.python_operator.ShortCircuitOperator(**kwargs)
Bases: airflow.operators.python_operator.PythonOperator, airflow.models.
SkipMixin
Allows a workflow to continue only if a condition is met. Otherwise, the workflow “short-circuits” and down-
stream tasks are skipped.
The ShortCircuitOperator is derived from the PythonOperator. It evaluates a condition and short-circuits the
workflow if the condition is False. Any downstream tasks are marked with a state of “skipped”. If the condition
is True, downstream tasks proceed as normal.
The condition is determined by the result of python_callable.
class airflow.operators.http_operator.SimpleHttpOperator(**kwargs)
Bases: airflow.models.BaseOperator
Calls an endpoint on an HTTP system to execute an action
Parameters
• http_conn_id (string) – The connection to run the sensor against
• endpoint (string) – The relative part of the full url. (templated)
• method (string) – The HTTP method to use, default = “POST”
• data (For POST/PUT, depends on the content-type parameter,
for GET a dictionary of key/value string pairs) – The data to pass.
POST-data in POST/PUT and params in the URL for a GET request. (templated)
• headers (a dictionary of string key/value pairs) – The HTTP headers
to be added to the GET request
• response_check (A lambda or defined function.) – A check against the
‘requests’ response object. Returns True for ‘pass’ and False otherwise.
• extra_options (A dictionary of options, where key is string
and value depends on the option that's being modified.) – Extra
options for the ‘requests’ library, see the ‘requests’ documentation (options to modify
timeout, ssl, etc.)
• xcom_push (bool) – Push the response to Xcom (default: False)
• log_response (bool) – Log the response (default: False)
class airflow.operators.sqlite_operator.SqliteOperator(**kwargs)
Bases: airflow.models.BaseOperator
Executes sql code in a specific Sqlite database
Parameters
• sqlite_conn_id (string) – reference to a specific sqlite database
• sql (string or string pointing to a template file. File must
have a '.sql' extensions.) – the sql code to be executed. (templated)
class airflow.operators.subdag_operator.SubDagOperator(**kwargs)
Bases: airflow.models.BaseOperator
class airflow.operators.dagrun_operator.TriggerDagRunOperator(**kwargs)
Bases: airflow.models.BaseOperator
Triggers a DAG run for a specified dag_id
Parameters
• trigger_dag_id (str) – the dag_id to trigger (templated)
• python_callable (python callable) – a reference to a python function that will
be called while passing it the context object and a placeholder object obj for your
callable to fill and return if you want a DagRun created. This obj object contains a run_id
and payload attribute that you can modify in your function. The run_id should be a
unique identifier for that DAG run, and the payload has to be a picklable object that will be
made available to your tasks while executing that DAG run. Your function header should
look like def foo(context, dag_run_obj):
• execution_date (str or datetime.datetime) – Execution date for the dag
(templated)
class airflow.operators.check_operator.ValueCheckOperator(**kwargs)
Bases: airflow.models.BaseOperator
Performs a simple value check using sql code.
Note that this is an abstract class and get_db_hook needs to be defined. Whereas a get_db_hook is hook that
gets a single record from an external source.
Parameters sql (string) – the sql to be executed. (templated)
Sensors
class airflow.sensors.external_task_sensor.ExternalTaskSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a task to complete in a different DAG
Parameters
• external_dag_id (string) – The dag_id that contains the task you want to wait for
• external_task_id (string) – The task_id that contains the task you want to wait for
• allowed_states (list) – list of allowed states, default is ['success']
• execution_delta (datetime.timedelta) – time difference with the previous ex-
ecution to look at, the default is the same execution_date as the current task. For yesterday,
use [positive!] datetime.timedelta(days=1). Either execution_delta or execution_date_fn
can be passed to ExternalTaskSensor, but not both.
• execution_date_fn (callable) – function that receives the current execution
date and returns the desired execution dates to query. Either execution_delta or execu-
tion_date_fn can be passed to ExternalTaskSensor, but not both.
poke(**kwargs)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.hive_partition_sensor.HivePartitionSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a partition to show up in Hive.
Note: Because partition supports general logical operators, it can be inefficient. Consider using Named-
HivePartitionSensor instead if you don’t need the full flexibility of HivePartitionSensor.
Parameters
• table (string) – The name of the table to wait for, supports the dot notation
(my_database.my_table)
• partition (string) – The partition clause to wait for. This is passed as is to the metas-
tore Thrift client get_partitions_by_filter method, and apparently supports SQL
like notation as in ds='2015-01-01' AND type='value' and comparison opera-
tors as in "ds>=2015-01-01"
• metastore_conn_id (str) – reference to the metastore thrift service connection id
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.http_sensor.HttpSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Executes a HTTP get statement and returns False on failure: 404 not found or response_check function re-
turned False
Parameters
• http_conn_id (string) – The connection to run the sensor against
• method (string) – The HTTP request method to use
• endpoint (string) – The relative part of the full url
• request_params (a dictionary of string key/value pairs) – The pa-
rameters to be added to the GET url
• headers (a dictionary of string key/value pairs) – The HTTP headers
to be added to the GET request
• response_check (A lambda or defined function.) – A check against the
‘requests’ response object. Returns True for ‘pass’ and False otherwise.
• extra_options (A dictionary of options, where key is string
and value depends on the option that's being modified.) – Extra
options for the ‘requests’ library, see the ‘requests’ documentation (options to modify
timeout, ssl, etc.)
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.metastore_partition_sensor.MetastorePartitionSensor(**kwargs)
Bases: airflow.sensors.sql_sensor.SqlSensor
An alternative to the HivePartitionSensor that talk directly to the MySQL db. This was created as a result of
observing sub optimal queries generated by the Metastore thrift service when hitting subpartitioned tables. The
Thrift service’s queries were written in a way that wouldn’t leverage the indexes.
Parameters
• schema (str) – the schema
• table (str) – the table
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.s3_prefix_sensor.S3PrefixSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a prefix to exist. A prefix is the first part of a key, thus enabling checking of constructs similar to glob
airfl* or SQL LIKE ‘airfl%’. There is the possibility to precise a delimiter to indicate the hierarchy or keys,
meaning that the match will stop at that delimiter. Current code accepts sane delimiters, i.e. characters that are
NOT special characters in the Python regex engine.
Parameters
• bucket_name (str) – Name of the S3 bucket
• prefix (str) – The prefix being waited on. Relative path from bucket root level.
• delimiter (str) – The delimiter intended to show hierarchy. Defaults to ‘/’.
• aws_conn_id (str) – a reference to the s3 connection
• verify (bool or str) – Whether or not to verify SSL certificates for S3 connection.
By default SSL certificates are verified. You can provide the following values: - False: do
not validate SSL certificates. SSL will still be used
(unless use_ssl is False), but SSL certificates will not be verified.
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.sql_sensor.SqlSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Runs a sql statement until a criteria is met. It will keep trying while sql returns no row, or if the first cell in (0,
‘0’, ‘’).
Parameters
• conn_id (string) – The connection to run the sensor against
• sql – The sql to run. To pass, it needs to return at least one cell that contains a non-zero /
empty string value.
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.time_sensor.TimeSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits until the specified time of the day.
Parameters target_time (datetime.time) – time after which the job succeeds
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.time_delta_sensor.TimeDeltaSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a timedelta after the task’s execution_date + schedule_interval. In Airflow, the daily task stamped with
execution_date 2016-01-01 can only start running on 2016-01-02. The timedelta here represents the time
after the execution period has closed.
Parameters delta (datetime.timedelta) – time length to wait after execution_date before
succeeding
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.sensors.web_hdfs_sensor.WebHdfsSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a file or folder to land in HDFS
poke(context)
Function that the sensors defined while deriving this class should override.
Operators
class airflow.contrib.operators.aws_athena_operator.AWSAthenaOperator(**kwargs)
Bases: airflow.models.BaseOperator
An operator that submit presto query to athena.
Parameters
• query (str) – Presto to be run on athena. (templated)
• database (str) – Database to select. (templated)
• output_location (str) – s3 path to write the query results into. (templated)
• aws_conn_id (str) – aws connection to use
• sleep_time (int) – Time to wait between two consecutive call to check query status on
athena
execute(context)
Run Presto Query on Athena
on_kill()
Cancel the submitted athena query
class airflow.contrib.operators.awsbatch_operator.AWSBatchOperator(**kwargs)
Bases: airflow.models.BaseOperator
Execute a job on AWS Batch Service
Parameters
• job_name (str) – the name for the job that will run on AWS Batch
• job_definition (str) – the job definition name on AWS Batch
• job_queue (str) – the queue name on AWS Batch
• overrides (dict) – the same parameter that boto3 will receive on con-
tainerOverrides (templated): http://boto3.readthedocs.io/en/latest/reference/services/batch.
html#submit_job
• max_retries (int) – exponential backoff retries while waiter is not merged, 4200 = 48
hours
• aws_conn_id (str) – connection id of AWS credentials / region name. If None, creden-
tial boto3 strategy will be used (http://boto3.readthedocs.io/en/latest/guide/configuration.
html).
• region_name (str) – region name to use in AWS Hook. Override the region_name in
connection (if provided)
class airflow.contrib.operators.bigquery_check_operator.BigQueryCheckOperator(**kwargs)
Bases: airflow.operators.check_operator.CheckOperator
Performs checks against BigQuery. The BigQueryCheckOperator expects a sql query that will return a
single row. Each value on that first row is evaluated using python bool casting. If any of the values return
False the check is failed and errors out.
Note that Python bool casting evals the following as False:
• False
• 0
• Empty string ("")
• Empty list ([])
• Empty dictionary or set ({})
Given a query like SELECT COUNT(*) FROM foo, it will fail only if the count == 0. You can craft much
more complex query that could, for instance, check that the table has the same number of rows as the source
table upstream, or that the count of today’s partition is greater than yesterday’s partition, or that a set of metrics
are less than 3 standard deviation for the 7 day average.
This operator can be used as a data quality check in your pipeline, and depending on where you put it in your
DAG, you have the choice to stop the critical path, preventing from publishing dubious data, or on the side and
receive email alterts without stopping the progress of the DAG.
Parameters
• sql (string) – the sql to be executed
• bigquery_conn_id (string) – reference to the BigQuery database
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).
class airflow.contrib.operators.bigquery_check_operator.BigQueryValueCheckOperator(**kwargs)
Bases: airflow.operators.check_operator.ValueCheckOperator
Performs a simple value check using sql code.
Parameters
• sql (string) – the sql to be executed
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).
class airflow.contrib.operators.bigquery_check_operator.BigQueryIntervalCheckOperator(**kwa
Bases: airflow.operators.check_operator.IntervalCheckOperator
Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from
days_back before.
This method constructs a query like so
Parameters
• table (str) – the table name
• days_back (int) – number of days between ds and the ds we want to check against.
Defaults to 7 days
• metrics_threshold (dict) – a dictionary of ratios indexed by metrics, for example
‘COUNT(*)’: 1.5 would require a 50 percent or less difference between the current day, and
the prior days_back.
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).
class airflow.contrib.operators.bigquery_get_data.BigQueryGetDataOperator(**kwargs)
Bases: airflow.models.BaseOperator
Fetches the data from a BigQuery table (alternatively fetch data for selected columns) and returns data in a
python list. The number of elements in the returned list will be equal to the number of rows fetched. Each
element in the list will again be a list where element would represent the columns values for that row.
Example Result: [['Tony', '10'], ['Mike', '20'], ['Steve', '15']]
Note: If you pass fields to selected_fields which are in different order than the order of columns already
in BQ table, the data will still be in the order of BQ table. For example if the BQ table has 3 columns as
[A,B,C] and you pass ‘B,A’ in the selected_fields the data would still be of the form 'A,B'.
Example:
get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)
Parameters
• dataset_id (string) – The dataset ID of the requested table. (templated)
• table_id (string) – The table ID of the requested table. (templated)
• max_results (string) – The maximum number of records (rows) to be fetched from
the table. (templated)
• selected_fields (string) – List of fields to return (comma-separated). If unspeci-
fied, all fields are returned.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
class airflow.contrib.operators.bigquery_operator.BigQueryCreateEmptyTableOperator(**kwargs)
Bases: airflow.models.BaseOperator
Creates a new, empty table in the specified BigQuery dataset, optionally with schema.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly pass
the schema fields in, or you may point the operator to a Google cloud storage object name. The object in Google
cloud storage must be a JSON file with the schema fields in it. You can also create a table without schema.
Parameters
• project_id (string) – The project to create the table into. (templated)
• dataset_id (string) – The dataset to create the table into. (templated)
• table_id (string) – The Name of the table to be created. (templated)
• schema_fields (list) – If set, the schema field list as defined here: https://cloud.
google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema
Example:
CreateTable = BigQueryCreateEmptyTableOperator(
task_id='BigQueryCreateEmptyTableOperator_task',
dataset_id='ODS',
table_id='Employees',
project_id='internal-gcp-project',
gcs_schema_object='gs://schema-bucket/employee_schema.json',
bigquery_conn_id='airflow-service-account',
google_cloud_storage_conn_id='airflow-service-account'
)
[
{
"mode": "NULLABLE",
"name": "emp_name",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "salary",
(continues on next page)
CreateTable = BigQueryCreateEmptyTableOperator(
task_id='BigQueryCreateEmptyTableOperator_task',
dataset_id='ODS',
table_id='Employees',
project_id='internal-gcp-project',
schema_fields=[{"name": "emp_name", "type": "STRING", "mode":
˓→"REQUIRED"},
bigquery_conn_id='airflow-service-account',
google_cloud_storage_conn_id='airflow-service-account'
)
class airflow.contrib.operators.bigquery_operator.BigQueryCreateExternalTableOperator(**kwa
Bases: airflow.models.BaseOperator
Creates a new external table in the dataset with the data in Google Cloud Storage.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly
pass the schema fields in, or you may point the operator to a Google cloud storage object name. The object in
Google cloud storage must be a JSON file with the schema fields in it.
Parameters
• bucket (string) – The bucket to point the external table to. (templated)
• source_objects (list) – List of Google cloud storage URIs to point table to. (tem-
plated) If source_format is ‘DATASTORE_BACKUP’, the list must only contain a single
URI.
• destination_project_dataset_table (string) – The dotted
(<project>.)<dataset>.<table> BigQuery table to load data into (templated). If <project> is
not included, project will be the project defined in the connection json.
• schema_fields (list) – If set, the schema field list as defined here: https://cloud.
google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema
Example:
class airflow.contrib.operators.bigquery_operator.BigQueryCreateEmptyDatasetOperator(**kwarg
Bases: airflow.models.BaseOperator
” This operator is used to create new dataset for your Project in Big query. https://cloud.google.com/bigquery/
docs/reference/rest/v2/datasets#resource
Parameters
• project_id (str) – The name of the project where we want to create the dataset. Don’t
need to provide, if projectId in dataset_reference.
• dataset_id (str) – The id of dataset. Don’t need to provide, if datasetId in
dataset_reference.
• dataset_reference – Dataset reference that could be provided with request body.
More info: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource
class airflow.contrib.operators.bigquery_operator.BigQueryOperator(**kwargs)
Bases: airflow.models.BaseOperator
• priority (string) – Specifies a priority for the query. Possible values include INTER-
ACTIVE and BATCH. The default value is INTERACTIVE.
• time_partitioning (dict) – configure optional time partitioning fields i.e. partition
by field, type and expiration as per API specifications.
• cluster_fields (list of str) – Request that the result of this query be stored
sorted by one or more columns. This is only available in conjunction with time_partitioning.
The order of columns given determines the sort order.
• location (str) – The geographic location of the job. Required except for US and EU.
See details at https://cloud.google.com/bigquery/docs/locations#specifying_your_location
class airflow.contrib.operators.bigquery_table_delete_operator.BigQueryTableDeleteOperator(
Bases: airflow.models.BaseOperator
Deletes BigQuery tables
Parameters
• deletion_dataset_table (string) – A dotted
(<project>.|<project>:)<dataset>.<table> that indicates which table will be deleted.
(templated)
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• ignore_if_missing (boolean) – if True, then return success even if the requested
table does not exist.
class airflow.contrib.operators.bigquery_to_bigquery.BigQueryToBigQueryOperator(**kwargs)
Bases: airflow.models.BaseOperator
Copies data from one BigQuery table to another.
See also:
For more details about these parameters: https://cloud.google.com/bigquery/docs/reference/v2/jobs#
configuration.copy
Parameters
• source_project_dataset_tables (list|string) – One or more dotted
(project:|project.)<dataset>.<table> BigQuery tables to use as the source data. If <project>
is not included, project will be the project defined in the connection json. Use a list if there
are multiple source tables. (templated)
• destination_project_dataset_table (string) – The destination BigQuery
table. Format is: (project:|project.)<dataset>.<table> (templated)
• write_disposition (string) – The write disposition if the table already exists.
• create_disposition (string) – The create disposition if the table doesn’t exist.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery
class airflow.contrib.operators.bigquery_to_gcs.BigQueryToCloudStorageOperator(**kwargs)
Bases: airflow.models.BaseOperator
Transfers a BigQuery table to a Google Cloud Storage bucket.
See also:
For more details about these parameters: https://cloud.google.com/bigquery/docs/reference/v2/jobs
Parameters
• source_project_dataset_table (string) – The dotted (<project>.
|<project>:)<dataset>.<table> BigQuery table to use as the source data. If
<project> is not included, project will be the project defined in the connection json. (tem-
plated)
• destination_cloud_storage_uris (list) – The destination Google Cloud Stor-
age URI (e.g. gs://some-bucket/some-file.txt). (templated) Follows convention defined here:
https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple
• compression (string) – Type of compression to use.
• export_format (string) – File format to export.
• field_delimiter (string) – The delimiter to use when extracting to a CSV.
• print_header (boolean) – Whether to print a header for a CSV file extract.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery
class airflow.contrib.operators.databricks_operator.DatabricksSubmitRunOperator(**kwargs)
Bases: airflow.models.BaseOperator
Submits a Spark job run to Databricks using the api/2.0/jobs/runs/submit API endpoint.
There are two ways to instantiate this operator.
In the first way, you can take the JSON payload that you typically use to call the api/2.0/jobs/runs/
submit endpoint and pass it directly to our DatabricksSubmitRunOperator through the json param-
eter. For example
json = {
'new_cluster': {
'spark_version': '2.1.0-db3-scala2.11',
'num_workers': 2
},
'notebook_task': {
'notebook_path': '/Users/[email protected]/PrepareData',
},
}
notebook_run = DatabricksSubmitRunOperator(task_id='notebook_run', json=json)
Another way to accomplish the same thing is to use the named parameters of the
DatabricksSubmitRunOperator directly. Note that there is exactly one named parameter for
each top level parameter in the runs/submit endpoint. In this method, your code would look like this:
new_cluster = {
'spark_version': '2.1.0-db3-scala2.11',
'num_workers': 2
}
notebook_task = {
'notebook_path': '/Users/[email protected]/PrepareData',
}
notebook_run = DatabricksSubmitRunOperator(
task_id='notebook_run',
new_cluster=new_cluster,
notebook_task=notebook_task)
In the case where both the json parameter AND the named parameters are provided, they will be merged together.
If there are conflicts during the merge, the named parameters will take precedence and override the top level
json keys.
Currently the named parameters that DatabricksSubmitRunOperator supports are
• spark_jar_task
• notebook_task
• new_cluster
• existing_cluster_id
• libraries
• run_name
• timeout_seconds
Parameters
• json (dict) – A JSON object containing API parameters which will be passed directly
to the api/2.0/jobs/runs/submit endpoint. The other named parameters (i.e.
spark_jar_task, notebook_task..) to this operator will be merged with this json
dictionary if they are provided. If there are conflicts during the merge, the named parameters
will take precedence and override the top level json keys. (templated)
See also:
For more information about templating see Jinja Templating. https://docs.databricks.com/
api/latest/jobs.html#runs-submit
• spark_jar_task (dict) – The main class and parameters for the JAR task. Note
that the actual JAR is specified in the libraries. EITHER spark_jar_task OR
notebook_task should be specified. This field will be templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobssparkjartask
• notebook_task (dict) – The notebook path and parameters for the notebook task.
EITHER spark_jar_task OR notebook_task should be specified. This field will
be templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobsnotebooktask
• new_cluster (dict) – Specs for a new cluster on which this task will be run. EITHER
new_cluster OR existing_cluster_id should be specified. This field will be
templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobsclusterspecnewcluster
• existing_cluster_id (string) – ID for existing cluster on which to run this task.
EITHER new_cluster OR existing_cluster_id should be specified. This field
will be templated.
• libraries (list of dicts) – Libraries which this run will use. This field will be
templated.
See also:
https://docs.databricks.com/api/latest/libraries.html#managedlibrarieslibrary
• run_name (string) – The run name used for this task. By default this will be set
to the Airflow task_id. This task_id is a required parameter of the superclass
BaseOperator. This field will be templated.
• timeout_seconds (int32) – The timeout for this run. By default a value of 0 is used
which means to have no timeout. This field will be templated.
• databricks_conn_id (string) – The name of the Airflow connection to use. By
default and in the common case this will be databricks_default. To use token based
authentication, provide the key token in the extra field for the connection.
• polling_period_seconds (int) – Controls the rate which we poll for the result of
this run. By default the operator will poll every 30 seconds.
• databricks_retry_limit (int) – Amount of times retry if the Databricks backend
is unreachable. Its value must be greater than or equal to 1.
• databricks_retry_delay (float) – Number of seconds to wait between retries (it
might be a floating point number).
• do_xcom_push (boolean) – Whether we should push run_id and run_page_url to xcom.
class airflow.contrib.operators.dataflow_operator.DataFlowJavaOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Java Cloud DataFlow batch job. The parameters of the operation will be passed to the job.
See also:
For more detail on job submission have a look at the reference: https://cloud.google.com/dataflow/pipelines/
specifying-exec-params
Parameters
• jar (string) – The reference to a self executing DataFlow jar.
• dataflow_default_options (dict) – Map of default job options.
• options (dict) – Map of job specific options.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Plat-
form for the dataflow job status while the job is in the JOB_STATE_RUNNING state.
• job_class (string) – The name of the dataflow job class to be executued, it is often
not the main class configured in the dataflow jar file.
Both jar and options are templated so you can use variables in them.
Note that both dataflow_default_options and options will be merged to specify pipeline execution
parameter, and dataflow_default_options is expected to save high-level options, for instances, project
and zone information, which apply to all dataflow operators in the DAG.
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and
staging location.
default_args = {
'dataflow_default_options': {
'project': 'my-gcp-project',
'zone': 'europe-west1-d',
'stagingLocation': 'gs://my-staging-bucket/staging/'
}
}
You need to pass the path to your dataflow as a file reference with the jar parameter, the jar needs to
be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/
#self-executing-jar). Use options to pass on options to your job.
t1 = DataFlowJavaOperator(
task_id='datapflow_example',
jar='{{var.value.gcp_dataflow_base}}pipeline/build/libs/pipeline-example-1.0.
˓→jar',
options={
'autoscalingAlgorithm': 'BASIC',
'maxNumWorkers': '50',
'start': '{{ds}}',
'partitionType': 'DAY',
'labels': {'foo' : 'bar'}
},
gcp_conn_id='gcp-airflow-service-account',
dag=my-dag)
class airflow.contrib.operators.dataflow_operator.DataflowTemplateOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Templated Cloud DataFlow batch job. The parameters of the operation will be passed to the job.
Parameters
• template (string) – The reference to the DataFlow template.
• dataflow_default_options (dict) – Map of default job environment options.
• parameters (dict) – Map of job specific parameters for the template.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Plat-
form for the dataflow job status while the job is in the JOB_STATE_RUNNING state.
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and
staging location.
See also:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters https://cloud.google.
com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment
default_args = {
'dataflow_default_options': {
'project': 'my-gcp-project'
'zone': 'europe-west1-d',
'tempLocation': 'gs://my-staging-bucket/staging/'
}
}
}
You need to pass the path to your dataflow template as a file reference with the template parameter. Use
parameters to pass on parameters to your job. Use environment to pass on runtime environment variables
to your job.
t1 = DataflowTemplateOperator(
task_id='datapflow_example',
template='{{var.value.gcp_dataflow_base}}',
parameters={
'inputFile': "gs://bucket/input/my_input.txt",
'outputFile': "gs://bucket/output/my_output.txt"
},
gcp_conn_id='gcp-airflow-service-account',
dag=my-dag)
template, dataflow_default_options and parameters are templated so you can use variables in
them.
Note that dataflow_default_options is expected to save high-level options for project information,
which apply to all dataflow operators in the DAG.
See also:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3 /LaunchTemplateParameters
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment For more de-
tail on job template execution have a look at the reference: https://cloud.google.com/dataflow/docs/
templates/executing-templates
class airflow.contrib.operators.dataflow_operator.DataFlowPythonOperator(**kwargs)
Bases: airflow.models.BaseOperator
Launching Cloud Dataflow jobs written in python. Note that both dataflow_default_options and options will
be merged to specify pipeline execution parameter, and dataflow_default_options is expected to save high-level
options, for instances, project and zone information, which apply to all dataflow operators in the DAG.
See also:
For more detail on job submission have a look at the reference: https://cloud.google.com/dataflow/pipelines/
specifying-exec-params
Parameters
• py_file (string) – Reference to the python dataflow pipleline file.py, e.g.,
/some/local/file/path/to/your/python/pipeline/file.
execute(context)
Execute the python dataflow job.
class airflow.contrib.operators.dataproc_operator.DataprocClusterCreateOperator(**kwargs)
Bases: airflow.models.BaseOperator
Create a new cluster on Google Cloud Dataproc. The operator will wait until the creation is successful or an
error occurs in the creation process.
The parameters allow to configure the cluster. Please refer to
https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters
for a detailed explanation on the different parameters. Most of the configuration parameters detailed in the link
are available as a parameter to this operator.
Parameters
• cluster_name (string) – The name of the DataProc cluster to create. (templated)
• project_id (str) – The ID of the google cloud project in which to create the cluster.
(templated)
• num_workers (int) – The # of workers to spin up. If set to zero will spin up cluster in a
single node mode
• storage_bucket (string) – The storage bucket to use, setting to None lets dataproc
generate a custom one for you
• init_actions_uris (list[string]) – List of GCS uri’s containing dataproc ini-
tialization scripts
• init_action_timeout (string) – Amount of time executable scripts in
init_actions_uris has to complete
• metadata (dict) – dict of key-value google compute engine metadata entries to add to
all instances
• image_version (string) – the version of software inside the Dataproc cluster
• custom_image – custom Dataproc image for more info see https://cloud.google.com/
dataproc/docs/guides/dataproc-images
• properties (dict) – dict of properties to set on config files (e.g. spark-defaults.conf),
see https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#
SoftwareConfig
• master_machine_type (string) – Compute engine machine type to use for the mas-
ter node
• master_disk_type (string) – Type of the boot disk for the master node (de-
fault is pd-standard). Valid values: pd-ssd (Persistent Disk Solid State Drive) or
pd-standard (Persistent Disk Hard Disk Drive).
• master_disk_size (int) – Disk size for the master node
• worker_machine_type (string) – Compute engine machine type to use for the
worker nodes
• worker_disk_type (string) – Type of the boot disk for the worker node (de-
fault is pd-standard). Valid values: pd-ssd (Persistent Disk Solid State Drive) or
pd-standard (Persistent Disk Hard Disk Drive).
• worker_disk_size (int) – Disk size for the worker nodes
• num_preemptible_workers (int) – The # of preemptible worker nodes to spin up
• labels (dict) – dict of labels to add to the cluster
• zone (string) – The zone where the cluster will be located. (templated)
• network_uri (string) – The network uri to be used for machine communication, can-
not be specified with subnetwork_uri
• subnetwork_uri (string) – The subnetwork uri to be used for machine communica-
tion, cannot be specified with network_uri
• internal_ip_only (bool) – If true, all instances in the cluster will only have internal
IP addresses. This can only be enabled for subnetwork enabled networks
• tags (list[string]) – The GCE tags to add to all instances
• region – leave as ‘global’, might become relevant in the future. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• service_account (string) – The service account of the dataproc instances.
• service_account_scopes (list[string]) – The URIs of service account scopes
to be included.
• idle_delete_ttl (int) – The longest duration that cluster would keep alive while
staying idle. Passing this threshold will cause cluster to be auto-deleted. A duration in
seconds.
• auto_delete_time (datetime.datetime) – The time when cluster will be auto-
deleted.
• auto_delete_ttl (int) – The life duration of cluster, the cluster will be auto-deleted
at the end of this duration. A duration in seconds. (If auto_delete_time is set this parameter
will be ignored)
Type custom_image: string
class airflow.contrib.operators.dataproc_operator.DataprocClusterScaleOperator(**kwargs)
Bases: airflow.models.BaseOperator
Scale, up or down, a cluster on Google Cloud Dataproc. The operator will wait until the cluster is re-scaled.
Example:
t1 = DataprocClusterScaleOperator(
task_id='dataproc_scale',
project_id='my-project',
cluster_name='cluster-1',
num_workers=10,
num_preemptible_workers=10,
graceful_decommission_timeout='1h',
dag=dag)
See also:
For more detail on about scaling clusters have a look at the reference: https://cloud.google.com/dataproc/docs/
concepts/configuring-clusters/scaling-clusters
Parameters
• cluster_name (string) – The name of the cluster to scale. (templated)
• project_id (string) – The ID of the google cloud project in which the cluster runs.
(templated)
• region (string) – The region for the dataproc cluster. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• num_workers (int) – The new number of workers
• num_preemptible_workers (int) – The new number of preemptible workers
• graceful_decommission_timeout (string) – Timeout for graceful YARN de-
comissioning. Maximum value is 1d
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
class airflow.contrib.operators.dataproc_operator.DataprocClusterDeleteOperator(**kwargs)
Bases: airflow.models.BaseOperator
Delete a cluster on Google Cloud Dataproc. The operator will wait until the cluster is destroyed.
Parameters
• cluster_name (string) – The name of the cluster to create. (templated)
• project_id (string) – The ID of the google cloud project in which the cluster runs.
(templated)
• region (string) – leave as ‘global’, might become relevant in the future. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
class airflow.contrib.operators.dataproc_operator.DataProcPigOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Pig query Job on a Cloud DataProc cluster. The parameters of the operation will be passed to the cluster.
It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and
UDFs.
default_args = {
'cluster_name': 'cluster-1',
'dataproc_pig_jars': [
'gs://example/udf/jar/datafu/1.2.0/datafu.jar',
'gs://example/udf/jar/gpig/1.2/gpig.jar'
]
}
You can pass a pig script as string or file reference. Use variables to pass on variables for the pig script to be
resolved on the cluster or use the parameters to be resolved in the script as template parameters.
Example:
t1 = DataProcPigOperator(
task_id='dataproc_pig',
query='a_pig_script.pig',
variables={'out': 'gs://example/output/{{ds}}'},
dag=dag)
See also:
For more detail on about job submission have a look at the reference: https://cloud.google.com/dataproc/
reference/rest/v1/projects.regions.jobs
Parameters
• query (string) – The query or reference to the query file (pg or pig extension). (tem-
plated)
• query_uri (string) – The uri of a pig script on Cloud Storage.
• variables (dict) – Map of named parameters for the query. (templated)
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster. (templated)
• dataproc_pig_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
• dataproc_pig_jars (list) – URIs to jars provisioned in Cloud Storage (example:
for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.
class airflow.contrib.operators.dataproc_operator.DataProcHiveOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Hive query Job on a Cloud DataProc cluster.
Parameters
• query (string) – The query or reference to the query file (q extension).
• query_uri (string) – The uri of a hive script on Cloud Storage.
• variables (dict) – Map of named parameters for the query.
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes.
• cluster_name (string) – The name of the DataProc cluster.
• dataproc_hive_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
• dataproc_hive_jars (list) – URIs to jars provisioned in Cloud Storage (example:
for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.
class airflow.contrib.operators.dataproc_operator.DataProcSparkSqlOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Spark SQL query Job on a Cloud DataProc cluster.
Parameters
• query (string) – The query or reference to the query file (q extension). (templated)
• query_uri (string) – The uri of a spark sql script on Cloud Storage.
• variables (dict) – Map of named parameters for the query. (templated)
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster. (templated)
• dataproc_spark_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments