Airflow - Interview - Question - Answers - Manual 1
Airflow - Interview - Question - Answers - Manual 1
Answer:
Answer:
Answer:
Answer:
Answer:
o Example:
Answer:
7. What is the Airflow Web UI, and what are its key features?
Answer:
Answer:
Answer:
Answer:
o A Task Instance is a specific run of a task in a DAG. It represents
the state of a task for a given execution of the DAG. Each task
instance has a status (e.g., success, failed, running, skipped) and
may contain logs and other metadata related to that specific run.
Answer:
12. How does Apache Airflow handle retries for failed tasks?
Answer:
task = PythonOperator(
task_id='example_task',
python_callable=my_function,
retries=3,
retry_delay=timedelta(minutes=5),
dag=dag
)
13. What is the difference between a DAG and a SubDag in Apache
Airflow?
Answer:
14. What are the different ways to trigger a DAG in Apache Airflow?
Answer:
Answer:
Answer:
17. What is the difference between a Task and a DAG Run in Apache Airflow?
Answer:
Answer:
Answer:
Answer:
Answer:
Example:
python
Answer:
o The Airflow Web Server provides the user interface (UI) for
managing and monitoring DAGs and tasks. It allows users to view
the status of DAG runs, task instances, logs, and other metadata
related to their workflows. The web server is typically accessed
via a web browser and is an essential tool for debugging and
interacting with Airflow.
Answer:
24. What are Sensors in Apache Airflow, and when would you use
them?
Answer:
25. What is the TriggerDagRunOperator, and when would you use it?
Answer:
Answer:
Answer:
28. What are the different Executor types in Apache Airflow, and
what are their use cases?
Answer:
Answer:
o Example:
python
task = PythonOperator(
task_id='my_task',
python_callable=my_function,
retries=3,
retry_delay=timedelta(minutes=10),
dag=dag
Answer:
Answer:
o When a task fails in Airflow, the system typically attempts to
retry the task based on the configured retries and retry_delay
parameters. If a task fails beyond the number of allowed retries,
its status is marked as failed. You can also use trigger rules to
customize behavior for failed tasks, such as running downstream
tasks only if upstream tasks fail.
Answer:
33. How can you trigger a task within a DAG using the
TriggerDagRunOperator?
Answer:
o Example:
python
trigger = TriggerDagRunOperator(
task_id='trigger_another_dag',
trigger_dag_id='another_dag',
dag=dag
Answer:
Answer:
Answer:
o Task logs in Airflow can be viewed from the Airflow Web UI. To
view logs:
Click the log button to view detailed logs for that specific
task instance.
Answer:
2. What is the role of the Airflow Scheduler, and how does it work?
Answer:
Answer: A DAG can be triggered manually using the Airflow UI, CLI, or
API:
o UI: Go to the DAGs page and click the "Trigger DAG" button for
the desired DAG.
o API: You can send a POST request to the Airflow REST API to
trigger a DAG.
Answer:
13. What is the Airflow web server, and how does it interact with
other components?
Answer: The start_date parameter defines when the DAG should start
running. It doesn't mean the DAG will start at that exact time, but it
indicates the first date the scheduler should consider for scheduling
the DAG. The end_date specifies the date when the DAG should stop
running.
Answer:
18. What are task states in Airflow, and what do they mean?
o Unique Task IDs: Ensure that task IDs are unique across all
DAGs.
21. What are Airflow Connections and how do you define them?
Answer:
o retry_delay: Specifies the fixed amount of time to wait between
retry attempts.
25. How would you schedule a DAG to run every 10 minutes, but
only on weekdays?
python
dag = DAG(
'my_dag',
This cron expression */10 * * * 1-5 will run the DAG every 10 minutes on
Monday through Friday.
python
task = PythonOperator(
task_id=f'task_{item}',
python_callable=my_function,
op_args=[item],
dag=dag,
This would dynamically create a task for each item in the list.
Answer:
python
task = PythonOperator(
task_id='my_task',
python_callable=my_function,
execution_timeout=timedelta(minutes=30),
dag=dag,
)
If the task exceeds the specified time, Airflow will terminate it and mark it as
failed.
o Up for Retry: Task has failed but is awaiting retry based on retry
parameters.
Answer:
o SubDagOperator:
Advantages:
Disadvantages:
o TriggerDagRunOperator:
Advantages:
Triggers an entirely separate DAG, allowing you to
decouple workflows and run them independently.
Disadvantages:
6. How would you monitor and log Airflow's performance and task
execution?
Answer:
Answer:
13. How does Airflow handle DAG and task dependency resolution in
case of failure or retries?
Answer:
14. How would you implement and manage complex branching logic
in Airflow?
python
from airflow.models import BaseOperator
class MyCustomOperator(BaseOperator):
@apply_defaults
super().__init__(*args, **kwargs)
self.param1 = param1
self.param2 = param2
python
custom_task = MyCustomOperator(
task_id='my_custom_task',
param1='value1',
param2='value2',
dag=dag
16. What are some common pitfalls in Airflow when scaling for high
throughput, and how can they be avoided?
o Task Queue Overload: If too many tasks are queued up and there
aren’t enough workers, task execution can be delayed. This can
be mitigated by:
17. How does Airflow handle time zones, and how would you ensure
consistency across different environments?
Answer:
Answer:
o KubernetesExecutor:
19. How do you deal with DAGs that have many tasks and complex
interdependencies in terms of maintainability and performance?
Answer:
Answer:
Answer:
22. What is the role of the "DAGrun" in Airflow, and how does it
relate to task execution?
Answer:
Answer:
24. How would you handle DAG scheduling when a DAG has a large
number of tasks with complex dependencies, and you need to
ensure that it does not overwhelm the scheduler?
Answer:
o DAG and Task Modularity: Break up the DAG into smaller sub-
DAGs using the SubDagOperator to reduce complexity and
manage dependencies more easily.
25. How does Airflow handle backfilling, and what happens if a DAG
run is missed or delayed?
Answer:
26. What are Airflow’s mechanisms for task state management, and
how would you handle task failures in production?
Answer:
27. What are some advanced strategies for dealing with data skew
when running Airflow in distributed environments?
Answer:
Answer:
o Role-based Access Control (RBAC): Airflow provides RBAC to
restrict access to DAGs and their components. This ensures only
authorized users can trigger, pause, or edit DAGs.