0% found this document useful (0 votes)
33 views22 pages

What Is Apache Airflow

Apache Airflow is an open-source platform for orchestrating complex workflows and data processing pipelines, allowing users to programmatically author, schedule, and monitor workflows. It utilizes Directed Acyclic Graphs (DAGs) to manage task dependencies and offers a web-based UI for monitoring and troubleshooting. Airflow supports various functionalities including task execution, error handling, and integration with external systems through operators and hooks.

Uploaded by

manjulavanireach
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views22 pages

What Is Apache Airflow

Apache Airflow is an open-source platform for orchestrating complex workflows and data processing pipelines, allowing users to programmatically author, schedule, and monitor workflows. It utilizes Directed Acyclic Graphs (DAGs) to manage task dependencies and offers a web-based UI for monitoring and troubleshooting. Airflow supports various functionalities including task execution, error handling, and integration with external systems through operators and hooks.

Uploaded by

manjulavanireach
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

1. What is Apache Airflow?

Apache Airflow is an open-source platform used for orchestrating complex computational workflows and data processing
pipelines. It's designed to programmatically author, schedule, and monitor workflows with ease.
2. How does Apache Airflow help in workflow management?
Airflow helps in managing workflows by allowing data engineers to script complex data pipelines as Directed Acyclic Graphs
(DAGs). It provides an intuitive interface to schedule, monitor, and troubleshoot these workflows.
3. What is a DAG in Airflow?
A Directed Acyclic Graph (DAG) in Airflow is an assemblage of all the jobs you wish to do, arranged to show their
interdependencies and linkages.
4. How does the Airflow Scheduler work?
The Airflow Scheduler monitors all tasks and DAGs, then triggers the task instances once their dependencies are complete. It
schedules jobs based on time or external triggers.
5. What is the Airflow Meta Database?
The Airflow Meta Database is where Airflow stores its metadata. This includes information about the status of tasks, DAGs,
variables, connections, and historical data about the workflow execution.
6. Can you explain what an operator is in Airflow?
An operator in Airflow represents a single task, or a unit of work, within a DAG. Each operator determines what actually happens
in a task.
7. What are hooks in Airflow?
Hooks in Airflow are interfaces to external platforms and databases, such as MySQL, PostgreSQL, or HTTP services. They are used
to manage connections and interact with external systems.
8. How do you use Python scripts in Airflow?
Python scripts in Airflow are used to define the logic of operators, DAGs, and plugins. They are written as standard Python files
and allow for extensive customization and control over your workflows.
9. What is the Airflow UI?
The Airflow UI is a web-based interface provided by Apache Airflow that allows users to manage and monitor their workflows,
view logs, track DAGs' progress, and troubleshoot issues.
10. How do you define dependencies in Airflow?
Dependencies in Airflow are defined by setting the relationships between tasks using the set_upstream and set_downstream
methods, or the >> and << bitwise operators in Python.
11. What is the role of the Airflow Executor?
The Airflow Executor is responsible for running the tasks within a DAG. There are different types of executors, such as the
LocalExecutor, CeleryExecutor, and KubernetesExecutor, each suited for different use cases.
12. How do you monitor a workflow in Airflow?
Workflows in Airflow are monitored using the Airflow UI, which provides information about the execution status of tasks, logs, and
allows rerunning of tasks in case of failures.
13. Can you explain how XComs work in Airflow?
XComs, or "Cross-communications", are a mechanism in Airflow that allows tasks to exchange messages or data. They are stored
in Airflow's metadata database and can be used to pass information between tasks within the same DAG.
14. What is the purpose of Airflow Variables?
Airflow Variables are used to store dynamic values that can be accessed and used in DAGs and tasks. They offer a way to avoid
hard-coding and to manage configuration settings.
15. How do you test an Airflow DAG?
Testing an Airflow DAG involves checking its correctness and behavior. This can be done by running individual tasks using the
Airflow CLI, using unit tests to test task logic, and checking DAG structure and dependencies.
16. What is SubDAG and when would you use it?
A SubDAG is a DAG used as a task in another parent DAG. It's useful for repeating patterns within a DAG and to modularize
complex workflows.
17. How do you handle errors and retries in Airflow?
Errors and retries in Airflow are handled by setting the retries and retry_delay parameters in task definitions. Airflow will
automatically retry a failed task according to these settings.
18. Can you describe a scenario where you used the CeleryExecutor in Airflow?
The CeleryExecutor is used in distributed environments where you need to run tasks on multiple machines. I used it in a project
where tasks were resource-intensive and required to be distributed across different nodes to balance the load.
19. How do you secure sensitive information in Airflow?
Sensitive information in Airflow can be secured using Airflow Connections for external systems and Airflow Variables for internal
configurations, both of which can be encrypted with Fernet keys.
20. What is Airflow's Branch Python Operator?
The BranchPythonOperator is a way to run different tasks based on the logic encoded in a Python function. It's used to control the
flow of a DAG execution dynamically.
21. How do you schedule DAGs in Airflow?
DAGs in Airflow are scheduled by setting the start_date, end_date, and schedule_interval parameters in the DAG definition. These
parameters determine when and how often the DAG should run.
22. Can you use Airflow for ETL processes? How?
Yes, Airflow is commonly used for ETL processes. It orchestrates the extraction, transformation, and loading of data by scheduling
and managing the tasks that comprise these processes.
23. What is the difference between a DAG and a task in Airflow?
In Airflow, a DAG is a collection of tasks organized with dependencies and relationships to define a workflow. A task, on the other
hand, is a single operation or step within a DAG, defined by an operator.
24. How does Airflow manage dependencies between tasks?
Airflow manages dependencies using task relationships. When a task is set as downstream of another, it will only run once the
upstream task has successfully completed.
25. Explain the concept of Airflow Plugins?.
Airflow Plugins are a way to extend the functionality of Airflow. They allow you to add new operators, hooks, and interfaces to
integrate with new systems or perform specific tasks that are not available in the standard Airflow installation.
26. How do you ensure high availability in Airflow?
High availability in Airflow can be achieved by setting up a multi-node cluster with a database like PostgreSQL or MySQL that
supports high availability and using a distributed executor like the CeleryExecutor.
27. What are Task Instances in Airflow?
A Task Instance in Airflow is a specific run of a task. It represents a task's execution at a particular point in time, with its own logs,
state, and context.
28. How do you manage data lineage in Airflow?
Data lineage in Airflow can be managed using XComs to pass metadata between tasks, and by using task and DAG
documentation to describe the flow and transformations of data.
29. Can you use Airflow for non-ETL workflows?
Yes, Airflow can be used for non-ETL workflows. It is a versatile tool that can orchestrate any type of task that can be executed in
a Python environment, including data analysis, machine learning model training, and more.
30. How do you handle task dependencies from external systems in Airflow?
Task dependencies from external systems can be handled in Airflow using Sensors. Sensors are a special kind of operator that
wait for a certain condition or event to occur in an external system before proceeding.
31. Explain how you would use Airflow in a microservices architecture?.
In a microservices architecture, Airflow can be used to orchestrate the interactions between different services. It can schedule
and manage tasks that involve multiple microservices, ensuring the right order of operations and handling failures.
32. What are the best practices for scaling Airflow?
Best practices for scaling Airflow include using a distributed executor like CeleryExecutor, ensuring your database is optimized
and can handle the load, splitting your DAGs into smaller, more manageable pieces, and monitoring your Airflow instances to
understand the resource usage.
33. How do you manage configuration changes in Airflow?
Configuration changes in Airflow can be managed by using Airflow Variables and Connections, which can be set and modified
either via the UI or the command line interface.
34. What is the role of the Airflow Webserver?
The Airflow Webserver provides the web UI for Airflow. It allows users to visualize DAGs, monitor task progress, view logs, manage
Airflow configuration, and troubleshoot issues.
35. How do you automate deployment of Airflow DAGs?
Automating the deployment of Apache Airflow DAGs (Directed Acyclic Graphs) can be achieved through a combination of version
control tools like Git, CI/CD (Continuous Integration/Continuous Deployment) pipelines, and proper configuration of the Airflow
environment.
Airflow UI
1- What information can you typically find on the main page of the Airflow UI? (Select all that apply) A. The status of
DAG runs (success, failure, running, etc.) B. A list of all the DAGs C. Logs for each task instance D. The Airflow configuration file
contents
2- Which of the following can be managed from the Airflow UI? (Select all that apply) A. Viewing the DAG run history B.
Triggering DAG runs C. Editing DAG files directly D. Managing connections
Airflow CLI
3- Which of the following commands would you use to start the Airflow web server? A. airflow webserver B. airflow
scheduler C. airflow worker D. airflow initdb
4- What command is used to initialize the Airflow database? A. airflow initdb B. airflow db init C. airflow db create D.
airflow database setup
5- Which command can be used to pause a DAG? A. airflow dags pause B. airflow dags stop C. airflow dags disable D.
airflow dags suspend
6- How do you restart the Airflow scheduler from the CLI? A. airflow scheduler restart B. airflow restart scheduler C. airflow
stop scheduler && airflow start scheduler D. airflow scheduler
7- Which command can be used to list all DAGs in your Airflow environment? A. airflow dags list B. airflow dags show C.
airflow list dags D. airflow show dags
8- Which command can be used to list the tasks in a specific DAG? A. airflow tasks list B. airflow dags tasks list C. airflow
list tasks D. airflow tasks show
9- How do you start an Airflow worker? A. airflow worker B. airflow start worker C. airflow run worker D. airflow worker start
Airflow use-cases
10- Which of the following are common use-cases for using Apache Airflow? (Select all that apply) A. Data pipeline
orchestration B. Web application development C. ETL (Extract, Transform, Load) processes D. Machine learning model training
and deployment
11- What is a primary advantage of using Apache Airflow for ETL processes? A. It provides a graphical user interface for
data manipulation B. It supports distributed task execution and scheduling C. It is optimized for real-time data processing D. It is
designed specifically for web development
Writing, scheduling, and debugging DAGs
12- Which of the following Python decorators is used to define a DAG in
Airflow? A. @dag B. @task C. @schedule D. @flow
13- How do you set the schedule interval for a DAG? A. By setting the start_date parameter B. By setting the
schedule_interval parameter C. By setting the end_date parameter D. By setting the catchup parameter
14- What happens if the catchup parameter is set to False in a DAG? A. Past DAG runs are not executed if the scheduler
is started after the start_date B. The DAG will not be scheduled at all C. The DAG will run immediately upon being triggered D.
The DAG will only run once regardless of the schedule
15- What is the purpose of the max_active_runs parameter in a DAG? A. It limits the number of active DAG runs B. It sets
the maximum number of tasks in a DAG C. It specifies the maximum runtime for a DAG D. It determines the maximum number of
retries for a task
16- How can you set a task dependency where task_2 should run after task_1? (Select all that apply) A. task_1 >>
task_2 B. task_2 << task_1 C. Both A and B D. Neither A nor B
17- How can you specify that a task should run only after multiple other tasks have successfully completed? A.
Using set_downstream B. Using all_success C. Using trigger_rule D. Using task_concurrency
18- What is the default trigger rule for tasks in Airflow? A. all_success B. all_failed C. one_success D. none_failed
19- Which method can you use to set dependencies between multiple tasks in a single statement? A. set_upstream B.
set_downstream C. set_dependencies D. set_task
DAG runs
20- Which of the following statements about DAG runs in Airflow is true? A. A DAG run is a single execution of a DAG B.
DAG runs can only be triggered manually C. DAG runs do not support retries D. A DAG run is the same as a task instance
21- How can you trigger a DAG run manually from the Airflow UI? A. By clicking the ‘Trigger Dag’ button B. By editing the
DAG file C. By restarting the web server D. By creating a new DAG
22- How do you configure a DAG to run every day at 8 AM? A. schedule_interval=’0 8 * * *’ B. start_date=datetime(2021,
1, 1, 8, 0) C. schedule_interval=’@daily’ D. start_date=’0 8 * * *’
Airflow architecture components
23- Which component of the Airflow architecture is responsible for scheduling the execution of tasks? A. Web Server
B. Scheduler C. Worker D. Metadata Database
24- What does the Airflow Worker do? A. It schedules DAGs B. It runs task instances C. It stores the state and history of DAGs
D. It provides a user interface
25- What role does the Web Server play in Airflow architecture? A. It schedules tasks B. It executes tasks C. It provides a
user interface to interact with Airflow D. It stores the state and history of task instances and DAG runs
26- How does the Scheduler interact with the Metadata Database? A. It reads DAG definitions and writes task instance
states B. It only reads DAG definitions C. It only writes task instance states D. It does not interact with the Metadata Database
27- What is the primary function of the Metadata Database in Airflow? A. To store logs for each task instance B. To
execute tasks C. To store configurations and the state of DAGs and task instances D. To provide a user interface
28- What is the purpose of the Airflow Executor? A. To execute tasks B. To schedule tasks C. To provide a user interface D.
To manage user permissions
Basic Operators
29- Which operator would you use to run a Python function in an Airflow DAG? A. BashOperator B. PythonOperator C.
DummyOperator D. BranchPythonOperator
30- Which operator would you use to run a shell command in an Airflow DAG? A. PythonOperator B. BashOperator C.
DummyOperator D. BranchPythonOperator
31- Which operator would you use to wait for a specific file to be available in a directory? A. FileSensor B.
PythonOperator C. BashOperator D. ExternalTaskSensor
32- Which operator would you use to run a SQL query in an Airflow DAG? A. PythonOperator B. BashOperator C.
SqlOperator D. SqlQueryOperator
33- How can you pass parameters to a Python function when using a PythonOperator? A. Using the params parameter
B. Using the op_args and op_kwargs parameters C. Using the arguments parameter D. Using the func_args parameter
XCom Basics
34- How do you push a value to XCom from within a PythonOperator? A. Using the push method on the task instance B.
Returning the value from the Python function C. Using the xcom_push method on the task instance D. Using the xcom_set method
on the task instance
35- How do you pull a value from XCom in a PythonOperator? A. Using the xcom_pull method B. Returning the value from
the Python function C. Using the xcom_get method D. Using the pull method
36- What is a key advantage of using XComs in Airflow? A. They allow task logs to be shared across DAGs B. They enable
data sharing between tasks C. They automate the execution of tasks D. They provide a user interface for task management
Connection Basics
37- In Airflow, what is a connection? A. A way to store environment variables B. A method to define dependencies between
tasks C. A configuration that defines how Airflow can connect to external systems D. A way to trigger DAGs
38- How do you create a new connection in Airflow? A. By editing the airflow.cfg file B. By defining it in the DAG file C. By
using the Airflow UI under the ‘Admin’ tab D. By using the airflow connections command
Variable Basics
39- What is a common use of variables in Airflow? A. To store task logs B. To define dynamic values that can be used across
multiple DAGs C. To manage user permissions D. To schedule DAG runs
40- How can you access a variable’s value in an Airflow DAG? A. Using the Variable.get() method B. Using the vars
parameter C. Using the variable.get() method D. Using the var parameter
Sensor Basics
41- What is a primary use case for sensors in Airflow? A. To execute tasks B. To monitor and wait for a certain condition to
be met before proceeding C. To schedule DAGs D. To manage connections
42- Which sensor would you use to check for the existence of a file in a specific location? A. FileSensor B. HttpSensor
C. ExternalTaskSensor D. TimeSensor
Advanced Topics
43- What is the role of the Executor in Airflow’s architecture? A. To schedule tasks B. To execute tasks C. To provide a user
interface D. To manage the metadata database
44- How can you scale out Airflow to handle more tasks? A. By adding more workers B. By increasing the max_active_runs
parameter C. By reducing the schedule interval D. By optimizing DAG definitions
45- What is the purpose of the retry_delay parameter in a task? A. To set the delay before the first attempt B. To set the
delay between retries in case of a task failure C. To set the overall runtime limit for the task D. To set the time before a task is
considered failed
46- How can you enable task logging in Airflow? A. By configuring the logging section in airflow.cfg B. By setting the
log_enabled parameter in the DAG C. By using the LogOperator D. By enabling logs in the Airflow UI
47- Which of the following is a built-in Airflow executor? A. LocalExecutor B. RemoteExecutor C. ClusterExecutor D.
DistributedExecutor
Monitoring and Maintenance
48- How do you view the logs for a specific task instance in Airflow? A. By clicking on the task in the Graph view of the
Airflow UI B. By checking the airflow.cfg file C. By running the airflow logs command D. By inspecting the DAG file
49- What is the purpose of the Airflow REST API? A. To provide an interface for managing Airflow configurations B. To
interact programmatically with Airflow for DAG and task management C. To run tasks directly from the command line D. To
provide a web interface for Airflow
50- How can you backfill a DAG in Airflow? A. Using the airflow dags backfill command B. By enabling the catchup parameter
C. By running the airflow backfill command D. By setting the start_date parameter
51- Which Airflow view allows you to see a detailed breakdown of task durations? A. Graph View B. Tree View C.
Duration View D. Gantt Chart View
DAG design best practices
52- What is the recommended way to handle task retries in Airflow? A. Set a high retries parameter and a reasonable
retry_delay B. Avoid task retries to prevent DAG delays C. Use the retry method in the PythonOperator D. Manually trigger failed
tasks
53- How can you modularize your DAGs for better maintainability? A. By defining all tasks in a single file B. By using
Python functions and modules to define tasks C. By using only BashOperator D. By avoiding task dependencies
Miscellaneous
54- What does the depends_on_past parameter do when set to True in a task? A. It makes the task run only if the
previous run succeeded B. It makes the task run only if all upstream tasks have succeeded C. It makes the task depend on the
outcome of the previous task instance D. It makes the task run only once
55- What is the function of the catchup parameter in a DAG? A. To run all missed DAG runs when the scheduler starts B. To
limit the number of concurrent runs C. To ensure the DAG does not run more than once D. To delay the start of the DAG
56- Which parameter allows you to specify the maximum number of active task instances across DAG runs? A.
max_active_runs B. max_active_tasks C. max_concurrency D. task_concurrency
57- How can you ensure that a task in a DAG does not run concurrently with itself across multiple DAG runs? A. By
setting max_active_runs to 1 B. By setting concurrency to 1 C. By using the depends_on_past parameter D. By setting
task_concurrency to 1
58- What is the purpose of the SLA parameter in a task? A. To set a Service Level Agreement for task completion time B. To
set the task’s priority C. To limit the number of retries D. To specify the task’s schedule
59- What is the default web server port for Airflow? A. 8000 B. 8080 C. 8888 D. 9000
60- Which command can be used to test a single task instance in a DAG? A. airflow tasks test B. airflow tasks run C.
airflow tasks execute D. airflow tasks start
61- How can you pass configuration options to a DAG? A. Using the dag_run configuration parameter B. Using environment
variables C. Using the conf parameter in the DAG definition D. Using the params parameter in the DAG definition
62- Which executor is recommended for a production environment with multiple workers? A. SequentialExecutor B.
LocalExecutor C. CeleryExecutor D. DebugExecutor
63- What is a key benefit of using the CeleryExecutor? A. It supports distributed execution of tasks B. It runs tasks
sequentially C. It is designed for debugging D. It provides a graphical user interface
64- How do you enable task-level retries in Airflow? A. By setting the retries parameter in the task B. By configuring the
retry_enabled parameter in airflow.cfg C. By using the RetryOperator D. By setting the retry_count parameter in the DAG
65- What happens if a task fails and has retries enabled? A. The task will be retried based on the retry_delay and retries
parameters B. The task will not run again C. The DAG will stop running D. The task will be marked as skipped
66- Which command can be used to clear the status of a task instance? A. airflow tasks clear B. airflow tasks reset C.
airflow tasks purge D. airflow tasks refresh
67- How do you disable a DAG in Airflow? A. By pausing the DAG in the Airflow UI B. By deleting the DAG file C. By setting
the active parameter to False in the DAG definition D. By stopping the scheduler
68- Which Airflow component is responsible for tracking the state of task instances and DAG runs? A. Scheduler B.
Metadata Database C. Worker D. Web Server
Security
69- How can you secure the Airflow web interface? A. By enabling authentication in the airflow.cfg file B. By disabling the
web interface C. By using a firewall to block access D. By running Airflow in a private network
70- What is the purpose of the Airflow Role-Based Access Control (RBAC) system? A. To provide different levels of
access to the Airflow web interface B. To manage task concurrency C. To control DAG scheduling D. To handle task dependencies
Explain, how airflow security works.
Airflow has built the role-based access control (RBAC) of the web console on top of Flask AppBuilder (FAB). A user can have
multiple roles and different users can share the same role. Each role can have multiple permissions. Each permission allows the
user an action against a resource.
Airflow provides 5 default roles with predefined permissions:
Admin, Public, User, Public, Viewer ad Op.
Airflow also allows you to have DAG-level roles.
dag = DAG(
dag_id="demo_dag_access",
default_args=DEFAULT_ARGS,
schedule_interval="0 0 * * * *",
access_control={
"project_A_role": {"can_dag_read", "can_dag_edit"},
"basic_user": {"can_dag_read"},
}
)
In addition, Airflow provides an interface that allow users to define their own SECURITY_MANAGER_CLASS. We can define our own
auth logic in your security manager.
Since Airflow UI is based on FAB, it also supports 5 auth types:
Database | OAuth | OpenID | LDAP | REMOTE_USER
How do you handle sensitive variables in Airflow.
By default, airflow considers your variable secret if your variable name has a string like the following:
access_token | api_key | apikey | authorization | passphrase | passwd | password | private_key | secret | token
We can also add our own custom names that will be considered sensitive by Airflow by setting the following env variable.
AIRFLOW__CORE__SENSITIVE_VAR_CONN_NAMES — A comma-separated list of extra sensitive keywords to look for in variables
names or connection’s extra JSON.
Explain Airflow metadata. How can we access this?
In Airflow, metadata refers to the information and configuration details about workflows, tasks, connections, and other
components within the Airflow system. This metadata is typically stored in a metadata database, which is usually backed by a
SQL-based database such as SQLite, MySQL, or PostgreSQL. Airflow uses this metadata to manage and orchestrate workflows,
track task statuses, handle retries, and provide visibility into workflow execution.
We can access this data by connecting it to the backend DB.
In many managed solutions (MWAA) we might not have direct access to the database. We can still access this metadata
using airflow.models.
from airflow.decorators import dag, task
from airflow.models import DagModel, DagRun, TaskFail, TaskInstance, Variable
from airflow.utils.dates import days_ago
from airflow.utils.session import create_session
from datetime import datetime

# Setting a variable to fetch the last MAX_AGE_IN_DAYS data


MAX_AGE_IN_DAYS = 10

# Creating a list of Airflow models and associated parameters to export


OBJECTS_TO_EXPORT = [
[DagModel, None],
[Variable, None],
[DagRun, DagRun.execution_date],
[TaskInstance, TaskInstance.execution_date],
[TaskFail, TaskFail.start_date],
]

@dag(schedule="@daily", start_date=datetime(202, 05, 31), catchup=False)


def extract_mwaa_metadata() -> None:
@task(task_id="extract_metadata")
def export_db_data() -> None:
with create_session() as session:
for table_name, execution_date in OBJECTS_TO_EXPORT:
lst = []
if execution_date is not None:
query = session.query(table_name).filter(execution_date >= days_ago(MAX_AGE_IN_DAYS))
else:
query = session.query(table_name)
all_rows = query.all()
tablename = table_name.__name__.upper()
if len(all_rows) > 0:
for row in all_rows:
lst.append(vars(row))
...
# process the data.

export_db_data()

extract_mwaa_metadata()
Have you encountered an error with the message “Return code -9”? What does it mean & how to handle this?
Return code -9 is mostly associated with an out-of-memory error. The task instance is taking more memory than configured. We
can configure the worker’s memory setting to fix the issue.
This error might also arise due to large XCOM objects. We should use remote storage (object store or shared file system) for
intermediate data and pass its reference to the consumer task.
What is a zombie task in Airflow?
A zombie task is one that the Airflow database thinks is running but hasn’t emitted a heartbeat for a certain amount of time.
Or Zombie tasks are tasks in Airflow that have completed execution, but are still present in the task list and marked as “running.”
This can occur when a task finishes execution but the Airflow worker process dies before it can mark the task as “success” or
“failed.” As a result, the task remains in the “running” state and becomes a zombie.
There are a lot of things that can cause zombies. Some examples include:
 worker process dies (runs out of memory) before it can mark the task as “success” or “failed”.
 The network connection between the worker and Airflow database is severed.
 long-running tasks not sending any signal/log to ensure its liveness
To avoid this:
 increase the scheduler configuration scheduler_zombie_task_threshold from default 5 minutes to somewhere around
~20–30 minutes.
 set AIRFLOW__SCHEDULER__SCHEDULER_ZOMBIE_TASK_THRESHOLD to a value higher than the DAG file parsing interval
 Keep emitting logs from Kubernetes pods at an interval less than scheduler_zombie_task_threshold, logging actions will
confirm that the pod is still alive.
How can you ignore files from scheduler parsing in airflow?
We can instruct the airflow scheduler to ignore certain files in our DAG folders, by creating a .airflowignore file under the DAG
folder itself (i.e. /dags/.airflowignore).
Overall it works like a .gitignore file. We can use it to specify directories or files in DAG folder. Each line
in .airflowignore essentially specifies a regex pattern, which means that any directory or file whose name matches any of the
patterns specified will be ignored.
As an example, let’s assume that we have created a .airflowignore file with the content outlined below.
# .airflowignore
helper
dbt_[\d]
Files like helper_functions.py , utils_HeLpEr.py , dbt_1.py , dbt_2.py , helper/utils.py will be completely ignored. Be careful though,
since if a directory’s name matches any of the patterns, this directory and all its subfolders would not be scanned by Airflow at
all.
How do you get access to the current context in task instances?
By using the following code we can access the context:
from airflow.operators.python import get_current_context
context = get_current_context()

foo = context["dag_run"].conf["foo"]
Explain the architecture of Airflow.
Airflow is an open-source workflow management tool. It includes a web server for UI interaction, a scheduler to manage job
execution, a metadata database to store workflow definitions and task metadata, and a worker to execute tasks. The web server
provides a user interface for DAG (Directed Acyclic Graph) management. The scheduler determines when to execute tasks based
on DAG schedules. Airflow supports two types of distributed executors namely Celery & K8s.

How to handle tasks that get stuck in a queued state?


Sometimes, tasks get stuck in the queued state. This can be intentional due to various reasons like airflow parallelism have been
reached or slots are full. It can also be unintentional when something fails between the executor and scheduler. We can track
such tasks by querying the airflow db.
SELECT dag_id, run_id, task_id, queued_dttm
FROM task_instance
WHERE queued_dttm < current_timestamp - interval '30 minutes' AND state = 'queued'
for earlier versions older than 2.6 we have three configuration properties that control this behavior
kubernetes.worker_pods_pending_timeout
celery.stalled_task_timeout
celery.task_adoption_timeout
Since 2.6 we have a unified config (scheduler.task_queued_timeout) that we can use to solve problematic tasks stuck in a
queued state.
Explain XCom and its limitations.
XComs, short for Cross-Communication, is a mechanism for tasks to exchange small pieces of data. It is used to share metadata
and configuration parameters. It can also pass small pieces of data between tasks.
# In source task: task id -> ti_1
kwargs['ti_1'].xcom_push(key='data_key', value='data_value')

# In sink / consumer task:


extracted_data = ti.xcom_pull(task_ids='ti_1', key='data_key')

Important things to remember:


 XComs are stored in the Airflow database. So we should only pass a small amount of info in XCom or else it will overload
the database.
 XComs are stored in an airflow db without encryption. So it will be a risk to share sensitive info using XCom.
 XComs are serialized for exchange. By default, Airflow uses JSON serialization or pyarrow serialization. So we are limited
to objects that can be serialized.

 Designed for Write Once, Read many times. But if we use a mutable data structure for the data container multiple tasks
can mutate the Xcom.
 On every retry, XComs will be cleared and new XComs will be created for the task instance.
 When exchanging data between two dags— both DAGs must have the same execution date.
Explain Sensor in Airflow. In which scenario we should use poke vs reschedule mode?
A sensor is a type of operator that for a condition at a specific interval (poke_interval). If the condition is met the task is marked
successful. Otherwise, the sensor waits for another interval before checking again. For a Sensor, there are two modes.
poke & reshedule
In poke mode, the sensor occupies a worker slot for the entire execution time and sleeps between pokes. While
in reschedule mode if the condition is not fulfilled then the sensor releases its worker slot and reschedules the next check for a
later time.
So if our poke_interval is very short (less than about 5 minutes), we should use the poke mode. Using reschedule mode in this
case can overload our scheduler. For long-running sensors, we should opt for reschedule.
How is parallelism controlled or managed in Airflow?
Parallelism in Airflow is controlled at multiple levels. At the system/installation level, there is core parallelism. This decides how
many tasks can run at the same time for that scheduler. Then we have parallelism at the DAG run level which controls how many
instances of the same dag can be run at the same time. We can also how many tasks can run for a particular dag in parallel. The
following configurations control the various aspects of parallelism in Airflow.
core.parallelism * number of schedulers
core.max_active_tasks_per_dag
core.max_active_runs_per_dag
core.default_pool_task_slot_count
max_active_tis_per_dag
How do we define dependencies between tasks?
Dependencies decide the flow of operation for a DAG. It controls the order in which task(s) are executed. There are four major
ways to define task dependencies in Airflow:
set_upstream() [<<]
set_downstream()[>>]
chain()
chain_linear()
cross_downstream() only since 2.7+
inferred dependencies via the TaskFlow API
depends_on_past #task depends on the success of the previous run of the same task
t0.set_downstream(t1) | t1.set_upstream(t0)
t0 >> t1 | t1 << t0

chain(*list_of_tasks)
chain_linear(*list_of_tasks)

demo_function_1(demo_function_2()) # Inferred

What are hooks and why do need them?


Hooks are abstractions to APIs of external systems like Hive, S3, MySQL, Postgres, HDFS, and Pig. Hooks implement a common
interface when possible, and act as a building block for operators. They use connections to configure the auth of the external
system. Operators call/use Hooks when they need to interact with external systems. Hooks allow code reusability and modularity.
We can remove a lot of boilerplate code if we leverage Hooks.

oracle cloud migration on GCP steps es


Assessment and Planning
a. Assess the Oracle Environment
 Inventory all Oracle databases, applications, and dependencies.
 Evaluate the Oracle workload size, version, and configuration (e.g., Oracle RAC, Data Guard).
 Identify any applications tightly coupled to the Oracle database and plan for their migration as well.
 Determine data sensitivity, compliance requirements, and network dependencies.
b. Choose the Right Google Cloud Services
 Compute Engine: For running Oracle on VMs, which is closest to an on-premise setup.
 Bare Metal Solution: For performance-sensitive Oracle workloads requiring near-native hardware performance (e.g.,
Oracle RAC).
 Cloud SQL (PostgreSQL or MySQL): If migrating from Oracle to another open-source database.
 Cloud Storage: For backup, archival, and data transfer.
 Google Cloud Networking (VPC, Cloud VPN, Interconnect): To set up networking between on-premise and Google
Cloud.
c. Choose a Migration Approach
 Lift-and-Shift: Rehost Oracle workloads on GCP with minimal changes (e.g., migrating to Compute Engine or Bare Metal
Solution).
 Replatform: Move the Oracle database to a different platform (e.g., Oracle to PostgreSQL or MySQL on Cloud SQL).
 Refactor: Redesign applications and data for cloud-native benefits (e.g., using managed database services).
 Hybrid Strategy: Some workloads may stay on-premise, while others move to GCP.

2. Proof of Concept (PoC)


a. Test the Migration on a Smaller Scale
 Create a PoC to test the Oracle database migration with a small, non-critical workload.
 Evaluate performance, latency, scalability, and compatibility with GCP.
 Test networking, security, data transfer, and high availability.
b. Select a Migration Tool
 Oracle GoldenGate: For near-zero downtime migration using real-time replication.
 Data Guard: If you're migrating using physical/logical standby databases.
 Google Cloud Database Migration Service: If migrating from Oracle to PostgreSQL/MySQL.
 Manual Export/Import: Use Oracle Data Pump or RMAN for smaller databases.
3. Set Up GCP Environment
a. Provision Compute Resources
 If migrating to Compute Engine, create VMs with the required CPU, memory, and disk configurations.
 If using Bare Metal Solution, work with GCP to provision bare metal servers that support Oracle workloads.
b. Set Up Networking
 Use Cloud Interconnect or VPN for secure connectivity between your on-premise data center and Google Cloud.
 Design VPCs, subnets, and firewall rules to support Oracle workloads.
 Configure DNS, load balancing, and global routing policies if needed.
c. Configure Storage
 Use Persistent Disks on Compute Engine for Oracle data files.
 Use Cloud Filestore or NFS for shared storage if needed for Oracle RAC.
d. Set Up Security and IAM
 Use IAM roles to control access to Oracle resources on GCP.
 Set up encryption (both in transit and at rest) for Oracle databases.
 Use VPC Service Controls for enhanced security in multi-tenant environments.

4. Perform Data Migration


a. Migrate the Oracle Database
 Backup and Restore: Use RMAN for full or incremental backups, and restore them on GCP (Compute Engine).
 Oracle Data Pump: Export data from your source database and import it into GCP.
bash
Copy code
expdp system/password@source_database full=y directory=data_pump_dir dumpfile=expdb.dmp
impdp system/password@target_database full=y directory=data_pump_dir dumpfile=expdb.dmp
 GoldenGate: Set up real-time replication to minimize downtime during migration.
 Data Guard: Create a physical/logical standby database on GCP and switch over once the database is synchronized.
b. Data Validation
 Run data integrity checks (e.g., comparing row counts, data validation scripts) between the source and target databases
to ensure accurate migration.
c. Application and Schema Migration
 Migrate Oracle-specific PL/SQL scripts, triggers, stored procedures, and application dependencies.
 Consider refactoring applications to be cloud-optimized (if migrating to a cloud-native database).
d. Testing
 Test the database performance, transaction throughput, and latency on GCP.
 Run application testing (functional and integration testing) with the new database on GCP.

5. Cutover
a. Perform Final Synchronization
 If using a tool like GoldenGate, perform the final sync to minimize downtime.
 Shut down writes to the on-premise Oracle database to ensure consistency.
b. Switch Over to the New Oracle Database
 If using Data Guard or replication tools, perform a switch over to the new Oracle instance in GCP.
 Test application connectivity to the new database.
c. Monitor for Issues
 Use Google Cloud’s Operations Suite (formerly Stackdriver) to monitor the health of the Oracle database and
applications post-cutover.
 Ensure backup and disaster recovery plans are in place.

6. Post-Migration Optimization
a. Optimize the Environment
 Adjust the instance type, CPU, memory, and disk based on post-migration performance metrics.
 Use GCP features like autoscaling or cloud-native services to reduce costs.
b. Database Tuning
 Perform Oracle database performance tuning on GCP by adjusting parameters, indexing, and using GCP's high-
performance storage options (e.g., SSDs).
 Implement cloud-native backup, restore, and high availability strategies, such as multi-region setups or snapshots.
c. Security and Compliance
 Regularly audit access, encryption, and data security measures in the GCP environment.
 Ensure compliance with industry-specific regulations (e.g., GDPR, HIPAA).
d. Cost Management
 Monitor costs using Google Cloud Billing reports and optimize by selecting the right storage, compute, and networking
options.
 Use Sustained Use Discounts or Committed Use Contracts to lower costs.
7. Ongoing Management and Monitoring
a. Backup and Recovery
 Schedule regular backups using Oracle RMAN, and store backups on Google Cloud Storage.
 Set up disaster recovery with replication and failover options (e.g., Oracle Data Guard in multiple GCP regions).
b. Database Monitoring
 Use Google Cloud Monitoring for real-time insights into Oracle database performance, availability, and security.
 Set up alerts for CPU, memory, storage, and network thresholds.
c. Scaling
 Use Google Cloud Load Balancer for horizontal scaling of applications.
 Set up Compute Engine autoscaling for dynamic scaling based on demand.

how informatica transformations migrated to GCP


1. Assess the Existing Informatica Environment
a. Inventory the ETL Pipelines
 Identify all existing Informatica mappings, workflows, and transformations in use.
 Classify transformations based on complexity (e.g., simple filters vs. complex lookups, joins, and aggregates).
 Determine data sources (e.g., on-premise databases, flat files) and destinations (e.g., cloud storage, data warehouses).
b. Identify GCP Equivalent Services
GCP provides several services that can replicate the functionality of Informatica transformations:
 Data Integration: Google Cloud Dataflow or Google Cloud Data Fusion can replace Informatica for ETL/ELT processes.
 Data Storage: Use BigQuery for data warehousing or Cloud Storage for flat files and unstructured data.
 Workflow Orchestration: Use Cloud Composer (managed Apache Airflow) or Google Cloud Functions for
scheduling and orchestration.
 Data Movement: Use Google Transfer Service or Pub/Sub for batch and streaming data integration.

2. Choose a Migration Strategy


a. Lift and Shift (Rehost)
 Description: Move Informatica PowerCenter or Informatica Cloud directly to GCP without modifying ETL logic.
 Approach:
o Rehost the Informatica PowerCenter server in GCP using Google Compute Engine.
o Move the Informatica Cloud instances to GCP and reconfigure the data pipelines to connect to GCP services
(e.g., BigQuery, Cloud Storage).
 Pros: Minimal changes to existing ETL workflows.
 Cons: Doesn't leverage native GCP services and may be less cost-effective in the long run.
b. Replatform (Hybrid Approach)
 Description: Migrate some Informatica transformations to GCP services like Dataflow while retaining parts of the
Informatica platform.
 Approach:
o Migrate specific parts of ETL workloads to Google Cloud Data Fusion or Dataflow for better scalability.
o Use Informatica Intelligent Cloud Services (IICS), which has native GCP connectors for BigQuery, Cloud
Storage, Pub/Sub, and Cloud SQL.
o Gradually refactor complex transformations or workflows to utilize GCP-native services.
 Pros: Balances leveraging cloud-native tools with a gradual migration.
 Cons: May require more effort in designing the hybrid system.
c. Refactor (Full Transformation to GCP Services)
 Description: Rebuild ETL pipelines using GCP-native services to fully leverage cloud scalability and efficiency.
 Approach:
o Rewrite Informatica transformations as Google Cloud Dataflow jobs or Data Fusion pipelines.
o Migrate data storage to BigQuery for analytics, Cloud SQL for OLTP, or Cloud Storage for unstructured data.
o Orchestrate workflows using Cloud Composer (managed Airflow) or Cloud Functions.
o For real-time transformations, utilize Google Cloud Pub/Sub with Dataflow or BigQuery Streaming.
 Pros: Full benefit from GCP’s cloud-native features, improved scalability, and reduced costs.
 Cons: Requires more effort in rewriting pipelines and understanding GCP’s data services.

3. Migrating Informatica Transformations to GCP Services


Here’s how specific Informatica transformations can be mapped to GCP services:
a. Source/Target Systems Migration
 Informatica Sources/Targets: Flat files, relational databases, cloud storage.
 GCP Equivalents:
o BigQuery for data warehouses.
o Cloud Storage for flat files.
o Cloud SQL/Cloud Spanner for relational databases.
o Pub/Sub for streaming data ingestion.
b. Data Transformation Mappings
Informatica
GCP Equivalent (GCP Service)
Transformation
Filter Dataflow or Data Fusion filter transforms
Aggregator BigQuery SQL or Dataflow GroupBy
Joiner Dataflow join transforms or BigQuery JOIN operations
Lookup BigQuery or Dataflow lookup transforms
Sorter BigQuery SQL ORDER BY or Dataflow sorting
Dataflow or Data Fusion expression transforms, BigQuery SQL
Expression/Function
functions
Router Dataflow branching transforms
Union Dataflow UNION or BigQuery UNION
Update Strategy Dataflow or BigQuery UPDATE/MERGE operations
c. Workflow Orchestration
 Replace Informatica workflows with Cloud Composer (Apache Airflow) or Cloud Functions for orchestration,
automation, and task scheduling.
d. Error Handling
 Set up error handling in Google Cloud Dataflow by defining side outputs for bad records and logging errors to
Stackdriver Logging.
 Use Cloud Functions for lightweight event-driven error handling.
e. Real-Time Data Processing
 For real-time ETL tasks, replace Informatica real-time processing with Google Cloud Pub/Sub (for message
ingestion) and Google Cloud Dataflow for streaming transformations.
 Integrate with BigQuery Streaming for real-time analytics.

4. Execution of Migration
a. Set Up the GCP Environment
 Provision necessary resources (e.g., Compute Engine for lift-and-shift, BigQuery for storage, and Cloud Storage for
raw data).
 Configure IAM roles and ensure security best practices are followed (encryption, access controls, etc.).
b. Configure Connectivity
 Ensure connectivity between your on-premise systems and GCP (if in a hybrid scenario) using Google Cloud
Interconnect or VPN.
 Set up connectors for Informatica Intelligent Cloud Services (IICS) to integrate with GCP services like BigQuery,
Cloud Storage, and Cloud SQL.
c. Test the Migration
 Run pilot projects to test the migrated transformations in the new GCP environment.
 Validate data integrity, performance, and transformation logic against expected results.
d. Optimization
 After migration, fine-tune BigQuery queries, Dataflow jobs, and overall performance using GCP-native monitoring tools
like Cloud Monitoring (formerly Stackdriver).
 Adjust compute resources (e.g., autoscaling with Dataflow) and storage configurations for cost optimization.

5. Post-Migration Considerations
a. Monitoring and Logging
 Use Google Cloud Monitoring and Logging for performance tracking, alerting, and logging errors within Dataflow,
BigQuery, or Cloud Composer pipelines.
 Enable cost management and billing alerts to monitor and control expenses post-migration.
b. Security and Compliance
 Ensure that data security, encryption, and compliance requirements (e.g., HIPAA, GDPR) are maintained in the cloud.
 Use VPC Service Controls and IAM roles to secure access to data and cloud resources.
c. Ongoing Optimization
 Continuously optimize data transformation pipelines, query performance, and resource usage on GCP using tools like
BigQuery BI Engine and Dataflow Autoscaling.

6. Tools for Migration


 Informatica Intelligent Cloud Services (IICS): Provides native connectors to Google Cloud services (BigQuery, Cloud
Storage, Pub/Sub, etc.) and can help in hybrid migration scenarios.
 Google Cloud Data Fusion: Managed ETL tool to design, build, and manage data pipelines, with a similar visual
interface to Informatica.
 Google Cloud Dataflow: Fully managed service for real-time and batch data processing, can replace complex
Informatica transformations.
 Cloud Composer: For workflow orchestration and job scheduling (similar to Informatica workflow orchestration).
By following this approach, you can effectively migrate Informatica transformations to GCP, leveraging the scalability, flexibility,
and cost-efficiency of Google Cloud's ecosystem.
Email alters:
Setting up job failure notifications in Apache Airflow is crucial for monitoring the health of your data pipelines. It allows you
to react promptly to issues, ensuring that data workflows run smoothly. Below are the steps and best practices for handling job
failures and setting up notification alerts in Airflow.
1. Set Up Email Notifications in Airflow
Airflow supports email notifications out of the box. Here’s how to configure it:
a. Configure Email Settings
1. Edit airflow.cfg: Locate the airflow.cfg file in your Airflow configuration directory (usually in ~/airflow/).
2. Update Email Settings: Find the [smtp] section and configure your SMTP server settings:
3. . Set Up Email Alerts in Your DAG
4. You can specify email notifications in the default_args or on a task level. Here’s an example:
5. 2. Using Callbacks for Custom Notifications
6. If you need more flexibility, you can implement custom alerting logic using task callbacks.
7. a. Define Callbacks
8. You can define a function that will be called upon task failure:
9. b. Attach Callbacks to Your DAG or Tasks
10. You can attach the on_failure_callback to your DAG or individual tasks:
SQL - get a report to compare the sales amount from current year month to previous year month :
Eg:
sales_table:
sale_id, date , amount(USD)
1 , Jan-2023, 123
2 , Feb-2023, 234
3 , Jan-2024, 125
4 , Feb-2024, 236
WITH MonthlySales AS (
SELECT
EXTRACT(YEAR FROM date) AS year,
EXTRACT(MONTH FROM date) AS month,
SUM(amount) AS total_sales
FROM
sales_table
GROUP BY
year, month
)

SELECT
COALESCE(CURRENT.year, PREVIOUS.year) AS year,
COALESCE(CURRENT.month, PREVIOUS.month) AS month,
CURRENT.total_sales AS current_year_sales,
PREVIOUS.total_sales AS previous_year_sales,
CURRENT.total_sales - COALESCE(PREVIOUS.total_sales, 0) AS sales_difference
FROM
(SELECT month, total_sales FROM MonthlySales WHERE year = EXTRACT(YEAR FROM CURRENT_DATE())) AS CURRENT
FULL OUTER JOIN
(SELECT month, total_sales FROM MonthlySales WHERE year = EXTRACT(YEAR FROM CURRENT_DATE()) - 1) AS PREVIOUS
ON
CURRENT.month = PREVIOUS.month
ORDER BY
COALESCE(CURRENT.year, PREVIOUS.year),
COALESCE(CURRENT.month, PREVIOUS.month);
Clustering Columns:
 You can specify up to 4 clustering columns for a clustered table. BigQuery will sort the data in the table based on the
values in these columns.
 The order of columns in clustering matters; queries that filter by the first clustering column will benefit the most from
clustering.
Airflow CLI Commands to View Job Status
1. List All DAGs:
o This command will show you all the available DAGs in your Airflow environment.
bash
Copy code
airflow dags list
2. View DAG Runs:
o To see the history of DAG runs for a specific DAG, you can use the following command:
bash
Copy code
airflow dags list-runs --dag-id <dag_id>
o Replace <dag_id> with the ID of your DAG.
3. View Task Instances:
o To view the status of all task instances for a specific DAG run, use:
bash
Copy code
airflow tasks list <dag_id>
o This will show you all the tasks in the specified DAG.
4. Get Task Instance Details:
o To get the details of a specific task instance, including its status, you can run:
bash
Copy code
airflow tasks stats <dag_id> --execution-date <execution_date>
o Replace <execution_date> with the specific execution date of the DAG run you want to check.
5. View the State of a Specific Task Instance:
o To check the state of a specific task instance, use:
bash
Copy code
airflow tasks state <dag_id> <task_id> <execution_date>
o Replace <dag_id>, <task_id>, and <execution_date> with the appropriate values.
6. View All Task Instances:
o To get a list of all task instances in the Airflow scheduler, you can run:
bash
Copy code
airflow tasks list <dag_id> --tree
7. Check Task Logs:
o To view the logs for a specific task instance, which can help in troubleshooting, use:
bash
Copy code
airflow tasks logs <dag_id> <task_id> <execution_date>
Example Commands
Here’s how you might use these commands:
 List all DAGs:
bash
Copy code
airflow dags list
 Check runs for a specific DAG:
bash
Copy code
airflow dags list-runs --dag-id my_dag_id
 Get the status of a specific task instance:
bash
Copy code
airflow tasks state my_dag_id my_task_id 2023-09-26T00:00:00+00:00
 View the logs for a specific task:
bash
Copy code
airflow tasks logs my_dag_id my_task_id 2023-09-26T00:00:00+00:00
The --use_legacy_sql=false flag in Google BigQuery is used when you want to run SQL queries using Standard SQL instead of
the legacy SQL dialect. Standard SQL is the default and more powerful query language in BigQuery, offering a richer feature set
and better compatibility with ANSI SQL standards.
What are the limitation of partitions?
ChatGPT said:
ChatGPT
Partitioning in Google BigQuery is a powerful feature that allows for the efficient management and querying of large datasets.
However, there are certain limitations and considerations to keep in mind when using partitions. Here are some of the key
limitations of partitioning in BigQuery:
1. Partition Limits
 Number of Partitions:
o BigQuery supports a maximum of 4,000 partitions in a table. If you exceed this limit, queries may fail or
performance may degrade.
2. Partition Types
 Supported Partition Types:
o BigQuery supports partitioning by date (using a TIMESTAMP or DATE column) or integer range. However, certain
partitioning types, like string or categorical partitioning, are not supported.
3. Query Limitations
 Partition Filter Requirement:
o Queries that target partitioned tables must include a filter on the partitioning column for optimal performance.
Without this filter, BigQuery may need to scan all partitions, negating the benefits of partitioning.
4. Partitioning Costs
 Storage Costs:
o Each partition incurs storage costs. If a partition has little to no data, it might not be cost-effective to maintain.
5. Data Manipulation Limitations
 No Direct Modification of Partitions:
o You cannot directly modify a partition. For example, you cannot delete or update data in a specific partition
directly; instead, you must delete or update the entire table or use partitioned queries.
6. Ingestion Limitations
 Load and Streaming:
o When loading data into partitioned tables, especially via streaming, you must specify the correct partition.
Misaligned partitioning during data ingestion can lead to data being placed in the wrong partition.
7. Partition Expiration
 Automatic Expiration:
o You can set partition expiration to automatically delete old partitions, but once set, this cannot be easily
reverted. Data in expired partitions cannot be recovered.
8. Time Partitioning vs. Integer Range Partitioning
 Different Constraints:
o Time-based partitioning has specific constraints, such as a maximum date range of 1,000 days for the time
partitioning feature, while integer range partitions may have different performance characteristics based on the
range specified.
9. Performance Considerations
 Small Partitions:
o Very small partitions (e.g., a few rows) can lead to increased query latency due to overhead in managing many
partitions.
10. No Cross-Partition Queries:
 Cross-Partition Queries:
o Querying across many partitions without effective filtering can result in high scan costs and degraded
performance.
SQL
Explain CTE usage,calculation of product sales within each category and its variations(month on month sales
WITH monthly_sales AS (
SELECT
DATE_TRUNC(sale_date, MONTH) AS sale_month, -- Truncate the date to the first day of the month
category_id,
SUM(amount) AS total_sales
FROM
sales_data
GROUP BY
sale_month, category_id
)
How to calculate year on year growth sales and cumulative month sale
WITH monthly_sales AS (
SELECT
DATE_TRUNC(sale_date, MONTH) AS sale_month,
SUM(amount) AS total_sales
FROM
sales_data
GROUP BY
sale_month
)

, yoy_growth AS (
SELECT
current.sale_month,
current.total_sales AS current_year_sales,
COALESCE(previous.total_sales, 0) AS previous_year_sales,
(current.total_sales - COALESCE(previous.total_sales, 0)) AS sales_difference,
CASE
WHEN previous.total_sales IS NULL THEN NULL
ELSE (current.total_sales - previous.total_sales) / previous.total_sales * 100
END AS yoy_growth_percentage
FROM
monthly_sales AS current
LEFT JOIN
monthly_sales AS previous
ON DATE_TRUNC(current.sale_month, YEAR) = DATE_TRUNC(DATE_ADD(previous.sale_month, INTERVAL 1
YEAR), YEAR)
AND EXTRACT(MONTH FROM current.sale_month) = EXTRACT(MONTH FROM previous.sale_month)
)

, cumulative_sales AS (
SELECT
sale_month,
current_year_sales,
previous_year_sales,
sales_difference,
yoy_growth_percentage,
SUM(current_year_sales) OVER (ORDER BY sale_month) AS cumulative_sales
FROM
yoy_growth
)

Final result
SELECT
sale_month,
current_year_sales,
previous_year_sales,
sales_difference,
yoy_growth_percentage,
cumulative_sales
FROM
cumulative_sales
ORDER BY
sale_month;

Calculate higest sale department wise from order tables

WITH department_sales AS (
SELECT
department_id,
SUM(amount) AS total_sales
FROM
orders
GROUP BY
department_id
)
SELECT
d.department_name,
ds.total_sales
FROM
department_sales ds
JOIN
departments d ON ds.department_id = d.department_id
WHERE
ds.total_sales = (
SELECT MAX(total_sales)
FROM department_sales
)
Lamba archtitecture

Lambda Architecture is a design pattern used for big data systems that aims to provide a balance between real-time data
processing and batch processing. It combines both methods to enable a comprehensive view of data while ensuring scalability
and fault tolerance. Here’s an overview of its components, architecture, advantages, and use cases.
Key Components
1. Batch Layer:
o Function: Responsible for managing the master dataset and computing batch views.
o Data Processing: Processes large volumes of data in batches using batch processing frameworks (e.g.,
Apache Hadoop, Apache Spark).
o Output: Produces batch views, which are pre-computed aggregations that can be queried efficiently.
2. Speed Layer (or Real-Time Layer):
o Function: Deals with real-time data processing, providing low-latency insights.
o Data Processing: Processes data in real-time using streaming frameworks (e.g., Apache Kafka, Apache Storm,
Apache Flink).
o Output: Produces real-time views based on incoming data.
3. Serving Layer:
o Function: Merges outputs from both the batch and speed layers to provide a unified view of the data.
o Data Storage: Stores the final views (batch views and real-time views) in a queryable format (e.g., databases,
NoSQL stores).
o Querying: Allows users to perform queries against the combined views for analytics.
Architecture Diagram
The architecture of Lambda Architecture can be visualized as follows:
| |
|
Advantages
1. Scalability: Can handle large volumes of data by leveraging batch processing while providing real-time insights.
2. Fault Tolerance: The architecture is robust; failures can be isolated in one layer without affecting the others.
3. Flexibility: Supports both batch and real-time processing, allowing organizations to analyze data as needed.
4. Data Freshness: Users can access up-to-date data while also having the benefit of historical data through batch
processing.
Disadvantages
1. Complexity: Implementing and maintaining a Lambda Architecture can be complex due to the integration of different
processing layers.
2. Latency: The batch processing layer introduces some latency, which may not be suitable for applications requiring
immediate results.
3. Data Duplication: Since both layers process data, there can be redundancy, which requires careful management.
Use Cases
 E-Commerce: Analyzing real-time customer behavior and purchasing patterns while also processing historical sales
data.
 Social Media Analytics: Monitoring trends and user interactions in real time while analyzing user behavior over time.
 IoT Applications: Collecting data from sensors in real time and processing historical data for predictive maintenance.
Conclusion
Lambda Architecture is a powerful approach for handling big data applications that require both real-time processing and batch
processing. While it offers significant advantages in terms of scalability and flexibility, organizations should weigh its complexity
and potential latency against their specific use cases to determine if it’s the right fit for their data processing needs.

Basic GCP Setup:


 Authenticate with GCP:
bash
Copy code
gcloud auth login
 Set a GCP project:
bash
Copy code
gcloud config set project [PROJECT_ID]
2. VM Instances:
 List VM instances:
bash
Copy code
gcloud compute instances list
 Create a new VM instance:
bash
Copy code
gcloud compute instances create [INSTANCE_NAME] --zone=[ZONE]
 SSH into a VM instance:
bash
Copy code
gcloud compute ssh [INSTANCE_NAME] --zone=[ZONE]
3. Cloud Storage:
 List buckets:
bash
Copy code
gsutil ls
 Create a new bucket:
bash
Copy code
gsutil mb gs://[BUCKET_NAME]
 Upload a file to a bucket:
bash
Copy code
gsutil cp [LOCAL_FILE_PATH] gs://[BUCKET_NAME]
4. BigQuery:
 Run a query:
bash
Copy code
bq query --use_legacy_sql=false 'SELECT * FROM [DATASET].[TABLE] LIMIT 10;'
 List datasets:
bash
Copy code
bq ls
5. Cloud Functions:
 Deploy a function:
bash
Copy code
gcloud functions deploy [FUNCTION_NAME] --runtime [RUNTIME] --trigger-http --allow-unauthenticated
 Invoke a deployed function:
bash
Copy code
gcloud functions call [FUNCTION_NAME] --data '{"key":"value"}'
6. Cloud Pub/Sub:
 Create a topic:
bash
Copy code
gcloud pubsub topics create [TOPIC_NAME]
 Publish a message:
bash
Copy code
gcloud pubsub topics publish [TOPIC_NAME] --message "Hello, world!"
7. Kubernetes (GKE):
 Get Kubernetes cluster credentials:
bash
Copy code
gcloud container clusters get-credentials [CLUSTER_NAME] --zone [ZONE]
 View GKE clusters:
bash
Copy code
gcloud container clusters list

PIN POINTS

 Denormalization allows you to use nested and repeated fields in BigQuery, which optimizes the way related data is
stored and queried.
 Denormalization in BigQuery can significantly impact performance, often in a positive way, due to the architectural
differences between traditional relational databases and columnar data warehouses like BigQuery.
 Denormalization in BigQuery generally improves query performance by reducing joins, simplifying the schema, and
enabling more efficient use of BigQuery’s columnar storage. The trade-off is higher storage costs due to data
redundancy and potentially slower update operations. However, in read-heavy analytical workloads—which BigQuery is
designed for—denormalization is often the best approach for optimizing performance
 ackDeadline -- to acknowledge the outstanding message. Once the deadline passes, the message is no longer
considered outstanding, and Pub/Sub will attempt to redeliver the message.

 A PCollection is immutable, so you can apply multiple transforms


to the same one.
 You can disable autoscaling by explicitly specifying the flag --

autoscaling_algorithm=NONE
 run up to 25 concurrent Dataflow jobs per Google Cloud
project
 a maximum of 1000 Compute Engine instances per job.
 Catchup By default, Airflow will run any past scheduled intervals that have not been run. In order to avoid catchup,
we need to explicitly pass the parameter catchup=False in the DAG definition.
 Wrap up!
 Well done! You've successfully created your first data pipeline with Airflow 😎

 Now you are able to:
 Define a Data Pipeline
 Execute a SQL request with the PostgresOperator
 Execute a Python function with the PythonOperator
 Execute an HTTP request against an API
 Wait for something to happen with Sensors
 Use Hooks to access secret methods
 Exchange data between tasks
 Here is the final code 👇
 from airflow import DAG
 from airflow.providers.postgres.operators.postgres import PostgresOperator
 from airflow.providers.http.sensors.http import HttpSensor
 from airflow.providers.http.operators.http import SimpleHttpOperator
 from airflow.operators.python import PythonOperator
 from airflow.providers.postgres.hooks.postgres import PostgresHook

 import json
 from pandas import json_normalize
 from datetime import datetime

 def _process_user(ti):
 user = ti.xcom_pull(task_ids="extract_user")
 user = user['results'][0]
 processed_user = json_normalize({
 'firstname': user['name']['first'],
 'lastname': user['name']['last'],
 'country': user['location']['country'],
 'username': user['login']['username'],
 'password': user['login']['password'],
 'email': user['email'] })
 processed_user.to_csv('/tmp/processed_user.csv', index=None, header=False)

 def _store_user():
 hook = PostgresHook(postgres_conn_id='postgres')
 hook.copy_expert(
 sql="COPY users FROM stdin WITH DELIMITER as ','",
 filename='/tmp/processed_user.csv'
 )

 with DAG('user_processing', start_date=datetime(2022, 1, 1),
 schedule_interval='@daily', catchup=False) as dag:

 create_table = PostgresOperator(
 task_id='create_table',
 postgres_conn_id='postgres',
 sql='''
 CREATE TABLE IF NOT EXISTS users (
 firstname TEXT NOT NULL,
 lastname TEXT NOT NULL,
 country TEXT NOT NULL,
 username TEXT NOT NULL,
 password TEXT NOT NULL,
 email TEXT NOT NULL
 );
 '''
 )

 is_api_available = HttpSensor(
 task_id='is_api_available',
 http_conn_id='user_api',
 endpoint='api/'
 )

 extract_user = SimpleHttpOperator(
 task_id='extract_user',
 http_conn_id='user_api',
 endpoint='api/',
 method='GET',
 response_filter=lambda response: json.loads(response.text),
 log_response=True
 )

 process_user = PythonOperator(
 task_id='process_user',
 python_callable=_process_user
 )

 store_user = PythonOperator(
 task_id='store_user',
 python_callable=_store_user
 )

 create_table >> is_api_available >> extract_user >> process_user >> store_user
 This process is known as Backfill. You may want to backfill the data even in the cases when catchup is disabled. This can
be done through CLI. Run the below command airflow dags backfill --start-date START_DATE --end-date
END_DATE dag_id
 Airflow backfilling allows users to run DAGs for a period in the past1. To backfill data, you can use the following
command: `airflow dags backfill --start-date START_DATE --end-date END_DATE dag_id` 2. To prevent a DAG from
backfilling old runs, set the `catchup` parameter to `False` in the DAG definition 3.

Datasets are amazing, but they have limitations as well:

 DAGs can only use Datasets in the same Airflow instance. A DAG cannot wait for
a Dataset defined in another Airflow instance.

 Consumer DAGs are triggered every time a task that updates datasets
completes successfully. Airflow doesn't check whether the data has been
effectively updated.

 You can't combine different schedules like datasets with cron expressions.

 If two tasks update the same dataset, as soon as one is done, that triggers the
Consumer DAG immediately without waiting for the second task to complete.

 Airflow monitors datasets only within the context of DAGs and Tasks. If an
external tool updates the actual data represented by a Dataset, Airflow has no
way of knowing that.

 16 tasks run concurrently on


 Remove DAG examples
 To keep our Airflow instance nice and clean, we are going to remove the DAG examples from the UI
 To do this
 Open the file docker-compose.yaml
 Replace the value 'true' by 'false' for the AIRFLOW__CORE__LOAD_EXAMPLES environment variables

 Save the file


 Restart Airflow by typing docker-compose down && docker-compose up -d

 Once it's done, go back to localhost:8080 and you should end up with something like this

 Concurrency, the parameters you must know!


 Airflow has several parameters to tune your tasks and DAGs concurrency.
 Concurrency defines the number of tasks and DAG Runs that you can execute at the same time (in
parallel)

 Starting from the configuration settings
 parallelism / AIRFLOW__CORE__PARALELISM
 This defines the maximum number of task instances that can run in Airflow per scheduler. By
default, you can execute up to 32 tasks at the same time. If you have 2 schedulers: 2 x 32 = 64 tasks.
 What value to define here depends on the resources you have and the number of schedulers running.

 max_active_tasks_per_dag / AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG
 This defines the maximum number of task instances allowed to run concurrently in each DAG. By
default, you can execute up to 16 tasks at the same time for a given DAG across all DAG Runs.

 max_active_runs_per_dag / AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG
 This defines the maximum number of active DAG runs per DAG. By default, you can have up to 16
DAG runs per DAG running at the same time.

The main limitation of SQLite is its lack of concurrency support, which makes it unsuitable for high-write, multi-user, or highly
concurrent environments.
Key Limitations Related to Concurrency:
1. Single-Writer Limitation:
o SQLite allows only one write operation at a time. While it supports multiple readers concurrently, only one
process or thread can write to the database at a time. This can create a bottleneck in scenarios where there are
frequent write operations, as other write requests must wait for the current write transaction to finish.
2. Database Locking:
o SQLite uses database-level locking rather than row-level locking, meaning that when a write operation is in
progress, the entire database is locked, preventing other writes. This limits performance in multi-user
environments where simultaneous writes are common.
3. Limited Scalability:
o Due to its concurrency limitations, SQLite is not designed for high-traffic applications or systems requiring high
scalability. It is more suited to applications with fewer concurrent users or lower write volume, such as mobile
apps, embedded systems, or local file-based databases.
Other Limitations:
 Database Size: While SQLite can handle databases up to 281 terabytes in theory, its performance tends to degrade
with very large databases. It's not as efficient for handling extremely large datasets compared to more robust database
systems like PostgreSQL or MySQL.
 No Built-in User Management: SQLite doesn't have user and permission management, making it less suitable for
multi-user systems with role-based access control.
 Limited Advanced Features: SQLite lacks many advanced features found in other relational databases, such as stored
procedures, complex triggers, and full-text search capabilities (without extensions).
 Serverless Architecture: SQLite operates directly on disk, which is great for simplicity but lacks the centralized
management features of server-based databases like load balancing, clustering, or replication.
In summary, SQLite is best suited for low-concurrency, lightweight, or embedded applications but is limited in high-concurrency,
large-scale, or multi-user environments.
4o
2. **Structs**: You can use STRUCTs to represent complex data types. For instance, you can define a STRUCT type and use it in
your table schema: Sql query: CREATE TABLE person (id INT64, name STRING, address STRUCT); This table has a column named
"address" with a STRUCT type that contains multiple fields. 3. **Querying Nested Data**: You can query nested and repeated
fields using dot notation and the UNNEST function to flatten arrays when necessary. For example, to query all elements in the
"numbers" array: Sql Query: SELECT id, num FROM my_table, UNNEST(numbers) AS num;**Querying Nested Data**: You can
query nested and repeated fields using dot notation and the UNNEST function to flatten arrays when necessary. For example, to
query all elements in the "numbers" array:

1. **Create an Internal (Native) Table**: To create an internal (native) table in Big Query using the `bq` tool, you'll need to
define a schema for your table and specify the dataset where you want to create it. Here's a command example: bq mk \
--table \ --schema "field1:STRING, field2:INTEGER, field3:FLOAT" \ project_id:dataset_name.table_name - Replace `field1`,
`field2`, `field3` with your actual column names and data types. - Replace `project_id`, `dataset_name`, and
`table_name` with your Google Cloud project ID, dataset name, and desired table name. 2. **Create an External Table**:
To create an external table in Big Query using the `bq` tool, you'll need to specify the external data source location (e.g.,
a Google Cloud Storage URI) and the schema for the table. Here's a command example: bq mk \ --
external_table_definition "field1:STRING, field2:INTEGER, field3:FLOAT" \
project_id:dataset_name.external_table_name@URI Replace `field1`, `field2`, `field3` with your actual column names
and data types.
- BigQuery provides robust security features, including access control lists (ACLs), encryption at rest and in transit, and
integration with Google Cloud Identity and Access Management (IAM).

Regarding the "staging" period, BigQuery doesn't have a specific staging area like some other data warehouses. When you load
data into BigQuery, it is ingested and processed almost immediately. There isn't a defined staging duration. Data is available for
querying as soon as it is successfully loaded. However, keep in mind that data loading times may vary depending on the volume
of data being loaded and other factors. It's generally a good practice to verify that the data is available for querying before
running important queries to ensure data consistency.

You might also like