Big Data Refers To Extremely Large and Complex Datasets That 1
Big Data Refers To Extremely Large and Complex Datasets That 1
process, and analyze using traditional data processing tools and methods. These
datasets can come from various sources, such as social media, sensors, transactions,
and more, and often exceed the capabilities of conventional database systems.
1. Volume: The sheer amount of data generated every second. It can be terabytes,
petabytes, or even more, making it challenging to store and process.
2. Velocity: The speed at which data is generated and needs to be processed. For
example, real-time data from social media, sensors, and other devices.
3. Variety: The different types and formats of data—structured (like databases),
semi-structured (like logs or XML), and unstructured (like text, images, or
videos).
4. Veracity: The trustworthiness or quality of the data. With large datasets, data
quality can vary, leading to challenges in ensuring accurate and reliable
insights.
5. Value: The usefulness and insights that can be derived from analyzing big
data to make informed decisions.
1
HADOOP
Hadoop is an open-source framework used for processing and storing large datasets
in a distributed computing environment. It is designed to handle big data and
provide a scalable, fault-tolerant, and efficient way to store and analyze vast amounts
of data across a network of computers.
HDFS (Hadoop Distributed File System) is the primary storage system used
by Apache Hadoop for storing large datasets across distributed environments.
It is designed to store vast amounts of data reliably, efficiently, and in a scalable
manner across a cluster of machines. HDFS is optimized for handling big data
applications that require high throughput and fault tolerance.
2.DataNode: The DataNode is the worker node in HDFS that is responsible for
actually storing the data in the form of blocks.
2
HDFS (Hadoop Distributed File System) is the primary storage system used by
the Apache Hadoop framework to store large volumes of data across a distributed
cluster of computers. It is designed to handle very large files and is optimized for
high throughput and fault tolerance, making it suitable for big data applications.
1. Distributed Storage:
a. HDFS is a distributed file system, meaning it divides data into blocks
and stores these blocks across multiple machines in a cluster. This
enables it to scale easily as the amount of data grows, with each node
storing a part of the total data.
2. Fault Tolerance:
a. One of the core features of HDFS is data replication. Each block of
data is typically replicated across multiple nodes in the cluster (often 3
replicas by default). This ensures that if one node fails, the data can
still be accessed from other nodes where the replica is stored,
providing high availability and data durability.
3. Large Data Files:
a. HDFS is optimized for storing large files rather than small files. It is
especially designed to efficiently handle large-scale datasets typical in
big data applications (such as terabytes or petabytes of data).
4. Block-based Storage:
a. In HDFS, files are split into fixed-size blocks (typically 128MB or
256MB) for storage. These blocks are stored across the cluster, and
the file metadata is managed by the NameNode (explained below).
The block size can be adjusted based on the application's needs.
5. High Throughput:
a. HDFS is designed for high throughput, which is ideal for
applications that need to read and write large amounts of data
sequentially. However, it is not optimized for low-latency access or
real-time queries, as it focuses more on batch processing of large
datasets.
3
6. Write Once, Read Many:
a. HDFS is designed for a write once, read many model. This means
that data is written once into the system and then read multiple times,
which is typical for big data processing scenarios like MapReduce
jobs or analytics workloads.
7. Scalability:
a. HDFS can scale out horizontally by adding more machines (or nodes)
to the cluster, which automatically increases the storage and
computing capacity of the system. It can handle massive amounts of
data by distributing it across many machines.
8. Data Integrity:
a. HDFS ensures data integrity by performing checksums on data blocks.
If a block is corrupted, the system can automatically detect the issue
and attempt to recover the data by retrieving the replica from another
node.
1. NameNode:
a. The NameNode is the master node in the HDFS architecture. It
manages the metadata of the file system, such as the file-to-block
mapping, block locations, and permissions. However, the NameNode
does not store the actual data but holds the information about where
the data blocks are stored across the cluster. The NameNode is crucial
for managing the overall file system structure.
b. Failure Recovery: If the NameNode fails, the entire HDFS system
can become unavailable. To mitigate this risk, a Secondary
NameNode or Checkpoint Node is often used for periodic
checkpoints to recover the NameNode’s state.
2. DataNode:
a. The DataNodes are the worker nodes in the HDFS cluster. They store
the actual data blocks that make up the files in HDFS. DataNodes are
responsible for reading and writing data to the storage disks, and they
4
report the status of blocks (health, replication count, etc.) to the
NameNode periodically.
b. Data Replication: The DataNodes also handle replication, ensuring
that the number of replicas of each block is maintained across
different nodes in the cluster.
3. Block:
a. Files in HDFS are split into blocks of fixed size (typically 128MB or
256MB). These blocks are distributed across multiple DataNodes in
the cluster. The block size is designed to optimize for large data
transfers and reduce overhead when accessing large datasets.
b. Block Replication: By default, each block is replicated three times
across different DataNodes. This replication provides redundancy and
fault tolerance.
4. Client:
a. The client is the application or user that interacts with the HDFS. The
client initiates file operations such as reading or writing data to the
HDFS. The client communicates with the NameNode to get metadata
(e.g., which DataNode stores which block) and then directly
communicates with the DataNodes to read or write the data.
Advantages of HDFS:
1. Scalability:
a. HDFS can scale horizontally by adding more nodes to the cluster,
enabling it to handle petabytes of data.
2. Fault Tolerance:
a. Through data replication, HDFS ensures high availability and fault
tolerance. Even if individual nodes fail, data is still accessible from
other nodes.
3. High Throughput:
a. HDFS is optimized for high-throughput access to large datasets,
making it suitable for big data analytics and batch processing.
4. Cost-Effective:
a. Since HDFS uses commodity hardware for storing data, it is more
cost-effective compared to traditional relational databases and other
proprietary storage systems.
5. Data Locality:
a. HDFS strives to store data close to where it will be processed (data
locality), which improves performance in distributed computing tasks,
such as MapReduce.
Disadvantages of HDFS:
6
2. Write Once:
a. HDFS follows a write-once, read-many model, meaning that once a
file is written, it cannot be modified. This limits its use in scenarios
where frequent updates or random writes are required.
3. Not Optimized for Low-Latency Access:
a. HDFS is designed for batch processing and high-throughput access,
not for real-time, low-latency data access or interactive queries.
Conclusion:
7
Replication factor in hdfs.
• By default, the replication factor in HDFS is set to 3. This means that each
data block will be replicated three times and stored on three different
DataNodes.
• This default replication provides a good balance between data reliability
and storage efficiency for most use cases. However, it can be adjusted
depending on the specific requirements of your cluster.
---- > Hadoop has several demons (background processes) that run on cluster
nodes these include names
Node, Data node, Resource manager, Node manager and more which
collectively mange data storage and processing in the Hadoop cluster.
2. MapReduce:
a. MapReduce is the processing layer of Hadoop. It is a programming
model used to process large datasets in parallel across multiple nodes.
b. It works in two phases:
i. Map: The input data is processed in parallel by "mapper" tasks
to create key-value pairs.
ii. Reduce: The key-value pairs are grouped and processed by
"reducer" tasks to produce the final output.
8
Key functions of the Mapper:
• The Mapper reads the input data, which can be stored in files (like HDFS), and
applies a transformation to it.
• It processes data in parallel (in a distributed manner), working on small chunks
of data at a time.
• The output from the Mapper is typically a set of key-value pairs (often referred
to as "intermediate key-value pairs"). These pairs are the result of the mapping
operation.
• The Mapper doesn't perform any aggregation. It simply takes input, applies a
function (map operation), and produces an output.
• Input: A text file containing the following sentence: Hello World Hello Hadoop
• Mapper Function: The Mapper will read each word from the input text and emit
a key-value pair where the key is the word and the value is 1 (indicating that the
word occurred once).
(Hello,1)
(World,1)
(Hello,1)
(Hadoop, 1)
• The Reducer takes the intermediate key-value pairs produced by the Mapper
and groups them by key. All values associated with the same key are processed
together.
9
• The Reducer performs the actual aggregation, such as summing, averaging, or
applying other operations to the values associated.
10
6. HBase
HBase is a distributed, scalable, and NoSQL database built on top of the
Hadoop ecosystem. It is designed to store large amounts of sparse data in a
fault-tolerant and highly available manner. HBase is modelled after Google’s
Bigtable and is often used for applications that require fast access to large
volumes of structured or semi-structured data.
2. Spark
Spark (often referred to as Apache Spark) is a unified, open-source computing
framework for distributed data processing. It was developed by the UC Berkeley
AMPLab and later donated to the Apache Software Foundation. Spark is designed
to be fast, scalable, and highly efficient for big data workloads and analytics. It can
process large datasets, both in real-time (streaming) and in batch, across many
machines in a distributed environment.
11
Data Serialization and Deserialization
Key Points:
• Object to byte stream: When you serialize data, you're converting the data
(such as an object, array, or dataset) into a sequence of bytes or a standardized
format so that it can be saved in files, sent over a network, or shared between
different applications or systems.
• Usage: Serialization is used in scenarios like storing data in a database,
sending data over the network, or saving data to files.
12
Deserialization
Key Points:
• Byte stream to object: When you deserialize data, you're reconstructing the
original object, data structure, or state from the byte stream or data format that
was serialized.
• Usage: Deserialization is used when you need to access or manipulate the data
after it has been transmitted or stored in a serialized format.
Avro:
• A binary serialization format often used with Apache Hadoop and Apache
Kafka.
• Provides compact storage and fast data transmission.
• It has support for schema evolution.
A Sequence File is a flat file format used in the Hadoop ecosystem to store data in
a key-value pair structure. It is primarily designed for use within the Hadoop
MapReduce framework and is particularly optimized for binary storage of data.
Sequence Files are used for storing large datasets in a compact and efficient way,
making them suitable for high-performance data processing. They are commonly
used with frameworks like HDFS (Hadoop Distributed File System) and HBase
to store data that is accessed in parallel by multiple nodes in a distributed system.
13
RCB File
RCB (Row Columnar Block) is a file format used in the Apache Hive ecosystem,
though it's not as commonly referenced or well-documented as other formats like
ORC, Parquet, or Avro. The term RCB may sometimes refer to a custom or internal
implementation used for efficient data storage in certain systems, particularly in
relation to columnar storage structures. However, ORC (Optimized Row
Columnar) is much more widely used and standardized in modern big data
ecosystems like Apache Hive and Apache Spark for storing large datasets.
• ORC stores data in a columnar format, which means that it stores all values
for each column in contiguous blocks. This is in contrast to row-based
storage formats (like CSV or JSON), where data is stored in rows.
• Columnar storage allows for more efficient compression, as similar values
within each column can be stored together, reducing storage size.
Parquet File
Parquet is an open-source columnar storage file format designed for efficient data
storage and retrieval. It is optimized for big data processing frameworks like
Apache Hadoop, Apache Spark, and Apache Hive, and is particularly useful for
analytical workloads.
14
Here’s an in-depth explanation of Parquet files:
• Parquet stores data in a columnar format, meaning that the data for each
column is stored separately. This is in contrast to row-based formats (like
CSV or JSON), where all values for a row are stored together.
• The columnar format allows for better compression, because similar data
values (typically within the same column) are stored together. This leads to
more efficient data storage and faster query performance.
Presto
Presto is a distributed SQL query engine designed for running fast, interactive
queries on large datasets. It was originally developed by Facebook to address the
need for running fast analytic queries across a variety of data sources. Presto is
open-source and is widely used in big data environments for querying data stored in
various types of databases, data lakes, and other storage systems.
1. Objects
2. Arrays
3. Key-Value Pairs
4. Data Types
15
JSON Syntax Rules:
{
"name": "Alice",
"age": 25,
"isActive": true,
"address": {
"street": "456 Oak St",
"city": "Los Angeles",
"zip": "90001"
},
"languages": ["English", "Spanish"],
"isMarried": null,
"score": 95.5
}
17
Parquet file:
The Parquet file format is a columnar storage format designed for efficient data
processing and storage. It is widely used in big data ecosystems like Apache Spark,
Apache Hive, and Apache Drill due to its efficiency, performance, and ability to
handle complex data types.
• Schema: Parquet files include metadata that describes the schema of the data,
making it self-describing.
• Support for Nested Data Structures: Parquet can store complex data types
like arrays, maps, and structs.
• Splitting: Parquet supports splitting large files into smaller parts, enabling
parallel processing.
1. File Header: The Parquet file begins with a magic number to identify it as a
Parquet file. The magic number is the 4-byte string PAR1, and it appears at
both the beginning and the end of the file.
18
2. Row Groups:
o A row group is a collection of rows, and the data for each column in the
row group is stored separately (this is why it is called columnar storage).
3. Column Chunks:
o Each column chunk contains data for a single column in the row group.
4. Pages:
o Parquet organizes data into pages to optimize for I/O operations. Each
page can be stored in a compressed format.
19
5. File Footer:
o The footer is located at the end of the Parquet file and contains critical
metadata, including:
20
ORC
6. Indexing: ORC provides built-in support for indexing the data, which speeds
up query execution by reducing the amount of data that needs to be scanned.
7. Predicate Filtering: ORC allows predicate filtering and has the ability to
perform queries that filter data at the storage layer.
21
8. Efficient Storage for Complex Data Types: ORC efficiently stores complex
data types such as maps, arrays, and structs, providing better support for
non-flat schemas.
9. ACID Support: ORC files support transactions and are compatible with
ACID (Atomicity, Consistency, Isolation, Durability) properties in systems
like Apache Hive.
22
AVRO
An Avro file is divided into several sections. Here’s a breakdown of its structure:
1. File Header:
o Every Avro file begins with a magic number (the string Obj in ASCII)
to identify it as an Avro file. This helps to ensure that the file is correctly
interpreted.
23
2. Schema:
o Avro files embed the schema used to serialize the data within the file.
This allows consumers to understand how to deserialize the data
correctly.
o The schema is stored in JSON format and defines the structure of the
data, including the fields, data types, and whether a field is optional.
o The schema is typically defined at the time of writing data, and it can
evolve as the data structure changes.
3. Data Blocks:
o The data itself is stored in blocks that contain the actual serialized
records. These blocks are divided into record batches, and each block
contains a sequence of records of the same schema.
o Each record in the data block is serialized in Avro's binary format. The
block is compressed, making Avro highly efficient for storing large
datasets.
o The data blocks are followed by metadata that helps to index and locate
the records.
4. Compression:
5. File Footer:
o The Avro file ends with a footer that contains metadata about the file.
o The footer includes the schema and information about the data blocks,
such as the number of records, the block size, and other details.
24
o The footer is preceded by a checksum for integrity checking.
|---------------------------------|
| Magic Number: 'Obj' |
|---------------------------------|
| Schema (JSON Format) |
|---------------------------------|
| Data Block 1 |
| (Serialized Data) |
|--------------------------------|
| Data Block 2 |
| (Serialized Data) |
|--------------------------------|
| ... |
|--------------------------------|
| File Footer |
|--------------------------------|
| Magic Number: 'Obj' |
|------------------------------ -|
25
Difference between spark and map reduce
Real-Time
Not supported Support with Spark Streaming
Processing
Difficult (requires
Iterative Supported (RDDs allow for iterative
multiple MapReduce
Processing algorithms)
jobs)
Libraries and Limited (basic Rich ecosystem (MLlib, GraphX,
Ecosystem MapReduce tasks) Spark SQL, etc.)
Cluster Runs on YARN, Mesos, Kubernetes,
Runs on Hadoop YARN
management or standalone
MapReduce:
• A programming model for processing and generating large datasets that can
be parallelized across a distributed cluster of computers.
• It involves two main steps: the Map phase (where data is split and processed)
and the Reduce phase (where results are aggregated).
• Primarily designed for batch processing.
Spark:
• An open-source, distributed computing system designed to handle both
batch processing and real-time streaming data.
26
• Uses in-memory processing, which improves speed and performance,
making it much faster than traditional MapReduce.
• Offers more complex APIs, including support for machine learning (MLlib),
graph processing (GraphX), and SQL (Spark SQL).
2. Performance
• MapReduce:
o Works with data stored in HDFS (Hadoop Distributed File System),
and its operations involve reading and writing to disk during each stage
of computation (Map and Reduce).
o Disk-based processing leads to slower execution compared to Spark.
o I/O bound, meaning it can be slower when handling large amounts of
data, as each operation requires writing intermediate data to disk.
• Spark:
o In-memory processing allows it to store intermediate data in memory
(RAM) between operations, reducing the need to repeatedly read and
write to disk.
o This makes Spark faster than MapReduce, often up to 100 times faster
for in-memory
3. Ease of Use
• MapReduce:
o Has a low-level API, meaning developers must write more code to
accomplish simple tasks, making it harder to program.
o It requires a good understanding of the MapReduce programming
model.
• Spark:
o Provides high-level APIs in multiple languages like Java, Scala,
Python, and R, making it more user-friendly and easier to program.
o Spark provides higher-level operations like DataFrame (similar to a
table) and Dataset for SQL-like operations, reducing the amount of
code developers need to write.
27
Data Processing Model
• MapReduce:
o Primarily used for batch processing. Data is processed in large chunks,
and each job (Map and Reduce) runs independently without sharing
data between jobs.
o Does not have built-in support for real-time processing.
• Spark:
o Supports both batch processing and real-time streaming (with Spark
Streaming).
o Enables interactive queries and can handle more complex workloads,
including machine learning, graph processing, and SQL-based
querying (via Spark SQL).
Fault Tolerance
• MapReduce:
o Achieves fault tolerance through data replication in HDFS. If a task
fails, it can be re-executed from a backup replica.
o Tasks are retried in case of failures, but it involves additional overhead.
• Spark:
o Achieves fault tolerance through a feature called lineage. Each RDD
(Resilient Distributed Dataset) tracks how it was derived from other
datasets. If a partition of an RDD is lost, Spark can recompute it from
the lineage information rather than relying on replication.
o This makes Spark more efficient in handling failures.
28
Programming Model
• MapReduce:
o Has a two-step process:
▪ Map: Processes input data in parallel and produces key-value
pairs.
▪ Reduce: Aggregates results based on the keys.
o Works well for simple map-reduce tasks, but does not support more
complex operations like joins or iterative algorithms without additional
coding.
• Spark:
o RDDs (Resilient Distributed Datasets) and DataFrames/Datasets
form the core data structures in Spark. RDDs allow more advanced
operations such as map, filter, reduce, join,
Real-Time Processing
• MapReduce:
o Does not support real-time streaming. It is focused on batch jobs,
where data is processed in large chunks after being accumulated.
• Spark:
o With Spark Streaming, it can process real-time data streams (e.g.,
from Kafka or Flume), allowing Spark to handle use cases like real-
time analytics or streaming machine learning.
29
Libraries and Ecosystem
• MapReduce:
o MapReduce itself is just a programming model. For more complex
tasks like machine learning or graph processing, you would need to use
other libraries (e.g., Mahout for machine learning).
• Spark:
o Spark provides a rich ecosystem with integrated libraries for:
▪ Machine Learning (MLlib)
▪ Graph Processing (GraphX)
▪ SQL queries (Spark SQL)
▪ Real-time Streaming (Spark Streaming)
Use Cases
• MapReduce:
o Best suited for batch processing tasks like ETL (Extract, Transform,
Load), large-scale log processing, or simple word count applications.
• Spark:
o Ideal for interactive queries, real-time analytics, machine learning,
graph processing, and other complex workloads. It is used in scenarios
like real-time event processing, recommendation systems, and big
data analytics
Cluster Management
• MapReduce:
o Runs on Hadoop, and uses YARN (Yet Another Resource Negotiator)
or MapReduce JobTracker for resource management and job
scheduling.
• Spark:
o Spark can run on Hadoop YARN, Mesos, Kubernetes, or standalone
mode. It has its own cluster manager, making it more flexible in terms
of deployment options.
30
Feature Apache Spark Apache Hive
Data warehousing with SQL-like
Purpose Fast, distributed data processing engine.
querying on Hadoop.
In-memory, fast, both batch and stream Disk-based, batch processing using
Processing Model
processing. MapReduce.
Speed Fast due to in-memory computation. Slower due to MapReduce execution.
Can connect to various data sources
Data Storage Primarily uses HDFS for storage.
(HDFS, S3, etc.).
Spark SQL, supports multiple languages
Query Language HiveQL (SQL-like).
(Python, R).
Better performance, especially in real-time Lower performance, suited for batch
Performance
and complex tasks. jobs.
Real-time streaming, machine learning, SQL querying for batch jobs on
Use Case
batch processing. Hadoop.
Integration Highly integrative with other big data tools. Integrated with Hadoop ecosystem.
Limited to batch processing; real-time
Real-time Processing Supports real-time with Spark Streaming.
is complex.
Highly flexible for complex and advanced
Flexibility Primarily for SQL-like batch jobs.
analytics.
31
Spark core
What is spark?
It is an open-source distributed data processing framework designed for big data
processing and analytics. it was developed to overcome the limitations of traditional
Hadoop map reduce model.
Apache Spark has a distributed architecture designed to provide fast, scalable, and
fault-tolerant processing of large datasets. Below is a breakdown of the main
components of the Apache Spark architecture:
1. Driver Program
2. Cluster Manager
• Role: The Cluster Manager is responsible for managing resources across the
cluster and scheduling the tasks of the Spark jobs.
• Responsibilities:
o Decides where the jobs will run and allocates resources like memory
and CPU to each job.
o There are several types of cluster managers that Spark can use:
▪ Standalone Cluster Manager (Simple, used for small clusters)
▪ YARN (Hadoop's cluster manager)
▪ Mesos (A more advanced, fine-grained resource manager)
32
3. Worker Nodes
4. Executors
• Role: Executors are the core computation units that run on the worker nodes.
• Responsibilities:
o They execute tasks and store data for the duration of the job.
o Each executor runs in its own JVM (Java Virtual Machine) and
operates independently.
o Executors are responsible for managing data locality (i.e., placing data
as close to the computation as possible) and storing data in RDDs or
DataFrames.
o RDDs or DataFrames are stored either in memory or on disk based on
data partitioning.
5. Tasks
• Role: Tasks are the smallest units of work in Spark and are executed on each
partition of the data.
• Responsibilities:
o A Spark job is divided into multiple tasks that are distributed across the
available executor nodes.
o These tasks are the units of work that perform actual computation (e.g.,
map, filter, reduce operations).
33
6. Resilient Distributed Datasets (RDDs)
• Role: The DAG Scheduler is responsible for breaking up a Spark job into
smaller stages and scheduling them for execution.
• Responsibilities:
o Stages in Spark correspond to tasks that can be executed in parallel.
Each stage is separated by wide transformations (like groupByKey or
join).
o After a stage is completed, the DAG scheduler sends tasks to the
available worker nodes.
o It ensures fault tolerance by recomputing lost data through lineage (as
RDDs hold metadata about how they were created).
8. Task Scheduler
• Role: The Task Scheduler schedules tasks that are distributed across the
worker nodes.
• Responsibilities:
o Divides stages into tasks and allocates tasks to different worker nodes
based on the availability of resources (CPU, memory, etc.).
o It also takes care of task locality, ensuring that the tasks are placed on
the node where the data resides to avoid unnecessary data shuffling.
34
9. Spark Context (SparkContext)
• Role: The Cluster Manager manages the distributed resources of the cluster.
• Responsibilities:
o Allocates resources (CPU, memory) to each application running in the
cluster.
o Responsible for launching executors on worker nodes and managing
their lifecycles.
Workflow in Spark:
35
Spark Architecture Diagram Overview:
Summary:
36
Apache Spark API
The Apache Spark API is a set of programming interfaces that allows developers
to interact with and utilize Apache Spark for distributed data processing. Spark
provides APIs in multiple programming languages like Java, Scala, Python, and R,
enabling developers to write applications for large-scale data processing.
Data partitioning in HDFS refers to the way data is divided into smaller chunks and
distributed across multiple machines in a Hadoop cluster. Partitioning is an essential
concept because it enables parallel processing and efficient storage of data across
different nodes in the cluster.
In the context of HDFS, data partitioning specifically means splitting large files into
blocks, which are the basic units of data storage and management in HDFS.
37
groupByKey() vs reduceByKey() in Apache Spark
GroupByKey is a transformation operation that groups the data based on the keys
in a (key, value) pair RDD. It takes an RDD of key-value pairs and groups the values
by the keys, returning a new RDD where the values are aggregated into collections
(typically lists) corresponding to each key.
Catching and persisting are techniques in Apache Spark used to store intermediate
data in memory (or on disk) to optimize performance during iterative or repeated
computations. Both methods help avoid recomputing the same data multiple times,
which can be expensive, especially in complex algorithms or iterative machine
learning tasks.
Persisting is similar to caching, but with more control over how and where the data
is stored. While cache() by default persists data in memory (using the
MEMORY_AND_DISK storage level), persist() allows you to specify the storage
level explicitly.
38
Shared Variables in Apache Spark
In Apache Spark, shared variables are variables that can be used across multiple
tasks and nodes in a distributed environment. They are often used when you need to
share information between different tasks or across different stages of a computation.
However, since Spark runs in a distributed environment, managing variables that are
shared across tasks and nodes requires careful handling to avoid conflicts and
inconsistency.
1. Broadcast Variables
2. Accumulator Variables
Broadcast variables are a mechanism for sharing read-only data across all worker
nodes in a distributed computation. These variables are cached and efficiently
distributed to each worker node so that the same data is not repeatedly sent during
each task execution. This helps improve performance, particularly when working
with large datasets that are referenced multiple times during the computation.
Accumulator variables are a special type of shared variable that can be used to
accumulate values (such as counts or sums) across multiple tasks in parallel.
Accumulators are designed to support associative and commutative operations,
meaning that the order of the accumulation does not matter (i.e., addition or
multiplication operations).
39
Classification of Transformations in Apache Spark
40
INTERACTIVE DATA ANALYSIS WITH SPARK SHELL
1. Read: The REPL reads the input (code or expressions) from the user.
2. Eval: It evaluates the input code or expression, which means it executes it.
3. Print: It prints the result of the execution to the screen.
4. Loop: The process repeats, allowing continuous interaction with the
environment.
41
Common Log Locations in Windows:
Command Line Tools are programs that allow users to interact with their operating
system or software by typing text commands into a terminal or command prompt.
These tools are essential for system administration, development, automation,
debugging, and troubleshooting tasks.
Command line tools are preferred by many developers, system administrators, and
power users due to their efficiency, scriptability, and ability to handle complex tasks
quickly. Below are key aspects and examples of command line tools.
42
Why Use Command Line Tools?
ls
ls -l # Lists files with details like permissions and size
cp source.txt destination.txt
mv old_name.txt new_name.txt
43
rm file.txt
rm -r directory/ # Remove directory recursively
dir
top
htop
tasklist
44
3. Network Tools
ping google.com
netstat
• curl: Transfers data to/from a server using various protocols (HTTP, FTP, etc.).
curl http://example.com
traceroute google.com
45
unzip archive.zip
• chkdsk (Windows): Checks the integrity of the file system and disk.
chkdsk C:
46
7. Log and Text Processing Tools
• sed (Unix/Linux): Stream editor for modifying text in files or input streams.
1. Faster Execution: Command line tools are generally quicker than graphical
alternatives because they don't require rendering of a user interface.
2. Automation: Commands can be scripted and scheduled to automate repetitive
tasks, such as backups or system updates.
3. Remote Administration: Many servers do not have a graphical interface.
Command line tools allow remote administration via SSH or other remote
access protocols.
4. Precision: Command line tools often offer more granular control over the
system compared to graphical tools.
5. Resource Efficiency: Command line tools use fewer system resources (CPU,
memory) than graphical tools.
47
3. Use log view applications
Log view applications are specialized tools or software that allow users to view,
analyze, and manage log files generated by systems, applications, or services. These
tools are essential for troubleshooting, monitoring, and debugging purposes because
logs often contain detailed information about the operations, errors, and performance
of systems and applications.
Log view applications provide an easier and more efficient way to search, filter, and
analyze logs compared to manually viewing raw log files. They often come with
additional features like real-time log monitoring, log aggregation, and visualization
to help users identify issues quickly.
1. Splunk
• Overview: Splunk is one of the most popular log management and analysis
platforms. It collects, indexes, and analyzes machine data (logs) from various
sources. It provides powerful search capabilities, real-time monitoring, and
visualizations.
• Key Features:
o Centralized log aggregation
o Real-time alerting
o Dashboards and visualizations
o Machine learning for anomaly detection
• Use Case: Monitoring enterprise-level infrastructure, security event analysis,
and application performance monitoring.
Example: You can use Splunk to visualize web server logs and detect trends like
traffic spikes, downtime, or errors in a user-friendly dashboard.
48
2. Loggly
Example: An organization might use ELK to monitor server logs and create a
Kibana dashboard that shows error trends, request rates, and system health metrics.
49
4. Graylog
5. Papertrail
Example: Papertrail can be used to monitor logs from applications running in AWS
or Heroku, providing real-time insights and troubleshooting capabilities.
50
6. Logstash
Example: A company might use Logstash to parse and filter Apache access logs
before sending them to Elasticsearch for indexing and visualization in Kibana.
51
Writing spark applications
To run a simple count program using Scala in Apache Spark, you can follow these
steps. Below is a minimal example of a Spark application in Scala that counts the
number of elements in a dataset (an RDD or DataFrame).
1. Set up Spark: Ensure you have Apache Spark installed and properly set up.
If you're using a cluster or local mode, the program can be run accordingly.
2. Scala Program: The Scala code to perform the count operation will look like
the following.
import org.apache.spark.sql.SparkSession
object SimpleCountApp {
def main(args: Array[String]): Unit = {
52
// Perform the count operation
val count = rdd.count()
53
3. Run the Application:
a. If you're using SBT, you can use the command: sbt run
b. If you are running it on a Spark cluster, you would package the code
into a JAR and submit it using spark-submit: spark-submit --class
SimpleCountApp --master local[*] target/scala-2.12/simple-count-
app_2.12-1.0.jar
Output:
This is a simple example of how you can use Spark with Scala to perform a basic
count operation on an RDD.
Understanding these components will allow you to write, run, and optimize Spark
programs in Scala effectively.
54
Simple Build Tool (SBT)
SBT (Simple Build Tool) is the most commonly used build tool in the Scala
ecosystem. It is designed to handle project builds, dependency management, and
packaging tasks for Scala and Java applications. It's similar to tools like Maven or
Gradle in the Java world.
Spark Submit
spark-submit is a command-line interface that allows you to submit and run Spark
applications on a cluster. It is used to submit a precompiled Spark application
(usually packaged as a JAR file) to a Spark cluster (or in local mode). This tool
handles the distribution of the application across the cluster and manages resources.
1. Submit Jobs to a Cluster: You can submit a job to various cluster managers
such as YARN, Mesos, Kubernetes, or run it locally.
2. Specify Configurations: You can configure resource requirements like the
number of cores, memory, etc.
3. Submit JARs and Dependencies: You can specify JAR files, Python files,
or other dependencies for your job.
55
Common spark-submit Options:
bash
Copy code
spark-submit --class <main-class> --master <cluster-manager> --deploy-mode
<deploy-mode> <path-to-jar> <application-arguments>
bash
Copy code
spark-submit --class SimpleSparkApp --master local[*] target/scala-2.12/simple-
spark-app_2.12-1.0.jar
bash
Copy code
spark-submit --class SimpleSparkApp --master yarn --deploy-mode cluster
target/scala-2.12/simple-spark-app_2.12-1.0.jar
56
a. --master yarn: Specifies that the job will run on a YARN-managed
cluster.
b. --deploy-mode cluster: Indicates that the driver will run inside the
cluster (not locally).
3. Submit with Dependencies (If you have additional JAR files or libraries):
bash
Copy code
spark-submit --class SimpleSparkApp --master local[*] --jars /path/to/extra-lib.jar
target/scala-2.12/simple-spark-app_2.12-1.0.jar
57
Spark streaming
58
5. Fault Tolerance:
a. Spark Streaming provides fault tolerance via the RDD lineage. If a
node fails, the system can recompute lost data from the source using the
lineage information.
b. It can also checkpoint data and processing state periodically to provide
additional reliability in case of failure.
6. Integrations with Data Sources:
a. Spark Streaming supports integration with many data sources,
including Kafka, Flume, Kinesis, Socket, HDFS, and Amazon S3,
allowing you to ingest real-time data from these systems.
7. Output Sinks:
a. Spark Streaming can output processed results to various sinks such as
files (HDFS, S3), databases, dashboards, or other messaging
systems like Kafka, depending on your application needs.
59
Example of a Simple Spark Streaming Program:
60
Breakdown of the Example:
1. Spark Streaming Context: The StreamingContext is created with the
specified batch interval (in this case, 1 second).
2. Socket Stream: The data is being read from a socket on localhost at port 9999.
You can use nc (Netcat) to simulate a data stream.
3. Transformations:
a. flatMap splits each line into words.
b. map converts each word into a tuple (word, 1).
c. reduceByKey aggregates counts for each word.
4. Output: The print() action outputs the word counts to the console.
5. Start and Await: The streaming computation is started with ssc.start(), and
the program waits for the streaming job to finish with ssc.awaitTermination().
1. Unified API: Spark Streaming leverages the same API as Spark Core, which
makes it easier to use and transition between batch processing and real-time
processing.
2. Scalability: Built on top of Spark, it can scale easily to handle large streams
of data.
3. Fault Tolerance: It ensures fault tolerance through the lineage of RDDs,
allowing the recovery of lost data.
4. Integration: It integrates well with popular messaging systems and file
systems like Kafka, HDFS, S3, Flume, etc.
5. Complex Processing: It supports advanced operations such as windowed
computations, stateful processing, and aggregations over time.
In addition to the classic Spark Streaming API (which is based on DStreams), Spark
introduced Structured Streaming in Spark 2.x as a more modern, high-level API
that simplifies stream processing.
61
• It provides better consistency, lower latency, and more expressive stream
processing capabilities.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
// Define a schema for the data and read from a Kafka stream
val kafkaStream = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "my_topic")
.load()
Conclusion:
Cogroup
in Apache Spark, cogroup is a transformation that is used to join two RDDs
(or Datasets) based on their keys. It allows you to perform a grouped join
operation, where elements from both RDDs that have the same key are
grouped together, and a function is applied to the grouped values
updatedStateByKey
foreachRDD
foreachRDD is an action that allows you to apply a custom function to each RDD
in the DStream as it is processed. This can be used for various purposes such as
saving data to external storage, updating a database, or performing custom logging.
63
WINDOW
Key Concepts:
64
Spark SQL
Spark SQL is a component of Apache Spark that allows you to run SQL queries
on structured and semi-structured data. It provides a programming interface for
working with structured data and integrates relational databases and data
warehouses with Spark. Spark SQL enables the execution of SQL queries, and also
includes a DataFrame API and Dataset API for handling data in a more expressive
and optimized manner.
65
constant folding, predicate pushdown, and join optimizations to
improve query performance.
b. Tungsten Execution Engine: This execution engine focuses on
memory management and data serialization, providing performance
improvements like code generation and memory management.
5. Hive Integration:
a. Spark SQL integrates with Apache Hive to read data from and write
data to Hive tables. It can also execute Hive UDFs (User Defined
Functions) and queries.
6. Support for Structured Streaming:
a. Spark SQL also supports structured streaming, allowing you to run
SQL queries over streaming data, enabling real-time analytics with the
same interface as batch processing.
7. Built-in Functions:
a. Spark SQL includes a rich set of built-in functions for data
manipulation, transformation, and aggregation, similar to the functions
found in SQL databases, such as count(), sum(), avg(), min(), max(),
and more.
66
2. SQL Context: To use Spark SQL, you need to create a SQLContext (or
SparkSession in Spark 2.x), which provides an interface for running SQL
queries.
3. Executing SQL Queries: Once you have a DataFrame or Dataset, you can
use SQL to query it. Spark SQL allows you to register a DataFrame as a
temporary table and then run SQL queries on it.
import org.apache.spark.sql.functions._
67
Spark SQL APIs:
1. SQL Queries:
a. You can use SQL queries to interact with DataFrames, as shown earlier.
Spark SQL allows SQL-like operations on DataFrames directly.
b. SQL queries can also be used on tables in Hive if Spark is connected
to a Hive metastore.
2. DataFrame API:
a. DataFrames provide a programmatic interface for working with
structured data. DataFrame operations are optimized via Spark’s
Catalyst query optimizer.
// Example of a DataFrame operation (select and filter)
val df2 = df.select("name", "age").filter("age > 30")
df2.show()
3. Dataset API:
a. Datasets are a type-safe, object-oriented version of DataFrames. They
allow you to work with strongly typed data, making it easier to catch
errors at compile time.
case class Person(name: String, age: Int)
val ds = spark.read.json("people.json").as[Person]
ds.filter(_.age > 30).show()
68
Example: Spark SQL Query Execution
Here is an example of how Spark SQL can be used to process structured data from
a CSV file and perform some operations:
object SparkSQLExample {
def main(args: Array[String]): Unit = {
// Create Spark session
val spark = SparkSession.builder.appName("Spark SQL
Example").getOrCreate()
69
Integration with Other Systems:
1. Hive Integration:
a. Spark SQL can query data stored in Hive, which is commonly used in
data warehouses.
b. You can use HiveQL (the SQL dialect for Hive) along with the Spark
SQL engine to run SQL queries.
2. JDBC Integration:
a. Spark SQL can connect to external relational databases using JDBC to
read and write data.
3. Other Formats:
a. Spark SQL supports a variety of file formats, including Parquet, ORC,
Avro, JSON, CSV, and more. This allows you to query data from these
formats without needing to load them into a traditional relational
database.
70
5. Interoperability:
a. Spark SQL integrates with many other components of the Spark
ecosystem, such as Spark Streaming, MLlib, and GraphX, allowing
users to combine real-time data processing with machine learning,
graph processing, and more.
Conclusion:
Spark SQL is a powerful tool that combines the ease of SQL with the scalability and
speed of Apache Spark. It simplifies querying and processing structured and semi-
structured data, providing an optimized and unified interface for big data processing.
Whether you're using SQL queries, DataFrames, or Datasets, Spark SQL is a
versatile tool for data analysis and integration across many data sources.
MLLIB
Spark Streaming and Structured Streaming are two key components of Apache
Spark for processing real-time data. While both are designed for real-time stream
processing, there are significant differences between them in terms of architecture,
programming model, and ease of use. Here's a detailed comparison and explanation
of both:
71
1. Spark Streaming
Structured Streaming
SPARK ML
GRAPH FRAMES
72
GeoSpark (now known as Apache Sedona)
Koalas
73
These interfaces make Spark SQL highly flexible and powerful for both batch and
real-time processing, enabling users to work with structured data in a variety of ways
while benefiting from Spark's distributed processing capabilities.
ETL
ETL stands for Extract, Transform, Load, and it refers to the process of moving
data from one or more sources to a target system, typically a data warehouse or data
lake, for further analysis and processing. ETL is a critical component of data
integration, data warehousing, and big data workflows.
SQLContext
SQLContext is a part of Spark's SQL module that provides the entry point to interact
with structured data through SQL queries, DataFrames, and Datasets. It allows you
to use Spark SQL to execute SQL queries on Spark's distributed data and facilitates
integration with external data sources like Hive, HDFS, JSON, Parquet, JDBC, and
more.
DataFrame
74
Key Features of DataFrame:
• Schema: DataFrames have a schema (a structure that defines the names and
types of columns), which provides better optimization opportunities than raw
RDDs.
• Optimized Execution: DataFrames benefit from Spark's Catalyst optimizer
for query optimization and Tungsten execution engine for efficient
computation and memory management.
• Ease of Use: DataFrames can be manipulated using a variety of high-level
operations like select(), filter(), groupBy(), join(), and more, without needing
to write complex Spark transformations.
• Supports Multiple Formats: You can read data from multiple formats such
as Parquet, JSON, CSV, JDBC, and more.
75
Here are the steps to convert an RDD to a DataFrame:
If your RDD consists of a collection of tuples or lists, you can directly convert it into
a DataFrame by providing the column names.
Example:
import org.apache.spark.sql.SparkSession
// Create a SparkSession
val spark = SparkSession.builder()
.appName("RDD to DataFrame Example")
.getOrCreate()
76
Output:
+----+---+------+
|Name|Age|Gender|
+----+---+------+
|John| 28| M |
|Sara| 25| F |
|Mike|30| M |
+----+---+------+
Explanation:
If you want to provide a specific schema (i.e., types for each column), you can define
a StructType schema and apply it while converting the RDD to a DataFrame.
Example:
Output:
+----+---+------+
|Name|Age|Gender|
+----+---+------+
|John| 28| M|
|Sara| 25| F |
|Mike|30| M|
+----+---+------+
78
Explanation:
1. RDD of Rows: The data is represented as Row objects (which are like tuples)
in the RDD.
2. Schema Definition: The schema is defined using StructType, which is an
array of StructField objects. Each StructField defines a column's name and its
type.
3. createDataFrame: The createDataFrame method is used with both the RDD
and the schema to create the DataFrame.
If your RDD consists of case classes, you can leverage Spark's built-in support for
case classes to convert it into a DataFrame. Case classes automatically define a
schema based on the fields of the class.
Example:
import org.apache.spark.sql.SparkSession
79
// Convert the RDD of case class objects to DataFrame
val df = spark.createDataFrame(data)
Output:
+----+---+------+
|Name|Age|Gender|
+----+---+------+
|John| 28| M|
|Sara| 25| F |
|Mike|30| M|
+----+---+------+
Explanation:
1. Case Class: A case class is defined for structured data with name, age, and
gender fields.
2. RDD of Case Classes: An RDD of Person objects is created using parallelize.
3. DataFrame Conversion: The createDataFrame method is used to convert the
RDD of case class objects into a DataFrame. Spark automatically infers the
schema based on the case class fields.
80
3. Case Classes: If your data is represented as case classes, Spark can
automatically infer the schema when converting the RDD to a DataFrame.
4. RDD vs DataFrame: DataFrames are optimized (using Catalyst optimizer)
and provide a more user-friendly API than RDDs, making them better suited
for structured data processing in Spark.
Conclusion:
A temporary table in Apache Spark is a table that exists for the duration of the
session or until it is explicitly dropped. It is a way to register a DataFrame or SQL
query result within Spark's SQL engine, making it accessible through SQL queries.
Temporary tables are often used for interactive queries or intermediate results in data
processing.
In Apache Spark, you can easily add a column to a DataFrame using the
withColumn method. The withColumn method allows you to add a new column to
an existing DataFrame by specifying the name of the new column and the expression
to compute its values.
81
Adding a Column to a DataFrame
If you want to add a new column with a constant value for all rows, you can use lit()
to create a literal value.
Example:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
// Sample DataFrame
val data = Seq(
("John", 28),
("Sara", 25),
("Mike", 30)
)
82
Output:
+----+---+-------+
|Name|Age|Country|
+----+---+-------+
|John| 28| USA|
|Sara| 25| USA|
|Mike| 30| USA|
+----+---+-------+
Explanation:
• lit("USA"): The lit function creates a literal value ("USA") to be added to
each row of the DataFrame as a new column called "Country".
You can use existing columns and apply transformations to create a new column.
Example:
Output:
+----+---+----------+
|Name|Age|AgeIn5Years|
+----+---+----------+
|John| 28| 33|
|Sara| 25| 30|
|Mike| 30| 35|
+----+---+----------+
83
Explanation:
You can also create a new column based on a condition (e.g., if a person is above a
certain age, add a flag).
Example:
Output:
+----+---+--------+
|Name|Age|AgeGroup|
+----+---+--------+
|John| 28| Young|
|Sara| 25| Young|
|Mike| 30| Old|
+----+---+--------+
84
Explanation:
If you need more complex logic, you can use a UDF (User Defined Function) to
create a new column. This approach is useful if the transformation logic cannot be
expressed with Spark's built-in functions.
Example:
import org.apache.spark.sql.functions.udf
85
Explanation:
• The UDF appendSuffix takes a String and appends "Sr." to it.
• The withColumn method uses the UDF to create a new column called
"NameWithSuffix".
You can also add multiple columns at once by chaining withColumn() calls.
Example:
val dfWithMultipleColumns = df
.withColumn("AgeIn5Years", col("Age") + 5)
.withColumn("AgeGroup", when(col("Age") >= 30, "Old").otherwise("Young"))
Output:
+----+---+----------+--------+
|Name|Age|AgeIn5Years|AgeGroup|
+----+---+----------+--------+
|John| 28| 33| Young|
|Sara| 25| 30| Young|
|Mike| 30| 35| Old|
+----+---+----------+--------+
86
Explanation:
• You can add multiple columns by chaining withColumn() calls. Here, we add
both the "AgeIn5Years" and "AgeGroup" columns in one operation.
Conclusion:
In Apache Spark, handling null values is an important part of data processing. Spark
provides a number of built-in functions to handle null values in DataFrames. Here
are some common techniques and functions used to manage missing or null data in
Spark.
87
1. Check for Null Values
You can check for null values in a DataFrame using the isNull() and isNotNull()
functions from the org.apache.spark.sql.functions package.
Example:
import org.apache.spark.sql.functions._
// Sample DataFrame
val data = Seq(
("John", 28),
("Sara", null),
("Mike", 30),
(null, 25)
)
Output:
+----+----+
|Name| Age|
+----+----+
|Sara|null|
+----+----+
88
Explanation:
To remove rows with null values, you can use the dropna() method, which drops
rows containing null values in one or more columns.
Example:
dfNoNulls.show()
Output:
+----+---+
|Name|Age|
+----+---+
|John| 28|
|Mike| 30|
|Sara| 25|
+----+---+
Explanation:
89
3. Fill Null Values with a Default Value
You can fill null values with a default value using the fill() or fillna() method.
Example:
// Fill null values in the "Age" column with a default value (e.g., 0)
val dfFilled = df.na.fill(Map("Age" -> 0))
dfFilled.show()
Output:
+----+---+
|Name|Age|
+----+---+
|John| 28|
|Sara| 0|
|Mike| 30|
|null| 25|
+----+---+
Explanation:
• na.fill(Map("Age" -> 0)): This fills null values in the "Age" column with the
default value 0.
• The fill() method can take a map, where you specify the column names as
keys and the values you want to fill as the corresponding values.
You can also fill all null values across all columns with a single value like this:
// Fill all null values in the entire DataFrame with a specific value
val dfAllFilled = df.na.fill("Unknown")
90
dfAllFilled.show()
Output:
+----+-----+
|Name| Age|
+----+-----+
|John| 28|
|Sara|Unknown|
|Mike| 30|
|Unknown| 25|
+----+-----+
You can use the when and otherwise functions to replace null values based on
custom logic.
Example:
dfWithCustomLogic.show()
Output:
+----+---+
|Name|Age|
+----+---+
|John| 28|
|Sara| 0|
91
|Mike| 30|
|null| 25|
+----+---+
Explanation:
• The when function checks if the column Age is null and replaces it with 0 if
true, otherwise it keeps the original value.
If you want to remove duplicates from a DataFrame and ignore rows that contain
null values, you can use the dropDuplicates() method, which removes rows that have
the same values across all columns.
Example:
dfWithoutDuplicates.show()
Output:
+----+---+
|Name|Age|
+----+---+
|John| 28|
|Sara| 0|
|Mike| 30|
|null| 25|
+----+---+
92
Explanation:
You can also drop rows that contain null values in specific columns using dropna()
with the subset parameter.
Example:
dfNoNullName.show()
Output:
+----+---+
|Name|Age|
+----+---+
|John| 28|
|Sara| 0|
|Mike| 30|
+----+---+
Explanation:
93
Summary of Common Methods for Handling Nulls in Spark:
1. Check for nulls: Use isNull() and isNotNull() to check for null values.
2. Remove rows with nulls: Use dropna() to remove rows containing null values.
3. Fill null values: Use fill() or fillna() to fill null values with a constant value.
4. Replace null values with custom logic: Use when and otherwise to replace
nulls with computed values.
5. Drop duplicates: Use dropDuplicates() to remove duplicate rows.
6. Drop rows with nulls in specific columns: Use dropna(subset=...) to drop
rows with null values in specific columns.
By using these functions, you can handle null values in Spark DataFrames efficiently
and tailor the behavior to your data processing needs.
In Apache Spark, you can save a DataFrame to different formats and storage
systems such as HDFS, local file system, Amazon S3, Hive, or databases like
JDBC. Spark provides various methods for saving DataFrames, and the choice of
format depends on the use case, such as whether you want to store data as parquet,
CSV, JSON, or ORC, etc.
94
have the same values in all columns by default, but it also allows you to specify
particular columns to consider for removing duplicates.
("John", 28),
("Sara", 25),
("Mike", 30),
dfWithoutDuplicates.show()
In Apache Spark, joins are used to combine rows from two or more DataFrames
based on a related column between them. Spark supports several types of joins,
which allow you to handle different data relationships and conditions. Below are the
different types of joins available in Spark:
95
1. Inner Join (default join)
An inner join returns rows when there is a match in both DataFrames. If no match
is found, the row is excluded from the result.
Syntax:
Example:
Output:
+----+---+------+
|Name|Age|Gender|
+----+---+------+
|John| 28| M|
|Sara| 25| F|
+----+---+------+
Explanation:
• Only rows with matching "Name" values in both DataFrames are returned.
96
2. Left Join (Left Outer Join)
A left join returns all rows from the left DataFrame and the matching rows from
the right DataFrame. If there is no match in the right DataFrame, the result will
contain null for the columns of the right DataFrame.
Syntax:
Example:
Output:
+----+---+------+
|Name|Age|Gender|
+----+---+------+
|John| 28| M|
|Sara| 25| F|
|Mike| 30| null|
+----+---+------+
Explanation:
• All rows from the left DataFrame (df1) are included, but for Mike, who doesn't
have a corresponding entry in df2, the Gender column is null.
97
3. Right Join (Right Outer Join)
A right join returns all rows from the right DataFrame and the matching rows from
the left DataFrame. If there is no match in the left DataFrame, the result will contain
null for the columns of the left DataFrame.
Syntax:
Example:
Output:
+----+---+------+
|Name|Age|Gender|
+----+---+------+
|John| 28| M|
|Sara| 25| F|
|null| NaN| null|
+----+---+------+
Explanation:
• All rows from the right DataFrame (df2) are included, and if there is no match
in the left DataFrame (df1), the columns from df1 are filled with null.
98
4. Full Outer Join
A full outer join returns all rows when there is a match in either left or right
DataFrame. If there is no match, null will be returned for the columns of the
DataFrame that doesn't have a matching row.
Syntax:
Example:
Output:
+----+----+------+
|Name| Age|Gender|
+----+----+------+
|John| 28| M|
|Sara| 25| F|
|Mike| 30| null|
|null|null| null|
+----+----+------+
Explanation:
• All rows from both DataFrames are returned. If there is no match, null is used
for missing values in the respective DataFrame columns.
99
5. Left Semi Join
A left semi join returns all rows from the left DataFrame where there is a match in
the right DataFrame, but it does not include any columns from the right DataFrame
in the result.
Syntax:
Example:
Output:
+----+---+
|Name|Age|
+----+---+
|John| 28|
|Sara| 25|
+----+---+
Explanation:
• Only the rows from the left DataFrame (df1) that have a corresponding match
in the right DataFrame (df2) are returned. The columns from the right
DataFrame are not included in the result.
100
6. Left Anti Join
A left anti join returns all rows from the left DataFrame where there is no match
in the right DataFrame. This join is useful for filtering rows from the left
DataFrame that don't have any matching rows in the right DataFrame.
Syntax:
Example:
Output:
+----+---+
|Name|Age|
+----+---+
|Mike| 30|
+----+---+
Explanation:
• Only the rows from the left DataFrame (df1) that do not have a matching
row in the right DataFrame (df2) are returned.
101
7. Cross Join (Cartesian Join)
A cross join produces the Cartesian product of the two DataFrames. It returns all
possible combinations of rows from both DataFrames. Cross joins can be very
expensive for large DataFrames, as the number of resulting rows is the product of
the row counts in the two DataFrames.
Syntax:
Example:
Output:
+----+---+------+-----------+
|Name|Age|Gender|Gender_Type|
+----+---+------+-----------+
|John| 28| M| Male |
|John| 28| F| Female|
|Sara| 25| M| Male |
|Sara| 25| F| Female|
+----+---+------+-----------+
102
Explanation:
• Each row from the first DataFrame is combined with every row from the
second DataFrame.
These join operations are essential for combining data from different sources, and
choosing the right join type depends on the data and the business logic you're trying
to implement.
In Apache Spark, when reading from or writing to data sources such as files (e.g.,
CSV, Parquet, JSON) or databases, you can specify various modes that control the
behavior of how data is read or written. These modes allow you to handle different
scenarios such as overwriting existing data, appending new data, or handling errors.
103
Read Modes in Spark
When reading data in Spark, you can specify how to handle corrupt or missing
records. These are typically controlled by the mode option in the read method.
When writing data in Spark, you can control how existing data is handled in the
target location using different write modes. These modes determine what happens
when data already exists at the target location (e.g., overwriting, appending, or
failing on existing data).
1. Aggregation Functions
import org.apache.spark.sql.functions._
df.groupBy("column_name").agg(count("*").alias("count"))
df.groupBy("column_name").agg(sum("numeric_column").alias("total"))
104
df.groupBy("column_name").agg(avg("numeric_column").alias("average"))
df.groupBy("column_name").agg(max("numeric_column").alias("max_value"))
df.groupBy("column_name").agg(min("numeric_column").alias("min_value"))
df.groupBy("column_name").agg(first("column_name").alias("first_value"))
df.groupBy("column_name").agg(last("column_name").alias("last_value"))
df.groupBy("column_name").agg(collect_list("column_name").alias("values_list")
)
df.groupBy("column_name").agg(collect_set("column_name").alias("unique_value
s"))
105
2. String Functions
df.withColumn("upper_case", upper(col("column_name")))
df.withColumn("lower_case", lower(col("column_name")))
df.withColumn("length", length(col("column_name")))
df.withColumn("trimmed", trim(col("column_name")))
106
• rpad(): Pads the right side of a string with a given character.
3. Date/Time Functions
Spark provides a wide range of functions to work with date and time data.
df.withColumn("current_date", current_date())
df.withColumn("current_timestamp", current_timestamp())
107
df.withColumn("new_date", date_sub(col("date_column"), 5))
df.withColumn("date", to_date(col("date_string")))
df.withColumn("timestamp", to_timestamp(col("timestamp_string")))
df.withColumn("year", year(col("date_column")))
df.withColumn("month", month(col("date_column")))
df.withColumn("day", dayofmonth(col("date_column")))
df.withColumn("hour", hour(col("timestamp_column")))
108
4. Mathematical Functions
df.withColumn("abs_value", abs(col("numeric_column")))
df.withColumn("sqrt_value", sqrt(col("numeric_column")))
df.withColumn("log_value", log(col("numeric_column")))
df.withColumn("exp_value", exp(col("numeric_column")))
df.withColumn("ceil_value", ceil(col("numeric_column")))
109
• floor(): Rounds a number down to the nearest integer.
df.withColumn("floor_value", floor(col("numeric_column")))
5. Conditional Functions
• nullif(): Returns null if two columns are equal; otherwise, returns the first
column.
6. Window Functions
Window functions are used to perform operations across a set of rows related to
the current row.
import org.apache.spark.sql.expressions.Window
val windowSpec =
Window.partitionBy("group_column").orderBy("value_column")
df.withColumn("row_num", row_number().over(windowSpec))
110
• rank(): Assigns a rank to each row within a partition of a result set, with
gaps in the rank.
df.withColumn("rank", rank().over(windowSpec))
df.withColumn("dense_rank", dense_rank().over(windowSpec))
These are just some of the built-in functions in Apache Spark. The full list includes
many more functions that allow for advanced data manipulations and processing,
including working with arrays, maps, and other complex data types.
import org.apache.spark.sql.functions._
In Apache Spark, reading and writing JSON files is a common operation. Spark
provides built-in functions to work with JSON data, allowing you to load JSON data
into a DataFrame, perform transformations, and then save it back in JSON format.
Here's how to read and write JSON files in Spark:
To read JSON files, use the read.json() function. You can specify the path to the
JSON file or a directory containing JSON files.
Once you have a DataFrame, you can save it back to a JSON file using the
write.json() function. You can specify the path where you want to save the file.
In Apache Spark, partitioning and bucketing are both techniques used to organize
data in distributed storage (like HDFS or S3) to optimize query performance.
111
However, they have different purposes and implementation details. Here's a
breakdown of the key differences between partitioning and bucketing:
1. Partitioning
Partitioning is a technique where large datasets are divided into smaller, manageable
chunks based on the values of one or more columns (referred to as partition keys).
Each partition corresponds to a directory on disk. When Spark reads data from a
partitioned table, it only reads the relevant partitions, which helps improve
performance by reducing the amount of data that needs to be processed.
• Data Distribution: The data is physically divided into partitions based on the
values of one or more columns.
• Directory Structure: Partitioned data is stored in separate directories on disk,
with each directory corresponding to a unique value of the partition key (or a
set of partition keys).
• Efficient Filtering: Partitioning is useful when queries filter based on the
partitioned columns. Spark can skip reading irrelevant partitions during query
execution, improving performance (known as partition pruning).
• Dynamic: Partitioning is determined dynamically when writing the data (i.e.,
Spark will decide where to place data based on the partition key).
Bucketing
Bucketing is a technique that divides the data into a fixed number of buckets (files)
based on the hash value of one or more columns. Each bucket contains a subset of
data, and the number of buckets is predefined. Bucketing helps with join operations,
as data from different tables can be bucketed on the same column(s), ensuring that
matching records are in the same bucket.
112
Key Points about Bucketing:
• Data Distribution: The data is divided into a fixed number of buckets based
on the hash of a column or a set of columns. The number of buckets is
specified in advance.
• File Structure: Data is stored in a fixed number of files (buckets), and each
file contains data based on the hash of the bucket column(s). The number of
buckets does not change dynamically.
• Efficient Joins: Bucketing is particularly useful when performing joins on the
bucketed columns. If both tables are bucketed on the same column and have
the same number of buckets, Spark can optimize the join by reading only the
matching buckets from each table.
• Use Case: Bucketing is useful when there is no natural partitioning column,
but you want to optimize operations like joins. It is also useful for improving
query performance when the data has a skewed distributed
113
Key Differences Between Partitioning and Bucketing
114
HIVE
Hive is a data warehouse system built on top of Hadoop that provides a higher-level
abstraction for querying and managing large datasets in Hadoop's HDFS (Hadoop
Distributed File System). It was developed by Facebook and is now an Apache
project. Hive allows users to query large datasets using a familiar SQL-like language
called HiveQL (or HQL), which is like traditional SQL, but tailored for big data
processing in a distributed environment.
Hive provides a query language called HiveQL, which is similar to SQL, allowing
users to express queries using a familiar syntax. However, HiveQL is designed to
work with the large-scale distributed nature of Hadoop, so it's optimized for batch
processing of large datasets rather than interactive querying like traditional databases.
Hive is designed to work with data stored in Hadoop's HDFS. The data is typically
stored in tables, and these tables are managed by Hive. Tables in Hive are analogous
to tables in a traditional relational database.
115
• Partitioning: Hive allows data to be partitioned by certain columns, like date,
to improve query performance. This helps with organizing data into more
manageable parts.
• Bucketing: Like partitioning, bucketing in Hive splits the data into multiple
files, but it’s based on the hash of a column, which is useful for certain query
patterns, such as joins.
3. Execution Engines
Originally, Hive queries were translated into MapReduce jobs. However, as Spark
and Tez became more popular, Hive began supporting these engines for more
efficient query execution.
4. Hive Metastore
116
5. Hive Data Types
Hive supports various data types for storing data. These include primitive types like
STRING, INT, FLOAT, and BOOLEAN, as well as complex types like ARRAY,
MAP, and STRUCT.
• Batch Processing: Hive is designed for batch processing, making it ideal for
ETL (Extract, Transform, Load) operations over large datasets.
• Data Warehousing: It is often used as a data warehouse solution for large-
scale data analytics, as it allows users to run SQL-like queries over data stored
in Hadoop.
• Integration with BI Tools: Hive integrates with business intelligence (BI)
tools like Tableau, Power BI, and others, through JDBC/ODBC connections,
making it easier to query big data with familiar interfaces.
• Scalability: Since Hive is built on top of Hadoop, it can scale horizontally and
handle very large datasets across multiple machines.
7. Hive Architecture
• Hive Driver: The driver is responsible for managing the lifecycle of a HiveQL
query and the execution process.
• Compiler: The compiler parses the HiveQL query, performs semantic
analysis, and generates an execution plan in terms of MapReduce, Tez, or
Spark jobs.
• Execution Engine: This component is responsible for running the query plan.
Depending on the chosen execution engine (MapReduce, Tez, or Spark), it
manages the actual data processing.
• Hive Metastore: Stores metadata about tables, partitions, and the schema of
data stored in Hive.
117
8. Hive Advantages
9. Limitations of Hive
• Latency: Hive was originally designed for batch processing, which can result
in high query latency. It is not optimized for low-latency, real-time querying.
• Not Suitable for OLTP: Hive is designed for OLAP (Online Analytical
Processing) rather than OLTP (Online Transaction Processing), meaning it’s
not well-suited for transactional or real-time applications.
• Lack of Fine-Grained Control: Unlike relational databases, Hive does not
support full ACID transactions, though newer versions are adding limited
ACID support (for example, in transactional tables).
Both Hive and SparkSQL are used for querying large datasets in the Hadoop
ecosystem, but they have differences:
• Hive typically relies on MapReduce for query execution (though it can also
use Tez or Spark for faster performance), while SparkSQL uses Spark for
query execution, which is faster due to Spark's in-memory processing.
• Hive is optimized for batch processing, while SparkSQL can handle both
batch and real-time stream processing.
118
Conclusion
Hive is a data warehouse solution for Hadoop that enables users to query and analyze
large datasets using an SQL-like language (HiveQL). It is particularly useful for
batch processing, ETL tasks, and data warehousing in the Hadoop ecosystem. Hive's
architecture allows it to scale to handle massive datasets, and its SQL-like interface
makes it accessible to people familiar with traditional relational databases, even
though it operates in a distributed environment.
In Hive, data flow refers to the movement of data from its source to its destination
in the Hadoop ecosystem. This flow typically involves several steps, from data
ingestion to querying and processing, with transformations and data storage in
between. Below is a general overview of the typical data flow in Hive:
1. Data Ingestion (Loading Data into Hive): Data can be ingested into Hive in
various ways:
a. From HDFS (Hadoop Distributed File System): Data is typically
loaded into Hive tables from files stored in HDFS, such as text files,
CSV, JSON, or Parquet files.
b. External Data Sources: Data can also come from external sources
such as HBase, Local File System, SQL databases (through
connectors), or even streaming systems like Kafka.
In this step, the data is often raw and unstructured, and it might need to be processed
and transformed before being ingested into Hive.
2.Creating Hive Tables: Data is stored in Hive as tables. The structure of the table
(columns, data types) must be defined when the table is created. A table can be:
• Internal (Managed): Hive manages both the data and the table metadata. If
the table is dropped, the data is also deleted.
• External: Hive manages the metadata only; the data remains in its original
location and is not deleted if the table is dropped.
119
3.Data Processing: Data processing in Hive typically occurs through
HiveQL queries. When a query is executed, Hive translates it into
MapReduce jobs (or Tez or Spark jobs, depending on the chosen execution
engine). These jobs process the data in parallel across the Hadoop cluster
Hive supports a variety of data types for defining the structure of the data in its
tables. These data types are classified into several categories:
120
d. Date and Time Types:
i. DATE: Stores date values (year, month, day).
ii. TIMESTAMP: Stores date and time values (year, month, day,
hour, minute, second).
iii. INTERVAL: Stores time intervals (e.g., months, days).
FEATURES OF HIVE
Apache Hive is a data warehouse system built on top of Hadoop that facilitates
querying and managing large datasets stored in Hadoop’s HDFS (Hadoop
Distributed File System). Here are the key features of Hive:
2. Scalability
3. Hive Metastore
• The Hive Metastore is a central repository that stores metadata about the
structure of Hive tables (e.g., column names, data types, partitioning
information) and the location of data in HDFS or other storage systems.
121
• The metastore is typically stored in a relational database like MySQL,
PostgreSQL, or Derby.
• This central metadata store ensures that users and applications can access and
query data in a consistent manner.
• Hive supports complex data types such as ARRAY, MAP, STRUCT, and
UNIONTYPE, which allow users to store and query nested, semi-structured,
and multi-dimensional data.
• These complex types can be used for advanced data transformations and data
modeling.
122
7. Integration with Hadoop Ecosystem
• Hive is tightly integrated with the Hadoop ecosystem, making it easy to read
and write data from and to other Hadoop tools and systems like HDFS, HBase,
Pig, Spark, and Flume.
• It also supports HDFS, HBase, Kudu, and other storage formats like ORC,
Parquet, Avro, and RCFile.
• This integration allows for flexible and efficient data storage, processing, and
management across different components of the Hadoop ecosystem.
123
10. Cost-Based Optimizer (CBO)
• Hive provides tools for importing and exporting data from and to different
systems. It can import data from local files, HDFS, HBase, or other sources.
• It can also export query results to different file formats or to external systems.
This flexibility allows for easy data integration with other tools in the
ecosystem.
124
14. Support for External Tables
• Hive allows the creation of external tables where data is stored outside of
Hive's control. This allows data to be queried in place without being moved
into the Hive warehouse.
• External tables are useful for integrating Hive with data stored in other
systems (e.g., HBase, S3, HDFS, or relational databases).
• Hive integrates with Business Intelligence (BI) tools such as Tableau, Power
BI, and QlikView via JDBC or ODBC drivers. This allows users to run SQL-
like queries on large datasets and visualize the results using familiar BI tools.
16. Security
Conclusion
Hive is a powerful and scalable data warehouse solution built on Hadoop that
facilitates querying large datasets using SQL-like syntax. Its features, such as
support for complex data types, integration with Hadoop and other big data tools,
scalability, batch processing capabilities, and the ability to handle structured and
semi-structured data, make it an essential tool for big data analytics and ETL
operations.
125
Summary of the Five Hive Architecture Components:
Component Role
Hive User Provides the interface (CLI, web, JDBC/ODBC) for users to
Interface interact with Hive and submit queries.
HiveQL SQL-like query language used to interact with data in Hive.
Hive Central repository for storing metadata about tables, partitions,
Metastore and storage locations.
Hive
Converts HiveQL queries into low-level execution plans
Execution
(MapReduce, Tez, or Spark).
Engine
Manages the lifecycle of a query, including parsing, compiling,
Hive Driver
and executing queries.
These five components form the core of Hive's architecture, enabling it to perform
large-scale data processing and querying on the Hadoop ecosystem in a user-friendly
and scalable manner.
126
BUCKETING
Bucketing is a technique used in Apache Hive (and other data systems) to divide
large datasets into smaller, more manageable chunks, called buckets. This technique
is often applied to tables that are too large to be efficiently processed in a single
operation. Bucketing helps improve the performance of queries, especially those that
involve equality joins, by ensuring that the data is distributed evenly across the
clusters.
In Hive, bucketing is done based on the hash of one or more columns. The idea is
to use a hash function on a column's value to determine which bucket the record will
go into. This method ensures that rows with the same column value end up in the
same bucket.
Methods of bucketing
127
Types of tables in hive
Table
Description Data Management
Type
Managed Data is managed and
Hive manages both the data and metadata,
(Internal) deleted when
default table type.
Tables dropped.
Data is not deleted
External Hive only manages metadata; data is stored
when the table is
Tables externally, independent of Hive.
dropped.
Data is divided into partitions based on
Partitione Data stored in
column values, improving query
d Tables different directories.
performance.
Data is divided into a fixed number of
Bucketed Data is distributed
buckets using a hash of one or more columns,
Tables into multiple buckets.
optimizing joins.
Transacti Supports ACID operations for reliable ACID-compliant,
onal updates, deletes, and inserts, typically with with full transaction
Tables ORC file format. support.
Data is not stored;
Virtual tables based on stored queries. No
Views only query results are
physical data is stored.
available.
Materializ Similar to views but stores query results Stores the query
ed Views physically for performance optimization. result physically.
Each type of table serves a specific purpose in Hive, and the choice of table type
depends on your specific use case, data size, performance requirements, and data
management needs.
128
SQOOP
129
Sqoop Commands and Operations:
• Importing Data:
o Sqoop provides the sqoop import command to import data from a
relational database into HDFS, Hive, or HBase. It can perform bulk
imports, handling large datasets efficiently.
Example:
• Exporting Data:
o The sqoop export command is used to export data from HDFS back into
a relational database.
Example:
130
Example Workflow in Sqoop:
Benefits of Sqoop:
131
Common Use Cases for Sqoop:
Conclusion:
Sqoop is a crucial tool for bridging the gap between relational databases and Hadoop,
making it easier to transfer data between the two environments. By supporting both
import and export operations, Sqoop enables the movement of data to and from
Hadoop-based systems like HDFS, Hive, and HBase, offering an efficient and
scalable solution for big data workflows.
Apache Sqoop is a powerful tool for efficiently transferring data between Hadoop
and relational databases. Below are some of the basic commands in Sqoop, along
with their uses and explanations.
1. sqoop import
The sqoop import command is used to import data from a relational database
(RDBMS) into Hadoop's distributed storage system, such as HDFS, Hive, or HBase
sqoop export
The sqoop export command is used to export data from HDFS to a relational
database. This is useful when you want to push processed data back into an RDBMS.
132
sqoop list-databases
The sqoop list-databases command lists all the databases in a specified relational
database management system.
sqoop list-tables
The sqoop list-tables command lists all the tables in a specific database of a relational
database management system.
sqoop create-hive-table
The sqoop create-hive-table command is used to create a Hive table when importing
data from a relational database. This command generates the Hive table structure
based on the relational database schema.
sqoop import-all-tables
The sqoop import-all-tables command imports all tables from a relational database
into HDFS or Hive. This command imports each table into its own directory in
HDFS or creates corresponding Hive tables.
sqoop job
The sqoop job command is used to create, list, or execute jobs in Sqoop. A job in
Sqoop is a predefined data transfer operation that can be scheduled and run later.
Command Description
Import data from a relational database to HDFS, Hive, or
sqoop import
HBase.
sqoop export Export data from HDFS to a relational database.
sqoop list-databases List databases in a relational database server.
sqoop list-tables List tables in a database.
sqoop create-hive-
Create a Hive table while importing data.
table
133
sqoop import-all- Import all tables from a relational database into HDFS or
tables Hive.
sqoop job Create, list, or run a Sqoop job.
Execute SQL queries directly against a relational
sqoop eval
database.
sqoop import -- Split the import operation into multiple chunks for parallel
split-by execution.
sqoop eval --batch Execute multiple SQL queries in a single call.
134
Layers
1. Raw Layer (or Bronze Layer)
• Definition: The Raw Layer is the first stage in the data pipeline where raw,
unprocessed data is ingested from various sources into the system. This layer
stores data in its original form as it was collected, without any
modifications or transformations.
• Characteristics:
o Untransformed Data: Data is stored in the same format as it was
received (e.g., JSON, CSV, log files, etc.).
o Data Integrity: This layer is used primarily to ensure that the data is
ingested correctly and is available for further processing.
o Durability and Retention: The Raw Layer serves as a raw data
archive, where the original data is preserved, enabling traceability
and auditing. It allows data engineers to go back to the original data if
needed.
o Scalability: The raw layer should be highly scalable to handle large
volumes of data coming from various sources like logs, IoT devices,
transaction systems, etc.
• Example: Data from a streaming source, like web logs or sensor data, is
ingested into the raw layer without being modified.
135
o Data Aggregation: Aggregations, summarizations, and other
operations (e.g., joining data from different sources) are performed in
this layer to provide more meaningful, structured data.
o Data Integration: Data from multiple sources is integrated into a
common format, such as converting data types, aligning time zones,
or handling schema changes.
o Quality Checks: The transform layer often includes validation to
ensure data quality and that it conforms to predefined standards or
business rules.
• Example: Raw web logs are cleaned, timestamped, and formatted into a
structured format (e.g., removing invalid entries, creating user sessions, or
calculating daily page views).
• Definition: The Golden Layer represents the final, cleanest, and most
refined version of the data. This layer is used for analytical purposes,
reporting, and business decision-making. The data in the Golden Layer is
often fully aggregated, consistent, and business-ready.
• Characteristics:
o High-Quality Data: The Golden Layer contains the final version of
data that is considered to be high-quality, trustworthy, and ready for
consumption by end-users.
o Business Insights: Data in this layer is usually transformed into the
key metrics and dimensions that are important for the business, such
as financial KPIs, customer behavior metrics, or product performance.
o Optimized for Reporting and Analytics: The Golden Layer is
typically optimized for consumption by business intelligence tools,
dashboards, and other analytics systems.
o Historical Data: It often contains historical data aggregated at various
time intervals, making it useful for trend analysis and long-term
reporting.
• Example: The transformed sales data from various regions might be
aggregated into monthly reports showing total revenue, customer growth,
and product category performance, ready for executive review.
136
Layered Architecture Example: From Raw to Golden
137
Summary of Layers:
These layers are common in data engineering practices and are used across various
modern data platforms, including data lakes, data warehouses, and data marts.
138
What is AWS?
AWS (Amazon Web Services) is a comprehensive and widely adopted cloud
computing platform provided by Amazon. It offers a broad set of on-demand
services for computing, storage, databases, networking, machine learning,
analytics, security, and more. AWS allows businesses and developers to access
scalable resources over the internet, without the need to invest in or maintain
physical infrastructure.
Key features
Integration
Automation
Scalability
Security
Pay as u go pricing
What is AZURE?
Azure is Microsoft's cloud computing platform, also known as Microsoft Azure,
offering a wide range of cloud services for computing, analytics, storage, and
networking. These services can be used by businesses and developers to build,
deploy, and manage applications through Microsoft's global network of data
centers. Azure provides a platform for services like virtual machines, databases,
AI, machine learning, and much more, similar to other cloud platforms such as
Amazon Web Services (AWS) and Google Cloud
Features
Scalability
Security
Pay as u go pricing
Global reach
Data analytics and big data
ci/cd
139
What is GCP?
What is DOCKER?
What is kafka
Apache KAFKA?
140
SNOWFLAKES
DATAWAREHOUSE
SPARK
SQOOP
141
HIVE
AIRFLOW
This is open source platform used for data transformation pipeline ETL.
In the context of data engineering and data architecture, particularly in modern data
platforms, "fabric" often refers to a data fabric—a unified architecture that
integrates, manages, and orchestrates data from various sources. In this
architecture, data is usually processed and stored in multiple layers that help in
transforming, organizing, and enriching the data for consumption by business
users, analysts, and data scientists.
142
DATA BRICKS
143
7. Collaborative Development: Teams can collaborate seamlessly with
version control, shared workspaces, and dashboards. The notebooks allow
for easy sharing and visualization of results in real-time.
Use Cases:
• Data Analytics: Business analysts and data scientists can use Databricks to
query large datasets, analyze trends, and visualize data insights.
• Data Engineering: It helps in building complex ETL pipelines for
transforming and moving data to different storage systems or databases.
• Machine Learning: Databricks is widely used for developing, training, and
deploying machine learning models at scale.
1. Databricks Overview
• What is Databricks?
Understand that Databricks is a cloud-based platform for big data analytics
and machine learning, primarily built around Apache Spark. It integrates
tightly with Azure and AWS, providing an environment for running data
processing jobs and creating machine learning models.
• Core Components:
o Databricks Workspace: A web-based interface to create notebooks,
dashboards, jobs, and clusters.
o Clusters: Virtual machines running Apache Spark. You need to know
how to create and manage clusters for processing data.
o Notebooks: Interactive documents where you can run Spark code,
visualize data, and document findings.
o Jobs: Automated workflows for running notebooks or JAR files.
144
2. Apache Spark
• Introduction to Apache Spark:Databricks is built on Apache Spark, so you
need to understand how Spark works. Learn about Spark's architecture and
its key components:
o Spark Core: The foundation of Spark, handling the execution of
distributed tasks.
o Spark SQL: A component for querying structured data with SQL.
o Spark DataFrames: Data structures for distributed data processing.
o RDDs (Resilient Distributed Datasets): The lower-level data
structure Spark uses for distributed processing.
• Key Concepts:
o Distributed Data Processing: Understanding how Spark distributes
data and computations across a cluster.
o Transformations & Actions: The two types of operations in Spark
(Transformations modify data, Actions return results).
3. Data Engineering Skills
• Data Ingestion: Learn how to read data from various sources, including:
o CSV, Parquet, JSON files (from local or cloud storage).
o Databases (like JDBC connections).
o Delta Lake: Understand this open-source storage layer that brings
ACID transactions to Apache Spark and big data workloads.
• ETL (Extract, Transform, Load): Learn how to create ETL pipelines in
Databricks using Spark.
• Data Transformation & Cleansing: You should know how to manipulate
large datasets using Spark's DataFrame and SQL API for tasks like filtering,
aggregation, and joins.
145
4. Machine Learning with MLlib and MLflow
• MLlib: Spark's scalable machine learning library. Learn basic algorithms,
including classification, regression, clustering, and recommendation.
• MLflow: Learn to use MLflow for managing the entire machine learning
lifecycle, including tracking experiments, packaging models, and deploying
them.
5. Delta Lake
• ACID Transactions: Understand Delta Lake’s ability to support ACID
transactions, which provides consistency and reliability for big data
workloads.
• Time Travel: Learn how to query previous versions of the data using Delta
Lake.
• Schema Evolution: Understand how Delta Lake handles changes in data
schema over time.
6. Databricks Notebooks
• Creating Notebooks: Learn how to create, organize, and share notebooks.
• Languages: Databricks supports multiple languages:
o Python: The most common language for data science and machine
learning.
o SQL: For querying structured data.
o Scala: For advanced Spark applications.
o R: For data science and statistical analysis.
• Visualization: Learn how to create visualizations (e.g., bar charts,
histograms, scatter plots) to explore and interpret your data.
146
7. Collaborating and Sharing
• Collaborative Workflows: Learn how to collaborate with other users in a
Databricks workspace.
• Sharing Notebooks: Learn how to export and share notebooks with others,
either through links or by exporting to formats like HTML or PDF.
• Version Control: Understand how to use Git integration for version control
within notebooks.
8. Automation and Scheduling Jobs
• Job Scheduler: Learn to create and schedule jobs to run notebooks
automatically at specified intervals.
• Cluster Management: Learn how to manage and scale clusters to optimize
cost and performance.
• Databricks REST API: For automating tasks like job scheduling or cluster
management programmatically.
9. Security and Permissions
• Access Control: Learn about role-based access control (RBAC) for
managing user permissions in Databricks.
• Workspace Permissions: Understand the difference between workspace
admins, users, and contributors, and how to manage their permissions.
• Cluster Security: Learn about securing clusters and setting up encryption.
10. Integrating with Other Services
• Cloud Integration: Understand how Databricks integrates with cloud
services like Azure (Azure Databricks) and AWS (Databricks on AWS).
• Data Storage: Learn how to integrate with cloud storage services (e.g.,
Amazon S3, Azure Blob Storage).
• Data Lakes and Warehouses: Understand integration with data lakes, data
warehouses, and other big data platforms.
147
11. Best Practices
• Optimization: Learn best practices for optimizing Spark jobs for
performance and cost efficiency (e.g., partitioning, caching, and
broadcasting).
• Monitoring and Debugging: Learn how to monitor Spark jobs, view logs,
and troubleshoot errors.
Learning Path
• Start by setting up a Databricks account (via AWS or Azure) and
familiarize yourself with the interface.
• Try running simple notebooks with basic Spark transformations.
• Work through sample ETL pipelines and understand how to ingest data
from various sources.
• Move on to more advanced concepts, such as Delta Lake, MLflow, and job
scheduling.
148
1. CPU (Central Processing Unit)
• What is CPU?
CPUs are the traditional, general-purpose processors found in most
computers and virtual machines (VMs). In Databricks, when you create
clusters, you usually get CPUs as the default processing unit. CPUs are
suitable for tasks that involve parallel processing of smaller workloads or
tasks that don’t require heavy parallelization.
• When to Use CPUs in Databricks:
o Data Engineering & ETL: CPUs are sufficient for traditional ETL
(Extract, Transform, Load) jobs and data preprocessing tasks. Spark
jobs that involve SQL queries, data cleaning, and transformation
usually run on CPU-based clusters.
o Large-Scale Data Processing: CPUs are generally used when you
need to process large datasets using Apache Spark, especially when
the task doesn’t involve complex computations that benefit from
massive parallelization.
o Cost Efficiency: CPU clusters are typically more cost-effective than
GPU clusters, especially for standard data processing tasks.
• Advantages:
o Cost-Effective: CPU-based instances are often cheaper compared to
GPU instances, making them ideal for less intensive tasks.
o Versatile: Suitable for a broad range of workloads, including basic
machine learning, analytics, and data transformation.
• Limitations:
o Not Ideal for Heavy Computation: CPUs are not well-suited for
very computationally intensive tasks, such as training large deep
learning models, where parallelization and vectorized computations
are crucial.
o
149
2. GPU (Graphics Processing Unit)
• What is GPU?
GPUs are specialized hardware designed for highly parallel processing.
They are particularly useful in tasks that require large-scale matrix
operations or high-speed data processing, such as training machine learning
models or running deep learning algorithms. GPUs excel in handling
computations that require thousands or millions of parallel calculations at
once.
• When to Use GPUs in Databricks:
o Deep Learning/Training Neural Networks: GPUs are essential for
deep learning tasks (e.g., training large neural networks, such as
Convolutional Neural Networks (CNNs) or Recurrent Neural
Networks (RNNs)) using frameworks like TensorFlow, PyTorch, or
Keras. GPUs accelerate these computations significantly due to their
ability to handle massive parallel operations.
o Machine Learning with Large Models: For models that involve
large amounts of matrix multiplication (like linear regression, decision
trees, or random forests), GPUs can offer faster performance than
CPUs.
o Big Data Processing with Complex Algorithms: When working
with complex algorithms such as clustering, large-scale matrix
factorization, or other linear algebra-heavy operations, GPUs can
speed up the processing significantly.
• Advantages:
o Parallel Processing Power: GPUs can handle thousands of parallel
tasks simultaneously, making them ideal for computationally intensive
workloads like deep learning and complex mathematical
computations.
o Faster Model Training: For deep learning tasks, GPUs reduce the
time required to train models by orders of magnitude compared to
CPUs.
150
o High Throughput: GPUs provide high throughput for batch
processing, making them suitable for real-time or high-speed data
analysis.
• Limitations:
o Cost: GPU-based instances are generally more expensive than CPU-
based instances, so they may not be cost-effective for simple tasks or
smaller-scale operations.
o Limited Usage Outside of ML/DL: GPUs excel at specific tasks
(e.g., training neural networks) but are not always necessary for
general-purpose data processing or traditional SQL-based tasks,
making them an overkill for simpler workloads.
3. Databricks - Integration with GPUs
• Cluster Configuration: When creating a cluster in Databricks, you can
choose between CPU-based or GPU-based instances. To use GPUs, you
typically select instances that are equipped with NVIDIA GPUs (e.g., Tesla
T4, V100, A100, etc.).
• Libraries & Frameworks for GPUs:
o CUDA (Compute Unified Device Architecture): This is a parallel
computing platform and application programming interface (API)
model created by NVIDIA. It enables software developers to use
GPUs for general-purpose processing (GPGPU). Databricks integrates
with CUDA-enabled libraries, such as TensorFlow, PyTorch, and
XGBoost, to make use of GPUs.
o Deep Learning Libraries: Databricks supports the use of popular
deep learning frameworks like TensorFlow, PyTorch, and Keras that
are GPU-accelerated. These frameworks take advantage of GPU
capabilities to speed up training and inference for large-scale deep
learning models.
151
o Databricks Runtime for Machine Learning (DBR ML): This
runtime includes optimizations and pre-installed libraries that support
GPU usage for machine learning and deep learning tasks.
4. Cluster Types and GPUs in Databricks
• GPU-enabled clusters: In Databricks, you can choose GPU-powered virtual
machines (VMs) for specific machine learning tasks. These clusters will
automatically configure the environment to use the GPU for training models.
• Types of GPUs in Databricks:
o Tesla K80: Older generation GPU, generally used for basic deep
learning and machine learning tasks.
o Tesla V100/A100: High-performance GPUs suitable for training
large-scale deep learning models.
o Tesla T4: A mid-range GPU optimized for machine learning
inference workloads.
5. Cost Considerations
• CPUs: Generally cheaper for general-purpose workloads and data
processing. Ideal for day-to-day data engineering, SQL queries, and other
non-intensive computations.
• GPUs: More expensive but necessary for deep learning, complex machine
learning tasks, or highly parallelizable computational workloads. The cost
may be justified by the significant performance improvement in these
specialized tasks.
6. Hybrid Use Case: CPU + GPU
• In some complex workflows, Databricks allows you to use both CPU and
GPU resources within the same cluster. For example, you can use CPUs for
data preprocessing and Spark-based tasks, while offloading the heavy model
training or inference to GPUs.
152
Summary Comparison
Use Case Data processing, ETL, SQL Deep learning, complex ML tasks
Conclusion
In Databricks:
• Use CPUs for traditional data engineering, processing large datasets with
Spark, ETL, and other general-purpose tasks.
• Use GPUs for specialized, intensive tasks like deep learning model training,
high-performance machine learning, and tasks that require massive parallel
computation.
153
DATA STRUCTURE
1. Primitive Data Structures: These are the basic building blocks of data
storage. They directly represent data and include:
a. Integer: Whole numbers (e.g., 1, 2, 3)
b. Float: Decimal numbers (e.g., 3.14, 2.71)
c. Character: A single letter or symbol (e.g., 'a', 'b')
d. Boolean: Represents two states, typically True or False
e. String: A sequence of characters (e.g., "hello", "world")
2. Non-Primitive Data Structures: These are more complex structures that
can store multiple values, and they are often built from primitive data types.
Key examples include:
a. Arrays: A collection of elements, all of the same type, stored in
contiguous memory locations. Each element is accessed by its index
(e.g., a list of integers [1, 2, 3, 4]).
b. Linked Lists: A linear collection of elements called nodes, where
each node contains data
154
Types of data structures
1.1 Arrays
• Operations:
• Use Case: When you need fast access to elements by index and have a
fixed-size collection of elements.
• Definition: A linked list is a linear data structure where each element (called
a node) contains data and a reference (link) to the next node in the sequence.
• Operations:
155
• Use Case: When you need dynamic memory allocation and need to
frequently insert or remove elements.
• Example: A playlist where each song points to the next song in the list.
1.3 Stacks
• Operations:
• Use Case: When you need to manage data in a reverse order or need
backtracking, such as undo functionality.
• Example: A stack of plates in a restaurant, where the plate on top is the one
to be served next.
1.4 Queues
• Definition: A queue is a collection that follows the First In, First Out
(FIFO) principle, where the first element added is the first to be removed.
• Operations:
• Use Case: When processing items in the order they were added, such as
managing tasks in a printer queue or tasks to be processed by a server.
156
• Example: A line at a checkout counter where the first person to get in line is
the first one to be served.
These structures are more complex and are often built from basic structures.
2.1 Trees
• Types of Trees:
o Binary Search Tree (BST): A binary tree with the property that for
each node, the left subtree contains only nodes with values less than
the node's value, and the right subtree contains only nodes with values
greater than the node's value.
• Operations:
• Use Case: Used for hierarchical data, searching, sorting, and indexing.
157
2.2 Heaps
• Operations:
• Definition: A hash table stores key-value pairs, where each key is hashed
into an index in an array. The hash function computes the index based on the
key.
• Operations:
• Use Case: When you need fast lookups, insertions, and deletions.
158
2.4 Graphs
• Operations:
• O(log n): Logarithmic time – typical of binary search and balanced trees.
159
• O(n log n): Log-linear time – typical for sorting algorithms (e.g., Merge
Sort).
• O(n^2): Quadratic time – typical for nested loops or inefficient sorting (e.g.,
Bubble Sort).
4. Summary
• Stacks and Queues: LIFO and FIFO order for processing elements.
160
Difference between OLAP and OLTP
OLAP
OLTP
5. Deals with smaller data volumes per transaction, but high frequency.
9. Examples: ATM systems, online banking, point of sale systems, order processing.
161
Key Differences:
1. Purpose: OLAP is designed for analyzing large datasets and making business
decisions, while OLTP is focused on transaction processing.
162
DATA WAREHOUSE
Characteristics:
• Historical Data: Primarily stores historical data for reporting and trend
analysis.
• Data Consistency: High consistency and data integrity due to the structured
nature of the data.
Use Cases:
163
DATA LAKE
A Data Lake is a large storage repository that can handle a vast amount of raw,
unstructured, semi-structured, and structured data. It allows for the storage of all
types of data without predefined schema constraints, making it highly flexible.
Characteristics:
• Raw Data: Can store raw, untransformed data, including log files, sensor data,
images, audio, video, social media data, and more.
• Schema-on-Read: Data is stored in its raw form, and the schema is applied
when the data is read (during analytics or querying).
• Flexibility: Suitable for data exploration and machine learning projects that
require working with varied data types.
• Unstructured Data: Can handle unstructured data like text, multimedia, etc.
Use Cases:
Example:
• Apache Hadoop
• Amazon S3 (as a Data Lake)
• Azure Data Lake Storage
164
DATA LAKEHOUSE
Characteristics:
Use Cases:
Example:
165
ETL (Extract, Transform, Load) vs. ELT (Extract, Load, Transform)
ETL and ELT are both data integration processes used to move data from various
sources to a data storage system, such as a data warehouse or data lake, for analysis.
The key difference between them lies in the order and approach in which the data
transformation occurs.
3. Loaded into the target data storage system, such as a data warehouse, where
it is ready for querying and analysis.
• Data Transformation happens before loading the data into the target system.
• Often used in data warehouses that are optimized for structured data.
Advantages of ETL:
• Data quality and consistency are high because transformations happen before
the data is loaded.
166
• Data can be validated and enriched before being loaded into the data
warehouse.
Disadvantages of ETL:
Use Case:
• Legacy Systems: When dealing with traditional data storage systems that
require preprocessing before analysis.
167
ELT (Extract, Load, Transform)
1. Extracted data is moved from the source system to the target storage system
(like a data warehouse or data lake) before any transformation.
3. Transformed after the data is loaded using the computational power of the
target system (often in the cloud, like with Google BigQuery, Amazon
Redshift, or Snowflake).
• Data Loading happens before any transformation, allowing the target system
to perform the transformations using its processing capabilities.
Advantages of ELT:
• ELT is more scalable and can handle larger data volumes because
transformations leverage the computational power of modern cloud-based
systems.
• Faster data loading since transformation is deferred until after the data is
stored.
168
• Better suited for real-time analytics, as transformations can occur on-demand
using the raw data stored in the data warehouse.
Disadvantages of ELT:
• More complex data transformations can increase the processing load on the
target system, potentially affecting performance if not optimized.
• May require more advanced skills to manage the transformation process after
the data is loaded.
• AWS Redshift: A cloud-based data warehouse where data is loaded first, and
SQL-based queries are used to process the data.
Use Case:
• Big Data & Analytics: When large volumes of data need to be ingested
quickly, processed in real-time, and analyzed on demand, such as with
customer behavior analytics or IoT data.
169
ETL (extract transform load)
2. Transformation occurs within the target system (e.g., cloud data warehouse).
7. Big data, real-time analytics, machine learning, and data lakes. Uses
170
Difference between schema on write and schema on read
Schema on write
1. "Schema on Write" system, the schema (structure of data) is defined before the
data is written to the storage system. The data must conform to this predefined
schema at the time of writing.
• When the Schema is Applied: The schema is applied during the process of
writing data, meaning the data must meet the structure and type requirements
of the schema before it is stored.
Advantages:
• Data is stored in a structured and predictable way, making queries faster and
more efficient.
Disadvantages:
• Less flexibility since data must conform to the schema before it is written.
171
Schema on read
Definition: In a "Schema on Read" system, the schema is applied when the data is
read (queried), not when it is written. The data is stored in its raw or unstructured
form, and the schema is defined dynamically at the time of data retrieval.
When the Schema is Applied: The schema is applied during the process of reading
or querying data. The structure is often inferred or defined on the fly based on the
user's query or data processing.
Advantages:
Disadvantages:
• Data retrieval can be slower because the schema must be applied dynamically
at the time of reading.
• Data integrity and consistency are not enforced when the data is initially
written, which can lead to messy or inconsistent datasets.
172
Snowflakes schema:
Similar to the star schema but dimension tables are broken down into related sub
dimension table
Example: fact table
Star schema:
Star schema is a type of database schema used in Dataware house
Diagram resembles a star within a central fact table being surrounded by dimension
table.
173
Comparison: Snowflake Schema vs Star Schema
Higher redundancy.
Lower redundancy.
Denormalization leads to
Redundancy Normalization reduces data
repeated data, which can
repetition and optimizes storage.
increase storage requirements.
174
Aspect Star Schema Snowflake Schema
Ideal for data marts, and small Ideal for large-scale data
to medium-sized datasets. warehouses.
176
Facts table:
1.less attributes
2.More record
Dimension table:
1.more attributes
2.less records
4.text formats
177
Dimension table: contains descriptive attributes
Dimension:
Text(what,when,which)
Measure
Dimension table :
Facts table:
A fact table is a central table in a data warehouse schema that stores quantitative
data for analysis and reporting purposes. It contains numerical metrics, measures,
and facts that are the focal point of business intelligence queries. Fact tables are used
to track business processes, events, or transactions and are typically used to answer
analytical questions like "What is the total sales revenue?" or "How many products
were sold?"
178
Types of dimension table
1. Conformed Dimension
• Example: A Time dimension used in both a Sales fact table and an Inventory
fact table would be considered conformed if it has the same structure (e.g.,
Day, Month, Quarter, Year) across both fact tables.
A slowly changing dimension (SCD) refers to a dimension table that changes over
time but at a slower rate. There are three main types of SCDs based on how historical
data is managed:
179
useful when tracking changes over time is important (e.g., tracking a
customer's address at different points in time). Typically, a start date and end
date are used to indicate the period the record is valid.
• Type 3 (Limited Historical Tracking): This approach stores only the current
value and the previous value of an attribute. It's useful when only a limited
history is needed (e.g., storing only the previous and current addresses of a
customer).
• When to Use: Type 1 is used when historical changes are not important, and
only the most current data is needed.
Example:
• Before Update:
180
• After Update:
Example:
• Before Update:
181
• After Update:
• Pros: Full historical tracking; captures the state of data at different times.
• Cons: Requires more storage and can make queries more complex.
• When to Use: Type 3 is useful when only limited history is required, such
as storing the current and previous values. For example, tracking the current
and previous addresses of a customer.
• Granularity: Stores current and only one previous version of the dimension.
Example:
182
• Before Update:
• After Update:
• Pros: Retains limited history (e.g., current and previous values) and is
relatively simple to implement.
• Cons: Limited to only two versions of the data, not suitable for tracking full
history.
• When to Use: Type 4 is used when you want to keep the dimension table
lean (only storing the current values) but still track historical changes
separately.
• Granularity: Current data is stored in the dimension table, and historical data
is stored in a different table.
183
Example:
• Current Table:
• Historical Table:
• Pros: Keeps the main dimension table clean and efficient; historical data can
be managed separately.
• Cons: Requires managing and joining two separate tables, which may add
complexity.
• When to Use: Type 5 is used when you want to capture both current and
historical data but need to separate the frequently changing attributes (mini-
dimension) from the rest of the data.
• Granularity: Combines historical tracking with the use of a surrogate key for
efficient querying.
184
Example:
• Mini-Dimension Table:
• Pros: Allows you to store both current and historical data efficiently using
surrogate keys.
• When to Use: Type 6 is used when you need to manage various types of
changes in a single dimension, depending on the nature of the attribute.
185
• Granularity: Flexible, depending on the attribute; allows for multiple
tracking mechanisms.
Example:
• Before Update:
• After Update:
186
Summary of SCD Types:
Add a new row to track When full historical tracking Full history (new row
Type 2
changes. is needed. per change).
Current in dimension,
Store historical data in a When you need to keep the
Type 4 history in a separate
separate table. dimension table lean.
table.
187
3. Junk Dimension
4. Degenerate Dimension
A degenerate dimension is a dimension that does not have its own dedicated
dimension table but is instead stored directly in the fact table. These are typically
transactional identifiers that don't require their own dimension table since they don't
have descriptive attributes.
5. Role-Playing Dimension
6. Standard Dimension
188
A standard dimension is a typical, non-specialized dimension that contains simple
descriptive attributes without any special handling like that of SCDs, junk
dimensions, or role-playing dimensions. These dimensions store static or slowly
changing descriptive information.
7. Shrunken Dimension
8. Time Dimension
While technically a type of dimension, the Time Dimension is often singled out due
to its importance and ubiquity in data warehousing. It is a specialized dimension
used to track time-based attributes, such as date, month, quarter, and year.
9. Hierarchical Dimension
189
• Example: A Geography Dimension could have a hierarchy such as Country
→ Region → State → City. This allows users to analyze data at various levels
of geography.
Dimension
Description Example
Type
Slowly
Changing Customer Address (Type 2),
Tracks changes over time. Types 1, 2, and 3.
Dimension Product Name (Type 1)
(SCD)
Degenerate Does not have its own dimension table, stored Invoice Number, Transaction
Dimension in fact table. ID
Role-
Date Dimension (Order Date,
Playing A dimension that plays multiple roles.
Ship Date, etc.)
Dimension
Standard
Regular dimension with descriptive attributes. Product, Customer, Region
Dimension
190
Dimension
Description Example
Type
Shrunken A reduced version of a dimension table for Time Dimension with Year
Dimension performance optimization. and Month only
Conclusion:
191
Types of facts table
In a data warehouse, a fact table is the central table that stores quantitative data
(measures or metrics) related to business events or transactions. These tables are
used to track performance metrics and support reporting and analytical queries.
Depending on the nature of the data and the business needs, fact tables can be
classified into several types.
• Usage: It is used to record the granular details of each business event, such as
individual sales, purchases, or orders.
192
• Key Points:
o Each record is highly granular, typically containing one row for each
event.
• Usage: This type of fact table is used for capturing and analyzing key
performance indicators (KPIs) at regular intervals, such as daily sales totals
or monthly profit.
• Granularity: Lower granularity (captures data at set time intervals like daily,
weekly, or monthly).
Example: A monthly sales snapshot fact table where each row represents the
total sales for a particular month.
Schema:
193
• Key Points:
• Definition: A cumulative fact table stores data that is aggregated over time
and continuously updated, often showing the cumulative value of a metric up
to a specific point.
• Usage: This type of fact table is typically used to track cumulative measures
like total sales or total profit over time, which are continuously updated to
reflect the total up to a specific period.
194
• Key Points:
• Usage: This type of fact table is used when querying large transactional fact
tables would be slow, and pre-aggregated summaries can provide quicker
insights.
195
• Key Points:
• Definition: A factless fact table does not contain any measurable data (i.e.,
no numeric metrics such as sales, quantity, or revenue). Instead, it records
events or conditions that are of interest and that can be used to count
occurrences or track specific events.
• Usage: Factless fact tables are typically used to track events or conditions,
such as whether a specific event occurred or if a certain condition was met
during a given period.
196
• Key Points:
• Usage: Periodic snapshots are used to capture and store the state of certain
business metrics (e.g., inventory levels, account balances) at periodic intervals.
197
• Key Points:
o Captures aggregate data for each period, often for KPIs or performance
measures.
Lower Tracking
Cumulative Fact Stores cumulative data
granularity cumulative
Table (e.g., running totals).
(aggregated) measures over time.
198
Fact Table Type Description Granularity Usage
Tracks events or
Tracking
Factless Fact conditions without
Event-based occurrences or
Table storing numeric
conditions.
measures.
Conclusion:
Different types of fact tables serve different purposes in a data warehouse, ranging
from detailed transaction tracking to high-level aggregated data for performance
optimization. The choice of which type of fact table to use depends on the nature of
the data, business requirements, and the level of granularity needed for reporting and
analysis.
199
Tools STACK
ETL:
Talend
ELT:
ADF
ADL
AS
Hybrid tools
1.apache spark
2.data bricks
6.knime
200
Tools:
storage:
1.amazon s3
2.adls
4.hdfs
Data indigestion:
1.apache kafka
2.adf
3.aws glue
Apache nifi
Data processing:
1.apache spark
2.data bricks
4.azure hd in right
201
Data cataloging and governance
2.azure purview
3.apache atlas
Data querying
1.presto
2.amazon anthena
Security:
1.aws iam
4.apache ranger
202
Jira is a popular project management and issue tracking software developed
by Atlassian. It is widely used for tracking and managing software development
tasks, as well as other types of work such as business processes and service
management. Jira is commonly employed in Agile and Scrum methodologies, where
teams can plan, track, and release software in iterative cycles.
1. Issue Tracking: Jira allows users to create, assign, and track issues (such as
bugs, tasks, stories, or improvements). Each issue can be customized with
fields, priorities, statuses, and assignees.
3. Agile Support:
5. Reporting and Dashboards: Jira offers a variety of built-in reports and the
ability to create custom dashboards. These help teams track progress, monitor
key performance indicators (KPIs), and identify bottlenecks or areas that need
improvement.
203
6. Collaboration Tools: Jira integrates with communication tools (e.g., Slack,
Confluence, Microsoft Teams) to facilitate collaboration. Teams can
comment on issues, @mention team members, and link issues to documents
or other issues.
7. Automation: Jira includes powerful automation features that can help reduce
manual tasks. For example, you can set up rules that automatically assign
issues, transition them based on specific actions, or notify team members of
updates.
8. Integration with Other Tools: Jira integrates well with many third-party
tools and services, such as Bitbucket (for Git-based version control),
Confluence (for knowledge sharing), Trello (for task management), and other
CI/CD tools.
9. Permissions and Access Control: Jira has a robust permission model that
enables fine-grained control over who can view, edit, or manage different
aspects of projects and issues.
204
3. Jira Work Management: A project management solution that is more
focused on business teams, providing tools for task tracking, process
management, and reporting.
A simple Jira workflow for a development task might look like this:
3. Code Review: The task is completed and under review by another developer.
2. Improved Visibility: Managers and team members can quickly check the
status of tasks, understand the overall project progress, and identify potential
issues.
5. Scalability: Jira scales from small teams to large enterprises, and can be
integrated with a variety of tools and systems, making it suitable for different
business needs.
205
Common Use Cases for Jira:
• Bug Tracking: Reporting and tracking bugs and issues in the software
development life cycle.
2. Sprint Planning: From the backlog, select tasks to work on in the current
sprint.
4. Code Review: Tasks are reviewed by another team member for quality.
• Jira vs. Trello: While both are owned by Atlassian, Jira is focused on
detailed project tracking and issue management (ideal for software
development teams), while Trello is simpler, providing a more flexible, visual
kanban-style board for general project management.
206
• Jira vs. Asana: Asana is designed for team collaboration and task
management, while Jira is more tailored for complex issue tracking and
software development workflows.
Conclusion:
Jira is a powerful and flexible tool for managing software development projects,
tracking issues, and implementing Agile methodologies. Whether you're working in
software development, IT service management, or business project management,
Jira's robust feature set and customizability make it a valuable tool for teams of all
sizes.
207
SNOWFLAKES
4. Data Sharing: Snowflake supports secure and efficient data sharing between
organizations. Data sharing allows users to access and query data from
another Snowflake account without the need to copy the data, which enhances
collaboration and data accessibility.
208
6. SQL-Based: Snowflake supports SQL (Structured Query Language), making
it compatible with various BI tools and applications that already use SQL for
querying databases.
7. Data Types and Integration: Snowflake can handle a wide variety of data
types, including structured, semi-structured (like JSON, XML, Avro,
Parquet), and unstructured data. It integrates easily with data lakes and third-
party tools like Tableau, Power BI, Apache Spark, and more.
Snowflake Architecture:
1. Database Storage Layer: This is where all the data is stored. Snowflake
stores data in a centralized repository that can scale automatically as data
volume increases. The data is stored in a compressed, optimized format.
209
Benefits of Snowflake:
4. Ease of Use: Snowflake uses SQL for querying, which is familiar to most data
analysts and developers. It also provides a user-friendly interface for
managing data and workloads.
Use Cases:
2. Business Intelligence (BI): With its high performance and compatibility with
BI tools, Snowflake is ideal for companies that need fast, scalable data
analytics and reporting.
3. Data Lake: Snowflake can be used as a data lake, handling both structured
and semi-structured data, and allowing for easy integration with data lakes
and data pipelines.
210
4. Data Sharing: Snowflake’s data-sharing capabilities make it useful for
businesses that need to share data between different departments, vendors, or
external stakeholders.
Conclusion:
211
In PySpark OPTIONS :
Example: Reading a CSV File with PySpark
Now let's break down each of the options you can use while reading a CSV file in
PySpark:
1. Column (header)
• header=True: This option tells PySpark that the first row in the CSV file
contains the column names. If set to False, PySpark will assign default column
names (_c0, _c1, etc.).
2. InferSchema
Example: For a CSV file with integers and strings, PySpark will automatically
determine which columns are integers and which are strings.
212
3. Delimiter (sep)
• sep="|": This option specifies the delimiter used in the file. By default, CSV
files use a comma (,), but if your file uses another delimiter, such as a pipe (|),
you can specify it using sep.
• PySpark will read the file with a pipe as the delimiter and create the dataframe.
• Example: If your CSV file contains dates in the format yyyy-MM-dd, PySpark
will automatically interpret them correctly.
• PySpark will interpret "NULL" as actual null values in the age column.
213
6. Quote (quote)
• quote='"': This option defines the quote character used in the file. By default,
the quote character is a double quote ("). It is typically used to wrap text fields
that contain delimiters (e.g., commas or pipes).
• Example: If your CSV file contains values like "New York" to wrap the field,
PySpark will correctly interpret these as single fields even if they contain a
delimiter.
7. Escape (escape)
• escape='"': The escape option specifies the escape character used to escape
special characters within quoted text. For instance, if your file contains double
quotes within quoted fields, you can specify an escape character to correctly
handle them.
• The escape character (") ensures that the quotes inside the field are handled
properly, and they won't be treated as delimiters.
214
Full Example with All Parameters
# Initialize
215
Key Points Recap:
• escape: The escape character used to handle special characters inside quoted
fields.
These options allow you to customize how PySpark reads and interprets data,
making it flexible and adaptable to various file formats.
id,name,age1,John,302,Alice,NULL
id|name|age1|John|302|Alice|25
216
In PySpark, the WITHCOLUMN function
You can use withColumn to add a new column to the DataFrame based on some
transformation or calculation.
Example:
In this example, the double_age column is created by multiplying the existing age
column by 2.
Example:
You can apply PySpark's built-in SQL functions to modify columns. Functions like
lit, when, count, min, max, and many others can be used with withColumn.
Example:
217
agedf = df.withColumn("status", when(col("age") > 18,
lit("adult")).otherwise(lit("minor")))
In this example, the status column is created based on the condition applied to the
age column. If age is greater than 18, it assigns "adult", otherwise "minor".
You can apply a User Defined Function (UDF) to a column to perform custom
operations. UDFs allow you to use your own logic to manipulate data.
Example:
# Define a simple UDF that adds 'Hello ' before the namedef greet(name): return
f"Hello {name}"
Here, a UDF named greet is applied to the name column to create a new greeting
column.
5. Renaming a Column
You can rename a column using withColumn by creating a new column with the
desired name and dropping the old one.
Example:
Here, the column old_name is replaced with new_name using withColumn and the
old column is dropped with the drop method.
218
6. Changing Data Type of a Column
You can use cast to change the data type of an existing column in the DataFrame.
Example:
This changes the age column from its original type (say, integer) to string.
Example:
You can use withColumn to handle missing data (null values) by using functions
like fillna, coalesce, or when combined with isNull.
Example:
219
df = df.withColumn("age", coalesce(col("age"), lit(30)))
Here, coalesce replaces the null values in the age column with 30.
9. Concatenating Columns
You can concatenate two or more columns into a new column using the concat
function.
Example:
You can use string functions like substr, upper, lower, length, etc., to manipulate
string columns.
Example:
220
11. Working with Dates and Timestamps
You can use PySpark's date and timestamp functions like to_date, current_date,
date_add, etc., to manipulate date columns.
Example:
# Add 5 days to the current date and create a new column 'date_plus_5'
• Using built-in functions (SQL functions like when, lit, concat, etc.)
• Renaming columns (by creating a new column and dropping the old one)
These examples showcase the flexibility and power of the withColumn method,
allowing you to perform a wide variety of operations on DataFrame columns in
PySpark.
221
1. What is Apache Spark?
Apache Spark is an open-source, distributed computing system that provides
an interface for programming entire clusters with implicit data parallelism and fault
tolerance. It is designed to process large-scale data efficiently.
o Spark Core: The foundational engine for large-scale parallel and distributed
data processing.
o Spark SQL: For structured data processing.
o Spark Streaming: For real-time data processing.
o MLlib: A library for scalable machine learning.
o GraphX: For graph and graph-parallel computation.
222
5. Which languages does Apache Spark support?
Apache Spark supports:
o Scala
o Python
o Java
o R
o SQL
7. What are the different methods to run Spark over Apache Hadoop?
Spark can run on Hadoop in the following modes:
• DataFrames
• Datasets
223
What is Write-Ahead Log (WAL) in Spark?
Write-Ahead Log is a fault-tolerance mechanism where every received data
is first written to a log file (disk) before processing, ensuring no data loss.
224
List commonly used Machine Learning Algorithms.
Common algorithms in Spark MLlib include:
• Linear Regression
• Logistic Regression
• Decision Trees
• Random Forests
• Gradient-Boosted Trees
• K-Means Clustering
225
What are the benefits of Spark lazy evaluation?
Benefits include:
Reducing the number of passes over data.
• Optimizing the computation process.
• Decreasing execution time.
226
What is Speculative Execution in Apache Spark?
Speculative execution is a mechanism to detect slow-running tasks and run
duplicates on other nodes to speed up the process.
227
What is the role of the Spark Driver in Spark applications?
The Spark Driver is responsible for converting the user's code into tasks,
scheduling them on executors, and collecting the results.
228
How to identify that a given operation is a Transformation or Action in
your program?
Transformations return RDDs (e.g., map, filter), while actions return non-
RDD values (e.g.,
collect, count).
Name the two types of shared variables available in Apache Spark.
• Broadcast Variables
• Accumulators
What are the common faults of developers while using Apache Spark?
Common faults include:
• Inefficient data partitioning.
• Excessive shuffling and data movement.
• Inappropriate use of transformations and actions.
• Not leveraging caching and persistence properly.
229
• Snappy
• Gzip
• Bzip2
• LZ4
• Zstandard (Zstd)
230
66.Explain foreach() operation in Apache Spark.
foreach() applies a function to each element in the RDD, typically used for
side effects like updating an external data store.
231
72.Explain createOrReplaceTempView() API.
createOrReplaceTempView() registers a DataFrame as a temporary table in Spark
SQL, allowing it to be queried using SQL.
• textFile(): Reads a text file and creates an RDD of strings, each representing
a line.
• wholeTextFiles(): Reads entire files and creates an RDD of (filename,
content) pairs.
232
78.Explain Spark coalesce() operation.
coalesce() reduces the number of partitions in an RDD, useful for
minimizing shuffling when reducing the data size.
• leftOuterJoin(): Returns all key-value pairs from the left RDD and
matching pairs from the right, filling with null where no match is found.
• rightOuterJoin(): Returns all key-value pairs from the right RDD and
matching pairs
from the left, filling with null where no match is found.
82.Explain Spark join() operation.
join() returns an RDD with all pairs of elements with matching keys from
both RDDs.
233
1. Explain top() and takeOrdered() operations.
o top(): Returns the top n elements from an RDD in descending order.
o takeOrdered(): Returns the top n elements from an RDD in
ascending order.
234
90.Explain reduceByKey() Spark operation.
reduceByKey() applies a reducing function to the elements with the same
key, reducing them to a single element per key.
235
What is Spark SQL?
Spark SQL is a Spark module for structured data processing, providing a
DataFrame API and allowing SQL queries to be executed.
236
What is the Starvation scenario in Spark Streaming?
Starvation occurs when all tasks are waiting for resources that are occupied
by other long- running tasks, leading to delays or deadlocks.
237
How do you parse data in XML? Which kind of class do you use with
Java to parse data?
To parse XML data in Java, you can use classes from the javax.xml.parsers
package, such as:
DocumentBuilder: Used with the Document Object Model (DOM) for in-
memory tree representation.
• SAXParser: Used with the Simple API for XML (SAX) for event-driven
parsing.
What are the roles and responsibilities of worker nodes in the Apache
Spark cluster? Is the Worker Node in Spark the same as the Slave Node?
• Worker Nodes: Execute tasks assigned by the Spark Driver, manage
executors, and store data in memory or disk as required.
• Slave Nodes: Worker nodes in Spark are commonly referred to as slave
nodes. Both terms are used interchangeably.
When reading from HDFS, Spark splits a single block into multiple
partitions based on the number of available cores or executors. You can also use
the repartition() method to explicitly specify the number of partitions.
238
On what basis can you differentiate RDD, DataFrame, and DataSet?
• RDD: Low-level, unstructured data; provides functional programming
APIs.
• DataFrame: Higher-level abstraction with schema; optimized for SQL
queries and transformations.
• Dataset: Combines features of RDDs and DataFrames; offers type safety
and object- oriented programming.
Spark Intro:
1. Spark : In-memory processing engine
2. Why spark is fast: Due to less I/O disc reads and writes
3. RDD: It is a data structure to store data in spark
4. When RDD fails: Using lineage graph we track which RDD failed and
reprocess it
5. Why RDD immutable : As it has to be recovered after its failure and to track
which RDD failed
6. Operations in spark: Transformation and Action
7. Transformation: Change data from one form to another, are lazy.
8. Action: Operations which processes the tranformations, not lazy. creates
DAG to remember sequence of steps.
9. Broadcast Variables: Data which is distributed to all the systems. Similar to
map side join in hive
10.Accumulators: Shared copy in driver, executors can update but not read.
Similar to counters in MR
239
11.MR before Yarn: Job tracker (scheduling &monitoring), task manager
(manages tasks in its node)
12.Limitations of MR: Unable to add new clusters(scalable), resource under-
utilization, only MR jobs handled
13.YARN: Resource manager(scheduling), application master(monitoring &
resource negotiation), node manager (manages tasks in its node)
14.Uberization: Tasks run on AM itself if they are very small
240
SPARK DATAFRAMES:
241
SPARK OPTIMIZATIONS
1. Spark optimization:
1. Cluster Configuration : To configure resources to the cluster so that
spark jobs can process well.
2. Code configuration: To apply optimization techniques at code level so that
processing will be fast.
3. Thin executor: More no. of executors with less no. of resources.
Multithreading not possible, too many broadcast variables required. Ex. 1
executor with each 2 cpu cores, 1 GB ram.
4. Fat executor: Less no. of executors with more amount of resources. System
performance drops down, garbage collection takes time. Ex 1 executor 16
cpu cores, 32 GB ram.
5. Garbage collection: To remove unused objects from memory.
6. Off heap memory: Memory stored outside of executors/ jvm. It takes less
time to clean objects than garbage collector, used for java overheads (extra
memory which directly doesn’t add to performance but required by system
to carry out its operation)
7. Static allocation: Resources are fixed at first and will remain the same till the
job ends.
8. Dynamic Allocation: Resources are allocated dynamically based on the job
requirement and released during job stages if they are no longer required.
9. Edge node: It is also called as gateway node which is can be accessed by
client to enter into hadoop cluster and access name node.
10.How to increase parallelism :
1. Salting : To increase no. of distinct keys so that work can be
distributed across many tasks which in turn increase parallelism.
2. Increase no. of shuffle partitions
3. Increase the resources of the cluster (more cpu cores)
242
11.Execution memory : To perform computations like shuffle, sort, join
12.Storage memory : To store the cache
13.User memory : To store user’s data structures, meta data
etc.
13.Reserved memory : To run the executors
14.Kyro Serializer: Used to store the data in disk in serialized manner which
occupies less space.
15.Broadcast join: Used to send the copies of data to all executors. Used when
we have only 1 big table.
16.Optimization on using coalesce() rather than repartition while reducing no.
of partitions
17.Join optimizations:
1. To avoid or minimize shuffling of data
2. To increase parallelism
1. How to avoid/minimize shuffling?
1. Filter and aggregate data before shuffling
2. Use optimization methods which require less shuffling
( coalesce() )
18.How to increase parallelism ?
1. Min (total cpu cores, total shuffle partitions, total distinct keys)
2. Use salting to increase no. of distinct keys
3. Increase default no. of shuffle partitions
4. Increase resources to inc total cpu cores
19.Skew partitions : Partitions in which data is unevenly distributed. Bucketing,
partitioning, salting can be used to handle it.
243
20.Sort aggregate: Data is sorted based on keys and then aggregated. More
processing time
21.Hash aggregate: Hash table is created and similar keys are added to the same
hash value. Less processing time.
22.Stages of execution plan :
1. Parsed logical plan (unresolved logical plan) : To find out syntax
errors
2.
2. Analytical logical plan (Resolved logical plan) : Checks for column and
table names from the catalog.
3. Optimized logical plan (Catalyst optimization) : Optimization done based on
built in rules.
4. Physical plan : Actual execution plan is selected based on cost effective
model.
5. Conversion into Rdd : Converted into rdd and sent to executors for
processing.
244
**Note:
1 hdfs block = 1 rdd partition = 128mb
1 hdfs block in local=1 rdd partition in local spark cluster= 32mb 1 rdd ~ can
have n partitions in it
1 cluster = 1 machine
N cores = N blocks can run in parallel in each cluster/machine N stages = N
- 1 wide transformations
N tasks in each stage= N partitions in each stage for that rdd/data frame
245
3. Calculate the moving average over a window of 3 rows.
Scenario: For a stock price dataset, calculate a moving average over the last
3 days. from pyspark.sql import Window
from pyspark.sql.functions import avg
windowSpec = Window.orderBy("date").rowsBetween(-2, 0)
df_with_moving_avg = df.withColumn("moving_avg",
avg("price").over(windowSpec)) df_with_moving_avg.show()
246
7. Join two DataFrames on a specific condition.
Scenario: You have two DataFrames: one for customer data and one for
orders. Join these DataFrames on the customer ID.
df_joined = df_customers.join(df_orders, df_customers.customer_id ==
df_orders.customer_id, "inner") df_joined.show()
247
10.Find the top N records from a DataFrame based on a column. Scenario:
You need to find the top 5 highest-selling products.
df.orderBy(col("sales").desc()).limit(5).show()
248
])
df = df.withColumn("json_data", from_json(col("json_column"), schema))
df.select("json_data.name", "json_data.age").show()
249
19.Write a PySpark code to group data based on multiple columns and
calculate aggregate functions. Scenario: Group data by "product" and
"region" and calculate the average sales for each group.
df.groupBy("product", "region").agg({"sales": "avg"}).show()
21.Write PySpark code to read a CSV file and infer its schema.
Scenario: You need to load a CSV file into a DataFrame, ensuring the
schema is inferred. df = spark.read.option("header", "true").option("inferSchema",
"true").csv("path_to_csv") df.show()
22.Write PySpark code to merge multiple small files into a single file.
Scenario: You have multiple small files in HDFS, and you want to
consolidate them into one large file.
df.coalesce(1).write.mode("overwrite").csv("output_path")
250
windowSpec =
Window.orderBy("date").rowsBetween(Window.unboundedPreceding, 0)
df_with_cumsum = df.withColumn("cumulative_sum",
sum("sales").over(windowSpec)) df_with_cumsum.show()
251
252
( )
253
254
255
256
257
258
259
260
261
262
263
264
Hadoop vs. Spark Architecture
Aspect Hadoop Spark
Storage Uses HDFS for storage Uses in-memory
processing for speed
Processing MapReduce is disk- In-memory processing
based improves performance
Integration Runs independently or Can run on top of
with Hadoop ecosystem Hadoop; more flexible
Complexity More complex setup and Simpler to deploy and
deployment configure
Performance Slower for iterative tasks Better performance for
due to disk I/O iterative tasks
265
RDD vs. DataFrame vs. Dataset
Aspect RDD DataFrame Dataset
API Level Low-level, High-level, High-level,
more control optimized with type-safe
Catalyst
Schema No schema, Uses schema Strongly
unstructured for structured data typed, compile-
time type safety
Optimization No built-in Optimized Optimized
using Catalyst using Catalyst,
optimization
with type safety
Type Safety No type No compile- Provides
safety time type safety compile-time type
safety
Performance Less Better Combines
optimized for performance due to type safety with
performance optimizations optimization
266
ACTION VS TRANSFORMATION
Aspect Action Transformation
Execution Triggers execution of Builds up a logical plan
the Spark job of data operations
Return Type Returns results or Returns a new
output RDD/DataFrame
Evaluation Eager evaluation; Lazy evaluation;
executes immediately executed when an action is
triggered
Computation Involves actual Defines data
computation (e.g., collect()) transformations (e.g., map())
Performance Can cause data Does not affect
processing; affects performance until an action is
performance called
267
GroupBykey vs ReduceBykey
Aspect GroupByKey ReduceByKey
Operation Groups all values by Aggregates values with
key the same key
Efficiency Can lead to high More efficient due to
shufling partial aggregation
Data Requires shufling of all Minimizes data
Movement values movement through local
aggregation
Use Case Useful for simple Preferred for
grouping aggregations and reductions
Performance Less efficient with large Better performance for
datasets large datasets
268
Aspect Cache Persist
Storage Level Defaults to Can use various storage
MEMORY_ONLY levels (e.g.,
MEMORY_AND_DISK)
Flexibility Simplified, with Offers more options for
default storage level storage levels
Use Case Suitable for simple Suitable for complex
caching scenarios caching scenarios requiring
different storage levels
Implementation Easier to use, More flexible, allows
shorthand for custom storage options
MEMORY_ONLY
Performance Suitable when More efficient when
memory suffices dealing with larger datasets
and limited
memory
269
Examples map(), filter() groupByKey(), join()
Complexity Simpler and faster More complex and
slower due to data movement
Aspect Collect Take
Output Retrieves all data from Retrieves a specified
the RDD/DataFrame number of elements
Memory Can be expensive and More memory-efficient
Usage use a lot of memory
Use Case Used when you need Useful for sampling or
the entire dataset debugging
Performance Can cause performance Faster and more
issues with large data controlled
Action Type Triggers full data Triggers partial data
retrieval retrieval
270
Aspect
Interface Executes SQL queries Provides a
programmatic interface
Syntax Uses SQL-like syntax Uses function-based
syntax
Optimization Optimized with Optimized with
Catalyst Catalyst
Use Case Preferred for complex Preferred for
queries and legacy SQL code programmatic data
manipulations
Integration Can integrate with Hive Provides a unified
and other SQL databases interface for different data
sources
271
Aspect Shuffle MapReduce
Operation Data reorganization Data processing model
across partitions for distributed computing
Efficiency Can be costly due to Designed for batch
data movement processing with high I/O
Performance Affects performance Optimized for large-
based on the amount of data scale data processing but less
movement efficient for iterative tasks
Use Case Used in Spark for data Used in Hadoop for
redistribution data processing tasks
Implementation Integrated into Spark Core component of the
operations Hadoop ecosystem
272
Aspect Executor Driver
Role Executes tasks and Coordinates and
processes data manages the Spark application
Memory Memory allocated per Memory used for
executor for data processing managing application
execution
Lifecycle Exists throughout the Starts and stops the
application execution Spark application
Tasks Runs the tasks assigned Schedules and
by the driver coordinates tasks and jobs
Parallelism Multiple executors run Single driver coordinates
in parallel multiple executors
273
Aspect ReduceByKey AggregateByKey
Operation Combines values with Performs custom
the same key using a function aggregation and combinatory
operations
Efficiency More efficient for Flexible for complex
simple aggregations aggregation scenarios
Shuffling Involves shufling but Can be more complex
can be optimized due to custom aggregation
Use Case Suitable for Ideal for advanced and
straightforward aggregations custom aggregations
Performance Generally faster for Performance varies with
simple operations complexity
274
Configuration Less Requires Modern and
flexible and older Hive setup and flexible, manages
configuration configurations
Capabilities Limited to Extends Comprehensive
SQL queries SQL capabilities access to all Spark
with Hive features
integration
275
Aspect Spark Context Spark Session
Purpose Entry point for Spark Unified entry point for
functionality Spark functionalities
Lifecycle Created before Spark Manages the Spark
jobs start application lifecycle
Functionality Provides access to RDD Provides access to
and basic Spark functionality RDD,
DataFrame, SQL, and
Streaming APIs
Configuration Configuration is less More flexible and
flexible easier to configure
Usage Older, used for legacy Modern and
applications recommended for new
applications
Aspect Structured Streaming Spark Streaming
Processing Micro-batch and Micro-batch
continuous processing processing
API SQL-based API with RDD-based API
DataFrame/Dataset
support
Complexity Simplified and high- More complex and
level low-level
Consistency Provides stronger Can be less consistent
consistency guarantees due to micro-batches
Performance Better performance with Can be slower for
built-in optimizations complex queries
276
Aspect Partitioning Bucketing
Purpose Divides data into Divides data into
multiple partitions based on a buckets based on a hash
key function
Usage Used to optimize Used to improve join
queries by reducing data performance and maintain
scanned sorted data
Shuffling Reduces shufling by Reduces shufle during
placing related data together joins and aggregations
Data Layout Data is physically Data is organized into
separated based on partition fixed-size buckets
key
Performance Improves performance Enhances performance
for queries involving partition for join operations
keys
277
DBT
1. What is dbt?
DBT (Data Build Tool) is a command-line tool that enables data analysts and
engineers to transform raw data into meaningful insights through SQL. It is
primarily used to manage the transformation layer in a modern data stack.
DBT allows for building and running SQL-based data models, testing data
quality, and documenting data transformations in a standardized and
maintainable manner.
2. Why dbt?
DBT simplifies the ETL (Extract, Transform, Load) process by focusing on the
"Transform" step, allowing users to:
• Write SQL queries to transform data in the data warehouse.
• Easily manage and organize SQL code.
• Automate testing and documentation of transformations.
• Version control through integration with Git.
• Use software engineering best practices for managing data transformation
workflows.
This makes dbt very useful for teams managing complex data transformations at
scale.
3. DBT Products: DBT offers several products for different use cases:
• dbt Core: The open-source version of DBT that handles data
transformation.
• dbt Cloud: A cloud-based version with added features for collaboration,
scheduling, and deployment, often used in enterprise environments.
• dbt Labs: The company behind the dbt product, providing solutions for data
transformation, analytics, and support.
278
4. Key Concepts with Examples:
• Models: SQL files that define transformations. For example, a model could
aggregate sales data by region. select
region,
sum(amount) as total_sales
from raw_sales
group by region;
• Run: A command to execute the dbt models and run the transformations.
• Sources: Represent the raw data that dbt transforms. E.g., a raw_sales table.
• Tests: dbt allows you to write tests to ensure data quality. E.g., checking if
there are any null values in a column. version: 2
models:
- name: my_model
columns:
- name: id
tests:
- not_null
• Docs: Documentation that helps describe how each model works, the lineage
of data, etc.
5. Uses of dbt:
• Data Transformation: Transforming raw data into analytics-ready
datasets.
• Data Quality Assurance: Ensuring the correctness of data using tests.
• Version Control: Managing data models using Git integration.
• Automated Workflows: Scheduling and running transformations
automatically.
• Documentation: Creating and maintaining data documentation for
stakeholders.
279
6. How Many Data Warehouses Present? There are many data warehouses
available, some of the most common include:
• Amazon Redshift
• Google BigQuery
• Snowflake
• Azure Synapse Analytics
• Teradata
• Databricks
7. Life Cycle of dbt: The dbt lifecycle involves the following steps:
1. Development: Writing models and tests in SQL.
2. Version Control: Pushing changes to a Git repository.
3. Execution: Running the dbt commands to execute transformations.
4. Testing: Running automated tests to ensure data integrity.
5. Documentation: Generating and sharing documentation on data models.
6. Deployment: Scheduling the execution of models in a production
environment.
8. Key Features with Examples:
• Modularity: dbt enables you to organize SQL code into reusable models.
Example: Creating modular models for different parts of your data pipeline
(e.g., one for sales, one for marketing).
• Version Control: Allows you to version your models using Git, ensuring
collaboration and traceability.
• Automated Testing: Testing data quality through built-in test functions
(e.g., checking if a column contains NULL values).
• Data Documentation: dbt automatically generates data documentation
based on your models.
280
9. Versions of dbt: The main versions of dbt are:
• dbt Core: The open-source version.
• dbt Cloud: The enterprise version, which offers features like scheduling,
collaboration, and deployment in the cloud.
• dbt CLI: A command-line interface version of dbt, primarily used for
running and testing models.
10. Types of dbt:
• dbt Core (open-source)
• dbt Cloud (paid, cloud-based service)
11. DBT Cloud Architecture: DBT Cloud architecture includes:
• Cloud Scheduler: Schedules the execution of dbt jobs.
• Data Warehouse Connection: DBT Cloud connects to a cloud data
warehouse like Snowflake, Redshift, or BigQuery.
• User Interface: Provides a web-based interface to manage models, logs, and
visualizations.
• Version Control Integration: Git integration for version control.
• Logging and Monitoring: Tracks job statuses, errors, and job history for
analysis.
12. DBT Commands Full with Examples: Here are common dbt commands:
• dbt init <project_name>: Initializes a new dbt project. Example: dbt init
my_project
• dbt run: Runs all models defined in the project. Example: dbt run
• dbt test: Runs tests defined in the project. Example: dbt test
281
• dbt docs generate: Generates the documentation for the models. Example:
dbt docs generate
• dbt seed: Loads static data from CSV files into the data warehouse.
Example: dbt seed
• dbt snapshot: Captures historical data changes over time. Example: dbt
snapshot
• dbt debug: Diagnoses any issues in the dbt setup or configuration. Example:
dbt debug
• dbt run --models <model_name>: Runs a specific model. Example: dbt run
--models sales_by_region
Additional Insights:
• Integration with CI/CD: dbt can be integrated with CI/CD pipelines to
automate testing and deployment.
• Custom Macros: You can define your own reusable SQL snippets using dbt
macros.
• Collaboration: DBT Cloud enhances collaboration with team members by
offering shared environments, documentation, and version control features.
DBT is a powerful tool for modern data teams, enabling better workflows, data
governance, and collaboration.
282
DBT (Data Build Tool): A Comprehensive Overview from Beginner to
Advanced
DBT (Data Build Tool) is an open-source tool that enables data analysts and
engineers to transform raw data in a structured and organized manner. It helps in
the transformation (T) step of the ETL (Extract, Transform, Load) pipeline,
focusing on transforming data in a data warehouse through SQL. DBT empowers
data teams to apply software engineering best practices to data transformation,
making it easy to manage and automate complex data pipelines.
283
• dbt run: This command is used to execute the transformations defined in
your models. dbt run
• dbt init: Initializes a new DBT project in your directory. dbt init
my_project
284
• Macros: Macros allow you to write reusable SQL logic (i.e., a custom SQL
function) that can be used in multiple models or queries. For example, a
macro to calculate the total sales might look like: {% macro
calculate_total_sales() %}
sum(amount)
{% endmacro %}
• Snapshots: Snapshots in DBT allow you to track changes in your data over
time, which is useful for slowly changing dimensions (SCD). For example,
if a product's price changes, DBT can keep a historical record of those price
changes.
• Materializations: DBT allows you to control how models are stored in the
database (e.g., as tables, views, or incremental models). The materialized
parameter determines this.
o view: Creates a view (i.e., a virtual table) for each model.
o table: Creates a physical table in the data warehouse.
o incremental: Only inserts or updates the rows that have changed.
• Data Documentation: DBT makes it easy to create a data dictionary and
document your models, tests, and sources. This is essential for transparency
and collaboration within teams.
dbt docs generate
dbt docs serve
This generates a website where you can view all of your data models and their
descriptions.
• CI/CD (Continuous Integration/Continuous Deployment): You can
integrate DBT with CI/CD pipelines to automate testing, deployment, and
monitoring of your transformations. For example, running tests on every pull
request before merging code into the main branch.
• Scheduling and Orchestration: DBT Cloud or third-party orchestration
tools (like Airflow) allow you to schedule your transformations to run on a
regular basis, such as daily or hourly.
286
4. DBT Cloud vs. DBT Core
• DBT Core: This is the open-source version of DBT, which you can run on
your own infrastructure. It provides all the core features of DBT, but you
will need to set up your own scheduling, orchestration, and monitoring.
• DBT Cloud: A hosted service provided by DBT Labs that provides
additional features like:
o Web-based interface: An intuitive dashboard for managing models, jobs,
and documentation.
o Collaboration: Multiple team members can work together in the same
environment with role-based access control.
o Scheduling and Monitoring: Easily schedule and monitor the status of your
dbt jobs.
287
Conclusion
DBT is a powerful tool for transforming data in a modern data stack. It offers a
streamlined way to write, test, and document SQL transformations, following
software engineering principles to improve collaboration, scalability, and
maintainability. Whether you're just starting with data transformations or you're
managing large-scale projects, DBT provides the tools to make the process more
efficient, standardized, and organized.
By mastering DBT, from beginner to advanced levels, you'll be able to handle
complex data transformation workflows with ease, while ensuring data quality,
version control, and effective team collaboration.
288
a. Install DBT Core Locally:
DBT Core is the open-source version and can be installed with pip (Python's
package installer). Here's how to install it:
1. Install Python: DBT requires Python 3.7 or later. You can download
Python from python.org.
2. Set up a virtual environment (optional but recommended):
python -m venv dbt_env
source dbt_env/bin/activate # For Mac/Linux
dbt_env\Scripts\activate # For Windows
For Redshift:
pip install dbt-redshift
289
For BigQuery:
pip install dbt-bigquery
This creates a new folder called my_project with the default DBT project
structure.
2. Navigate into the project:
cd my_project
Ensure you have the correct credentials and data warehouse information for your
setup.
5. Write Your First DBT Model
Now that you have DBT set up and connected to your data warehouse, you can
start writing your first models.
1. Navigate to the models directory: In your DBT project folder, find the
models directory. This is where all your SQL transformation files will go.
2. Create a simple SQL model: Create a file named my_first_model.sql inside
the models folder and add a SQL query:
-- models/my_first_model.sql
select
id,
name,
amount
291
from raw_sales
This model will select data from the raw_sales table and transform it.
3. Run the model: Now, run the model to execute the SQL query:
dbt run
This command will execute all models in the project, and you'll see the results of
your SQL query stored in your data warehouse.
6. Test and Document Your Models
DBT provides ways to test your data and document your models.
a. Adding Tests
You can add simple data tests to ensure that your models meet certain conditions
(e.g., no NULL values). To test the id column in my_first_model.sql, for example,
add a test in the schema.yml file:
version: 2
models:
- name: my_first_model
columns:
- name: id
tests:
- not_null
292
b. Documenting Models
DBT allows you to generate documentation for your models. To do this, use the
docs feature:
1. Create a docs file to describe your models, like this:
version: 2
models:
- name: my_first_model
description: "This model aggregates the raw sales data by region."
This will start a local web server where you can view the documentation.
7. Running DBT in a Production Environment
Once you're comfortable with running DBT locally, you can start to automate your
workflows and use DBT in a production environment. You can use DBT Cloud
for a managed solution or set up a cron job to schedule your DBT runs on your
own infrastructure.
1. DBT Cloud: DBT Cloud provides a fully-managed service with scheduling,
monitoring, and collaboration features. You can sign up for a free account
on DBT Cloud, connect it to your data warehouse, and start using it.
2. Scheduling with Cron: If you prefer to run DBT locally, you can set up
cron jobs to run DBT at regular intervals (e.g., daily, weekly).
293
Example of a cron job:
0 3 * * * cd /path/to/my_project && dbt run
Architecture of dbt
DBT Architecture: A Detailed Overview
DBT (Data Build Tool) follows a modular and flexible architecture designed to
manage, transform, and test data efficiently within modern data stacks. It
emphasizes collaboration, version control, and testing to ensure high-quality data
transformation pipelines.
The architecture of DBT consists of several components that work together to
provide an end-to-end solution for data transformation, documentation, and testing.
Here's a breakdown of the key components and how they interact within DBT's
ecosystem.
294
1. Core Components of DBT Architecture
1.1. DBT Project
A DBT Project is the foundation of the DBT architecture. It consists of directories
and files that define how data will be transformed in the data warehouse.
• Models Directory: Contains SQL files where you define the data
transformation logic. Each SQL file in the models directory is a
transformation that DBT will execute. These transformations might include
creating tables, views, or performing aggregations, etc.
• Target Directory: This is where DBT places the output of your runs,
including any compiled SQL files.
• Macros: Reusable pieces of SQL code that can be used across multiple
models.
• Seeds: Static CSV files that you can load into your data warehouse.
• Tests: SQL-based tests to check data integrity, for example, checking for
nulls or uniqueness.
• Documentation: Markdown-based files or schema files to document your
models and provide explanations about your data pipeline.
1.2. DBT CLI (Command Line Interface)
The DBT CLI is a command-line tool that interacts with the project and runs the
transformations. It is the primary way to execute DBT commands, which include:
• dbt run – Runs the models (transforms) you’ve defined.
• dbt test – Runs data quality tests to validate your data.
• dbt docs generate – Generates the project’s documentation.
The CLI interacts with both the project files and the data warehouse where
transformations are executed.
295
1.3. Data Warehouse
DBT is primarily used for transforming data inside a data warehouse. It connects
to cloud data warehouses like:
• Snowflake
• Google BigQuery
• Amazon Redshift
• Databricks
DBT connects to these data warehouses through a connection adapter that defines
how DBT interacts with the specific data warehouse.
1.4. Adapter Layer
The Adapter Layer is responsible for providing DBT's connectivity to different
data warehouses. It is a crucial part of the architecture, as it ensures that DBT can
communicate with various cloud-based databases. This adapter layer provides the
underlying connection logic to:
• Authenticate and connect to the data warehouse.
• Execute SQL commands.
• Fetch results.
DBT includes specific adapters for different data warehouses:
• dbt-snowflake for Snowflake
• dbt-bigquery for Google BigQuery
• dbt-redshift for Amazon Redshift
• dbt-databricks for Databricks
296
1.5. DBT Cloud (Optional)
While DBT Core is the open-source command-line version of DBT, DBT Cloud
is the fully managed, cloud-based version. DBT Cloud adds additional features for
enterprise users:
• Web Interface: A user-friendly interface for managing your DBT projects,
setting up jobs, monitoring runs, and viewing logs.
• Scheduling: Schedule DBT runs (e.g., daily, weekly) to automate your
transformation pipelines.
• Collaboration: Provides tools for version control (Git integration), team
collaboration, and deployment management.
• Integrated Logging & Monitoring: Cloud provides tools for monitoring
your DBT runs and getting detailed logs in case of errors.
1.6. Version Control (Git Integration)
DBT projects are designed to integrate with Git, allowing for version control of all
your data models, transformations, and configurations. This helps teams
collaborate effectively, track changes, and manage codebases.
GitHub/GitLab integration is a key feature for maintaining version-controlled
DBT projects, where each change to your models and transformations is tracked,
making collaboration seamless.
2. DBT Workflow
Here's a breakdown of the DBT workflow from beginning to end:
1. Initialize the DBT Project:
a. You create a new project using dbt init command. This generates a project
structure with directories for models, seeds, macros, and tests.
2. Write SQL Models:
a. You define your transformations using SQL inside the models directory. For
example, a model could aggregate sales data or filter out invalid records.
297
3. Run Models:
a. You execute your transformations by running the dbt run command. DBT
compiles your SQL files, executes them on the data warehouse, and
materializes the results in tables or views (depending on the configuration).
4. Test the Data:
a. Data tests (such as checking for null values, uniqueness, etc.) can be added
to models using the dbt test command. This ensures data integrity and
validates that the transformations are correct.
5. Document Models:
a. You can document your models using the schema.yml file and generate
HTML-based documentation using the dbt docs generate command. DBT
automatically associates your models with their descriptions and other
metadata.
6. Schedule Jobs:
a. If using DBT Cloud, you can schedule jobs to run transformations
automatically at specific intervals. Alternatively, in DBT Core, you can use
tools like cron jobs or orchestration platforms (e.g., Airflow) to schedule
DBT runs.
7. Collaboration and Version Control:
a. Developers and data analysts work together on a Git-based repository, where
they can pull, push, and merge changes to models and configurations.
298
3. DBT Components in Action: Architecture Flow
1. User Interaction:
a. Data engineers and analysts write SQL queries (models) to define
transformations, data sources, tests, and documentation in the DBT project.
2. DBT CLI:
a. When a user runs a command like dbt run or dbt test, the CLI compiles the
models and interacts with the Adapter Layer to execute the SQL
transformations in the data warehouse.
3. Data Warehouse:
a. The data warehouse (e.g., Snowflake, BigQuery, Redshift) performs the
actual transformations and stores the results, such as tables or views, based
on the models defined.
4. DBT Cloud (Optional):
a. In a managed environment, DBT Cloud offers scheduling, monitoring,
logging, and collaboration features. It runs jobs automatically, handles user
permissions, and provides a user-friendly interface for managing the
project.
299
4. DBT Architecture Diagram
Here's a high-level view of how DBT components interact:
+---------------------+ +-----------------------+
| Data Warehouse | <----> (SQL) -->| DBT Models |
| (e.g., Snowflake, | | (SQL Transformations) |
| BigQuery, etc.) | +-----------------------+
+---------------------+ |
v
+------------------+ +-----------------------+ +-------------------+
| DBT CLI (or DBT | -----> | DBT Adapter Layer | <----> | Data Warehouse |
| Cloud (UI/Cloud) | | (Connects to the DB) | | Tables/Views |
+------------------+ +-----------------------+ +-------------------+
|
v
+---------------------+
| Version Control (Git)|
| (e.g., GitHub, GitLab) |
+---------------------+
In this diagram:
• DBT Models (SQL transformations) are written by data engineers/analysts.
• The DBT CLI or DBT Cloud interacts with the Adapter Layer to send the
SQL transformations to the data warehouse.
• Version Control (Git) tracks changes and enables collaboration.
Conclusion
DBT’s architecture is designed for simplicity, modularity, and scalability in
managing data transformation workflows. The core components of DBT—the
DBT Project, CLI, Adapter Layer, Data Warehouse, and DBT Cloud—work
together to facilitate efficient data transformations. With DBT, data teams can
300
streamline their ETL processes, ensure data quality through testing, and collaborate
effectively using version control.
301
MICROSOFT FABRIC
• https://signup.live.com
Step 3.
Join Microsoft 365 developer
Step 4.
Go to fabric page and start the free trail
https:// app.fabric.microsoft.com
302
What is Microsoft Fabric?
Microsoft Fabric is a comprehensive data platform introduced by Microsoft in
2023. It is designed to provide a unified and seamless experience for data
engineering, data science, data analytics, and business intelligence (BI) tasks,
allowing organizations to manage and analyze large-scale data in real-time.
Microsoft Fabric integrates various data and analytics tools under a single unified
architecture to handle the end-to-end data lifecycle, from ingestion and storage to
advanced analytics and reporting.
Fabric aims to simplify the complexity of working with data by providing an all-in-
one platform that unifies multiple data services into a cohesive ecosystem. It
combines Microsoft’s data products, including Azure Synapse Analytics, Power
BI, Azure Data Factory, and Data Lake Storage, with new features, into one
cohesive offering.
303
4. Data Lakes: Integration with Azure Data Lake Storage allows
organizations to store and access large datasets in their raw form, which can
be processed and analyzed as needed.
5. Real-Time Analytics: It provides real-time analytics and stream processing,
enabling businesses to make decisions based on live data.
6. AI and Machine Learning Integration: Microsoft Fabric integrates
advanced machine learning and AI capabilities, making it easier to build,
deploy, and manage AI models within the platform.
7. Cloud-Native Architecture: Built on Microsoft Azure’s cloud platform,
Fabric offers scalability, flexibility, and enterprise-grade security, making it
suitable for large and complex data environments.
304
6. Stream Analytics: Capabilities for real-time data analytics and stream
processing, enabling users to process and analyze data as it flows into the
system.
7. Unified Data Governance: Integrated data governance features for
managing and securing data across the platform, ensuring compliance and
protecting sensitive information.
8. Data Integration: Integration with Azure Data Factory for data movement
and orchestration, as well as integration with external data sources (e.g.,
APIs, databases, external services).
305
Microsoft Fabric vs. Azure Synapse Analytics
While Azure Synapse Analytics was Microsoft's previous unified data platform
for analytics, Microsoft Fabric is an evolved and expanded offering that goes
beyond the capabilities of Synapse.
• Azure Synapse Analytics: Focuses primarily on data warehousing, big data
analytics, and integration with Power BI.
• Microsoft Fabric: Includes all the features of Synapse but adds additional
tools for data engineering, real-time analytics, machine learning, data
lakes, and more, providing a more comprehensive solution for managing the
full data lifecycle.
306
allocated, how users interact with the platform, and how data is organized and
accessed.
Let’s explore these terms in the context of Microsoft Fabric, specifically from a
data engineering perspective:
1. Capacity
Capacity in Microsoft Fabric refers to the amount of computational and storage
resources available within the platform to execute data tasks. It includes how
resources are provisioned for different workloads (such as data transformation,
machine learning, or real-time analytics) and is critical for performance
optimization.
• Types of Capacity: There are different types of capacity in Microsoft
Fabric, including dedicated capacity (where resources are allocated
specifically to your workspace) and shared capacity (where resources are
shared among multiple users or workspaces).
• Scaling: Data engineers need to understand how to scale capacity based on
workloads. For example, more capacity is required during data-heavy
transformations or large-scale model training.
• Performance Monitoring: Understanding the capacity limits is essential for
optimizing the performance of data pipelines and queries. Improperly
provisioned resources can lead to slow data processing or system
bottlenecks.
Key Takeaway: As a data engineer, you must manage capacity to ensure that the
required resources are available for processing large datasets efficiently. If your
workloads become too heavy, you might need to upgrade or scale out the capacity.
2. Experience
In the context of Microsoft Fabric, Experience refers to how users interact with
the platform based on their roles and tasks. This term often relates to how data
engineers, scientists, and analysts use different features of Fabric to interact with
data.
307
• End-User Experience: This involves the user interface, including the
workspace and tools available (e.g., notebooks, pipelines, dashboards). A
data engineer’s experience might be centered around working with data
engineering pipelines, while others (like BI analysts) focus more on
reporting.
• Productivity Tools: Microsoft Fabric provides multiple experiences
depending on the specific tools you need, such as the Data Engineering
Experience (for building data pipelines) and the Data Science Experience
(for running machine learning models).
• User Personalization: Depending on the tools you use, the experience can
be customized. A data engineer may spend most of their time using data
pipelines, stream analytics, and monitoring data workflows.
Key Takeaway: As a data engineer, the experience is about how you interact with
tools in the workspace, so understanding the interface and workflow optimization
is essential for maximizing productivity.
3. Item
In Microsoft Fabric, Item generally refers to an individual object or resource that
is used within the system. This could be any entity, such as a dataset, pipeline,
model, or report.
• Data Items: This includes all entities that store or process data, such as
tables, views, and datasets within your workspace.
• Pipeline Items: When creating data pipelines, individual tasks or
transformations can be considered "items" that contribute to the overall
process.
• Model Items: In data science or machine learning workflows, an item could
refer to a machine learning model or its components.
Key Takeaway: A data engineer must understand that items in Microsoft Fabric
can represent the building blocks of a data pipeline or transformation process.
Managing these items efficiently is essential for building scalable and maintainable
systems.
308
4. Workspace
A Workspace in Microsoft Fabric is a collaborative environment where data
engineers, analysts, and data scientists can work on data projects together. It
provides a space for storing, managing, and processing data, as well as for
collaboration across teams.
• Data Engineering Workspaces: These are specifically set up for teams
focused on data ingestion, transformation, and orchestration. Workspaces
house data pipelines, datasets, and scripts used for ETL processes.
• Collaborative Environment: Teams can collaborate on data models,
transformations, and machine learning projects in the same workspace. For
example, a data engineer might create data pipelines in a workspace, while a
data scientist might develop models on the same data.
• Workspace Resources: Within a workspace, you can configure data
models, notebooks, compute resources, and schedules. A workspace also
contains data sets, pipelines, and jobs.
Key Takeaway: As a data engineer, workspaces are where you spend much of
your time. You need to understand how to organize and optimize data processing
tasks within the workspace, collaborate with other roles, and allocate resources
effectively.
5. Tenant
A Tenant in Microsoft Fabric refers to a logical container for all the resources in
your organization. It represents the overarching instance of Microsoft Fabric and is
associated with your organization’s subscription or Azure Active Directory
(AAD).
• Tenant Isolation: Each tenant has isolated resources, meaning your data and
resources are segregated from other organizations or tenants. This provides
security and data privacy.
• Role-Based Access Control (RBAC): Tenants are important for managing
user access. Users within a tenant can be assigned roles that govern their
ability to view or modify data, run tasks, or interact with resources.
309
• Cross-Tenant Collaboration: In some scenarios, data from different tenants
can be shared or accessed via external data connections or APIs, enabling
cross-tenant collaboration.
Key Takeaway: Understanding the tenant model is essential for managing access,
security, and data governance across your organization. As a data engineer, you’ll
be concerned with managing resources within a tenant, setting up data access
permissions, and ensuring that security policies are applied correctly.
Additional Important Concepts in Fabric for Data Engineers
1. Lakehouse Architecture: As a data engineer, you’ll need to understand
how Lakehouse architecture integrates structured and unstructured data in a
unified storage layer. It provides the flexibility of data lakes while
supporting efficient analytics like a data warehouse.
2. Data Pipelines: You will create, monitor, and manage data pipelines in
Microsoft Fabric. Pipelines are crucial for automating ETL workflows,
moving data from sources to the warehouse or lakehouse, and processing
data in stages.
3. Real-Time Analytics: Understanding how to handle real-time data
streams and use stream analytics is important for building solutions that
require up-to-date information (e.g., IoT data processing, fraud detection).
4. Power BI Integration: While primarily for BI analysts, data engineers need
to ensure seamless integration between data processing workflows in
Microsoft Fabric and Power BI for reporting and dashboard creation.
5. MLOps: If you're also handling machine learning workflows, understanding
MLOps for automating the lifecycle of ML models within Fabric (from
training to deployment) will be essential for managing complex AI models.
310
visualization. The roles within a workspace define what actions a user can
perform, what resources they can access, and what data they can modify.
Here are the roles in a workspace in Microsoft Fabric, along with their
responsibilities and permissions:
1. Workspace Admin
Responsibilities:
• A Workspace Admin has full control over the workspace, managing both
resources and user permissions.
• This role is typically responsible for the overall configuration of the
workspace, including setting up workspaces, allocating capacity, and
managing access controls.
• They can create and delete workspaces, manage workspace-level resources
(e.g., pipelines, data sets), and add/remove users.
• Managing security and access: Workspace admins configure role-based
access control (RBAC) to grant other users the appropriate access to
resources.
Permissions:
• Create, modify, and delete workspaces.
• Configure roles and access controls for other users.
• Assign and manage resources like compute capacity, data sources, and
notebooks.
• Full access to all data, datasets, pipelines, and reports within the workspace.
311
2. Data Engineer
Responsibilities:
• Data Engineers are primarily responsible for building and managing data
pipelines, data transformation, and data workflows within the workspace.
• They work on designing and managing ETL processes (Extract, Transform,
Load), ensuring data flows smoothly from source systems to the storage
layer (data lake or warehouse).
• They may also be involved in data quality monitoring, scheduling data
jobs, and troubleshooting data issues.
Permissions:
• Access to and control over data pipelines and other data transformation
tools.
• Ability to create, modify, and delete datasets and data workflows.
• Limited access to notebooks for building scripts or data transformations.
• Can execute data transformations, run jobs, and monitor their progress.
3. Data Scientist
Responsibilities:
• Data Scientists focus on analyzing data and building machine learning
(ML) models.
• They typically use Python, R, or SQL to analyze datasets and build
predictive models, leveraging tools like notebooks and integrated machine
learning frameworks.
• They may also be responsible for model training, testing, and deployment
within the workspace.
312
Permissions:
• Full access to notebooks (for creating and running models).
• Ability to access datasets for model building.
• Ability to create, edit, and run scripts for data analysis and machine learning
experiments.
• Can interact with ML models and potentially deploy models depending on
the workspace’s configuration.
4. Business Analyst
Responsibilities:
• Business Analysts primarily work with Power BI and other data
visualization tools within Microsoft Fabric.
• They are responsible for transforming raw data into actionable insights by
creating dashboards, reports, and data visualizations.
• They interpret data, create KPIs, and communicate data insights to business
stakeholders for decision-making.
Permissions:
• Can access and view datasets and data models.
• Ability to create and modify Power BI reports, dashboards, and
visualizations.
• Limited access to data transformation tasks but can request data from
engineers or scientists for analysis.
• Cannot typically alter data pipelines or datasets.
313
5. Contributor
Responsibilities:
• Contributors can collaborate on creating and managing data pipelines,
datasets, and reports, but they don’t have administrative privileges to
manage access control or workspace settings.
• They typically perform tasks such as data modeling, creating reports, and
running queries but are not responsible for user management or configuring
workspace resources.
Permissions:
• Create, modify, and run data pipelines and data models.
• Can create, modify, and view reports and dashboards.
• Cannot manage access, delete workspaces, or modify security settings.
6. Reader
Responsibilities:
• Readers have the most restricted role. They are mainly consumers of data
and insights.
• They are limited to viewing data, reports, and dashboards but cannot modify
or create new resources.
• Readers typically interact with data in the form of reports, dashboards, and
visualizations created by others.
Permissions:
• View only access to datasets, reports, and dashboards.
• Cannot edit, delete, or create new resources like data pipelines or models.
• Ideal for users who need insights but do not need to make changes to the
workspace or the data itself.
314
7. Machine Learning Operations (MLOps)
Responsibilities:
• MLOps professionals focus on automating and managing the lifecycle of
machine learning models, from development to deployment and monitoring.
• In Microsoft Fabric, MLOps might be responsible for model training,
deployment, integration, and monitoring, ensuring that machine learning
models perform optimally in production environments.
Permissions:
• Access to datasets and machine learning models.
• Ability to deploy and monitor models in production.
• Collaborates with data engineers and data scientists to manage model
pipelines.
• Can trigger training jobs or set up model monitoring pipelines.
315
Additional Notes on Roles in Microsoft Fabric
1. Granular Permissions: Workspace admins can assign granular permissions
to users or groups within the workspace, allowing them to only access
specific data or components. This is useful for managing sensitive data and
ensuring that users only interact with the data relevant to their tasks.
2. Collaboration: Multiple roles can collaborate within the same workspace.
For example, data engineers and data scientists can work together on the
same dataset, while business analysts can create visualizations from that
data. This fosters cross-functional collaboration on data projects.
3. Custom Roles: Depending on the needs of the organization, Microsoft
Fabric allows the creation of custom roles with fine-grained access to
resources and data. This is particularly useful for organizations with unique
workflows or data security requirements.
OneLake is a unified data lake offering within Microsoft Fabric that consolidates
storage and analytics. It acts as a centralized repository where all data—structured,
semi-structured, and unstructured—can be stored and managed in a scalable and
efficient manner. OneLake aims to provide a more simplified and integrated
approach to data storage, eliminating the need for multiple, disparate storage
solutions.
Here’s a breakdown of OneLake and its importance in the context of Microsoft
Fabric:
Key Features and Benefits of OneLake:
1. Unified Data Storage:
a. OneLake brings together various types of data (e.g., transactional, log files,
machine data) in a single repository.
b. It supports both structured data (like tables, rows, columns) and
unstructured data (like JSON, XML, Parquet files).
316
c. Users can store raw data directly in OneLake and process it later for
analytics, transformations, or machine learning.
2. Centralized Management:
a. OneLake provides centralized management for all your organization’s data,
eliminating the need to manage multiple data lakes or data warehouses.
b. This unified platform allows you to govern, secure, and optimize data
storage more effectively.
3. Integration with Microsoft Fabric:
a. As part of Microsoft Fabric, OneLake integrates seamlessly with other
components like Power BI, Data Engineering, and Machine Learning
services.
b. It ensures smooth data pipelines between data storage and the tools used for
analysis, transformation, and visualization.
4. Scalability:
a. OneLake is designed to scale according to the needs of your organization.
As data grows, the platform can accommodate the increased volume without
compromising performance or reliability.
b. You can store massive amounts of data without worrying about the
infrastructure, as OneLake automatically handles scalability.
5. High Performance and Cost Efficiency:
a. OneLake utilizes the power of Azure Data Lake and Azure Synapse to
deliver high-performance queries and analysis.
b. With optimized storage and access layers, it can reduce storage costs and
ensure that data is accessed efficiently, depending on the workload.
6. Data Governance and Security:
a. OneLake includes robust data governance features, such as role-based
access control (RBAC), audit logs, and data lineage tracking.
317
b. Organizations can set fine-grained access policies to ensure sensitive data is
only accessed by authorized users.
7. Support for Multiple Data Formats:
a. OneLake supports a wide variety of file formats, including CSV, Parquet,
ORC, Avro, and others. This flexibility allows organizations to work with
different types of data and tools without being constrained by format
compatibility.
8. Data Sharing and Collaboration:
a. OneLake facilitates data sharing between different teams and departments
within an organization. It provides collaborative features where different
users (data engineers, scientists, analysts) can work together on the same
data sets without redundancy.
318
Benefits for Data Engineers:
As a data engineer, working with OneLake simplifies the workflow:
• Single Source of Truth: No need to manage multiple storage systems.
OneLake serves as the single repository for all data across the organization.
• Cost and Performance Optimization: OneLake’s integration with Azure
services helps manage cost-effective storage and high-performance
analytics.
• Seamless Pipelines: OneLake integrates with Microsoft Fabric’s data
pipelines, enabling smooth ETL (Extract, Transform, Load) processes from
raw storage to consumable insights.
320
b. Ctrl + X: Cut selected file or folder.
c. Ctrl + V: Paste the copied or cut file into the current directory.
d. Delete: Delete the selected file or folder.
3. Previewing Files:
a. Ctrl + P: Preview the selected file (if supported).
4. Selection:
a. Shift + Click: Select multiple files or folders in a list.
b. Ctrl + Click: Select non-contiguous files or folders.
These shortcuts can help you perform common tasks like moving files, searching
for specific datasets, or organizing files much faster.
321
2. ACID Transactions:
a. The lakehouse architecture often supports ACID (Atomicity, Consistency,
Isolation, Durability) transactions, ensuring data integrity and consistency
across large datasets and complex data operations, which was historically a
limitation in data lakes.
3. Efficient Querying and Analytics:
a. Lakehouse enables querying of large datasets (including both raw and
structured data) without the need for data replication, thanks to Delta Lake
or other similar technologies that support versioned and optimized storage.
b. It provides support for SQL-based querying, making it easier for analysts
to interact with data, even if the underlying storage is semi-structured or
unstructured.
4. Support for Multiple Data Types:
a. In a lakehouse, you can store all kinds of data: raw data, data models,
structured data (e.g., tables, SQL), and unstructured data (e.g., images,
logs, video).
b. Delta Lake on Azure, which is part of Microsoft Fabric, offers the
functionality to manage such diverse datasets within the lakehouse.
5. Real-Time Data Processing:
a. Lakehouses are designed to support real-time streaming data and batch
data processing. This makes them ideal for businesses that need up-to-date
information in their analytics (such as streaming IoT data or customer
transactions).
6. Data Governance and Security:
a. Lakehouses provide strong data governance capabilities, ensuring that data
access is controlled, tracked, and compliant with internal security policies.
b. Integration with Azure Active Directory (AAD) ensures fine-grained role-
based access control (RBAC) and data auditing.
322
Lakehouse in Microsoft Fabric
In Microsoft Fabric, the Lakehouse architecture integrates with OneLake to
provide a seamless experience for storing and analyzing data:
1. Unified Data Storage:
a. OneLake acts as a centralized repository, allowing users to store data in its
raw form while providing the capability to process it with the tools available
in Microsoft Fabric (such as data engineering, machine learning, and
Power BI).
2. Delta Lake Integration**:
a. Delta Lake is a storage layer that provides ACID transactions, schema
enforcement, and time travel on top of OneLake. This enables users to
manage, version, and perform SQL analytics on both batch and streaming
data.
3. Simplified Data Pipeline:
a. Users can build data pipelines that read from OneLake, process and
transform the data, and then load it into a data warehouse or use it directly
in analytics workflows. The combination of data lakes and data warehouses
within the lakehouse simplifies the architecture by removing the need for
redundant data storage.
323
Key Features of the Main View:
1. File/Folder Navigation:
a. On the left-hand side of the main view, you’ll typically find a folder-based
hierarchy that displays your data stored in the Lakehouse.
b. You can browse through folders and subfolders to explore datasets or
specific files (e.g., CSVs, Parquet files, or logs).
c. This view makes it easy to organize and navigate between different types of
datasets.
2. Data Preview:
a. When you click on a specific file or dataset in the folder structure, the main
view will display a preview of the data, showing a limited set of records.
b. This allows users to check the contents of the dataset before making any
decisions on how to process or analyze it.
3. Metadata Overview:
a. The main view displays metadata associated with each dataset, such as:
i. File size
ii. Last modified date
iii. Schema (e.g., data types, columns)
iv. Creation date
b. This helps users quickly understand the data they are dealing with, without
having to dig deeper into each file.
4. Action Buttons:
a. The main view contains various action buttons (e.g., Upload, Delete, Move,
Download, Preview), allowing users to perform file management tasks
directly within the interface.
b. You can also perform SQL-based queries directly from this view to explore
the data further.
324
Ribbon View in Lakehouse Explorer
The Ribbon View in Lakehouse Explorer is a toolbar that provides easy access to
the most common actions and options available within the explorer. It is typically
located at the top of the interface and consists of different tabs or groups of
buttons, offering shortcuts for tasks like data navigation, querying, and file
management.
Key Features of the Ribbon View:
1. File and Folder Management:
a. The Ribbon View contains buttons for uploading, downloading,
renaming, moving, and deleting files or folders within the Lakehouse.
b. It may also offer options to create new folders to organize your data better.
2. Search and Filter:
a. There is often a search bar in the Ribbon View, allowing users to quickly
locate files or datasets by name or metadata attributes.
b. Users can also apply filters to narrow down data results based on specific
criteria (e.g., file type, size, creation date).
3. Query and Analysis:
a. The Ribbon may include options for running SQL queries on datasets
directly within the Lakehouse Explorer. This lets you analyze data without
needing to use separate tools.
b. There may be a run query button or a SQL editor within the ribbon for
users to enter and execute queries against the data in the Lakehouse.
4. Preview and Visualize:
a. If available, the Ribbon View may provide buttons to preview data,
including the ability to open a visualization or preview a specific dataset in
a tabular or graphical format.
325
b. Users can access summary statistics (e.g., count, average) from the Ribbon
for quick data insights.
5. Collaboration:
a. The Ribbon View often includes options to share data or invite others to
collaborate on the data exploration process, making it easier to work as part
of a team.
6. Integration with Other Tools:
a. The Ribbon View may also provide quick access to integrations with other
tools in the Microsoft Fabric ecosystem, such as Power BI, Data
Engineering, or Notebooks.
7. Settings and Configuration:
a. There may be a settings section within the Ribbon that allows users to
configure their Lakehouse environment or customize preferences related to
their data or file explorer experience.
326
Steps to Create a Fabric Workspace:
1. Sign in to Microsoft Fabric:
a. Open a browser and navigate to the Microsoft Fabric portal.
b. Sign in with your Microsoft account associated with Microsoft Fabric or
your organization’s Azure Active Directory credentials.
2. Navigate to the Fabric Home Page:
a. After signing in, you'll land on the Fabric home page where you can see the
dashboard and various resources available to you.
3. Access the Workspaces Section:
a. On the left-hand navigation pane, locate the "Workspaces" tab (you may
need to click on "Fabric" or "Resources" depending on your setup).
b. Alternatively, you can search for "Workspaces" from the main search bar.
4. Create a New Workspace:
a. Once you’re in the Workspaces area, look for a button or option that says
“Create Workspace”. Typically, this will be a large button on the top right
or at the bottom of the workspace list.
b. Click on “Create Workspace” to begin setting up a new workspace.
5. Provide Workspace Information:
a. Workspace Name: Enter a name for your new workspace. Choose a name
that reflects the purpose of the workspace (e.g., “Sales Data Analysis” or
“Marketing Insights”).
b. Description (optional): You can optionally provide a brief description to
explain the workspace’s purpose or scope.
c. Region/Location: Select the data center region for your workspace. The
region affects where the data and services associated with the workspace are
stored and processed, so choose a region close to your team or users.
6. Choose Permissions:
327
a. Access Control: You can specify which users or groups should have access
to the workspace.
i. Set up roles like Admin, Member, or Viewer depending on the level of
access and control you want to provide.
ii. You may be able to link Azure Active Directory (AAD) groups or invite
individual users to the workspace.
7. Create the Workspace:
a. After entering all required details (name, region, permissions), click on
“Create” or “Create Workspace” to finalize the process.
b. The workspace will be created, and you’ll be taken to the workspace
environment where you can start adding datasets, data pipelines, models, and
other resources.
8. Verify and Access the Workspace:
a. After creating the workspace, you should see it listed on the Workspaces
dashboard.
b. You can now open the workspace, configure data sources, start building
pipelines, set up Power BI reports, and work with data in collaboration with
your team.
Optional Workspace Configuration:
• Adding Resources: Once your workspace is created, you can begin adding
resources such as:
o Data Engineering jobs and pipelines
o Power BI reports and dashboards
o Machine Learning models
o Data Lake and Data Warehouse connections
• Set up Collaborators: Invite other users from your organization to join the
workspace, giving them specific roles to manage and collaborate on different
tasks.
328
LAKE HOUSE
A Lakehouse is a modern data architecture that combines the features of a Data
Lake and a Data Warehouse. It aims to provide the best of both worlds: the
flexibility, scalability, and cost-effectiveness of a data lake with the performance,
reliability, and structure of a data warehouse.
Key Characteristics of a Lakehouse:
1. Unified Storage:
a. A Lakehouse uses data lakes for storage, but it structures the data in a way
that allows it to be easily accessed for both operational and analytical
purposes. Unlike traditional data lakes that store raw, unstructured data, the
Lakehouse model allows structured, semi-structured, and unstructured data
to be managed in a unified manner.
2. Data Engineering and Analytics:
a. A Lakehouse typically supports both data engineering and analytics
workloads. It combines data storage with the ability to run ETL (Extract,
Transform, Load) processes, data transformations, and analysis directly
within the same platform.
b. It provides a single source of truth for business intelligence, machine
learning, and other advanced analytics.
3. Transactional Data Management:
a. One of the key features of a Lakehouse is that it incorporates transactional
support (like ACID transactions) for ensuring data consistency and
reliability. This feature is more commonly associated with data warehouses
but is made available in the Lakehouse architecture by using technologies
such as Delta Lake or Apache Hudi.
4. Cost Efficiency:
a. Since Lakehouses typically use cloud-based data lakes for storage (e.g.,
Azure Data Lake, Amazon S3, or Google Cloud Storage), they provide
329
highly scalable storage at a lower cost compared to traditional data
warehouses.
5. Flexibility:
a. Lakehouses allow schema-on-read and schema-on-write, making it easier
to store and analyze data with flexible schemas. This means you can ingest
raw data and define the structure when you need to query it.
b. You can use SQL, Python, or R for querying and processing data in a
Lakehouse, and it is compatible with other data processing frameworks like
Apache Spark.
6. Modern Analytics:
a. Lakehouses support real-time analytics and machine learning pipelines. By
leveraging the power of data lakes for massive storage and the structure of a
data warehouse for optimized querying, a Lakehouse can be used for big
data analytics, streaming analytics, and machine learning models.
330
4. Data Science & Machine Learning Layer:
a. A Lakehouse enables machine learning workflows, including training
models, batch processing, and real-time inference, by leveraging the data
stored in the lake.
5. Business Intelligence and Reporting:
a. Since Lakehouses provide structured, clean data, they can easily integrate
with business intelligence tools (like Power BI or Tableau) for generating
reports and dashboards.
Advantages of a Lakehouse:
1. Unified Platform:
a. A Lakehouse combines the strengths of both data lakes and data warehouses
into a single platform. This reduces the need for separate systems for
different types of data processing.
2. Cost-Effective:
a. By using cloud storage, Lakehouses offer a cost-effective way to store
massive amounts of data, including raw or semi-structured data, which can
then be processed as needed.
3. Improved Performance:
a. Lakehouses provide optimized query performance by using data
management technologies (such as Delta Lake) that offer indexing,
caching, and query optimizations.
4. Flexibility:
a. They support a wide variety of data formats (e.g., CSV, Parquet, JSON)
and data types (structured, semi-structured, unstructured), making them
flexible for various use cases.
5. Advanced Analytics:
331
a. Lakehouses can support advanced analytics, real-time data processing, and
machine learning workflows directly on large datasets stored in the lake,
without the need to move data to separate systems.
332
• Lakehouse: Combines the best of both—using a data lake's flexibility for
storing large volumes of raw data and a data warehouse’s optimizations for
query performance, analytics, and transactional support.
333
AZURE
What is cloud computing
Cloud computing refers to the delivery of computing services—such as storage,
processing power, databases, networking, software, and analytics—over the
internet, or the "cloud," rather than from a local server or personal computer. In
other words, cloud computing allows you to access and use computing resources
via the internet, often on a pay-as-you-go basis, rather than maintaining physical
hardware and software infrastructure.
Here's a side-by-side comparison between AWS (Amazon Web Services) and
Azure (Microsoft Azure) based on various factors:
Feature AWS (Amazon Web Services) Azure (Microsoft Azure)
334
CodeBuild, CodeDeploy,
Developer Azure DevOps, Visual Studio
CodePipeline, AWS SDKs and
Tools integration, Azure SDKs
APIs
Storage
Object, Block, and File storage Object, Block, and File storage
Models
Amazon Rekognition (for image Azure Cognitive Services (vision,
AI & Machineand video analysis), Amazon language, speech, and decision-
Learning Lex (chatbots), SageMaker making), Azure Machine
(ML) Learning
Amazon RDS (Relational
Azure SQL Database, Cosmos DB
Database), DynamoDB
Databases (NoSQL), Synapse Analytics
(NoSQL), Redshift (Data
(Data Warehouse)
Warehouse)
Integration Works with open-source Deep integration with Microsoft
with Existing technologies and third-party products (e.g., Windows Server,
Tools tools SQL Server, Office 365)
Identity AWS IAM (Identity and Access Azure Active Directory (Azure
Management Management) AD)
335
Amazon ECS (Elastic Container
Container Azure Kubernetes Service (AKS),
Service), EKS (Elastic
Services Azure Container Instances
Kubernetes Service)
Serverless AWS Lambda (for running code Azure Functions (for running
Computing without provisioning servers) event-driven code)
Backup and
AWS Backup, Amazon Glacier Azure Site Recovery, Azure
Disaster
for archiving Backup
Recovery
336
Data Concepts in Cloud Computing (Azure Context)
Understanding key data concepts is essential for working with data on platforms
like Microsoft Azure. These concepts are related to how data is stored, processed,
and managed, and the various technologies and services available to handle data
efficiently.
1. Relational Data (Structured Data)
• Relational Data refers to data that is stored in tables with predefined
relationships between them. The tables are structured with rows and
columns. This is typically organized in databases that adhere to the
Relational Database Management System (RDBMS) model.
• Key features:
o Data is stored in tables (rows and columns).
o Strong consistency and data integrity.
o Supports SQL (Structured Query Language) for querying and managing
data.
o Good for transactional data and business applications that require complex
queries and relationships.
Examples of relational data: Customer databases, financial records, and
inventory systems.
Relational Data in Azure:
o Azure SQL Database: Fully managed relational database service built on
Microsoft SQL Server.
o Azure SQL Managed Instance: A fully managed database engine that
provides near 100% compatibility with SQL Server.
o Azure Database for PostgreSQL: A managed relational database service
for PostgreSQL.
o Azure Database for MySQL: A fully managed database service for
MySQL.
337
Use cases:
o Data modeling (e.g., ER diagrams).
o Data transactions (CRUD operations).
o Business reporting and analysis.
338
o Azure Data Lake Storage Gen2: A scalable, secure data lake for big data
analytics, designed for storing large amounts of unstructured data.
Use cases:
o Big Data processing and analytics.
o Real-time web applications.
o Internet of Things (IoT) data.
339
4. Data Services in Azure
Azure provides various data services for managing, processing, and analyzing data,
whether it's structured, semi-structured, or unstructured.
• Azure SQL Database: A fully managed relational database service based
on Microsoft SQL Server. It supports both transactional and analytical
workloads.
• Azure Synapse Analytics: Combines big data and data warehousing into a
single service. It allows querying data from relational and non-relational
sources using SQL.
• Azure Databricks: An Apache Spark-based analytics platform that is
designed for large-scale data engineering and data science tasks.
• Azure Data Lake Storage Gen2: A hierarchical file system built on top of
Azure Blob Storage designed for analytics workloads.
• Azure Cosmos DB: A globally distributed, multi-model NoSQL database
service that provides low-latency and scalable data access.
• Azure Data Factory: A data integration service for creating, scheduling,
and orchestrating ETL (Extract, Transform, Load) workflows across various
data sources.
• Azure Stream Analytics: A real-time analytics service designed for
processing streaming data from IoT devices, social media, and other real-
time sources.
• Azure Blob Storage: Object storage for storing massive amounts of
unstructured data, such as images, videos, and backups.
340
5. Modern Data Warehouses in Azure
A modern data warehouse is a centralized repository that allows businesses to
store and analyze large volumes of data from various sources, often in real-time.
• Azure Synapse Analytics (formerly SQL Data Warehouse): This is Azure's
modern data warehouse service. It integrates with big data technologies and
provides capabilities for real-time analytics, querying data at scale, and
integrating with AI and machine learning models.
o Key Features:
▪ Combines big data and data warehousing into a single platform.
▪ Real-time analytics.
▪ Built-in security and scalability.
▪ Deep integration with other Azure services, like Power BI for reporting.
o Use Case: Analyzing large datasets to uncover trends, patterns, and insights
for business intelligence (BI).
341
o Cosmos DB can be used for real-time data applications such as IoT
devices, mobile apps, and gaming platforms where high performance and
low latency are critical.
o It offers multi-region replication to ensure low-latency access to data from
anywhere in the world.
Key Features:
o Global distribution: Data can be replicated globally across multiple Azure
regions.
o Automatic scaling: Cosmos DB automatically scales based on demand.
o Consistency models: Offers multiple consistency models, such as strong
consistency, eventual consistency, and bounded staleness, to suit different
use cases.
o Real-time analytics: Ideal for applications requiring instant processing of
large volumes of data, such as real-time event streaming and IoT data.
342
Summary of Key Concepts
Modern
A centralized system for storing large Azure Synapse
Data
volumes of structured data for analytics. Analytics
Warehouse
343
Cloud Computing Models
Cloud computing models refer to the different ways that cloud services are
delivered and consumed by organizations. These models define how resources are
provided and managed by the cloud service provider, and how users access and
interact with them.
1. Service Models of Cloud Computing
Cloud computing is primarily categorized into three service models based on the
level of control, management, and flexibility provided to users:
• Infrastructure as a Service (IaaS):
o Description: Provides virtualized computing resources over the internet,
such as virtual machines, storage, and networking.
o User Responsibility: Users manage operating systems, applications, and
data, while the provider manages the infrastructure.
o Examples: Amazon Web Services (AWS), Microsoft Azure, Google Cloud
Platform (GCP).
• Platform as a Service (PaaS):
o Description: Provides a platform that allows users to develop, run, and
manage applications without dealing with the complexity of infrastructure
management.
o User Responsibility: Users focus on application development, while the
provider manages runtime, middleware, databases, and infrastructure.
o Examples: Google App Engine, Microsoft Azure App Services, Heroku.
• Software as a Service (SaaS):
o Description: Provides fully managed software applications that users can
access over the internet.
344
o User Responsibility: Users only interact with the software, while the
provider manages the underlying infrastructure, platform, and application
updates.
o Examples: Microsoft Office 365, Google Workspace, Salesforce.
2. Deployment Models of Cloud Computing
Cloud deployment models define the type of cloud environment used based on
ownership, location, and access control. These models are categorized as:
• Public Cloud
• Private Cloud
• Hybrid Cloud
• Community Cloud
Public Cloud
• Description: A public cloud is a cloud computing model where the
infrastructure and services are owned and operated by a third-party cloud
service provider. These services are made available to the general public
over the internet.
• Characteristics of Public Cloud:
o Shared Resources: Multiple customers (tenants) share the same
infrastructure, but data and workloads are logically separated.
o Scalability: High scalability and on-demand resource provisioning, with
resources being available as needed.
o Cost-Effective: Typically operates on a pay-as-you-go pricing model,
meaning customers only pay for what they use.
o Maintenance-Free: Cloud provider is responsible for maintaining and
upgrading hardware, software, and services.
o Accessibility: Services are available over the internet, allowing users to
access them from anywhere.
345
o Examples: Amazon Web Services (AWS), Microsoft Azure, Google Cloud
Platform (GCP).
• Use Cases:
o Small to medium-sized businesses looking for cost-effective IT
infrastructure.
o Web applications, websites, and scalable workloads.
Private Cloud
• Description: A private cloud is a cloud computing model in which the
infrastructure is used exclusively by one organization. It can be hosted on-
premises or by a third-party provider, but it is not shared with other
organizations.
• Characteristics of Private Cloud:
o Exclusive Access: The infrastructure is dedicated solely to a single
organization, offering greater control over the environment.
o Enhanced Security: Since the cloud is private, it provides more robust
security and privacy controls, which is particularly beneficial for industries
with stringent regulations.
o Customization: Organizations have more flexibility to customize the
infrastructure to meet specific needs, such as specific hardware or software
configurations.
o Limited Scalability: Unlike public clouds, private clouds may have more
limited scalability, as resources are fixed and the organization is responsible
for managing growth.
o Cost: Typically more expensive than public cloud due to the dedicated
infrastructure and maintenance costs.
o Examples: VMware Private Cloud, Microsoft Azure Stack, OpenStack.
• Use Cases:
346
o Large enterprises with specific compliance or security needs.
o Applications that require control over the entire infrastructure and data, such
as sensitive government data or financial systems.
Hybrid Cloud
• Description: A hybrid cloud is a cloud computing model that combines
elements of both public and private clouds. It allows data and applications to
be shared between them, offering more flexibility and deployment options.
• Characteristics of Hybrid Cloud:
o Flexibility: Organizations can take advantage of the scalability and cost-
efficiency of public clouds, while maintaining control and security over
sensitive workloads in a private cloud.
o Seamless Integration: A hybrid cloud allows integration between on-
premises infrastructure and public cloud resources, creating a unified
environment.
o Workload Portability: Organizations can move workloads between public
and private clouds based on demand, security requirements, or compliance
issues.
o Cost Optimization: The model allows organizations to use the public cloud
for non-sensitive workloads and the private cloud for sensitive workloads,
balancing costs and security.
o Complexity: Hybrid clouds can be more complex to set up and manage due
to the need for coordination between multiple environments.
o Examples: Microsoft Azure Hybrid Cloud, AWS Outposts, Google Anthos.
• Use Cases:
o Organizations looking for a balance between control (private cloud) and
scalability (public cloud).
347
o Applications that need to scale based on demand but also need to store
sensitive data privately.
Community Cloud
• Description: A community cloud is a cloud computing model that is shared
by several organizations with common goals, requirements, or regulations.
The infrastructure is shared among the organizations, which may be from the
same industry or with similar regulatory needs.
• Characteristics of Community Cloud:
o Shared Infrastructure: Multiple organizations share the same cloud
infrastructure, but it is customized to meet the specific needs of the
community.
o Cost Sharing: The cost of the infrastructure is shared among the
organizations, making it more affordable than a private cloud.
o Collaborative Environment: Ideal for industries or organizations that need
to collaborate and share data while maintaining control over their
infrastructure.
o Security and Compliance: Organizations in the community share common
security, compliance, and regulatory requirements, such as those in
healthcare, education, or government.
o Customization: The cloud infrastructure may be customized to meet the
specific needs of the community, including specialized software or security
protocols.
o Examples: Government clouds, healthcare clouds, or industry-specific
clouds (e.g., financial services or research communities).
• Use Cases:
o Organizations within the same industry or with similar regulatory
requirements.
348
o Government agencies or healthcare organizations that need to maintain high
levels of security and compliance.
349
• Associate: Intermediate certifications that cover more specific roles or
solutions and are intended for professionals who have some hands-on
experience with Azure.
• Expert: Advanced certifications for experienced professionals who want to
prove their deep knowledge of Azure services and architecture.
• Specialty: Focused certifications that cover niche areas, like AI, IoT, or
security.
350
o Target Audience: Beginners looking to learn about data concepts in Azure.
2. Azure Associate Certifications
These certifications are for individuals who have hands-on experience and want to
specialize in specific Azure roles or services.
• Microsoft Certified: Azure Administrator Associate
o Exam: AZ-104
o Topics Covered: Managing Azure subscriptions, resources, storage,
network, virtual machines, and identity.
o Target Audience: Azure administrators managing cloud resources.
• Microsoft Certified: Azure Developer Associate
o Exam: AZ-204
o Topics Covered: Developing applications, using Azure SDKs, APIs,
managing Azure resources, and cloud-native apps.
o Target Audience: Azure developers building and maintaining applications
on Azure.
• Microsoft Certified: Azure Security Engineer Associate
o Exam: AZ-500
o Topics Covered: Azure security tools, identity management, platform
protection, data and application security, and security operations.
o Target Audience: Security engineers responsible for securing Azure
environments.
• Microsoft Certified: Azure AI Engineer Associate
o Exam: AI-102
o Topics Covered: AI solutions, machine learning, computer vision, natural
language processing, and integrating AI solutions with Azure services.
o Target Audience: AI engineers and those focused on AI solutions in Azure.
351
• Microsoft Certified: Azure Data Engineer Associate
o Exam: DP-203
o Topics Covered: Designing and implementing data storage, managing and
developing data pipelines, integrating data solutions.
o Target Audience: Data engineers working with big data, analytics, and data
storage solutions.
3. Azure Expert Certifications
These certifications are for professionals with deep experience in Azure services.
They typically require several years of hands-on experience.
• Microsoft Certified: Azure Solutions Architect Expert
o Exam: AZ-303 (Exam for Architect Technologies) and AZ-304 (Exam for
Architect Design)
o Topics Covered: Design and implement Azure infrastructure, security,
business continuity, governance, and hybrid cloud solutions.
o Target Audience: Azure solutions architects.
• Microsoft Certified: Azure DevOps Engineer Expert
o Exam: AZ-400
o Topics Covered: DevOps principles, source control, continuous integration,
delivery, security, and automation.
o Target Audience: DevOps professionals working with Azure DevOps tools
and processes.
4. Azure Specialty Certifications
These are more specialized certifications for specific Azure technologies.
• Microsoft Certified: Azure IoT Developer Specialty
o Exam: AZ-220
o Topics Covered: IoT solutions, device management, data processing, and
cloud integration.
352
o Target Audience: IoT developers using Azure IoT services.
• Microsoft Certified: Azure AI Engineer Associate
o Exam: AI-102
o Topics Covered: AI development, deployment, and integration using
Azure.
o Target Audience: Developers working with AI solutions.
• Microsoft Certified: Azure Virtual Desktop Specialty
o Exam: AZ-140
o Topics Covered: Configuring and managing Azure Virtual Desktop
environments.
o Target Audience: Professionals working with Azure Virtual Desktop
(formerly Windows Virtual Desktop).
353
2rs for verification
-> can we create azure account free without credit card
Students -> (https://azure.microsoft .com/en-in/free/students/)
You require a school account (on email id )you will receive $100 credit for free
and create upto 250 gb sql database 75 hr wvm
How to cancel subscription
Cost management-> go to subscription->cancel subscription
Microsoft Azure offers a free account to new users with access to a limited set of
services for free and credits that can be used for exploring Azure services. Here's
how you can create one:
Steps to Create a Free Azure Account:
1. Go to the Azure Free Account Page:
a. Visit the official Azure Free Account page.
2. Sign Up:
a. Click on the "Start for free" or "Sign Up" button.
b. You will need a Microsoft account (Outlook, Hotmail, etc.). If you don't
have one, you can create a new Microsoft account during the sign-up
process.
3. Provide Personal Information:
a. Enter your personal information, including name, country/region, and
phone number.
b. You will need to verify your phone number through a text message or phone
call.
4. Add Payment Information:
354
a. While the account is free, you will need to provide a valid credit card for
verification purposes. You won’t be charged unless you exceed the free
usage limits or choose to upgrade to a paid plan.
b. Note: Microsoft may place a temporary hold of a small amount (usually $1)
to verify the card, but it will not be charged.
5. Get Your Free Credits:
a. Once the account is set up, you will receive $200 in free credits to explore
Azure services for the first 30 days. These credits can be used on any Azure
services.
b. After 30 days, you will continue to have access to more than 25 services that
are always free, and you can continue using them without incurring any
charges.
6. Start Using Azure:
a. You can now start using Azure, creating resources such as virtual machines,
databases, and other cloud services.
355
With these certifications, Azure users can build a strong foundation of knowledge,
increase their career opportunities, and gain expertise in specific areas of cloud
computing.
Can we create free azure account
Yes, you can create a free Azure account with a credit card. However, there are
some important things to know about how this works:
Key Points for Creating a Free Azure Account with a Credit Card:
1. Credit Card for Verification:
a. Microsoft requires you to enter a valid credit card during the sign-up
process for verification purposes. This is not for charging you immediately.
b. Why a credit card is needed: It's used to verify your identity and ensure
that you are not a robot or fraudulent user. Microsoft may perform a small
temporary authorization of around $1 USD to verify the card, but this
amount is not charged.
2. Free $200 Credit:
a. Upon successfully signing up for an Azure free account, you will receive
$200 in free credits that you can use to explore Azure services within the
first 30 days.
b. You can use the $200 credit on most Azure services without any additional
charges during this period.
3. Always Free Services:
a. After the $200 credits are exhausted or the 30 days expire, you'll still have
access to more than 25 Azure services that are always free with certain
usage limits. Examples of such services include Azure Functions, Azure
Blob Storage, and Azure Active Directory.
b. These services are always free up to a certain level of usage. If you exceed
the usage limits for any of these services, you'll need to upgrade to a paid
plan.
4. No Automatic Charges:
356
a. If you do not upgrade to a paid subscription and you are using the free-tier
services, you will not be charged. Microsoft will not charge you
automatically unless you manually upgrade your account to a paid
subscription or exceed the free-tier usage limits.
b. Important: You will need to monitor your usage to ensure that you stay
within the free limits if you are not ready to pay for additional services.
Steps to Create the Free Azure Account with a Credit Card:
1. Go to the Azure Free Account page:
Visit the official Azure Free Account page.
2. Click "Start for Free":
Click on the "Start for free" button to begin the sign-up process.
3. Sign in with a Microsoft Account:
If you don’t have one, you will need to create a new Microsoft account (Outlook,
Hotmail, etc.).
4. Provide Personal Information:
Fill in personal details such as your name, country, and phone number. You’ll also
need to verify your phone number via text or a phone call.
5. Enter Credit Card Information:
Provide your valid credit card information. Microsoft will use this only for
verification and billing purposes after the free credit is exhausted.
6. Receive Free Credits:
Once your account is set up, you’ll receive $200 in free credits that are valid for
the first 30 days. After 30 days, you'll still have access to free services with usage
limits.
7. Start Using Azure:
You can now start using Azure services. Make sure to monitor your usage so that
you don’t exceed the free-tier limits.
357
Summary:
• You can create a free Azure account with a credit card, but no charges will
occur unless you exceed the free-tier limits or choose to upgrade to a paid
plan.
• $200 in free credits are available to try out Azure services for the first 30
days.
• After the credits are used up, you'll still have access to 25+ always free
services with usage limits.
If you prefer, you can also avoid using the credit card for future usage by only
relying on the free-tier services and monitoring your usage carefully.
1.compute
Virtual machines
Conatiners| kubernetes sevices
Cloud services
Mobile services
2.network
Virtual network
Load balancing
Azure DNS
3.storage
Azure disk storage
Blob storage
Azure backup
Queue storage
358
Azure Interface
The Azure Interface refers to the various ways users interact with and manage
their Azure resources. It provides a user-friendly environment for configuring,
monitoring, and controlling all Azure services and resources. Azure offers several
interfaces that cater to different types of users, including developers,
administrators, and business professionals.
Below are the main interfaces provided by Microsoft Azure:
1. Azure Portal
• Description: The Azure Portal is the most common and comprehensive
web-based interface for managing Azure resources. It is a graphical interface
that provides an intuitive and user-friendly experience to create, configure,
and manage resources within the Azure cloud.
• Key Features:
o Dashboard: The portal offers a customizable dashboard where users can pin
and view key resources and metrics.
o Resource Management: Create, configure, and monitor various Azure
resources such as virtual machines (VMs), storage accounts, databases, and
networking components.
o Search: Quickly search and access services, resources, or documentation.
o Templates and Automation: You can deploy Azure resources using pre-
built templates or through automation tools.
o Monitoring and Alerts: Set up alerts, view metrics, and logs for resource
monitoring.
o Security and Access Control: Manage roles, permissions, and policies for
users and resources through Azure Active Directory.
• Use Case: The Azure Portal is ideal for administrators, developers, and IT
professionals who prefer a visual interface to manage Azure resources.
359
• Access: You can access the Azure Portal at https://portal.azure.com.
3. Azure PowerShell
• Description: Azure PowerShell is a set of cmdlets (commandlets) designed
specifically for managing Azure resources in a PowerShell environment.
PowerShell is a scripting language and shell that allows users to automate
administrative tasks and manage Azure resources programmatically.
360
• Key Features:
o Cmdlets: PowerShell provides cmdlets that can be used to interact with
Azure services and resources.
o Automation: You can use PowerShell scripts to automate the creation,
configuration, and management of resources.
o Integration: Works well with other tools in the Microsoft ecosystem, like
System Center, Windows Server, and Active Directory.
• Use Case: PowerShell is commonly used by IT administrators and advanced
users who prefer the PowerShell scripting environment to manage resources
and automate tasks.
• Access: Azure PowerShell can be run on Windows, Linux, or macOS. It can
also be used in Azure Cloud Shell in the portal.
361
• Use Case: It is ideal for users who need to manage Azure resources without
having to set up any local tools or configurations. It's especially useful for
quick management tasks or when you don’t have access to a local machine
with Azure tools installed.
• Access: Azure Cloud Shell can be accessed directly within the Azure Portal
by clicking the "Cloud Shell" icon in the top-right corner.
362
6. Azure REST API
• Description: The Azure REST API provides programmatic access to Azure
resources and services through HTTP requests. This is ideal for developers
who want to build custom applications that interact with Azure.
• Key Features:
o RESTful: The API follows RESTful principles, allowing for easy
interaction with Azure resources using standard HTTP methods.
o Full control: Developers can perform all management tasks, such as
creating and configuring resources, using the REST API.
o Integration with Other Systems: It allows Azure to be integrated into other
custom applications or systems.
• Use Case: Developers and system integrators who need to interact with
Azure services at a low level or want to integrate Azure with other
applications.
• Access: The Azure REST API is documented in the Azure REST API
documentation.
363
• Use Case: Developers and operations teams use Application Insights to
monitor their applications and services running on Azure, ensuring they
perform optimally.
• Access: Application Insights can be accessed through the Azure Portal,
where you can configure monitoring and view logs.
364
Azure products and services
Microsoft Azure offers a wide range of products and services across different
categories to support various cloud computing needs such as infrastructure,
platform, and software as a service (IaaS, PaaS, SaaS). Azure's offerings help
organizations to build, deploy, and manage applications through its globally
distributed data centers. Below is an overview of some of the most common Azure
products and services, categorized by their functionality.
1. Compute Services
Azure Compute services provide on-demand computing resources for running
applications and workloads.
• Azure Virtual Machines (VMs): Provides scalable, on-demand compute
power for applications. You can choose different sizes of VMs for different
workloads.
o Use Case: Hosting websites, running development environments, and
enterprise applications.
• Azure App Services: Platform as a Service (PaaS) for building, deploying,
and managing web apps and APIs.
o Use Case: Hosting websites, mobile apps, and RESTful APIs.
• Azure Kubernetes Service (AKS): Managed Kubernetes service for
deploying and managing containerized applications.
o Use Case: Deploying and orchestrating containers in a cloud environment.
• Azure Functions: Serverless compute service that automatically scales
based on demand.
o Use Case: Running event-driven applications and automating workflows.
• Azure Virtual Desktop: A service that enables users to create a scalable
desktop and application virtualization environment.
365
o Use Case: Remote work scenarios, virtual desktops, and application
hosting.
2. Storage Services
Azure provides scalable storage solutions for various data types such as
unstructured, structured, and big data.
• Azure Blob Storage: Object storage for storing unstructured data like
documents, images, and video files.
o Use Case: Storing large amounts of unstructured data.
• Azure Disk Storage: Managed disk storage for virtual machines.
o Use Case: Attaching persistent storage to virtual machines.
• Azure File Storage: Fully managed file shares in the cloud that can be
mounted on Windows or Linux VMs.
o Use Case: Shared storage for applications that require file system access.
• Azure Queue Storage: Messaging service for storing messages that can be
retrieved by other applications.
o Use Case: Decoupling applications for better performance and scalability.
• Azure Data Lake Storage: Scalable storage for big data analytics.
o Use Case: Storing and analyzing large datasets.
3. Networking Services
Azure provides networking services for secure and efficient communication
between Azure resources and on-premises infrastructure.
• Azure Virtual Network (VNet): A private network within Azure that
enables communication between Azure resources securely.
o Use Case: Isolating resources and controlling traffic flow in the cloud.
366
• Azure Load Balancer: Distributes incoming network traffic across multiple
virtual machines.
o Use Case: Ensuring high availability and load distribution.
• Azure VPN Gateway: Securely connects an on-premises network to an
Azure virtual network.
o Use Case: Extending on-premises infrastructure to the cloud.
• Azure Application Gateway: A web traffic load balancer that enables you
to manage traffic to your web applications.
o Use Case: Managing and securing HTTP(S) traffic for applications.
• Azure Content Delivery Network (CDN): A global content delivery
service for distributing content like images, videos, and web pages with low
latency.
o Use Case: Accelerating the delivery of static content globally.
• Azure ExpressRoute: A private, high-throughput connection between on-
premises infrastructure and Azure.
o Use Case: Establishing private, secure, and high-performance connectivity
to Azure.
367
o Use Case: Building highly responsive, globally distributed apps that require
low-latency access.
• Azure Database for MySQL/PostgreSQL: Managed services for running
MySQL and PostgreSQL databases on Azure.
o Use Case: Hosting MySQL/PostgreSQL databases with minimal
management.
• Azure Synapse Analytics: A comprehensive analytics platform that
combines data warehousing and big data analytics.
o Use Case: Analyzing large datasets from various data sources.
• Azure Data Factory: A cloud-based data integration service to orchestrate
and automate data movement and transformation.
o Use Case: Building ETL (Extract, Transform, Load) pipelines for data
processing.
• Azure HDInsight: A fully managed cloud service for big data analytics
using frameworks like Hadoop, Spark, and Hive.
o Use Case: Running big data analytics workloads in the cloud.
369
7. Developer Tools and DevOps
Azure provides tools for software developers and DevOps teams to manage code,
continuous integration, and continuous delivery (CI/CD) pipelines.
• Azure DevOps Services: A suite of development tools for version control,
build automation, release management, and project management.
o Use Case: Managing software development life cycles and CI/CD pipelines.
• Azure DevTest Labs: A service for quickly creating development and test
environments in Azure.
o Use Case: Setting up environments for development, testing, and
experimentation.
• Azure Logic Apps: A service for automating workflows and business
processes using a no-code interface.
o Use Case: Integrating applications, automating tasks, and setting up business
workflows.
• Azure Container Registry: A service for storing and managing Docker
container images.
o Use Case: Storing and managing containerized application images.
• Azure Container Instances: A service for running Docker containers
without needing to manage infrastructure.
o Use Case: Running containerized applications on demand.
370
8. Internet of Things (IoT)
Azure offers a range of services for connecting, managing, and analyzing IoT
devices.
• Azure IoT Hub: A service for connecting, monitoring, and managing IoT
devices.
o Use Case: Building IoT solutions to connect and manage millions of
devices.
• Azure Digital Twins: A service for creating digital representations of
physical environments.
o Use Case: Building digital models of real-world environments for IoT
applications.
• Azure IoT Central: A fully managed app platform for building IoT
solutions.
o Use Case: Quickly building and deploying IoT solutions without deep
development.
371
• Azure Cost Management and Billing: A tool for tracking and managing
Azure costs.
o Use Case: Managing and optimizing cloud spending.
372
Use Case: If one data center in an availability zone goes down, services in other
availability zones remain available, ensuring high availability and resilience.
2. Azure Resources Azure resources are the individual components that make
up your solution or application in the cloud. These can be virtual machines,
databases, storage accounts, etc. Every resource you create on Azure is part
of a resource group, which is a logical container for managing related
resources.
Use Case: If you create a web app, a storage account, and a database, all these
resources can be grouped together within a single resource group for easy
management.
373
c. Network Security Groups (NSGs): Manage inbound and outbound traffic
to Azure resources.
Use Case: VNets are essential for ensuring secure communication between your
cloud resources (such as virtual machines) and external systems, like on-premises
data centers.
6. Storage Services Azure Storage offers a range of services for storing data
in the cloud. It includes services for managing unstructured and structured
data. Key storage resources in Azure are:
a. Azure Blob Storage: Object storage for large amounts of unstructured data
such as text, images, and video.
b. Azure Disk Storage: Persistent storage for virtual machines (VMs).
c. Azure File Storage: Managed file shares that can be mounted on VMs.
374
d. Azure Data Lake Storage: Scalable storage for big data analytics.
Use Case: If you need to store files, you would use Azure Blob Storage, while
Azure Disk Storage would be used to store VM data.
375
9. Monitoring and Management Azure provides services for managing,
monitoring, and optimizing the performance of your applications and
resources in the cloud.
a. Azure Monitor: A comprehensive monitoring solution for tracking the
performance and health of applications and infrastructure.
b. Azure Log Analytics: Collects and analyzes logs from Azure resources.
c. Azure Automation: Automates repetitive tasks like patch management,
configuration, and VM provisioning.
Use Case: Azure Monitor can be used to track the health of your applications,
while Azure Automation helps in automating resource management tasks.
376
Azure Architecture Example (End-to-End Solution)
Let’s consider an example of deploying a web application on Azure:
1. Virtual Network (VNet): Set up a secure virtual network to isolate the web
application and database.
2. Azure App Services: Deploy the web application on Azure App Services
for automatic scaling and management.
3. Azure SQL Database: Use Azure SQL Database to store relational data for
your application.
4. Azure Blob Storage: Store media files and logs in Azure Blob Storage.
5. Azure Load Balancer: Distribute incoming traffic across multiple instances
of the web application.
6. Azure Monitor and Application Insights: Set up monitoring and logging
to ensure the web app is running optimally.
7. Azure Active Directory (AAD): Use AAD for user authentication and role-
based access control (RBAC).
Azure storage
377
Storage services. It allows for scalability, easy management, and access from
anywhere.
378
o Managed disks that offer high performance and scalability.
o Supports different disk types such as:
▪ Premium SSD: High-performance SSD storage.
▪ Standard SSD: Balanced SSD storage for workloads.
▪ Standard HDD: Economical storage for less demanding workloads.
• Use Case: Store VM operating systems, application data, and high-
performance databases.
379
o Useful for task scheduling, load balancing, and asynchronous processing.
o Messages can be stored for up to 7 days.
• Use Case: Queuing tasks for background processing, decoupling services,
and handling messages in microservices architecture.
380
Azure Storage Access Methods
Azure Storage provides several ways to access and manage data in the cloud.
These methods can be used programmatically, via the Azure portal, or using
different tools:
1. Azure Portal: A web-based interface to create and manage storage
accounts, containers, files, and other storage resources.
2. Azure CLI: Command-line tools that allow you to interact with Azure
resources and manage storage through shell commands.
3. Azure PowerShell: A set of cmdlets that let you automate and manage
Azure resources, including storage.
4. Azure SDKs: Software Development Kits (SDKs) for different
programming languages like Python, .NET, Java, and Node.js that allow
developers to interact with Azure Storage.
5. REST APIs: Azure Storage also exposes a REST API that allows
developers to perform storage operations such as uploading, downloading,
and deleting files.
381
a. Shared Access Signatures (SAS): Temporary tokens that grant restricted
access to specific Azure Storage resources without needing to share the
account keys.
b. Azure Active Directory (AAD) Integration: Allows for authentication and
role-based access control (RBAC) for Azure Storage.
c. Access Control Lists (ACLs): Set permissions on specific blobs or
containers for fine-grained access control.
3. Firewall & Virtual Network Integration: Restrict access to Azure Storage
resources to specific IP ranges or Virtual Networks.
382
• Content Delivery: Use Azure Blob Storage to store and serve large files
such as images, videos, and static web content.
• Data Archiving: Use Archive Storage tier for storing infrequently accessed
data for long-term retention at a lower cost.
• Enterprise Applications: Store and manage data for enterprise applications
that require high availability, scalability, and reliability.
1. Block Blob
• Purpose: Block blobs are optimized for storing text and binary data. They
are ideal for storing large files such as documents, images, videos, backups,
and log files.
383
• Features:
o Composed of blocks of data that can be managed independently.
o Ideal for streaming media and large files.
o Each block can be up to 100 MB in size (in practice, the total size of a block
blob can be up to 5 TB).
o Supports parallel uploads of data, making it efficient to upload large files in
smaller chunks.
• Use Case: Storing media files, application backups, website content, and
data for analytics.
Example: A video file, a large image, or a backup file.
2. Append Blob
• Purpose: Append blobs are optimized for scenarios where data is added to
an existing blob, rather than replacing it. They are specifically designed for
logging and append-only operations.
• Features:
o Made up of blocks like block blobs, but each block can only be appended
(new data can only be added to the end of the blob).
o Ideal for situations where data is continuously added over time, such as
logging or tracking events.
o Append blobs allow you to perform efficient writes for continuous data
streams.
• Use Case: Logging data (e.g., system logs, event logs), or continuously
collecting data (e.g., IoT sensor data).
Example: A log file that is constantly updated with new entries.
384
3. Page Blob
• Purpose: Page blobs are optimized for random read/write operations and
are used primarily for storing virtual machine (VM) disks and other
scenarios where frequent, random access to data is required.
• Features:
o Composed of 512-byte pages, allowing for efficient random access to large
data files.
o Supports efficient updates to small parts of large data, making it ideal for
VM disk storage.
o Page blobs can grow up to 8 TB in size.
o Ideal for workloads that require frequent, small updates (as opposed to entire
blobs).
• Use Case: Storing VHD (Virtual Hard Disk) files for Azure virtual
machines, database files, and other random read/write workloads.
Example: A virtual machine disk or a database that requires frequent and efficient
random writes.
385
a. Use Case: For data that is infrequently accessed but still needs to be stored
for long periods (e.g., backups, archives, and older documents).
b. Performance: Slightly higher latency than the Hot tier but still suitable for
occasional access.
c. Cost: Lower storage cost than Hot, but higher access costs.
3. Archive Tier:
a. Use Case: For data that is rarely accessed but must be retained for long-term
storage (e.g., regulatory compliance, historical data).
b. Performance: Very high latency (retrieval can take hours), optimized for
cost-efficient long-term storage.
c. Cost: The lowest storage cost, but retrieval costs are high.
386
c. Storage accounts come with different performance options, like Standard
and Premium, based on the use case and the performance needs of your
application.
4. Access Control:
a. You can control access to blobs using Azure Active Directory (AAD),
Shared Access Signatures (SAS), or Access Keys.
b. Azure Blob Storage supports RBAC (Role-Based Access Control) and
Access Control Lists (ACLs) for granular control over who can access what
data.
387
o Storing backups and archives.
Conclusion
Azure Blob Storage is a powerful and scalable solution for storing unstructured
data in the cloud. It offers various types of blobs suited for different workloads:
• Block Blobs for large files like media, backups, and datasets.
• Append Blobs for scenarios that require appending data (e.g., logs).
• Page Blobs for high-performance random access data, such as virtual
machine disks.
By choosing the right type of blob and storage tier, you can optimize both
performance and costs based on the frequency and needs of your data access.
Azure Blob Storage: Ideal Use Cases
Azure Blob Storage is a highly scalable, cost-effective, and secure cloud storage
service that is ideal for storing unstructured data. Unstructured data includes
anything that doesn't fit neatly into a relational database model, such as text,
images, audio files, videos, log files, backups, and more. Blob storage is suited for
both large-scale data and frequent data access, offering a wide variety of
scenarios for data storage.
388
Here are the key use cases where Azure Blob Storage is ideal:
391
8. Data Sharing (Collaborative Workspaces)
• Ideal For: Storing files that need to be shared or collaborated on across
multiple parties or teams.
• Why Blob Storage: Blob Storage is well-suited for collaboration by
allowing multiple users to access the same blob data. Through the use of
Shared Access Signatures (SAS), permissions can be granted securely to
different users for read, write, or delete operations without exposing the
storage account keys.
• Example:
o Sharing large files (e.g., project documents, CAD files) between
departments, clients, or partners.
o A company storing and managing shared project files for its teams.
392
10. Machine Learning and AI Data Storage
• Ideal For: Storing large datasets for training machine learning models and
artificial intelligence applications.
• Why Blob Storage: Blob Storage can handle large unstructured datasets
(e.g., image data, text, and structured data) which are commonly used for
machine learning training. It integrates seamlessly with Azure’s AI and ML
tools, enabling fast data processing and model training.
• Example:
o Storing image data for a deep learning model for image classification.
o Storing training datasets for AI-driven analytics.
Summary
Azure Blob Storage is highly flexible and can be used in a variety of scenarios due
to its scalability, security, and cost-effectiveness. Here are some key areas where
Azure Blob Storage is ideal:
1. Storing unstructured data (media files, logs, backups, etc.)
2. Big data analytics and data lakes for processing large datasets.
3. Web and mobile application storage for user files.
4. Archiving and disaster recovery for long-term data retention.
5. Logging and event data collection for monitoring and troubleshooting.
6. Data sharing and collaborative workspaces for team-based storage needs.
7. IoT data storage for sensor and device-generated information.
8. Machine learning and AI applications requiring large datasets for
training.
Azure Storage Overview
Azure offers a variety of storage solutions, each designed to address specific types
of data storage needs. Among these are Azure File Storage, Azure Queue
393
Storage, Azure Table Storage, and Azure Single Disk Storage. Here's an
overview of each:
394
• Migrating existing on-premises file shares to the cloud for a hybrid cloud
solution.
395
• An e-commerce website where the order processing system places orders in
a queue, and inventory systems or billing systems asynchronously process
those orders.
• Managing background tasks or delayed jobs in a web application.
397
• Persistent Data Storage: For applications that require fast and persistent
storage solutions.
Example:
• Attaching a Premium SSD disk to an Azure VM running a high-
performance database or web application.
• Using Standard SSD for a less critical application running on an Azure VM
that requires moderate performance.
398
VM disk
storage
In Summary:
• Azure File Storage: Great for managing file shares and providing
SMB/NFS file access over the cloud.
• Azure Queue Storage: Useful for decoupling and managing message
queues in distributed systems.
• Azure Table Storage: Best for storing semi-structured data in key-value
pairs, often used in NoSQL scenarios.
• Azure Single Disk Storage: Ideal for providing persistent disk storage for
virtual machines, databases, and other applications that require scalable,
high-performance storage.
Explain the types of azure storage accounts
Azure provides different types of storage accounts to cater to various needs, each
offering a specific set of features and performance levels. The type of storage
account you choose depends on factors like your performance needs, access
patterns, and the kind of data you’re storing. Here's a detailed explanation of the
types of Azure storage accounts:
399
• Supports Blob, File, Queue, and Table services.
• Offers both Hot, Cool, and Archive access tiers for blobs.
• Supports Azure Blob Storage, Azure Disk Storage, and Azure Data Lake
Storage Gen2 (for big data and analytics).
• Provides support for advanced data management and access control.
Ideal Use Cases:
• Storing data that needs to be accessed frequently (Hot), less frequently
(Cool), or rarely (Archive).
• Storing unstructured data (e.g., images, videos) or semi-structured data (e.g.,
logs, metadata).
• Storing data for web applications, mobile apps, and cloud-native
applications.
Example:
• A web application that stores media files (images, videos) and logs.
• A mobile app where data is stored in various access tiers depending on how
often it is accessed.
400
Ideal Use Cases:
• Storing images, videos, and other media files.
• Storing backup data or data lake storage for analytics.
• Storing files for web and mobile applications.
Example:
• A video streaming application that stores and serves video content to users.
• A data analytics pipeline where raw data is stored before it is processed.
402
5. Table Storage Account
The Table Storage account is designed for storing NoSQL key-value pairs. It is a
highly scalable solution for storing large amounts of semi-structured data that
doesn’t require relational database capabilities.
Key Features:
• Key-value pairs: Supports a schema-less data model where entities are
identified by a PartitionKey and RowKey.
• Scalability: Provides high scalability and performance for large datasets.
• Optimized for read-heavy workloads.
Ideal Use Cases:
• Storing metadata, application logs, or sensor data.
• Storing large amounts of data that doesn't need complex querying.
Example:
• A mobile app storing user preferences and app data as key-value pairs.
• A data analytics platform storing logs or metadata related to processing
jobs.
404
• A SQL database running on Standard SSD for more affordable but still
reliable disk storage.
General-
Blob, File, Queue, Table Most general use cases, including
purpose v2
services. Flexible. unstructured data storage.
(GPv2)
405
Azure SQL Overview
Azure SQL is a family of fully managed, relational database services provided by
Microsoft Azure. It is based on SQL Server, offering a cloud-based solution for
building, deploying, and managing databases. Azure SQL provides various
services tailored to different needs, such as Azure SQL Database, Azure SQL
Managed Instance, and SQL Server on Azure Virtual Machines.
Azure SQL Database
Azure SQL Database is a Platform-as-a-Service (PaaS) offering that provides
fully managed relational database services. It allows users to build and run
applications without having to manage database infrastructure. Azure SQL
Database automatically handles database management functions like backups,
patching, scaling, and high availability.
Key Features of Azure SQL Database:
1. Fully Managed: Azure SQL Database removes the need to manage the
underlying hardware and database infrastructure. Azure handles backups,
patching, security, and scaling automatically.
2. Scalability: You can scale up or down based on your workload needs. Azure
offers DTU (Database Transaction Units) and vCore models for
scalability.
3. High Availability: Built-in high availability with auto-failover groups,
ensuring business continuity and reducing downtime.
4. Security: Includes features like transparent data encryption (TDE),
advanced threat protection, firewall rules, and always encrypted data.
5. Automatic Backups: Azure SQL Database automatically takes backups
with up to 35 days of retention.
6. Integrated with Azure Services: It integrates seamlessly with other Azure
services like Azure App Services, Power BI, and Azure Functions.
406
Deployment Options:
• Single Database: A standalone SQL database designed for most general-
purpose applications.
• Elastic Pool: A pool of databases that share resources. This is useful for
SaaS applications with varying usage patterns.
• Managed Instance: A fully managed instance of SQL Server that provides
near 100% compatibility with SQL Server on-premises, making it easier to
migrate SQL Server workloads to Azure.
Access Models:
• DTU Model: Based on a blended measure of CPU, memory, and I/O
throughput.
• vCore Model: Provides more flexibility in performance, allowing you to
choose the number of cores, memory, and storage.
Ideal Use Cases:
• Web and mobile applications.
• Enterprise applications with high-availability needs.
• Analytics and reporting solutions using Power BI.
407
2. Full Control: More control over configuration, database settings, and
instance-level features.
3. Built-in High Availability: Offers auto-failover groups and zone-
redundant deployments to ensure high availability.
4. Hybrid Capabilities: Supports on-premises SQL Server migrations with
SQL Server Always On.
5. Security and Compliance: Includes transparent data encryption,
managed identity, and advanced threat protection.
Ideal Use Cases:
• Migration from SQL Server to Azure without changing application code.
• Enterprise applications that require the full SQL Server feature set and
need minimal changes during migration.
408
5. Scaling: Azure VMs provide more flexibility in sizing, scaling, and
managing performance.
Ideal Use Cases:
• Legacy applications requiring SQL Server with full control.
• SQL Server instances requiring complex configurations, third-party
applications, or custom extensions.
409
b. Transparent Data Encryption (TDE): Data is automatically encrypted
when stored, without needing to change your application.
c. Always Encrypted: Ensures that sensitive data remains encrypted in transit
and at rest.
4. Automatic Backup and Restore:
a. Azure SQL Database includes automatic backups with up to 35 days of
retention. You can restore databases to any point within the retention
period.
5. Geo-Replication:
a. Active Geo-Replication: Enables you to replicate databases to different
regions around the world for high availability and disaster recovery.
b. Auto-failover Groups: Used for automatically failing over to a secondary
server in case of primary database failure, ensuring business continuity.
6. Serverless:
a. Serverless SQL Database: This feature automatically scales compute
resources based on demand and pauses during inactivity, which is ideal for
intermittent workloads.
7. Data Migration:
a. Azure offers various tools like Azure Database Migration Service (DMS)
and SQL Data Sync for seamless migration from on-premises SQL Server
or other databases to Azure SQL Database.
410
Summary: Key Offerings of Azure SQL
PURCHASING MODEL
The purchasing model for cloud services defines how a customer is billed for the
resources they use, whether it's based on actual consumption, a reserved capacity,
or a specific subscription. Choosing the right model depends on factors like the
predictability of workloads, the need for flexibility, and the cost optimization
goals of an organization. Popular models include Pay-As-You-Go, Reserved
Instances, Spot Pricing, and Subscription Models, among others. Each has its
own set of features, and organizations must select the best model based on their
specific needs.
411
DATA BASE TRANSACTION UNIT
A Database Transaction Unit (DTU) is a performance unit used by Microsoft
Azure SQL Database to measure the combined resources of a database. The DTU
model is a blended measure of three key resources that impact the performance of
your database:
1. CPU (Central Processing Unit)
2. Memory (RAM)
3. I/O (Input/Output) throughput (storage and data transfer)
In other words, DTUs represent a pre-configured combination of compute power,
memory, and I/O resources, which are optimized for general-purpose workloads in
Azure SQL Database.
DTU Model: Components Breakdown
1. CPU: The processing power (CPU) required for database operations such as
query execution and processing.
2. Memory: The amount of RAM required for storing data, indexes, query
execution plans, and other in-memory objects.
3. I/O Throughput: The speed at which data is read from or written to disk,
affecting data retrieval and storage performance.
DTU and Service Tiers in Azure SQL Database
Azure SQL Database offers different performance tiers, each of which specifies a
certain number of DTUs. These tiers determine the amount of compute, memory,
and I/O resources allocated to your database.
• Basic: Suitable for light workloads with minimal requirements. Low DTU
allocation.
• Standard: Offers a balanced performance for most business workloads with
moderate DTU allocation.
• Premium: For high-performance, mission-critical applications with high
DTU requirements.
412
For example:
• Basic Tier might provide a maximum of 5 DTUs.
• Standard Tier might provide 10 to 300 DTUs, depending on the selected
performance level.
• Premium Tier offers higher DTUs, such as 200 DTUs, 400 DTUs, or more.
Choosing the Right Number of DTUs
When selecting the number of DTUs, you are essentially deciding how much CPU,
memory, and I/O throughput you want for your database. More DTUs mean more
resources, which translates to better performance but also higher costs.
1. Low Workloads: If your database has simple requirements (e.g., low
transaction volume), fewer DTUs are needed.
2. Moderate Workloads: If your database has average transaction volume,
moderate DTU levels (e.g., 50–100) may be sufficient.
3. High Performance Workloads: For mission-critical applications, large
databases, or high-traffic websites, you may need higher DTUs (e.g., 200 or
more).
DTU vs. vCore Models
Azure SQL Database also offers a vCore-based model, which is an alternative to
the DTU-based model. The vCore model allows customers to choose the number
of virtual cores (vCores) and other resources (like memory and storage), giving
more granular control over the configuration of their database.
• DTU model is simpler and easier to manage for users who do not need fine-
grained control.
• Vcore model is more flexible and is ideal for customers who want to
configure individual components like CPU, memory, and storage
separately.
DTU in Practice
413
For example, let’s say you choose a Standard S2 tier in Azure SQL Database,
which provides 50 DTUs. This would allocate a specific amount of CPU, memory,
and I/O resources to your database, optimized for general-purpose workloads.
• If you choose a higher tier like Standard S3 with 100 DTUs, it means more
resources and a better ability to handle higher traffic or more intensive
queries.
explain SQL server on azure VM and azure SQL managed instance and azure
SQL database
SQL Server on Azure VM, Azure SQL Managed Instance, and Azure SQL
Database: Side-by-Side Comparison
Microsoft Azure offers different solutions for hosting SQL Server-based
workloads. These solutions can be broadly categorized into three options:
1. SQL Server on Azure Virtual Machine (VM)
2. Azure SQL Managed Instance
3. Azure SQL Database
Each of these solutions has its own strengths, capabilities, and use cases. Here's a
comparison of the three:
414
Supports recent
Full support for all Supports SQL Server SQL Server
SQL Version
SQL Server versions 2008 and later, versions (e.g., SQL
Support
versions with full compatibility Server 2016, 2017,
2019)
Managed instance with No access to the
Full control over the
Control Over limited control but still underlying server or
SQL Server instance
SQL Server allows configuration and OS, SQL database
and OS
tuning only
Requires
Built-in high
configuration of HA Built-in high availability
High availability with
(High Availability) with automatic failover
Availability geo-replication and
solutions (e.g., and multi-region support
automatic failover
Always On)
Requires manual
Backup and Automated backups with Automated backups
configuration for
Recovery point-in-time restore and point-in-time
backups
415
restore (up to 35
days)
Managed with
Full control over Managed with limited
Performance limited control over
tuning, indexes, and tuning options, but it’s
Tuning performance tuning
SQL Server settings highly optimized
and configuration
Pay-as-you-go for
Pay-as-you-go
VM size, storage, Pay-as-you-go with
based on database
Cost and SQL Server pricing based on instance
tier (Basic,
Structure license (or bring- size (vCores) and
Standard, Premium)
your-own license - storage
or vCore model
BYOL)
416
Detailed Breakdown:
1. SQL Server on Azure VM (IaaS)
• Use Case: Ideal for customers who want to lift-and-shift their on-premises
SQL Server workloads to the cloud without needing significant changes to
their application or database. This solution offers full control over both the
operating system and SQL Server instance.
• Control: Full control over SQL Server configuration, OS settings, patches,
and updates. You manage the installation, tuning, and scaling of the system,
which means you have more flexibility but also more responsibility.
• Management Overhead: You must manage patching, backups, high
availability (HA) configurations, and security updates. Azure does not
handle this automatically for you.
• Pricing: You pay for the virtual machine size (vCPU, memory), storage, and
SQL Server licensing (or bring your own license).
• Pros:
o Full flexibility and control over your SQL Server environment.
o Useful for legacy applications that require compatibility with specific SQL
Server features.
• Cons:
o More management overhead and responsibility.
o Requires expertise in configuring high availability, backups, and disaster
recovery.
417
• Control: Managed instance gives you full compatibility with SQL Server,
but Microsoft manages most of the underlying infrastructure, including OS,
backups, patching, and updates. You can configure database settings but not
the underlying infrastructure.
• Management Overhead: Minimal management is needed. Patching,
backups, high availability, and disaster recovery are handled by Azure,
making it more convenient than managing SQL Server on a VM.
• Pricing: Azure SQL Managed Instance is priced based on compute (vCores)
and storage. It also offers a more predictable pricing structure compared to
the SQL Server on Azure VM model.
• Pros:
o High compatibility with SQL Server features like SQL Agent, Linked
Servers, and full-text search.
o Automated management of patches, backups, and high availability.
o Easier migration for on-premises SQL Server databases with minimal code
changes.
• Cons:
o Limited control compared to SQL Server on an Azure VM.
o More complex than Azure SQL Database for cloud-native applications.
419
Summary Table:
Built-in HA with
High Requires configuration Built-in HA with
geo-replication and
Availability (e.g., Always On) automatic failover
auto-failover
Cloud-native
Lift-and-shift, full SQL Server
applications,
Ideal For control over migrations with
lightweight
environment minimal changes
workloads
420
421