Introduction Big Data
Introduction Big Data
LESSON 1:
Characteristics of Data
1. Accuracy
2. Validity
3. Reliability
4. Timeliness
5. Relevance
6. Completeness
lOMoARcPSD|44504791
Structured
Unstructured
Structured Data:
. Meaningful data.
Unstructured Data:
Unstructured data can not readily classify and fit into a neat box
Big Data
1. Cost Savings Big Data tools like Apache Hadoop, Spark, etc.
bring cost-saving benefits to businesses when they have to
store large amounts of data. These tools help organizations in
identifying more effective ways of doing business
2. Time-Saving
lOMoARcPSD|44504791
unstructured data
lOMoARcPSD|44504791
Types of Big Data Now that we are on track with what is big data,
let’s have a look at the types of big data:
Unstructured data can not readily classify and fit into a neat box
o Volume
o Veracity
o Variety
o Value
o Velocity
Volume
lOMoARcPSD|44504791
Variety
Example: Web server logs, i.e., the log file is created and
maintained by some server that contains a list of activities.
Veracity
Value
Velocity
Big data velocity deals with the speed at the data flows
from sources like application logs, business processes,
networks, and social media sites, sensors, mobile
devices, etc.
web analytics
big data and marketing
fraud and big data
risk and big data
credit risk management
big data and algorithmic trading
big data and healthcare
big data in medicine
advertising
web analytics
Three types of big data that are a big deal for marketing
Customer: The big data category most familiar to marketing
may include behavioral, attitudinal and transactional metrics
from such sources as marketing campaigns, points of sale,
websites, customer surveys, social media, online
communities and loyalty programs.
lOMoARcPSD|44504791
Unstructured Data
Alternate Data
Non-traditional data sources could include mobile phone
usage, utility bill payments, and geolocation data. These
data sources are mainly helpful for individuals with little or
no traditional credit history.
Data Preprocessing
Big data has the potential to improve health care for the
better. Here are some of the most common benefits of using
big data in health care:
Better patient care: More patient data means an
opportunity to understand the patient experience better and
improve the care they receive.
Improved research: Big data gives medical researchers
unprecedented access to a large volume of data and
methods of collecting data. In turn, this data can drive
important medical breakthroughs that save lives.
Smarter treatment plans: Analyzing the treatment plans
that helped patients (and those that didn’t) can help
lOMoARcPSD|44504791
introduction to Hadoop
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
Comprehensive view of
Detects threats within
Benefits network traffic, identifies
encrypted traffic
lateral movement
Requires specialized
Requires multiple firewalls
Limitations hardware and software,
and efficient data exchange
raises privacy concerns
lOMoARcPSD|44504791
LESSON 2:
Introduction to NoSQL
BDA UNIT-2
lOMoARcPSD|44504791
LESSON 3:
LESSON 3
Data formats, analyzing data with Hadoop
(Scalability)
(Data Locality)
Components of Hadoop
Lossless Compression
Lossy Compression
Serialization in Hadoop
Serialization is the process of converting in-memory data structures or
objects into a format suitable for storage or transmission. In Apache
lOMoARcPSD|44504791
LESSON 4
Introduction to Hive
lOMoARcPSD|44504791
Features of Hive
Hive Client
Hive Services
Hive Commands
DDL commands are used to define and manage the structure of tables
and databases in Hive.
Examples:
Create Database:
CREATE DATABASE my_database;
Create Table:
CREATE TABLE employees (
id INT,
lOMoARcPSD|44504791
name STRING,
salary FLOAT
Drop Table:
DROP TABLE employees;
DML commands are used to query, insert, or modify data in Hive tables.
Examples:
Insert Data:
o Insert data directly:
Load Data:
LOAD DATA INPATH
'/path/to/employees_data.csv' INTO TABLE
employees;
Export Table:
lOMoARcPSD|44504791
sql
Copy code
EXPORT TABLE employees TO
'/backup/employees/';
Examples:
Select Statement:
sql
Copy code
SELECT name, salary FROM employees WHERE
salary > 30000;
Group By:
sql
Copy code
SELECT department, AVG(salary) FROM employees
GROUP BY department;
Order By:
sql
Copy code
SELECT name, salary FROM employees ORDER BY
salary DESC;
Join:
sql
Copy code
lOMoARcPSD|44504791
Examples:
Grant Permissions:
sql
Copy code
GRANT SELECT ON TABLE employees TO USER
'john';
Revoke Permissions:
sql
Copy code
REVOKE SELECT ON TABLE employees FROM USER
'john';
5. Utility Commands
Examples:
Show Databases:
sql
Copy code
SHOW DATABASES;
lOMoARcPSD|44504791
Show Tables:
sql
Copy code
SHOW TABLES IN my_database;
Describe Table:
sql
Copy code
DESCRIBE FORMATTED employees;
Explain Query:
sql
Copy code
EXPLAIN SELECT name FROM employees WHERE
salary > 50000;
6. Transaction Commands
Examples:
Enable Transactions:
sql
Copy code
SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode =
nonstrict;
Insert Data:
lOMoARcPSD|44504791
sql
Copy code
INSERT INTO employees PARTITION
(department='HR') VALUES (2, 'Jane Doe',
60000.0);
Update Data:
sql
Copy code
UPDATE employees SET salary = 70000 WHERE id
= 1;
Delete Data:
sql
Copy code
DELETE FROM employees WHERE salary < 30000;
DML (Data
Query and modify data INSERT, LOAD, EXPORT
Manipulation)
1. Inner Join
Syntax:
sql
Copy code
SELECT a.column1, b.column2
FROM table1 a
JOIN table2 b
ON a.common_column = b.common_column;
Example:
sql
lOMoARcPSD|44504791
Copy code
SELECT e.name, d.department_name
FROM employees e
JOIN departments d
ON e.department_id = d.id;
Returns all rows from the left table and the matching rows from the right
table. Rows in the left table without a match will have NULL in the
columns from the right table.
Syntax:
sql
Copy code
SELECT a.column1, b.column2
FROM table1 a
LEFT OUTER JOIN table2 b
ON a.common_column = b.common_column;
Example:
sql
Copy code
SELECT e.name, d.department_name
FROM employees e
LEFT OUTER JOIN departments d
ON e.department_id = d.id;
Returns all rows from the right table and the matching rows from the left
table. Rows in the right table without a match will have NULL in the
columns from the left table.
lOMoARcPSD|44504791
Syntax:
sql
Copy code
SELECT a.column1, b.column2
FROM table1 a
RIGHT OUTER JOIN table2 b
ON a.common_column = b.common_column;
Example:
sql
Copy code
SELECT e.name, d.department_name
FROM employees e
RIGHT OUTER JOIN departments d
ON e.department_id = d.id;
Syntax:
sql
Copy code
SELECT a.column1, b.column2
FROM table1 a
FULL OUTER JOIN table2 b
ON a.common_column = b.common_column;
Example:
sql
Copy code
SELECT e.name, d.department_name
FROM employees e
FULL OUTER JOIN departments d
lOMoARcPSD|44504791
ON e.department_id = d.id;
Window functions
1. ROW_NUMBER()
Syntax:
SELECT emp_id, sale_amount, dept_id,
ROW_NUMBER() OVER (PARTITION BY dept_id
ORDER BY sale_amount DESC) AS rank
FROM sales_data;
Result:
2. RANK()
Assigns a rank to each row within a window. Ties are given the same
rank, skipping subsequent ranks.
Syntax:
3. DENSE_RANK()
Syntax:
SELECT emp_id, sale_amount, dept_id,
DENSE_RANK() OVER (PARTITION BY dept_id
ORDER BY sale_amount DESC) AS rank
FROM sales_data;
Result:
4. NTILE(n)
Divides rows into n buckets and assigns a bucket number to each row.
Syntax:
sql
Copy code
lOMoARcPSD|44504791
Result:
5. LEAD()
Accesses data from the next row in the same result set.
Syntax:
sql
Copy code
SELECT emp_id, sale_amount, dept_id,
LEAD(sale_amount) OVER (PARTITION BY
dept_id ORDER BY sale_amount) AS next_sale
FROM sales_data;
Result:
6. LAG()
Accesses data from the previous row in the same result set.
Syntax:
sql
Copy code
SELECT emp_id, sale_amount, dept_id,
LAG(sale_amount) OVER (PARTITION BY
dept_id ORDER BY sale_amount) AS prev_sale
FROM sales_data;
Result:
Syntax:
sql
Copy code
SELECT emp_id, sale_amount, dept_id,
SUM(sale_amount) OVER (PARTITION BY
dept_id ORDER BY sale_amount) AS cumulative_sum
FROM sales_data;
lOMoARcPSD|44504791
Result:
Optimization
1. Partitioning:
2. Bucketing:
6. Indexing:
a. Map Join:
Convert large joins into map-side joins when one table is small.
sql
Copy code
set hive.auto.convert.join=true;
lOMoARcPSD|44504791
b. Broadcast Join:
LESSON 5
APACHE SPARK
What is Spark?
processing
efficiently.
efficiently.
Hadoop is a high
Spark is a low latency
latency computing
computing and can
framework, which
process data
does not have an
interactively.
Latency interactive mode.
With Hadoop
Spark can process real-
MapReduce, a
time data, from real-time
developer can only
events like Twitter, and
process data in
Facebook.
Data batch mode only.
found on another
node
Data fragments in
Spark is much faster as
Hadoop can be too
it uses MLib for
large and can create
computations and has
Machine bottlenecks. Thus, it
in-memory processing.
Learning is slower than Spark.
Hadoop is easily
It is quite difficult to
scalable by adding
scale as it relies on RAM
nodes and disk for
for computations. It
storage. It supports
supports thousands of
tens of thousands of
nodes in a cluster.
Scalability nodes.
management.
LAZY EVALUATION
3. Distributed Memory:
o Spark divides datasets into partitions and stores them in the
memory of worker nodes across the cluster.
4. Transformation and Action Model:
o Transformations (like map, filter, flatMap) are lazy;
they don't execute until an action (like collect, count,
save) is performed.
o Intermediate results from transformations can be stored in
memory for subsequent operations.
DAG
SPARK CONTEXT
SparkContext is the entry point to the Apache Spark
cluster, and it is used to configure and coordinate the
execution of Spark jobs.
It is typically created once per Spark application and
is the starting point for interacting with Spark.
SPARK SESSION
RDD
What is RDD?
Spark creates a operator graph. When the user runs an action (like
collect), the Graph is submitted to a DAG Scheduler.
The DAG scheduler divides operator graph into (map and reduce)
stages. A stage is comprised of tasks based on partitions of the
input data. The DAG scheduler pipelines operators together to
optimize the graph. For e.g. Many map operators can be scheduled
in a single stage.
Transformation Description
lOMoARcPSD|44504791
Action
CATALYST OPTIMIZER
2. Extracting Components
3. Date Arithmetic
5. Filtering/Comparing
Returns the
current_date() ->
current_date() current
2024-11-17
system date.
Returns the
current current_timestamp() -
current_timestamp()
system > 2024-11-17 12:00:00
timestamp.
Converts
to_date('2024-11-17
string or
to_date() 12:00:00') -> 2024-11-
timestamp
17
to date.
Converts
to_timestamp('2024-
to_timestamp() string to
11-17 12:00:00')
timestamp.
date_format()
1. Identifying Nulls
2. Replacing Nulls
3. Removing Nulls
LESSON 6