Module 1 Introduction to Big Data Analytics
Module 1 Introduction to Big Data Analytics
ANALYTICS
Theory (PP) 30
Practical (PR) 15
Total hours 45
Data analytics: Introduction to big data analytics, classification of analytics, terminologies used in big data
environment, use cases of big data analytics, data analytics using Python.
Introduction to Hadoop: Hadoop overview, Hadoop history, Importance of Hadoop, RDBMS versus Hadoop,
distributed computing challenges, use cases of Hadoop.
Module – II
HDFS (Hadoop distributed file system), processing data with Hadoop, mapping resources and applications with 9
Hadoop YARN, Big data environment, Hadoop ecosystem, Component of Hadoop echo system, HDFS commands,
Linux commands, Hadoop environment setup. YARN, Introduction to MapReduce.
Introduction to NoSQL Advantages of NOSQL, SQL VS NOSQL.
Module – III Introduction to MongoDB: Introduction to MongoDB, importance of MongoDB, Data types in MongoDB, terms
used in RDBMS and MongoDB, Installation of MongoDB, MongoDB query language. Handson on MongoDB 10
Introduction to Cassandra: Introduction of Cassandra, Features of Apache Cassandra CQL data types, CQLSH,
CRUD operations TTL, Alter Command, import and export and querying system tables
Data Ingestion and Processing in Hadoop:
Module – IV Data ingestion tools Sqoop, Flume, Kafka. Basics of Hive, Hive architecture, Hive data types, Hive file format,
8
Hive environment setup, HQL (Hive Query Language), parsing file implementation, SerDe.
Introduction to Graph database
Apache SPARK
Module – V
Introduction to Apache Spark, Characteristics of Apache Spark, Use Cases of Apache Spark, Apache spark
02/23/2025 Architecture, Hadoop Vs Spark, Apache
Module Spark Environment
I : Introduction to Big DataSetup, Introduction to SparkContext, Spark RDD,
and Analytics 9
spark Caching, Spark DataFrame Operations, Spark Functions in distributed environment.
Textbooks
Seema Acharya, Subhashini Chellappan “Big Data and Analytics”, 2 nd edition, Wiley, 2023.
1.
ISBN-978-81-265-7951-8
2. Aven Jeffrey “Data Analytics with Spark using Python”, Pearson, 2018
Reference books
1. Mayank Bhushan “Big Data and Hadoop” BPB Publication, 1 st Edition, 2018
Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization,
2
Dreamtech Press, 2016
3 Ivan Marin, Ankit Shukla, Sarang VK “Big Data Analysis with Python” Packt, 2019
Balamarugan Balusamy, Nandhini Abirami R, Seifedine Kadry, Amir Gandomi, “Big Data:
4
Concepts, Technology and Architecture” Wiley, 2021
Kulkarni, Parag, Joshi, Sarang, Brown Meta S, “Big Data Analytics”, PHI Learning private
5
Limited, 2016
Raj Kamal and Preeti Saxena, “Big Data Analytics: Introduction to Hadoop, Spark and Machine
6 Learning”, McGraw Hill Education (India) Private Limited, 2019
IBM Introduction to Big Data with Spark and Hadoop [15 hours]
1
https://www.coursera.org/learn/introduction-to-big-data-with-spark-hado
op
IBM Introduction to NoSQL Databases [18 hours]
2
https://www.coursera.org/learn/introduction-to-nosql-databases?speciali
zation=nosql-big-data-and-spark-foundations
Data
Quantitative Qualitative
Data Data
Collected
through
Collected observation,
through field work,
measuring focus groups,
Measurable Close ended Descriptive Open ended
things that interviews,
have a fixed recording or
reality filming
Conversation
s
traditional
computer
science
traditional
computer
science
/or example…
busy
web server
access logs
02/23/2025 graphto of
Module I : Introduction the
Big Data and Analytics 16
Big Data, what is it?
traditional
computer data with a large
science number of
observations and/or
features.
statisti
cs
statistics
traditional
computer data with a large
science number of
observations and/or
features.
statisti
cs
traditional
computer data with a large
science number of
observations and/or
features.
statisti
cs
non-traditional sample
size (i.e. > 300 subjects);
can’t analyze in stats
02/23/2025 other tools
Module I : Introduction to Big (Excel).
Data and Analytics 20
https://www.facebook.com/itspark.in/videos/what-
is-big-data-and-how-does-it-work/
597613787402602/
Old Model: Few companies are generating data, all others are consuming
data
New Model: all of us are generating data, and all of us are consuming
data
• When the business asks you to perform a data science project, first prepare a
project charter.
• a timetable and
• deliverables
• Already stated in the project charter which data you need and
where you can find it.
• Ensure you can use the data in your program, which means
checking the existence of, quality, and access to the data.
• Enhance the quality of the data and prepare it for use in subsequent
steps.
• Results can take many forms - ranging from presentations to research reports
Dat Insight
a s!
(ChatGPT, January
2023)
?
One of the main challenges of big data is the ability to store, process, and analyze it effectively. Traditional data
processing methods and technologies are often not able to handle the volume, velocity, and variety of big data.
As a result, new technologies and approaches, such as distributed computing and machine learning, have been
developed to help organizations make sense of their big data.
Big data can have a wide range of applications, from improving business
operations and customer service to enabling new scientific discoveries
and advancements in healthcare. For example, in business, big data can
(ChatGPT, January be used to gain insights into customer behavior, identify new market
2023) opportunities, and optimize supply chain operations. In healthcare, big
data can be used to improve patient outcomes and develop
personalized treatment plans.
Overall, big data is a rapidly growing field with many potential benefits
for organizations and individuals, but also has the potential for privacy
and security concerns. Therefore, it is important for organizations to
have a robust data governance framework and for individuals to
understand the implications of data collection and use.
02/23/2025 Module I : Introduction to Big Data and Analytics 61
Big Data, in demand?
• Short Answer:
• Short Answer:
• Short Answer:
Computation Statistics
Science
02/23/2025 Module I : Introduction to Big Data and Analytics 68
Big Data, What is it?
• Short Answer:
CSE545
focuses on:
How to analyze data that is Analyses only possible with a
mostly too large for main large
memory. number of observations or
features.
02/23/2025 Module I : Introduction to Big Data and Analytics 69
Big Data, What is it?
Goal: Generalizations
A model or summarization of the
data.
Goal: Generalizations
A model or summarization of the
data.
E.g.
● Google’s PageRank: summarizes web pages by a single
number.
● Twitter financial market predictions: Models the stock
market according to shifts in sentiment in Twitter.
● Distinguish tissue type in medical images: Summarizes
millions of pixels into clusters.
● Mental health diagnosis in social media: Models
presence of diagnosis as a distribution (a summary)
02/23/2025of linguistic patterns.
Module I : Introduction to Big Data and Analytics 71
Big Data, What is it?
Goal: Generalizations
A model or summarization of the
data.
1. Descriptive analytics
Describe (generalizes) the data itself
2. Predictive analytics
Create something generalizeable to new data
Deep Learning
02/23/2025
/rameworksModule I : Introduction to Big Data and Analytics 74
Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the
data.
Descriptiv Prescripti
Predictive
e ve
02/23/2025 78
Techniques in Descriptive
Analytics
• Collecting data from multiple sources and
summarizing it to provide a cohesive
Data view.
Aggregation • Instrumental in generating company
reports and KPIs, enabling businesses to
track performance metrics over time.
• Helps identify past events and uncover
patterns that may not be immediately
apparent.
Data Mining • important for historical analysis, allowing
businesses to understand trends and
behaviors that have shaped their
performance.
02/23/2025 Module I : Introduction to Big Data and Analytics 79
Methods Used in Descriptive
Analytics
• Collecting data based on observed
Observatio
behaviors or events within the
ns
organization.
• Conducting in-depth examinations of
Case
specific instances to gain detailed
Studies
insights.
Surveys Using questionnaires to gather
data and understand trends and
patterns among stakeholders
• Need to move data from one source to another, and this is where
the data integration frameworks such as Apache Sqoop and
Apache Flume excel.
Column Docume
Streami
databas nt
ng data
es stores
SQL on New
Key-value stores
Hadoop SQL
Graph databases
02/23/2025 Module I : Introduction to Big Data and Analytics 90
Data is stored in columns, which
allows algorithms to perform much
faster queries. Newer technologies
use cell-wise storage. Table-like
02/23/2025
structures are still important
Module I : Introduction to Big Data and Analytics 91
Document stores
Document stores no
longer use tables, but
store every observation
in a document. This Streaming data : Data is collected,
allows for a much more transformed, and aggregated not in
flexible data scheme.
batches but in real time.
• For example, if you can gain 10% on a cluster of 100 servers, you
save the cost of 10 servers.
• You want to allow others to use the predictions made by your application.
• Service tools excel here by exposing big data applications to other applications
as a service.
• The best-known example is the REST service - REST stands for representational
state transfer.
• Big data security tools allow you to have central and fine-grained
control over access to the data.
• Manufacturing
• Retail
• Healthcare
• Oil and gas
• Telecommunications
• Financial services
Oil exploration and discovery : Exploring for oil and gas can
be expensive. But companies can make use of the vast
amount of data generated in the drilling and production
process to make informed decisions about new drilling sites.
Data generated from seismic monitors can be used to find
new oil and gas sources by identifying traces that were
previously overlooked
Oil production optimization : Unstructured sensor and
For the past few years, the oil and gas
historical data can be used to optimize oil well production.
industry has been leveraging big data
By creating predictive models, companies can measure
to find new ways to innovate. The
well production to understand usage rates. With deeper
industry has long made use of data
data analysis, engineers can determine why actual well
sensors to track and monitor the
outputs aren’t tallying with their predictions.
performance of oil wells, machinery,
and operations. Oil and gas companies
have been able to harness this data to
monitor well activity, create modelsModule
02/23/2025 of I : Introduction to Big Data and Analytics 103
V. Telecommunications
Optimize network capacity : Optimal network
performance is essential for a telecom’s success.
Network usage analytics can help companies
identify areas with excess capacity and reroute
bandwidth as needed. Big data analytics can help
them plan for infrastructure investments and
design new services that meet customer demands.
With new insights, telecoms are able maintain
customer loyalty and avoid losing revenue to
competitors.
Telecom customer churn : By analyzing the data
telecoms already have about service quality,
convenience, and other factors, telecoms can
predict overall customer satisfaction. And they can
set up alerts when customers are at risk of
churning—and take action with retention
The popularity of smart phones and other campaigns New product and offerings
proactive offers
: Big data provides
mobile devices has given valuable insights to help companies design new
telecommunications companies tremendous products and features. An improved
growth opportunities. But there are understanding of customer behavior enables
challenges as well, as organizations
02/23/2025
work to companies to tailor services to different customer
Module I : Introduction to Big Data and Analytics 104
keep pace with customer demands for new segments for future offerings.
VI. Financial services
Fraud and compliance : companies can identify
patterns that indicate fraud and aggregate large
volumes of information to streamline regulatory
reporting.
Drive innovation : Big data offers valuable insights
that help organizations innovate. Big data
analytics makes the interdependencies between
humans, institutions, entities, and processes more
apparent. With better understanding of market
trends and customer needs, organizations can
improve decision-making about new products and
Anti-money
services. laundering : Financial services firms
are under more pressure than ever before from
governments passing anti-money laundering laws.
Forward-thinking banks and financial services These laws require that banks show proof of proper
firms are capitalizing on big data. From diligence and submit suspicious activity reports.
capturing new market opportunities to Can help companies identify potential fraud
reducing fraud, financial services Financial
patterns. regulatory and compliance analytics :
organizations have been able to convert big Financial services companies must be in
data into a competitive advantage. compliance with a wide variety of requirements
concerning risk, conduct, and transparency. At the
02/23/2025 same
Module I : Introduction time,
to Big Data banks must comply with the 105
and Analytics Dodd-
Frank Act, Basel III, and other regulations that
Data Analytics Using
Python
• # Sample Data
• arr = [5, 6, 11]
• # Mean
• mean = np.mean(arr)
• # sample Data
• arr = [1, 2, 3, 4]
• # Median
• median = np.median(arr)
• # sample Data
• arr = [1, 2, 2, 3]
• # Mode
• mode = stats.mode(arr)
• print("Mode = ", mode)
• Variance
• Standard deviation
02/23/2025 Module I : Introduction to Big Data and Analytics 112
Range
• Range = Largest data value – smallest data value
•
import numpy as np
• # Sample Data
• arr = [1, 2, 3, 4, 5]
• # Finding Max
• Maximum = max(arr)
• # Finding Min
• Minimum = min(arr)
Variance
•# sample data
•arr = [1, 2, 3, 4, 5]
•# variance
•print("Var = ",
(statistics.variance(arr)))
Standard
Deviation
•import statistics
•# sample data
•arr = [1, 2, 3, 4, 5]
•# Standard Deviation
•print("Std = ",
(statistics.stdev(arr)))