0% found this document useful (0 votes)
8 views

Module 1 Introduction to Big Data Analytics

The document outlines a course on Big Data Analytics, covering topics such as the characteristics and types of big data, data analytics, and the Hadoop ecosystem. It includes practical applications using tools like MongoDB, Cassandra, and Apache Spark, alongside assessments and course objectives. The syllabus is structured into modules, each focusing on different aspects of big data technologies and methodologies.

Uploaded by

whoizzpriyxx
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module 1 Introduction to Big Data Analytics

The document outlines a course on Big Data Analytics, covering topics such as the characteristics and types of big data, data analytics, and the Hadoop ecosystem. It includes practical applications using tools like MongoDB, Cassandra, and Apache Spark, alongside assessments and course objectives. The syllabus is structured into modules, each focusing on different aspects of big data technologies and methodologies.

Uploaded by

whoizzpriyxx
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 121

6CS1039 - BIG DATA

ANALYTICS

02/23/2025 Module I : Introduction to Big Data and Analytics 1


OUTLINE
CREDIT : 3 L : 2; P : 2

Theory (PP) 30

Practical (PR) 15

Total hours 45

02/23/2025 Module I : Introduction to Big Data and Analytics 2


SYLLABUS
Contact
Module Content Hours :
45
Introduction to big data: Types of digital data, classification of digital data, structured data, semi structured
data and unstructured data.
Module – I Characteristics of big data, evolution of big data, challenges of big data definition of data, traditional business
9
intelligence (BI) versus Big Data, Traditional Versus Big Data Approach

Data analytics: Introduction to big data analytics, classification of analytics, terminologies used in big data
environment, use cases of big data analytics, data analytics using Python.
Introduction to Hadoop: Hadoop overview, Hadoop history, Importance of Hadoop, RDBMS versus Hadoop,
distributed computing challenges, use cases of Hadoop.
Module – II
HDFS (Hadoop distributed file system), processing data with Hadoop, mapping resources and applications with 9
Hadoop YARN, Big data environment, Hadoop ecosystem, Component of Hadoop echo system, HDFS commands,
Linux commands, Hadoop environment setup. YARN, Introduction to MapReduce.
Introduction to NoSQL Advantages of NOSQL, SQL VS NOSQL.

Module – III Introduction to MongoDB: Introduction to MongoDB, importance of MongoDB, Data types in MongoDB, terms
used in RDBMS and MongoDB, Installation of MongoDB, MongoDB query language. Handson on MongoDB 10

Introduction to Cassandra: Introduction of Cassandra, Features of Apache Cassandra CQL data types, CQLSH,
CRUD operations TTL, Alter Command, import and export and querying system tables
Data Ingestion and Processing in Hadoop:
Module – IV Data ingestion tools Sqoop, Flume, Kafka. Basics of Hive, Hive architecture, Hive data types, Hive file format,
8
Hive environment setup, HQL (Hive Query Language), parsing file implementation, SerDe.
Introduction to Graph database
Apache SPARK
Module – V
Introduction to Apache Spark, Characteristics of Apache Spark, Use Cases of Apache Spark, Apache spark
02/23/2025 Architecture, Hadoop Vs Spark, Apache
Module Spark Environment
I : Introduction to Big DataSetup, Introduction to SparkContext, Spark RDD,
and Analytics 9
spark Caching, Spark DataFrame Operations, Spark Functions in distributed environment.
Textbooks
Seema Acharya, Subhashini Chellappan “Big Data and Analytics”, 2 nd edition, Wiley, 2023.
1.
ISBN-978-81-265-7951-8

2. Aven Jeffrey “Data Analytics with Spark using Python”, Pearson, 2018

Reference books

1. Mayank Bhushan “Big Data and Hadoop” BPB Publication, 1 st Edition, 2018

Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization,
2
Dreamtech Press, 2016

3 Ivan Marin, Ankit Shukla, Sarang VK “Big Data Analysis with Python” Packt, 2019

Balamarugan Balusamy, Nandhini Abirami R, Seifedine Kadry, Amir Gandomi, “Big Data:
4
Concepts, Technology and Architecture” Wiley, 2021
Kulkarni, Parag, Joshi, Sarang, Brown Meta S, “Big Data Analytics”, PHI Learning private
5
Limited, 2016
Raj Kamal and Preeti Saxena, “Big Data Analytics: Introduction to Hadoop, Spark and Machine
6 Learning”, McGraw Hill Education (India) Private Limited, 2019

02/23/2025 Module I : Introduction to Big Data and Analytics 4


COURSE OBJECTIVES

1. Explain the characteristics of Big Data and analytics, distinguish


between traditional and Big Data approaches, and identify the
challenges and infrastructure required for effective data management.
2. Understand data processing with Hadoop and interact with its
ecosystem.
3. Implement MapReduce programming, like mappers, reducers,
combiners, and partitioners etc.
4. Implement Hadoop Tools (MongoDB, Cassandra and Hive) in HFDS
environment.
5. Introduction of Apache Spark, Create and manipulate Spark DataFrame.

02/23/2025 Module I : Introduction to Big Data and Analytics 5


COURSE OUTCOME
Bloom’s
Course
CO Statement Taxonomy
Outcome
Level
Analyse the fundamentals of Big Data, and
1 BTL3
applications of analytics.
Setup Hadoop environment and interact with its
2 BTL3
ecosystem.
Implement NoSQL (dataset MongoDB, Cassandra)
3 BTL4
on Hadoop.
Perform data processing on data warehouse (Hive)
4 BTL4
in HFDS environment.
Create and manipulate Spark DataFrames in Spark
5 BTL4
for data processing.

02/23/2025 Module I : Introduction to Big Data and Analytics 6


COURSERA COURSE

IBM Introduction to Big Data with Spark and Hadoop [15 hours]
1
https://www.coursera.org/learn/introduction-to-big-data-with-spark-hado
op
IBM Introduction to NoSQL Databases [18 hours]
2
https://www.coursera.org/learn/introduction-to-nosql-databases?speciali
zation=nosql-big-data-and-spark-foundations

02/23/2025 Module I : Introduction to Big Data and Analytics 7


DSA COMPONENTS & ASSESSMENT PATTERN
Component Max Marks Scaled Frequency
Marks
Quiz 5 5 1
Blended Learning
(Coursera 5 5 1
Certification)
Mini Project 10 10 1
Attendance 5 5 --
Mid Semester 10 1
40 Internal Semester End
Exam Component
MSE Lab 50 15 1 Assessment Examination
Total 50 Marks
-- 50 50
Total Marks 100
02/23/2025 Module I : Introduction to Big Data and Analytics 8
MINI PROJECT
• CASE STUDY : FITNESS DATA - WHAT DO WE
DO WITH IT
• MINI PROJECT : Data Processing and
Transformation Using
Hive in a Hadoop
Environment

02/23/2025 Module I : Introduction to Big Data and Analytics 9


INTRODUCTION TO BIG
DATA
Module – I

02/23/2025 Module I : Introduction to Big Data and Analytics 10


To learn
•Introduction to big data: Types of digital data, classification of digital
data, structured data, semi structured data and unstructured data.
Characteristics of big data, evolution of big data, challenges of big data
definition of data, traditional business intelligence (BI) versus Big Data,
Traditional Versus Big Data Approach
• Data analytics: Introduction to big data analytics, classification of
analytics, terminologies used in big data environment, use cases of
big data analytics, data analytics using Python.

02/23/2025 Module I : Introduction to Big Data and Analytics 11


Types of Data

Data

Quantitative Qualitative
Data Data

Collected
through
Collected observation,
through field work,
measuring focus groups,
Measurable Close ended Descriptive Open ended
things that interviews,
have a fixed recording or
reality filming
Conversation
s

02/23/2025 Module I : Introduction to Big Data and Analytics 12


Outline
Big Data Definition
• Big data refers to data sets
– whose size is beyond the ability of typical
database software tools to capture, store,
manage and analyze

Module I : Introduction to Big Data


02/23/2025 13
and Analytics
• Data that is too large or too complex to be
What data can managed using traditional data processing,
be defined as analysis, and storage techniques
Big Data?

02/23/2025 Module I : Introduction to Big Data and Analytics 14


Big Data, what is it?

data that will


not fit in main
memory.

traditional
computer
science

02/23/2025 Module I : Introduction to Big Data and Analytics 15


Big Data, what is it?
data that will
not fit in main
memory.

traditional
computer
science

/or example…
busy
web server
access logs
02/23/2025 graphto of
Module I : Introduction the
Big Data and Analytics 16
Big Data, what is it?

data that will


not fit in main
memory.

traditional
computer data with a large
science number of
observations and/or
features.
statisti
cs

02/23/2025 Module I : Introduction to Big Data and Analytics 17


Big Data, what is it?
Tall data:
edge list of a large graph
rgb values per pixel location in large images

data with a large


number of
observations and/or
features.

statistics

Wide data: mobile app usage statistics


of 100 people
02/23/2025 Module I : Introduction to Big Data and Analytics 18
Big Data, what is it?
data that will
not fit in main
memory.

traditional
computer data with a large
science number of
observations and/or
features.
statisti
cs

02/23/2025 Module I : Introduction to Big Data and Analytics 19


Big Data, what is it?

data that will


not fit in main
memory.

traditional
computer data with a large
science number of
observations and/or
features.
statisti
cs
non-traditional sample
size (i.e. > 300 subjects);
can’t analyze in stats
02/23/2025 other tools
Module I : Introduction to Big (Excel).
Data and Analytics 20
https://www.facebook.com/itspark.in/videos/what-
is-big-data-and-how-does-it-work/
597613787402602/

02/23/2025 Module I : Introduction to Big Data and Analytics 21


Gartner’s Definition for Big
Data
• The “big” in big data also refers to several
other characteristics of a big data source.
These aspects include not just increased
volume but increased velocity and
increased variety.

Module I : Introduction to Big Data


02/23/2025 22
and Analytics
The
Evolution
of Big Data

02/23/2025 Module I : Introduction to Big Data and Analytics 23


02/23/2025 Module I : Introduction to Big Data and Analytics 24
6V Characteristics of Big Data

02/23/2025 Module I : Introduction to Big Data and Analytics 25


02/23/2025 Module I : Introduction to Big Data and Analytics 26
02/23/2025 Module I : Introduction to Big Data and Analytics 27
The ModelOutline
Has Changed…
• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming
data

New Model: all of us are generating data, and all of us are consuming
data

Module I : Introduction to Big Data


02/23/2025 28
and Analytics
HOW IS BIGOutline
DATA DIFFERENT?
1. Often automatically generated by a machine
(Instead of a person being involved in creating new
data)
2. Big data is typically an entirely new source of
data. Example : Experience with online
shopping
3. Many big data sources are not designed to be
friendly
4. Large swaths of big data streams may not
have much value

Module I : Introduction to Big Data


02/23/2025 29
and Analytics
What Structure
Outline Does Big
Data Have?
• Actually semi-structured or multi-
structured, not unstructured
• Taming semi-structured data is largely a
matter of putting in the extra time and
effort to figure out the best way to process
it.

Module I : Introduction to Big Data


02/23/2025 30
and Analytics
02/23/2025 Module I : Introduction to Big Data and Analytics 31
Structured Data

Module I : Introduction to Big Data


02/23/2025 32
and Analytics
Module I : Introduction to Big Data and Analytics 02/23/2025 33
Semi-Structured data

02/23/2025 Module I : Introduction to Big Data and Analytics 34


Unstructured data
02/23/2025 Module I : Introduction to Big Data and Analytics 35
Facets of data
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming

02/23/2025 Module I : Introduction to Big Data and Analytics 36


A) STRUCTURED DATA

02/23/2025 Module I : Introduction to Big Data and Analytics 37


B) UNSTRUCTURED DATA

02/23/2025 Module I : Introduction to Big Data and Analytics 38


Data Science Process

02/23/2025 Module I : Introduction to Big Data and Analytics 39


The data science process
consists of six steps

02/23/2025 Module I : Introduction to Big Data and Analytics 40


1. SETTING THE RESEARCH
GOAL
• Data science is mostly applied in the context of an organization.

• When the business asks you to perform a data science project, first prepare a
project charter.

• Charter contains information such as

• what you’re going to research

• how the company benefits from that

• what data and resources you need

• a timetable and

• deliverables

02/23/2025 Module I : Introduction to Big Data and Analytics 41


2. RETRIEVING DATA
• Second step is to collect data.

• Already stated in the project charter which data you need and
where you can find it.

• Ensure you can use the data in your program, which means
checking the existence of, quality, and access to the data.

• Data can also be delivered by third-party companies and takes


many forms ranging from Excel spreadsheets to different types of
databases.

02/23/2025 Module I : Introduction to Big Data and Analytics 42


3. DATA PREPARATION
• Data collection is an error-prone process

• Enhance the quality of the data and prepare it for use in subsequent
steps.

• This phase consists of three subphases:


• data cleansing removes false values from a data source and inconsistencies
across data sources
• data integration enriches data sources by combining information from multiple
data sources, and
• data transformation ensures that the data is in a suitable format for use in your
models.
02/23/2025 Module I : Introduction to Big Data and Analytics 43
4. DATA EXPLORATION
• Concerned with building a deeper understanding of data.

• Try to understand how variables interact with each other, the


distribution of the data, and whether there are outliers.

• To achieve this, use descriptive statistics, visual techniques, and


simple modeling.

• Often goes by the abbreviation EDA, for Exploratory Data


Analysis.

02/23/2025 Module I : Introduction to Big Data and Analytics 44


5. DATA MODELING OR MODEL BUILDING
• Use models, domain knowledge, and insights about the data you found in the
previous steps to answer the research question.

• Select a technique from the fields of statistics, machine learning, operations


research, and so on.

• Building a model is an iterative process that involves selecting the variables


for the model, executing the model, and model diagnostics.

02/23/2025 Module I : Introduction to Big Data and Analytics 45


6. PRESENTATION AND
AUTOMATION
• Present the results to your business.

• Results can take many forms - ranging from presentations to research reports

• Sometimes, need to automate the execution of the process because the


business will want to use the insights you gained in another project or enable
an operational process to use the outcome from your model.

02/23/2025 Module I : Introduction to Big Data and Analytics 46


Outline
FILTERING BIG DATA EFFECTIVELY
• extract, transform, and load (ETL) process - process
of taking a raw feed of data, reading it, and
producing a usable set of output

Module I : Introduction to Big Data


02/23/2025 47
and Analytics
TODAY’SOutline
BIG DATA IS NOT
TOMORROW’S BIG DATA
• Big data will continue to evolve
• Transactional data in retail, telecommunications, and
banking industries were very big and hard to handle
even a decade ago.
• In fact, such data wasn’t widely available for analytics
and reporting in many organizations in the late 1990s
• Today, such data is considered a necessary and
fundamental asset.
• Virtually every company of any size has access to it.

Module I : Introduction to Big Data


02/23/2025 48
and Analytics
WRAP-UP Outline
ABOUT BIG DATA
– data that exceeds the capability of commonly used
hardware environments and software tools to
capture, manage, and process
– not just in terms of volume, but also in terms of
variety, velocity, and complexity
– often automatically generated by a machine of
some sort
– usually not in a user-friendly format
– just the next wave of new, bigger data that pushes
today’s limits.
– Many big data sources are semi-structured, can
also be unstructured, even structured in some
cases.
Module I : Introduction to Big Data
02/23/2025 49
and Analytics
What is Big Data Analytics?
• https://www.youtube.com/watch?v=aeHqYLgZP84&t=15
s
• https://youtu.be/aeHqYLgZP84?si=p0Dd1qX7c4silq-X

02/23/2025 Module I : Introduction to Big Data and Analytics 50


Big Data, what is it? Government View
(2016)

02/23/2025 Module I : Introduction to Big Data and Analytics 51


Big Data, what is it? Industry View

02/23/2025 Module I : Introduction to Big Data and Analytics 52


Big Data, what is it? Industry View

02/23/2025 Module I : Introduction to Big Data and Analytics 53


Big Data, a type of analytics

02/23/2025 Module I : Introduction to Big Data and Analytics 54


Big Data, a type of analytics

02/23/2025 Module I : Introduction to Big Data and Analytics 55


Big Data, a type of analytics

02/23/2025 Module I : Introduction to Big Data and Analytics 56


Big Data, a type of analytics

Dat Insight
a s!

02/23/2025 Module I : Introduction to Big Data and Analytics 57


Big Data, a type of analytics

02/23/2025 Module I : Introduction to Big Data and Analytics 58


Big Data, what is it?

(ChatGPT, January
2023)
?

02/23/2025 Module I : Introduction to Big Data and Analytics 59


Big Data, what is it?
Big data refers to the large and complex sets of data
that are generated and collected by organizations and
individuals on a daily basis. These data sets can come
from a variety of sources, such as social media, online
transactions, and sensor data, and can be structured or
unstructured.

One of the main challenges of big data is the ability to


(ChatGPT, January
2023)
store, process, and analyze it effectively. Traditional
data processing methods and technologies are often
not able to handle the volume, velocity, and variety of
big data. As a result, new technologies and approaches,
such as distributed computing and machine learning,
have been developed to help organizations make sense
of their big data.
02/23/2025

Module I : Introduction to Big Data and Analytics 60
Big Data, what is it?
Big data refers to the large and complex sets of data that are generated and collected by organizations and
individuals on a daily basis. These data sets can come from a variety of sources, such as social media, online
transactions, and sensor data, and can be structured or unstructured.

One of the main challenges of big data is the ability to store, process, and analyze it effectively. Traditional data
processing methods and technologies are often not able to handle the volume, velocity, and variety of big data.
As a result, new technologies and approaches, such as distributed computing and machine learning, have been
developed to help organizations make sense of their big data.

Big data can have a wide range of applications, from improving business
operations and customer service to enabling new scientific discoveries
and advancements in healthcare. For example, in business, big data can
(ChatGPT, January be used to gain insights into customer behavior, identify new market
2023) opportunities, and optimize supply chain operations. In healthcare, big
data can be used to improve patient outcomes and develop
personalized treatment plans.

Overall, big data is a rapidly growing field with many potential benefits
for organizations and individuals, but also has the potential for privacy
and security concerns. Therefore, it is important for organizations to
have a robust data governance framework and for individuals to
understand the implications of data collection and use.
02/23/2025 Module I : Introduction to Big Data and Analytics 61
Big Data, in demand?

02/23/2025 Module I : Introduction to Big Data and Analytics 62


Big Data, in
demand?
By the
requirements in
job ads.
(Muenchen,2019)

Primarily for big data


Used extensively in big
data

02/23/2025 Module I : Introduction to Big Data and Analytics 63


Big Data, What is it?

02/23/2025 Module I : Introduction to Big Data and Analytics 64


Big Data, What is it?

Libraries, tools and architectures for


working with large datasets quickly.

02/23/2025 Module I : Introduction to Big Data and Analytics 65


Big Data, What is it?

• Short Answer:

• Big Data ≈ Data Mining ≈ Predictive Analytics ≈ Data Scie


• (Leskovec et al., 2017)

02/23/2025 Module I : Introduction to Big Data and Analytics 66


Big Data, What is it?

• Short Answer:

• Big Data ≈ Data Mining ≈ Predictive Analytics ≈ Data Scie


• (Leskovec et al., 2017)

02/23/2025 Module I : Introduction to Big Data and Analytics 67


(Conway,
Big Data, What is it?

• Short Answer:

• Big Data ≈ Data Mining ≈ Predictive Analytics ≈ Data Scie


• (Leskovec et al., 2017)

Computation Statistics

Science
02/23/2025 Module I : Introduction to Big Data and Analytics 68
Big Data, What is it?

• Short Answer:

• Big Data ≈ Data Mining ≈ Predictive Analytics ≈ Data Scie


• (Leskovec et al., 2017)

CSE545
focuses on:
How to analyze data that is Analyses only possible with a
mostly too large for main large
memory. number of observations or
features.
02/23/2025 Module I : Introduction to Big Data and Analytics 69
Big Data, What is it?
Goal: Generalizations
A model or summarization of the
data.

How to analyze data that is Analyses only possible with a


mostly too large for main large
memory. number of observations or
features.
02/23/2025 Module I : Introduction to Big Data and Analytics 70
Big Data, What is it?

Goal: Generalizations
A model or summarization of the
data.

E.g.
● Google’s PageRank: summarizes web pages by a single
number.
● Twitter financial market predictions: Models the stock
market according to shifts in sentiment in Twitter.
● Distinguish tissue type in medical images: Summarizes
millions of pixels into clusters.
● Mental health diagnosis in social media: Models
presence of diagnosis as a distribution (a summary)
02/23/2025of linguistic patterns.
Module I : Introduction to Big Data and Analytics 71
Big Data, What is it?

Goal: Generalizations
A model or summarization of the
data.

1. Descriptive analytics
Describe (generalizes) the data itself

2. Predictive analytics
Create something generalizeable to new data

02/23/2025 Module I : Introduction to Big Data and Analytics 72


Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the
data.

Data/Workflow Analytics and


/rameworks Algorithms

02/23/2025 Module I : Introduction to Big Data and Analytics 73


Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the
data.

Data/Workflow Analytics and


/rameworks Algorithms
Hadoop /ile
Spar
System
Streamingk
MapReduce

Deep Learning
02/23/2025
/rameworksModule I : Introduction to Big Data and Analytics 74
Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the
data.

Data/Workflow Analytics and Algorithms


/rameworks
Hadoop /ile Similarity Search
Spar Hypothesis
System
k Testing Transformers/Self-
MapReduce Supervision
Deep Learning
Streaming Recommendation
/rameworks
Systems Link
02/23/2025 Module I : Introduction to Big Data and Analytics
Analysis 75
Types of
Analytic
s

02/23/2025 Module I : Introduction to Big Data and Analytics 76


Types of Analytics
Types of
Analytic
s

Descriptiv Prescripti
Predictive
e ve

02/23/2025 Module I : Introduction to Big Data and Analytics 77


Module I : Introduction to Big Data and Analytics

 Descriptive analytics stands as the bedrock for


understanding and communicating historical data.
 Helps organizations make sense of past
events, offering a clear snapshot of their
I. operations, finances, and overall performance.
 Seeks to prepare historical data in a format
DESCRIPTIV that is easily communicable, benefiting a broad
business audience.
E  often involves creating company reports and
ANALYTICS Key Performance Indicators (KPIs) that
provide a comprehensive overview of various
aspects of a company, such as operations,
revenue, finances, customers, and stakeholders.
 By summarizing this data, businesses can convey
complex information in a way that is
understandable to diverse audiences, from
executives to frontline employees.

02/23/2025 78
Techniques in Descriptive
Analytics
• Collecting data from multiple sources and
summarizing it to provide a cohesive
Data view.
Aggregation • Instrumental in generating company
reports and KPIs, enabling businesses to
track performance metrics over time.
• Helps identify past events and uncover
patterns that may not be immediately
apparent.
Data Mining • important for historical analysis, allowing
businesses to understand trends and
behaviors that have shaped their
performance.
02/23/2025 Module I : Introduction to Big Data and Analytics 79
Methods Used in Descriptive
Analytics
• Collecting data based on observed
Observatio
behaviors or events within the
ns
organization.
• Conducting in-depth examinations of
Case
specific instances to gain detailed
Studies
insights.
Surveys Using questionnaires to gather
data and understand trends and
patterns among stakeholders

02/23/2025 Module I : Introduction to Big Data and Analytics 80


The big data ecosystem
and data science

02/23/2025 Module I : Introduction to Big Data and Analytics 81


Big data technologies can be classified into a few main components

• Currently many big data


tools and frameworks exist,
and it’s easy to get lost
because new technologies
appear rapidly
• Big data ecosystem can be
grouped into technologies
that have similar goals and
functionalities

02/23/2025 Module I : Introduction to Big Data and Analytics 82


1. DISTRIBUTED FILE SYSTEMS
• Runs on multiple servers at once

store files larger than any one computer


disk.
Advantages

Files get automatically replicated across multiple


servers for redundancy or parallel operations while
hiding the complexity of doing so from the user.

The system scales easily: you’re no longer bound by


the memory or storage restrictions of a single server.

02/23/2025 Module I : Introduction to Big Data and Analytics 83


2. DISTRIBUTED PROGRAMMING
FRAMEWORK
• Once you have the data stored on the distributed file system,
exploit it.
• One important aspect of working on a distributed hard disk is
that you won’t move your data to your program, but rather you’ll
move your program to the data.
• Need to deal with the complexities that come with distributed
programming, such as restarting jobs that have failed, tracking
the results from the different subprocesses, and so on.
• Luckily, the open source community has developed many
frameworks to handle this and to provide a better experience
working with distributed data and dealing with many of the
challenges it carries.

02/23/2025 Module I : Introduction to Big Data and Analytics 84


• scale was increased by moving everything to a server with more
memory, storage, and a better CPU (vertical scaling)
• Nowadays you can add another small server (horizontal scaling).
• This principle makes the scaling potential virtually limitless.
• The best-known distributed file system : Hadoop File System
(HDFS)
• An open-source implementation of the Google File System
• Other distributed file systems exist:
• Red Hat Cluster File System
• Ceph File System
• Tachyon File System

02/23/2025 Module I : Introduction to Big Data and Analytics 85


3. DATA INTEGRATION
FRAMEWORK
• A distributed file system in place, need to add data.

• Need to move data from one source to another, and this is where
the data integration frameworks such as Apache Sqoop and
Apache Flume excel.

• Process is similar to an extract, transform, and load process in a


traditional data warehouse.

02/23/2025 Module I : Introduction to Big Data and Analytics 86


4. MACHINE LEARNING

FRAMEWORKS
Before World War II everything needed to be calculated by hand,
which severely limited the possibilities of data analysis.
• After World War II computers and scientific computing were
developed
• A single computer could do all the counting and calculations and a
world of opportunities opened
• With the enormous amount of data available nowadays, one
computer can no longer handle the workload by itself.
• One of the biggest issues with the old algorithms
• is that they don’t scale well.
• With the amount of data, need to analyze today, this becomes
problematic, and specialized frameworks and libraries are required to
deal with this amount of data
• The most popular machine-learning library for Python is Scikit-learn
02/23/2025 Module I : Introduction to Big Data and Analytics 87
Other Python Libraries
• Neural networks are learning algorithms that mimic
PyBrain for neural the human brain in learning mechanics and
networks complexity. Neural networks are often regarded as
advanced and black box.

NLTK or Natural • bundled with a number of text corpuses to help


Language Toolkit you model your own data.

Pylearn2 • Another machine learning toolbox but a bit less


mature than Scikit-learn.

TensorFlow • A Python library for deep learning provided by


Google.

02/23/2025 Module I : Introduction to Big Data and Analytics 88


5. NOSQL DATABASES
• Managing and querying data - Relational databases such as
Oracle SQL, MySQL, Sybase IQ, and others
• “No” in this context stands for “Not Only” SQL
• Traditional databases had shortcomings that
• didn’t allow them to scale well
• Storage or processing power can’t scale beyond a single node
• no way to handle streaming, graph, or unstructured forms of data
• NoSQL databases : allow for a virtually endless growth of data
• Many different types of databases have arisen

02/23/2025 Module I : Introduction to Big Data and Analytics 89


Databases Categorized into the following types

Column Docume
Streami
databas nt
ng data
es stores
SQL on New
Key-value stores
Hadoop SQL

Graph databases
02/23/2025 Module I : Introduction to Big Data and Analytics 90
Data is stored in columns, which
allows algorithms to perform much
faster queries. Newer technologies
use cell-wise storage. Table-like
02/23/2025
structures are still important
Module I : Introduction to Big Data and Analytics 91
Document stores

Document stores no
longer use tables, but
store every observation
in a document. This Streaming data : Data is collected,
allows for a much more transformed, and aggregated not in
flexible data scheme.
batches but in real time.

02/23/2025 Module I : Introduction to Big Data and Analytics 92


6. SCHEDULING TOOLS
• Scheduling tools helps to automate repetitive tasks and trigger
jobs based on events such as adding a new file to a folder

• Similar to tools such as CRON on Linux but are specifically


developed for big data

• Use them, for instance, to start a MapReduce task whenever a


new dataset is available in a directory

02/23/2025 Module I : Introduction to Big Data and Analytics 93


7. BENCHMARKING TOOLS
• Developed to optimize big data installation by providing
standardized profiling suites.

• A profiling suite is taken from a representative set of big data jobs.

• Benchmarking and optimizing the big data infrastructure and


configuration aren’t often jobs for data scientists themselves but for
a professional specialized in setting up IT infrastructure

• For example, if you can gain 10% on a cluster of 100 servers, you
save the cost of 10 servers.

02/23/2025 Module I : Introduction to Big Data and Analytics 94


8. SYSTEM DEPLOYMENT

• Where system deployment tools shine

• Setting up a big data infrastructure isn’t an easy task and

• Assisting engineers in deploying new applications into the big


data cluster

• Largely automate the installation and configuration of big data


components

02/23/2025 Module I : Introduction to Big Data and Analytics 95


9. SERVICE PROGRAMMING
• Suppose a world-class soccer prediction application on Hadoop is built

• You want to allow others to use the predictions made by your application.

• However, you have no idea of the architecture or technology of everyone keen


on using your predictions.

• Service tools excel here by exposing big data applications to other applications
as a service.

• Data scientists sometimes need to expose their models through services.

• The best-known example is the REST service - REST stands for representational
state transfer.

• Often used to feed websites with data


02/23/2025 Module I : Introduction to Big Data and Analytics 96
10. SECURITY
• Do you want everybody to have access to all of your data?

• Probably need to have fine-grained control over the access to


data but don’t want to manage this on an application-by-
application basis.

• Big data security tools allow you to have central and fine-grained
control over access to the data.

02/23/2025 Module I : Introduction to Big Data and Analytics 97


USE CASES OF BIG DATA
ANALYTICS

02/23/2025 Module I : Introduction to Big Data and Analytics 98


USE CASES OF BIG DATA ANALYTICS

• Manufacturing
• Retail
• Healthcare
• Oil and gas
• Telecommunications
• Financial services

02/23/2025 Module I : Introduction to Big Data and Analytics 99


I. MANUFACTURING
Predictive maintenance : Big data can help predict
equipment failure.

Potential issues can be discovered by analyzing both


structured data (equipment year, make, and model)
and multi-structured data (log entries, sensor data,
error messages, engine temperature, and other
factors).

With this data, manufacturers can maximize parts and


equipment uptime and deploy maintenance more cost
Challenges
effectively. : Companies must integrate data
coming from different formats and identify the
signals that will lead to optimizing maintenance.

Production optimization : Big data can help


Operational efficiency : can analyze and assess manufacturers understand the flow of items
production processes, proactively respond to through their production lines and see which areas
customer feedback, and anticipate future can benefit. Data analysis will reveal which steps
demands.
02/23/2025 Module I : Introduction to Big Data and Analytics 100
lead to increased production time and which areas
II. Retail
Product development

Customer experience : Challenges Integrating a


high volume of data from various sources can be
difficult. Once the data is integrated, path analysis
can be used to identify experience paths and
correlate them with various sets of behavior.

Customer lifetime value : Challenges To identify


your high-value customers, you will need to
analyze a high volume of customer transaction
data and create sophisticated models that examine
Competition is fierce in retail. To stay past behavior and predict future actions.
ahead, companies strive to
differentiate themselves. Big data is The in-store shopping experience : Challenges
being used across all stages of the Complex graphs and path analyses are required to
retail process—from product identify customer paths and behavior. This data
predictions to demand forecasting to must then be correlated and joined with multiple
in-store optimization. Using big data, datasets to correctly analyze store behavior.
retailers are finding new ways to
02/23/2025
innovate. Pricing analytics
Module I : Introduction and
to Big Data and optimization : Challenges To101
Analytics
III. Healthcare
Genomic research : To identify disease genes and
biomarkers to help patients pinpoint health issues they
may face in the future; to design personalized
treatments.
Patient experience and outcomes : to provide better
treatment and improved quality of care—without increasing
costs. With big data, healthcare organizations can create a
360-degree view of patient care as the patient moves
through various treatments and departments.
Claims fraud : can be hundreds of associated reports in a
variety of different formats; makes extremely difficult to
Healthcare organizations are using big
verify the accuracy of insurance incentive programs and
data for everything from improving
find the patterns that indicate fraudulent activity. Big data
profitability to helping save lives.
helps healthcare organizations detect potential fraud by
Healthcare companies, hospitals, and
flagging certain behaviors for further examination.
researchers collect massive amounts
of data. But all of this data isn’t useful Healthcare billing analytics : By analyzing billing and claims
in isolation. It becomes important data, organizations can discover lost revenue opportunities
when the data is analyzed to highlight and places where payment cash flows can be improved.
trends and threats in patterns and This use case requires integrating billing data from various
create 02/23/2025
predictive mode payers, analyzing a large volume of that data, and102then
Module I : Introduction to Big Data and Analytics
identifying activity patterns in the billing data.
IV. Oil and gas
Predictive equipment maintenance : . Big data can help by
providing insight so companies can predict the remaining
optimal life of their systems and components, ensuring that
their assets operate at optimum production efficiency.

Oil exploration and discovery : Exploring for oil and gas can
be expensive. But companies can make use of the vast
amount of data generated in the drilling and production
process to make informed decisions about new drilling sites.
Data generated from seismic monitors can be used to find
new oil and gas sources by identifying traces that were
previously overlooked
Oil production optimization : Unstructured sensor and
For the past few years, the oil and gas
historical data can be used to optimize oil well production.
industry has been leveraging big data
By creating predictive models, companies can measure
to find new ways to innovate. The
well production to understand usage rates. With deeper
industry has long made use of data
data analysis, engineers can determine why actual well
sensors to track and monitor the
outputs aren’t tallying with their predictions.
performance of oil wells, machinery,
and operations. Oil and gas companies
have been able to harness this data to
monitor well activity, create modelsModule
02/23/2025 of I : Introduction to Big Data and Analytics 103
V. Telecommunications
Optimize network capacity : Optimal network
performance is essential for a telecom’s success.
Network usage analytics can help companies
identify areas with excess capacity and reroute
bandwidth as needed. Big data analytics can help
them plan for infrastructure investments and
design new services that meet customer demands.
With new insights, telecoms are able maintain
customer loyalty and avoid losing revenue to
competitors.
Telecom customer churn : By analyzing the data
telecoms already have about service quality,
convenience, and other factors, telecoms can
predict overall customer satisfaction. And they can
set up alerts when customers are at risk of
churning—and take action with retention
The popularity of smart phones and other campaigns New product and offerings
proactive offers
: Big data provides
mobile devices has given valuable insights to help companies design new
telecommunications companies tremendous products and features. An improved
growth opportunities. But there are understanding of customer behavior enables
challenges as well, as organizations
02/23/2025
work to companies to tailor services to different customer
Module I : Introduction to Big Data and Analytics 104
keep pace with customer demands for new segments for future offerings.
VI. Financial services
Fraud and compliance : companies can identify
patterns that indicate fraud and aggregate large
volumes of information to streamline regulatory
reporting.
Drive innovation : Big data offers valuable insights
that help organizations innovate. Big data
analytics makes the interdependencies between
humans, institutions, entities, and processes more
apparent. With better understanding of market
trends and customer needs, organizations can
improve decision-making about new products and
Anti-money
services. laundering : Financial services firms
are under more pressure than ever before from
governments passing anti-money laundering laws.
Forward-thinking banks and financial services These laws require that banks show proof of proper
firms are capitalizing on big data. From diligence and submit suspicious activity reports.
capturing new market opportunities to Can help companies identify potential fraud
reducing fraud, financial services Financial
patterns. regulatory and compliance analytics :
organizations have been able to convert big Financial services companies must be in
data into a competitive advantage. compliance with a wide variety of requirements
concerning risk, conduct, and transparency. At the
02/23/2025 same
Module I : Introduction time,
to Big Data banks must comply with the 105
and Analytics Dodd-
Frank Act, Basel III, and other regulations that
Data Analytics Using
Python

02/23/2025 Module I : Introduction to Big Data and Analytics 106


Methods Used in Descriptive
Analytics
• Collecting data based on observed
Observatio
behaviors or events within the
ns
organization.
• Conducting in-depth examinations of
Case
specific instances to gain detailed
Studies
insights.
Surveys Using questionnaires to gather
data and understand trends and
patterns among stakeholders

02/23/2025 Module I : Introduction to Big Data and Analytics 107


Module I : Introduction to Big Data and Analytics 02/23/2025 108
Find the mean of a data set
• import numpy as np

• # Sample Data
• arr = [5, 6, 11]

• # Mean
• mean = np.mean(arr)

• print("Mean = ", mean)

02/23/2025 Module I : Introduction to Big Data and Analytics 109


Median
• import numpy as np

• # sample Data
• arr = [1, 2, 3, 4]

• # Median
• median = np.median(arr)

• print("Median = ", median)

02/23/2025 Module I : Introduction to Big Data and Analytics 110


Mode
• from scipy import stats

• # sample Data
• arr = [1, 2, 2, 3]

• # Mode
• mode = stats.mode(arr)
• print("Mode = ", mode)

02/23/2025 Module I : Introduction to Big Data and Analytics 111


Measure of Variability
• Measures of dispersion as it helps to gain insights about
the dispersion or the spread of the observations at hand.
• Some of the measures used to calculate the measures of
dispersion in the observations of the variables are as
follows:
• Range

• Variance

• Standard deviation
02/23/2025 Module I : Introduction to Big Data and Analytics 112
Range
• Range = Largest data value – smallest data value

import numpy as np

• # Sample Data
• arr = [1, 2, 3, 4, 5]

• # Finding Max
• Maximum = max(arr)
• # Finding Min
• Minimum = min(arr)

• # Difference Of Max and Min


• Range = Maximum-Minimum
• print("Maximum = {}, Minimum = {} and Range = {}".format(Maximum, Minimum,
Range))
02/23/2025 Module I : Introduction to Big Data and Analytics 113
114

Variance

Module I : Introduction to Big Data and Analytics


•import statistics
02/23/2025

•# sample data
•arr = [1, 2, 3, 4, 5]
•# variance
•print("Var = ",
(statistics.variance(arr)))
Standard
Deviation
•import statistics

•# sample data
•arr = [1, 2, 3, 4, 5]

•# Standard Deviation
•print("Std = ",
(statistics.stdev(arr)))

02/23/2025 Module I : Introduction to Big Data and Analytics 115


Practical Application: Using
Python for Descriptive Analytics
• Create/Load a dataset, provide a summary, and
calculate key metrics
• JupyterLab and Jupyter Notebooks
• https://jupyter.org/try-jupyter/lab/
• Create a Python file : descriptiveanalysis.ipynb
• Create a dataset
'Month': ['January', 'January', 'February', 'February', 'March',
'March', 'April', 'April', 'May', 'May']
'Revenue': [10000, 15000, 12000, 13000, 14000, 16000,
17000, 18000, 19000, 20000]
'Customers': [100, 120, 110, 115, 130, 140, 150, 160, 170,
180]}
02/23/2025 Module I : Introduction to Big Data and Analytics 116
Practical Application: Using
Python for Descriptive Analytics
(Contd..)
• Summarize the dataset
• Calculate the mean revenue
• Print the monthly aggregated data sorted by month
number

02/23/2025 Module I : Introduction to Big Data and Analytics 117


descriptiveanalysis.ipynb
• import pandas as pd
# Sample dataset
data_dict = { 'Month': ['January', 'January', 'February',
'February', 'March', 'March', 'April', 'April', 'May', 'May'],
'Revenue': [10000, 15000, 12000, 13000, 14000, 16000,
17000, 18000, 19000, 20000], 'Customers': [100, 120, 110,
115, 130, 140, 150, 160, 170, 180]}
# Convert the dictionary into a DataFrame
• data = pd.DataFrame(data_dict)
# Mapping of month names to month numbers
• month_mapping = { 'January': 1, 'February': 2,
'March': 3, 'April': 4, 'May': 5}
02/23/2025 Module I : Introduction to Big Data and Analytics 118
descriptiveanalysis.ipynb
# Add a month number column
• data['Month_Number'] =
data['Month'].map(month_mapping)
# Summarize the dataset
• summary = data.describe()
# Print the summary
• print(summary)
# Calculate the mean revenue
• mean_revenue = data['Revenue'].mean()
• print(f"Mean Revenue: {mean_revenue}")
02/23/2025 Module I : Introduction to Big Data and Analytics 119
# Aggregate data by month
• monthly_data = data.groupby(['Month',
'Month_Number']).agg({'Revenue': 'sum', 'Customers':
'count'}).sort_values(by='Month_Number’)
# Drop the Month_Number column as it's no longer needed
after sorting
• monthly_data =
monthly_data.reset_index(drop=False).drop(columns='Month_Nu
mber').set_index('Month’)
# Print the monthly aggregated data sorted by month
• numberprint(monthly_data)

02/23/2025 Module I : Introduction to Big Data and Analytics 120


Descriptive statistics for
the columns in the
dataset: Revenue and Custo
mers

25th Percentile (25%) The value


below which 25% of the
observations fall. For revenue, 25%
of the data is below 12,750. For
customers, 25% is below 113.75.

Table 2 : Aggregated monthly data


for Revenue and Customers
02/23/2025 Module I : Introduction to Big Data and Analytics 121

You might also like