0% found this document useful (0 votes)
15 views63 pages

Data Analytics All 5 Units

The document outlines the syllabus for a Data Analytics course, covering topics such as the introduction to data analytics, the data analytics lifecycle, and various data classifications. It emphasizes the importance of data analytics in decision-making, operational optimization, and competitive advantage, while also detailing modern tools and applications across industries. The document highlights the structured approach of the Data Analytics Lifecycle to ensure systematic execution and reliable insights.

Uploaded by

ya2110222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views63 pages

Data Analytics All 5 Units

The document outlines the syllabus for a Data Analytics course, covering topics such as the introduction to data analytics, the data analytics lifecycle, and various data classifications. It emphasizes the importance of data analytics in decision-making, operational optimization, and competitive advantage, while also detailing modern tools and applications across industries. The document highlights the structured approach of the Data Analytics Lifecycle to ensure systematic execution and reliable insights.

Uploaded by

ya2110222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

ITECH WORLD AKTU

ITECH WORLD AKTU


Subject Name: Data Analytics (DA)
Subject Code: BCS052

Unit 1:

Syllabus:

1. Introduction to Data Analytics:

• Sources and nature of data


• Classification of data (structured, semistructured, unstructured)
• Characteristics of data
• Introduction to Big Data platform
• Need of data analytics
• Evolution of analytic scalability
• Analytic process and tools
• Analysis vs reporting
• Modern data analytic tools
• Applications of data analytics

2. Data Analytics Lifecycle:

• Need for data analytics lifecycle


• Key roles for successful analytic projects
• Various phases of data analytics lifecycle:
(a) Discovery
(b) Data preparation
(c) Model planning
(d) Model building
(e) Communicating results
(f) Operationalization

0.1 Introduction to Data Analytics


0.1.1 Definition of Data Analytics
Data Analytics is the process of examining raw data to uncover trends, patterns, and
insights that can assist in informed decision-making. It involves the use of statistical.

1
ITECH WORLD AKTU

Key Points:

• Objective: Transform data into actionable insights.

• Methods: Involves data cleaning, processing, and analysis.

• Outcome: Generates insights for strategic decisions in various domains like busi-
ness, healthcare, and technology.

• Tools: Includes Python, R, Excel, and specialized tools like Tableau, Power BI.

Example: A retail store uses data analytics to identify customer buying patterns
and optimize inventory management, ensuring popular products are always in stock.

0.1.2 Sources and Nature of Data


Data originates from various sources, primarily categorized as social, machine-generated,
and transactional data. Below is a detailed explanation of these sources:

1. Social Data:

• User-Generated Content: Posts, likes, and comments on platforms like


Facebook, Twitter, and Instagram.
• Reviews and Ratings: Feedback on platforms such as Amazon and Yelp
that reflect customer opinions.
• Social Network Analysis: Connections and interactions between users that
reveal behavioral patterns.
• Trending Topics: Real-time topics gaining popularity, aiding in sentiment
and trend analysis.

2. Machine-Generated Data:

• Sensors and IoT Devices: Data from devices like thermostats, smart-
watches, and industrial sensors.
• Log Data: Records of system activities, such as server logs and application
usage.
• GPS Data: Location information generated by devices like smartphones and
vehicles.
• Telemetry Data: Remote data transmitted from devices, such as satellites
and drones.

3. Transactional Data:

• Sales Data: Information about products sold, quantities, and revenues.


• Banking Transactions: Records of deposits, withdrawals, and payments.
• E-Commerce Transactions: Online purchases, customer behavior, and cart
abandonment rates.
• Invoices and Receipts: Structured records of financial exchanges between
businesses or customers.

2
ITECH WORLD AKTU

Example:
• A social media platform like Twitter generates vast amounts of social data from
tweets, hashtags, and mentions.

• Machine-generated data from GPS in delivery trucks helps optimize routes and
reduce costs.

• A retail store’s transactional data tracks customer purchases and identifies high-
demand products.

0.1.3 Classification of Data


Data can be classified into three main categories: structured, semi-structured, and un-
structured. Below is a detailed explanation of each type:

• Structured Data: Data that is organized in a tabular format with rows and
columns. It follows a fixed schema, making it easy to query and analyze.

– Examples: Excel sheets, relational databases (e.g., SQL).


– Common Tools: SQL, Microsoft Excel.

• Semi-Structured Data: Data that does not have a rigid structure but contains
tags or markers to separate elements. It lies between structured and unstructured
data.

– Examples: JSON files, XML files.


– Common Tools: NoSQL databases, tools like MongoDB.

• Unstructured Data: Data without a predefined format or organization. It re-


quires advanced tools and techniques for analysis.

– Examples: Images, videos, audio files, and text documents.


– Common Tools: Machine Learning models, Hadoop, Spark.

Example: Email metadata (e.g., sender, recipient, timestamp) is semi-structured,


while the email body is unstructured.

Comparison Table:

0.1.4 Characteristics of Data


The key characteristics of data, often referred to as the 4Vs, include:

• Volume: Refers to the sheer amount of data generated. Modern data systems
must handle terabytes or even petabytes of data.

– Example: A social media platform like Facebook generates billions of user


interactions daily.

• Velocity: Refers to the speed at which data is generated and processed. Real-time
data processing is crucial for timely insights.

3
ITECH WORLD AKTU

Aspect Structured Data Semi-Structured Unstructured Data


Data
Definition Organized in rows and Contains elements Lacks any predefined
columns with a fixed with tags or markers format or schema.
schema. but lacks strict struc-
ture.
Examples SQL databases, Excel JSON, XML, NoSQL Images, videos, audio
sheets. databases. files, text documents.
Storage Stored in relational Stored in NoSQL Stored in data lakes or
databases. databases or files. object storage.
Ease of Analysis Easy to query and an- Moderate difficulty Requires advanced
alyze using traditional due to partial struc- techniques and tools
tools. ture. for analysis.
Schema Depen- Follows a predefined Partially structured Does not follow any
dency and fixed schema. with flexible schema. schema.
Data Size Typically smaller in Moderate size, often Usually the largest in
size compared to oth- larger than structured size due to diverse for-
ers. data. mats.
Processing Tools SQL, Excel, and BI MongoDB, NoSQL, Hadoop, Spark, and
tools. and custom parsers. AI/ML tools.

Table 1: Comparison of Structured, Semi-Structured, and Unstructured Data

– Example: Stock market systems process millions of trades per second to pro-
vide real-time updates.

• Variety: Refers to the different types and formats of data, including structured,
semi-structured, and unstructured data.

– Example: A company might analyze customer reviews (text), social media


posts (images/videos), and sales transactions (structured data).

• Veracity: Refers to the quality and reliability of the data. High veracity ensures
data accuracy, consistency, and trustworthiness.

– Example: Data from unreliable sources or with missing values can lead to
incorrect insights.

Real-Life Scenario: Social media platforms like Twitter deal with high Volume
(millions of tweets daily), high Velocity (real-time updates), high Variety (text, images,
videos), and mixed Veracity (authentic and fake information).

0.1.5 Introduction to Big Data Platform


Big Data platforms are specialized frameworks and technologies designed to handle the
processing, storage, and analysis of massive datasets that traditional systems cannot effi-

4
ITECH WORLD AKTU

ciently manage. These platforms enable businesses and organizations to derive meaningful
insights from large-scale and diverse data.
Key Features of Big Data Platforms:

• Scalability: Ability to handle growing volumes of data efficiently.

• Distributed Computing: Processing data across multiple machines to improve per-


formance.

• Fault Tolerance: Ensuring reliability even in the event of hardware failures.

• High Performance: Providing fast data access and processing speeds.

Common Tools in Big Data Platforms:

• Hadoop:

– A distributed computing framework that processes and stores large datasets


using the MapReduce programming model.
– Components include:
∗ HDFS (Hadoop Distributed File System): For distributed storage.
∗ YARN: For resource management and job scheduling.
– Example: A telecom company uses Hadoop to analyze call records for iden-
tifying network issues.

• Spark:

– A fast and flexible in-memory processing framework for Big Data.


– Offers support for a wide range of workloads such as batch processing, real-
time streaming, machine learning, and graph computation.
– Compatible with Hadoop for storage and cluster management.
– Example: A financial institution uses Spark for fraud detection by analyzing
transaction data in real time.

• NoSQL Databases:

– Designed to handle unstructured and semi-structured data at scale.


– Types of NoSQL databases:
∗ Document-based (e.g., MongoDB).
∗ Key-Value stores (e.g., Redis).
∗ Columnar databases (e.g., Cassandra).
∗ Graph databases (e.g., Neo4j).
– Example: An e-commerce platform uses MongoDB to store customer profiles,
product details, and purchase history.

Applications of Big Data Platforms:

• Personalized marketing by analyzing customer preferences.

5
ITECH WORLD AKTU

• Real-time analytics for monitoring industrial equipment using IoT sensors.

• Enhancing healthcare diagnostics by analyzing patient records and medical images.

• Predictive maintenance in manufacturing by identifying patterns in machine per-


formance data.

Example in Action: Hadoop processes petabytes of clickstream data from a large


online retailer to optimize website navigation and improve the user experience.

1 Need of Data Analytics


Data analytics has become essential in modern organizations for the following reasons:

• Data-Driven Decision Making: Organizations increasingly rely on data-driven


insights to make informed decisions, improve performance, and predict future trends.

• Optimization of Operations: Analytics helps organizations identify inefficien-


cies, optimize processes, and improve resource allocation.

• Competitive Advantage: By leveraging data analytics, companies can better


understand customer preferences, market trends, and competitor behavior, giving
them a competitive edge.

• Personalization and Customer Insights: Data analytics enables organizations


to personalize products and services according to customer needs by analyzing data
such as preferences and buying behavior.

• Risk Management: By analyzing historical data, companies can predict potential


risks and take proactive measures to mitigate them.

Example: A retail company uses data analytics to predict customer demand for
products, enabling them to stock inventory more efficiently.

2 Evolution of Analytic Scalability


The scalability of analytics has evolved over time, allowing organizations to handle larger
and more complex datasets efficiently. The key stages in this evolution include:

• Early Stages (Manual and Small Data): In the past, analytics was performed
manually with small datasets, often using spreadsheets or simple statistical tools.

• Relational Databases and SQL: With the rise of structured data, relational
databases and SQL-based querying became more prevalent, offering better scala-
bility for handling larger datasets.

• Big Data and Distributed Computing: The advent of big data technolo-
gies such as Hadoop and Spark allowed for the processing and analysis of massive
datasets across distributed systems.

6
ITECH WORLD AKTU

• Cloud Computing: Cloud-based platforms like AWS, Google Cloud, and Azure
have made scaling analytics infrastructure easier by providing on-demand resources,
reducing the need for physical hardware.
• Real-Time Data Analytics: Technologies such as Apache Kafka and stream
processing frameworks have enabled the processing of data in real-time, further
enhancing scalability.

3 Analytic Process and Tools


The analytic process involves several stages, each requiring different tools and techniques
to effectively analyze and extract valuable insights from data. The process can typically
be broken down into the following steps:

• Data Collection: Gathering raw data from various sources such as databases,
APIs, or sensors.
• Data Cleaning: Identifying and correcting errors or inconsistencies in the dataset
to improve the quality of the data.
• Data Exploration: Visualizing and summarizing data to understand patterns and
distributions.
• Model Building: Selecting and applying statistical or machine learning models
to predict or classify data.
• Evaluation and Interpretation: Evaluating the accuracy and effectiveness of
models, and interpreting the results for actionable insights.

Tools:
• Statistical Tools: R, Python (with libraries like Pandas, NumPy), SAS
• Machine Learning Frameworks: TensorFlow, Scikit-learn, Keras
• Big Data Tools: Hadoop, Apache Spark
• Data Visualization: Tableau, Power BI, Matplotlib (Python)

4 Analysis vs Reporting
The difference between analysis and reporting lies in their purpose and approach to data:

• Analysis: Involves deeper insights into data, such as identifying trends, patterns,
and correlations. It often requires complex statistical or machine learning methods.
• Reporting: Focuses on summarizing data into a readable format, such as charts,
tables, or dashboards, to provide stakeholders with easy-to-understand summaries.

Example: A report might display sales numbers for the last quarter, while analysis
might uncover reasons behind those numbers, such as customer buying behavior or market
conditions.

7
ITECH WORLD AKTU

5 Modern Data Analytic Tools


Modern tools have revolutionized data analytics, making it easier to handle vast amounts
of data and perform sophisticated analyses. Some of the most popular modern tools
include:

• Apache Hadoop: A framework for processing large datasets in a distributed


computing environment.

• Apache Spark: A fast, in-memory data processing engine for big data analytics.

• Power BI: A powerful business analytics tool that allows users to visualize data
and share insights.

• Tableau: A data visualization tool that enables users to create interactive dash-
boards and visual reports.

• Python with Libraries: Libraries like Pandas, Matplotlib, and Scikit-learn enable
efficient data analysis and visualization.

6 Applications of Data Analytics


Data analytics is used in various industries and domains to solve complex problems and
enhance decision-making. Some common applications include:

• Healthcare: Analyzing patient data for better diagnosis, treatment plans, and
management of healthcare resources.

• Finance: Fraud detection, risk assessment, and portfolio optimization through the
analysis of financial data.

• Retail: Predicting customer behavior, optimizing inventory, and personalizing


marketing campaigns.

• Manufacturing: Predictive maintenance, quality control, and process optimiza-


tion to improve production efficiency.

• Telecommunications: Network optimization, customer churn prediction, and


fraud detection.

6.0.1 Need for Data Analytics Lifecycle


What is Data Analytics Lifecycle?
The Data Analytics Lifecycle refers to a series of stages or steps that guide the process
of analyzing data from initial collection to final insights and decision-making. It is a
structured framework designed to ensure systematic execution of analytics projects, which
helps in producing accurate and actionable results. The lifecycle consists of multiple
phases, each with specific tasks, and is essential for managing complex data projects.
The key stages of the Data Analytics Lifecycle typically include:

• Discovery: Understanding the project objectives and data requirements.

8
ITECH WORLD AKTU

• Data Preparation: Collecting, cleaning, and transforming data into usable for-
mats.

• Model Planning: Identifying suitable analytical techniques and models.

• Model Building: Developing models to extract insights.

• Communicating Results: Presenting insights and findings to stakeholders.

• Operationalization: Implementing the model or results into a business process.

Need for Data Analytics Lifecycle


A structured approach to managing data analytics projects is crucial for several rea-
sons. The following points highlight the importance of adopting the Data Analytics
Lifecycle:

• Ensures Systematic Approach: The lifecycle provides a systematic framework


for managing projects. It ensures that every step is accounted for, avoiding ran-
domness in execution and ensuring that tasks are completed in the correct order.

• Minimizes Errors: By following a predefined process, the risk of errors is reduced.


Each stage builds upon the previous one, ensuring accuracy and reliability in data
processing and analysis.

9
ITECH WORLD AKTU

• Optimizes Resource Usage: The lifecycle ensures efficient use of resources, such
as time, tools, and personnel. By organizing tasks in a structured way, projects are
completed more efficiently, avoiding wasted effort and resources.

• Increases Efficiency: With a clear workflow in place, tasks are completed in a


more streamlined manner, making the entire process more efficient. The structured
approach ensures that insights can be derived quickly and accurately.

• Improves Communication: Clear milestones and stages help teams stay aligned
and facilitate communication about the progress of the project. This clarity is
especially useful when different teams or departments are involved.

• Better Decision-Making: The lifecycle ensures that all steps are thoroughly exe-
cuted, leading to high-quality insights. This improves decision-making by providing
businesses with reliable and actionable data.

• Scalable: The lifecycle framework is adaptable to projects of different sizes. Whether


it’s a small-scale analysis or a large, complex dataset, the process can scale accord-
ing to the project requirements.

6.0.2 Key Roles in Analytics Projects


In data analytics projects, various roles contribute to the successful execution and delivery
of insights. Each role plays a vital part in the project lifecycle, ensuring that the right
data is collected, processed, analyzed, and interpreted for decision-making. The key roles
typically include:

• Data Scientist:

– A data scientist is responsible for analyzing and interpreting complex data to


extract meaningful insights.
– They design and build models to forecast trends, make predictions, and identify
patterns within data.
– Data scientists use machine learning algorithms, statistical models, and ad-
vanced analytics techniques to solve business problems.
– Example: A data scientist develops a predictive model to forecast customer
churn based on historical data and trends.

• Data Engineer:

– A data engineer is responsible for designing, constructing, and maintaining the


systems and infrastructure that collect, store, and process data.
– They ensure that data pipelines are efficient, scalable, and capable of handling
large volumes of data.
– Data engineers work closely with data scientists to ensure the availability of
clean and well-structured data for analysis.
– Example: A data engineer designs and implements a data pipeline that ex-
tracts real-time transactional data from an e-commerce platform and stores it
in a data warehouse.

10
ITECH WORLD AKTU

• Business Analyst:

– A business analyst bridges the gap between the technical team (data scientists
and engineers) and business stakeholders.
– They are responsible for understanding the business problem and translating
it into actionable data-driven solutions.
– Business analysts also interpret the results of data analysis and communicate
them in a way that is understandable for non-technical stakeholders.
– Example: A business analyst analyzes customer feedback data and interprets
the results to help the marketing team refine their targeting strategy.

• Project Manager:

– A project manager oversees the overall execution of an analytics project, en-


suring that it stays on track and is completed within scope, time, and budget.
– They coordinate between teams, manage resources, and resolve any issues that
may arise during the project.
– Project managers also ensure that the project delivers business value and meets
stakeholder expectations.
– Example: A project manager ensures that the data engineering team delivers
clean data on time, while also coordinating with the data scientists to make
sure the model development phase proceeds smoothly.

6.0.3 Phases of Data Analytics Lifecycle


The phases of the Data Analytics Lifecycle are critical to successfully executing an ana-
lytics project. Each phase ensures the project follows a systematic approach from start
to finish:

1. Discovery:

• Identify the business problem or goal.


• Understand the data requirements and sources.
• Define the scope and objectives of the project.

2. Data Preparation:

• Collect and consolidate relevant data.


• Clean the data by handling missing values, duplicates, and errors.
• Transform the data into a suitable format for analysis (e.g., normalization,
encoding).

3. Model Planning:

• Choose the appropriate analytical methods (e.g., regression, clustering).


• Select suitable algorithms based on the business needs.
• Define evaluation metrics (e.g., accuracy, precision, recall).

11
ITECH WORLD AKTU

4. Model Building:

• Implement the selected models using tools like Python, R, or machine learning
libraries (e.g., Scikit-learn, TensorFlow).
• Train the model on the prepared dataset.
• Tune hyperparameters to improve model performance.

5. Communicating Results:

• Visualize findings using tools like Tableau, Power BI, or matplotlib.


• Present insights to stakeholders in a clear, understandable format.
• Provide actionable recommendations based on the results.

6. Operationalization:

• Deploy the model into a production environment for real-time analysis or batch
processing.
• Integrate the model with existing business systems (e.g., CRM, ERP).
• Monitor and maintain the model’s performance over time.

Example: A retail company builds a model to predict customer churn and integrates
it into their CRM system.

12
ITECH WORLD AKTU

ITECH WORLD AKTU


Subject Name: Data Analytics (DA)
Subject Code: BCS052

Unit 2: Data Analysis

Syllabus
1. Regression modeling.

2. Multivariate analysis.

3. Bayesian modeling, inference, and Bayesian networks.

4. Support vector and kernel methods.

5. Analysis of time series:

• Linear systems analysis.


• Nonlinear dynamics.

6. Rule induction.

7. Neural networks:

• Learning and generalisation.


• Competitive learning.
• Principal component analysis and neural networks.

8. Fuzzy logic:

• Extracting fuzzy models from data.


• Fuzzy decision trees.

9. Stochastic search methods.

1
ITECH WORLD AKTU

Detailed Notes
1 Regression Modeling
Regression modeling is a fundamental statistical technique used to examine the relation-
ship between one dependent variable (outcome) and one or more independent variables
(predictors or features). It helps in understanding, modeling, and predicting the depen-
dent variable based on the behavior of independent variables.

Objectives of Regression Modeling


• To identify and quantify relationships between variables.

• To predict future outcomes based on historical data.

• To understand the influence of independent variables on a dependent variable.

• To identify trends and make informed decisions in various fields such as economics,
medicine, engineering, and marketing.

Types of Regression Models


1. Linear Regression:

• Establishes a linear relationship between dependent and independent variables


using the equation y = mx + c, where y is the dependent variable, x is the
independent variable, m is the slope, and c is the intercept.
• Assumes that the relationship between variables is linear and that residuals
(errors) are normally distributed.
• Suitable for predicting continuous outcomes.

Example: Predicting house prices based on size, number of rooms, and location.

2. Multiple Linear Regression:

• Extends linear regression to include multiple independent variables.


• The equation becomes y = b0 + b1 x1 + b2 x2 + · · · + bn xn + ϵ, where b0 is the
intercept, b1 , b2 , . . . , bn are the coefficients, and ϵ is the error term.

Example: Predicting a company’s sales revenue based on advertising spend, num-


ber of salespeople, and seasonal effects.

3. Logistic Regression:

• Used for binary classification problems where the outcome is categorical (e.g.,
0 or 1, Yes or No).
• Employs the sigmoid function σ(x) = 1
1+e−x
to model probabilities.
• Suitable for predicting binary or categorical outcomes.

Example: Classifying whether a patient has a disease based on medical test results.

2
ITECH WORLD AKTU

Steps in Regression Modeling

(a) Data Collection: Gather data relevant to the problem, ensuring accuracy
and completeness.
(b) Data Preprocessing: Handle missing values, scale variables, and identify
outliers.
(c) Feature Selection: Identify the most significant predictors using methods
like correlation analysis or stepwise selection.
(d) Model Building: Fit the regression model using statistical software or pro-
gramming languages like Python or R.
(e) Model Evaluation: Assess the model’s performance using metrics such as
R2 , Mean Squared Error (MSE), or Mean Absolute Error (MAE).
(f) Prediction: Use the model to make predictions on new or unseen data.

Applications of Regression Modeling

• Business: Forecasting sales, revenue, and market trends.


• Healthcare: Predicting disease outcomes or treatment effectiveness.
• Engineering: Modeling system reliability and performance.
• Finance: Estimating stock prices or credit risks.

Example: Predicting Sales Revenue A retail company wants to predict


its monthly sales revenue based on advertising spend and the number of active
customers. Using multiple linear regression, the dependent variable is sales revenue,
and the independent variables are advertising spend and customer count. The
fitted model could help optimize the allocation of marketing budgets for maximum
revenue.

2 Multivariate Analysis
Multivariate analysis is a statistical technique used to analyze data involving multi-
ple variables simultaneously. It helps in understanding the relationships, patterns,
and structure within datasets where more than two variables are interdependent.

Objectives of Multivariate Analysis

• To identify relationships and dependencies among multiple variables.


• To reduce the dimensionality of datasets while retaining important informa-
tion.
• To classify or group data into meaningful categories.
• To predict outcomes based on multiple predictors.

3
ITECH WORLD AKTU

Types of Multivariate Analysis Techniques

(a) Factor Analysis:


• Identifies underlying factors or latent variables that explain the observed
data.
• Reduces a large set of variables into smaller groups based on their corre-
lations.
• Uses methods like Principal Axis Factoring (PAF) or Maximum Likelihood
Estimation (MLE).
Example: In psychology, factor analysis is used to identify latent traits like
intelligence or personality from observed behavior.
(b) Cluster Analysis:
• Groups similar data points into clusters based on their characteristics.
• Common algorithms include K-Means, Hierarchical Clustering, and DB-
SCAN.
• Does not require pre-defined labels and is used for exploratory data anal-
ysis.
Example: Customer segmentation in marketing to classify customers into
groups like high-value, low-value, or occasional buyers.
(c) Principal Component Analysis (PCA):
• A dimensionality reduction technique that transforms data into a set of
linearly uncorrelated components (principal components).
• Retains as much variance as possible while reducing the number of vari-
ables.
• Helps visualize high-dimensional data in 2D or 3D spaces.
Example: Simplifying genome data by reducing thousands of genetic vari-
ables to a manageable number of principal components.

Applications of Multivariate Analysis

• Marketing: Customer segmentation, product positioning, and preference


analysis.
• Finance: Risk assessment, portfolio optimization, and fraud detection.
• Healthcare: Analyzing patient data to predict disease outcomes or treatment
responses.
• Psychology: Identifying personality traits or cognitive factors using survey
data.
• Environment: Studying the impact of multiple environmental factors on
ecosystems.

4
ITECH WORLD AKTU

Steps in Multivariate Analysis

(a) Define the Problem: Clearly identify the objectives and variables to be
analyzed.
(b) Collect Data: Gather accurate and relevant data for all variables.
(c) Preprocess Data: Handle missing values, standardize variables, and detect
outliers.
(d) Choose the Method: Select an appropriate multivariate technique based on
the objective.
(e) Apply the Method: Use statistical software (e.g., Python, R, SPSS) to
conduct the analysis.
(f) Interpret Results: Understand the output, identify patterns, and draw ac-
tionable insights.

Advantages of Multivariate Analysis

• Handles complex datasets with multiple interdependent variables.


• Reduces dimensionality while retaining essential information.
• Enhances predictive accuracy in machine learning models.
• Provides deeper insights for decision-making.

Limitations of Multivariate Analysis

• Requires a large sample size to achieve reliable results.


• Sensitive to multicollinearity among variables.
• Interpretation of results can be challenging for non-experts.

Example: Customer Segmentation in Marketing A retail com-


pany wants to segment its customer base to improve targeted marketing campaigns.
Using cluster analysis, customer data such as age, income, purchase frequency, and
product preferences are grouped into clusters. The company identifies three main
segments:

(a) High-income, frequent buyers.


(b) Middle-income, occasional buyers.
(c) Low-income, infrequent buyers.

The insights help the company design personalized offers and allocate marketing
budgets effectively.

5
ITECH WORLD AKTU

3 Bayesian Modeling, Inference, and Bayesian Net-


works
1. Bayesian Modeling

• Bayesian modeling is a statistical approach that applies Bayes’ theorem to


update probabilities as new evidence or information becomes available.
• It incorporates prior knowledge (prior probabilities) along with new evidence
(likelihood) to compute updated probabilities (posterior probabilities).
• Bayes’ theorem is expressed as:
P (B|A)P (A)
P (A|B) =
P (B)
where:
– P (A|B): Posterior probability (probability of A given B).
– P (B|A): Likelihood (probability of observing B given A).
– P (A): Prior probability (initial belief about A).
– P (B): Evidence (probability of observing B).
• Bayesian modeling is particularly useful in situations with uncertainty or in-
complete data.
• Applications:
– Forecasting in finance, weather, and sports.
– Fraud detection in transactions.
– Medical diagnosis based on symptoms and test results.

2. Inference in Bayesian Modeling

• Bayesian inference involves the process of deducing likely outcomes based on


prior knowledge and new evidence.
• It answers questions such as:
– What is the probability of a hypothesis being true given the observed
data?
– How should we update our belief about a hypothesis when new data is
observed?
• Types of Bayesian inference:
(a) Point Estimation: Finds the single best estimate of a parameter (e.g.,
Maximum A Posteriori (MAP)).
(b) Interval Estimation: Provides a range of values (credible intervals)
where a parameter likely lies.
(c) Posterior Predictive Checks: Validates models by comparing predic-
tions to observed data.
• Advantages:
– Allows for dynamic updates as new data becomes available.
– Handles uncertainty effectively by integrating prior information.

6
ITECH WORLD AKTU

3. Bayesian Networks

• Bayesian networks are graphical models that represent a set of variables and
their probabilistic dependencies using directed acyclic graphs (DAGs).
• Components of a Bayesian network:
– Nodes: Represent variables.
– Edges: Represent dependencies between variables.
– Conditional Probability Tables (CPTs): Quantify the relationships
between connected variables.
• Applications:
– Diagnosing diseases based on symptoms and test results.
– Predicting equipment failures in industrial systems.
– Understanding causal relationships in data.

4. Advantages of Bayesian Methods

• Incorporates prior knowledge into the analysis, making it robust for decision-
making.
• Handles uncertainty and incomplete data effectively.
• Supports dynamic updating of models as new evidence becomes available.

5. Limitations of Bayesian Methods

• Computationally intensive for large datasets or complex models.


• Requires careful selection of prior probabilities, which can introduce bias if
chosen incorrectly.

4 Support Vector and Kernel Methods


Support Vector Machines (SVM) and Kernel Methods are powerful techniques used
in machine learning for classification and regression tasks. Below is a detailed
breakdown:

4.1 Support Vector Machines (SVM)

• Definition: SVM is a supervised learning algorithm that identifies the best


hyperplane to separate different classes in the dataset.
• Key Features:
– Maximizes the margin between data points of different classes.
– Works well for both linearly separable and non-linear data.
– Robust to high-dimensional spaces and effective in scenarios with many
features.

7
ITECH WORLD AKTU

• Objective: The objective of SVM is to find the hyperplane that maximizes the
margin between the nearest data points of different classes, known as support
vectors.
2
Maximize:
∥w∥
subject to:
yi (w · xi + b) ≥ 1 ∀i
where:
– w: Weight vector defining the hyperplane.
– xi : Input data points.
– yi : Class labels (+1 or −1).
– b: Bias term.
• Soft Margin SVM: In cases where perfect separation is not possible, SVM
introduces slack variables ξi to allow misclassification:

yi (w · xi + b) ≥ 1 − ξi , ξi ≥ 0

The optimization problem becomes:


n
1 X
Minimize: ∥w∥2 + C ξi
2 i=1

8
ITECH WORLD AKTU

where C is a regularization parameter controlling the trade-off between maxi-


mizing the margin and minimizing the classification error.
• Applications:
– Spam email detection.
– Image classification.
– Sentiment analysis.

4.2 Kernel Methods

• Definition: Kernel methods enable SVM to handle non-linearly separable


data by transforming it into a higher-dimensional space.
• Key Features:
– Uses kernel functions to compute relationships between data points in
higher dimensions.
– Avoids explicit computation in high-dimensional space, reducing compu-
tational complexity (the kernel trick ).
– Common kernel functions:
∗ Linear Kernel:
K(xi , xj ) = xi · xj
• Applications:
– Face recognition.
– Medical diagnosis.
– Stock price prediction.

4.3 Advantages of SVM and Kernel Methods:

• Effective in high-dimensional spaces.


• Works well with small datasets due to the use of support vectors.
• Kernel methods allow handling complex, non-linear relationships.

4.4 Limitations of SVM and Kernel Methods:

• Computationally intensive for large datasets.


• Performance depends on the choice of kernel and its parameters.
• Not well-suited for datasets with significant noise or overlapping classes.

5 Analysis of Time Series


Time series analysis involves examining data points collected or recorded at spe-
cific time intervals to identify patterns, trends, and insights. It is widely used for
forecasting and understanding temporal behaviors.

9
ITECH WORLD AKTU

1. Linear Systems Analysis

• Definition: Linear systems analysis examines the linear relationships between


variables in a time series to predict future trends.
• Characteristics:
– Assumes a linear relationship between past values and future observations.
– Uses techniques such as moving averages and autoregression.
• Common Techniques:
– Autoregressive Models: Use past values of the time series to predict future
values.
– Moving Average Models: Use past error terms for predictions.
– ARMA Models: Combine autoregressive and moving average approaches
for better accuracy.
• Applications:
– Stock price prediction based on historical price trends.
– Economic forecasting for GDP or inflation rates.
– Electricity demand prediction.
• Example:
– A financial analyst uses historical price and volatility data to forecast
future stock prices using a combination of linear models.

2. Nonlinear Dynamics

• Definition: Nonlinear dynamics analyze time series data that exhibit chaotic
or nonlinear behaviors, which cannot be captured by linear models.
• Characteristics:
– Relationships between variables are complex and not proportional.
– Small changes in initial conditions can lead to significant differences in
outcomes (sensitive dependence on initial conditions).
• Common Techniques:
– Delay Embedding: Reconstructs a system’s phase space from a time series
to analyze its dynamics.
– Fractal Dimension Analysis: Measures the complexity of the data.
– Lyapunov Exponent: Quantifies the sensitivity to initial conditions.
• Applications:
– Modeling weather systems, which involve chaotic dynamics.
– Predicting heart rate variability in medical diagnostics.
– Analyzing financial markets where nonlinear dependencies exist.
• Example:
– Meteorologists use nonlinear dynamics to predict weather patterns, ac-
counting for the chaotic interactions of atmospheric variables.

10
ITECH WORLD AKTU

3. Combining Linear and Nonlinear Models

• In practice, time series data often exhibit both linear and nonlinear patterns.
• Hybrid models, such as combining traditional time series models with machine
learning techniques, are used to capture both types of behaviors for improved
accuracy.

6 Rule Induction
Rule induction extracts rules from data to create interpretable models.

• Definition: Rule induction is a method that automatically generates decision


rules from data.
• Key Features:
– Produces easy-to-understand rules for decision-making.
– Used for classification tasks in machine learning.
– Helps uncover hidden patterns in data.
• Applications:
– Credit risk analysis.
– Medical diagnosis.
– Customer segmentation.
• Example: In credit risk analysis, rules are induced to predict whether a
customer will default on a loan based on features such as income, credit score,
and loan amount.

7 Neural Networks
Neural networks are computational models inspired by the human brain, used for
pattern recognition and predictive tasks.

1. Learning and Generalisation

• Definition: Neural networks learn from historical data and generalize pat-
terns to make predictions on new, unseen data.
• Key Features:
– Learn complex relationships in data.
– Generalize well to unseen data if properly trained.
• Example: A neural network trained on a set of images of handwritten digits
can generalize and classify new, unseen digits.

11
ITECH WORLD AKTU

2. Competitive Learning

• Definition: Competitive learning is a type of unsupervised learning where


neurons compete to represent the input data.
• Key Features:
– No target output is provided.
– Clusters similar data points by competition between neurons.
• Example: A competitive learning network used for clustering customer data
based on purchasing behavior.

3. Principal Component Analysis (PCA) and Neural Networks

• Definition: PCA reduces the dimensionality of data while retaining most of


the variance, which is then used as input for neural networks.
• Key Features:
– PCA helps in reducing the computational complexity.
– Neural networks can be trained more efficiently with reduced dimension-
ality.
• Example: Handwriting recognition, where PCA is used to reduce the number
of features (pixels), followed by neural network training for classification.

4. Supervised and Unsupervised Learning

• Supervised Learning: Involves training a model using labeled data (input-


output pairs) to make predictions.
• Unsupervised Learning: Involves learning patterns and structures from
unlabeled data without predefined output labels.

5. Comparison Between Supervised and Unsupervised Learning

8 Multilayer Perceptron Model with Its Learning


Algorithm
The Multilayer Perceptron (MLP) is a type of artificial neural network consisting
of an input layer, one or more hidden layers, and an output layer. MLP is used for
both classification and regression tasks.

1. Structure of the Multilayer Perceptron (MLP)

• The MLP consists of multiple layers of neurons:


– Input Layer: Receives the input features.
– Hidden Layers: One or more layers where the actual computation hap-
pens.

12
ITECH WORLD AKTU

Criteria Supervised Learn- Unsupervised Examples


ing Learning
Data Type Labeled data Unlabeled data Image classification,
Spam detection
Output Predicted output for Hidden patterns or Market basket analy-
new data clusters sis, Clustering
Goal Learn mapping from Discover structure or Stock price prediction,
input to output distribution Customer segmenta-
tion
Algorithms Regression, classifica- Clustering, associa- Decision trees, k-
tion, etc. tion, etc. means clustering
Performance Eval- Accuracy, Precision, In terms of clustering Silhouette score,
uation Recall or patterns discovered Davies-Bouldin index
Use Case Predict outcomes for Find hidden structure Fraud detection,
unseen data in data Topic modeling

Table 1: Comparison between Supervised and Unsupervised Learning

– Output Layer: Produces the final prediction.


• Each neuron in a layer is connected to every neuron in the next layer, and
each connection has a weight.
• Non-linear activation functions (e.g., Sigmoid, ReLU) are used in the hidden
and output layers.

13
ITECH WORLD AKTU

9 Fuzzy Logic
1. Extracting Fuzzy Models from Data

• Definition: Fuzzy logic translates real-world, uncertain, or imprecise data


into fuzzy models that are human-readable.
• Process: Fuzzy models are built by identifying patterns in data and convert-
ing them into fuzzy rules using linguistic variables.
• Key Features:
– Handles vague or ambiguous information.
– Rules are expressed as ”if-then” statements, such as ”if temperature is
high, then likelihood of rain is high.”
– Allows reasoning with degrees of truth rather than binary true/false logic.
• Applications:
– Control systems (e.g., temperature control).
– Decision-making in uncertain environments.
– Expert systems and diagnostics.

2. Fuzzy Decision Trees

• Definition: Combines fuzzy logic with decision trees, providing a robust


framework for decision-making under uncertainty.
• Process: In fuzzy decision trees, data is split into branches based on fuzzy
conditions (e.g., ”low”, ”medium”, ”high” values) rather than exact thresholds.
• Key Features:
– Nodes represent fuzzy sets, and edges represent fuzzy conditions.
– Each branch can handle uncertainty, allowing a more nuanced decision
process.
– Fuzzy decision trees work well for classification tasks where exact data
values are difficult to interpret.
• Applications:
– Medical diagnosis based on symptoms.
– Classification problems with imprecise data.

Example: Fuzzy-based Climate Prediction

• Scenario: Predicting climate or weather conditions based on fuzzy logic rules.


• Process: Use fuzzy variables such as temperature, humidity, and wind speed
to create rules like:
– ”If temperature is high and humidity is low, then it is likely to be sunny.”
– ”If wind speed is high and humidity is medium, then there is a chance of
rain.”

14
ITECH WORLD AKTU

• Outcome: The fuzzy model produces a prediction with a degree of certainty


(e.g., 70% chance of rain).
• Applications: Weather forecasting, climate modeling, and environmental
monitoring.

10 Stochastic Search Methods


Stochastic search methods are algorithms that rely on probabilistic approaches to
explore a solution space. These methods are particularly useful for solving opti-
mization problems where traditional deterministic methods may be ineffective due
to complex, large, or poorly understood solution spaces.

1. Genetic Algorithms (GAs)

• Definition: Genetic Algorithms (GAs) are search heuristics inspired by the


process of natural selection. They are used to find approximate solutions to
optimization and search problems.
• Key Concepts:
– Population: A set of potential solutions (individuals), each represented
by a chromosome.
– Selection: A process where individuals are chosen based on their fitness
(how good they are at solving the problem).
– Crossover (Recombination): Combines two selected individuals to
produce offspring by exchanging parts of their chromosomes.
– Mutation: Introduces small random changes to an individual’s chromo-
some to maintain diversity within the population.
– Fitness Function: A function that evaluates the quality of the solutions.
The better the solution, the higher its fitness score.
• Steps in GA:
– Initialize a population of random solutions.
– Evaluate the fitness of each solution.
– Select pairs of solutions to mate and create offspring.
– Apply crossover and mutation to create new individuals.
– Repeat the process for multiple generations.
• Applications:
– Optimization problems, such as finding the best parameters for a machine
learning model.
– Engineering design, such as the design of aerodynamic shapes.
– Game strategies and route planning.

15
ITECH WORLD AKTU

2. Simulated Annealing (SA)

• Definition: Simulated Annealing (SA) is a probabilistic technique for ap-


proximating the global optimum of a given function. It mimics the process of
annealing in metallurgy, where a material is heated and then slowly cooled to
remove defects.
• Key Concepts:
– Temperature: A parameter that controls the probability of accepting
worse solutions as the algorithm explores the solution space. Initially
high, it decreases over time.
– Acceptance Probability: The algorithm may accept a worse solution
with a certain probability to escape local minima and search for a better
global minimum. The probability decreases as the temperature lowers.
– Neighborhood Search: At each iteration, the algorithm randomly ex-
plores neighboring solutions (mutates the current solution).
• Steps in Simulated Annealing:
– Initialize a random solution and set an initial temperature.
– Iteratively explore neighboring solutions and calculate the change in en-
ergy (cost or objective function).
– Accept the new solution with a certain probability, which is a function of
the temperature and the energy difference.
– Gradually decrease the temperature according to a cooling schedule.
– Repeat until the system reaches equilibrium or a stopping condition is
met.
• Applications:
– Solving combinatorial optimization problems, such as the traveling sales-
man problem.
– Circuit design, such as the placement of components in a chip.
– Machine learning hyperparameter tuning.

Example: Optimization in Route Planning

• Problem: Finding the optimal route for delivery trucks that minimizes travel
distance or time.
• Solution:
– Genetic Algorithms: Can be used to evolve a population of possible
routes, selecting and combining the best routes through crossover and
mutation to find an optimal or near-optimal solution.
– Simulated Annealing: Can be used to explore the space of possible
routes, accepting less optimal routes in the short term (to escape local
minima) and gradually converging to an optimal route as the temperature
decreases.

16
ITECH WORLD AKTU 1

ITECH WORLD AKTU


Subject Name: Data Analytics
Subject Code: BCS052

UNIT 3: Mining Data Streams


Syllabus
1. Introduction to streams concepts

2. Stream data model and architecture

3. Stream computing

4. Sampling data in a stream

5. Filtering streams

6. Counting distinct elements in a stream

7. Estimating moments

8. Counting ones in a window

9. Decaying window

10. Real-time Analytics Platform (RTAP) applications

11. Case studies:

• Real-time sentiment analysis


• Stock market predictions
ITECH WORLD AKTU 2

1. Introduction to Streams Concepts

Definition: A data stream is a continuous and real-time flow of data elements made
available sequentially over time. Unlike traditional static datasets, data streams are
dynamic and require real-time processing and analysis to extract actionable insights.
Key Characteristics:

• Continuous Flow: Data streams are generated and processed continuously,


often without a defined start or end point.

• High Volume: Streams can produce a large amount of data per second, re-
quiring scalable systems to handle the load.

• Real-Time Processing: Due to their continuous nature, streams demand


real-time or near real-time analysis.

• Transient Data: Data in streams may not be stored permanently and could
be processed in memory or with sliding windows.

• Heterogeneity: Data streams can come from diverse sources and in varying
formats (structured, semi-structured, or unstructured).

Applications of Data Streams:

• Social Media Analytics: Monitoring platforms like Twitter or Instagram for


trending topics and sentiment analysis.

• IoT (Internet of Things): Devices like smart sensors in a factory transmit-


ting data about temperature, pressure, or performance in real-time.

• E-commerce: Streaming customer interactions on websites to offer dynamic


recommendations.
ITECH WORLD AKTU 3

• Finance: Real-time stock price analysis for investment decisions.

• Transportation: Monitoring traffic flows using data from GPS devices.

Example 1: Social Media Feeds

• Social media platforms like Twitter continuously generate streams of tweets.


These streams can be processed in real-time to identify trending hashtags,
analyze user sentiments, or detect breaking news.

Example 2: Sensor Data from IoT Devices

• Consider a smart home system where sensors monitor temperature, humidity,


and energy usage. These sensors send continuous data streams to a central
server, enabling real-time decisions like adjusting the thermostat or sending
alerts for anomalies.

Challenges in Handling Data Streams:

• Scalability: Systems need to scale dynamically to handle spikes in data vol-


ume.

• Latency: Minimizing the time between data arrival and actionable insight is
crucial for applications like fraud detection.

• Data Quality: Ensuring the accuracy and reliability of streaming data, which
might have noise or incomplete values.

2. Stream Data Model and Architecture


ITECH WORLD AKTU 4

Architecture of Stream Data Processing (Based on Diagram):

1. Data Streams:

• Continuous flow of data from various sources, such as sensors, social media,
or log files.

2. Stream Manager:

• Manages incoming data streams and forwards them to the processing com-
ponents.

3. System Catalog:

• Stores metadata about the system, such as stream schemas and resources
used for processing.

4. Scheduler:

• Allocates tasks and resources efficiently to process the incoming data


streams.

5. Router:

• Directs data streams to appropriate processing units or queues for further


operations.

6. Queue Manager:

• Manages buffers and organizes incoming data to ensure orderly processing.

7. Query Processor:

• Executes queries on the stream data and provides results in real-time.

8. Query Optimizer:

• Optimizes query execution plans for efficiency in processing large volumes


of streaming data.

9. QoS Monitoring (Quality of Service):

• Ensures data processing meets predefined performance metrics, such as


latency and throughput.

10. Storage Manager:

• Handles temporary or permanent storage of processed data, typically in


databases or file systems.

11. Secondary Storage:

• Used for long-term storage of data streams and historical data.


ITECH WORLD AKTU 5

3. Stream Computing

Definition: Stream computing refers to the continuous and real-time processing of


data streams as they arrive, without storing them for batch processing.
Key Features of Stream Computing:

• Continuous Processing: Data is processed on the fly as it streams in.

• Low Latency: Enables near real-time decision-making by minimizing process-


ing delays.

• Scalable: Handles large and variable volumes of data streams efficiently.

• Fault Tolerance: Ensures the system can recover from failures and continue
processing seamlessly.

Advantages of Stream Computing:

• Real-Time Insights: Enables immediate response to events, such as fraud


detection or anomaly identification.

• Dynamic Scalability: Adapts to fluctuating data volumes in applications


like IoT and e-commerce.

• Improved Resource Utilization: Processes only the required data without


storing entire datasets.

• Continuous Analytics: Provides ongoing analysis rather than waiting for


batch processes to complete.

• Supports Complex Workflows: Handles advanced operations like filtering,


aggregations, and joins in real-time.

Applications of Stream Computing:

• Fraud Detection: Monitors transactions in real-time for unusual patterns.

• IoT Applications: Processes sensor data from smart devices continuously.

• Healthcare: Analyzes patient vitals in real-time for critical care alerts.

• E-commerce: Tracks customer behavior to offer real-time recommendations.

• Social Media Analytics: Monitors and analyzes trends and user sentiments.

Example: Real-Time Traffic Monitoring

• GPS devices in vehicles stream location data to a central system.

• Stream computing processes this data to identify traffic congestion and suggest
alternative routes in real-time.

Example: Fraud detection in banking: Transaction streams are analyzed to detect


unusual patterns in real-time.
ITECH WORLD AKTU 6

4. Sampling Data in a Stream

Definition: Sampling in data streams involves selecting a representative subset of


the stream for analysis, especially when processing the entire stream is infeasible due
to high volume or resource constraints.
Importance of Sampling:

• Reduces computational load by working on a smaller subset of data.

• Helps in estimating patterns or trends without analyzing the full stream.

• Useful in scenarios where storage or processing capacity is limited.

Techniques for Sampling:

• Random Sampling:

– Data points are chosen randomly from the stream.


– Simple to implement but may not ensure uniform representation.

• Reservoir Sampling:

– Maintains a fixed-size sample from the stream by dynamically replacing


elements as new data arrives.
– Ensures uniform probability for each data point in the stream.

• Systematic Sampling:

– Picks every k th element from the stream after selecting a random starting
point.
– Useful when the stream has a repetitive structure or pattern.

• Stratified Sampling:

– Divides the stream into strata (subgroups) and samples proportionally


from each stratum.
– Ensures representation of all subgroups in the data.

• Priority Sampling:

– Assigns a priority to each data point based on importance or weight and


selects data with higher priorities.

Advantages of Sampling:

• Reduces memory and processing requirements.

• Allows real-time analytics on high-velocity streams.

• Enables quicker insights by focusing on relevant portions of the data.


ITECH WORLD AKTU 7

Applications of Sampling:

• Social Media Analysis: Sampling a subset of tweets to estimate trending


topics or sentiments.

• Network Traffic Monitoring: Analyzing a fraction of packets to detect


anomalies or patterns.

• Financial Data Streams: Sampling stock price streams to estimate market


trends without analyzing all data.

Example 1: Random Sampling in Sentiment Analysis

• From a stream of 1 million tweets, randomly select 10% to estimate the senti-
ment trends for a product launch.

Example 2: Reservoir Sampling in IoT

• A smart thermostat collects temperature readings continuously. Reservoir sam-


pling is used to maintain a fixed-size sample for quick diagnostics.

5. Filtering Streams

Definition: Filtering streams involves selecting or discarding elements from a data


stream based on specific criteria or conditions.
Techniques:

• Content-based Filtering: Filters data based on attributes or content (e.g.,


filter tweets with specific hashtags).

• Time-based Filtering: Filters data within a specific time range (e.g., log
entries in the last 24 hours).

• Probabilistic Filtering: Uses approximate methods like Bloom Filters to


test membership efficiently.

Example:

• Filter sensor data to only include temperature readings above 30°C.

6. Counting Distinct Elements in a Stream

Definition: Counting distinct elements identifies the number of unique items in a


data stream.
Challenges:

• High data volume makes storing all elements impractical.

• Efficient algorithms are needed to approximate the count.


ITECH WORLD AKTU 8

Techniques:

• Hashing: Hash elements to reduce memory usage.

• HyperLogLog Algorithm: Estimates the count using probabilistic data


structures with minimal memory.

Example:

• Count the number of unique IP addresses in network traffic logs.

7. Estimating Moments

Definition: Moments are statistical measures used to understand the properties of


a data distribution, such as variance or skewness.
Types of Moments:

• First Moment: The sum of elements in a stream.

• Second Moment: The sum of squares of elements, used for measuring vari-
ance.

Techniques:

• Alon-Matias-Szegedy (AMS) Algorithm: Efficiently estimates moments


using random sampling.

Example:

• Estimate the variance in transaction amounts in a financial data stream.

8. Counting Ones in a Window

Definition: Counting ones in a window involves tracking the number of ones (or
specific events) in a fixed-size window of the stream.
Techniques:

• Sliding Window: Maintains a fixed-size window and updates counts as new


data arrives.

• Exponential Decay: Older data has less influence on the count over time.

Example:

• Count the number of ”likes” in the last 5 minutes on a live video stream.

9. Decaying Window
ITECH WORLD AKTU 9

Definition: A decaying window assigns weights to elements in a stream based on


their age, giving more importance to recent data.
Advantages:

• Keeps analytics relevant to recent trends.

• Efficiently handles infinite streams by discarding older, less important data.

Techniques:

• Exponential Decay Function: Applies a decay factor e−λt , where t is the


age of the data.

Example:

• Monitor CPU usage, giving higher priority to recent data while reducing older
data’s impact.

10. Real-time Analytics Platform (RTAP) Applications

Definition: RTAP refers to platforms designed for processing and analyzing real-
time data streams.
Applications:

• Sentiment Analysis: Analyze social media streams to determine public opin-


ion in real-time.

• Stock Market Predictions: Process stock trade data to predict price move-
ments and trends.

• Fraud Detection: Monitor transactions to detect fraudulent activities in-


stantly.

• IoT Monitoring: Analyze sensor data from smart devices for immediate ac-
tion.

Key Features:

• Low latency for real-time decision-making.

• Scalable to handle high-velocity data streams.

• Integration with distributed systems (e.g., Apache Kafka, Spark Streaming).

Example:

• A bank uses RTAP to monitor millions of credit card transactions to detect


and block fraud in real-time.
ITECH WORLD AKTU 10

Case Studies
1. Real-Time Sentiment Analysis

Objective: Analyze real-time social media streams to understand and quantify


public sentiment.
Applications:

• Monitoring political sentiment during elections.

• Tracking customer feedback during product launches.

• Real-time reaction analysis to breaking news or crises.

Steps:

1. Data Collection:

• Use APIs such as Twitter API, Reddit API, or Facebook Graph API to
stream live data.
• Filter data using keywords, hashtags, geolocation, or user metadata.
• Example: Collect tweets containing hashtags like #Election2024 or #New-
ProductLaunch.

2. Data Preprocessing:

• Cleaning: Remove stopwords, special characters, emojis, URLs, and un-


necessary whitespace.
• Tokenization: Break text into individual words or phrases for analysis.
• Normalization: Convert text to lowercase, and apply stemming or lemma-
tization to unify word forms.
• Example: Convert ”Running” and ”Runs” to ”Run.”

3. Sentiment Analysis:

• Apply sentiment scoring algorithms to classify text as positive, negative,


or neutral.
• Popular Tools:
– VADER (Valence Aware Dictionary for Sentiment Reason-
ing): Ideal for analyzing short, informal text like tweets.
– TextBlob: Simple library for sentiment scoring and polarity detec-
tion.
– Deep Learning Models: Pre-trained models like BERT or GPT-3
for high accuracy in complex sentiment detection.
• Example: A tweet like ”This product is amazing!” is classified as positive
sentiment.
ITECH WORLD AKTU 11

4. Real-Time Processing:

• Use stream processing platforms such as Apache Kafka, Apache Flink, or


Spark Streaming for low-latency data processing.
• Continuously calculate sentiment scores and update visualizations or dash-
boards.

5. Visualization and Insights:

• Use visualization tools like Tableau, Power BI, or custom dashboards to


display trends.
• Example: Real-time sentiment heatmaps during a political debate or prod-
uct launch.

2. Stock Market Predictions

Objective: Analyze and predict stock price trends in real-time using market data
and news feeds.
Applications:

• Predicting stock price fluctuations based on breaking news.

• Identifying trading opportunities using historical and real-time data.

• Risk management through early detection of market volatility.

Steps:

1. Data Collection:

• Gather real-time stock data from APIs such as Yahoo Finance, Alpha
Vantage, or Bloomberg.
• Collect relevant financial news or tweets using web scraping tools or news
APIs.
• Example: Monitor the stock prices of companies like Apple and Tesla
while analyzing related news articles.

2. Data Preprocessing:

• Cleaning: Remove irrelevant data like stopwords, duplicate records, or


incomplete stock logs.
• Feature Engineering: Generate features such as moving averages, trad-
ing volume, and sentiment scores from news articles.
• Normalization: Scale stock prices and volumes to improve the perfor-
mance of prediction models.

3. Modeling and Prediction:


ITECH WORLD AKTU 12

• Use machine learning or deep learning models to predict price movements.


• Popular Models:
– Linear Regression: For modeling price trends over time.
– ARIMA (AutoRegressive Integrated Moving Average): For
time-series forecasting.
– LSTM (Long Short-Term Memory): Deep learning model for
sequential data like stock prices.
• Example: Predict whether a stock’s price will rise or fall based on the past
24 hours’ trends.

4. Real-Time Processing:

• Stream data using platforms like Apache Kafka or Spark Streaming.


• Continuously update predictions based on new stock prices or news feeds.

5. Visualization and Insights:

• Display stock trends, predicted price movements, and trading signals on


a dashboard.
• Example: A real-time graph showing Tesla’s price prediction for the next
15 minutes.
ITECH WORLD AKTU 1

ITECH WORLD AKTU


Subject Name: Data Analytics (DA)
Subject Code: BCS052

UNIT 4
Syllabus
• Frequent Item sets and Clustering: Mining frequent item sets, market-based mod-
elling, Apriori algorithm.

• Handling large data sets in main memory, limited pass algorithm, counting frequent
item sets in a stream.

• Clustering techniques: hierarchical, K-means, clustering high dimensional data,


CLIQUE and ProCLUS.

• Frequent pattern-based clustering methods, clustering in non-Euclidean space.

• Clustering for streams and parallelism.


ITECH WORLD AKTU 2

Mining Frequent Item Sets


Frequent item sets are groups of items that appear together in a dataset frequently.
They are widely used in:

• Market-basket analysis.

• Association rule mining.

• Pattern recognition in large datasets.

Key Concepts
1. Support: The proportion of transactions in which an item set appears.

2. Confidence: The likelihood that an item appears in a transaction given


another item is already present.

3. Lift: Measures how much more likely the occurrence of items together is than
if they were independent.

Market-Based Modelling
This method analyzes customer purchasing behavior by identifying relationships
among items. It uses frequent item sets to predict buying patterns.

Example:
• Transaction Data:

1. T1: {Bread, Milk}


2. T2: {Bread, Diaper, Beer, Eggs}
3. T3: {Milk, Diaper, Beer, Cola}
4. T4: {Bread, Milk, Diaper, Beer}

• Frequent Item Sets (with Support):

1. {Bread, Milk}: Appears in 2 out of 4 transactions (Support = 50%).


2. {Diaper, Beer}: Appears in 3 out of 4 transactions (Support = 75%).

Apriori Algorithm The Apriori algorithm is used to identify frequent item sets and
derive association rules. It works by:

• Iteratively generating candidate item sets.

• Pruning item sets that do not meet the minimum support threshold.

Steps in the Apriori Algorithm


1. Generate 1-item Frequent Sets:
ITECH WORLD AKTU 3

• Identify individual items that meet the minimum support threshold.

2. Generate Candidate 2-Item Sets:

• Combine frequent 1-item sets to create candidate pairs.

3. Prune Infrequent Item Sets:

• Remove pairs that do not meet the minimum support threshold.

4. Repeat for Higher-Order Item Sets:

• Continue combining and pruning until no new frequent item sets can be gen-
erated.

Key Optimizations
• Use the ”downward closure property”: All subsets of a frequent item set must also
be frequent.

• Reduce computational cost by generating only candidate sets from previously fre-
quent sets.

Example:
• Input Transactions:

1. T1: {Bread, Milk}


2. T2: {Bread, Diaper, Beer}
3. T3: {Milk, Diaper, Beer}
4. T4: {Bread, Milk, Diaper}

• Step 1: Generate 1-Item Frequent Sets:

– {Bread}, {Milk}, {Diaper}, {Beer}.

• Step 2: Generate 2-Item Candidate Sets:

– {Bread, Milk}, {Bread, Diaper}, {Milk, Diaper}, etc.

• Step 3: Prune Infrequent Sets:

– Retain sets like {Bread, Milk} and {Diaper, Beer}.

• Step 4: Generate Higher-Order Sets:

– Combine 2-item frequent sets to form 3-item candidates and repeat.

Output:
Frequent item sets and their support counts.
ITECH WORLD AKTU 4

Handling Large Datasets in Main Memory As datasets grow, handling them ef-
ficiently in main memory becomes critical. Memory optimization techniques allow for
processing large volumes of data without overwhelming system resources.

Challenges in Handling Large Datasets


• Memory Constraints: Limited RAM compared to dataset size.

• Processing Speed: Ensuring computations are performed within a reasonable


time.

• Data Streams: Continuously incoming data that cannot be stored entirely in


memory.

Techniques for Efficient Handling


1. Limited Pass Algorithm:

• Processes data in small, manageable chunks, reducing memory usage.


• Suitable for datasets that cannot be fully loaded into memory.
• Example Applications:
– Market-basket analysis where transactions are processed one batch at a
time.
– Summarizing web traffic logs incrementally.

2. Counting Frequent Item Sets in a Stream:

• Tracks the frequency of items in data streams.


• Uses techniques like:
– Hashing: Compresses data into hash tables to save space.
– Approximation Algorithms: Provides estimates for item frequencies
(e.g., Count-Min Sketch).
• Example Applications:
– Tracking frequently accessed webpages in real time.
– Monitoring high-traffic product sales in e-commerce platforms.

3. Sampling:

• Processes only a representative subset of the data.


• Reduces computational overhead while retaining insights.
• Example: Analyzing a random sample of user transactions instead of the entire
dataset.

4. Data Partitioning:

• Divides the dataset into smaller partitions for separate processing.


• Example: Dividing data by regions, time periods, or user groups for analysis.
ITECH WORLD AKTU 5

Example Scenario
• Scenario: An online retailer processes real-time sales data to identify frequent item
sets.

• Approach:

1. Use a Limited Pass Algorithm to process transactions in batches of 1,000


records.
2. Apply a Count-Min Sketch to estimate item frequencies in the stream.
3. Store only the most frequent 100 item sets in memory for immediate reporting.

• Benefits:

– Reduces memory usage by avoiding storage of the entire dataset.


– Provides real-time insights without waiting for full data processing.

Real-World Applications
• E-Commerce: Tracking high-demand products during sales events.

• Social Media: Identifying trending hashtags or topics in real time.

• Network Traffic Monitoring: Detecting frequent IP addresses or URLs in net-


work streams.

• IoT Devices: Analyzing sensor data streams for anomalies or patterns.

Clustering Techniques Clustering is an unsupervised learning technique used to group


data into clusters based on similarity. Each cluster contains data points that are more
similar to each other than to those in other clusters.

Applications of Clustering
• Customer Segmentation: Grouping customers based on purchasing behavior.

• Document Categorization: Clustering similar documents or articles.

• Image Segmentation: Dividing an image into meaningful regions for analysis.

• Anomaly Detection: Identifying outliers that do not belong to any cluster.

1. Hierarchical Clustering
Hierarchical clustering creates a tree-like structure (dendrogram) to represent nested
clusters.

• Types:

– Agglomerative (Bottom-Up): Starts with individual points as clusters and


merges them iteratively.
ITECH WORLD AKTU 6

– Divisive (Top-Down): Starts with one cluster containing all points and
splits them iteratively.

• Steps:

1. Compute the distance (similarity) matrix.


2. Merge the closest clusters.
3. Update the distance matrix.
4. Repeat until all points are in one cluster.

• Linkage Methods:

– Single Linkage: Distance between the nearest points in two clusters.


– Complete Linkage: Distance between the farthest points in two clusters.
– Average Linkage: Average distance between all points in two clusters.

• Example:

– Dataset: Documents about sports, technology, and finance.


– Approach: Hierarchical clustering groups documents into clusters by topics.

2. K-Means Clustering
K-Means is a partition-based clustering method that divides data into k clusters, where
k is predefined.

• Steps:

1. Initialize k cluster centroids randomly.


2. Assign each data point to the nearest centroid.
3. Update centroids as the mean of assigned points.
4. Repeat until centroids stabilize or a maximum number of iterations is reached.

• Example:

– Dataset: Customer data including age, income, and spending score.


– Approach: K-Means groups customers into clusters like low spenders, moder-
ate spenders, and high spenders.

• Advantages:

– Simple and scalable.


– Works well for spherical clusters.

• Limitations:

– Requires the number of clusters (k) to be predefined.


– Sensitive to initial centroid placement.
– May fail for non-spherical or overlapping clusters.
ITECH WORLD AKTU 7

3. High-Dimensional Data Clustering


Clustering high-dimensional data presents challenges like increased sparsity and compu-
tational cost. Methods include:

• CLIQUE (Clustering In QUEst):

– Combines grid-based and density-based clustering.


– Suitable for large, high-dimensional datasets.
– Example: Segmenting gene expression data in bioinformatics.

• ProCLUS (Projected Clustering):

– Identifies clusters in relevant dimensions of high-dimensional space.


– Example: Grouping user behavior data in multiple dimensions like clicks, time
spent, and purchases.

Comparison of Clustering Methods

Method Advantages Limitations


Hierarchical Cluster- Visualizes nested clus- Computationally
ing ters expensive for large
datasets
K-Means Clustering Fast and efficient Requires k and
struggles with non-
spherical clusters
CLIQUE Handles high dimen- May miss complex
sions cluster shapes
ProCLUS Identifies subspace Sensitive to parameter
clusters selection

Table 1: Comparison of Clustering Methods


ITECH WORLD AKTU 8

Advanced Clustering Methods


Clustering High Dimensional Data
High-dimensional data is common in many applications such as genomics, text mining,
and image recognition. Clustering in high-dimensional spaces requires specialized tech-
niques to manage the curse of dimensionality.

Techniques for High-Dimensional Clustering


• CLIQUE (CLustering In QUEst): A density-based subspace clustering algo-
rithm that identifies clusters in high-dimensional spaces by dividing the data into
cells and considering both density and the correlation between dimensions.

• ProCLUS (Projected Clustering): Focuses on finding subspace clusters by pro-


jecting the high-dimensional data into a lower-dimensional subspace and performing
clustering in the subspace.

• PCA-based Clustering: Principal Component Analysis (PCA) is often used to


reduce the dimensionality of the data, making it easier to identify clusters in a
lower-dimensional space.

Example
In a dataset with 100 features, clustering can be challenging. ProCLUS and CLIQUE
can reduce the dimensionality and identify clusters in subspaces where the data points
are most densely packed.

Frequent Pattern-Based Clustering


This technique combines the principles of frequent pattern mining and clustering. Fre-
quent pattern-based clustering involves finding patterns that appear frequently in a
dataset and using them to form clusters.

Approach
• Frequent Itemset Mining: The process begins by identifying frequent itemsets
using algorithms like Apriori or FP-growth.

• Cluster Formation: Once frequent itemsets are identified, they are used to form
clusters of data points that share similar patterns.

Example
In market basket analysis, frequent itemsets like Milk, Bread or Diapers, Beer can be
used to form customer segments that frequently purchase those items together.
ITECH WORLD AKTU 9

Clustering in Non-Euclidean Space


Traditional clustering techniques like K-means assume Euclidean distances between points.
However, many real-world data, such as text, graphs, or images, do not fit the Euclidean
space. Non-Euclidean spaces require different distance measures, such as cosine similarity,
Jaccard similarity, or Minkowski distance.

Methods for Non-Euclidean Clustering


• Cosine Similarity-based Clustering: This is commonly used in text mining
where the cosine of the angle between two vectors represents their similarity.

• Manhattan Distance (L1): Used for applications involving grid-based data or


data points with absolute differences.

• Graph-based Clustering: Uses graphs to model data where the vertices represent
data points, and the edges represent relationships or similarities between them.

Example
Text clustering can be done using cosine similarity to group documents that share com-
mon terms or topics, even when they do not share exact matching words.

Clustering for Streams and Parallelism


Real-time data streams, like sensor data or social media posts, present unique challenges
for clustering algorithms. Traditional clustering methods are not suitable for stream data
due to the continuous flow and large volume of data.

Approach
• Stream Clustering Algorithms: These algorithms process data in a single pass,
and must handle data that arrives in real-time. Examples include the CluStream
and DenStream algorithms.

• Parallel Clustering: This method uses multiple processors to perform clustering


on subsets of the data simultaneously. Distributed computing platforms like Hadoop
and Spark are often used for parallel clustering.

• Incremental Learning: Clustering models that update as new data arrives with-
out needing to recompute the clusters from scratch.

Example
In social media analytics, clustering algorithms need to identify trends in real-time as
millions of posts are generated every minute. Streaming clustering algorithms such as
CluStream can update clusters as new data flows in, ensuring that the model remains
relevant.
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

ITECH WORLD AKTU


Subject: Data Analytics (BCS052)

UNIT 5: Frameworks and


Visualization

Syllabus
• Frameworks: MapReduce, Hadoop, Pig, Hive, HBase, MapR, Sharding, NoSQL
Databases, S3, Hadoop Distributed File Systems.

• Visualization: Visual data analysis techniques, interaction techniques, systems,


and applications.

• Introduction to R: R graphical user interfaces, data import and export, attribute


and data types, descriptive statistics, exploratory data analysis, visualization before
analysis, analytics for unstructured data.

1
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

Frameworks
Frameworks are essential tools in data analytics that provide the infrastructure to manage,
process, and analyze large datasets. They enable scalable, efficient, and fault-tolerant
operations, making them ideal for distributed systems

MapReduce
Definition: MapReduce is a programming model used for processing and generating
large datasets. It splits the data into chunks, processes it in parallel, and reduces it to
meaningful results.

Steps:
1. Input: A large dataset is split into smaller chunks.

2. Map Phase: Each chunk of data is processed independently. The map function
converts each item into a key-value pair.

3. Shuffling and Sorting: After the map phase, key-value pairs are grouped by their
keys.

4. Reduce Phase: The reduce function takes the grouped key-value pairs and aggre-
gates them into meaningful results.

5. Output: The result of the aggregation is the final output.


Example:
Input: [1, 2, 3, 4]
Map: [(1, 1), (2, 1), (3, 1), (4, 1)]
Reduce: [(1, 4)] \textit{(Sum of all numbers)}
Explanation:
• Input: A list of integers [1, 2, 3, 4].

• Map Phase: The map function processes each number and creates key-value pairs
with the number as the key and ‘1‘ as the value.

• Reduce Phase: The reduce function groups the key-value pairs by the key and
sums the values.

• Output: The result is a single pair (1, 4), which represents the sum of all the
numbers.
Applications:
• Word Count: Counting word frequencies in large datasets.

• Sorting: Sorting large datasets distributed across many nodes.

• Log Processing: Analyzing logs from large systems.

• Machine Learning: Distributed computation in algorithms like k-means clustering.

2
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

Advantages:

• Scalability: Can handle large datasets by distributing tasks across many machines.

• Parallelism: Data is processed in parallel, reducing the overall time required.

• Fault Tolerance: The system can recover from task failures by retrying the failed
tasks.

Hadoop
Definition: Hadoop is an open-source framework for storing and processing large datasets
in a distributed manner across clusters of computers. It allows for the efficient processing
of large datasets in a fault-tolerant and scalable way.

Components:

• Hadoop Distributed File System (HDFS): A distributed file system that stores
data across multiple machines in a cluster, ensuring redundancy and fault tolerance.

• MapReduce: A programming model and processing engine that allows for parallel
processing of data across nodes in a cluster.

• YARN (Yet Another Resource Negotiator): A resource management layer


that manages and schedules computing resources across all nodes in the Hadoop
cluster.

• Hadoop Common: A set of shared libraries and utilities that support the other
Hadoop modules.

3
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

Example: Netflix uses Hadoop to analyze user data for recommendations. By pro-
cessing large volumes of user viewing data, Hadoop helps generate personalized recom-
mendations for each user, ensuring better engagement and user experience.

Pig
Definition: Pig is a high-level platform developed on top of Hadoop for creating MapRe-
duce programs. It simplifies the process of writing MapReduce programs by providing
a more user-friendly, procedural language called Pig Latin. Pig is designed to handle
both batch processing and data transformation jobs, making it easier for analysts and
programmers to process large datasets without having to deal with low-level MapReduce
code directly.

Features:

• High-level language: Pig Latin is a simple, procedural language that abstracts


the complexities of MapReduce.

• Extensibility: Pig allows for the addition of custom functions, making it extensible
for specific use cases.

4
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

• Optimization: Pig automatically optimizes queries, minimizing the need for man-
ual performance tuning.

• Support for complex data types: Pig can handle complex data types, including
nested data structures.

Pig Latin Syntax: Pig Latin is similar to SQL in its structure but is tailored for
the MapReduce paradigm. Here is an example of a Pig Latin query:

A = LOAD ’data.txt’ USING PigStorage(’,’) AS (name, age);


B = FILTER A BY age > 30;
STORE B INTO ’output’;

Explanation of Example:

• A = LOAD ’data.txt’ USING PigStorage(’,’) AS (name, age);: This state-


ment loads data from a file called ’data.txt’, assuming that the fields in the file are
separated by commas. It assigns the fields to the variables ‘name‘ and ‘age‘.

• B = FILTER A BY age ¿ 30;: This statement filters the loaded data and keeps
only the records where the age is greater than 30.

• STORE B INTO ’output’;: Finally, the filtered data (‘B‘) is stored in the output
directory.

Execution Flow: 1. **Loading Data:** Pig reads data from sources like HDFS,
local files, or relational databases. 2. **Transforming Data:** Pig supports various
transformations such as filtering, grouping, joining, and sorting. 3. **Storing Data:**
The transformed data is stored back into HDFS, a database, or another storage system.
Applications:

• Data Transformation: Cleaning, transforming, and manipulating large datasets.

• Data Analysis: Aggregating and analyzing large volumes of data.

• Log Analysis: Processing log files to extract insights or generate reports.

Advantages:

• Simplicity: Pig Latin is simpler and easier to write compared to traditional


MapReduce code.

• Performance: Pig optimizes the execution of queries, making it more efficient


than writing raw MapReduce code.

• Flexibility: Supports a wide range of data processing tasks, including complex


data transformations.

5
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

Hive
Definition: Hive is a data warehousing and SQL-like query language system built on top
of Hadoop. It is used for managing and querying large datasets stored in Hadoop’s HDFS.
Hive abstracts the complexities of writing MapReduce jobs and provides a more user-
friendly interface for querying large datasets using a SQL-like language called HiveQL.

Components:
• Metastore: A central repository that stores metadata about the data stored in
HDFS, such as table structures and partitions.

• HiveQL: A query language similar to SQL that enables users to perform data
analysis and querying tasks.

• Driver: The component responsible for receiving queries and sending them to the
execution engine for processing.

• Execution Engine: The component that executes the MapReduce jobs generated
from HiveQL queries on the Hadoop cluster.
Query Execution Flow:
1. Writing Queries: Users write queries using HiveQL, which is a SQL-like language.

2. Compiling Queries: The queries are compiled by the Hive driver, which translates
them into MapReduce jobs.

3. Executing Queries: The execution engine runs the compiled jobs on the Hadoop
cluster to process the data.

4. Storing Results: Results can be stored back into HDFS or in other storage systems
like HBase.
Applications:
• Data Analysis: Analyzing large datasets using SQL-like queries.

• ETL Operations: Extracting, transforming, and loading large datasets.

• Data Warehousing: Storing and querying structured data in HDFS.


Advantages:
• Ease of Use: HiveQL is similar to SQL, making it easier for those familiar with
relational databases to use.

• Scalability: Hive can scale to handle large datasets on a Hadoop cluster.

• Extensibility: Users can add custom UDFs (User Defined Functions) to extend
Hive’s capabilities.

Comparison: Pig, Hive, and SQL


Difference Table:

6
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

Feature Pig Hive SQL


Data Model Semi-structured or Structured data Structured data
unstructured
Language Pig Latin (Procedu- HiveQL (Declarative, SQL (Declarative)
ral) SQL-like)
Processing Model Data flows in a SQL-based queries Declarative queries
pipeline transformed toprocessed by the
MapReduce jobs RDBMS
Use Case Complex data trans- Data warehous- OLTP, OLAP, and
formation, ETL tasks ing, querying large general database
datasets management
Performance Tun- Allows manual perfor- Automatic perfor-Manual optimization
ing mance tuning mance optimization via indexing, query
optimization
Extensibility Supports user-defined Supports UDFs and Can support UDFs in
functions (UDFs) custom scripts some systems
Fault Tolerance Built-in fault toler- Built-in fault toler- Fault tolerance de-
ance through Hadoop ance through Hadoop pends on the database
system
Ease of Use Requires knowledge of Easier for SQL users Easy to use with a
scripting (Pig Latin) due to HiveQL standard SQL inter-
face
Storage Format Works with HDFS, Primarily works with Works with relational
HBase, local file sys- HDFS databases
tems
Scalability Highly scalable due to Scalable on top of Limited scalability
Hadoop’s distribution Hadoop’s HDFS (depends on RDBMS)

Table 1: Comparison of Pig, Hive, and SQL

HBase, MapR, Sharding, NoSQL Databases, S3, Hadoop Dis-


tributed File Systems
HBase:
• Open-source, distributed NoSQL database.
• Runs on top of Hadoop’s HDFS.
• Stores data in a columnar format, suitable for sparse data.
• Supports real-time read/write access.
• Commonly used for real-time analytics and large-scale data processing.
MapR:
• Data platform integrating Hadoop, NoSQL, and big data technologies.
• Provides distributed storage and analytics with high performance.
• Offers a unified solution for data storage, access, and analytics.

7
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

• Used in industries like finance, healthcare, and telecommunications.

Sharding:

• Distributes data across multiple servers or databases (shards).

• Enhances scalability and performance by splitting large datasets.

• Requests are routed to appropriate shards based on a shard key.

• Essential for horizontally scaling databases handling massive datasets.

NoSQL Databases:

• Handle unstructured, semi-structured, or large-scale data.

• Types of NoSQL databases:

– Document Databases: Store data as documents (e.g., MongoDB).


– Key-Value Stores: Store data as key-value pairs (e.g., Redis).
– Column-family Stores: Store data in columns (e.g., Cassandra).
– Graph Databases: Store data as graphs (e.g., Neo4j).

• Used for applications requiring scalability, flexibility, and real-time data.

S3:

• Amazon’s cloud-based object storage service.

• Scalable and offers high durability (99.999999999

• Supports encryption, versioning, and lifecycle management.

• Commonly used for storing backups, media files, and big data.

Hadoop Distributed File System (HDFS):

• Primary storage system for Hadoop.

• Stores large files across multiple nodes.

• Divides files into blocks (e.g., 128MB, 256MB) for distribution.

• Provides fault tolerance through data replication.

• Works with Hadoop’s MapReduce framework for distributed data processing.

8
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

HDFS Architecture
Metadata:

• Metadata in HDFS refers to information about the structure of data stored in the
system (e.g., file names, file locations, permissions).

• Managed by the NameNode.

• Metadata is stored in memory for faster access and operation.

• It includes information like file-to-block mapping and block locations.

Read Data:

• Client requests data from HDFS by providing the file path.

• NameNode provides the list of DataNodes where the file’s blocks are stored.

• Client communicates directly with DataNodes to read data in blocks.

• Data is read in parallel from different DataNodes for faster access.

Write Data:

• Client requests to write a file to HDFS.

• NameNode checks file permissions and availability of blocks.

• Data is split into blocks and written to multiple DataNodes.

• Each block is replicated to ensure fault tolerance (default replication factor is 3).

• DataNodes store the data blocks and confirm back to the client.

Metadata Manipulation:

9
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

• NameNode is responsible for maintaining and manipulating metadata.

• It stores the metadata in memory and on the local disk as a persistent storage.

• When a file is created, deleted, or modified, NameNode updates the metadata


accordingly.

• Metadata includes block locations, file names, and the replication factor.

NameNode:

• NameNode is the master server in HDFS that manages metadata.

• It keeps the directory tree of all files in the system.

• NameNode maintains information about file blocks and where they are stored.

• It does not store the actual data but handles the file system namespace and block
management.

• In case of failure, HDFS ensures fault tolerance using secondary NameNode or


backup mechanisms.

DataNode Rack 1:

• DataNodes are worker nodes in HDFS responsible for storing actual data blocks.

• They are distributed across multiple racks for redundancy and high availability.

• Each DataNode in Rack 1 stores replicas of data blocks as per the replication factor.

• DataNodes periodically send heartbeat signals and block reports to NameNode.

DataNode Rack 2:

• Similar to DataNode Rack 1, DataNodes in Rack 2 store replicated blocks.

• HDFS ensures data redundancy by replicating data blocks across different racks.

• This improves data availability and fault tolerance in case of rack failure.

• DataNodes in Rack 2 store data blocks based on the replication factor defined by
NameNode.

10
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

Visualization
Visualization is the graphical representation of data to identify patterns, trends, and
insights. It helps in understanding complex data by presenting it in charts, graphs, or
other visual forms. Common tools include Tableau, Power BI, and D3.js.

Visual Data Analysis Techniques


Techniques:

• Line Charts:

– Line charts are used to visualize trends over time or continuous data.
– They are ideal for showing changes in data at evenly spaced intervals, such as
stock prices or temperature.
– The X-axis represents time or the continuous variable, while the Y-axis repre-
sents the values of the data points.

• Bar Charts:

– Bar charts are used to compare different categories or groups.


– The X-axis typically represents the categories, and the Y-axis shows the cor-
responding values.
– Bar charts are great for showing relative sizes or differences between categories,
such as sales by region or number of items sold.

• Scatter Plots:

– Scatter plots display data points on a two-dimensional plane, with one variable
on the X-axis and the other on the Y-axis.
– They are useful for showing the relationship between two continuous variables,
helping to identify correlations or trends.
– Scatter plots can help detect patterns, clusters, or outliers in the data.

• Heatmaps:

– Heatmaps represent data using color gradients to indicate the magnitude of


values.
– They are ideal for visualizing the density or intensity of data over a specific
area or over time.
– Heatmaps are often used in applications like geospatial data analysis, where
the intensity of events (e.g., crime rates, temperature) is mapped.

Example: A heatmap showing temperature variations over a year might use color
gradients to represent temperature changes over different months or days. This visual
representation allows quick identification of periods with extreme heat or cold.

11
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

Interaction Techniques
Types:

• Brushing and Linking:

– Brushing and linking is a technique that allows users to highlight data points
in one visualization and see the corresponding data in other visualizations.
– For example, in a dashboard, brushing a region on a scatter plot could highlight
the same points on a related bar chart or line chart.
– This interaction helps users explore relationships and patterns across multiple
views of the data.

• Zooming and Panning:

– Zooming and panning techniques enable users to explore data at different levels
of detail by adjusting the view.
– Zooming allows users to focus on a specific portion of the data, such as exam-
ining a particular time period in a time series.
– Panning enables users to move across large datasets to explore different sec-
tions of the data, such as navigating through geographic data or large tables.

• Filtering:

– Filtering allows users to view subsets of data based on specific criteria.


– It is often used in interactive dashboards to narrow down large datasets by
selecting specific categories, ranges, or conditions (e.g., filtering sales data by
region or by year).
– Filtering helps users focus on relevant data, making the analysis more man-
ageable and meaningful.

Example: Interactive dashboards in Tableau often allow users to apply brushing and
linking techniques, zoom into specific regions on maps, and filter data by different criteria
to create dynamic visualizations tailored to the user’s needs.

12
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

Aspect Data Visual- Data Analyt-


ization ics
Definition Graphical rep- Statistical anal-
resentation of ysis of data for
data. insights.
Purpose To simplify data To derive ac-
understanding tionable insights
visually. from data.
Focus Creating visual Statistical tech-
aids (charts, niques and
graphs). predictive mod-
eling.
Tools Tableau, Power Python, R,
BI, D3.js, Mat- SPSS, SAS,
plotlib. Excel.
Output Graphs, charts, Models, reports,
heatmaps, dash- predictions, in-
boards. sights.
Audience Non-technical Data scientists,
stakeholders, analysts, re-
managers. searchers.
Nature Descriptive: Inferential: an-
shows data alyzes and pre-
trends. dicts trends.
Skills Design princi- Statistical anal-
ples, visualiza- ysis, program-
tion tools. ming, machine
learning.
Time Sensitiv- Focuses on cur- Analyzes past
ity rent or real-time data to predict
data. future trends.

Table 2: Difference between Data Visualization and Data Analytics

Introduction to R
R is a powerful language for statistical computing and data analysis. It provides a
wide variety of statistical techniques and graphical methods, making it popular for data
analysis, data visualization, and statistical computing.

R Graphical User Interfaces (GUIs)


Popular GUIs:

• RStudio: RStudio is the most widely used integrated development environment


(IDE) for R. It offers a rich user interface with powerful features such as code
completion, syntax highlighting, and integrated plotting.

– Multiple panes for console, script editor, environment, and plotting.

13
Subject: Data Analytics (BCS052) ITECH WORLD AKTU

– Support for version control, debugging, and package management.


– Extensible with plugins.

• R Commander: R Commander is a GUI for R that is accessible for beginners and


non-programmers. It provides a menu-based interface to perform various statistical
operations and analyses.

– Simple point-and-click interface.


– Suitable for basic data manipulation, statistical analyses, and plotting.
– Useful for those who prefer not to write code directly.

Data Import and Export


Import:

• read.csv(): Used to read CSV files into R as data frames.

• read.table(): Reads general text files into R. This function allows more flexibility
with delimiters and other file formats.

Export:

• write.csv(): Writes data from R to a CSV file.

• write.table(): Writes data to a general text file, with more options for formatting
the output.

Example:

# Importing data
my_data <- read.csv("data.csv")

# Performing a simple analysis (e.g., viewing the structure of the data)


str(my_data)

# Exporting the data to a new CSV file


write.csv(my_data, "output.csv")

14

You might also like