Data Analytics All 5 Units
Data Analytics All 5 Units
Unit 1:
Syllabus:
1
ITECH WORLD AKTU
Key Points:
• Outcome: Generates insights for strategic decisions in various domains like busi-
ness, healthcare, and technology.
• Tools: Includes Python, R, Excel, and specialized tools like Tableau, Power BI.
Example: A retail store uses data analytics to identify customer buying patterns
and optimize inventory management, ensuring popular products are always in stock.
1. Social Data:
2. Machine-Generated Data:
• Sensors and IoT Devices: Data from devices like thermostats, smart-
watches, and industrial sensors.
• Log Data: Records of system activities, such as server logs and application
usage.
• GPS Data: Location information generated by devices like smartphones and
vehicles.
• Telemetry Data: Remote data transmitted from devices, such as satellites
and drones.
3. Transactional Data:
2
ITECH WORLD AKTU
Example:
• A social media platform like Twitter generates vast amounts of social data from
tweets, hashtags, and mentions.
• Machine-generated data from GPS in delivery trucks helps optimize routes and
reduce costs.
• A retail store’s transactional data tracks customer purchases and identifies high-
demand products.
• Structured Data: Data that is organized in a tabular format with rows and
columns. It follows a fixed schema, making it easy to query and analyze.
• Semi-Structured Data: Data that does not have a rigid structure but contains
tags or markers to separate elements. It lies between structured and unstructured
data.
Comparison Table:
• Volume: Refers to the sheer amount of data generated. Modern data systems
must handle terabytes or even petabytes of data.
• Velocity: Refers to the speed at which data is generated and processed. Real-time
data processing is crucial for timely insights.
3
ITECH WORLD AKTU
– Example: Stock market systems process millions of trades per second to pro-
vide real-time updates.
• Variety: Refers to the different types and formats of data, including structured,
semi-structured, and unstructured data.
• Veracity: Refers to the quality and reliability of the data. High veracity ensures
data accuracy, consistency, and trustworthiness.
– Example: Data from unreliable sources or with missing values can lead to
incorrect insights.
Real-Life Scenario: Social media platforms like Twitter deal with high Volume
(millions of tweets daily), high Velocity (real-time updates), high Variety (text, images,
videos), and mixed Veracity (authentic and fake information).
4
ITECH WORLD AKTU
ciently manage. These platforms enable businesses and organizations to derive meaningful
insights from large-scale and diverse data.
Key Features of Big Data Platforms:
• Hadoop:
• Spark:
• NoSQL Databases:
5
ITECH WORLD AKTU
Example: A retail company uses data analytics to predict customer demand for
products, enabling them to stock inventory more efficiently.
• Early Stages (Manual and Small Data): In the past, analytics was performed
manually with small datasets, often using spreadsheets or simple statistical tools.
• Relational Databases and SQL: With the rise of structured data, relational
databases and SQL-based querying became more prevalent, offering better scala-
bility for handling larger datasets.
• Big Data and Distributed Computing: The advent of big data technolo-
gies such as Hadoop and Spark allowed for the processing and analysis of massive
datasets across distributed systems.
6
ITECH WORLD AKTU
• Cloud Computing: Cloud-based platforms like AWS, Google Cloud, and Azure
have made scaling analytics infrastructure easier by providing on-demand resources,
reducing the need for physical hardware.
• Real-Time Data Analytics: Technologies such as Apache Kafka and stream
processing frameworks have enabled the processing of data in real-time, further
enhancing scalability.
• Data Collection: Gathering raw data from various sources such as databases,
APIs, or sensors.
• Data Cleaning: Identifying and correcting errors or inconsistencies in the dataset
to improve the quality of the data.
• Data Exploration: Visualizing and summarizing data to understand patterns and
distributions.
• Model Building: Selecting and applying statistical or machine learning models
to predict or classify data.
• Evaluation and Interpretation: Evaluating the accuracy and effectiveness of
models, and interpreting the results for actionable insights.
Tools:
• Statistical Tools: R, Python (with libraries like Pandas, NumPy), SAS
• Machine Learning Frameworks: TensorFlow, Scikit-learn, Keras
• Big Data Tools: Hadoop, Apache Spark
• Data Visualization: Tableau, Power BI, Matplotlib (Python)
4 Analysis vs Reporting
The difference between analysis and reporting lies in their purpose and approach to data:
• Analysis: Involves deeper insights into data, such as identifying trends, patterns,
and correlations. It often requires complex statistical or machine learning methods.
• Reporting: Focuses on summarizing data into a readable format, such as charts,
tables, or dashboards, to provide stakeholders with easy-to-understand summaries.
Example: A report might display sales numbers for the last quarter, while analysis
might uncover reasons behind those numbers, such as customer buying behavior or market
conditions.
7
ITECH WORLD AKTU
• Apache Spark: A fast, in-memory data processing engine for big data analytics.
• Power BI: A powerful business analytics tool that allows users to visualize data
and share insights.
• Tableau: A data visualization tool that enables users to create interactive dash-
boards and visual reports.
• Python with Libraries: Libraries like Pandas, Matplotlib, and Scikit-learn enable
efficient data analysis and visualization.
• Healthcare: Analyzing patient data for better diagnosis, treatment plans, and
management of healthcare resources.
• Finance: Fraud detection, risk assessment, and portfolio optimization through the
analysis of financial data.
8
ITECH WORLD AKTU
• Data Preparation: Collecting, cleaning, and transforming data into usable for-
mats.
9
ITECH WORLD AKTU
• Optimizes Resource Usage: The lifecycle ensures efficient use of resources, such
as time, tools, and personnel. By organizing tasks in a structured way, projects are
completed more efficiently, avoiding wasted effort and resources.
• Improves Communication: Clear milestones and stages help teams stay aligned
and facilitate communication about the progress of the project. This clarity is
especially useful when different teams or departments are involved.
• Better Decision-Making: The lifecycle ensures that all steps are thoroughly exe-
cuted, leading to high-quality insights. This improves decision-making by providing
businesses with reliable and actionable data.
• Data Scientist:
• Data Engineer:
10
ITECH WORLD AKTU
• Business Analyst:
– A business analyst bridges the gap between the technical team (data scientists
and engineers) and business stakeholders.
– They are responsible for understanding the business problem and translating
it into actionable data-driven solutions.
– Business analysts also interpret the results of data analysis and communicate
them in a way that is understandable for non-technical stakeholders.
– Example: A business analyst analyzes customer feedback data and interprets
the results to help the marketing team refine their targeting strategy.
• Project Manager:
1. Discovery:
2. Data Preparation:
3. Model Planning:
11
ITECH WORLD AKTU
4. Model Building:
• Implement the selected models using tools like Python, R, or machine learning
libraries (e.g., Scikit-learn, TensorFlow).
• Train the model on the prepared dataset.
• Tune hyperparameters to improve model performance.
5. Communicating Results:
6. Operationalization:
• Deploy the model into a production environment for real-time analysis or batch
processing.
• Integrate the model with existing business systems (e.g., CRM, ERP).
• Monitor and maintain the model’s performance over time.
Example: A retail company builds a model to predict customer churn and integrates
it into their CRM system.
12
ITECH WORLD AKTU
Syllabus
1. Regression modeling.
2. Multivariate analysis.
6. Rule induction.
7. Neural networks:
8. Fuzzy logic:
1
ITECH WORLD AKTU
Detailed Notes
1 Regression Modeling
Regression modeling is a fundamental statistical technique used to examine the relation-
ship between one dependent variable (outcome) and one or more independent variables
(predictors or features). It helps in understanding, modeling, and predicting the depen-
dent variable based on the behavior of independent variables.
• To identify trends and make informed decisions in various fields such as economics,
medicine, engineering, and marketing.
Example: Predicting house prices based on size, number of rooms, and location.
3. Logistic Regression:
• Used for binary classification problems where the outcome is categorical (e.g.,
0 or 1, Yes or No).
• Employs the sigmoid function σ(x) = 1
1+e−x
to model probabilities.
• Suitable for predicting binary or categorical outcomes.
Example: Classifying whether a patient has a disease based on medical test results.
2
ITECH WORLD AKTU
(a) Data Collection: Gather data relevant to the problem, ensuring accuracy
and completeness.
(b) Data Preprocessing: Handle missing values, scale variables, and identify
outliers.
(c) Feature Selection: Identify the most significant predictors using methods
like correlation analysis or stepwise selection.
(d) Model Building: Fit the regression model using statistical software or pro-
gramming languages like Python or R.
(e) Model Evaluation: Assess the model’s performance using metrics such as
R2 , Mean Squared Error (MSE), or Mean Absolute Error (MAE).
(f) Prediction: Use the model to make predictions on new or unseen data.
2 Multivariate Analysis
Multivariate analysis is a statistical technique used to analyze data involving multi-
ple variables simultaneously. It helps in understanding the relationships, patterns,
and structure within datasets where more than two variables are interdependent.
3
ITECH WORLD AKTU
4
ITECH WORLD AKTU
(a) Define the Problem: Clearly identify the objectives and variables to be
analyzed.
(b) Collect Data: Gather accurate and relevant data for all variables.
(c) Preprocess Data: Handle missing values, standardize variables, and detect
outliers.
(d) Choose the Method: Select an appropriate multivariate technique based on
the objective.
(e) Apply the Method: Use statistical software (e.g., Python, R, SPSS) to
conduct the analysis.
(f) Interpret Results: Understand the output, identify patterns, and draw ac-
tionable insights.
The insights help the company design personalized offers and allocate marketing
budgets effectively.
5
ITECH WORLD AKTU
6
ITECH WORLD AKTU
3. Bayesian Networks
• Bayesian networks are graphical models that represent a set of variables and
their probabilistic dependencies using directed acyclic graphs (DAGs).
• Components of a Bayesian network:
– Nodes: Represent variables.
– Edges: Represent dependencies between variables.
– Conditional Probability Tables (CPTs): Quantify the relationships
between connected variables.
• Applications:
– Diagnosing diseases based on symptoms and test results.
– Predicting equipment failures in industrial systems.
– Understanding causal relationships in data.
• Incorporates prior knowledge into the analysis, making it robust for decision-
making.
• Handles uncertainty and incomplete data effectively.
• Supports dynamic updating of models as new evidence becomes available.
7
ITECH WORLD AKTU
• Objective: The objective of SVM is to find the hyperplane that maximizes the
margin between the nearest data points of different classes, known as support
vectors.
2
Maximize:
∥w∥
subject to:
yi (w · xi + b) ≥ 1 ∀i
where:
– w: Weight vector defining the hyperplane.
– xi : Input data points.
– yi : Class labels (+1 or −1).
– b: Bias term.
• Soft Margin SVM: In cases where perfect separation is not possible, SVM
introduces slack variables ξi to allow misclassification:
yi (w · xi + b) ≥ 1 − ξi , ξi ≥ 0
8
ITECH WORLD AKTU
9
ITECH WORLD AKTU
2. Nonlinear Dynamics
• Definition: Nonlinear dynamics analyze time series data that exhibit chaotic
or nonlinear behaviors, which cannot be captured by linear models.
• Characteristics:
– Relationships between variables are complex and not proportional.
– Small changes in initial conditions can lead to significant differences in
outcomes (sensitive dependence on initial conditions).
• Common Techniques:
– Delay Embedding: Reconstructs a system’s phase space from a time series
to analyze its dynamics.
– Fractal Dimension Analysis: Measures the complexity of the data.
– Lyapunov Exponent: Quantifies the sensitivity to initial conditions.
• Applications:
– Modeling weather systems, which involve chaotic dynamics.
– Predicting heart rate variability in medical diagnostics.
– Analyzing financial markets where nonlinear dependencies exist.
• Example:
– Meteorologists use nonlinear dynamics to predict weather patterns, ac-
counting for the chaotic interactions of atmospheric variables.
10
ITECH WORLD AKTU
• In practice, time series data often exhibit both linear and nonlinear patterns.
• Hybrid models, such as combining traditional time series models with machine
learning techniques, are used to capture both types of behaviors for improved
accuracy.
6 Rule Induction
Rule induction extracts rules from data to create interpretable models.
7 Neural Networks
Neural networks are computational models inspired by the human brain, used for
pattern recognition and predictive tasks.
• Definition: Neural networks learn from historical data and generalize pat-
terns to make predictions on new, unseen data.
• Key Features:
– Learn complex relationships in data.
– Generalize well to unseen data if properly trained.
• Example: A neural network trained on a set of images of handwritten digits
can generalize and classify new, unseen digits.
11
ITECH WORLD AKTU
2. Competitive Learning
12
ITECH WORLD AKTU
13
ITECH WORLD AKTU
9 Fuzzy Logic
1. Extracting Fuzzy Models from Data
14
ITECH WORLD AKTU
15
ITECH WORLD AKTU
• Problem: Finding the optimal route for delivery trucks that minimizes travel
distance or time.
• Solution:
– Genetic Algorithms: Can be used to evolve a population of possible
routes, selecting and combining the best routes through crossover and
mutation to find an optimal or near-optimal solution.
– Simulated Annealing: Can be used to explore the space of possible
routes, accepting less optimal routes in the short term (to escape local
minima) and gradually converging to an optimal route as the temperature
decreases.
16
ITECH WORLD AKTU 1
3. Stream computing
5. Filtering streams
7. Estimating moments
9. Decaying window
Definition: A data stream is a continuous and real-time flow of data elements made
available sequentially over time. Unlike traditional static datasets, data streams are
dynamic and require real-time processing and analysis to extract actionable insights.
Key Characteristics:
• High Volume: Streams can produce a large amount of data per second, re-
quiring scalable systems to handle the load.
• Transient Data: Data in streams may not be stored permanently and could
be processed in memory or with sliding windows.
• Heterogeneity: Data streams can come from diverse sources and in varying
formats (structured, semi-structured, or unstructured).
• Latency: Minimizing the time between data arrival and actionable insight is
crucial for applications like fraud detection.
• Data Quality: Ensuring the accuracy and reliability of streaming data, which
might have noise or incomplete values.
1. Data Streams:
• Continuous flow of data from various sources, such as sensors, social media,
or log files.
2. Stream Manager:
• Manages incoming data streams and forwards them to the processing com-
ponents.
3. System Catalog:
• Stores metadata about the system, such as stream schemas and resources
used for processing.
4. Scheduler:
5. Router:
6. Queue Manager:
7. Query Processor:
8. Query Optimizer:
3. Stream Computing
• Fault Tolerance: Ensures the system can recover from failures and continue
processing seamlessly.
• Social Media Analytics: Monitors and analyzes trends and user sentiments.
• Stream computing processes this data to identify traffic congestion and suggest
alternative routes in real-time.
• Random Sampling:
• Reservoir Sampling:
• Systematic Sampling:
– Picks every k th element from the stream after selecting a random starting
point.
– Useful when the stream has a repetitive structure or pattern.
• Stratified Sampling:
• Priority Sampling:
Advantages of Sampling:
Applications of Sampling:
• From a stream of 1 million tweets, randomly select 10% to estimate the senti-
ment trends for a product launch.
5. Filtering Streams
• Time-based Filtering: Filters data within a specific time range (e.g., log
entries in the last 24 hours).
Example:
Techniques:
Example:
7. Estimating Moments
• Second Moment: The sum of squares of elements, used for measuring vari-
ance.
Techniques:
Example:
Definition: Counting ones in a window involves tracking the number of ones (or
specific events) in a fixed-size window of the stream.
Techniques:
• Exponential Decay: Older data has less influence on the count over time.
Example:
• Count the number of ”likes” in the last 5 minutes on a live video stream.
9. Decaying Window
ITECH WORLD AKTU 9
Techniques:
Example:
• Monitor CPU usage, giving higher priority to recent data while reducing older
data’s impact.
Definition: RTAP refers to platforms designed for processing and analyzing real-
time data streams.
Applications:
• Stock Market Predictions: Process stock trade data to predict price move-
ments and trends.
• IoT Monitoring: Analyze sensor data from smart devices for immediate ac-
tion.
Key Features:
Example:
Case Studies
1. Real-Time Sentiment Analysis
Steps:
1. Data Collection:
• Use APIs such as Twitter API, Reddit API, or Facebook Graph API to
stream live data.
• Filter data using keywords, hashtags, geolocation, or user metadata.
• Example: Collect tweets containing hashtags like #Election2024 or #New-
ProductLaunch.
2. Data Preprocessing:
3. Sentiment Analysis:
4. Real-Time Processing:
Objective: Analyze and predict stock price trends in real-time using market data
and news feeds.
Applications:
Steps:
1. Data Collection:
• Gather real-time stock data from APIs such as Yahoo Finance, Alpha
Vantage, or Bloomberg.
• Collect relevant financial news or tweets using web scraping tools or news
APIs.
• Example: Monitor the stock prices of companies like Apple and Tesla
while analyzing related news articles.
2. Data Preprocessing:
4. Real-Time Processing:
UNIT 4
Syllabus
• Frequent Item sets and Clustering: Mining frequent item sets, market-based mod-
elling, Apriori algorithm.
• Handling large data sets in main memory, limited pass algorithm, counting frequent
item sets in a stream.
• Market-basket analysis.
Key Concepts
1. Support: The proportion of transactions in which an item set appears.
3. Lift: Measures how much more likely the occurrence of items together is than
if they were independent.
Market-Based Modelling
This method analyzes customer purchasing behavior by identifying relationships
among items. It uses frequent item sets to predict buying patterns.
Example:
• Transaction Data:
Apriori Algorithm The Apriori algorithm is used to identify frequent item sets and
derive association rules. It works by:
• Pruning item sets that do not meet the minimum support threshold.
• Continue combining and pruning until no new frequent item sets can be gen-
erated.
Key Optimizations
• Use the ”downward closure property”: All subsets of a frequent item set must also
be frequent.
• Reduce computational cost by generating only candidate sets from previously fre-
quent sets.
Example:
• Input Transactions:
Output:
Frequent item sets and their support counts.
ITECH WORLD AKTU 4
Handling Large Datasets in Main Memory As datasets grow, handling them ef-
ficiently in main memory becomes critical. Memory optimization techniques allow for
processing large volumes of data without overwhelming system resources.
3. Sampling:
4. Data Partitioning:
Example Scenario
• Scenario: An online retailer processes real-time sales data to identify frequent item
sets.
• Approach:
• Benefits:
Real-World Applications
• E-Commerce: Tracking high-demand products during sales events.
Applications of Clustering
• Customer Segmentation: Grouping customers based on purchasing behavior.
1. Hierarchical Clustering
Hierarchical clustering creates a tree-like structure (dendrogram) to represent nested
clusters.
• Types:
– Divisive (Top-Down): Starts with one cluster containing all points and
splits them iteratively.
• Steps:
• Linkage Methods:
• Example:
2. K-Means Clustering
K-Means is a partition-based clustering method that divides data into k clusters, where
k is predefined.
• Steps:
• Example:
• Advantages:
• Limitations:
Example
In a dataset with 100 features, clustering can be challenging. ProCLUS and CLIQUE
can reduce the dimensionality and identify clusters in subspaces where the data points
are most densely packed.
Approach
• Frequent Itemset Mining: The process begins by identifying frequent itemsets
using algorithms like Apriori or FP-growth.
• Cluster Formation: Once frequent itemsets are identified, they are used to form
clusters of data points that share similar patterns.
Example
In market basket analysis, frequent itemsets like Milk, Bread or Diapers, Beer can be
used to form customer segments that frequently purchase those items together.
ITECH WORLD AKTU 9
• Graph-based Clustering: Uses graphs to model data where the vertices represent
data points, and the edges represent relationships or similarities between them.
Example
Text clustering can be done using cosine similarity to group documents that share com-
mon terms or topics, even when they do not share exact matching words.
Approach
• Stream Clustering Algorithms: These algorithms process data in a single pass,
and must handle data that arrives in real-time. Examples include the CluStream
and DenStream algorithms.
• Incremental Learning: Clustering models that update as new data arrives with-
out needing to recompute the clusters from scratch.
Example
In social media analytics, clustering algorithms need to identify trends in real-time as
millions of posts are generated every minute. Streaming clustering algorithms such as
CluStream can update clusters as new data flows in, ensuring that the model remains
relevant.
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
Syllabus
• Frameworks: MapReduce, Hadoop, Pig, Hive, HBase, MapR, Sharding, NoSQL
Databases, S3, Hadoop Distributed File Systems.
1
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
Frameworks
Frameworks are essential tools in data analytics that provide the infrastructure to manage,
process, and analyze large datasets. They enable scalable, efficient, and fault-tolerant
operations, making them ideal for distributed systems
MapReduce
Definition: MapReduce is a programming model used for processing and generating
large datasets. It splits the data into chunks, processes it in parallel, and reduces it to
meaningful results.
Steps:
1. Input: A large dataset is split into smaller chunks.
2. Map Phase: Each chunk of data is processed independently. The map function
converts each item into a key-value pair.
3. Shuffling and Sorting: After the map phase, key-value pairs are grouped by their
keys.
4. Reduce Phase: The reduce function takes the grouped key-value pairs and aggre-
gates them into meaningful results.
• Map Phase: The map function processes each number and creates key-value pairs
with the number as the key and ‘1‘ as the value.
• Reduce Phase: The reduce function groups the key-value pairs by the key and
sums the values.
• Output: The result is a single pair (1, 4), which represents the sum of all the
numbers.
Applications:
• Word Count: Counting word frequencies in large datasets.
2
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
Advantages:
• Scalability: Can handle large datasets by distributing tasks across many machines.
• Fault Tolerance: The system can recover from task failures by retrying the failed
tasks.
Hadoop
Definition: Hadoop is an open-source framework for storing and processing large datasets
in a distributed manner across clusters of computers. It allows for the efficient processing
of large datasets in a fault-tolerant and scalable way.
Components:
• Hadoop Distributed File System (HDFS): A distributed file system that stores
data across multiple machines in a cluster, ensuring redundancy and fault tolerance.
• MapReduce: A programming model and processing engine that allows for parallel
processing of data across nodes in a cluster.
• Hadoop Common: A set of shared libraries and utilities that support the other
Hadoop modules.
3
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
Example: Netflix uses Hadoop to analyze user data for recommendations. By pro-
cessing large volumes of user viewing data, Hadoop helps generate personalized recom-
mendations for each user, ensuring better engagement and user experience.
Pig
Definition: Pig is a high-level platform developed on top of Hadoop for creating MapRe-
duce programs. It simplifies the process of writing MapReduce programs by providing
a more user-friendly, procedural language called Pig Latin. Pig is designed to handle
both batch processing and data transformation jobs, making it easier for analysts and
programmers to process large datasets without having to deal with low-level MapReduce
code directly.
Features:
• Extensibility: Pig allows for the addition of custom functions, making it extensible
for specific use cases.
4
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
• Optimization: Pig automatically optimizes queries, minimizing the need for man-
ual performance tuning.
• Support for complex data types: Pig can handle complex data types, including
nested data structures.
Pig Latin Syntax: Pig Latin is similar to SQL in its structure but is tailored for
the MapReduce paradigm. Here is an example of a Pig Latin query:
Explanation of Example:
• B = FILTER A BY age ¿ 30;: This statement filters the loaded data and keeps
only the records where the age is greater than 30.
• STORE B INTO ’output’;: Finally, the filtered data (‘B‘) is stored in the output
directory.
Execution Flow: 1. **Loading Data:** Pig reads data from sources like HDFS,
local files, or relational databases. 2. **Transforming Data:** Pig supports various
transformations such as filtering, grouping, joining, and sorting. 3. **Storing Data:**
The transformed data is stored back into HDFS, a database, or another storage system.
Applications:
Advantages:
5
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
Hive
Definition: Hive is a data warehousing and SQL-like query language system built on top
of Hadoop. It is used for managing and querying large datasets stored in Hadoop’s HDFS.
Hive abstracts the complexities of writing MapReduce jobs and provides a more user-
friendly interface for querying large datasets using a SQL-like language called HiveQL.
Components:
• Metastore: A central repository that stores metadata about the data stored in
HDFS, such as table structures and partitions.
• HiveQL: A query language similar to SQL that enables users to perform data
analysis and querying tasks.
• Driver: The component responsible for receiving queries and sending them to the
execution engine for processing.
• Execution Engine: The component that executes the MapReduce jobs generated
from HiveQL queries on the Hadoop cluster.
Query Execution Flow:
1. Writing Queries: Users write queries using HiveQL, which is a SQL-like language.
2. Compiling Queries: The queries are compiled by the Hive driver, which translates
them into MapReduce jobs.
3. Executing Queries: The execution engine runs the compiled jobs on the Hadoop
cluster to process the data.
4. Storing Results: Results can be stored back into HDFS or in other storage systems
like HBase.
Applications:
• Data Analysis: Analyzing large datasets using SQL-like queries.
• Extensibility: Users can add custom UDFs (User Defined Functions) to extend
Hive’s capabilities.
6
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
7
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
Sharding:
NoSQL Databases:
S3:
• Commonly used for storing backups, media files, and big data.
8
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
HDFS Architecture
Metadata:
• Metadata in HDFS refers to information about the structure of data stored in the
system (e.g., file names, file locations, permissions).
Read Data:
• NameNode provides the list of DataNodes where the file’s blocks are stored.
Write Data:
• Each block is replicated to ensure fault tolerance (default replication factor is 3).
• DataNodes store the data blocks and confirm back to the client.
Metadata Manipulation:
9
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
• It stores the metadata in memory and on the local disk as a persistent storage.
• Metadata includes block locations, file names, and the replication factor.
NameNode:
• NameNode maintains information about file blocks and where they are stored.
• It does not store the actual data but handles the file system namespace and block
management.
DataNode Rack 1:
• DataNodes are worker nodes in HDFS responsible for storing actual data blocks.
• They are distributed across multiple racks for redundancy and high availability.
• Each DataNode in Rack 1 stores replicas of data blocks as per the replication factor.
DataNode Rack 2:
• HDFS ensures data redundancy by replicating data blocks across different racks.
• This improves data availability and fault tolerance in case of rack failure.
• DataNodes in Rack 2 store data blocks based on the replication factor defined by
NameNode.
10
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
Visualization
Visualization is the graphical representation of data to identify patterns, trends, and
insights. It helps in understanding complex data by presenting it in charts, graphs, or
other visual forms. Common tools include Tableau, Power BI, and D3.js.
• Line Charts:
– Line charts are used to visualize trends over time or continuous data.
– They are ideal for showing changes in data at evenly spaced intervals, such as
stock prices or temperature.
– The X-axis represents time or the continuous variable, while the Y-axis repre-
sents the values of the data points.
• Bar Charts:
• Scatter Plots:
– Scatter plots display data points on a two-dimensional plane, with one variable
on the X-axis and the other on the Y-axis.
– They are useful for showing the relationship between two continuous variables,
helping to identify correlations or trends.
– Scatter plots can help detect patterns, clusters, or outliers in the data.
• Heatmaps:
Example: A heatmap showing temperature variations over a year might use color
gradients to represent temperature changes over different months or days. This visual
representation allows quick identification of periods with extreme heat or cold.
11
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
Interaction Techniques
Types:
– Brushing and linking is a technique that allows users to highlight data points
in one visualization and see the corresponding data in other visualizations.
– For example, in a dashboard, brushing a region on a scatter plot could highlight
the same points on a related bar chart or line chart.
– This interaction helps users explore relationships and patterns across multiple
views of the data.
– Zooming and panning techniques enable users to explore data at different levels
of detail by adjusting the view.
– Zooming allows users to focus on a specific portion of the data, such as exam-
ining a particular time period in a time series.
– Panning enables users to move across large datasets to explore different sec-
tions of the data, such as navigating through geographic data or large tables.
• Filtering:
Example: Interactive dashboards in Tableau often allow users to apply brushing and
linking techniques, zoom into specific regions on maps, and filter data by different criteria
to create dynamic visualizations tailored to the user’s needs.
12
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
Introduction to R
R is a powerful language for statistical computing and data analysis. It provides a
wide variety of statistical techniques and graphical methods, making it popular for data
analysis, data visualization, and statistical computing.
13
Subject: Data Analytics (BCS052) ITECH WORLD AKTU
• read.table(): Reads general text files into R. This function allows more flexibility
with delimiters and other file formats.
Export:
• write.table(): Writes data to a general text file, with more options for formatting
the output.
Example:
# Importing data
my_data <- read.csv("data.csv")
14