Dataanalytics Notes
Dataanalytics Notes
Digital Notes
Contents
Unit-1(Introduction) ................................................................................................................. 5
1.1 Why Data Analytics ................................................................................................... 5
1.2 IMPORTANCE DATA ANALYTICS .................................................................................. 5
1.3 CHARACTERISTICS OF DATA ANALYTICS...................................................................... 6
1.4 CLASSIFICATION OF DATA (STRUCTURED, SEMI-STRUCTURED, UNSTRUCTURED)......... 8
1.5 WHAT COMES UNDER BIG DATA? .............................................................................. 8
1.6 Types of Data ............................................................................................................. 9
1.7 STAGES OF BIG DATA BUSINESS ANALYTICS .............................................. 12
1.8 Key Computing Resources for Big Data .................................................... 13
1.9 BENEFITS OF BIG DATA ................................................................................. 15
1.10 Big Data Technologies ......................................................................... 16
1.11 BIG DATA CHALLENGES ............................................................................... 17
1.12 ANALYSIS VS REPORTING .............................................................................. 17
1.12 DATA ANALYTICS LIFECYCLE ......................................................................... 19
1.13 Common Tools for the Data Preparation Phase ........................................ 21
1.15 COMMON TOOLS FOR THE MODEL BUILDING PHASE .................................................. 22
KEY ROLES FOR A SUCCESSFUL ANALYTICS PROJECT .................................................. 23
Unit – 2 (Data Analysis) .................................................................................... 23
2.1 What is Regression Analysis? .................................................................. 24
2.2 APPLICATION OF REGRESSION ANALYSIS IN RESEARCH ............................ 25
2.3 USE OF REGRESSION IN ORGANIZATIONS ............................................... 26
2.4 MULTIVARIATE (LINEAR) REGRESSION......................................................... 26
2.4.1 Polynomial Regression ............................................................................ 26
2.4.2 NONLINEAR REGRESSION ................................................................ 27
2.5 Introduction to Bayesian Modeling........................................................... 27
2.6 WHAT IS A BAYESIAN NETWORK? ...................................................................... 29
2.7 Limitations of Bayesian Networks ..................................................................... 31
2.8 INTRODUCTION TO SVM ................................................................................ 32
2.9 TIME SERIES ANALYSIS ................................................................................. 34
2.10 Nonlinear vs linear..................................................................................... 35
2.11 RULE INDUCTION ....................................................................................... 35
2.12 WHAT ARE NEURAL NETWORKS?...................................................................... 36
2.13 Types of Neural Networks ............................................................................ 37
2.14 Principal Component Analysis ............................................................... 37
2.15 WHAT IS FUZZY LOGIC? ........................................................................... 39
2.16 BENEFITS OF USING FUZZY LOGIC ............................................................. 40
Unit-3 (Mining Data Streams) ............................................................................. 42
3.1 What is Data Stream Mining ........................................................................... 42
3.2 Characteristics of data stream model ................................................................... 42
3.3 Data Streaming Architecture .................................................................. 43
3.4 Methodologies for Stream Data Processing ............................................................ 45
3.5 Stream Data Processing Methods....................................................................... 46
3.6 FILTERING AND STREAMING............................................................................. 47
3.7 RTAP .................................................................................................... 49
3.8 BENEFITS OF REAL-TIME ANALYTICS .................................................................... 51
3.9 CHALLENGES ............................................................................................ 52
3.10 Use cases for real-time analytics in customer experience management ............................. 52
3.11 EXAMPLES OF REAL-TIME ANALYTICS INCLUDE: ...................................................... 53
3.12 TYPES OF REAL-TIME ANALYTICS ...................................................................... 54
3.13 Generic Design of an RTAP........................................................................... 54
3.14 Stock Market Predictions .............................................................................. 55
3.15 Real Time stock Predictions........................................................................... 55
Unit-4 (Frequent Itemsets and Clustering) ........................................................................ 56
4.1 MINING FREQUENT PATTERNS, ASSOCIATION AND CORRELATIONS: .................................. 56
4.2 WHY IS FREQ. PATTERN MINING IMPORTANT? ......................................................... 56
4.3 INTRODUCTION TO MARKET BASKET ANALYSIS ....................................................... 56
4.4 Market Basket Benefits ................................................................................. 57
4.5 ASSOCIATION RULE MINING ............................................................................. 58
4.6 Example Association Rule .............................................................................. 59
4.7 Definition: Frequent Itemset ............................................................................ 60
4.8 SUPPORT AND CONFIDENCE ............................................................................. 61
4.9 Mining Association Rules .............................................................................. 62
4.10 FREQUENT ITEMSET GENERATION ..................................................................... 63
4.11 APRIORI ALGORITHM ................................................................................... 63
4.12 LIMITATIONS ........................................................................................ 66
4.13 METHODS TO IMPROVE APRIORI’S EFFICIENCY ............................................. 66
4.14 WHAT IS CLUSTER ANALYSIS? ........................................................................ 66
4.15 GENERAL APPLICATIONS OF CLUSTERING ............................................................. 67
4.16 Requirements of Clustering in Data Mining .......................................................... 67
4.17 Similarity and Dissimilarity Measures ................................................................ 67
4.18 MAJOR CLUSTERING APPROACHES .................................................................... 68
4.19 PARTITIONING ALGORITHMS: BASIC CONCEPT ........................................................ 68
4.20 K-MEANS CLUSTERING ................................................................................ 69
4.21 K-means Algoritms .................................................................................... 69
4.22 Weaknesses of K-Mean Clustering ................................................................... 70
4.23 APPLICATIONS OF K-MEAN CLUSTERING ............................................................. 70
4.24 CONCLUSION ........................................................................................ 70
4.25 CLIQUE (CLUSTERING IN QUEST).................................................................... 71
4.26 Strength and Weakness of CLIQUE .................................................................. 71
4.27 FREQUENT PATTERN-BASED APPROACH .............................................................. 71
Unit-5 (Frame Works and Visualization) ......................................................................... 72
5.1 WHAT IS HADOOP? ...................................................................................... 72
5.2 HADOOP DISTRIBUTED FILE SYSTEM ................................................................... 72
5.3 MAPREDUCE ............................................................................................. 73
5.4 Why MapReduce is so popular ......................................................................... 73
5.5 Understanding Map and Reduce........................................................................ 74
5.6 Benefits of MapReduce ................................................................................. 75
5.7 HDFS: Hadoop Distributed File System ............................................................... 76
5.8 HDFS ARCHITECTURE .................................................................................. 77
5.9 WHAT IS HIVE? .......................................................................................... 81
5.10 Hive Vs Relational Databases ......................................................................... 82
5.11 HIVE ARCHITECTURE .................................................................................. 84
5.12 WHAT IS PIG ............................................................................................ 85
5.13 HBASE .................................................................................................. 87
5.14 WHAT IS NOSQL? ..................................................................................... 87
5.15 HBASE VS. HDFS ...................................................................................... 88
5.16 HBase and RDBMS ................................................................................... 89
5.17 NoSQL Databases ..................................................................................... 90
5.18 VISUAL DATA ANALYTICS ............................................................................. 91
5.19 Visual Analytics Process .............................................................................. 92
5.21 VISUALIZATION TOOLS ................................................................................. 93
5.22 VISUAL DATA ANALYTICS APPLICATIONS ............................................................ 94
6. Question Bank Unit Wise....................................................................................... 95
6. 1 QUESTION BANK OF UNIT 1 ............................................................................. 95
6. 2 Question Bank of Unit 2 ............................................................................... 95
6.3 QUESTION BANK OF UNIT 3 ............................................................................. 96
6.4 QUESTION BANK OF UNIT 4 ............................................................................. 96
6.5 QUESTION BANK OF UNIT 5 ............................................................................. 97
7.Multiple Choice Question Unit Wise ........................................................................... 97
7.1 MCQ’s of Unit 1 ........................................................................................ 97
7.2MCQ’s of Unit 2 ......................................................................................... 98
7.3MCQ’s of Unit 3 ....................................................................................... 100
7.4MCQ’s of Unit 4 ....................................................................................... 101
7.5 MCQ’s of Unit 5 ...................................................................................... 102
8. Previous year Question papers ...................................................................... 103
8.1 Year -2021 ............................................................................................ 103
9. NPTEL Lectures Link ........................................................................................ 105
9.1 Link 1 Introduction to Data Analytics ................................................................ 105
9.2 Link 2 Supervised Learning .......................................................................... 106
9.3 Link 3Logistic Regression ............................................................................ 106
9.4 Link 4 Support Vector Machines ..................................................................... 106
9.5 Link 5 Artificial Neural Network..................................................................... 106
Unit-1(Introduction)
1.1 Why Data Analytics
Examples
• E-Promotions: Based on your current location, your purchase history, what
you like send promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body any
abnormal measurements require immediate reaction
• Variety: This refers to large variety of input data which in turn generates large
amount of data as output.
a. Various formats, types, and structures
b. Text, numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays, etc…
c. Static data vs. streaming data
Social media sources such as Face book and Twitter generate tremendous amounts of
comments and tweets. This data can be captured and analyzed to understand, for
example, what people think about new product introductions.
Machines, such as smart meters, generate data. These meters continuously stream data
about electricity, water, or gas consumption that can be shared with customers and
combined with pricing plans to motivate customers to move some of their energy
consumption, such as for washing clothes, to non-peak hours. There is a tremendous
amount of geospatial (e.g., GPS) data, such as that created by cell phones, that can
be used by applications like Four Square to help you know the locations of friends
and to receive offers from nearby stores and restaurants. Image, voice, and audio
data can be analyzed for applications such as facial recognition systems in security
systems.
Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.
• Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
• Social Media Data: Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
• Stock Exchange Data: The stock exchange data holds information about the ‘buy’
and ‘sell’ decisions made on a share of different companies made by the customers.
• Power Grid Data: The power grid data holds information consumed by a particular
node with respect to a base station.
• Transport Data: Transport data includes model, capacity, distance and availability of
a vehicle.
• Search Engine Data: Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data.
Structure Data
3. Predictive Analytics:
o This stage involves predicting the possible future events based on the
information obtained from the Descriptive and/or Discovery Analytics stages.
Also in the stage possible risks involved can be identified. Eg: What shall be
the sales improvement next year (making insights for future)?
4. Prescriptive Analytics:
o It involves planning actions or making decisions to improve the Business
based on the predictive analytics . Eg: How much amount of material should
be procured to increase the production?
What made Big Data needed?
• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization
• Commodity hardware
Big data is really critical to our life and its emerging as one of the most important
technologies in modern world. Follow are just few benefits which are very much known to all
of us:
• Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and other
advertising mediums.
• Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
• Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.
• Accumulation of raw data captured from various sources (i.e. discussion boards,
emails, exam logs, chat logs in e-learning systems) can be used to identify fruitful
patterns and relationships
• By itself, stored data does not generate business value, and this is true of traditional
databases, data warehouses, and the new technologies such as Hadoop for storing big
data. Once the data is appropriately stored,
• However, it can be analyzed, which can create tremendous value. A variety of
analysis technologies, approaches, and products have emerged that are especially
applicable to big data, such as in-memory analytics, in-database analytics, and
appliances
Big data technologies are important in providing more accurate analysis, which may lead to
more concrete decision-making resulting in greater operational efficiencies, cost reductions,
and reduced risks for the business.
There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data,
we examine the following two classes of technology:
• This include systems like MongoDB that provide operational capabilities for real-
time, interactive workloads where data is primarily captured and stored.
• NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations
to be run inexpensively and efficiently. This makes operational big data workloads
much easier to manage, cheaper, and faster to implement.
• Some NoSQL systems can provide insights into patterns and trends based on real-time
data with minimal coding and without the need for data scientists and additional
infrastructure.
• This includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis
that may touch most or all of the data.
• MapReduce provides a new method of analyzing data that is complementary to the
capabilities provided by SQL, and a system based on MapReduce that can be scaled
up from single servers to thousands of high and low end machines.
• These two classes of technology are complementary and frequently deployed
together.
1.11 BIG DATA CHALLENGES
o Capturing data
o Curation
o Storage
o Searching
o Sharing
o Transfer
o Analysis
o Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
It’s important that we differentiate the two because some organizations might be selling
themselves short in one area and not reap the benefits, which web analytics can bring to the
table. The first core component of web analytics, reporting, is merely organizing data into
summaries. On the other hand, analysis is the process of inspecting, cleaning, transforming,
and modeling these summaries (reports) with the goal of highlighting useful information.
Simply put, reporting translates data into information while analysis turns information into
insights. Also, reporting should enable users to ask “What?” questions about the information,
whereas analysis should answer to “Why”” and “What can we do about it?”
1. Purpose
Reporting helps companies monitor their data even before digital technology boomed.
Various organizations have been dependent on the information it brings to their business, as
reporting extracts that and makes it easier to understand.
Analysis interprets data at a deeper level. While reporting can link between cross-channels of
data, provide comparison, and make understand information easier (think of a dashboard,
charts, and graphs, which are reporting tools and not analysis reports), analysis interprets this
information and provides recommendations on actions.
2. Tasks
As reporting and analysis have a very fine line dividing them, sometimes it’s easy to confuse
tasks that have analysis labeled on top of them when all it does is reporting. Hence, ensure
that your analytics team has a healthy balance doing both.
Here’s a great differentiator to keep in mind if what you’re doing is reporting or analysis:
3. Outputs
Reporting and analysis have the push and pull effect from its users through their outputs.
Reporting has a push approach, as it pushes information to users and outputs come in the
forms of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to further probe and to
answer business questions. Outputs from such can be in the form of ad hoc responses and
analysis presentations. Analysis presentations are comprised of insights, recommended
actions, and a forecast of its impact on the company—all in a language that’s easy to
understand at the level of the user who’ll be reading and deciding on it.
This is important for organizations to realize truly the value of data, such that a standard
report is not similar to a meaningful analytics.
4. Delivery
Analysis requires a more custom approach, with human minds doing superior reasoning and
analytical thinking to extract insights, and technical skills to provide efficient steps towards
accomplishing a specific goal. This is why data analysts and scientists are demanded these
days, as organizations depend on them to come up with recommendations for leaders or
business executives make decisions about their businesses.
5. Value
This isn’t about identifying which one brings more value, rather understanding that both are
indispensable when looking at the big picture. It should help businesses grow, expand, move
forward, and make more profit or increase their value.
Reporting Analysis
Provides data Provides answers
Reporting Analysis
Provides what is asked for Provides what is needed
Is typically standardized Is typically customized
Does not involve a person Involves a person
Is fairly inflexible Is extremely flexible
Phases-of-data-analytics-lifecycle
Phase 1: Discovery
In this phase,
• The data science team must learn and investigate the problem,
• Develop context and understanding, and
• Learn about the data sources needed and available for the project.
• In addition, the team formulates initial hypotheses that can later be tested with data.
The team should perform five main activities during this step of the discovery
phase:
• Identify data sources: Make a list of data sources the team may need to test
the initial hypotheses outlined in this phase.
o Make an inventory of the datasets currently available and those that
o can be purchased or otherwise acquired for the tests the team wants
o to perform.
• Capture aggregate data sources: This is for previewing the data and providing
high-level understanding.
o It enables the team to gain a quick overview of the data and perform
o further exploration on specific areas.
• Review the raw data: Begin understanding the interdependencies among the
data attributes.
o Become familiar with the content of the data, its quality, and its
limitations.
• Evaluate the data structures and tools needed: The data type and structure dictate
which tools the team can use to analyze the data.
• Scope the sort of data infrastructure needed for this type of problem: In addition
to the tools needed, the data influences the kind of infrastructure that's required, such
as disk storage and network capacity.
• Unlike many traditional stage-gate processes, in which the team can advance only
when specific criteria are met, the Data Analytics Lifecycle is intended to
accommodate more ambiguity.
• For each phase of the process, it is recommended to pass certain checkpoints as a way
of gauging whether the team is ready to move to the next phase of the Data Analytics
Lifecycle.
resid
y = β 0 + β1x + ε
Linear component Random Error
component
ŷ i = b 0 + b1x variable
• The coefficients b0 and b1 will be found using computer software, such as Excel’s
data analysis add-in or MegaStat
• Other regression measures will also be computed as part of computer-based
regression analysis
•
Simple Linear Regression Example
A real estate agent wishes to examine the relationship between the selling price of a
home and its size (measured in square feet)
A random sample of 10 houses is selected
• Dependent variable (y) = house price in $1000
• Independent variable (x) = square feet
LESAST SQUARES
Or
REGRESSION
LESAST SQUARES
• In situations where
– a latent construct cannot be appropriately represented as a
continuous variable,
– ordinal or discrete indicators do not reflect underlying continuous variables,
– the latent variables cannot be assumed to be normally distributed,
traditional Gaussian modeling is clearly not appropriate.
• In addition, normal distribution analysis sets minimum requirements for the number
of observations, and the measurement level of variables should be continuous.
• A priori probability
• Conditional probability
• Posteriori probability
Bayes’ Theorem
W h y does i t m a t t e r ? I f 1 % o f a p o p u l a t i o n have cancer, f o r a
screening t e s t w i t h 8 0 % s ens itivity and 9 5 % s pec ific ity;
Test P[ Te s t + v e | C a n c e r ] = 80%
Have Positive
Cance P[ Te s t + v e ]
= 5.75
r P[ Canc er ]
P[ Canc er|Tes t + v e ] ≈ 14%
... i.e. m o s t pos itive results
are ac tually false alarms
M i x i n g u p P[ A | B ] w i t h P[ B | A ] is t h e Pros ec ut or ’s Fallacy ; a
small pro b a bility o f evidence given innoc enc e need N O T mea n a
small probabilit y o f innoc enc e given evidence.
C D
So BN = (DAG, CPD)
C D
Each node in graph represents
a random variable
P( X | E )
where X is the query variable and E is the evidence variable.
Summary
• Bayesian methods provide sound theory and framework for implementation of
classifiers
• Bayesian networks a natural way to represent conditional independence
information. Qualitative info in links, quantitative in tables.
• NP-complete or NP-hard to compute exact values; typical to make simplifying
assumptions or approximate methods.
• Many Bayesian tools and systems exist
• Bayesian Networks: an efficient and effective representation of the joint
probability distribution of a set of random variables
• Efficient:
o Local models
o Independence (d-separation)
• Effective:
o Algorithms take advantage of structure to
o Compute posterior probabilities
o Compute most probable instantiation
o Decision making
2.8 INTRODUCTION TO SVM
Support vector machines (SVMs) are powerful yet flexible supervised machine learning
algorithms which are used both for classification and regression. But generally, they are
used in classification problems. In 1960s, SVMs were first introduced but later they got
refined in 1990. SVMs have their unique way of implementation as compared to other
machine learning algorithms. Lately, they are extremely popular because of their ability to
handle multiple continuous and categorical variables.
Working of SVM
SVM Kernels
• In practice, SVM algorithm is implemented with kernel that transforms an input data
space into the required form. SVM uses a technique called the kernel trick in which
kernel takes a low dimensional input space and transforms it into a higher
dimensional space. In simple words, kernel converts non-separable problems into
separable problems by adding more dimensions to it. It makes SVM more powerful,
flexible and accurate. The following are some of the types of kernels used by SVM.
Linear Kernel
• It can be used as a dot product between any two observations. The formula of linear
kernel is as below −
• K(x,xi)=sum(x∗xi)K(x,xi)=sum(x∗xi)
• From the above formula, we can see that the product between two vectors say 𝑥𝑥 & 𝑥𝑥𝑖𝑖
is the sum of the multiplication of each pair of input values.
Polynomial Kernel
• It is more generalized form of linear kernel and distinguish curved or nonlinear input
space. Following is the formula for polynomial kernel −
• k(X,Xi)=1+sum(X∗Xi)^dk(X,Xi)=1+sum(X∗Xi)^d
• Here d is the degree of polynomial, which we need to specify manually in the
learning algorithm.
• RBF kernel, mostly used in SVM classification, maps input space in indefinite
dimensional space. Following formula explains it mathematically −
• K(x,xi)=exp(−gamma∗sum(x−xi^2))K(x,xi)=exp(−gamma∗sum(x−xi^2))
• Here, gamma ranges from 0 to 1. We need to manually specify it in the learning
algorithm. A good default value of gamma is 0.1.
2.9 TIME SERIES ANALYSIS
• Aim:
– To collect and analyze the past observations to develop an appropriate model which can
then be used to generate future values for the series.
• Time Series Forecasting is based on the idea that the history of occurrences over time
can be used to predict the future
Application
• Business
• Economics
• Finance
• Science and Engineering
• Some rule induction systems induce more complex rules, in which values of attributes
may be expressed by negation of some values or by a value subset of the attribute
domain
• Data from which rules are induced are usually presented in a form sim- ilar to a table
in which cases (or examples) are labels (or names) for rows and variables are labeled
as attributes and a decision. We will restrict our attention to rule induction which
belongs to supervised learning:
• all cases are preclassied by an expert. In dierent words, the decision value is assigned
by an expert to each case. Attributes are independent variables and the decision is a
dependent variable.
• A very simple ex- ample of such a table is presented as Table 1.1, in which attributes
are:
• Temperature, Headache, Weakness, Nausea, and the decision is Flu. The set of all
cases labeled by the same decision value is called a concept. For Table 1.1, case set
f1, 2, 4, 5g is a concept of all cases aected by flu (for each case from this set the
corresponding value of Flu is yes).
• E[X]=0
• Where E is the statistical expectation operator. If X has not zero mean we first
subtract the mean from X before we proceed with the rest of the analysis.
• Let q denote a unit vector, also of dimension m, onto which the vector X is to be
projected. This projection is defined by the inner product of the vectors X and q:
• A=XTq=qTX
• • ||q||=(qTq)½=1
• The projection A is a random variable with a mean and variance related to the
statistics of vector X. Assuming that X has zero mean we can calculate the mean
value of the projection A:
• E[A]=qTE[X]=0
• The variance of A is therefore the same as its mean- square value and so we can
write:
• s2=E[A2]=E[(qTX)(XTq)]=qTE[XXT]q=qTR q
• The m-by-m matrix R is the correlation matrix of the random vector X, formally
defined as the expectation of the outer product of the vector X with itself, as shown:
• R=E[XXT]
• We observe that the matrix R is symmetric, which means that:
a single vector:
• a =[a1, a2,…, am]T
• =[xTq1, xTq2,…, xTqm]T
• =QTx
• Where Q is the matrix which is constructed by the (column) eigenvectors of R.
• From the above we see that:
• x=Q a
• This is nothing more than a coordinate
transformation from the input space, of vector x, to the feature space of the vector a.
• From the perspective of the pattern recognition the usefulness of the PCA method is
that it provides an effective technique for dimensionality reduction.
• In particular we may reduce the number of features needed for effective data
representation by discarding those linear combinations in the previous formula that
have small variances and retain only these terms that have large variances.
• Let l1, l2, …, ll denote the largest l eigenvalues of R. We may then approximate the
vector x by
Definition of fuzzy
Fuzzy – “not clear, distinct, or precise; blurred”
Definition of fuzzy logic
A form of knowledge representation suitable for notions that
cannot be defined precisely, but which depend upon their
contexts.
Stochastic search
Stochastic search and optimization techniques are used in a vast number of areas,
including aerospace, medicine, transportation, and finance, to name but a few. Whether
the goal is refining the design of a missile or aircraft, determining the effectiveness of a
new drug, developing the most efficient timing strategies for traffic signals, or making
investment decisions in order to increase profits, stochastic algorithms can help
researchers and practitioners devise optimal solutions to countless real-world problems.
Data Stream Mining (also known as stream learning) is the process of extracting
knowledge structures from continuous, rapid data records. A data stream is an ordered
sequence of instances that in many applications of data stream mining can be read only once
or a small number of times using limited computing and storage capabilities.
In many data stream mining applications, the goal is to predict the class or value of new
instances in the data stream given some knowledge about the class membership or values of
previous instances in the data stream. Machine learning techniques can be used to learn this
prediction task from labeled examples in an automated fashion. Often, concepts from the
field of incremental learning are applied to cope with structural changes, on-line learning and
real-time demands. In many applications, especially operating within non-stationary
environments, the distribution underlying the instances or the rules underlying their labeling
may change over time, i.e. the goal of the prediction, the class to be predicted or the target
value to be predicted, may change over time.This problem is referred to as concept drift.
Detecting concept drift is a central issue to data stream mining. Other challenges that arise
when applying machine learning to streaming data include: partially and delayed labeled
data, recovery from concept drifts, and temporal dependencies.
Data stream real time analytics are needed to manage the data currently generated, at an ever
increasing rate from these applications.
•Examples:
•Financial
•Network monitoring
•Security
•Telecommunications data management
•Web applications
•Manufacturing
•Sensor networks
•Email
•blogging
Data-streaming architectures are used to process data that's continuously produced as streams
of events over time, instead of static datasets.
▪ Compared to the traditional centralized "state of the world" databases and data warehouses,
data streaming applications work on the streams of events and on
application-specific local state that is an aggregate of the history of events. Some of
the advantages of streaming data processing are:
▪ Decreased latency from signal to decision.
▪ Unified way of handling real-time and historic data.
▪ Time travel queries.
▪ Real-time analysis of streaming data can empower you to react to events and insights as
they happen.
▪ Streaming data does not need to be discarded: data persistence pays off in a variety of ways.
With the right technologies, it’s possible to replicate streaming data to geodistributed data
centers.
▪ An effective message-passing system is much more than a queue for a real-time application:
it is the heart of an effective design for an overall big data architecture.
▪ The most disruptive idea presented here is that streaming architecture should not be limited
to specialized real-time applications.
▪ Lambda architecture is a data-processing architecture designed to handle massive quantities
of data by taking advantage of both batch- and streamprocessing methods.
▪ This approach to architecture attempts to balance latency, throughput, and faulttolerance by
using batch processing to provide comprehensive and accurate views of batch data, while
simultaneously using real-time stream processing to provide views
of online data.
▪ Lambda architecture describes a system consisting of three layers: batch processing, speed
(or real-time) processing, and a serving layer for responding to queries
▪ The processing layers ingest from an immutable master copy of the entire data set.
▪ Batch layer:
▪ The batch layer precomputes results using a distributed processing system that can handle
very large quantities of data.
▪ The batch layer aims at perfect accuracy by being able to process all available data when
generating views.
▪ This means it can fix any errors by re-computing based on the complete data set, then
updating existing views.
▪ Output is typically stored in a read-only database, with updates completely replacing
existing precomputed views.
▪ Apache Hadoop is the de facto standard batch-processing system used in most
highthroughput architectures.
Flow of data through the processing and serving layers of a generic lambda
Speed layer:
▪ The speed layer processes data streams in real time and without the requirements of
fix-ups or completeness.
▪ This layer sacrifices throughput as it aims to minimize latency by providing real-time
views into the most recent data.
▪ Essentially, the speed layer is responsible for filling the "gap" caused by the batch
layer's lag in providing views based on the most recent data.
▪ This layer's views may not be as accurate or complete as the ones eventually
produced by the batch layer, but they are available almost immediately after data is
received, and can be replaced when the batch layer's views for the same data become
available.
▪ Stream-processing technologies typically used in this layer include Apache
Storm, SQLstream and Apache Spark. Output is typically stored on fast NoSQL
databases.
▪ Serving layer:
▪ Output from the batch and speed layers are stored in the serving layer, which responds to
ad-hoc queries by returning precomputed views or building views from the processed data.
▪ Examples of technologies used in the serving layer include Druid, which provides a single
cluster to handle output from both layers.
▪ Dedicated stores used in the serving layer include Apache Cassandra or Apache HBase for
speed-layer output, and Elephant DB or Cloudera Impala for batch-layer output.
▪ Criticism of lambda architecture has focused on its inherent complexity and its
limiting influence.
▪ The batch and streaming sides each require a different code base that must be
maintained and kept in sync so that processed data produces the same result from
both paths.
Major challenges
Keep track of a large universe, e.g., pairs of IP address, not ages
Methodology
Synopses (trade-off between accuracy and storage)
Use synopsis data structure, much smaller (O(logk N) space) than their base
data set (O(N) space)
Compute an approximate answer within a small error range (factor ε of the
actual answer)
Major methods
Random sampling
Histograms
Sliding windows
Multi-resolution model
Sketches
Radomized algorithms
Sketches
Histograms and wavelets require multi-passes over the
data but sketches can operate in a single pass
Frequency moments of a stream A = {a1, …, aN}, Fk:
v
Fk = ∑ mi
k
i =1
where v: the universe or domain size, mi: the frequency
of i in the sequence
Given N elts and v values, sketches can approximate F0,
F1, F2 in O(log v + log N) space
Randomized algorithms
Monte Carlo algorithm: bound on running time but may
not return correct result
Bloom Filters
Whenever a list or set is used, and space is consideration, a Bloom filter should be
considered. When using a Bloom filter, consider the potential effects of false positives."
• It is a randomized data structure that is used to represent a set.
• It answers membership queries
• It can give FALSE POSITIVE while answering membership
• queries (very less %).
Definition
•Data stream consists of a universe of elements chosen from a set of N
• Maintain a count of number of distinct items seen so far.
Let us consider a strem
• How to estimate (in an unbiased manner) the number of distinct elements seen?
3.7 RTAP
Definition
• Use of all available data and resources when they are needed.
• Consists of dynamic analysis and reporting based on data entered into a system less
than one minute before the actual time of use.
• an analytics engine that analyzes the data, correlates values and blends streams
together.
The system that receives and sends data streams and executes the application and real-time
analytics logic is called the stream processor.
In order for the real-time data to be useful, the real-time analytics applications being used
should have high availability and low response times. These applications should also feasibly
manage large amounts of data, up to terabytes. This should all be done while returning
answers to queries within seconds.
The term real-time also includes managing changing data sources -- something that may arise
as market and business factors change within a company. As a result, the real-time analytics
applications should be able to handle big data. The adoption of real-time big data analytics
can maximize business returns, reduce operational costs and introduce an era where machines
can interact over the internet of things using real-time information to make decisions on their
own.
Different technologies exist that have been designed to meet these demands, including the
growing quantities and diversity of data. Some of these new technologies are based on
specialized appliances -- such as hardware and software systems. Other technologies utilize a
special processor and memory chip combination, or a database with analytics capabilities
embedded in its design
Businesses that utilize real-time analytics greatly reduce risk throughout their company since
the system uses data to predict outcomes and suggest alternatives rather than relying on the
collection of speculations based on past events or recent scans -- as is the case with historical
data analytics. Real-time analytics provides insights into what is going on in the moment.
• Faster results. The ability to instantly classify raw data allows queries to more
efficiently collect the appropriate data and sort through it quickly. This, in turn, allows for
faster and more efficient trend prediction and decision making.
3.9 CHALLENGES
One major challenge faced in real-time analytics is the vague definition of real time and the
inconsistent requirements that result from the various interpretations of the term. As a result,
businesses must invest a significant amount of time and effort to collect specific and detailed
requirements from all stakeholders in order to agree on a specific definition of real time, what
is needed for it and what data sources should be used.
Once the company has unanimously decided on what real time means, it faces the challenge
of creating an architecture with the ability to process data at high speeds. Unfortunately, data
sources and applications can cause processing-speed requirements to vary from milliseconds
to minutes, making creation of a capable architecture difficult. Furthermore, the architecture
must also be capable of handling quick changes in data volume and should be able to scale up
as the data grows.
Finally, companies may find that their employees are resistant to the change when
implementing real-time analytics. Therefore, businesses should focus on preparing their staff
by providing appropriate training and fully communicating the reasons for the change to real-
time analytics.
Here are some examples of how enterprises are tapping into real-time analytics:
• Managing location data. Real-time analytics can be used to determine what data
sets are relevant to a particular geographic location and signal the appropriate updates.
• Real-time credit scoring. Instant updates of individuals' credit scores allow financial
institutions to immediately decide whether or not to extend the customer's credit.
• Financial trading. Real-time big data analytics is being used to support decision-
making in financial trading. Institutions use financial databases, satellite weather stations
and social media to instantaneously inform buying and selling decisions.
Example Applications
• Financial services
• Government
• E-Commerce Sites
• Insurance Industry
Steps
• Live data, from Yahoo – read and processed
• Data stored in memory with fast
• Using live data, spark Mlib application creates and trains a model
• Results of machine learning model are pushed to other interested applications
• As data ages and starts to become cool, it is moved from Apache Geode to Apache
HAWQ and eventually lands in Apache Hadoop
• • Example:
Coocurrences
• 80% of all customers purchase items X, Y and Z together.
Association rules
• 60% of all customers who purchase X and Y also buy Z.
Sequential patterns
• 60% of customers who first buy X also purchase Y within three weeks.
• It is an important data mining model studied extensively by the database and data
mining community.
• Initially used for Market Basket Analysis to find how items purchased by customers
are related.
• Given a set of transactions, find rules that will predict the occurrence of an item based
on the occurrences of other items in the transaction.
• Market-Basket transactions
•
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
{Diaper}→{Beer},
{Milk,Bread}→{Eggs,Coke},
{Beer, Bread} → {Milk},
Basket Data
Retail organizations, e.g., supermarkets, collect and store massive amounts sales data, called
basket data.
A record consist of
transaction date
items bought
Or, basket data may consist of items bought by a customer over a period.
l Items frequently purchased together:
Bread ⇒PeanutButter
90% of transactions that purchase bread and butter also purchase milk
“IF” part = antecedent
“THEN” part = consequent
“Item set” = the items (e.g., products) comprising the antecedent or consequent
• Antecedent and consequent are disjoint (i.e., have no items in common)
Antecedent: bread and butter
Consequent: milk
Confidence factor: 90%
• Itemset
• A collection of one or more items
• Example: {Milk, Bread, Diaper}
• k-itemset
• An itemset that contains k items
• Support count (σ)
• Frequency of occurrence of an itemset
• E.g. σ({Milk, Bread,Diaper}) = 2
• Support
• Fraction of transactions that contain an itemset
• E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
• An itemset whose support is greater than or equal to a minsup
•
•
TID• threshold
Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
The model: data
X → Y, where X, Y ⊂ I, and X ∩Y = ∅
• An itemset is a set of items.
• E.g., X = {milk, bread, cereal} is an itemset.
• A k-itemset is an itemset with k items.
• E.g., {milk, bread, cereal} is a 3-itemset
• Support count: The support count of an itemset X, denoted by X.count, in a data set
T is the number of transactions in T that contain X. Assume T has n transactions.
Then,
( X ∪ Y ).count
support =
n
( X ∪ Y ).count
confidence =
X .count
Goal: Find all rules that satisfy the user-specified minimum support (minsup)
and minimum confidence (minconf).
Association Rule
An implication expression of the form X → Y, where X and Y are itemsets
Example:
{Milk, Diaper} → {Beer}
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
Example 5 Bread, Milk, Diaper, Coke
• For the mininmum support, it all depends on the dataset. Usually, may start with a
high value, and then decrease the values until to find a value that will generate enough
paterns.
• For the minimum confidence, it is a little bit easier because it represents the
confidence that you want in the rules. So usually, use something like 60 % . But it
also depends on the data.
• In terms of performance, when minsup is higher you will find less pattern and the
algorithm is faster. For minconf, when it is set higher, there will be less pattern but it
may not be faster because many algorithms don't use minconf to prune the search
space. So obviously, setting these parameters also depends on how many rules you
want.
•
4.9 Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
if an itemset is frequent, each of its subsets is frequent as well.
This property belongs to a special category of properties called
antimonotonicity in the sense that if a set cannot pass a test, all of its supersets
will fail the same test as well.
Rule Generation
– Generate high confidence rules from each frequent itemset, where each
rule is a binary partitioning of a frequent itemset
• Frequent itemset generation is still computationally expensive
• For every itemset Ik, generate all itemsets Ik+1 s.t. Ik ⊂ Ik+1
• Scan all transactions and compute supp(Ik+1) for all itemsets Ik+1
• Drop itemsets Ik+1 with support < minsupp
• Until no new frequent itemsets are found
Association Rules
Finally, construct all rules X → Y s.t.
• XY has high support
• Supp(XY)/Supp(X) > min-confidence
4.12 LIMITATIONS
• Apriori algorithm can be very slow and the bottleneck is candidate generation.
• For example, if the transaction DB has 104 frequent 1- itemsets, they will generate 107
candidate 2-itemsets even after employing the downward closure.
• To compute those with sup more than min sup, the database need to be scanned at every
level. It needs (n +1 ) scans, where n is the length of the longest pattern.
4.13 METHODS TO IMPROVE APRIORI’S EFFICIENCY
• Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is
below the threshold cannot be frequent
• Transaction reduction: A transaction that does not contain any frequent k-itemset is useless
in subsequent scans
• Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one
of the partitions of DB.
• Sampling: mining on a subset of given data, lower support threshold + a method to
determine the completeness
• Dynamic itemset counting: add new candidate itemsets only when all of their subsets are
estimated to be frequent.
4.14 WHAT IS CLUSTER ANALYSIS?
•Cluster: a collection of data objects
•Similar to one another within the same cluster
•Typical applications
•As a stand-alone tool to get insight into data distribution
• Model-based: A model is hypothesized for each of the clusters and the idea is to find
the best fit of that model to each other
• Partitional Clustering
• A division data objects into non-overlapping subsets (clusters) such that each
data object is in exactly one subset
• Hierarchical clustering
1. When the numbers of data are not so many, initial grouping will determine the cluster
significantly.
2. The number of cluster, K, must be determined before hand. Its disadvantage is that it does
not yield the same result with each run, since the resulting clusters depend on the initial
random assignments.
3. We never know the real cluster, using the same data, because if it is inputted in a different
order it may produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition may produce different result of
cluster. The algorithm may be trapped in the local optimum.
• It is relatively efficient and fast. It computes result at O(tkn), where n is number of objects
or points, k is number of clusters and t is number of iterations.
• k-means clustering can be applied to machine learning or data mining
• Used on acoustic data in speech understanding to convert waveforms into one of k
categories (known as Vector Quantization or Image Segmentation).
• Also used for choosing color palettes on old fashioned graphical display devices and Image
Quantization.
4.24 CONCLUSION
• K-means algorithm is useful for undirected knowledge discovery and is relatively simple.
• K-means has found wide spread usage in lot of fields, ranging from unsupervised learning
of neural network, Pattern recognitions, Classification analysis, Artificial intelligence, image
processing, machine vision, and many others.
• The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models.
• It is made by apache software foundation in 2011.
• Written in JAVA.
o Hadoop is open source software.
o Framework
o Massive Storage
o Processing Power
• Big data is a term used to define very large amount of unstructured and
semi structured data a company creates.
•The term is used when talking about Petabytes and Exabyte of data.
•That much data would take so much time and cost to load into relational
database for analysis.
•Facebook has almost 10billion photos taking up to 1Petabytes of storage.
So what is the problem??
1. Processing that large data is very difficult in relational database.
2. It would take too much time to process data and cost.
It ties so many small and reasonable priced machines together into a single cost
effective computer cluster.
5.3 MAPREDUCE
MapReduce is a programming model for processing and generating large data sets with a
parallel, distributed algorithm on a cluster.
It is an associative implementation for processing and generating large data sets.
MAP function that process a key pair to generates a set of intermediate key pairs.
REDUCE function that merges all intermediate values associated with the same
intermediate key
• Now that we have described how Hadoop stores data, lets turn our attention to how it
processes data
• We typically process data in Hadoop using MapReduce
• MapReduce is not a language, it’s a programming model
• MapReduce is a method for distributing a task across multiple nodes. Each node
processes data stored on that node.
• MapReduce consists of two functions:
map (K1, V1) -> (K2, V2)
reduce (K2, list(V2)) -> list(K3, V3)
Terminology
• The client program submits a job to Hadoop.
• The job consists of a mapper, a reducer, and a list of inputs.
• The job is sent to the JobTracker process on the Master Node.
• Each Slave Node runs a process called the TaskTracker.
• The JobTracker instructs TaskTrackers to run and monitor tasks.
• A Map or Reduce over a piece of data is a single task.
• A task attempts is an instance of a task running on a slave node.
MapReduce Failure Recovery
Example:
The NameNodeholds metadata for the two files
• Foo.txt (300MB) and Bar.txt (200MB)
• Assume HDFS is configured for 128MB blocks
The DataNodeshold the actual blocks
• Each block is 128MB in size
• Each block is replicated three times on the cluster
• Block reports are periodically sent to the NameNode
HDFS Architecture
Role of NameNode
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software.
It is a software that can be run on commodity hardware.
The system having the namenode acts as the master server.
‒ It stores all metadata: filenames, locations of each block on Data Nodes, file
attributes, etc...
‒ Block and Replica management
‒ Health of Data Nodes through block reports
‒ Keeps metadata in RAM for fast lookup
‒ Regulates client’s access to files.
‒ It also executes file system operations such as renaming, closing, and opening
files and directories
Functionalities of NameNode
Running on a single machine, the NameNode daemon determines and tracks where
the various blocks of a data file are stored.
If a client application wants to access a particular file stored in HDFS, the
application contacts the NameNode.
NameNode provides the application with the locations of the various blocks for
that file.
For performance reasons, the NameNode resides in a machine’s memory.
Because the NameNode is critical to the operation of HDFS, any unavailability or
corruption of the NameNode results in a data unavailability event on the cluster.
Thus, the NameNode is viewed as a single point of failure in the Hadoop
environment.
To minimize the chance of a NameNode failure and to improve performance, the
NameNode is typically run on a dedicated machine.
Role of DataNode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software.
For every node (Commodity hardware/System) in a cluster, there will be a datanode.
‒ The DataNode daemon manages the data stored on each machine.
‒ It stores file contents as blocks.
‒ Different blocks of the same file are stored on different Datanodes
‒ Same block is replicated across several Datanodes for redundency
Fuctionalities of DataNode
• Datanodes perform read-write operations on the file systems, as per client request.
• They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
• Each DataNode periodically builds a report about the blocks stored on the
DataNode and sends the report to the NameNode.
• Such tasks include updating the file system image with the contents of the file system
edit logs.
• In the event of a NameNode outage, the NameNode must be restarted and initialized
with the last file system image file and the contents of the edits logs.
• Periodically combines a prior filesystem snapshot and editing into a new snapshot.
New snapshot is sent back to the NameNode.
NameNode Failure
• Loosing a NameNode is equivalent to losing all the files on the filesystem
• Back up files that make up the pesistent state of the file system (local or NFS
mount)
• NameNode finds other copies of these 'lost' blocks and replicates them to other
nodes.
• Hive as data warehouse designed for managing and querying only structured data that
is stored in tables.
• While dealing with structured data, Map Reduce doesn't have optimization and
usability features but Hive framework does.
• Query optimization refers to an effective way of query execution in terms of
performance.
• Hive's SQL-inspired language separates the user from the complexity of Map Reduce
programming.
• It reuses familiar concepts from the relational database world, such as tables, rows,
columns and schema, etc. for ease of learning.
• So, Hive can use directory structures to "partition" data to improve performance on
certain queries.
• Hive is "Schema on READ only". So, functions like the update, modifications, etc.
don't work with this. Because the Hive query in a typical cluster runs on multiple Data
Nodes. So it is not possible to update and modify data across multiple nodes.
• Which means that after inserting table we can update the table in the latest Hive
versions.
Hive Components
• High-level language (HiveQL)
• Set of commands
Two Main
Components • Two execution modes
• Local: reads/write to local file system
• Mapreduce: connects to Hadoop cluster and
reads/writes to HDFS
• Interactive mode
• Console
Two modes
• Batch mode
• Submit a script
Hive deals with Structured Data
• Hive Data Models:
Partitions:
• Partition means dividing a table into a coarse grained parts based on the value
of a partition column such as ‘data’. This makes it faster to do queries on
slices of data.
• The Partition keys determine how data is stored. Here, each unique value of
the Partition key defines a Partition of the table. The Partitions are named after
dates for convenience. It is similar to ‘Block Splitting’ in HDFS.
• Buckets:
• Buckets give extra structure to the data that may be used for efficient queries.
Split data based on hash of a column – mainly for parallelism
Data in each partition may in turn be divided into Buckets based on the value
of a hash function of some column of a table.
Hive Clients
• Hive provides different drivers for communication with a different type of
applications. For Thrift based applications, it will provide Thrift client for
communication.
• These Clients and drivers in turn again communicate with Hive server in the Hive
services.
Hive Services
• Client interactions with Hive can be performed through Hive Services.
• If the client wants to perform any query related operations in Hive, it has to
communicate through Hive Services.
• CLI is the command line interface acts as Hive service for DDL (Data definition
Language) operations.
• All drivers communicate with Hive server and to the main driver in Hive services as
shown in above architecture diagram.
• Driver present in the Hive services represents the main driver, and it communicates all
type of Thrift, JDBC, ODBC, and other client specific applications.
• Driver will process those requests from different applications to meta store and field
systems for further processing.
• Query results and data loaded in the tables are going to be stored in Hadoop
cluster on HDFS.
Developed by Yahoo!
Open-source language
Pig Engine
• Pig provides an execution engine at Hadoop
Pig Components
• High-level language (Pig Latin)
Two Main • Set of commands
Components
• Two execution modes
• Local: reads/write to local file system
• Mapreduce: connects to Hadoop
cluster and reads/writes to HDFS
• Interactive mode
Two modes • Console
• Batch mode
• Submit a script
12
• Load, Filter, Foreach Generate, Group By, Store, Join, Distinct, Order By, …
• Aggregations
• Schema
• UDFs
5.13 HBASE
• HBase is a distributed column-oriented data store built on top of HDFS
• HBase is an Apache open source project whose goal is to provide storage for the Hadoop
Distributed Computing
• It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
• Data is logically organized into tables, rows and columns
Difference
• Hive and HBase are two different Hadoop based technologies –
• Hive is an SQL-like engine that runs MapReduce jobs, and
• HBase is a NoSQL key/value database on Hadoop.
• Just like Google can be used for search and Facebook for social networking, Hive can be
used for analytical queries while HBase for real-time querying.
NoSQL Categories
There are four general types (most common categories) of NoSQL databases. Each of these
categories has its own specific attributes and limitations.
There is not a single solutions which is better than all the others, however there are some
databases that are better to solve specific problems.
To clarify the NoSQL databases, lets discuss the most common categories :
• Key-value stores
• Column-oriented
• Graph
• Document oriented
Visual analytics is more than only visualization. It can rather be seen as an integral approach
combining visualization, human factors, and data analysis. Visualization and visual analytics
both integrate methodology from information analytics, geospatial analytics, and scientific
analytics.
5.19 Visual Analytics Process
The visual analytics process is a combination of automatic and visual analysis methods with a
tight coupling through human interaction in order to gain knowledge from data.
In many visual analytics scenarios, heterogeneous data sources need to be integrated before
visual or automatic analysis methods can be applied.
Therefore, the first step is often to preprocess and transform the data in order to extract
meaningful units of data for further processing. Typical preprocessing tasks are data cleaning,
normalization, grouping, or integration of heterogeneous data into a common schema.
Continuing with this meaningful data, the analyst can select between visual or automatic
analysis methods. After mapping the data the analyst may obtain the desired knowledge
directly, but more likely is the case that an initial visualization is not sufficient for the
analysis.
In contrast to traditional information visualization, findings from the visualization can be
reused to build a model for automatic analysis.
Once a model is created the analyst has the ability to interact with the automatic methods by
modifying parameters or selecting other types of analysis algorithms.
Model visualization can then be used to verify the findings of these models. Alternating
between visual and automatic methods is characteristic for the visual analytics process and
leads to a continuous refinement and verification of preliminary results.
Misleading results in an intermediate step can thus be discovered at an early stage, which
leads to more confidence in the final results.
5.20 DATA VISUALIZATION METHODS
Many conventional data visualization methods are often used. They are: table, histogram,
scatter plot, line chart, bar chart, pie chart, area chart, flow chart, bubble chart, multiple data
series or combination of charts, time line, Venn diagram, data flow diagram, and entity
relationship diagram, etc.
In addition, some data visualization methods have been used although they are less known
compared the above methods. The additional methods are: parallel coordinates, treemap, cone
tree, and semantic network,
• Parallel coordinates is used to plot individual data elements across many dimensions.
Parallel coordinate is very useful when to display multidimensional data.
• Treemap is an effective method for visualizing hierarchies. The size of each sub-rectangle
represents one measure, while color is often used to represent another measure of data.
• Cone tree is another method displaying hierarchical data such as organizational body in
three dimensions. The branches grow in the form of cone.
• A semantic network is a graphical representation of logical relationship between different
concepts. It generates directed graph, the combination of nodes or vertices, edges or arcs, and
label over each edge. Visualizations are not only static; they can be interactive. Interactive
visualization can be performed through approaches such as zooming (zoom in and zoom out),
overview and detail, zoom and pan, and focus and context or fish eye.
The steps for interactive visualization are as follows
1. Selecting: Interactive selection of data entities or subset or part of whole data or whole
data set according to the user interest.
2. Linking: It is useful for relating information among multiple views.
3. Filtering: It helps users adjust the amount of information for display. It decreases
information quantity and focuses on information of interest.
4. Rearranging or Remapping: Because the spatial layout is the most important visual
mapping, rearranging the spatial layout of the information is very effective in producing
different insights.
Big data visualization can be performed through a number of approaches such as more than
one view per representation display, dynamical changes in number of factors, and filtering
(dynamic query filters, star-field display, and tight coupling), etc
Several visualization methods were analyzed and classified according to data criteria: (1)
large data volume, (2) data variety, and (3) data dynamics.
• Treemap: It is based on space-filling visualization of hierarchical data.
• Circle Packing: It is a direct alternative to treemap. Besides the fact that as primitive shape
it uses circles, which also can be included into circles from a higher hierarchy level.
• Sunburst: It uses treemap visualization and is converted to polar coordinate system. The
main difference is that the variable parameters are not width and height, but a radius and arc
length.
• Parallel Coordinates: It allows visual analysis to be extended with multiple data factors for
different objects.
• Streamgraph: It is a type of a stacked area graph that is displaced around a central axis
resulting in flowing and organic shape.
• Circular Network Diagram: Data object are placed around a circle and linked by curves
based on the rate of their relativeness. The different line width or color saturation is usually
used to measure object relativeness.
7.2MCQ’s of Unit 2
1. Which of the following is true about regression analysis?
a. answering yes/no questions about the data
b. estimating numerical characteristics of the data
c. modeling relationships within the data
d. describing associations within the data
2. What is a hypothesis?
1. A statement that the researcher wants to test through the data collected in a study.
2. A research question the results will answer.
3. A theory that underpins the study.
4. A statistical method for calculating the extent to which the results could have
happened by chance.
3. What is the cyclical process of collecting and analysing data during a single research study
called?
1. Interim Analysis
2. Inter analysis
3. inter item analysis
4. constant analysis
4. Which of the following is not a major data analysis approaches?
1. Data Mining
2. Predictive Intelligence
3. Business Intelligence
4. Text Analytics
5. The Process of describing the data that is huge and complex to store and process is known
as
1. Analytics
2. Data mining
3. Big data
4. Data warehouse
6 . Which of the following is true about regression analysis?
1. answering yes/no questions about the data
2. estimating numerical characteristics of the data
3. modeling relationships within the data
4. describing associations within the data
7. Which of the following is a widely used and effective machine learning algorithm based on
the idea of bagging?
A. Decision Tree
B. Regression
C. Classification
D. Random Forest
8. PCA is a ________.
A. Eigen value
B. Eigen vector
C. Linear value
D. None of these
10. __________is a dimensionality reduction technique which is commonly used for the
supervised classification problems.
A. Value analysis
B. Function Analysis
C. Pure analysis
D. None of these
11.The predictions for generative learning algorithms are made using _______ .
A. Naive Theorem
B. Bayes Theorem
C. Naive Bayes Theorem
D. None of these
12. ________is an important factor in predictive modeling
A. Dimensionality Reduction
B. feature selection
C. feature extraction
D. None of these
13 . What would you do in PCA to get the same projection as SVD?
A. transform data to zero mean
B. transform data to zero median
C. not possible
D. none of these
14. . What is the full form of BN in Neural Networks?
A. Bayesian Networks
B. Belief Networks
C. Bayes Nets
D. All of the above
.
3. In Filtering Streams____________
a. Accept those tuples in the stream that meet a criterion.
b. Accept data in the stream that meet a criterion.
c. Accept those class in the stream that meet a criterion.
d. Accept rows in the stream that meet a criterion
7.4MCQ’s of Unit 4
1. The number of iterations in apriori _
1. increases with the size of the data
2. decreases with the increase in size of the data
3. increases with the size of the maximum frequent set
4. decreases with increase in size of the maximum frequent set
2. Which of the following are interestingness measures for association rules?
1. Recall ‘
2. Lift
3. Accuracy
4. All of Above
3. _______ is an example for case based-learning
1. Decision trees
2. Neural networks
3. Genetic algorithm
4. K-nearest neighbor
4. Which of the following is finally produced by Hierarchical Clustering?
a) final estimate of cluster centroids
b) tree showing how close things are to each other
c) assignment of each point to clusters
d) all of the mentioned
5. Which of the following is required by K-means clustering?
a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of the mentioned
6._______is the task of dividing the population or data points into a number of groups.
A) Unsupervised learning
B) clustering
C) semi supervised
D) classification
9._____model in which we will fit the data on the probability that how it may belong to
the same distribution.
A) Centroid based methods
B) distribution based model
C) Connectivity based methods
D) None of these
10._______is basically a type of unsupervised learning method
A) Unsupervised learning
B) clustering
C) semi supervised
D) classification
7.5 MCQ’s of Unit 5
1.What license is Hadoop distributed under?
a) Apache License 2.0
b) Mozilla Public License
c) Shareware
d) Commercial
2. Which of the following platforms does Hadoop run on?
a) Bare metal
b) Debian
c) Cross-platform
d) Unix-like
3. IBM and ________ have announced a major initiative to use Hadoop to support university
courses in distributed computer programming.
a) Google Latitude
b) Android (operating system)
c) Google Variations
d) Google
4. HDFS works in a __________ fashion.
a) master-slave
b) worker/slave
c) None of the above
d) all of the mentioned
5. Which of the following scenario may not be a good fit for HDFS?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file
b) HDFS is suitable for storing data related to applications requiring low latency data access
c) HDFS is suitable for storing data related to applications requiring low latency data access
d) None of the mentioned
6. The need for data replication can arise in various scenarios like ____________
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
d) All of the mentioned
7. The minimum amount of data that HDFS can read or write is called a _____________.
a) Datanode
b) Namenode
c) Block
d) None of the above
10. Which of the following is the most popular high-level Java API in Hadoop Ecosystem?
a) Cascading
b) Scalding
c) Hcatalog
d) Cascalog
B TECH
(SEM-V) THEORY
EXAMINATION 2020-21 DATA
ANALYTICS
Time: 3 Hours Total Marks:
100
Note: 1. Attempt all Sections. If require any missing data; then choose suitably.
SECTION A
1. Attempt all questions in brief. 2 x 10 = 20