DA Full
DA Full
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
Business Data
Science
Data
Analytics
Real-time
Job Market Usability
CO # CO Unit
CO1 Understand and classify the characteristics, concepts and Introduction to
principles of big data. Big Data
CO2 Apply the data analytics techniques and models. Data Analysis
CO3 Implement and analyze the data analysis techniques for Mining Data
mining data streams. Streams
CO4 Examine the techniques of clustering and frequent item Frequent Itemsets
sets. and Clustering
CO5 Analyze and evaluate the framework and visualization for Frameworks and
big data analytics. Visualization
CO6 Formulate the concepts, principles and techniques focusing Applications of all
on the applications to industry and real world experience. units
Prerequisites
NIL
Textbook
Data Analytics, Radha Shankarmani,M. Vijayalaxmi, Wiley India Private Limited,
ISBN: 9788126560639.
Reference Books
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and
Presenting Data by EMC Education Services (Editor), Wiley, 2014
Bill Franks, Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data
Streams with advanced analystics, John Wiley & sons, 2012.
Glenn J. Myatt, Making Sense of Data, John Wiley & Sons, 2007 Pete Warden, Big
Data Glossary,O’Reilly, 2011.
Jiawei Han, MichelineKamber “Data Mining Concepts and Techniques”, Second
Edition, Elsevier, Reprinted 2008.
Stephan Kudyba, Thomas H. Davenport, Big Data, Mining, and Analytics, Components
of Strategic Decision Making, CRC Press, Taylor & Francis Group. 2014
Big Data, Black Book, DT Editorial Services, Dreamtech Press, 2015
School of Computer Engineering
Evaluation
8
Grading:
?
School of Computer Engineering
Data
9
Human-readable refers to information that only humans can interpret and study,
such as an image or the meaning of a block of text. If it requires a person to
interpret it, that information is human-readable.
Machine-readable refers to information that computer programs can process. A
program is a set of instructions for manipulating data. Such data can be
automatically read and processed by a computer, such as CSV, JSON, XML, etc.
Non-digital material (for example printed or hand-written documents) is by its non-
digital nature not machine-readable. But even digital material need not be machine-
readable. For example, a PDF document containing tables of data. These are
definitely digital but are not machine-readable because a computer would struggle
to access the tabular information - even though they are very human readable. The
equivalent tables in a format such as a spreadsheet would be machine readable.
Another example scans (photographs) of text are not machine-readable (but are
human readable!) but the equivalent text in a format such as a simple ASCII text file
can machine readable and processable.
It is defined as the data that has a defined repeating pattern and this pattern
makes it easier for any program to sort, read, and process the data.
This is data is in an organized form (e.g., in rows and columns) and can be easily
used by a computer program.
Relationships exist between entities of data.
Structured data:
Organize data in a pre-defined format
Is stored in a tabular form
Is the data that resides in a fixed fields within a record of file
Is formatted data that has entities and their attributes mapped
Is used to query and report against predetermined data types
Sources:
Relational Multidimensional
database databases
Structured data
Legacy
Flat files
databases
School of Computer Engineering
Ease with Structured Data
15
Web data in
the form of XML
cookies Semi-structured
data
Other Markup JSON
languages
Inconsistent Structure
Self-describing
(level/value pair)
Other schema
Semi-structured information is
data blended with data
values
Unstructured data is a set of data that might or might not have any logical or
repeating patterns and is not recognized in a pre-defined manner.
About 80 percent of enterprise data consists of unstructured content.
Unstructured data:
Typically consists of metadata i.e. additional information related to data.
Comprises of inconsistent data such as data obtained from files, social
media websites, satellites etc
Consists of data in different formats such as e-mails, text, audio, video, or
images.
Sources: Body of email
Chats, Text
Text both
messages
internal and
external to org.
Mobile data
Unstructured data
Social Media Images, audios,
data videos
School of Computer Engineering
Challenges associated with Unstructured data
20
Working with unstructured data poses certain challenges, which are as follows:
Identifying the unstructured data that can be processed
Sorting, organizing, and arranging unstructured data indifferent sets and
formats
Combining and linking unstructured data in a more structured format to derive
any logical conclusions out of the available information
Costing in terms of storage space and human resources need to deal with the
exponential growth of unstructured data
Data Analysis of Unstructured Data
The complexity of unstructured data lies within the language that created it. Human
language is quite different from the language used by machines, which prefer
structured information. Unstructured data analysis is referred to the process of
analyzing data objects that doesn’t follow a predefine data model and/or is
unorganized. It is the analysis of any data that is stored over time within an
organizational data repository without any intent for its orchestration, pattern or
categorization.
Think of following:
Semi- Big
Structured Unstructured
structured Data
Data Data
Data
Refer to Appendix
for data volumes
More data
The main challenge in the traditional approach for computing systems to manage
‘Big Data’ because of immense speed and volume at which it is generated. Some of
the challenges are:
Traditional approach cannot work on unstructured data efficiently
Traditional approach is built on top of the relational data model, relationships
between the subjects of interests have been created inside the system and the
analysis is done based on them. This approach will not adequate for big data
Traditional approach is batch oriented and need to wait for nightly ETL
(extract, transform and load) and transformation jobs to complete before
the required insight is obtained
Traditional data management, warehousing, and analysis systems fizzle to
analyze this type of data. Due to it’s complexity, big data is processed with
parallelism. Parallelism in a traditional system is achieved through costly
hardware like MPP (Massively Parallel Processing) systems
Inadequate support of aggregated summaries of data
Process challenges
Capturing Data
Aligning data from different sources
Transforming data into suitable form for data analysis
Modeling data(Mathematically, simulation)
Management Challenges:
Security
Privacy
Governance
Ethical issues
School of Computer Engineering
Web Data
29
Hadoop is an open-source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple programming
models.
It is designed to scale up from single servers to thousands of machines, each offering
local computation and storage.
It provides massive storage for any kind of data, enormous processing power and
the ability to handle virtually limitless concurrent tasks or jobs.
Importance:
Ability to store and process huge amounts of any kind of data, quickly.
Computing power: It’s distributed computing model processes big data fast.
Fault tolerance: Data and application processing are protected against
hardware failure.
Flexibility: Unlike traditional relational databases, preprocess of data does not
require before storing it.
Low cost: The open-source framework is free and uses commodity hardware to
store large quantities of data.
Scalability: System can easily grow to handle more data simply by adding
nodes. Little administration is required.
School of Computer Engineering
Evolution of Analytics Scalability
33
Database 1
Analytic Server
Database 2
Extract
Database 3
Database 1
Analytic Server
Database 2
Submit
Consolidate
Request
Database n
In an in-database environment, the processing stays in the database where the data
has been consolidated. The user’s machine just submits the request; it doesn’t do
heavy lifting.
Massively parallel processing (MPP) database systems is the most mature, proven, and
widely deployed mechanism for storing and analyzing large amounts of data. An MPP
database spreads data out into independent pieces managed by independent storage
and central processing unit (CPU) resources. Conceptually, it is like having pieces of
data loaded onto multiple network connected personal computers around a house.
The data in an MPP system gets split across a variety of disks managed by a variety of
CPUs spread across a number of servers.
One-terabyte
table 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte
chunks chunks chunks chunks chunks
An MPP system allows the different sets of CPU and disk to run the process concurrently
An MPP system
breaks the job into pieces
Single Threaded
Process ★ Parallel Process ★
School of Computer Engineering
OLTP vs. MPP vs. Hadoop
39
OLTP MPP
Examples: Oracle, DB2, SQL Server etc. Examples: Netezza, Teradata, Vertica etc.
It needs to read data from disk to memory before Takes the processing as close possible to the data,
start processing, so very fast in memory calculation. so less data movement
It is good for smaller OLTP (transaction) operations. It is good for batch processing. Some of the MPP
It also maintains very high level of data integrity. (Netezza, Vertica) overlooks integrity like enforcing
unique key for the sake of batch performance.
MPP Hadoop
Stores data in a matured internal structure. So There are no such structured architecture for data
data loading and data processing is efficient. stored on Hadoop. So, accessing and loading data
is not as efficient as conventional MPP systems.
It support only relational models. Support virtually any kind of data.
However the main objective of MPP and Hadoop is same, process data parallely near storage.
Only Hadoop:
All data as heavily unstructured (documents, audio, video etc)
Need to process in batch
Fault tolerance refers to the ability of a system (computer, network, cloud cluster,
etc.) to continue operating without interruption when one or more of its
components fail.
The objective of creating a fault-tolerant system is to prevent disruptions arising
from a single point of failure, ensuring the high availability and business continuity
of mission-critical applications or systems.
Fault-tolerant systems use backup components that automatically take the place of
failed components, ensuring no loss of service. These include:
Hardware systems that are backed up by identical or equivalent systems. For
example, a server can be made fault tolerant by using an identical server
running in parallel, with all operations mirrored to the backup server.
Software systems that are backed up by other software instances. For example,
a database with customer information can be continuously replicated to
another machine. If the primary database goes down, operations can be
automatically redirected to the second database.
Power sources that are made fault tolerant using alternative sources. For
example, many organizations have power generators that can take over in case
main line electricity fails.
School of Computer Engineering
Analytic Processes and Tools
48
Points to cover
Spreadsheets and Analytics Tool
Analytics Engine
CRM and Online Marketing Solutions
Reporting uses data to track the performance of your business, while an analysis
uses data to answer strategic questions about your business. Though they are
distinct, reporting and analysis rely on each other. Reporting sheds light on what
questions to ask, and an analysis attempts to answer those questions.
Simply put,
Data Reporting Reveals The Right Questions.
Data Analysis Helps Find Answers.
Approach Explanation
Descriptive What’s happening in my business?
• Comprehensive, accurate and historical data
• Effective Visualisation
Diagnostic Why is it happening?
• Ability to drill-down to the root-cause
• Ability to isolate all confounding information
Predictive What’s likely to happen?
• Decisions are automated using algorithms and technology
• Historical patterns are being used to predict specific outcomes using
algorithms
Prescriptive What do I need to do?
• Recommended actions and strategies based on champion/challenger
strategy outcomes
• Applying advanced analytical algorithm to make specific
recommendations
School of Computer Engineering
Mapping of Big Data’s Vs to Analytics Focus
53
History data can be quite large. There might be a need to process huge amount of data many times a
day as it gets updated continuously. Therefore volume is mapped to history. Variety is pervasive.
Input data, insights, and decisions can span a variety of forms, hence it is mapped to all three. High
velocity data might have to be processed to help real time decision making and plays across
descriptive, predictive, and prescriptive analytics when they deal with present data. Predictive and
prescriptive analytics create data about the future. That data is uncertain, by nature and its veracity
is in doubt. Therefore veracity is mapped to prescriptive and predictive analytics when it deal with
future.
School of Computer Engineering
Big Data Analytics
54
Big data analytics is the process of extracting useful information by analysing different
types of big data sets. It is used to discover hidden patterns, outliers, unearth trends,
unknown co-relationship and other useful info for the benefit of faster decision making.
Big Data Application in different Industries
Big Data
Analytics isn’t
“One-size-fit-all” traditional
Only used by huge online Meant to replace data
RDBMS built on shared disk
companies warehouse
and memory
Detailed Lessons
Introduction to Data, Big Data Characteristics, Types of Big Data, Challenges of
Traditional, Systems, Web Data, Evolution of Analytic Scalability, OLTP, MPP, Grid
Computing, Cloud Computing, Fault Tolerance, Analytic Processes and Tools, Analysis
Versus Reporting, Statistical Concepts, Types of Analytics.
Data Mining: Data mining is the process of looking for hidden, valid, and
potentially useful patterns in huge data sets. Data Mining is all about
discovering unsuspected/previously unknown relationships amongst the
data. It is a multi-disciplinary skill that uses machine learning, statistics,
AI and database technology.
Natural Language Processing (NLP): NLP gives the machines the ability
to read, understand and derive meaning from human languages.
Text Analytics (TA): TA is the process of extracting meaning out of text.
For example, this can be analyzing text written by customers in a
customer survey, with the focus on finding common themes and trends.
The idea is to be able to examine the customer feedback to inform the
business on taking strategic action, in order to improve customer
experience.
Noisy text analytics: It is a process of information extraction whose goal
is to automatically extract structured or semi-structured information from
noisy unstructured text data.
School of Computer Engineering
Appendix cont…
62
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
One of the fundamental task in data analysis is to find how different variables
are related to each other and one of the central tool for learning about such
relationships is regression.
Lets take a simple example: Suppose your manager asked you to predict annual
sales. There can be factors (drivers) that affects sales such as competitive
pricing, product quality, shipping time & cost, online reviews, easy return policy,
loyalty rewards, word of mouth recommendations, ease of checkout etc. In this
case, sales is your dependent variable. Factors affecting sales are independent
variables.
Regression analysis would help to solve this problem. In simple words,
regression analysis is used to model the relationship between a dependent
variable and one or more independent (predictors) variables and then use the
relationships to make predictions about the future.
Regression analysis helps to answer the following questions:
Which of the drivers have a significant impact on sales?
Which is the most important driver of sales?
How do the drivers interact with each other?
School of Computer Engineering
Regression Modelling Techniques cont…
20
The regression analysis allows to model the dependent variable as a function of its
predictors i.e. Y = f(Xi, β) + ei where Y is dependent variable, f is the function, Xi is the
independent variable, β is the unknown parameters, ei is the error term, and i varies from 1
to n.
Terminologies
Outliers: Suppose there is an observation in the dataset which is having a very high or
very low value as compared to the other observations in the data, i.e. it does not belong to
the population, such an observation is called an outlier. In simple words, it is extreme
value. An outlier is a problem because many times it hampers the results we get.
Multicollinearity: When the predictors are highly correlated to each other then the
variables are said to be multicollinear. Many types of regression techniques assumes
multicollinearity should not be present in the dataset. It is because it causes problems in
ranking variables based on its importance or it makes job difficult in selecting the most
important independent variable (factor).
Heteroscedasticity: When dependent variable's variability is not equal across values of
an independent variable, it is called heteroscedasticity. Example -As one's income
increases, the variability of food consumption will increase. A poorer person will spend a
rather constant amount by always eating inexpensive food; a wealthier person may
occasionally buy inexpensive food and at other times eat expensive meals. Those with
higher incomes display a greater variability of food consumption.
School of Computer Engineering
Terminologies cont...
21
To fit the regression line, a statistical approach known as least squares method.
If b > 0, then x(predictor) and y(target) have a positive relationship. That is increase
in x will increase y.
If b < 0, then x(predictor) and y(target) have a negative relationship. That is increase
in x will decrease y.
If sum of squared error is taken as a metric to evaluate the model, then goal to obtain
a line that best reduces the error.
Multiple linear regression refers to a statistical technique that is used to predict the
outcome of a variable based on the value of two or more variables. It is sometimes
known simply as multiple regression, and it is an extension of linear regression.
Example:
Do age and intelligence quotient (IQ) scores predict grade point average (GPA)?
Do weight, height, and age explain the variance in cholesterol levels?
Do height, weight, age, and hours of exercise per week predict blood pressure?
The formula for a multiple linear regression is:
y = β0+ β1x1 + β2x2 + β3x3 + β4x4+ … … … … … … + βnxn + e
where, y = the predicted value of the dependent variable.
β0 = the y-intercept (value of y when all other parameters are set to 0)
β1x1= the regression coefficient (β1) of the first independent variable (x1)
βnxn= the regression coefficient (βn) of the last independent variable (xn)
e = model error
The value of r is 0.97 which indicates a very strong, almost perfect, positive correlation,
and the data value appears to form a slight curve.
School of Computer Engineering
Non-Linear Regression cont…
44
Polynomials are the equations that involve powers of the independent variables. A second
degree (quadratic), third degree (cubic), and n degree polynomial functions:
Second degree: y = β0+ β1x + β2x2 + e
Third degree: y = β0+ β1x + β2x2 + β3x3 + e
n degree: y = β0+ β1x + β2x2 + β3x3 + … … + βnxn + e
Where:
β0 is the intercept of the regression model
β1, β2, β3 are the coefficient of the predictors.
How to find the right degree of the equation?
As we increase the degree in the model, it tends to increase the performance of the
model. However, increasing the degrees of the model also increases the risk of over-
fitting and under-fitting the data. So, one of the approach can be adopted:
Forward Selection: This method increases the degree until it is significant enough to
define the best possible model.
Backward Elimination: This method decreases the degree until it is significant
enough to define the best possible model.
Class work
Define the second-order polynomial model with two independent variables.
Define the second-order polynomial model with three independent
variables.
Define the third-order polynomial model with two independent variables.
The tools exists in software such as SAS, Excel or the language such as
Python, R can estimate the value of coefficients of predictor such as β0, β1 etc
and to fit a curve in a non-linear fashion for the given data.
Following figure depicts the graph of increase in sale vs. discount.
Curve
An R2 of 1 indicates that the regression model perfectly fits the data while an
R2 of 0 indicate that model does not fit the data at all.
An R2 is calculated as follows:
where
In the example, a value of 0.99 for R2 indicates that a quadratic model is good
fit for the data.
Another preferable way to perform non-linear regression is to try to
transform the data in order to make the relationship between the two
variables more linear and then use a regression model rather than a
polynomial one. Transformations aim to make a non-linear relationship
between two variables more linear so that it can be described by a linear
regression model.
Three most popular transformations are the:
Square root (√X)
Logarithm (log X)
Negative reciprocal (- 1/ X)
Where Sb is the sigmoid function with base b. However in some cases it can be easier
to communicate results by working in base 2, base 10, or exponential constant e.
In reference to the students example, solving the equation with software tool and
considering base as e, the coefficient is β0 = -4.0777 and β1= 1.5046
Similarly, for a student who studies 4 hours, the estimated probability of passing
the exam is 0.87:
Following table shows the probability of passing the exam for several values of
hours studying.
Hours of study Probability of passing the exam
1 0.07
2 0.26
3 0.61
5 0.97
The rule states that if the probability of an event is unknown, it can be calculated
using the known probabilities of several distinct events. Consider the image:
B A C
There are three events: A, B, and C. Events B and C are distinct from each other
while event A intersects with both events. We do not know the probability of
event A. However, we know the probability of event A under condition B and the
probability of event A under condition C. The total probability rule states that by
using the two conditional probabilities, we can find the probability of event A.
Mathematically, the total probability rule can be written in the following
equation where n is the number of events and Bn is the distinct event.
where:
H is the hypothesis whose probability is affected by data.
E is the evidence i.e. the unseen data which was not used in computing the
prior probability
P(H) is the prior probability i.e. it is the probability of H before E is
observed
P(H | E) is the posterior probability i.e. the probability of H given E and after
E is observed.
P(E | H) is the probability of observing E given H. It indicates the
compatibility of the evidence with the given hypothesis.
P(E) is the marginal likelihood or model evidence.
School of Computer Engineering
Bayesian Interface cont…
68
Bayes’ theorem also can be written as: P( H | E) = ( P(H) * P (E | H) ) * λ, where λ = 1/
P(E) and is the normalizing constant ensuring that P ( H | E) sums to 1 for each state of E.
Class Exercise
Consider the use of online dating sites by age group:
Now, the values for each can be obtained by looking at the dataset and substitute them into the
equation. For all entries in the dataset, the denominator does not change, it remain static.
Therefore, the denominator can be removed and a proportionality can be introduced.
In the example, the class variable(y) has only two outcomes, yes or no. There could be cases
where the classification could be multivariate. Therefore, the need is to find the class y with
maximum probability.
Using the above function, we can obtain the class, given the predictors.
P(Y) = 9/ 14 and P(N) = 5/14 where Y stands for Yes and N stands for No.
The outlook probability is: P(sunny | Y) = 2/9, P(overcast | Y) = 4/9, P(rain | Y) = 3/9, P(sunny |
N) = 3/5, P(overcast | N) = 0, P(rain | N) = 2/5
The temperature probability is: P(hot | Y) = 2/9, P(mild | Y) = 4/9, P(cool | Y) = 3/9, P(hot | N) =
2/5, P(mild | N) = 2/5, P(cool | N) = 1/5
The humidity probability is: P(high | Y) = 3/9, P(normal | Y) = 6/9, P(high | N) = 4/5, P(normal |
N) = 2/5.
The windy probability is: P(true | Y) = 3/9, P(false | Y) = 6/9, P(true | N) = 3/5, P(false | N) =
2/5
Now we want to predict “Enjoy Sport” on a day with the conditions: <outlook = sunny;
temperature = cool; humidity = high; windy = strong>
P(Y) P(sunny | Y) P(cool | Y) P(high | Y) P(strong | Y) = .005 and P(N) P(sunny | N) P(cool | N)
P(high | N) P(strong | N) = .021
Since, the probability of No is the larger, we can predict “Enjoy Sport” to be No on that day.
Types of Naive Bayes Classifier
Multinomial Naive Bayes: This is mostly used for document classification problem, i.e.
whether a document belongs to the category of sports, politics, technology etc. The
features/predictors used by the classifier are the frequency of the words present in the
document.
Bernoulli Naive Bayes: This is similar to the multinomial naive bayes but the predictors are
boolean variables. The parameters that we use to predict the class variable take up only values
yes or no, for example if a word occurs in the text or not.
Gaussian Naive Bayes: The predictors take up a continuous value and are not discrete.
Pros
It is easy and fast to predict class of test data set. It also perform well in multi class
prediction.
When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
It perform well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve, which
is a strong assumption).
Cons
The assumption of independent predictors. In real life, it is almost impossible to get a
set of predictors which are completely independent.
If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to
make a prediction. This is often known as “Zero Frequency”. To solve this, we can use
the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
periods
Seasonal component
These are the rhythmic forces which operate in a regular and periodic
manner over a span of less than a year. They have the same or almost the
same pattern during a period of 12 months. This variation will be present in
a time series if the data are recorded hourly, daily, weekly, quarterly, or
monthly.
These variations come into play either because of the natural forces or man-
made conventions. The various seasons or climatic conditions play an
important role in seasonal variations. Such as production of crops depends
on seasons, the sale of umbrella and raincoats in the rainy season, and the
sale of electric fans and A.C. shoots up in summer seasons.
The effect of person-made conventions such as some festivals, customs,
habits, fashions, and some occasions like marriage is easily noticeable. They
recur themselves year after year. An upswing in a season should not be taken
as an indicator of better business conditions.
Cyclical component
The variations in a time series which operate themselves over a span of more
than one year are the cyclic variations. This oscillatory movement has a
period of oscillation of more than a year. One complete period is a cycle. This
cyclic movement is sometimes called the ‘Business Cycle’.
It is a four-phase cycle comprising of the phases of prosperity, recession,
depression, and recovery. The cyclic variation may be regular are not
periodic. The upswings and the downswings in business depend upon the
joint nature of the economic forces and the interaction between them.
Irregular component
They are not regular variations and are purely random or irregular. These
fluctuations are unforeseen, uncontrollable, unpredictable, and are erratic.
These forces are earthquakes, wars, flood, famines, and any other disasters.
Mixed model
Different assumptions lead to different combinations of additive and multiplicative
models as Yt = Tt + St + Ct * It
The time series analysis can also be done using the model as:
Yt = Tt + St * Ct * It
Yt = Tt * St + Ct * It
Home Work
How to determine if a time series has a trend component?
How to determine if a time series has a seasonal component?
How to determine if a time series has both a trend and seasonal component?
MA(3), MA(5) and MA(12) are commonly used for monthly data and MA(4) is
normally used for quarterly data.
MA(4), and MA(12) would average out the seasonality factors in quarterly and
monthly data respectively.
The advantage of MA method is that the data requirement is very small.
The major disadvantage is that it assumes the data to be stationary.
MA also called as simple moving average.
School of Computer Engineering
Moving Averages (MAs) cont…
86
8 286
9 212
10 275
11 188
12 312
School of Computer Engineering
Exponential Smoothing Model
87
Error calculation
The error is calculated as Et = yt – St (i.e. difference of actual and smooth at time t)
Then error square is calculated i.e. ESt = Et * Et
Then, sum of the squared errors (SSE) is calculated i.e. SSE = ΣESi for i = 1 to n where
n is the number of observations.
Then, the mean of the squared errors is calculated i.e. MSE = SSE/(n-1)
The best value for α is choose so the value which results in the smallest MSE.
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared 4 1 1 25 4 0 4 4 1 4 4 4
Error
Sum of Square Error = 56 and MSE = 56 / 12 = 4.6667
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared 4 1 1 25 4 0 4 4 1 4 4 4
Error
Sum of Square Error = 56, MSE = 56 / 12 = 4.6667, RMSE = SQRT(4.667) = 2.2
Here, X’(t) represents the forecasted data value of point t and X(t) represents the actual
data value of point t. Calculate MAPE for the below dataset.
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
MAPE is commonly used because it’s easy to interpret and easy to explain. For
example, a MAPE value of 11.5% means that the average difference between the
forecasted value and the actual value is 11.5%.
The lower the value for MAPE, the better a model is able to forecast values e.g. a
model with a MAPE of 2% is more accurate than a model with a MAPE of 10%.
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
Data Stream – Large data volume, likely unstructured and structured arriving at a very
high rate, which requires real time/near real time analysis for effective decision making.
q It is basically continuously generated data and arrives in a stream (sequence of data
elements made available over time). It is generally time-stamped and geo-tagged (in
the form of latitude and longitude).
q Stream is composed of synchronized sequence of elements or events.
q If it is not processed immediately, then it is lost forever.
q In general, such data is generated as part of application logs, events, or collected
from a large pool of devices continuously generating events such as ATM or PoS.
Example:
Data Center: Large network deployment of a data center with hundreds of servers,
switches, routers and other devices in the network. The event logs from all these devices
at real time create a stream of data. This data can be used to prevent failures in the data
center and automate triggers so that the complete data center is fault tolerant.
Stock Market: The data generated here is a stream of data where a lot of events are
happening in real-time. The price of stock are continuously varying. These are large
continuous data streams which needs analysis in real-time for better decisions on
trading.
School of Computer Engineering
Basic Model of Stream data
4
q Input data rapidly and streams needn’t have the same data rates or data types.
q The system cannot store the data entirely.
q Queries tends to ask information about recent data.
q The scan never turn back.
Queries (Command)
. . . 1, 5, 2, 7, 0, 9, 3
. . . a, r, v, t, y, h, b Processor Output
. . . 0, 0, 1, 0, 1, 1, 0
Input Stream
Limited
Storage
Independent Computation
Data-at-rest Data-in-motion
Sporadic: major
events
In summary, streaming data:
q Size is unbounded i.e. it continually generated and can’t process all at once
q Size and Frequency is unpredictable due to human behavior
q Processing is must be relatively fast and simple
User/Application
Multiple streams
Stream Query
Processor
Scratch Space
(Main memory
and/or Disk)
School of Computer Engineering
Data Stream Management Systems
10
q Traditional relational databases store and retrieve records of data that are static in
nature and do not perceive a notation of time unless time is added as an attribute
during the schema design.
q The model is adequate for legacy applications and older repositories of information,
but many current and emerging application require support for online analysis of
rapidly arriving and changing data streams.
q This has resulted in data stream management system (DSMS) with an emphasis on
continuous query languages and query evaluation.
q There are two complementary techniques for end-to-end stream processing: Data
Stream Management Systems (DSMSs) and Streaming.
q Comparison of DSMS and SDW with
traditional database and warehouse
systems, wherein data rates are on the y-
axis, and query complexity and available
storage on the x-axis.
DBMS DSMS
Data Persistent relations Streams, time windows
Data access Random Sequential, One-pass
Updates Arbitrary Append-only
Update Rates Relatively Low High, bursty
Processing model Query driven (pull-based) Data driven (push-based)
Queries One-time Continuous
Query Plans Fixed Adaptive
Query Optimization One query Multi-query
Query Answers Exact Exact or approximate
Latency Relatively high Low
12
q The traffic flowing through the network is itself a high-speed data stream, with each data
packet containing fields such as a timestamp, the source and destination IP addresses, and
ports.
q Other network monitoring data streams include real-time system and alert logs produced by
routers, routing and configuration updates, and periodic performance measurements.
q However, it is not feasible to perform complex operations on high-speed streams or to keep
transmitting terabytes of raw data to a data management system.
q Instead, there is a need of scalable and flexible end-to-end data stream management solutions,
ranging from real-time low-latency alerting and monitoring, ad-hoc analysis and early data
reduction on raw streaming data, to long-term analysis of processed data.
School of Computer Engineering
Network monitoring – DBMS, DSMS, SDW
14
q The input buffer captures the streaming inputs. Optionally, an input monitor may collect
various statistics such as inter-arrival times or drop some incoming data in a controlled fashion
(e.g., via random sampling) if the system cannot keep up.
q The working storage component temporarily stores recent portions of the stream and/or
various summary data structures needed by queries. Depending on the arrival rates, this ranges
from a small number of counters in fast RAM to memory-resident sliding windows.
q Local storage may be used for metadata such as foreign key mappings, e.g., translation from
numeric device IDs that accompany router performance data to more user-friendly router
names. Users may directly update the metadata in the local storage, but the working storage is
used only for query processing.
q Continuous queries are registered in the query repository and converted into execution plans;
similar queries may be grouped for shared processing. While superficially similar to relational
query plans, continuous query plans also require buffers, inter-operator queues and scheduling
algorithms to handle continuously streaming data. Conceptually, each operator consumes a data
stream and returns a modified stream for consumption by the next operator in the pipeline.
q The query processor may communicate with the input monitor and may change the query plans
in response to changes in the workload and the input rates.
q Finally, results may streamed to users, to alerting or event-processing applications, or to a SDW
for permanent storage and further analysis.
Storage Mining
Internet Processing
Analysis
Sensor
Sensor
Online analysis of stock prices and making hold or sell decisions requires
quickly identification of correlations and fast changing trends and to an extent
forecasting future valuations as data is constantly arriving from several
sources like news, current stock movement etc. Typical queries include:
q Find the stocks priced between $1 and $200, which is showing very large
buying in the last one hour based on some federal bank news about tax
rates for a particular industry.
q Find all the stocks trading above their 100 day moving average by more
than 10% and also with volume exceeding a million shares.
Online mining of web usage logs, telephone call records and ATM are the
examples of data streams since they continuously output data and are
potentially infinite. The goal is to find interesting customer behavior patterns,
identifying suspicious spending behavior that could indicate fraud etc. Typical
queries include:
q Examine current buying pattern of users at a website and potentially plan
advertising campaigns and recommendations.
q Continuously monitor location, average spends etc of credit card
customers and identify potential frauds.
q Streams Queries are similar to SQL in that one can specify which data like to
include in the stream, any conditions that the data has to match, etc.
q Streams queries are composed in the following format: SELECT <select
criteria> WHERE <where criteria> HAVING <having criteria>
q Two types of queries can be identified as typical over data streams. The first
distinction is between one-time queries and continuous queries.
q One-time queries: These are the queries that are evaluated once over a
point-in-time snapshot of the dataset, with the answers returned to the
users. For example, a stock price checker may alert the user when a
stock price crosses a particular price point.
q Continuous queries: These are evaluated continuously as data streams
continue to arrive. The answer to a continuous query is produced over
time, always reflecting the stream data so far. It may be stored and
updated as new data arrives, or they may produced as data streams
themselves. Typically, aggregation queries such as finding maximum,
average, count etc. are the continuous queries where values are stored.
q In a sliding window, tuples are grouped within a window that slides across the data
stream according to a specified interval. A time-based sliding window with a length of
ten seconds and a sliding interval of five seconds contains tuples that arrive within a ten-
second window. The set of tuples within the window are evaluated every five seconds.
Sliding windows can contain overlapping data; an event can belong to more than one
sliding window.
q In the following image, the first window (w1, in the box with dashed lines) contains
events that arrived between the zero th and ten th seconds. The second window (w2, in
the box with solid lines) contains events that arrived between the fifth and fifteenth
seconds. Note that events e3 through e6 are in both windows. When window w2 is
evaluated at time t = 15 seconds, events e1 and e2 are dropped from the event queue.
An example would be to compute the moving average of a stock price across the last five
minutes, triggered every second.
q In a tumbling window, tuples are grouped in a single window based on time or count. A
tuple belongs to only one window.
q For example, consider a time-based tumbling window with a length of five seconds. The
first window (w1) contains events that arrived between the zero th and fifth seconds.
The second window (w2) contains events that arrived between the fifth and tenth
seconds, and the third window (w3) contains events that arrived between tenth and
fifteenth seconds. The tumbling window is evaluated every five seconds, and none of the
windows overlap; each segment represents a distinct time segment.
An example would be to compute the average price of a stock over the last five minutes,
computed every five minutes.
q A blocking query operator is a query operator that is unable to produce the first
tuple of its output until it has seen its entire input. Sorting is an example of a
blocking operator, as are aggregation operators such as SUM, COUNT, MIN, MAX, and
AVG.
q If one thinks about evaluating continuous stream queries using a traditional tree of
query operators, where data streams enter at the leaves and final query answers are
produced at the root, then the incorporation of blocking operators into the query
tree poses problems.
q Since continuous data streams may be infinite, a blocking operator that has a data
stream as one of its inputs will never see its entire input, and therefore it will never
be able to produce any output.
q Clearly, blocking operators are not very suitable to the data stream computation
model, but aggregate queries are extremely common, and sorted data is easier to
work with and can often be processed more efficiently than unsorted data.
q Doing away with blocking operators altogether would be problematic, but dealing
with them effectively is one of the more challenging aspects of data stream
computation.
Need & how of sampling - System cannot store the entire stream conveniently, so
q how to make critical calculations about the stream using a limited amount of
(primary or secondary) memory?
q Don’t know how long the stream is, so when and how often to sample?
Three solutions namely (i) Reservoir , (ii) Biased Reservoir , and (iii) Concise sampling
n=8
k=4
The key idea behind reservoir sampling is to create a ‘reservoir’ from a big ocean of data.
Each element of the population has an equal probability of being present in the sample
and that probability is (k/n). With this key idea, a subsample to be created. It has to be
noted, when a sample is created, the distributions should be identical not only row-wise
but also column-wise, wherein columns are the features.
0. Start
1. Create an array reservoir[0..k-1] and copy first k items of stream[] to it.
2. Iterate from k to n−1. In each iteration i:
2.1. Generate a random number from 0 to i. Let the generated random number is j.
2.2. If j is in range 0 to k-1, replace reservoir[j] with arr[i]
3. Stop
Illustration
Input:
The list of integer stream: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, and the value of k = 6
Output:
k-selected items in the given array: 8 2 7 9 12 6
In many cases, the stream data may evolve over time, and the corresponding data
mining or query results may also change over time. Thus, the results of a query over a
more recent window may be quite different from the results of a query over a more
distant window. Similarly, the entire history of the data stream may not relevant for use
in a repetitive data mining application such as classification. The simple reservoir
sampling algorithm can be adapted to a sample from a moving window over data
streams. This is useful in many data stream applications where a small amount of recent
history is more relevant than the entire previous stream. However, this can sometimes
be an extreme solution, since for some applications we may need to sample from
varying lengths of the stream history. While recent queries may be more frequent, it is
also not possible to completely disregard queries over more distant horizons in the data
stream. Biased reservoir sampling is a bias function to regulate the sampling from the
stream. This bias gives a higher probability of selecting data points from recent parts of
the stream as compared to distant past. This bias function is quite effective since it
regulates the sampling in a smooth way so that the queries over recent horizons are
more accurately resolved.
Many a time, the size of the reservoir is sometimes restricted by the available main
memory. It is desirable to increase the sample size within the available main memory
restrictions. For this purpose, the technique of concise sampling is quite effective. Concise
sampling exploits the fact that the number of distinct values of an attribute is often
significantly smaller than the size of the data stream. In many applications, sampling is
performed based on a single attribute in multi-dimensional data. For example, customer
data in an e-commerce site sampling may be done based on only customer ids. The
number of distinct customer ids is definitely much smaller than “n” the size of the entire
stream.
The repeated occurrence of the same value can be exploited in order to increase the
sample size beyond the relevant space restrictions. We note that when the number of
distinct values in the stream is smaller than the main memory limitations, the entire
stream can be maintained in main memory, and therefore, sampling may not even be
necessary. For current systems in which the memory sizes maybe the order of several
gigabytes, very large sample sizes can be main memory resident as long as the number of
distinct values do not exceed the memory constraints.
q Inverse Sampling
q Weighted Sampling
q Biased Sampling
q Priority Sampling
q Dynamic Sampling
q Chain Sampling
A empty bloom filter is a bit array of n bits, all set to zero, like below:
0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9
We need k number of hash functions to calculate the hashes for a given
input. When we want to add an item in the filter, the bits at k indices
h 1 (x), h 2 (x), … h k (x) are set, where indices are calculated using hash
functions.
Example – Suppose we want to enter “geeks” in the filter, we are using 3
hash functions and a bit array of length 10, all set to 0 initially. First we’ll
calculate the hashes as following :
h1(“geeks”) % 10 = 1, h2(“geeks”) % 10 = 4, and h3(“geeks”) % 10 = 7
Note: These outputs are random for explanation only.
0 1 0 0 1 0 0 1 0 0
0 1 2 3 4 5 6 7 8 9
0 1 0 1 1 1 0 1 0 0
0 1 2 3 4 5 6 7 8 9
h1(“cat”) % 10 = 1
h2(“cat”) % 10 = 3
h3(“cat”) % 10 = 7
If we check the bit array, bits at these indices are set to 1 but we know
that “cat” was never added to the filter. Bit at index 1 and 7 was set when
we added “geeks” and bit 3 was set we added “nerd”.
cat
0 1 0 1 1 1 0 1 0 0
0 1 2 3 4 5 6 7 8 9
So, because bits at calculated indices are already set by some other item,
bloom filter erroneously claim that “cat” is present and generating a false
positive result. Depending on the application, it could be huge downside
or relatively okay.
We can control the probability of getting a false positive by controlling the
size of the Bloom filter. More space means fewer false positives. If we want
decrease probability of false positive result, we have to use more number of
hash functions and larger bit array. This would add latency in addition of
item and checking membership.
School of Computer Engineering
Bloom Filter Algorithm
53
Insertion Lookup
Data: e is the element to insert into the Bloom filter. Data: x is the element for which membership is tested.
insert(e) bool isMember(x) /* returns true or false to the membership test */
begin begin
A Bloom filter requires space O(n) and can answer membership queries
in O(1) time where n is number item inserted in the filter. Although the
asymptotic space complexity of a Bloom filter is the same as a hash map,
O(n), a Bloom filter is more space efficient.
Class Exercise
q A empty bloom filter is of size 11 with 4 hash functions namely
q h1(x) = (3x+ 3) mod 6
q h2(x) = (2x+ 9) mod 2
q h3(x) = (3x+ 7) mod 8
q h4(x) = (2x+ 3) mod 5
Illustrate bloom filter insertion with 7 and then 8.
Perform bloom filter lookup/membership test with 10 and 48
0 0 0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 11
K=4 (# of Hash Function) INSERT (x1), x1=7 INSERT (x2), x2 = 8
Note: These outputs are
h1(x1) = 0 h1(x2) = 3
random for explanation h2(x1) = 1 h2(x2) = 1
only. h3(x1) = 4 h3(x2) = 7
h4(x1) = 2 h4(x2) = 4
State of bloom filter post to the insertion of x1 and x2
1 1 1 1 1 0 0 1 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 11
School of Computer Engineering
False Positive in Bloom Filters cont’d
56
1 1 1 1 1 0 0 1 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 11
LOOKUP(x3), x3 = 10 LOOKUP(x4) , x4 = 48
h1(x3) = 3 h1(x4) = 3
h2(x3) = 1 h2(x4) = 1
h3(x3) = 5 h3(x4) = 7
h4(x4) = 4
X3 doesn’t X4 - Case of
exist FALSE POSITIVE
School of Computer Engineering
Optimum number of hash functions
57
Class Exercise
Calculate the optimal number of hash functions for 10 bit length bloom
filter having 3 numbers of input elements.
0.2
0.18
False positive probability
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 2 4 6 8 10 12 14 16 18 20
Let n be the size of bit array, k be the number of hash functions and m be
the number of expected elements to be inserted in the filter, then the
probability of false positive p can be calculated as:
Class Exercise
Calculate the probability of false positives with table size 10 and no. of
items to be inserted are 3.
(1, 2, 2, 1, 3, 1, 5, 1, 3, 3, 3, 2, 2)
Number of distinct elements = 4
How to calculate?
1. Initialize the hashtable (large binary array) of size n with all zeros.
2. Choose the hash function hi : i ∈ {1, …, k}
3. For each flow label f ∈ {1, …, m} , compute h(f) and mark that
position in the hashtable with 1
4. Count the number of positions in the hashtable with 1 and call it c.
5. The number of distinct items is m* ln ( m / (m-c))
Class Exercise
Count the distinct elements in a data stream of elements {1, 2, 2, 1, 3, 1, 5,
1, 3, 3, 3, 2, 2} with the hash function h(x) = (5x+1) mod 6 of size 11.
q The 0th order moment i.e. f0 is the number of distinct elements in the stream.
q The 1st order moment i.e. f1 is the length of the stream.
q The 2nd order moment f2 is an important quantity which represent show “skewed”
the distribution of the elements in stream is.
Example
Consider the stream a, b, c, b, d, a, c, d, a, b, d, c, a, a, b wherein na = 5, nb = 4, nc = 3 and nd
= 3. In this case:
q f0 = number of distinct elements = 4
q f1 = length of the stream = na + nb + nc + nd = 5 + 4 + 3 + 3 = 15
q f2 = na2 + nb2 + nc2 + nd 2 = 52 + 42 + 32 + 32 = 59
q Real-time analytics makes use of all available data and resources when they
are needed, and it consists of dynamic analysis and reporting, based on data
entered into a system before the actual time of use.
q Real-time denotes the ability to process data as it arrives, rather than
storing the data and retrieving it at some point in the future.
q For example, consider an e-merchant like Flipkart or Snapdeal; real time
means the time elapsed from the time a customer enters the website to the
time the customer logs out. Any analytics procedure, like providing the
customer with recommendations or offering a discount based on current
value in the shopping car, etc., will have to be done within this timeframe
which may be a about 15 minutes to an hour.
q But from the point of view of a military application where there is constant
monitoring say of the air space, time needed to analyze a potential threat
pattern and make decision maybe a few milliseconds.
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
Let I= {i1, ..., ik} be a set of items. Let D, be a set of transactions where each transaction T
is a set of items such that T ⊆ I. Each transaction is associated with an identifier, called
TID. Let A be a set of items. A transaction T is said to contain A if and only if A ⊆ T.
Transaction TID Items bought What is k-itemset?
T1 10 Beer, Nuts, Diaper q When k=1, then k-Itemset is itemset 1.
T2 20 Beer, Coffee, Diaper q When k=2, then k-Itemset is itemset 2.
T3 30 Beer, Diaper, Eggs q When k=3, then k-Itemset is itemset 3.
T4 40 Nuts, Eggs, Milk
q When k=4, then k-Itemset is itemset 4.
q When k=5, then k-Itemset is itemset 5.
T5 50 Nuts, Coffee, Diaper, Eggs, Milk
q For example, you are in a supermarket to buy milk. Referring to the below example,
there are nine baskets containing varying combinations of milk, cheese, apples, and
bananas.
q Question - are you more likely to buy apples or cheese in the same transaction than
somebody who did not buy milk?
q The next step is to determine the relationships and the rules. So, association rule
mining is applied in this context. It is a procedure which aims to observe frequently
occurring patterns, correlations, or associations from datasets found in various kinds
of databases such as relational databases, transactional databases, and other forms of
repositories.
School of Computer Engineering
Market-Basket Model cont…
8
q The association rule has three measures that express the degree of confidence in the
rule, i.e. Support, Confidence, and Lift. Since the market-basket has its origin in retail
application, it is sometimes called transaction.
q Support: The number of transactions that include items in the {A} and {B} parts of
the rule as a percentage of the total number of transactions. It is a measure of how
frequently the collection of items occur together as a percentage of all transactions.
Example: Referring to the earlier dataset, Support(milk) = 6/9, Support(cheese) = 7/9,
Support(milk & cheese) = 6/9. This is often expressed as milk => cheese i.e. bought
milk and cheese together.
q Confidence: It is the ratio of the no of transactions that includes all items in {B} as
well as the no of transactions that includes all items in {A} to the no of transactions
that includes all items in {A}. Example: Referring to the earlier dataset ,
Confidence(milk => cheese) = (milk & cheese)/(milk) = 6/ 6.
q Lift: The lift of the rule A=>B is the confidence of the rule divided by the expected
confidence, assuming that the itemsets A and B are independent of each other.
Example: Referring to the earlier dataset, Lift(milk => cheese) = [(milk &
cheese)/(milk) ]/[cheese/Total] = [6/6] / [7/9] = 1/0.777.
Milk ? ?
Cheese ? ?
1 Milk => Cheese ? ? ? ?
Apple, Milk ? ?
2 (Apple, Milk) => Cheese ? ? ? ?
Apple, Cheese ? ?
q Support(Grapes) ?
q Confidence({Grapes, Apple} => {Mango}) ?
q Lift ({Grapes, Apple} => {Mango}) ?
q Imagine you’re at the supermarket, and in your mind, you have the items you
wanted to buy. But you end up buying a lot more than you were supposed to.
This is called impulsive buying and brands use the Apriori algorithm to
leverage this phenomenon.
q The Apriori algorithm uses frequent itemsets to generate association rules,
and it is designed to work on the databases that contain transactions.
q With the help of these association rule, it determines how strongly or how
weakly two objects are connected.
q It is the iterative process for finding the frequent itemsets from the large
dataset.
q This algorithm uses a breadth-first search and Hash Tree to calculate the
itemset associations efficiently.
q It is mainly used for market basket analysis and helps to find those products
that can be bought together. It can also be used in the healthcare field to find
drug reactions for patients.
q Hash tree is a data structure used for data verification and synchronization.
q It is a tree data structure where each non-leaf node is a hash of it’s child
nodes. All the leaf nodes are at the same depth and are as far left as possible.
q It is also known as Merkle Tree.
top hash
hash(hA+B) + hash(hC+D) = hA+B+C+D
A B C D
C1
Itemset Support Count
A 6
Now, we will take out all the itemsets that
have the greater support count that the B 7
minimum support (2). It will give us the table C 6
for the frequent itemset L1. Since all the
D 2
itemsets have greater or equal support count
than the minimum support, except the E, so E E 1
itemset will be removed.
School of Computer Engineering
Apriori Algorithm Example cont…
18
L1
Itemset Support Count Step-2: Candidate Generation C2, and L2:
In this step, we will generate C2 with the help
A 6 of L1. In C2, we will create the pair of the
B 7 itemsets of L1 in the form of subsets. After
C 6 creating the subsets, we will again find the
support count from the main transaction table
D 2 of datasets, i.e., how many times these pairs
C2 have occurred together in the given dataset. So,
we will get the below table for C2:
Itemset Support Count
{A, B} 4
Again, we need to compare the C2 Support
{A,C} 4 count with the minimum support count, and
{A, D} 1 after comparing, the itemset with less support
count will be eliminated from the table C2. It
{B, C} 4
will give us the below table for L2. In this case,
{B, D} 2 {A,D}, {C,D} itemset will be removed.
{C, D} 0
School of Computer Engineering
Apriori Algorithm Example cont…
19
L2
Step-3: Candidate Generation C3, and L3:
Itemset Support Count For C3, we will repeat the same two processes,
{A, B} 4 but now we will form the C3 table with subsets
{A, C} 4 of three itemsets together, and will calculate
the support count from the dataset. It will give
{B, C} 4 the below table:
{B, D} 2
C3
Now we will create the L3 table. As we can Itemset Support Count
see from the above C3 table, there is only
{A, B, C} 2
one combinati o n o f i te m s e t t h a t h a s
support count equal to the minimum {B, C, D} 0
support count. So, the L3 will have only one {A, C, D} 0
combination, i.e., {A, B, C}.
{A, B, D} 0
L3
Itemset Support Count
{A, B, C} 2
q Clustering is the task of dividing the population or data points into a number
of groups such that data points in the same groups are more similar to other
data points in the same group and dissimilar to the data points in other
groups.
q It is basically a collection of objects on the basis of similarity and
dissimilarity between them.
q Following is an example of finding clusters of population based on their
income and debt.
The data points clustered together can be classified into one single group. The
clusters can be distinguished, and can identify that there are 3 clusters.
Now, based on the similarity of these clusters, the most similar clusters combined
together and this process is repeated until only a single cluster is left.
Proximity Matrix
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
The diagonal elements of this matrix is always 0 as the distance of a point with itself is
always 0. The Euclidean distance formula is used to calculate the rest of the distances. So,
to calculate the distance between
Point 1 and 2: √(10-7)^2 = √9 = 3
Point 1 and 3: √(10-28)^2 = √324 = 18 and so on…
Similarly, all the distances are calculated and the proximity matrix is filled.
Step 1: First, all the points to an individual cluster is assigned. Different colors here
represent different clusters. Hence, 5 different clusters for the 5 points in the data.
Step 2: Next, look at the smallest distance in the proximity matrix and merge the points
with the smallest distance. Then the proximity matrix is updated.
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Let’s look at the updated clusters and accordingly update the proximity matrix. Here, we
have taken the maximum of the two marks (7, 10) to replace the marks for this cluster.
Instead of the maximum, the minimum value or the average values can also be
considered.
Roll No Mark
(1, 2) 10
3 28
4 20
5 35
Step 3: Step 2 is repeated until only a single cluster is left. So, look at the minimum
distance in the proximity matrix and then merge the closest pair of clusters. We will get
the merged clusters after repeating these steps:
q To get the number of clusters for hierarchical clustering, we make use of the concept
called a Dendrogram.
q A dendrogram is a tree-like diagram that records the sequences of merges or splits.
q Let’s get back to faculty-student example. Whenever we merge two clusters, a
dendrogram record the distance between these clusters and represent it in graph
form.
Here, we can see that we have merged sample 1 and 2. The vertical line represents the
distance between these samples.
School of Computer Engineering
Dendrogram cont…
43
Similarly, we plot all the steps where we merged the clusters and finally, we get a
dendrogram like this:
We can clearly visualize the steps of hierarchical clustering. More the distance of the
vertical lines in the dendrogram, more the distance between those clusters.
Now, we can set a threshold distance and draw a horizontal line (Generally, the threshold
is set in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and
draw a horizontal line:
The number of clusters will be the number of vertical lines which are being intersected
by the line drawn using the threshold. In the above example, since the red line intersects
2 vertical lines, we will have 2 clusters. One cluster will have a sample (1,2,4) and the
other will have a sample (3,5).
School of Computer Engineering
Hierarchical Clustering closeness of two clusters
46
The decision of merging two clusters is taken on the basis of closeness of these
clusters. There are multiple metrics for deciding the closeness of two clusters
and primarily are: q Manhattan distance
q Euclidean distance q Maximum distance
q Squared Euclidean distance q Mahalanobis distance
The below diagram explains the working of the K-means Clustering Algorithm:
1. Begin
2. Step-1: Select the number K to decide the number of clusters.
3. Step-2: Select random K points or centroids. (It can be other from the input
dataset).
4. Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
5. Step-4: Calculate the variance and place a new centroid of each cluster.
6. Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
7. Step-6: If any reassignment occurs, then go to step-4 else go to step-7.
8. Step-7: The model is ready.
9. End
Suppose we have two variables x and y. The x-y axis scatter plot of these two variables is
given below:
q Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into
two different clusters.
q We need to choose some random K points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we
are selecting the below two points as K points, which are not the part of dataset.
Consider the below image:
q Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by calculating the distance between two points. So,
we will draw a median between both the centroids. Consider the below image:
q From the image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid
i.e. K2. Let's color them as blue and yellow for clear visualization.
q As we need to find the closest cluster, so we will repeat the process by choosing
a new centroid. To choose the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids as below:
q Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like below
image:
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
School of Computer Engineering
Working of K-Means Algorithm cont…
56
q We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:
q As we got the new centroids so again will draw the median line and reassign the
data points. So, the image will be:
q We can see in the previous image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below image:
q As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
Hadoop
Apache open-source software framework
Inspired by:
- Google MapReduce
- Google File System
Few stastics to get an idea of data gets generated every day, every minute, and
every second.
q Every day
q NYSE generates 1.5 billion shares and trade data
q Facebook stores 2.7 billion comments and likes
q Google processes about 24 petabytes of data
q Every minutes
q Facebook users share nearly 2.5 million pieces of content.
q Amazon generates over $ 80,000 in online sale
q Twitter users tweet nearly 300,000 times.
q Instagram users post nearly 220,000 new photos
q Apple users download nearly 50,000 apps.
q Email users send over 2000 million messages
q YouTube users upload 72 hrs of new video content
q Every second
q Banking applications process more than 10,000 credit card transactions.
School of Computer Engineering
Data Challenges
6
To process, analyze and made sense of these different kinds of data, a system is
needed that scales and address the challenges as shown:
Hadoop was created by Doug Cutting, the creator of Apache Lucene (text search
library). Hadoop was part of Apace Nutch (open-source web search engine of
Yahoo project) and also part of Lucene project. The name Hadoop is not an
acronym; it’s a made-up name.
School of Computer Engineering
Key Aspects of Hadoop
9
Hive Pig
(NotSQL Databases)
(Script)
HCatalog
(Coordination)
Oozie
(Metadata Services)
MapReduce
(Distributed Processing)
HDFS
(Distributed Storage)
HCatalog
Hive Pig
FLUM
E Map Reduce
SQOO
P
Data Management
Data Access
Data Processing
Data Storage
HDFS YARN
HDFS
Cluster NameNode DataNode DataNode DataNode
HDFC Admin
Data Node Data Node Data Node Data Node
Cluster Node
Name Node
q The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications.
q HDFS holds very large amount of data and employs a NameNode and
DataNode architecture to implement a distributed file system that provides
high-performance access to data across highly scalable Hadoop clusters.
q To store such huge data, the files are stored across multiple machines.
q These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
q It’s run on commodity hardware.
q Unlike other distributed systems, HDFS is highly fault-tolerant and designed
using low-cost hardware.
1. Metadata stored about the file consists of file name, file path, number of
blocks, block Ids, replication level.
2. This metadata information is stored on the local disk. Namenode uses two
files for storing this metadata information.
q FsImage q EditLog
3. NameNode in HDFS also keeps in it’s memory, location of the DataNodes
that store the blocks for any given file. Using that information Namenode
can reconstruct the whole file by getting the location of all the blocks of a
given file.
Example
(File Name, numReplicas, rack-ids, machine-ids, block-ids, …)
/user/in4072/data/part-0, 3, r:3, M3, {1, 3}, …
/user/in4072/data/part-1, 3, r:2, M1, {2, 4, 5}, …
/user/in4072/data/part-2, 3, r:1, M2, {6, 9, 8}, …
1
2
Secondary
NameNode
NameNode
3
With Hadoop 2.0, built into the platform, HDFS now has automated failover
with a hot standby, with full stack resiliency.
1. Automated Failover: Hadoop pro-actively detects NameNode host and
process failures and will automatically switch to the standby NameNode to
maintain availability for the HDFS service. There is no need for human
intervention in the process – System Administrators can sleep in peace!
2. Hot Standby: Both Active and Standby NameNodes have up to date HDFS
metadata, ensuring seamless failover even for large clusters – which means
no downtime for your HDP cluster!
3. Full Stack Resiliency: The entire Hadoop stack (MapReduce, Hive, Pig,
HBase, Oozie etc.) has been certified to handle a NameNode failure scenario
without losing data or the job progress. This is vital to ensure long running
jobs that are critical to complete on schedule will not be adversely affected
during a NameNode failure scenario.
All machines in rack are connected using the same network switch and if that
network goes down then all machines in that rack will be out of service. Thus
the rack is down. Rack Awareness was introduced by Apache Hadoop to
overcome this issue. In Rack Awareness, NameNode chooses the DataNode
which is closer to the same rack or nearby rack. NameNode maintains Rack ids
of each DataNode to achieve rack information. Thus, this concept chooses
DataNodes based on the rack information. NameNode in Hadoop makes ensures
that all the replicas should not stored on the same rack or single rack. Default
replication factor is 3. Therefore according to Rack Awareness Algorithm:
q When a Hadoop framework creates new block, it places first replica on the
local node, and place a second one in a different rack, and the third one is on
different node on same remote node.
q When re-replicating a block, if the number of existing replicas is one, place
the second on a different rack.
q When number of existing replicas are two, if the two replicas are in the
same rack, place the third one on a different rack.
School of Computer Engineering
Rack Awareness & Replication
34
B3 DN 1 B1 DN 1 B2 DN 1
B1 DN 2 B2 DN 2 B3 DN 2
B3 DN 3 B1 DN 3 B2 DN 3
DN 4 DN 4 DN 4
At the crux of MapReduce are two functions: Map and Reduce. They are
sequenced one after the other.
q The Map function takes input from the disk as <key,value> pairs, processes
them, and produces another set of intermediate <key,value> pairs as output.
q The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output.
The types of keys and values differ based on the use case. All inputs and outputs
are stored in the HDFS. While the map is a mandatory step to filter and sort the
initial data, the reduce function is optional.
<k1, v1> -> Map() -> list(<k2, v2>)
<k2, list(v2)> -> Reduce() -> list(<k3, v3>)
Mappers and Reducers are the Hadoop servers that run the Map and Reduce
functions respectively. It doesn’t matter if these are the same or different servers.
q Map: The input data is first split into smaller blocks. Each block is then
assigned to a mapper for processing. For example, if a file has 100 records
to be processed, 100 mappers can run together to process one record each.
Or maybe 50 mappers can run together to process two records each. The
Hadoop framework decides how many mappers to use, based on the size of
the data to be processed and the memory block available on each mapper
server.
q Reduce: After all the mappers complete processing, the framework shuffles
and sorts the results before passing them on to the reducers. A reducer
cannot start while a mapper is still in progress. All the map output values
that have the same key are assigned to a single reducer, which then
aggregates the values for that key.
Class Exercise 1 Class Exercise 2
Draw the MapReduce process to Draw the MapReduce process to find the
count the number of words for the maximum electrical consumption for each
input: year:
Dog Cat Rat Year
Car Car Rat
Dog car Rat
Rat Rat Rat
PIG
q It was developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
q It is a platform for structuring the data flow, processing and analyzing huge data sets.
q Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
q Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
q Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
Hbase
q It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able
to work on Big Data sets effectively.
q At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At
such times, HBase comes handy as it gives us a tolerant way of storing limited data.
HIVE
q With the help of SQL methodology and interface, HIVE performs reading and writing
of large data sets. However, its query language is called as HQL (Hive Query
Language).
q It is highly scalable as it allows real-time processing and batch processing both. Also,
all the SQL datatypes are supported by Hive thus, making the query processing easier.
q Similar to the Query Processing frameworks, HIVE too comes with two components:
JDBC Drivers and HIVE Command Line. JDBC, along with ODBC drivers work on
establishing the data storage permissions and connection whereas HIVE Command
line helps in the processing of queries.
Oozie
q It simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit.
q There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie
workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or
external stimulus is given to it.
Zookeeper
q There was a huge issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency, often.
q Zookeeper overcame all the problems by performing synchronization, inter-component based
communication, grouping, and maintenance.
Mahout
q It allows machine learning ability to a system or application. Machine Learning helps the
system to develop itself based on some patterns, user/environmental interaction or on the
basis of algorithms.
q It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking algorithms
as per our need with the help of its own libraries.
Spark
q It’s a platform that handles all the process consumptive tasks like batch processing, interactive
or iterative real-time processing, graph conversions, and visualization, etc.
q It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
q It is best suited for real-time data whereas Hadoop is best suited for structured data or batch
processing, hence both are used in most of the companies interchangeably.
Sqoop
q It is a tool designed to transfer data between Hadoop and relational database.
q It is used to import data from relational databases such as MySQL, Oracle to Hadoop
HDFS, and export from Hadoop file system to relational databases.
HCatlog
q It is a table storage management tool for Hadoop that exposes the tabular data of
HIVE metastore to other Hadoop applications.
q It enables users with different data processing tools (Pig, MapReduce) to easily write
data onto a grid.
q It ensures that users don’t have to worry about where or in what format their data is
stored.
Solr, Lucene
q These are the services that perform the task of searching and indexing built on top of
Lucene (full text search engine).
q As Hadoop handles a large amount of data, Solr & Lucene helps in finding the
required information from such a large source.
q It is a scalable, ready to deploy, search/storage engine optimized to search large
volumes of text-centric data.
School of Computer Engineering
Hadoop Limitations
48
q Not fit for small data: Hadoop does not suit for small data. HDFS lacks the ability to
efficiently support the random reading of small files because of its high capacity
design. The solution to this drawback of Hadoop to deal with small file issue is
simple. Just merge the small files to create bigger files and then copy bigger files to
HDFS.
q Security concerns: Hadoop is challenging in managing the complex application. If
the user doesn’t know how to enable a platform who is managing the platform, data
can be a huge risk. At storage and network levels, Hadoop is missing encryption,
which is a major point of concern. Hadoop supports Kerberos authentication, which
is hard to manage. Spark provides a security bonus to overcome the limitations of
Hadoop.
q Vulnerable by nature: Hadoop is entirely written in Java, a language most widely
used, hence java been most heavily exploited by cyber criminals and as a result,
implicated in numerous security breaches.
q No caching: Hadoop is not efficient for caching. In Hadoop, MapReduce cannot cache
the intermediate data in memory for a further requirement which diminishes the
performance of Hadoop. Spark can overcome this limitation.
Database
RDBMS NoSQL
OLAP OLTP
RDBMS NoSQL
Relational database Non-relational, distributed database
Relational model Model-less approach
Pre-defined schema Dynamic schema for unstructured data
Table based databases Document-based or graph-based or wide
column store or key-value pairs databases
Vertically scalable (by increasing system Horizontally scalable (by creating a cluster of
resources) commodity machines)
Uses SQL Uses UnQL (Unstructured Query Language)
Not preferred for large datasets Largely preferred for large datasets
Not a best fit for hierarchical data Best fit for hierarchical storage as it follows
the key-value pair of storing data similar to
JSON
Emphasis on ACID properties Follows Brewer’s CAP theorem
RDBMS NoSQL
Excellent support from vendors Relies heavily on community support
Supports complex querying and data keeping Does not have good support for complex
needs querying
Can be configured for strong consistency F ew s u p p o r t s t ro n g c o n s i s t e n c y ( e . g. ,
MongoDB), few others can be configured for
eventual consistency (e.g., Cassandra)
Examples: Oracle, DB2, MySQL, MS SQL, Examples: MongoDB, HBase, Cassandra, Redis,
PostgreSQL, etc. Neo4j, CouchDB, Couchbase, Riak, etc.
OLTP OLAP
Many short transactions Long transactions, complex queries
Example: Example:
- Update account balance - Count the classes with fewer than 10 classes
- Add book to shopping cart - Report total sales for each dept in each
- Enroll in course month
Queries touch small amounts of Queries touch large amounts of data
data (few records)
Updates are frequent Updates are infrequent
Concurrency is biggest Individual queries can require lots of
performance problem resources
CAP Theorem: In the past, when is the need to store more data or increase
processing power, the common option was to scale vertically (get more powerful
machines) or further optimize the existing code base. However, with the
advances in parallel processing and distributed systems, it is more common to
expand horizontally, or have more machines to do the same task in parallel.
However, in order to effectively pick the tool of choice like Spark, Hadoop, Kafka,
Zookeeper and Storm in Apache project, a basic idea of CAP Theorem is
necessary. The CAP theorem is called the Brewer’s Theorem. It states that a
distributed computing environment can only have 2 of the 3: Consistency,
Availability and Partition Tolerance – one must be sacrificed.
q Consistency implies that every read fetches the last write
q Availability implies that reads and write always succeed. In other words,
each non-failing node will return a response in a reasonable amount of time
q Partition Tolerance implies that the system will continue to function when
network partition occurs
Next, the client request that v 1 be written to S1. Since the system is Client
available, S1 must respond. Since the network is partitioned, however, S1
cannot replicate its data to S2. This phase of execution is called α1.
S1 S2 S1 S2 S1 S2
V0 V0 V1 V0 V1 V0
Write V1 done
Client Client Client
Next, the client issue a read request to S2. Again, since the system is available,
S2 must respond and since the network is partitioned, S2 cannot update its
value from S1. It returns v0. This phase of execution is called α2.
S1 S2 S1 S2
V1 V0 V1 V0
read V0
Client Client
S2 returns v0 to the client after the client had already written v1 to S1. This is
inconsistent.
We assumed a consistent, available, partition tolerant system existed, but we
just showed that there exists an execution for any such system in which the
system acts inconsistently. Thus, no such system exists.
q It’s more than rows in tables —NoSQL systems store and retrieve data from many
formats: key-value stores, graph databases, column-family (Bigtable) stores,
document stores, and even rows in tables.
q It’s free of joins —NoSQL systems allow you to extract your data using simple
interfaces without joins.
q It’s schema-free — NoSQL systems allow to drag-and-drop data into a folder and
then query it without creating an entity-relational model.
q It works on many processors — NoSQL systems allow you to store your database
on multiple processors and maintain high-speed performance.
q It uses shared-nothing commodity computers — Most (but not all) NoSQL
systems leverage low-cost commodity processors that have separate RAM and disk.
q It supports linear scalability — When you add more processors, you get a
consistent increase in performance.
q It’s innovative — NoSQL offers options to a single way of storing, retrieving, and
manipulating data. NoSQL supporters (also known as NoSQLers) have an inclusive
attitude about NoSQL and recognize SQL solutions as viable options. To the NoSQL
community, NoSQL means “Not only SQL.”
Why: In today’s time data is becoming easier to access and capture through
third parties such as Facebook, Google+ and others. Personal user information,
social graphs, geo location data, user-generated content and machine logging
data are just a few examples where the data has been increasing exponentially.
To avail the above service properly, it is required to process huge amount of data
which SQL databases were never designed. The evolution of NoSql databases is
to handle these huge data properly.
Uses:
Log analysis
Time-based data
Enterprises today need highly reliable, scalable and available data storage
across a configurable set of systems that act as storage nodes. The needs of
organizations are changing rapidly. Many organizations operating with single
CPU and relational database management systems (RDBMS) were not able to
cope up with the speed in which information needs to be extracted. Businesses
have to capture and analyze a large amount of variable data, and make
immediate changes in their business based on their findings.
The figure shows how the demands of
volume, velocity, variability, and agility
play a key role in the emergence of
NoSQL solutions. As each of these
drivers applies pressure to the single-
processor relational model, its
foundation becomes less stable and in
time no longer meets the organization’s
needs.
School of Computer Engineering
NoSQL Business Drivers cont…
62
Volume
q Without a doubt, the key factor pushing organizations to look at alternatives
to their current RDBMSs is a need to query big data using clusters of
commodity processors.
q Until around 2005, performance concerns were resolved by purchasing
faster processors. In time, the ability to increase processing speed was no
longer an option. As chip density increased, heat could no longer dissipate
fast enough without chip overheating. This phenomenon, known as the
power wall, forced systems designers to shift their focus from increasing
speed on a single chip to using more processors working together.
q The need to scale out (also known as horizontal scaling), rather than scale
up (faster processors), moved organizations from serial to parallel
processing where data problems are split into separate paths and sent to
separate processors to divide and conquer the work.
Velocity
q Though big data problems are a consideration for many organizations
moving away from RDBMSs, the ability of a single processor system to
rapidly read and write data is also key.
q Many single-processor RDBMSs are unable to keep up with the demands of
real-time inserts and online queries to the database made by public-facing
websites.
q RDBMSs frequently index many columns of every new row, a process which
decreases system performance.
q When single-processor RDBMSs are used as a back end to a web store front,
the random bursts in web traffic slow down response for everyone, and
tuning these systems can be costly when both high read and write
throughput is desired.
Variability
q Companies that want to capture and report on exception data struggle when
attempting to use rigid database schema structures imposed by RDBMSs.
For example, if a business unit wants to capture a few custom fields for a
particular customer, all customer rows within the database need to store
this information even though it doesn’t apply.
q Adding new columns to an RDBMS requires the system be shut down and
ALTER TABLE commands to be run.
q When a database is large, this process can impact system availability, costing
time and money.
Agility
q The most complex part of building applications using RDBMSs is the process
of putting data into and getting data out of the database.
q If the data has nested and repeated subgroups of data structures, one need
to include an object-relational mapping layer. The responsibility of this layer
is to generate the correct combination of INSERT, UPDATE, DELETE, and
SELECT SQL statements to move object data to and from the RDBMS
persistence layer.
q This process isn’t simple and is associated with the largest barrier to rapid
change when developing new or modifying existing applications.
There are mainly four categories of NoSQL data stores. Each of these categories
has its unique attributes and limitations.
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but
the value part is stored as a document. The document is stored in JSON or XML
formats. The document type is mostly used for CMS (Content Management
Systems), blogging platforms, real-time analytics & e-commerce applications. It
should not use for complex transactions which require multiple operations or
queries against varying aggregate structures.
SQL NoSQL – Document-Based
ID Name Age State Key (ID) Value (JSON)
1 John 27 California 1 {
“Name”: John
“Age”:27
“State”: California
}
A graph type database stores entities as well the relations amongst those
entities. The entity is stored as a node with the relationship as edges. An edge
gives a relationship between nodes. Every node and edge has a unique identifier.
Graph base database mostly used for social networks, logistics, spatial data.
q The key-value store, column family store, document store and graph store
patterns can be modified based on different aspects of the system and its
implementation. Database architecture could be distributed (manages single
database distributed in multiple servers located at various sites) or federated
(manages independent and heterogeneous databases at multiple sites).
q The variations in architecture are based on system requirements like agility,
availability (anywhere, anytime), intelligence, scalability, collaboration and low
latency. Various technologies support the architectural strategies to satisfy the
above requirement. For example, agility is given as a service using virtualization
or cloud computing; availability is the service given by internet and mobility;
intelligence is given by machine learning and predictive analytics; scalability
(flexibility of using commodity machines) is given by Big Data
Technologies/cloud platforms; collaboration is given by (enterprise-wide) social
network application; and low latency (event driven) is provided by in-memory
databases.
q NoSQL solution is used to handle and manage big data. NoSQL with their
inherently horizontal scale out architectures solves big data problems by moving
data to queries, uses hash rings to distribute the load, replicates the scale reads,
and allows the database to distribute queries evenly in order to make systems
run fast.
q In the distributed computing architecture, there are two ways of resource
sharing possible or share nothing. The memory can be shared or disk can be
shared (by CPUs); or no resources shared. The three of them can be considered
as shared memory, shared disk, and shared-nothing. Each of these architectures
works with different types of data to solve big data problems. In shared memory,
many CPUs access a single shared memory over a high-speed bus. This system is
ideal for large computation and also for graph stores. For graph traversals to be
fast, the entire graph should be in main memory. The shared disk system,
processors have independent memory but shares disk space using a storage
area network (SAN). Big data uses commodity machines which shares
nothing (shares no resources).
SM SD
In a shared nothing (SN) architecture, neither memory nor disk is shared among
multiple processors.
Advantages:
q Fault Isolation: provides the benefit of isolating fault. A fault in a single
machine or node is contained and confined to that node exclusively and
exposed only through messages.
q Scalability: If the disk is a shared resource, synchronization will have to
maintain a consistent shared state and it means that different nodes will
have to take turns to access the critical data. This imposes a limit on how
many nodes can be added to the distributed shared disk system, this
compromising on scalability.
q No standardization rules
q Limited query capabilities
q RDBMS databases and tools are comparatively mature
q It does not offer any traditional database capabilities, like consistency when
multiple transactions are performed simultaneously.
q When the volume of data increases it is difficult to maintain unique values as
keys become difficult
q Doesn't work as well with relational data
q The learning curve is stiff for new developers
q Open source options so not so popular for enterprises.
Infographics are:
q Best for telling a premeditated story and offer subjectivity.
q Best for guiding the audience to conclusions and point out
relationships.
q Created manually for one specific dataset.
It is used for Marketing content, Resumes, Blog posts, and Case studies
etc.
Data visualizations are:
q Best for allowing the audience to draw their own conclusions, and
offer objectivity
q Ideal for understanding data at a glance
q Automatically generated for arbitrary datasets
It is used for Dashboards, Scorecards, Newsletters, Reports, and
Editorials etc.
School of Computer Engineering
Data Visualization Purpose
88
q Map q Ordinogram
q Parallel Coordinate Plot q Isoline
q Venn Diagram q Isosurface
q Timeline q Streamline
q Euler Diagram q Direct Volume Rendering (DVR)
q Hyperbolic Trees
q Cluster Diagram
List your friends that play Soccer OR Tennis List your friends that play Soccer AND Tennis
q Draw the Venn Diagram to show people that play Soccer but NOT Tennis
q Draw the Venn Diagram to show people that play Soccer or play Tennis, but
not the both.
School of Computer Engineering
Timeline
93
Source: datavizcatalogue.com
School of Computer Engineering
Timeline cont’d
94
Source: officetimeline.com
School of Computer Engineering
Euler Diagram
95
Source: wikipedia
Class Exercise
Draw the Euler diagram of the sets, X = {1, 2, 5, 8}, Y = {1, 6, 9} and Z={4, 7, 8 ,
9}. Then draw the equivalent Venn Diagram.
School of Computer Engineering
Hyperbolic Trees
96
• “You know you have a distributed system when the crash of a computer
you’ve never heard of stops you from getting any work done.” –Leslie
Lamport
Partial failure
Failure of a single component must not cause the failure of the entire
system only a degradation of the application performance
Failure should not result in the loss of any data
Component Recovery
• If a component fails, it should be able to recover without
restarting the entire system
• Component failure or recovery during a job must not affect the
final output
Scalability
• Increasing resources should increase load capacity
• Increasing the load on the system should result in a graceful decline
in performance for all jobs
• Not system failure
Where Did Hadoop Come From?
• Based on work done by Google in the early 2000s
• Google's objective was to index the entire World Wide Web
• Google had reached the limits of scalability of RDBMS technology
• “The Google File System” in 2003
• “MapReduce: Simplified Data Processing on Large Clusters” in 2004
• A developer by the name of Doug Cutting (at Yahoo!) was wrestling with
many of the same problems in the implementation of his own open-source
search engine,
• He started an open-source project based on Google’s research and created
Hadoop in 2005.
• Hadoop was named after his son’s toy elephant.
• The core idea was to distribute the data as it is initially stored
• Each node can then perform computation on the data it stores without moving the
data for the initial processing
Uses for Hadoop
• Data-intensive text processing
• Graph mining
• many ore....
The Hadoop Ecosystem
Hive Pig
(NotSQL Databases)
(Query) (Script)
HCatalog
(Coordination)
MapReduce
(Distributed Processing)
HDFS
(Distributed Storage)
Hadoop Use case
ZooKeeper
HCatalog
Hive Pig
FLUME
Map Reduce
SQOOP
The Hadoop App Store
Core Hadoop Concepts
• Applications are written in a high-level programming language
• No network programming or temporal dependency.
• Nodes should communicate as little as possible
• A “shared nothing” architecture.
• Data is spread among the machines in advance
• Perform computation where the data is already stored as often as
possible.
Hadoop Core Components
HDFC - Hadoop Distributed File System (storage)
MapReduce (processing)
es
e
od
am
aN
en
at
fil
Secondary
,D
1.
Id
NameNode
ck
Client
Bl
2. o
3.Read da
ta
Cluster
Membership
• Then communicated directly with the data nodes to read the data
Data Retrieval
• The latest versions of Hadoop provide an HDFS High
Availability (HA) feature.
TaskTracker
◦ Keeps track of the performance of an individual mapper or reducer
MapReduce:
The Big
Picture
Map Process
m
Class 1
f(xi) = sign(wTxi + b)
• So, the first condition will basically tell that all plus y are in non origin
side and minus y are in origin side.
Find w,b such that:
wTxi+b > 0, if yi = +1 (1)
wTxi+b < 0, if yi = -1 (2)
Class 2
So, Combining the two inequalities of
Equations (1) & (2) we get,
m
yi(wTxi+b) ≥ 1, xi. (Condition-1) Class 1
So, what we want to do is to sort of pick up one of this infinite line, which we
consider is good. Let us see how do I pick up one among this infinite line.
Margin of Line
• Define a quantity called Margin of a line xi
• What is the margin of a line? You take the
closest point to that line.
• How do you find out the closest point?
• Draw perpendicular from every point to the
line. See which point has the closest smallest
perpendicular distance, let us say this point
has the perpendicular distance & is the
margin of a line.
Perpendicular distance from a point to a line (from coordinate geometry)
w1 x1i w2 x2 i b
d(Xi)
2 2
w w
1 2
Margin of Line
w1 x1i w2 x 2 i b
d(Xi)
w12 w22
d(Xi)
Where W=[w1,w2] & Xi=[x1i,x2i] in 2-Dimensional.
But in higher dimensional it will be w3,w4,...... &
x3i,x4i,......
Let, introduce again the vector notation for this that will be
W T Xi b
d(Xi)
W Norm of W
T 2
As per vector representation, W W W
What is the margin?
• The margin now the smallest of d ( X i ) values the
closest point the perpendicular the closest point
you will have the smallest perpendicular
distance. d(Xi )
• So, margin is nothing, but the minimum of these
distances the smallest of these distances over
all the Xi’s.
T
W Xi b
M argin m in d ( X i ) m in
Xi Xi W
Choosing the best margin
• If you examine this 3 lines they have different values of
w1, w2 , b different, but they are really different lines. If
you plot them they will turn out to be the same line they
will have the same margin. NO 2x1 + 3x2 + 4 = 0
• So, which one should we took which value of w and b 4x1 + 6x2 + 8 = 0
should we took. 6x1 + 9x2 + 12 = 0
• So, we will scale w 1 , w 2 and b by multiplying by some
factor. We will multiply w and b by a constant such that
this smallest value turns out to be 1. So, that the smallest
this value overall Xi’s becomes 1
W T X i b =1
M argin min d ( X i ) min
Xi Xi W
1
So, it can be rewrite as, Margin
W
Choosing the best margin
1
• Find w, b, such that, Maximum Margin (Objective)
W
• It has to satisfy 2 conditions maximize margin and linearly separate.
min=1
Such that, yi ( wTxi + b ) ≥ 1 for all i = 1..N (Constraint)
w2
Such that, yi ( wTxi + b ) ≥ 1 for all i = 1..N Primal Optimization Problem
w x2 b 1
2 w x1 b 1
Width (maximize)
W w x 2 b w x1 b 1 ( 1)
w x 2 w x1 2
w 2
( x 2 x1 )
w w
SVM – Optimization
• Learning the SVM can be formulated as an optimization:
• Or equivalently
When solving SVM problems, there are some useful equations to keep in mind:
Support vector
guideline
Soft vs Hard Margin SVMs
Formulating the Optimization Problem
“Soft” margin solution
The optimization problem becomes
subject to
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
“Item set” = the items (e.g., products) comprising the antecedent or consequent
8
Definition: Frequent Itemset
• Itemset
• A collection of one or more items
• Example: {Milk, Bread, Diaper}
• k-itemset
• An itemset that contains k items
• Support count ()
• Frequency of occurrence of an itemset
• E.g. ({Milk, Bread,Diaper}) = 2
• Support
• Fraction of transactions that contain an itemset
• E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
• An itemset whose support is greater than or equal to a minsup threshold
The model: data
• I = {i1, i2, …, im}: a set of items.
• Transaction t :
• t a set of items, and t I.
• Transaction Database T: a set of transactions T = {t1, t2, …, tn}.
l I: itemset
{cucumber, parsley, onion, tomato, salt, bread, olives, cheese, butter}
l T: set of transactions
1 {{cucumber, parsley, onion, tomato, salt, bread},
2 {tomato, cucumber, parsley},
3 {tomato, cucumber, olives, onion, parsley},
4 {tomato, cucumber, onion, bread},
5 {tomato, salt, onion},
6 {bread, cheese}
7 {tomato, cheese, cucumber}
8 {bread, butter}} 10
The model: Association rules
• A transaction t contains X, a set of items (itemset) in I, if X t.
• An association rule is an implication of the form:
X Y, where X, Y I, and X Y =
11
Rule strength measures
• Support: The rule holds with support sup in T (the transaction data set) if
sup% of transactions contain X Y.
• sup = probability that a transaction contains Pr(X Y)
(Percentage of transactions that contain X Y)
• Confidence: The rule holds in T with confidence conf if conf% of tranactions
that contain X also contain Y.
• conf = conditional probability that a transaction having X also contains Y
Pr(Y | X)
(Ratio of number of transactions that contain X Y to the number that
contain X)
• An association rule is a pattern that states when X occurs, Y occurs with
certain probability.
12
Support and Confidence
• Support count: The support count of an itemset X, denoted by X.count, in a
data set T is the number of transactions in T that contain X. Assume T has n
transactions.
( X Y ). count
• Then, support
n
( X Y ).count
confidence
X .count
Goal: Find all rules that satisfy the user-specified minimum support (minsup)
and minimum confidence (minconf).
13
Definition: Association Rule
l Association Rule
– An implication expression of the form
X Y, where X and Y are itemsets
– Example:
{Milk, Diaper} {Beer}
16
Basic Concept: Association Rules
Transaction-id Items bought § Let min_support = 50%,
10 A, B, C min_conf = 50%:
20 A, C § A C (50%, 66.7%)
30 A, D § C A (50%, 100%)
40 B, E, F
Frequent pattern Support
Customer Customer {A} 75%
buys both buys diaper
{B} 50%
{C} 50%
{A, C} 50%
Customer
buys beer
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to find all
rules having
• support ≥ minsup threshold
• confidence ≥ minconf threshold
• Brute-force approach:
• List all possible association rules
• Compute the support and confidence for each rule
• Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Frequent Itemset Generation
• Brute-force approach:
• Each itemset in the lattice is a candidate frequent itemset
• Count the support of each candidate by scanning the database
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCDE
Computational Complexity
• Given d unique items:
• Total number of itemsets = 2d
• Total number of possible association rules:
d d k
R
d 1 d k
k j
k 1 j 1
3 2 1d d 1
1. Rule Generation
– Generate high confidence rules from each frequent itemset, where
each rule is a binary partitioning of a frequent itemset
25
The Apriori algorithm
• The algorithm uses a level-wise search, where k-itemsets are used to explore
(k+1)-itemsets
• In this algorithm, frequent subsets are extended one item at a time (this step
is known as candidate generation process)
• Then groups of candidates are tested against the data.
• It identifies the frequent individual items in the database and extends them
to larger and larger item sets as long as those itemsets appear sufficiently
often in the database.
• Apriori algorithm determines frequent itemsets that can be used to
determine association rules which highlight general trends in the database.
The Apriori algorithm
• The Apriori algorithm takes advantage of the fact that any subset of a
frequent itemset is also a frequent itemset.
• i.e., if {l1,l2} is a frequent itemset, then {l1} and {l2} should be frequent itemsets.
• Let us assume:
• minimum confidence threshold is 60%
Association Rules with confidence
• R1 : 1,3 -> 5
– Confidence = sc{1,3,5}/sc{1,3} = 2/3 = 66.66% (R1 is selected)
• R2 : 1,5 -> 3
– Confidence = sc{1,5,3}/sc{1,5} = 2/2 = 100% (R2 is selected)
• R3 : 3,5 -> 1
– Confidence = sc{3,5,1}/sc{3,5} = 2/3 = 66.66% (R3 is selected)
• R4 : 1 -> 3,5
– Confidence = sc{1,3,5}/sc{1} = 2/3 = 66.66% (R4 is selected)
• R5 : 3 -> 1,5
– Confidence = sc{3,1,5}/sc{3} = 2/4 = 50% (R5 is REJECTED)
• R6 : 5 -> 1,3
– Confidence = sc{5,1,3}/sc{5} = 2/4 = 50% (R6 is REJECTED)
How to efficiently generate rules?
• In general, confidence does not have an anti-monotone property
c(ABC→D) can be larger or smaller than c(AB →D)
• But confidence of rules generated from the same itemset has an anti-
monotone property
• e.g., L= {A,B,C,D}
c(ABC→D) ≥ c(AB→CD) ≥ c(A→BCD)
Pruned the
Rule
Rule generation for Apriori Algorithm
• Cdidate rule is generated by merging two rules that share the same
prefix in the rule consequent
Coffee Coffee
Tea 15 5 20 Association Rule: Tea Coffee
Tea 75 5 80
90 10 100
• The lift value of an association rule is the ratio of the confidence of the rule
and the expected confidence of the rule.
• This says how likely item Y is purchased when item X is purchased, while
controlling for how popular item Y is.
Correlation Concepts
• Lift is easier to understand when written in terms of probabilities.
Probability = Focus to Events
Support ( XUY ) Support = Focus to item
Lift = = P(X,Y)/P(X).P(Y) togetherness
Support ( X ).Support (Y )
• The Lift measures the probability of X and Y occurring together divided by the
probability of X and Y occurring if they were independent events.
• If the lift is equal to 1, it means that two item sets A and B are independent (the
occurrence of A is independent of the occurrence of item set B)
in this case, support(A B) = support(A) support(B)
• If the lift is higher than 1, it means that A and B are positively correlated.
• If the lift is lower than 1, it means that A and B are negatively correlated.
8
Correlation Concepts [Cont.]
9
Statistical Independence (Probabilistic)
• Population of 1000 students
• 600 students know how to swim (S)
• 700 students know how to bike (B)
• 420 students know how to swim and bike (S,B)
• play basketball Þ not eat cereal [20%, 33.3%] is more accurate, although with lower
support and confidence Basketball Not Sum
basketball (row)
• Measure of dependent/correlated events: lift Cereal 2000 1750 3750
Coffee Coffee
Tea 15 5 20 Association Rule: Tea Coffee
Tea 75 5 80
90 10 100
(If the relative support of an itemset I satisfies a prespecified minimum support threshold (i.e., the absolute
support of I satisfies the corresponding minimum support count threshold), then I is a frequent itemset)
• Definition (Frequent Itemset Mining): Given a set of transactions T = {T1 . . . Tn}, where each
transaction Ti is a subset of items from U, determine all itemsets I that occur as a subset of at
least a predefined fraction minsup of the transactions in T.
• Definition (Maximal Frequent Itemsets): A frequent itemset is maximal at a given minimum
support level minsup, if it is frequent, and no superset of it is frequent.
The Frequent Pattern Mining Model
• Definition (Association Rules) Let X and Y be two sets of items. Then, the rule
X⇒Y is said to be an association rule at a minimum support of minsup and
minimum confidence of minconf, if it satisfies both the following criteria:
1. The support of the itemset X ∪ Y is at least minsup.
2. The confidence of the rule X ⇒ Y is at least minconf.
• Property 4.3.1 (Confidence Monotonicity) Let X1, X2, and I be itemsets such that
X1 ⊂ X2 ⊂ I. Then the confidence of X2 ⇒ I − X2 is at least that of X1 ⇒ I − X1.
conf(X2 ⇒ I − X2) ≥ conf(X1 ⇒ I − X1)
Introduction to DATA MINING, Vipin Kumar, P N Tan, Michael Steinbach
Data Mining: Clustering
Cluster Analysis
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Outlier Analysis
What is Cluster Analysis?
• Cluster: a collection of data objects
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Cluster analysis
• Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Typical applications
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
• create thematic maps in GIS by clustering feature spaces
• detect spatial clusters and explain them in spatial data mining
• Image Processing
• Economic Science (especially market research)
• WWW
• Document classification
• Cluster Weblog data to discover groups of similar access patterns
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation
database
• Insurance: Identifying groups of motor insurance policy holders with a high
average claim cost
• City-planning: Identifying groups of houses according to their house type, value,
and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults
What is not Cluster Analysis?
• Supervised classification
• Have class label information
• Simple segmentation
• Dividing students into different registration groups alphabetically, by last name
• Results of a query
• Groupings are a result of an external specification
• Graph partitioning
• Some mutual relevance and synergy, but areas are not identical
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
• high intra-class similarity
• low inter-class similarity
x 11 ... x 1f ... x 1p
• Data matrix
• (two modes) ... ... ... ... ...
x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1
• Dissimilarity matrix 0
d(2,1) 0
• (one mode)
d(3,1 ) d ( 3,2 ) 0
: : :
d ( n ,1) d ( n ,2 ) ... ... 0
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, which is typically metric: d(i, j)
• There is a separate “quality” function that measures the “goodness” of a
cluster.
• The definitions of distance functions are usually very different for interval-
scaled, boolean, categorical, ordinal and ratio variables.
• Weights should be associated with different variables based on applications
and data semantics.
• It is hard to define “similar enough” or “good enough”
• the answer is typically highly subjective.
Type of data in clustering analysis
• Interval-scaled variables:
• Binary variables:
• Partitional Clustering
• A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n objects into
a set of k clusters
• Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67): Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each
cluster is represented by one of the objects in the cluster
K-Means Clustering
• Simple Clustering: K-means
Chart Title
80
70
60
50
40
30
20
10
0
165 170 175 180 185 190
Hierarchical Clustering
Typical Alternatives to Calculate the Distance between Clusters
• Single link: smallest distance between an element in one cluster and an element in the
other, i.e., dis(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an element in
the other, i.e., dis(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in the other,
i.e., dis(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj)
• Medoid: one chosen, centrally located object in the cluster
Hierarchical Clustering animal
0.05 1
3 1
0
1 3 2 5 4 6
Hierarchical Clustering
• Use distance matrix as clustering criteria. This method does not require
the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4 Agglomerative (bottom-up): (AGNES)
Start with each document being a single cluster.
Eventually all documents belong to the same
a
ab cluster.
b
abcde
c
cde
d
de
e Divisive (top-down): (DIANA)
Start with all documents belong to the same cluster.
Step 4 Step 3 Step 2 Step 1 Step 0
Eventually each node forms a cluster on its own.
Dendrogram: Shows How the Clusters are Merged
Pair-group centroid.
The distance between two clusters is determined
as the distance between centroids.
Single Link Agglomerative Clustering
• Use maximum similarity of pairs:
• Can result in “straggly” (long and thin) clusters due to chaining effect.
• Appropriate in some domains, such as clustering islands: “Hawai’i clusters”
• After merging ci and cj, the similarity of the resulting cluster to another
cluster, ck, is:
Item E A C B D
E 0 1 2 2 3
A 1 0 2 5 3
C 2 2 0 1 6
B 2 5 1 0 3
D 3 3 6 3 0
Another Example
• Find single link technique to find X Y
clusters in the given database. 1
0.4 0.53
2
0.22 0.38
3
0.35 0.32
4
0.26 0.19
5
0.08 0.41
6
0.45 0.3
Plot given data 1
X Y
0.4 0.53
2 0.22 0.38
3 0.35 0.32
4 0.26 0.19
5 0.08 0.41
6 0.45 0.3
X Y
Identify two nearest clusters 1 0.4 0.53
2 0.22 0.38
3 0.35 0.32
4 0.26 0.19
5 0.08 0.41
6 0.45 0.3
Repeat process until all objects in same cluster
X Y
1 0.4 0.53
2 0.22 0.38
3 0.35 0.32
4 0.26 0.19
5 0.08 0.41
6 0.45 0.3
Average link
X Y
1
• Average distance matrix 2
0.4 0.53
0.22 0.38
3 0.35 0.32
4 0.26 0.19
5 0.08 0.41
6 0.45 0.3
Construct a distance matrix
1 2 3 4 5 6
1 0
2 0.24 0
3 0.22 0.15 0
A 1 B
2
E 1
3
D
Difficulties in Hierarchical Clustering
• Difficulties regarding the selection of merge or split points
• This decision is critical because the further merge or split decisions are
based on the newly formed clusters
• Method does not scale well
• So hierarchical methods are integrated with other clustering techniques to
form multiple-phase clustering
DATA ANALYTICS
Lecture-2
Dr. H.K.Tripathy
What is Data Analytics?
Ø The increase in size of the data has lead to a rise in need for carrying out
inspecting, cleaning, transforming and modelling data with the goal of
discovering useful information to gain insights from the data, suggesting
conclusions and supporting decision-making.
Ø Intelligent data analysis (IDA) uses concepts from artificial intelligence, information
retrieval, machine learning, pattern recognition, visualization, distributed programming.
Ø The process of IDA typically consists of the following three stages:
Ø Data preparation
Ø Data mining and rule finding
Ø Result validation and interpretation
Ø It has multiple facets and approaches, encompassing diverse techniques under a
variety of names, in different business, science and social science domains.
What is Data Analytics?
In Statistical
Exploratory Data applications, business Confirmatory Data
Analysis (EDA) analytics can be Analysis (CDA)
divided into 2 types
‒ It focuses on discovering new features in the ‒ It focuses on confirming or falsifying existing
data hypotheses using traditional statistical tools
‒ Exploratory Data Analysis involves things like: such as significance, inference, and
establishing the data’s underlying structure, confidence.
identifying mistakes and missing data, ‒ CDA involves processes like testing hypotheses,
establishing the key variables, spotting producing estimates, regression analysis
anomalies, checking assumptions and testing (estimating the relationship between
hypotheses in relation to a specific model, variables) and variance analysis (evaluating
estimating parameters. the difference between the planned and
actual outcome).
Importance of Data Analysis
Ø Data analysis offers the following benefits:
Ø Structuring the findings from survey research or other means of data collection
Ø Provides a picture of data at several levels of granularity from a macro picture into a
micro one
Ø Acquiring meaningful insights from the data set which can be effectively exploited to
take some critical decisions to improve productivity
Ø Helps to remove human bias in decision making, through proper statistical treatment
Ø With the advent of big data, it is even more vital to find a way to analyze the ever
(faster) growing disparate data coursing through their environments and give it meaning
Data Analytics Applications
Ø Understanding and targeting customers
Ø Understanding and optimizing business processes
Ø Personal quantification and performance optimization
Ø Improving healthcare and public health
Ø Improving sports performance
Ø Improving science and research
Ø Optimizing machine and device performance
Ø Improving security and law enforcement
Ø Improving and optimizing cities and countries
Ø Financial trading
Data Analysis Process
Deployment
Data
Evaluation
Data
Modeling
Data
we evaluate the We need to plan the
Preparation results from the last deployment,
Data we need to select a step, review the monitoring and
Business Exploration modeling technique, scope of error, and maintenance and
generate test design, determine the next produce a final report
Understanding we need to select data as
build a model and assess steps to perform. We and review the
per the need, clean it, evaluate the results project. In this phase,
we need to gather initial the model built. The
construct it to get useful of the test cases and we deploy the results
data, describe and data model is build to
information and then review the scope of of the analysis. This is
we need to determine explore data and lastly analyze relationships
integrate it all. Finally, errors in this phase. also known as
the business objective, verify data quality to between various
we need to format the reviewing the project.
assess the situation, ensure it contains the selected objects in the
data to get the
determine data mining data we require. Data data. Test cases are built
appropriate data. Data is
goals and then produce collected from the for assessing the model
selected, cleaned, and
the project plan as per various sources is and model is tested and
integrated into the
the requirement. described in terms of its implemented on the
format finalized for the
Business objectives are application and the need data in this phase.
analysis in this phase.
defined in this phase. for the project in this
phase.
Programmatic
P
P ‒ There might be a need to write a program for data analysis by using code to
manipulate it or do any kind of exploration because of the scale of the data.
Data-driven
The D ‒ A lot of data scientists depend on a hypothesis-driven approach to
characteristics of data analysis. For appropriate data analysis, one can also avail the
the data analysis Characteristics data to foster analysis. This can be of significant advantage when
I D
there is a large amount of data.
depend on of Data
different aspects Analysis Attributes usage
such as volume, A ‒ For proper and accurate analysis of data, it can use a lot of
attributes. In the past, analysts dealt with hundreds of attributes
velocity, and or characteristics of the data source. With Big Data, there are
variety. now thousands of attributes and millions of observations.
A Iterative
I ‒ As whole data is broken into samples and samples are then analyzed,
therefore data analytics can be iterative in nature. Better compute power
enables iteration of the models until data analysts are satisfied.
How to Get a Better Analysis?
In order to have a great analysis, it is necessary to ask the
right question, gather the right data to address it, and
design the right analysis to answer the question. Only Business Importance
after careful analysis, we can define it as correct.
It means how the problem is related
to business and its importance. We
will assign the results in the business
Statistical Significance How to Get context as part of the final process of
validation.
Null hypothesis:
Alternate hypothesis: There is no relationship between
There is a relationship between gender and age 14 test score. This is
gender and age 14 test score. the default assumption (even if you
do not think it is true!).
Statistical Significance - What is a P-value?
• A p-value is a probability. It is usually expressed as a proportion which can also be
easily interpreted as a percentage.
• P-values become important when we are looking to ascertain how confident we can
be in accepting or rejecting our hypotheses.
• Because we only have data from a sample of individual cases and not the entire
population we can never be absolutely (100%) sure that the alternative hypothesis is
true.
• However, by using the properties of the normal distribution we can compute the
probability that the result we observed in our sample could have occurred by chance.
• The way that the p-value is calculated varies subtlety between different statistical
tests, which each generate a test statistic (called, for example, t, F or X2 depending on
the particular test).
Sampling Distribution
More precisely, sampling distributions are probability distributions and used to describe the
variability of sample statistics.
The probability distribution of sample mean (hereafter, will be denoted as �) is called the sampling
distribution of the mean (also, referred to as the distribution of sample mean).
Using the values of � and �2 for different random samples of a population, we are to make
inference on the parameters � and �2 (of the population).
Sampling Distribution
Example 5.1:
Consider five identical balls numbered and weighting as 1, 2, 3, 4 and 5. Consider an experiment consisting of
drawing two balls, replacing the first before drawing the second, and then computing the mean of the values of
the two balls.
Sample (�) Mean (�) Sample (�) Mean (�) Sample (�) Mean (�)
[1,1] 1.0 [2,4] 3.0 [4,2] 3.0
Following table lists all [1,2] 1.5 [2,5] 3.5 [4,3] 3.5
possible samples and
their mean. [1,3] 2.0 [3,1] 2.0 [4,4] 4.0
[1,4] 2.5 [3,2] 2.5 [4,5] 4.5
[1,5] 3.0 [3,3] 3.0 [5,1] 3.0
[2,1] 1.5 [3,4] 3.5 [5,2] 3.5
[2,2] 2.0 [3,5] 4.0 [5,3] 4.0
[2,3] 2.5 [4,1] 2.5 [5,4] 4.5
[5,5] 5.0
Sampling Distribution
Sampling distribution of means
� 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
�(�) 1 2 3 4 5 4 3 2 1
25 25 25 25 25 25 25 25 25
Issues with Sampling Distribution
1. In practical situation, for a large population, it is infeasible to have all
possible samples and hence probability distribution of sample statistics.
?
the method of choosing the samples.
The Idea of Statistical Significance
Ø Because sampling is imperfect
• Samples may not ideally match the population
Ø Because hypothesis cannot be directly tested
• Inference is subject to error
The degree of risk that you are willing to take that you will reject a null hypothesis
Lecture-3
Dr. H.K.Tripathy
What is Regression?
A way of predicting the value of one variable from another.
‒ It is a hypothetical model of the relationship between two variable.
‒ The model is used in a linear one.
‒ Regression is a statistical procedure that determines the equation for the
straight line that best fits a specific set of data.
• Any straight line can be represented by an equation of the form Y = bX + a, where
b and a are constants.
• The value of b is called the slope constant and determines the direction and
degree to which the line is tilted.
• The value of a is called the Y-intercept and determines the point where the line
crosses the Y-axis.
Main Objectives
Two main objectives:
Ø Establish if there is a relationship between two variables
‒ Specifically, establish if there is a statistically significant relationship
between the two.
‒ Example: Income and expenditure, wage and gender, etc.
Ø Forecast new observations.
‒ Can we use what we kow about the relationship to forecast unobserved
values?
‒ Example: What will are sales be over the next quarter?
Variable’s Roles
Variables
Dependent Independent
‒ This is the varable whos values ‒ This is the varable that explains
we want to explain of forecast. the other one.
‒ Its values depends on
‒ Its values are independent.
something else.
‒ We denote it as Y. ‒ We denote it as X.
Y=mX+c
A Linear Equation
You may remember one of these.
‒ y = a + bx
‒ y = mx + b
• In this regression discussion, we just use a different notation:
‒ y = β0 + β1x,
• where, β0 is called as intercept and β1 is called as cofficient or slope
• The values of the regression parameters 0, and 1 are not known.
• We estimate them from data.
• 1 indicates the change in the mean response per unit increase in X.
Example :
• The weekly advertising expenditure (x) and weekly sales (y) are presented in the following
table.
y x
n From the data table we have:
1250
1380
41
54
n 10 x 564 x 2 32604
1575 64
1650 71 b 0 1436 . 5 10 . 8 ( 56 . 4 ) 828
Point Estimation of Mean Response
• The estimated regression function is:
ŷ 828 10.8x
Sales 828 10.8 Expenditure
y = β0 + β1x
y = 4 + 2x
β1= 2
What happens if we change the intercept?
y = 4 + 2x
y = 9 + 2x
y = -2 + 2x
What happens if we change the slope?
y = 4 + 2x
y = 4 + 5x
y = 4 + 0x
=4
y = 4 - 3x
But, the world is not linear !
y = 4 + 2x
True Value
y = β0 + β1x + ε
Simple Linear Regression Model
Simple Linear Regression Model
2
Data for Linear Regression Example
• Logistic regression can be used to model and solve such problems, also
called as binary classification problems.
• Logistic Regression is one of the most commonly used Machine Learning
algorithms that is used to model a binary variable that takes only 2 values –
0 and 1.
• The objective of Logistic Regression is to develop a mathematical equation
that can give us a score in the range of 0 to 1.
• This score gives us the probability of the variable taking the value 1.
Why not linear regression?
When the response variable has only 2 possible values, it is desirable to have a model that
predicts the value either as 0 or 1 or as a probability score that ranges between 0 and 1.
Linear regression does not have this capability. Because, If you use linear regression to model
a binary response variable, the resulting model may not restrict the predicted Y values within
0 and 1.
which is equivalent to
O dds event
P ro b a b ility e v e n t
1+ O dds event
Odds Ratios
Dichotomous Predictor
Consider a dichotomous
predictor (X) which represents
the presence of risk (1 = present)
Odds Ratios
Definition of Odds Ratio: Ratio of two odds estimates.
So, if Pr(response | trt) = 0.40 and Pr(response | placebo) = 0.20
Then:
0.40
Odds response| trt group 0.667
1 0.40
0.20
Odds response | placebo group 0.25
1 0.20
0.667
OR Trt vs. Placebo 2.67
0.25
Logistic Regression Example
Consider an example dataset which maps the number of hours of study with the
result of an exam. The result can take only two values, namely passed(1) or failed(0):
So, we have The dataset has ‘p’ feature variables and ‘n’ observations. The feature
matrix is represented as:
Convesion to binanary:
{010,100,011,010,011,100,000,100,010,011,100,010}
Trailing Zeros
Computing r(a):
{1,2,0,1,0,2,0,2,1,0,2,1}
Distinct Elements
R = max r(a) = 2
Estimate = 2R = 22 = 2*2 =4