DA Unit II - II

Data Analytics
(Subject Code: 410243)

(Class: TE Computer Engineering)
2019 Pattern
Designed By: Prof Balaji Bodkhe

Objectives and outcomes
• Course Objectives
–To understand and apply different design methods and techniques
–To understand architectural design and modeling
–To understand and apply testing techniques
–To implement design and testing using current tools and
techniques in distributed, concurrent and parallel
– Environments
• Course Outcomes
– To present a survey on design techniques for software system
– To present a design and model using UML for a given software
system
– To present a design of test cases and implement automated testing
for client server, distributed, mobile applications
2
Concepts
UNIT-II
Basic Data Analytic Methods
4
UNIT-I CONCEPTS
Syllabus
Statistical Methods for Evaluation- Hypothesis testing, difference

of means, Wilcoxon rank–sum test, type 1 type 2 errors, power and
sample size, ANNOVA. Advanced Analytical Theory and Methods:
Clustering- Overview, K means- Use cases, Overview of methods,
determining number of clusters, diagnostics, reasons to choose and
cautions.
4
Statistical Methods for
Evaluation?
 statistics can help answer the following questions for data
analytics:
 Model Building and Planning
• What are the best input variables for the model?
• Can the model predict the outcome given the input?
 Model Evaluation
• Is the model accurate?
• Does the model perform better than an obvious guess?
• Does the model perform better than another candidate model?
 Model Deployment
• Is the prediction sound?
• Does the model have the desired effect (such as reducing the
cost}?
 This section discusses some useful statistical tools that may answer
these questions. 5
Hypothesis Testing
When comparing populations, such as testing or evaluating
the difference of the means from two samples of data a
common technique to assess the difference or the significance
of the difference is hypothesis testing.
form an assertion and test it with data.
When performing hypothesis tests, the common assumption
is that there is no difference between two samples.
null hypothesis (H0) and alternative hypothesis (Ha)
6
Hypothesis Testing
Hypothesis Testing
For example, if the task is to identify the effect of drug A
compared to drug B on patients, the null hypothesis and
alternative hypothesis would be this.
• H0: Drug A and drug B have the same effect on patients.
• Ha: Drug A has a greater effect than drug Bon patients.
If the task is to identify whether advertising Campaign C is
effective on reducing customer churn, the null hypothesis
and alternative hypothesis would be as follows.
• H0: Campaign C does not reduce customer churn better
than the current campaign method.
• Ha: Campaign C does reduce customer churn better than
the current campaign.
Hypothesis Testing
It is important to state the null hypothesis and alternative
hypothesis, because misstating them is likely to undermine
the subsequent steps of the hypothesis testing process. A
hypothesis test leads to either rejecting the null hypothesis
in favour of the alternative or not rejecting the null
hypothesis.
Example Null Hypotheses and
Alternative Hypotheses
Difference of Means
Two hypothesis tests to compare the means of the
respective populations based on samples randomly
drawn from each population.
• Ho: μ1 = μ2
• Ha: μ1 ≠ μ2
The μ1 and μ2 denote the population means of pop1
and pop2, respectively.
Difference of Means
Difference of Means

Student's t-test
Student 's t-test assumes that distributions of the two
populations have equal but unknown variances.
Suppose n1 and n2 samples are randomly and independently
selected from two populations, pop1 and pop2, respectively.
If each population is normally distributed with the same mean
(μ1 = μ2) and with the same variance, then T (the t-statistic ),
given in Equation, follows a t -distribution with n1 + n2 - 2
degrees of freedom (df).
Student's t-test
Student's t-test
The shape of the t-distribution is similar to the normal
distribution. In fact, as the degrees of freedom approaches
30 or more, the t-distribution is nearly identical to the
normal distribution.
Because the numerator of T is the difference of the sample
means, if the observed value of T is far enough from zero
such that the probability of observing such a value of T is
unlikely, one would reject the null hypothesis that the
population means are equal.
Student's t-test
Thus, for a small probability, say α = 0.05, T* is determined
such that P(|T| ≥ T*) = 0.05. After the samples are collected
and the observed value of T is calculated according to above
Equation, the null hypothesis (μ1 = μ2) is rejected if |T| ≥ T*
In hypothesis testing, in general, the small probability, n, is
known as the significance level of the test.
The significance level of the test is the probability of
rejecting the null hypothesis, when the null hypothesis is
actually TRUE.
Student's t-test
Student's t-test
From the R output, the observed value of T is t = -1.7828.
The negative sign is due to the fact that the sample mean of x
is less than the sample mean of y.
Using the qt () function in R, a T value of 2.0484 corresponds
to a 0.05 significance level.
Because the magnitude of the observed T statistic is less than
the T value corresponding to the 0.05
significance level ( -1.78281< 2.0484), the null hypothesis is
not rejected.
Welch 's t-test
In statistics, Welch's t-test, or unequal variances t-
test, is a two-sample location test which is used
to test the hypothesis that two populations have
equal means.
Welch's t-test is an adaptation of Student's t-test,
and is more reliable when the two samples have
unequal variances and unequal sample sizes.
Welch 's t-test

Welch 's t-test
In Welch's test, under the remaining assumptions of
random samples from two normal populations with
the same mean, the distribution of Tis approximated
by the t-distribution.
The R code performs the Welch's t-test on the same
set of data analysed in the earlier Student's t-test
example.
Welch 's t-test
Welch 's t-test (df)
Welch 's t-test
In both the Student's and Welch's t-test examples,
the R output provides 95% confidence intervals on
the difference of the means.
Type I and Type II Errors
A hypothesis test may result in two types of errors,
depending on whether the test accepts or rejects the null
hypothesis. These two errors are known as type I and type II
errors.
• A type I error is the rejection of the null hypothesis when
the null hypothesis is TRUE. The probability of the type I
error is denoted by the Greek letter α.
• A type II error is the acceptance of a null hypothesis when
the null hypothesis is FALSE. The probability of the type II
error is denoted by the Greek letter β.
Type I and Type II Errors
Four possible states of a hypothesis test, including
the two types of errors.
Power and Sample Size
The power of a test is the probability of correctly rejecting
the null hypothesis.
denoted by 1 – β, where β is the probability of a type II error.
power of a test improves as the sample size increases, power
is used to determine the necessary sample size.
In the difference of means, the power of a hypothesis test
depends on the true difference of the population means.
In general, the magnitude of the difference is known as the
effect size.
A larger sample size better identifies a fixed effect size
ANOVA
The hypothesis tests presented in the previous sections are
good for analysing means between two populations.
What if there are more than two populations? Consider an
example of testing the impact of nutrition and exercise on
60 candidates between age 18 and 50.
The candidates are randomly split into six
groups, each assigned with a different weight loss strategy,
and the goal is to determine which strategy is the most
effective.
ANOVA
Group 1 only eats junk food.
Group 2 only eats healthy food.
Group 3 eats junk food and does cardio exercise every other
day.
Group 4 eats healthy food and does cardio exercise every
other day.
Group 5 eats junk food and does both cardio and strength
training every other day.
Group 6 eats healthy food and does both cardio and
strength training every other day.
ANOVA
Multiple t-tests could be applied to each pair of weight loss
strategies, therefore, a total of 15 t-tests would be performed.
However, multiple t-tests may not perform well on several
populations for two reasons.
First, because the number of t-tests increases as the number
of groups increases, analysis using the multiple t-tests
becomes cognitively more difficult.
Second, by doing a greater number of analyses, the
probability of committing at least one type I error
somewhere in the analysis greatly increases.
ANOVA
Analysis of Variance (ANOVA) is designed to address these
issues. AN OVA is a generalization of the hypothesis testing
of the difference of two population means.
ANOVA tests if any of the population means differ from the
other population means.
The null hypothesis of A NOVA is that all the population
means are equal.
The alternative hypothesis is that at least one pair of the
population means is not equal. In other words,
Overview of Clustering
clustering is the use of unsupervised techniques for
grouping similar objects.
In machine learning, unsupervised refers to the
problem of finding hidden structure within
unlabelled data.
The structure of the data describes the objects of
interest and determines how best to group the
object s.
Overview of Clustering
For example, based on customers' personal income, it
is straightforward to divide the customers into three
groups depending on arbitrarily selected values.
The customers could be divided into three groups as
follows:
• Earn less than $10,000
• Earn between 510,000 and $99,999
• Earn $100,000 or more
K-means
Given a collection of objects each with n measurable
attributes, k-means is an analytical technique that, for a
chosen value of k, identifies k clusters of objects based on
the objects proximity to the center of the k groups.
The center is determined as the arithmetic average (mean)
of each cluster's n-dimensional vector of attributes.
Figure illustrates three clusters of objects with two
attributes. Each object in the dataset is represented by a
small dot color-coded to the closest large dot, the mean of
the cluster.
Use Cases
Image Processing
Video is one example of the growing volumes of
unstructured data being collected. Within each frame of a
video, k-means analysis can be used to identify objects in
the video.
For each frame, the task is to determine which pixels are
most similar to each other. The attributes of each pixel can
include brightness, color, and location, the x and y
coordinates in the frame.
With security video images, for example, successive frames
are examined to identify any changes to the clusters. These
newly identified clusters may indicate unauthorized access
to a facility.
Medical
Patient attributes such as age, height, weight, systolic and
diastolic blood pressures, cholesterol level, and other
attributes can identify naturally occurring clusters.
Clustering, in general, is useful in biology for the
classification of plants and animals as well as in the field of
human genetics.
Customer Segmentation
Marketing and sales groups use k-means to better identify
customers who have similar behaviours and spending
patterns.
For example, a wireless provider may look at the following
customer attributes: monthly bill, number of text messages,
data volume consumed, minutes used during various daily
periods, and years as a customer.
The wireless company could then look at the naturally
occurring clusters and consider tactics to increase sales or
reduce the customer churn rate, the proportion of customers
who end their relationship with a particular company.
Overview of the Method
To illustrate the method to find k clusters from a collection
of M objects with n attributes, the two dimensional case (n =
2) is examined. It is much easier to visualize the k-means
method in two dimensions.
Because each object in this example has two attributes, it is
useful to consider each object corresponding to the point
(xi, yi), where xi and yi denote the two attributes and i = 1,
2 ... M.
For a given cluster of m points (m < M), the point that
corresponds to the cluster's mean is called a centroid.
k-means algorithm
1. Choose the value of k and the k initial guesses for the
centroids. In this example, k = 3, and the initial centroids are
indicated by the points shaded in red, green, and blue in
Figure 4-2.
2. Compute the distance from each data point (xi, yi), to
centroid. Assign each point to the closest centroid. This
association defines the first k clusters.
the distance, d, between any two points, (x1 , y1) and (x2
, y2), in the Cartesian plane is typically expressed by using
the Euclidean distance measure provided in Equation.
3. Compute the centroid, the center of mass, of each newly
defined cluster from Step 2.
In two dimensions, the centroid (xc, yc ) of them points in a
k-means cluster is calculated as follows in Equation
Thus, (xc, yc) is the ordered pair of the arithmetic means of

the coordinates of them points in the cluster. In this step, a
centroid is computed for each of the k clusters.
4. Repeat Steps 2 and 3 until the algorithm converges to an
answer.
a. Assign each point to the closest centroid computed
in Step 3.
b. Compute the centroid of newly defined clusters.
c. Repeat until the algorithm reaches the final answer.

DA Unit II - II

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

DA Unit II - II

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DA Unit II - II

Uploaded by

Copyright:

Available Formats

Data Analytics

(Subject Code: 410243)

Designed By: Prof Balaji Bodkhe

Basic Data Analytic Methods

Statistical Methods for Evaluation- Hypothesis testing, difference

Thus, (xc, yc) is the ordered pair of the arithmetic means of

You might also like