BI Unit 2
BI Unit 2
Table of Contents
Data Science: It is the domain of study that deals with vast volumes of
data using modern tools and techniques and Machine learning algorithms
to derive meaningful information, and to make business decisions.
2 Data Collection
3 Data Processing
4 Data Understanding(EDA)
5 Model building and Deployment
Data Collection: Data collection is the next stage in the data science life
cycle to gather raw data from relevant sources. The data captured can be
either in structured or unstructured form. The methods of collecting the
data might come from – logs from websites, social media data, data from
online repositories, and even data streamed from online sources via APIs,
web scraping or data that could be present in excel or any other source.
Data Science Process (cont.)
In the histogram, the bars show the count of values in each range.
The histogram shows that the center of the data is somewhere around
45 and the spread of the data is from about 30 to 65. It also shows the
shape of the data as roughly mound-shaped. This shape is a visual clue
that the data is likely to be from a normal distribution.
► How extreme data values are observed in histograms Histograms are
affected by extreme values, or outliers.
Data Science Process (cont.)
► How skewness is observed in histograms?
Not all histograms are symmetrical. Histograms display the distribution
of your data, and there are many common types of distributions. For
example, data is often nonsymmetrical. In statistics, this is called skewed
data. For example, the battery life for a phone is often skewed, with some
phones having a much longer battery life than most.
Data Science Process (cont.)
The relationships between variables.
► Scatter plots show relationships. Scatter plots show how two continuous
variables are related by putting one variable on the x-axis and a second
variable on the y-axis.
Whether or not your data have outliers or unusual points that may
indicate data quality issues or lead to interesting insights.
Whether or not your data have patterns over time.
Data Science Process (cont.)
User Based CF Example: Consider a matrix that shows four users Alice,
U1, U2 and U3 rating on different news apps. The rating range is from 1
to 5 on the basis of users’ likability of the news app. The indicates that the
user has not rated the app.
Examples of Algorithms- Recommendations (cont.)
Step 1: Calculating the similarity between Alice and all the other
users
At first we calculate the averages of the ratings of all the user excluding I5
as it is not rated by Alice. Σ
rip
Therefore, we calculate the average as r = Σp
p
r Alice = 3.5
r U 1 = 2.25
r U 2 = 3.5
rU3 = 3
and calculate new rating as r′ip = rip − r i
Examples of Algorithms- Recommendations (cont.)
Now, we calculate the similarity between Alice and all the other users.
Sim(Alice, U1) = √((1.5∗0.75)+(0.5∗−1.25)+(−2.5∗−0.25)+(.5∗0.75))
√ = 0.301
2 (1.5 +0.5 +2.5 +0.5 ) (0.75 +1.25 +0.25 +0.752)
2 2 2 2 2 2
(0.301∗0.75)+(−0.33∗1.5)+(0.707∗1)
r(Alice,I 5) = 3.5 + |0.301|+|−0.33|+|0.707|
= 3.83
Examples of Algorithms - Validation
Validation in machine learning (ML) is an essential step to assess how
well a model will generalize to new, previously unseen data. It provides a
mechanism to tune, compare, and select models. Without validation,
there’s a risk of overfitting, where a model performs well on the training
data but poorly on new data.
on which data points end up in the training set and which ones in the
validation/test set.
Random subsampling, also known as repeated hold-out validation, is
an extension of the hold-out validation method. Instead of splitting the
data into training and test sets once, the process is repeated multiple
times with different random splits. This method helps address the
variability issue seen in the standard hold-out validation method by
averaging performance over multiple random splits.
K-fold cross-validation In this technique, the whole dataset is parti-
tioned in k parts of equal size and each partition is called a fold. It’s
known as k-fold since there are k parts where k can be any integer -
3,4,5, etc.
One fold is used for validation and other K-1 folds are used for training
the model. To use every fold as a validation set and other left-outs as
a training set, this technique is repeated k times until each fold is used
Examples of Algorithms - Validation (cont.)
once. To get the final accuracy, you need to take the accuracy of the
k-models validation data.
This validation technique is not considered suitable for imbalanced
datasets as the model will not get trained properly owing to the proper
ratio of each class’s data.
Examples of Algorithms - Validation (cont.)
many times - 100,400,500 or even higher - and take the average of all
the test errors to conclude how well your model performs.
Examples of Algorithms - Validation (cont.)
hist(income.data$happiness)
Linear Regression (cont.)
Estimates for the coefficients provided in the output above, we can now
build out the equation for our model.
Coefficients — Std. Error
The standard error of the coefficient is an estimate of the standard
deviation of the coefficient. In effect, it is telling us how much uncertainty
there is with our coefficient.
Linear Regression (cont.)
Coefficients — t value
The t-statistic is simply the coefficient divided by the standard error. In
general, we want our coefficients to have large t-statistics, because it
indicates that our standard error is small in comparison to our coefficient.
Coefficients — Pr(> |t|) and Signif. codes
The p-value is calculated using the t-statistic from the T distribution. The
p-value, in association with the t-statistic, help us to understand how
significant our coefficient is to the model. In practice, any p-value below
0.05 is usually deemed as significant.
Residual Standard Error
The residual standard error is a measure of how well the model fits the
data.
Linear Regression (cont.)
variables in your model are zero. The alternative hypothesis is that at least
one of them is not zero. The F-statistic and overall p-value help us
determine the result of this test
However, for smaller models, a larger F-statistic generally indicates that
the null hypothesis should be rejected. A better approach is to utilize the
p-value that is associated with the F-statistic. Again, in practice, a p-value
below 0.05 generally indicates that you have at least one coefficient in
your model that isn’t zero.
Step 4: Check for homoscedasticity
We can run plot(income.happiness.lm) to check whether the observed data
meets our model assumptions:
par(mfrow=c(2,2))
plot(income.happiness.lm)
Linear Regression (cont.)
par(mfrow=c(1,1))
Linear Regression (cont.)
What is Clustering in R?
Clustering is a technique of data segmentation that partitions the data
into several groups based on their similarity.
Applications of R clustering are as follows:
1 Marketing – In the area of marketing, we use clustering to explore
and select customers that are potential buyers of the product. This
differentiates the most likeable customers from the ones who possess
the least tendency to purchase the product. After the clusters have
been developed, businesses can keep a track of their customers and
make necessary decisions to retain them in that cluster.
2 Retail – Retail industries make use of clustering to group customers
based on their preferences, style, choice of wear as well as store prefer-
ences. This allows them to manage their stores in a much more efficient
manner.
Clustering Using R (cont.)
3 Medical Science – Medicine and health industries make use of clus-
tering algorithms to facilitate efficient diagnosis and treatment of their
patients as well as the discovery of new medicines. Based on the age,
group, genetic coding of the patients, these organisations are better
capable to understand diagnosis through robust clustering.
4 Sociology – Clustering is used in Data Mining operations to divide
people based on their demographics, lifestyle, socioeconomic status,
etc. This can help the law enforcement agencies to group potential
criminals and even identify them with an efficient implementation of
the clustering algorithm.
Clustering Using R (cont.)
p− m
d (i , j) =
p
We can calculate similarity as,
s(i, j) = 1 − d(i , j)
Let s be the cases where matched attributes are both zero then,
p− m
d (i , j) =
p− s
s(i, j) = 1 − d (i , j)
s(i, j) = 1 − d (i , j)
Clustering Using R (cont.)
K-Means Clustering in R
K-Means is an iterative hard clustering technique that uses an unsupervised
learning algorithm. In this, total numbers of clusters are pre-defined by the
user and based on the similarity of each data point, the data points are
clustered. This algorithm also finds out the centroid of the cluster.
Algorithm:
1 Specify number of clusters (K): Let us take an example of k =2 and 5
data points.
2 Randomly assign each data point to a cluster: In the example, the
red and green color shows 2 clusters with their respective random data
points assigned to them.
3 Calculate cluster centroids
4 Re-allocate each data point to their nearest cluster centroid
5 Re-figure cluster centroid
Clustering Using R (cont.)
R Code:
# Library required for fviz cluster function install.packages(”factoextra”)
library(factoextra)
# Loading dataset
df < − mtcars
Omitting any NA values
df < − na.omit(df)
Clustering Using R (cont.)
# Scaling dataset
df < − scale(df)
# output to be present as PNG file
png(file = ”KMeansExample.png”)
km < − kmeans(df, centers = 4, nstart = 25)
# Visualize the clusters
fviz cluster(km, data = df)
# saving the file
dev.off()
Clustering Using R (cont.)
Text Analytics
► Sourcing: Determine where your text data is coming from, which could
include websites, databases, social media platforms, customer reviews,
or other text-rich sources.
► Acquisition: Use methods like web scraping, database queries, or APIs
to collect the data.
2 Data Preprocessing:
► Cleaning: Remove noise such as HTML tags, URLs, non-textual con-
tent, or any irrelevant text portions.
Text Analytics (cont.)
Advantages :
1 Weighting Important Terms: TF-IDF inherently weighs terms based on
their importance. Words that are frequent in a single document but rare
across documents get a higher weight, potentially emphasizing unique
or more relevant information.
Text Analytics (cont.)
Example:
Let’s consider the sentence:
”I love to play football.”
Unigrams: ”I”, ”love”, ”to”, ”play”, ”football”
Bigrams: ”I love”, ”love to”, ”to play”, ”play football”
Trigrams: ”I love to”, ”love to play”, ”to play football”
► Evaluation:
1. After building models, assess their performance using appropriate
evaluation metrics like accuracy, F1-score, recall, precision, etc.
2. Use methods like cross-validation to ensure model generalizability.
► Deployment: If the goal is a real-world application, such as a rec-
ommendation system, chatbot, or sentiment analysis tool, deploy the
trained model to production.
Feature Engineering: Derive new features based on the textual
content, like sentiment scores, length of text, readability scores, etc.
Text Analytics Using R
# Use caret to create a 70%/30% stratified split. Set the random seed for
reproducibility.
library(caret)
set.seed(32984)
indexes < − createDataPartition(spam.raw$Label, times = 1, p = 0.7, list
= FALSE)
train < − spam.raw[indexes,] # Create Train set
test < − spam.raw[-indexes,] # Create Test Set
Text Analytics Using R (cont.)
# Verify proportions.
prop.table(table(train$Label))
prop.table(table(test$Label))
Text Analytics Using R (cont.)
train.tokens[[357]]
The best way to get an accurate analysis when using inferential statistics
involves identifying the population being measured or studied, creating a
sample for that portion of the population, and using analysis to factor in
any sampling errors.
Types of Inferential Statistics
Inferential statistics employ four different methodologies or types:
1 Parameter Estimation: Analysts take a statistic from the sample