0% found this document useful (0 votes)
31 views113 pages

BI Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views113 pages

BI Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

Business Intelligence - Unit-II

Table of Contents

Data Science: The Concept


Data Science Process
Typical Tools in Data Science
Examples of algorithms
► Recommendations
► Validation
► Classification
Histograms using R and Excel
Regression Using R
Clustering Using R
Text Analysis Using R
Statistical Analysis
Question Bank
Data Science
Data Science (cont.)

Data Science: It is the domain of study that deals with vast volumes of
data using modern tools and techniques and Machine learning algorithms
to derive meaningful information, and to make business decisions.

Why Data Science now?


Data science has gained significant prominence in recent years due to
several factors:
Data Abundance: The digital transformation has led to an explosion
of data from various sources, including social media, sensors, transac-
tions, and more. This abundance of data presents opportunities for
extracting valuable insights.
Technological Advancements: The growth of computing power, stor-
age capabilities, and data processing tools has enabled the handling and
analysis of large datasets that were previously impossible to manage.
Data Science (cont.)

Business Value: Organizations have recognized that data-driven in-


sights can provide a competitive edge by improving decision-making,
customer understanding, and operational efficiency.
Personalization: Data science allows companies to personalize their
products, services, and marketing efforts by understanding individual
customer preferences and behavior.
Predictive Analytics: Data science enables predictive modeling, al-
lowing businesses to forecast trends, demand, and potential outcomes,
leading to better planning and resource allocation.

How data is generated? Data is generated through various processes


and activities in the digital and physical world. It can come from a wide
range of sources, including human interactions, sensors, machines, and
more. Here are some common ways data is generated:
Data Science (cont.)

1 Human Activities: People generate data through their interactions


with digital devices, such as smartphones, computers, and wearable de-
vices. This includes text messages, social media posts, emails, browsing
history, and online purchases.
2 Sensors and IoT Devices: Internet of Things (IoT) devices and sen-
sors collect data from the physical environment. These devices can
measure things like temperature, humidity, motion, light levels, and
more. This data is often used for monitoring and control purposes.
3 Machine-generated Data: Automated systems and machines gener-
ate data as they operate. This can include logs, error reports, per-
formance metrics, and other operational data that helps in diagnosing
issues and improving efficiency.
4 Digital Transactions: Financial transactions, online purchases, and
electronic payments generate data related to spending patterns, trans-
action amounts, locations, and more.
Data Science (cont.)

5 Scientific Instruments: Scientific experiments and research generate


data from instruments such as telescopes, microscopes, particle accel-
erators, and more. This data is used for advancing scientific knowledge
and understanding.
6 Social Media and Online Platforms: Social media platforms, web-
sites, and online applications generate data from user interactions, such
as likes, shares, comments, and clicks. This data is used for user en-
gagement analysis and content optimization.
7 Surveys and Questionnaires: Data is collected through surveys and
questionnaires to gather information from individuals on specific topics
or subjects.
8 Medical Devices: Medical equipment like MRI machines, heart rate
monitors, and glucose monitors generate data related to patients’ health
conditions.
Data Science (cont.)

9 Audio and Video Recordings: Audio and video recordings capture


data in the form of sound and visuals. This can include videos, pod-
casts, music recordings, and more.
10 Satellites and Remote Sensing: Satellites and remote sensing tech-
nologies collect data about the Earth’s surface, atmosphere, oceans,
and more. This data is used for environmental monitoring, weather
forecasting, and geographical analysis.
11 Log Files: Computers, servers, and network devices generate log files
that record various events and activities, helping in troubleshooting and
security analysis.
12 Geolocation Data: GPS-enabled devices generate data about the lo-
cation and movement of people and vehicles, which is used for naviga-
tion and location-based services.
Data Science (cont.)

The field of data science focuses on extracting valuable information from


various data sources to make informed decisions and gain a deeper
understanding of the world around us.
Data Science Process
The data science process involves a series of steps that guide the
transformation of raw data into meaningful insights and actionable
outcomes.
1 Business Understanding

2 Data Collection
3 Data Processing
4 Data Understanding(EDA)
5 Model building and Deployment

Business Understanding: There are two main tasks addressed in this


stage:
Define objectives: Work with your customer and other stakeholders to
understand and identify the business problems. Formulate questions
that define the business goals that the data science techniques can
target.
Data Science Process (cont.)

Identify KPI’s:For any data science project, key performance indicators


define the performance or success of the project
Identify data sources: Find the relevant data that helps you answer the
questions that define the objectives of the project
Goal’s : Whether customer wish to do predictions or optimise the over-
all spending’s of the company or want to improve sales or minimize the
loss or optimize any particular process etc

Data Collection: Data collection is the next stage in the data science life
cycle to gather raw data from relevant sources. The data captured can be
either in structured or unstructured form. The methods of collecting the
data might come from – logs from websites, social media data, data from
online repositories, and even data streamed from online sources via APIs,
web scraping or data that could be present in excel or any other source.
Data Science Process (cont.)

Data Pre processing: There are 4 major tasks in data preprocessing –


Data cleaning, Data integration, Data reduction, and Data transformation.
Data Cleaning
1 Data cleaning is the process of removing incorrect data, incomplete
data, and inaccurate data from the datasets, and it also replaces the
missing values.
Here are some techniques for data cleaning:
► Handling Missing Values: Missing values can be handled by many
techniques, such as removing rows/columns containing NULL values and
imputing NULL values using mean, mode, regression, etc.
2 Data Integration Data Integration can be defined as combining data
from multiple sources.
A few of the issues to be considered during Data Integration include
the following -
Data Science Process (cont.)

► Entity Identification Problem - It can be defined as identifying ob-


jects/features from multiple databases that correspond to the same
entity. For example, in database A customer-id, and in database B
customer-number belong to the same entity.
► Schema Integration: - It is used to merge two or more database
schema/metadata into a single schema. It essentially takes two or more
schema as input and determines a mapping between them. For example,
entity type CUSTOMER in one schema may have CLIENT in another
schema.
► Detecting and Resolving Data Value Concepts - The data can be
stored in various ways in different databases, and it needs to be taken care
of while integrating them into a single dataset. For example, dates can
be stored in various formats such as DD/MM/YYYY, YYYY/MM/DD,
or MM/DD/YYYY, etc.
Data Science Process (cont.)

3 Data ReductionThis process helps in the reduction of the volume


of the data, which makes the analysis easier yet produces the same
or almost the same result. This reduction also helps to reduce storage
space. One of the data reduction techniques is dimensionality reduction

► Dimensionality reduction: It is the process of reducing the number of


features (or dimensions) in a dataset while retaining as much information
as possible. This can be done for a variety of reasons, such as to reduce
the complexity of a model, to improve the performance of a learning
algorithm, or to make it easier to visualize the data.
There are several techniques for dimensionality reduction, including
principal component analysis (PCA), singular value decomposition (SVD),
and linear discriminant analysis (LDA). Each technique uses a different
method to project the data onto a lower-dimensional space while pre-
serving important information.
4 Data Transformation: The data transformation involves steps that
are:
Data Science Process (cont.)

► Discretization: The continuous data here is split into intervals. Dis-


cretization reduces the data size. For example, rather than specifying
the class time, we can set an interval like (3 pm-5 pm, or 6 pm-8 pm).
► Normalization: It is the method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0 to 1.0.
Data Understanding (Exploratory Data Analysis):

Exploratory data analysis (EDA) involves using graphics and visualizations


to explore and analyze a data set. The goal is to explore, investigate and
learn, as opposed to confirming statistical hypotheses.

The process of using numerical summaries and visualizations to explore


your data and to identify potential relationships between variables is called
exploratory data analysis, or EDA. Because EDA involves exploring, it is
iterative. You are likely to learn different aspects about your data from
different graphs.
Data Science Process (cont.)

Typical goals are understanding:


The distribution of variables in your data set. That is, what is the
shape of your data? Is the distribution skewed? Bimodal?
► Histograms show the shape of your data. The horizontal axis shows your
data values, where each bar includes a range of values. The vertical axis
shows how many points in your data have values in the specified range
for the bar.
Data Science Process (cont.)

In the histogram, the bars show the count of values in each range.

The histogram shows that the center of the data is somewhere around
45 and the spread of the data is from about 30 to 65. It also shows the
shape of the data as roughly mound-shaped. This shape is a visual clue
that the data is likely to be from a normal distribution.
► How extreme data values are observed in histograms Histograms are
affected by extreme values, or outliers.
Data Science Process (cont.)
► How skewness is observed in histograms?
Not all histograms are symmetrical. Histograms display the distribution
of your data, and there are many common types of distributions. For
example, data is often nonsymmetrical. In statistics, this is called skewed
data. For example, the battery life for a phone is often skewed, with some
phones having a much longer battery life than most.
Data Science Process (cont.)
The relationships between variables.
► Scatter plots show relationships. Scatter plots show how two continuous
variables are related by putting one variable on the x-axis and a second
variable on the y-axis.

Example 1: Increasing relationship


Data Science Process (cont.)
Example 2: Decreasing relationship
Data Science Process (cont.)
Example 3: Curved relationship
Data Science Process (cont.)

Example 4: Outliers in scatter plots Unusual points, or outliers, in the


data stand out in scatter plots.

Whether or not your data have outliers or unusual points that may
indicate data quality issues or lead to interesting insights.
Whether or not your data have patterns over time.
Data Science Process (cont.)

Model Building Model This step involves selecting an appropriate model


based on the problem type (classification, regression, clustering, etc.) and
training the model on the processed data. This step also involves tuning
the hyperparameters to find the most optimal parameters for the model.

Model Evaluation: After the model is built, it is important to evaluate its


performance. This usually involves splitting the data into a training set
and a test set, and comparing the model’s predictions on the test set with
the actual values. Metrics such as accuracy, precision, recall, F1-score,
ROC curve, etc. are used for evaluation.
Examples of Algorithms- Recommendations
Recommendation System: A recommendation system (or recommender
system) is a class of machine learning that uses data to help predict, narrow
down, and find what people are looking for among an exponentially growing
number of options.
Types of Recommendation Systems:
Collaborative Filtering: Collaborative Filtering recommends items based
on similarity measures between users and/or items. The basic assumption
behind the algorithm is that users with similar interests have common pref-
erences.
For ex: Buying a ceiling Fan and then the system starts recommending to
buy a light (this is because many people who buy ceiling fans are also buying
lights and not because light and ceiling fan are related , this information is
generally extracted from the transcript of users )
Content-based recommendation refers to a recommendation system ap-
proach that leverages the characteristics of items and a profile of the user’s
Examples of Algorithms- Recommendations (cont.)
preferences to suggest items. These systems primarily utilize descriptions of
items to recommend additional items similar to what the user likes, based
on their previous actions or explicit feedback.
For Ex- When a user buys a Cannon D450 Camera the system starts recom-
mending lenses, other similar model camera (These recommendations are
based on the fact that only those products related to the main item in some
attributes like model or compatible lens etc . , and also these details about
the product are taken from the stored data)
Collaborative filtering based recommender systems can be:
Memory-Based
Item Model-Based
Hybrid
Deep Learning
Examples of Algorithms- Recommendations (cont.)
Memory Based Collaborative filtering categories:
User-based collaborative filtering is a technique used in recommenda-
tion systems where items are recommended to a user based on the
preferences of similar users.
Item-based collaborative filtering, also known as item-item collabora-
tive filtering, is a technique used in recommendation systems where
the similarities between items are used to recommend items. Rather
than looking for users who are similar to the target user, item-based
collaborative filtering looks for items that are similar to items the user
has interacted with.
Techniques like Cosine similarity, Pearson correlation, Jaccard
index
Examples of Algorithms- Recommendations (cont.)

Steps for User-Based Collaborative Filtering:


Step 1: Finding the similarity of users to the target user U. Similarity for
any two users ‘a’ and ‘b’ can be calculated from the given formula,

Step 2: Prediction of missing rating of an item. Now, the target user


might be very similar to some users and may not be much similar to others.
Hence, the ratings given to a particular item by the more similar users should
be given more weightage than those given by less similar users and so on.
This problem can be solved by using a weighted average approach. In this
approach, multiply the rating of each user with a similarity factor calculated
using the above mention formula. The missing rating can be calculated as
Examples of Algorithms- Recommendations (cont.)

User Based CF Example: Consider a matrix that shows four users Alice,
U1, U2 and U3 rating on different news apps. The rating range is from 1
to 5 on the basis of users’ likability of the news app. The indicates that the
user has not rated the app.
Examples of Algorithms- Recommendations (cont.)
Step 1: Calculating the similarity between Alice and all the other
users
At first we calculate the averages of the ratings of all the user excluding I5
as it is not rated by Alice. Σ
rip
Therefore, we calculate the average as r = Σp
p
r Alice = 3.5
r U 1 = 2.25
r U 2 = 3.5
rU3 = 3
and calculate new rating as r′ip = rip − r i
Examples of Algorithms- Recommendations (cont.)

Hence, we get the following matrix,


Examples of Algorithms- Recommendations (cont.)

Now, we calculate the similarity between Alice and all the other users.
Sim(Alice, U1) = √((1.5∗0.75)+(0.5∗−1.25)+(−2.5∗−0.25)+(.5∗0.75))
√ = 0.301
2 (1.5 +0.5 +2.5 +0.5 ) (0.75 +1.25 +0.25 +0.752)
2 2 2 2 2 2

Sim(Alice, U2) = √((1.5∗0.25)+(0.5∗−0.5)+(−2.5∗0.5)+(.5∗−0.5))


√ 2 = −0.33
2 (1.5 +0.5 +2.5 +0.5 ) (0.5 +0.5 +0.5 +0.52)
2 2 2 2 2

Sim(Alice, U3) = √((1.5∗0.0)+(0.5∗0)+(−2.5∗−2)+(.5∗2))


√ 2 2 2 = 0.707
2 (1.5 +0.5 +2.5 +0.5 ) (0 +0 +2 +22)
2 2 2

Step 2: Predicting the rating of the app not rated by Alice


Now, we predict Alice’s rating for BBC News App,
(sim(Alice,U1)∗(rU1,I 5 − r U1)+(sim(Alice,U2)∗(rU2,I 5 − r U2)+(sim(Alice,U3
r(Alice,I 5) = r Alice + |sim(Alice,U1)|+|sim(Alice,U2)|+|sim(Alice,U3)|

(0.301∗0.75)+(−0.33∗1.5)+(0.707∗1)
r(Alice,I 5) = 3.5 + |0.301|+|−0.33|+|0.707|
= 3.83
Examples of Algorithms - Validation
Validation in machine learning (ML) is an essential step to assess how
well a model will generalize to new, previously unseen data. It provides a
mechanism to tune, compare, and select models. Without validation,
there’s a risk of overfitting, where a model performs well on the training
data but poorly on new data.

Different Validation Methods:


In the holdout method, data set is partitioned, such that – maximum
data belongs to training set and remaining data belongs to test set.
Pros:
1. It’s a straightforward and easy method to implement.
2. Computationally less expensive compared to other methods like k-
fold cross-validation.
Cons:
1.The performance evaluation might have high variance, especially if
the dataset is small. The reason is that the evaluation heavily depends
Examples of Algorithms - Validation (cont.)

on which data points end up in the training set and which ones in the
validation/test set.
Random subsampling, also known as repeated hold-out validation, is
an extension of the hold-out validation method. Instead of splitting the
data into training and test sets once, the process is repeated multiple
times with different random splits. This method helps address the
variability issue seen in the standard hold-out validation method by
averaging performance over multiple random splits.
K-fold cross-validation In this technique, the whole dataset is parti-
tioned in k parts of equal size and each partition is called a fold. It’s
known as k-fold since there are k parts where k can be any integer -
3,4,5, etc.
One fold is used for validation and other K-1 folds are used for training
the model. To use every fold as a validation set and other left-outs as
a training set, this technique is repeated k times until each fold is used
Examples of Algorithms - Validation (cont.)

once. To get the final accuracy, you need to take the accuracy of the
k-models validation data.
This validation technique is not considered suitable for imbalanced
datasets as the model will not get trained properly owing to the proper
ratio of each class’s data.
Examples of Algorithms - Validation (cont.)

Stratified k-fold cross-validation As seen above, k-fold validation


can’t be used for imbalanced datasets because data is split into k-folds
with a uniform probability distribution. Not so with stratified k-fold,
which is an enhanced version of the k-fold cross-validation technique.
Although it too splits the dataset into k equal folds, each fold has the
same ratio of instances of target variables that are in the complete
dataset. This enables it to work perfectly for imbalanced datasets,
but not for time-series data. In the stratified k-fold cross-validation
technique, this ratio of instances of the target variable is maintained in
all the folds.
Examples of Algorithms - Validation (cont.)
Examples of Algorithms - Validation (cont.)

Leave-p-out cross-validation: An exhaustive cross-validation tech-


nique, where p samples are used as the validation set and n-p samples
are used as the training set if a dataset has n samples. The process
is repeated until the entire dataset containing n samples gets divided
on the validation set of p samples and the training set of n-p samples.
This continues till all samples are used as a validation set.
The technique, which has a high computation time, produces good
results. However, it’s not considered ideal for an imbalanced dataset
and is deemed to be a computationally unfeasible method. This is
because if the training set has all samples of one class, the model will
not be able to properly generalize and will become biased to either of
the classes.
Examples of Algorithms - Validation (cont.)

Leave-one-out cross-validation In this technique, only 1 sample point


is used as a validation set and the remaining n-1 samples are used in
the training set. Think of it as a more specific case of the leave-p-out
cross-validation technique with P=1.
To understand this better, consider this example: There are 1000 in-
stances in your dataset. In each iteration, 1 instance will be used for
the validation set and the remaining 999 instances will be used as the
training set. The process repeats itself until every instance from the
dataset is used as a validation sample.
Examples of Algorithms - Validation (cont.)

Monte Carlo cross-validation Also known as shuffle split cross-


validation and repeated random subsampling cross-validation, the Monte
Carlo technique involves splitting the whole data into training data and
test data. Splitting can be done in the percentage of 70-30% or 60-
40% - or anything you prefer. The only condition for each iteration is
to keep the train-test split percentage different.Repeat these iterations
Examples of Algorithms - Validation (cont.)

many times - 100,400,500 or even higher - and take the average of all
the test errors to conclude how well your model performs.
Examples of Algorithms - Validation (cont.)

Time series (rolling cross-validation / forward chaining method)


Time series is the type of data collected at different points in time. This
kind of data allows one to understand what factors influence certain
variables from period to period. Some examples of time series data are
weather records, economic indicators, etc.
To begin:Start the training with a small subset of data. Perform fore-
casting for the later data points and check their accuracy. The fore-
casted data points are then included as part of the next training dataset
and the next data points are forecasted. The process goes on.
Examples of Algorithms - Validation (cont.)
Examples of Algorithms- Classification

Na¨ıve Bayes for data with nominal attributes


Given the training data in the table below (Buy Computer data),predict
the class of the following new example using Na¨ıve Bayes classification:
age <= 30, income = medium, student = yes, credit − rating = fair
Examples of Algorithms- Classification (cont.)
Examples of Algorithms- Classification (cont.)
Histograms Using R

A histogram represents the frequencies of values of a variable bucketed


into ranges. Histogram is similar to bar chat but the difference is it groups
the values into continuous ranges. Each bar in histogram represents the
height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as
an input and uses some more parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used
v is a vector containing numeric values used in histogram.
main indicates title of the chart.
Histograms Using R (cont.)

col is used to set color of the bars.


border is used to set border color of each bar.
xlab is used to give description of x-axis.
xlim is used to specify the range of values on the x-axis.
ylim is used to specify the range of values on the y-axis.
breaks is used to mention the width of each bar.
Example
A simple histogram is created using input vector, label, col and border
parameters.
# Create data for the graph.
v < − c(9,13,21,8,36,22,12,41,31,33,19)
# Give the chart file a name.
png(file = ”histogram.png”)
Histograms Using R (cont.)

# Create the histogram.


hist(v,xlab = ”Weight”,col = ”yellow”,border = ”blue”)
# Save the file.
dev.off()
# Create the histogram to specify the range of values allowed in X axis
and Y axis, we can use the xlim and ylim parameters
hist(v,xlab = ”Weight”,col = ”green”,border = ”red”, xlim = c(0,40),
ylim = c(0,5), breaks = 5)
Histograms Using R (cont.)
Linear Regression
Linear regression is a regression model that uses a straight line to describe
the relationship between variables. It finds the line of best fit through your
data by searching for the value of the regression coefficient(s) that
minimizes the total error of the model.
There are two main types of linear regression:
Simple linear regression uses only one independent variable
Multiple linear regression uses two or more independent variables
Linear Regression (cont.)

Steps to implement Linear Regression in R


Step 1: Load the data into R
Step 2: Make sure your data meet the assumptions
Step 3: Perform the linear regression analysis
Step 4: Check for homoscedasticity
Step 5: Visualize the results with a graph
Step 1 : Load the data into R
In RStudio, go to File ¿ Import dataset ¿ From Text (base).
Choose the data file , and an Import Dataset window pops up.
Click on the Import button and the file should appear in your Environ-
ment tab on the upper right side of the RStudio screen.
Step 2: Make sure your data meet the assumptions Simple linear
regression is a parametric test, meaning that it makes certain assumptions
about the data. These assumptions are:
Linear Regression (cont.)

Homogeneity of variance (homoscedasticity):This means that the pre-


diction error doesn’t change significantly over the range of prediction
of the model
Independence of observations: The independent variables are not highly
correlated with each other.(no multicollinearity)
Normality: The data follows a normal distribution.
The relationship between the independent and dependent variable is
linear: the line of best fit through the data points is a straight line
(rather than a curve or some sort of grouping factor).
Normality: To check whether the dependent variable follows a normal
distribution, use the hist() function.

hist(income.data$happiness)
Linear Regression (cont.)

Linearity The relationship between the independent and dependent


variable must be linear. We can test this visually with a scatter plot to see
if the distribution of data points could be described with a straight line.

plot(happiness ∼ income, data = income.data)


Linear Regression (cont.)
Linear Regression (cont.)

Independence of observations (aka no autocorrelation) Use the cor()


function to test the relationship between your independent variables and
make sure they aren’t too highly correlated.
cor(heart.data$biking, heart.data$smoking)
Step 3: Perform the linear regression analysis Now that you’ve
determined your data meet the assumptions, you can perform a linear
regression analysis to evaluate the relationship between the independent
and dependent variables.
income.happiness.lm < − lm(happiness ∼ income, data = income.data)
summary(income.happiness.lm)
Linear Regression (cont.)
Linear Regression (cont.)

Interpretation of Regression Output


Residuals
The residuals are the difference between the actual values and the
predicted values.
Coefficients
Coefficients — Estimate

Estimates for the coefficients provided in the output above, we can now
build out the equation for our model.
Coefficients — Std. Error
The standard error of the coefficient is an estimate of the standard
deviation of the coefficient. In effect, it is telling us how much uncertainty
there is with our coefficient.
Linear Regression (cont.)

Coefficients — t value
The t-statistic is simply the coefficient divided by the standard error. In
general, we want our coefficients to have large t-statistics, because it
indicates that our standard error is small in comparison to our coefficient.
Coefficients — Pr(> |t|) and Signif. codes
The p-value is calculated using the t-statistic from the T distribution. The
p-value, in association with the t-statistic, help us to understand how
significant our coefficient is to the model. In practice, any p-value below
0.05 is usually deemed as significant.
Residual Standard Error
The residual standard error is a measure of how well the model fits the
data.
Linear Regression (cont.)

Multiple R-squared and Adjusted R-squared


The Multiple R-squared value is most often used for simple linear
regression (one predictor). It tells us what percentage of the variation
within our dependent variable that the independent variable is explaining.
In other words, it’s another method to determine how well our model is
fitting the data.
The Adjusted R-squared value shows what percentage of the variation
within our dependent variable that all independent variables are explaining.
F-statistic and p-value
When running a regression model, either simple or multiple, a hypothesis
test is being run on the global model. The null hypothesis is that there is
no relationship between the dependent variable and the independent
variable(s) and the alternative hypothesis is that there is a relationship.
Said another way, the null hypothesis is that the coefficients for all of the
Linear Regression (cont.)

variables in your model are zero. The alternative hypothesis is that at least
one of them is not zero. The F-statistic and overall p-value help us
determine the result of this test
However, for smaller models, a larger F-statistic generally indicates that
the null hypothesis should be rejected. A better approach is to utilize the
p-value that is associated with the F-statistic. Again, in practice, a p-value
below 0.05 generally indicates that you have at least one coefficient in
your model that isn’t zero.
Step 4: Check for homoscedasticity
We can run plot(income.happiness.lm) to check whether the observed data
meets our model assumptions:
par(mfrow=c(2,2))
plot(income.happiness.lm)
Linear Regression (cont.)

par(mfrow=c(1,1))
Linear Regression (cont.)

Step 5: Visualize the results with a graph


Plot the data points on a graph
income.graph< −ggplot(income.data, aes(x=income, y=happiness))

Add the linear regression line to the plotted data


income.graph < − income.graph + geom smooth(method=”lm”,
col=”black”)
income.graph
Linear Regression (cont.)
Clustering Using R

What is Clustering in R?
Clustering is a technique of data segmentation that partitions the data
into several groups based on their similarity.
Applications of R clustering are as follows:
1 Marketing – In the area of marketing, we use clustering to explore

and select customers that are potential buyers of the product. This
differentiates the most likeable customers from the ones who possess
the least tendency to purchase the product. After the clusters have
been developed, businesses can keep a track of their customers and
make necessary decisions to retain them in that cluster.
2 Retail – Retail industries make use of clustering to group customers
based on their preferences, style, choice of wear as well as store prefer-
ences. This allows them to manage their stores in a much more efficient
manner.
Clustering Using R (cont.)
3 Medical Science – Medicine and health industries make use of clus-
tering algorithms to facilitate efficient diagnosis and treatment of their
patients as well as the discovery of new medicines. Based on the age,
group, genetic coding of the patients, these organisations are better
capable to understand diagnosis through robust clustering.
4 Sociology – Clustering is used in Data Mining operations to divide
people based on their demographics, lifestyle, socioeconomic status,
etc. This can help the law enforcement agencies to group potential
criminals and even identify them with an efficient implementation of
the clustering algorithm.
Clustering Using R (cont.)

Methods for Measuring Distance between Objects

1 Proximity measures for Nominal Attributes


Nominal attributes can have two or more different states e.g. an at-
tribute ’color’ can have values like ’Red’, ’Green’, ’Yellow’, etc. Dissim-
ilarity for nominal attributes is calculated as the ratio of total number of
mismatches between two data tuples to the total number of attributes.
Let M be the total number of states of a nominal attribute. Then
the states can be numbered from 1 to M . However, the numbering
does not denote any kind of ordering and can not be used for any
mathemetical operations.
Let m be total number of matches between two tuple attributes and p
be total number of attributes, then the dissimilarity can be calculated
as,
Clustering Using R (cont.)

p− m
d (i , j) =
p
We can calculate similarity as,

s(i, j) = 1 − d(i , j)

2 Proximity measures for Binary Attributes: Since binary attributes


are similar to nominal attributes, proximity measures for binary at-
tributes are also similar to that of nominal attributes. For symmetric
binary attributes, the process is same i.e.
p− m
d (i , j) =
p

However, for asymmetric binary attributes, we drop the number of


matched zeros (where an attribute of both tuples is zero).
Clustering Using R (cont.)

Let s be the cases where matched attributes are both zero then,
p− m
d (i , j) =
p− s

We can calculate similarity as,

s(i, j) = 1 − d (i , j)

3 Proximity Measures for Numeric Data : Minkowski Distance:


Distance or dissimilarity between two numric attributes is commonly
measured using minkowski distance, manhattan or euclidean distance.
It is important to scale the data point to a common range usually
[0, 1]or [1, 1] . This is to avoid attributes having high values from out-
weighing those with lower values.
Clustering Using R (cont.)

Euclidean distance is most popular distance metric for numeric at-


tributes also known as straight line distance. It can be calculated as,
q
d (i , j) = |x x 1 − xj1|2 + |xx2 − xj2|2 + . . . + |xxp − xjp |2

Another popular distance measure is manhattan or city block distance


which is calculated as,

d (i , j) = |xx1 − xj1| + |xx2 − xj2| + . . . + |xxp − xjp |

4 Proximity measures for ordinal attributes Ordinal attributes have


a meaningful order among their attributes values therefore, they are
treated similar to numeric attributes. However, to do so, it is impor-
tant to convert the states to numbers where each state of an ordinal
attribute is assigned a number corresponding to the order of attribute
Clustering Using R (cont.)

values. For e.g if a grading system have grades as A, B and C, then


the number can be given as C=1, B=2 and A=3.
Since number of states can be different for different ordinal attributes,
it is therefore required to scale the values to common range e.g [0, 1] .
This can be done using given formula,
rif − 1
Zif =
Mf − 1

where M is maximum number assigned to states and r is rank(numeric


value) of a patricular object.
After the scaling is done, we can simply apply same distance metrics
as given for numeric attributes. The similarity can be calculated as:

s(i, j) = 1 − d (i , j)
Clustering Using R (cont.)

5 Proximity Measures for Mixed Attribute Types Real world data


is often described by a mixture of different types of attributes, so it is
important to define proximity measure for such data.
Approach is to combine all the attributes into a single dissimilarity
matrix, bringing all meaningful attributes to a common scale of [0, 1]
Clustering Using R (cont.)
Clustering Using R (cont.)

K-Means Clustering in R
K-Means is an iterative hard clustering technique that uses an unsupervised
learning algorithm. In this, total numbers of clusters are pre-defined by the
user and based on the similarity of each data point, the data points are
clustered. This algorithm also finds out the centroid of the cluster.
Algorithm:
1 Specify number of clusters (K): Let us take an example of k =2 and 5

data points.
2 Randomly assign each data point to a cluster: In the example, the
red and green color shows 2 clusters with their respective random data
points assigned to them.
3 Calculate cluster centroids
4 Re-allocate each data point to their nearest cluster centroid
5 Re-figure cluster centroid
Clustering Using R (cont.)

Syntax: kmeans(x, centers, nstart)


where,
x represents numeric matrix or data frame object
centers represents the K value or distinct cluster centers
nstart represents number of random sets to be chosen

R Code:
# Library required for fviz cluster function install.packages(”factoextra”)
library(factoextra)
# Loading dataset
df < − mtcars
Omitting any NA values
df < − na.omit(df)
Clustering Using R (cont.)

# Scaling dataset
df < − scale(df)
# output to be present as PNG file
png(file = ”KMeansExample.png”)
km < − kmeans(df, centers = 4, nstart = 25)
# Visualize the clusters
fviz cluster(km, data = df)
# saving the file
dev.off()
Clustering Using R (cont.)
Text Analytics

Text mining, also known as text analytics, is the process of extracting


valuable insights, patterns, and information from a vast amount of
unstructured text data using different algorithms and techniques. It’s a
subset of data mining that focuses specifically on textual information.
Text analysis is a comprehensive process that involves several stages to
transform raw textual data into meaningful insights. The stages in text
analysis can be broken down as follows:
1 Data Collection:

► Sourcing: Determine where your text data is coming from, which could
include websites, databases, social media platforms, customer reviews,
or other text-rich sources.
► Acquisition: Use methods like web scraping, database queries, or APIs
to collect the data.
2 Data Preprocessing:
► Cleaning: Remove noise such as HTML tags, URLs, non-textual con-
tent, or any irrelevant text portions.
Text Analytics (cont.)

► Normalization: Make text consistent, e.g., converting everything to


lowercase.
► Tokenization: Split the text into individual words or terms. Example:
”Text mining is fun!” → [”Text”, ”mining”, ”is”, ”fun!”]
► Stopword Removal: Eliminate common words that might not have
significant meaning in analysis, such as ”and”, ”the”, and ”is”.
► Stemming and Lemmatization: Convert words to their base or root
form (e.g., ”running” to ”run”).
3 Exploratory Data Analysis (EDA): Understand the basic properties
of the text data, such as most common words, length distribution of
the documents, and word frequency distributions. Visualize the data
through word clouds, bar plots, histograms, etc.
4 Feature Extraction and Engineering:
Vectorization: Convert text into numerical format using techniques
like Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency
(TF-IDF), or word embeddings like Word2Vec and GloVe.
Text Analytics (cont.)

► A Document-Term Matrix ( DTM ) represents a collection of text data


in a matrix format, where rows correspond to documents and columns
correspond to terms (typically words or n-grams). Each cell in the matrix
represents the frequency (or importance weight like TF-IDF) of a term
in a particular document.
Example:
Suppose we have the following three short documents:
”apple banana”
”apple orange apple”
”banana orange”
Text Analytics (cont.)

► TF-IDF, which stands for Term Frequency-Inverse Document Fre-


quency, is a numerical statistic that reflects how important a word is to
a document in a collection or corpus. It’s one of the most popular tech-
niques for transforming text into a meaningful representation of numbers
which can be used for various machine learning tasks.
The formula for TF-IDF is:
TF − IDF = TF ∗ IDF
Where:
TF (Term Frequency) is the number of times a term appears in a
document divided by the total number of terms in that document.
IDF (Inverse Document Frequency) is the logarithm of the total num-
ber of documents divided by the number of documents containing the
term.
Text Analytics (cont.)
Text Analytics (cont.)

Advantages :
1 Weighting Important Terms: TF-IDF inherently weighs terms based on
their importance. Words that are frequent in a single document but rare
across documents get a higher weight, potentially emphasizing unique
or more relevant information.
Text Analytics (cont.)

2 Downweighting Common Terms: Common words (like ”and”, ”the”,


”is”) appear in many documents. In a DTM, these terms will have high
frequencies. TF-IDF naturally downweights such terms due to the IDF
component.
3 Improved Performance for Certain Tasks: For some machine learning
and information retrieval tasks, TF-IDF representations can lead to bet-
ter performance compared to raw frequency counts in a DTM.
N-grams are contiguous sequences of n items (words, characters, sym-
bols) from a given sample of text or speech. They’re useful in many
natural language processing (NLP) tasks because they capture the local
structure or context within the text, which can be missed when consid-
ering words individually.
The term ’N-gram’ is a general one; when n is specified, you get:
1 gram (or unigram): A single item
2 gram (or bigram): Two consecutive items

3 gram (or trigram): Three consecutive items


Text Analytics (cont.)

Example:
Let’s consider the sentence:
”I love to play football.”
Unigrams: ”I”, ”love”, ”to”, ”play”, ”football”
Bigrams: ”I love”, ”love to”, ”to play”, ”play football”
Trigrams: ”I love to”, ”love to play”, ”to play football”

► Modeling (For Predictive Tasks):


1. Depending on the problem (e.g., text classification, sentiment analy-
sis, or topic modeling), you might apply various algorithms.
2. Use machine learning or deep learning models to train on your features
to make predictions or derive insights.
Text Analytics (cont.)

► Evaluation:
1. After building models, assess their performance using appropriate
evaluation metrics like accuracy, F1-score, recall, precision, etc.
2. Use methods like cross-validation to ensure model generalizability.
► Deployment: If the goal is a real-world application, such as a rec-
ommendation system, chatbot, or sentiment analysis tool, deploy the
trained model to production.
Feature Engineering: Derive new features based on the textual
content, like sentiment scores, length of text, readability scores, etc.
Text Analytics Using R

# Install all required packages.


install.packages(c(”ggplot2”, ”e1071”, ”caret”, ”quanteda”))
# Load up the .CSV data and explore in RStudio.
spam.raw < −read.csv(”spam.csv”, stringsAsFactors = FALSE,
fileEncoding = ”UTF-16”)
View(spam.raw)
Text Analytics Using R (cont.)
Text Analytics Using R (cont.)

# Check data to see if there are missing values.


length(which(!complete.cases(spam.raw)))
# Convert our class label into a factor.
spam.raw$Label < − as.factor(spam.raw$Label)
# The first step, as always, is to explore the data.
# First, let’s take a look at distribution of the class labels (i.e., ham vs.
spam).
prop.table(table(spam.raw$Label))
Text Analytics Using R (cont.)

# Use caret to create a 70%/30% stratified split. Set the random seed for
reproducibility.
library(caret)
set.seed(32984)
indexes < − createDataPartition(spam.raw$Label, times = 1, p = 0.7, list
= FALSE)
train < − spam.raw[indexes,] # Create Train set
test < − spam.raw[-indexes,] # Create Test Set
Text Analytics Using R (cont.)

# Verify proportions.
prop.table(table(train$Label))
prop.table(table(test$Label))
Text Analytics Using R (cont.)

# Text analytics requires a lot of data exploration, data pre-processing


and data wrangling. Let’s explore some examples.
train$Text[21]
Text Analytics Using R (cont.)

# There are many packages in the R ecosystem for performing text


analytics. One of the newer packages in quanteda. The quanteda package
has many useful functions for quickly and easily working with text data.
library(quanteda)
help(package = ”quanteda”)
Text Analytics Using R (cont.)
Text Analytics Using R (cont.)

# Lower case the tokens.


train.tokens < − tokens tolower(train.tokens)
train.tokens[[357]]
Text Analytics Using R (cont.)

#Remove the Stop Words


train.tokens < − tokens select(train.tokens, Stopwords(), selection =
”remove”)
Text Analytics Using R (cont.)

train.tokens[[357]]

# Perform stemming on the tokens.


train.tokens < − tokens wordstem(train.tokens, language = ”english”)
train.tokens[[357]]
Text Analytics Using R (cont.)
Text Analytics Using R (cont.)

# Create our first bag-of-words model.


train.tokens.dfm < − dfm(train.tokens, tolower = FALSE)
train.tokens.matrix < − as.matrix(train.tokens.dfm)
View(train.tokens.matrix[1 : 20, 1 : 100])
Text Analytics Using R (cont.)

# Setup a the feature data frame with labels.


train.tokens.df < − cbind(Label = train$Label,
data.frame(train.tokens.dfm))

# Cleanup column names.


names(train.tokens.df) < − make.names(names(train.tokens.df))
# Use caret to create stratified folds for 10-fold cross validation repeated
3 times (i.e., create 30 random stratified samples)
set.seed(48743)
cv.folds < − createMultiFolds(train$Label, k = 10, times = 3)
Text Analytics Using R (cont.)

cv.cntrl < − trainControl(method = ”repeatedcv”, number = 10, repeats


= 3, index = cv.folds)
#Build the Single decision tree using the training Data Set
rpart.cv.1 < − train(Label ∼ ., data = train.tokens.df, method = ”rpart”,
trControl = cv.cntrl, tuneLength = 7)
Text Analytics Using R (cont.)

Preparation of Test Data


# Tokenization.
test.tokens < − tokens(test$Text, what = ’word’, remove numbers =
TRUE, remove punct = TRUE, remove symbols = TRUE,
remove hyphens = TRUE )
Text Analytics Using R (cont.)

# Lower case the tokens.


test.tokens < − tokens tolower(test.tokens)
# Stopword removal.
test.tokens < − tokens select(test.tokens, stopwords(), selection =
”remove”)
# Stemming.
test.tokens < − tokens wordstem(test.tokens, language = ”english”)
# Convert quanteda document-term frequency matrix.
test.tokens.dfm < − dfm(test.tokens, tolower = FALSE)
preds < − predict(rpart.cv.1, test.tokens.dfm)# prediction
confusionMatrix(preds, test.tokens.dfm$Label)
Text Analytics Using R (cont.)
Statistical Analysis

What is statistical analysis? Statistical analysis is collecting and


analyzing data samples to uncover patterns and trends and predict what
could happen next to make better and more scientific decisions.
As an aspect of business intelligence, statistical analysis scrutinizes
business data and reports on trends using five key steps.

1 Describe the type of data that will be analyzed


2 Explore the relation of the data to the underlying population
3 Create a statistical model to summarize the understanding of how the
data related to the underlying population
4 Prove or disprove the validity of the model
5 Use predictive analytics to run scenarios that will guide future actions
Statistical Analysis (cont.)

Importance of statistical analysis


Once the data is collected, statistical analysis can be used for many things
in your business. Some include:
1 Summarizing and presenting the data in a graph or chart to present
key findings
2 Discovering crucial measures within the data
3 Calculating if the data is slightly clustered or spread out, which also
determines similarities
4 Making future predictions based on past behavior
5 Testing a hypothesis from an experiment
Statistical Analysis (cont.)
Statistical analysis vs. data analysis
Statistical analysis applies specific statistical methods to a sample of
data to understand the total population. It allows for conclusions to be
drawn about particular markets, cohorts, and a general grouping to predict
the behavior and characteristics of others.
Data analysis is the process of inspecting and cleaning all available data
and transforming it into useful information that can be understood by
non-technical individuals. This is crucial when you consider that data can
be meaningless if it isn’t understood by those who make decisions.
Statistical Analysis (cont.)

Main types of statistical analysis


Descriptive Analysis
Descriptive statistical Analysis refers to a branch of statistics that involves
summarizing, organizing, and presenting data meaningfully and concisely.
It focuses on describing and analyzing a dataset’s main features and
characteristics without making any generalizations or inferences to a larger
population.
The primary goal of descriptive statistics is to provide a clear and
concise summary of the data, enabling researchers or analysts to
gain insights and understand patterns, trends, and distributions
within the dataset. This summary typically includes measures such
as central tendency (e.g., mean, median, mode), dispersion (e.g.,
range, variance, standard deviation), and shape of the distribution
(e.g., skewness, kurtosis).
Statistical Analysis (cont.)

Descriptive statistics also involves a graphical representation of data


through charts, graphs, and tables, which can further aid in
visualizing and interpreting the information. Common graphical
techniques include histograms, bar charts, pie charts, scatter plots,
and box plots.
By employing descriptive statistics, researchers can effectively summarize
and communicate the key characteristics of a dataset, facilitating a better
understanding of the data and providing a foundation for further statistical
analysis or decision-making processes.
Inferential Analysis
The inferential statistical analysis takes a random data sample from a
portion of the population to make predictions, draw conclusions based on
that information, and generalize the results to represent the data on-hand.
Statistical Analysis (cont.)

The best way to get an accurate analysis when using inferential statistics
involves identifying the population being measured or studied, creating a
sample for that portion of the population, and using analysis to factor in
any sampling errors.
Types of Inferential Statistics
Inferential statistics employ four different methodologies or types:
1 Parameter Estimation: Analysts take a statistic from the sample

data and use it to make an informed guess about a population’s mean


parameter. It uses estimators such as probability plotting, Bayesian
estimation methods, rank regression, and maximum likelihood estima-
tion.
Statistical Analysis (cont.)

2 Confidence Intervals: Analysts use confidence intervals to get an in-


terval estimation for the chosen parameters. They are used to discover
the margin of error in research to determine whether if it will affect the
testing.
3 Regression Analysis: Regression analysis is a series of statistical pro-
cesses that estimate the relationship between a dependent variable and
a set of independent variables. This analysis uses hypothesis tests to
determine if the relationships observed in the sample data actually exist
in the population.
4 Hypothesis Test: Analysts try to answer research questions by using
sample data and making assumptions involving the population param-
eters. This test determines if the measured population has a higher
value than another data point in the analysis. In this , you are trying
to find the error margin by multiplying the mean’s standard error by
the z-score.
Question Bank

1 Explain the Data Science Process


2 Discuss about recommender systems
3 Explain prons and cons of various validation methods
Explain proximity measures for nomial, binary, ordinal, numerical and
Mixed data
4 Implement clustering using R
5 Perform Text analysis using DTM, TF-IDF methods with an example
6 Explain steps in Text analytic process
7 Dicuss about satistical analysis
8 How are the regression’s underlying assumptions verified?
9 Explain Linear and Multiple Linear regression model using R with an
example
10
How can text analysis be applied in real-world scenarios?
Question Bank (cont.)

11 Give examples of situations in which classification would be an appro-


priate data mining technique.
12 Give examples of situations in which cluster analysis would be an ap-
propriate data mining technique.
13 Interpret the data with the help of Histograms
14 Examples of Classification algorithms in various domains
15 Examples of segmentation algorithms in various domains
The End

You might also like