0% found this document useful (0 votes)
12 views48 pages

Unit 2 Bi Unlocked Notes

Uploaded by

arulmani1979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views48 pages

Unit 2 Bi Unlocked Notes

Uploaded by

arulmani1979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

lOMoARcPSD|36571236

UNIT-2 BI unlocked - Notes

tech (Bhilai Institute of Technology)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

UNIT-II

Data Science: The concept, process and typical tools in data science. Example of
different algorithms i.e segmentation, classification, validation, regressions,
recommendations. Exercises using Excel and R to work on histograms, regression, clustering
and text analysis. Correlation between Algorithm and Code in data science.

Overview of Data Science

Data Science can be explained as the entire process of gathering actionable insights from raw
data that involves various concepts that include statistical analysis, data analysis, machine
learning algorithms, data modeling, preprocessing of data, etc.

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms,
and systems to extract knowledge and insights from structured and unstructured data. It
involves the use of techniques from statistics, data analysis, machine learning, and computer
science to extract insights and knowledge from data. Data science can be applied in a wide
range of fields, including business, healthcare, finance, and government, among others. The
goal of data science is to turn raw data into actionable insights that can inform decision-making
and improve outcomes.

It can be aligned with how data science actually works.

History of Data Science

Data Science has evolved over the years and didn‟t start as how we know data science today.
Let‟s take a look at the timeline to understand how Data Science evolved over the years.

1
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

1. 1962 – Inception
a. Future of Data Analysis – In 1962, John W Tukey wrote the “Future of Data Analysis”
where he first mentioned the importance of data analysis with respect to science rather than
mathematics.

2. 1974
a. Concise Survey of Computer Methods – In 1974, Peter Naur published the “Concise
Survey of Computer methods that surveys the contemporary methods of data processing
in various applications.
3. 1974 – 1980

a. International Association For Statistical Computing – In 1997, The committee was


formed whose sole purpose is to link traditional statistical methodology with modern
computer technology to extract useful information and knowledge from the data.

4. 1980-1990
a. Knowledge Discovery in Databases – In 1989, Gregory Piatetsky-Shapiro chaired the
Knowledge Discovery in Databases that later went on to become the annual conference
on knowledge discovery and data mining.
5. 1990-2000
a. Database Marketing – In 1994, BusinessWeek published a cover story that explains
how big organizations are using the customer data to predict the likelihood of a customer
buying a specific product or not. Kind of like how targeted ads work in the modern era
for social media campaigns.

b. International Federation of Classification Society – For the first time in 1996, the
term “Data Science” was used in a conference held in Japan.

6. 2000-2010
a. Data Science – An Action Plan for Expanding the Technical Areas of the Field of
Statistics – In 2001, William S Cleveland published the action plan, that majorly focused
on major areas of the technical work in the field of statistics and coined the term Data
Science.

b. Statistical Modeling – The Two Cultures – In 2001, Leo Breiman wrote “There are
two cultures in the use of statistical modeling to reach conclusions from data. One
assumes that the data are generated by a given stochastic data model. The other uses
algorithmic models and treats the data mechanism as unknown”.
c. Data Science Journal – April 2002 saw the launch of a journal that focused on
management of data and databases in science and technology.

7. 2010-Present
a. Data Everywhere – In February 2010, Kenneth Cukier wrote a special report for The

2
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Economist that said a new professional has arrived – a data scientist. Who combines the
skills of software programmer, statistician and storyteller/artist to extract the nuggets of
gold hidden under mountains of data.

How does Data Science Work?


The working of data science can be explained as follows:
1. Raw data is gathered from various sources that explain the business problem.
2. Using various statistical analysis, and machine learning approaches, data modeling is
performed to get the optimum solutions that best explain the business problem.
3. Actionable insights that will serve as a solution for the business problems gathered
through data science.

They can follow the following approach to get an optimal solution using Data Science:
1. Gather the previous data on the sales that were closed.
2. Use statistical analysis to find out the patterns that were followed by the leads that were
closed.
3. Use machine learning to get actionable insights for finding out potential leads.
4. Use the new data on sales lead to segregate potential leads that will be highly likely to be
closed.

Data Science Life Cycle


The Data Science lifecycle comprises of the following:

1. Formulating a Business Problem: Any data science problem will start their journey
from formulating a business problem. A business problem explains the issues that may be
fixed with insights gathered from an efficient Data Science solution. A simple example of
a business problem is – You have past 1 year‟s sales data for a retail store. Using machine
learning approaches, you have to predict or forecast the sales for the next 3 months that
will help the store to create an inventory that will help in reducing the wastage of
products that have lesser shelf life than the other products.

2. Data Extraction, Transformation, Loading: The next step in the data science life cycle
is to create a data pipeline where the relevant data is extracted from the source and

3
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

transformed into machine readable format, and eventually the data is loaded into the
program or the machine learning pipeline to get things started.
For the above example – To forecast the sales, we will need data from the store that will
be useful for formulating an efficient machine learning model. Keeping this in mind, we
would create separate data points that may or may not be affecting the sales for that
particular store.

3. Data Preprocessing: The third step is where the magic happens. Using statistical
analysis, Exploratory data analysis, data wrangling and manipulation, we will create
meaningful data. The preprocessing is done to assess the various data points and
formulate hypotheses that best explain the relationship between the various features in the
data.
For example – The store sales problem will require the data to be in a time series format
to be able to forecast the sales. The hypothesis testing will test the stationarity of the
series and further computations will show various trends, seasonality and other
relationship patterns in the data.

4. Data Modeling: This step involves advanced machine learning concepts that will be used
for feature selection, feature transformation, standardization of the data, data
normalization, etc. Choosing the best algorithms based on evidence from the above steps
will help you create a model that will efficiently create a forecast for the said months in
the above example.
For example – We can use the Time Series forecasting approach for the business problem
where the presence of high dimensional data could be the case. We will use various
dimensionality reduction techniques, and create a Forecasting model using AR, MA, or
ARIMA model and forecast the sales for the next quarter.

5. Gathering Actionable Insights: The final step from the data science life cycle is
gathering insights from the said problem statement. We create inferences and findings
from the entire process that would best explain the business problem.
For example – From the above Time series model, we will get the monthly or weekly
sales for the next 3 months. These insights will in turn help the professionals create a
strategy plan to overcome the problem at hand.

6. Solutions For the Business Problem: The solutions for the business problem are
nothing but actionable insights that will solve the problem using evidence based
information. For example – Our forecast from the Time series model will give an
efficient estimate for the store sales in the next 3 months. Using those insights, the store
can plan their inventory to reduce the wastage of perishable goods.

USES OF DATA SCIENCE :

Data science is a field that involves using scientific methods, processes, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. It can be used
in a variety of industries and applications such as:

4
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

1. Business: Data science can be used to analyze customer data, predict market trends, and
optimize business operations.

2. Healthcare: Data science can be used to analyze medical data and identify patterns that
can aid in diagnosis, treatment, and drug discovery.

3. Finance: Data science can be used to identify fraud, analyze financial markets, and make
investment decisions.

4. Social Media: Data science can be used to understand user behavior, recommend content,
and identify influencers.

5. Internet of things: Data science can be used to analyze sensor data from IoT devices and
make predictions about equipment failures, traffic patterns, and more.

6. Natural Language Processing: Data science can be used to make computers understand
human language, process large amounts of text or speech data and make predictions.

Overall Data Science is a multidisciplinary field that involves the use of statistics, machine
learning, and computer science to extract insights and knowledge from data.

Applications of Data Science:

Following are some of the applications that make use of Data Science for their services:
 Internet Search Results (Google)
 Recommendation Engine (Spotify)
 Intelligent Digital Assistants (Google Assistant)
 Autonomous Driving Vehicle (Waymo)
 Spam Filter (Gmail)
 Abusive Content and Hate Speech Filter (Facebook)
 Robotics (Boston Dynamics)
 Automatic Piracy Detection (YouTube)

Those applications drive a wide variety of use cases in organizations, including the
following:
 customer analytics
 fraud detection
 risk management
 stock trading
 targeted advertising
 website personalization
 customer service
 predictive maintenance
 logistics and supply chain management
 image recognition
 speech recognition

5
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

 natural language processing


 cybersecurity
 medical diagnosis
Data science team

 Data engineer. Responsibilities include setting up data pipelines and aiding in data
preparation and model deployment, working closely with data scientists.
 Data analyst. This is a lower-level position for analytics professionals who don't have
the experience level or advanced skills that data scientists do.
 Machine learning engineer. This programming-oriented job involves developing the
machine learning models needed for data science applications.
 Data visualization developer. This person works with data scientists to create
visualizations and dashboards used to present analytics results to business users.
 Data translator. Also called an analytics translator, it's an emerging role that serves as
a liaison to business units and helps plan projects and communicate results.
 Data architect. A data architect designs and oversees the implementation of the
underlying systems used to store and manage data for analytics uses.

Data science tools and platforms

Numerous tools are available for data scientists to use in the analytics process, including both
commercial and open source options:

 data platforms and analytics engines, such as Spark, Hadoop and NoSQL databases;
 programming languages, such as Python, R, Julia, Scala and SQL;
 statistical analysis tools like SAS and IBM SPSS;
 machine learning platforms and libraries, including TensorFlow, Weka, Scikit-learn,
Keras and PyTorch;
 Jupyter Notebook, a web application for sharing documents with code, equations and
other information; and
 data visualization tools and libraries, such as Tableau, D3.js and Matplotlib.
Data Science Tools

Let’s take a look at those tools and their advantages now, placed in alphabetical order:

6
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Algorithms.io.
This tool is a machine-learning (ML) resource that takes raw data and shapes it into real-time
insights and actionable events, particularly in the context of machine-learning.

Advantages:

 It‟s on a cloud platform, so it has all the SaaS advantages of scalability, security, and
infrastructure
 Makes machine learning simple and accessible to developers and companies

Apache Hadoop
This open-source framework creates simple programming models and distributes extensive data
set processing across thousands of computer clusters. Hadoop works equally well for research
and production purposes. Hadoop is perfect for high-level computations.

Advantages:

 Open-source
 Highly scalable
 It has many modules available
 Failures are handled at the application layer

Apache Spark
Also called “Spark,” this is an all-powerful analytics engine and has the distinction of being the
most used data science tool. It is known for offering lightning-fast cluster computing. Spark
accesses varied data sources such as Cassandra, HDFS, HBase, and S3. It can also easily handle
large datasets.

7
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Advantages:

 Over 80 high-level operators simplify the process of parallel app building


 Can be used interactively from the Scale, Python, and R shells
 Advanced DAG execution engine supports in-memory computing and acyclic data flow

BigML
This tool is another top-rated data science resource that provides users with a fully interactable,
cloud-based GUI environment, ideal for processing ML algorithms. You can create a free or
premium account depending on your needs, and the web interface is easy to use.

Advantages:

 An affordable resource for building complex machine learning solutions


 Takes predictive data patterns and turns them into intelligent, practical applications
usable by anyone
 It can run in the cloud or on-premises

D3.js
D3.js is an open-source JavaScript library that lets you make interactive visualizations on your
web browser. It emphasizes web standards to take full advantage of all of the features of modern
browsers, without being bogged down with a proprietary framework.

Advantages:

 D3.js is based on the very popular JavaScript


 Ideal for client-side Internet of Things (IoT) interactions
 Useful for creating interactive visualizations

8
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Data Robot
This tool is described as an advanced platform for automated machine learning. Data scientists,
executives, IT professionals, and software engineers use it to help them build better quality
predictive models, and do it faster.

Advantages:

 With just a single click or line of code, you can train, test, and compare many different
models
 It features Python SDK and APIs
 It comes with a simple model deployment process

Excel
Yes, even this ubiquitous old database workhorse gets some attention here, too! Originally
developed by Microsoft for spreadsheet calculations, it has gained widespread use as a tool for
data processing, visualization, and sophisticated calculations.

Advantages:

 You can sort and filter your data with one click
 Advanced Filtering function lets you filter data based on your favorite criteria
 Well-known and found everywhere

ForecastThis
If you‟re a data scientist who wants automated predictive model selection, then this is the tool for
you! ForecastThis helps investment managers, data scientists, and quantitative analysts to use
their in-house data to optimize their complex future objectives and create robust forecasts.

Advantages:

 Easily scalable to fit any size challenge

9
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

 Includes robust optimization algorithms


 Simple spreadsheet and API plugins

Google BigQuery
This is a very scalable, serverless data warehouse tool created for productive data analysis. It
uses Google‟s infrastructure-based processing power to run super-fast SQL queries against
append-only tables.

Advantages:

 Extremely fast
 Keeps costs down since users need only pay for storage and computer usage
 Easily scalable

Java
Java is the classic object-oriented programming language that‟s been around for years. It‟s
simple, architecture-neutral, secure, platform-independent, and object-oriented.

Advantages:

 Suitable for large science projects if used with Java 8 with Lambdas
 Java has an extensive suite of tools and libraries that are perfect for machine learning and
data science
 Easy to understand

MATLAB
MATLAB is a high-level language coupled with an interactive environment for numerical
computation, programming, and visualization. MATLAB is a powerful tool, a language used in
technical computing, and ideal for graphics, math, and programming.

10
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Advantages:

 Intuitive use
 It analyzes data, creates models, and develops algorithms
 With just a few simple code changes, it scales analyses to run on clouds, clusters, and
GPUs

MySQL

Another familiar tool that enjoys widespread popularity, MySQL is one of the most popular
open-source databases available today. It‟s ideal for accessing data from databases.

Advantages:

 Users can easily store and access data in a structured manner


 Works with programming languages like Java
 It‟s an open-source relational database management system

NLTK

Short for Natural Language Toolkit, this open-source tool works with human language data and
is a well-liked Python program builder. NLTK is ideal for rookie data scientists and students.

Advantages:

 Comes with a suite of text processing libraries


 Offers over 50 easy-to-use interfaces
 It has an active discussion forum that provides a wealth of new information

11
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Rapid Miner

This data science tool is a unified platform that incorporates data prep, machine learning, and
model deployment for making data science processes easy and fast. It enjoys heavy use in the
manufacturing, telecommunication, utility, and banking industries.

Advantages:

 All of the resources are located on one platform


 GUI is based on a block-diagram process, simplifying these blocks into a plug-and-play
environment
 Uses a visual workflow designer to model machine learning algorithms

SAS

This data science tool is designed especially for statistical operations. It is a closed-source
proprietary software tool that specializes in handling and analyzing massive amounts of data for
large organizations. It‟s well-supported by its company and very reliable. Still, it‟s a case of
getting what you pay for because SAS is expensive and best suited for large companies and
organizations.

Advantages:

 Numerous analytics functions covering everything from social media to automated


forecasting to location data
 It features interactive dashboards and reports, letting the user go straight from reporting
to analysis
 Contains advanced data visualization techniques such as auto charting to present
compelling results and data

12
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Tableau

Tableau is a Data Visualization software that is packed with powerful graphics to make
interactive visualizations. It is focused on industries working in the field of business intelligence.
The most important aspect of Tableau is its ability to interface with databases, spreadsheets,
OLAP (Online Analytical Processing) cubes, etc. Along with these features, Tableau has the
ability to visualize geographical data and for plotting longitudes and latitudes in maps.

TensorFlow

TensorFlow has become a standard tool for Machine Learning. It is widely used for advanced
machine learning algorithms like Deep Learning. Developers named TensorFlow after Tensors
which are multidimensional arrays.
It is an open-source and ever-evolving toolkit which is known for its performance and high
computational abilities. TensorFlow can run on both CPUs and GPUs and has recently emerged
on more powerful TPU platforms.

This gives it an unprecedented edge in terms of the processing power of advanced machine
learning algorithms.

Weka

Weka or Waikato Environment for Knowledge Analysis is a machine learning software written
in Java. It is a collection of various Machine Learning algorithms for data mining. Weka consists

13
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

of various machine learning tools like classification, clustering, regression, visualization and data
preparation.
It is an open-source GUI software that allows easier implementation of machine learning
algorithms through an interactable platform.

Example of different algorithms i.e segmentation, classification, validation, regressions,


recommendations.

The implementation of Data Science to any problem requires a set of skills. Machine Learning is
an integral part of this skill set.
For doing Data Science, you must know the various Machine Learning algorithms used for
solving different types of problems, as a single algorithm cannot be the best for all types of use
cases. These algorithms find an application in various tasks
like prediction, classification, clustering, etc. from the dataset under consideration.

These algorithms can be categorized into 3 main categories.

1. Supervised Algorithms: The training data set has inputs as well as the desired output.
During the training session, the model will adjust its variables to map inputs to the
corresponding output.

2. Unsupervised Algorithms: In this category, there is not a target outcome. The algorithms
will cluster the data set for different groups.

14
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

3. Reinforcement Algorithms: These algorithms are trained on taking decisions. Therefore


based on those decisions, the algorithm will train itself based on the success/error of output.
Eventually by experience algorithm will able to give good predictions.

Linear Regression

Linear regression method is used for predicting the value of the dependent variable by using the
values of the independent variable.

The linear regression model is suitable for predicting the value of a continuous quantity.

The linear regression model represents the relationship between the input variables (x) and the
output variable (y) of a dataset in terms of a line given by the equation,

y = m*x + c

where y is the dependent variable and x is the independent variable. Basic calculus theories are
applied to find the values for m and c using the given data set. The main aim of this method is to
find the value of b0 and b1 to find the best fit line that will be covering or will be nearest to most
of the data points.

Logistic Regression

Linear Regression is always used for representing the relationship between


some continuous values. However, contrary to this Logistic Regression works on discrete
values.
Logistic regression finds the most common application in solving binary classification
problems, that is, when there are only two possibilities of an event, either the event will occur or
it will not occur (0 or 1).
Thus, in Logistic Regression, we convert the predicted values into such values that lie in the
range of 0 to 1 by using a non-linear transform function which is called a logistic function.

15
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

We generate this with the help of logistic function –

1 / (1 + e^-x)
Here, e represents base of natural log and we obtain the S-shaped curve with values between 0
and 1. The equation for logistic regression is written as:

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))


Here, b0 and b1 are the coefficients of the input x. These coefficients are estimated using the
data through “maximum likelihood estimation”.

Decision Trees

This algorithm categorizes the population for several sets based on some chosen properties
(independent variables) of a population. Usually, this algorithm is used to solve classification
problems. Categorization is done by using some techniques such as Gini, Chi-square, entropy etc.

Decision trees help in solving both classification and prediction problems. It makes it easy to
understand the data for better accuracy of the predictions. Each node of the Decision tree
represents a feature or an attribute, each link represents a decision and each leaf node holds a
class label, that is, the outcome.
The drawback of decision trees is that it suffers from the problem of overfitting.

Basically, these two Data Science algorithms are most commonly used for implementing the
Decision trees.

 ID3 ( Iterative Dichotomiser 3) Algorithm

This algorithm uses entropy and information gain as the decision metric.

 Cart ( Classification and Regression Tree) Algorithm

This algorithm uses the Gini index as the decision metric. The below image will help you to
understand things better.

16
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Here‟s a decision tree that evaluates scenarios where people want to play football.

Support Vector Machine (SVM)

Support Vector Machine or SVM comes under the category of supervised Machine Learning
algorithms and finds an application in both classification and regression problems. It is most
commonly used for classification of problems and classifies the data points by using
a hyperplane.

17
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

The first step of this Data Science algorithm involves plotting all the data items as individual
points in an n-dimensional graph.

Here, n is the number of features and the value of each individual feature is the value of a
specific coordinate. Then we find the hyperplane that best separates the two classes for
classifying them.
Finding the correct hyperplane plays the most important role in classification. The data points
which are closest to the separating hyperplane are the support vectors.

Let us consider the following example to understand how you can identify the right hyperplane.
The basic principle for selecting the best hyperplane is that you have to choose the hyperplane
that separates the two classes very well.

In this case, the hyperplane B is classifying the data points very well. Thus, B will be the right
hyperplane.

18
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

All three hyperplanes are separating the two classes properly. In such cases, we have to select the
hyperplane with the maximum margin. As we can see in the above image, hyperplane B has the
maximum margin therefore it will be the right hyperplane.

In this case, the hyperplane B has the maximum margin but it is not classifying the two classes
accurately. Thus, A will be the right hyperplane.

Naive Bayes

The Naive Bayes algorithm helps in building predictive models. We use this Data Science
algorithm when we want to calculate the probability of the occurrence of an event in the future.
Here, we have prior knowledge that another event has already occurred.

The Naive Bayes algorithm works on the assumption that each feature is independent and has
an individual contribution to the final prediction.
The Naive Bayes theorem is represented by:

P(A|B) = P(B|A) P(A) / P(B)


Where A and B are two events.

 P(A|B) is the posterior probability i.e. the probability of A given that B has already
occurred.
 P(B|A) is the likelihood i.e. the probability of B given that A has already occurred.
 P(A) is the class prior to probability.
 P(B) is the predictor prior probability.
Example: Let‟s understand it using an example. Below I have a training data set of
weather and the corresponding target variable „Play‟. Now, we need to classify whether
players will play or not based on weather conditions. Let‟s follow the below steps to
perform it.

Step 1: Convert the data set to a frequency table.

Step 2: Create a Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.

Step 3: Now, use the Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of the prediction.

19
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Problem: Players will pay if the weather is sunny, is this statement correct?

We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) *
P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different classes based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.

KNN

KNN stands for K-Nearest Neighbors. This Data Science algorithm employs both classification
and regression problems.

This is a simple algorithm which predicts unknown data point with its k nearest neighbors. The
value of k is a critical factor here regarding the accuracy of prediction. It determines the nearest
by calculating the distance using basic distance functions like Euclidean.

The KNN algorithm considers the complete dataset as the training dataset. After training the
model using the KNN algorithm, we try to predict the outcome of a new data point.

Here, the KNN algorithm searches the entire data set for identifying the k most similar or nearest
neighbors of that data point. It then predicts the outcome based on these k instances. For finding
the nearest neighbors of a data instance, we can use various distance measures like Euclidean
distance, Hamming distance, etc. To better understand, let us consider the following example.

20
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Here we have represented the two classes A and B by the circle and the square respectively.

Let us assume the value of k is 3.


Now we will first find three data points that are closest to the new data item and enclose them in
a dotted circle. Here the three closest points of the new data item belong to class A. Thus, we can
say that the new data point will also belong to class A.

Now you all might be thinking that how we assumed k=3?

The selection of the value of k is a very critical task. You should take such a value of k that it is
neither too small nor too large. Another simpler approach is to take k = √n where n is the number
of data points.

K-Means Clustering

K-means clustering is a type of unsupervised Machine Learning algorithm.

Clustering basically means dividing the data set into groups of similar data items called clusters.
K means clustering categorizes the data items into k groups with similar data items.
For measuring this similarity, we use Euclidean distance which is given by,
D = √(x1-x2)^2 + (y1-y2)^2
K means clustering is iterative in nature.

The basic steps followed by the algorithm are as follows:

 First, we select the value of k which is equal to the number of clusters into which we
want to categorize our data.
 Then we assign the random center values to each of these k clusters.
 Now we start searching for the nearest data points to the cluster centers by using the
Euclidean distance formula.

21
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

 In the next step, we calculate the mean of the data points assigned to each cluster.
 Again we search for the nearest data points to the newly created centers and assign them
to their closest clusters.
 We should keep repeating the above steps until there is no change in the data points
assigned to the k clusters.
 First, we randomly initialize and select the k-points. These k-points are the means.
 We use the Euclidean distance to find data-points that are closest to their centreW of the
cluster.
 Then we calculate the mean of all the points in the cluster which is finding their centroid.
 We iteratively repeat step 1, 2 and 3 until all the points are assigned to their respective
clusters.

K-means clustering is the most popular form of an unsupervised learning algorithm. It is easy to
understand and implement.

The objective of the K-means clustering is to minimize the Euclidean distance that each point
has from the centroid of the cluster. To better understand, let us consider the following example.

22
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Artificial Neural Networks

Neural Networks are modeled after the neurons in the human brain. It comprises many layers of
neurons that are structured to transmit information from the input layer to the output layer.
Between the input and the output layer, there are hidden layers present.
These hidden layers can be many or just one. A simple neural network comprising of a single
hidden layer is known as Perceptron.

In the above diagram for a simple neural network, there is an input layer that takes the input in
the form of a vector. Then, this input is passed to the hidden layer which comprises of various
mathematical functions that perform computation on the given input.

For example, given the images of cats and dogs, our hidden layers perform various
mathematical operations to find the maximum probability of the class our input image falls in.
This is an example of binary classification where the class, that is, dog or cat, is assigned its
appropriate place.

Principal Component Analysis (PCA)

PCA is basically a technique for performing dimensionality reduction of the datasets with the
least effect on the variance of the datasets. This means removing the redundant features but
keeping the important ones.
To achieve this, PCA transforms the variables of the dataset into a new set of variables. This new
set of variables represents the principal components.

The most important features of these principal components are:

 All the PCs are orthogonal (i.e. they are at a right angle to each other).
 They are created in such a way that with the increasing number of components, the amount
of variation that it retains starts decreasing.

23
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

 This means the 1st principal component retains the variation to the maximum extent as
compared to the original variables.
PCA is basically used for summarizing data. While dealing with a dataset there might be some
features related to each other. Thus PCA helps you to reduce such features and make predictions
with less number of features without compromising with the accuracy.
For example, consider the following diagram in which we have reduced a 3D space to a 2D
space.

Random Forests

Random Forests overcomes the overfitting problem of decision trees and helps in solving both
classification and regression problems. It works on the principle of Ensemble learning.
The Ensemble learning methods believe that a large number of weak learners can work together
for giving high accuracy predictions.
Random Forests work in a much similar way. It considers the prediction of a large number of
individual decision trees for giving the final outcome. It calculates the number of votes of
predictions of different decision trees and the prediction with the largest number of votes
becomes the prediction of the model. Let us understand this by an example.

24
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

In the above image, there are two classes labeled as A and B. In this random forest consisting of
7 decision trees, 3 have voted for class A and 4 voted for class B. As class B has received the
maximum votes thus the model‟s prediction will be class B.

Exercises using Excel and R to work on histograms

How To Add A Histogram In Excel?: Different Methods

As mentioned earlier, there are different ways of adding a histogram in Excel. Below are the
methods that work best for the 2016 and older versions of Microsoft Excel.

Method 1: Creating A Histogram In Excel 2016

This is the easiest method to add a histogram in your Excel worksheet, as it uses the recently
added in-built histogram option in the charts section. It works only for the 2016 and newer
versions.

Step 1: Select Dataset

The first step to add a histogram to the Excel sheet requires you to select the entire dataset that
you wish to represent using the histogram. To do this, simply click on any of the corner cells and
drag the pointer over the rest of the cells that you wish to select.

Step 2: Insert Tab

Once you have selected the database that you wish to add a histogram in excel, go to the menu
bar and click on the Insert tab. The insert tab is used to insert images, charts, filters, hyperlinks,
etc.

Read more about How to Combine Multiple Excel Files into a Single Workbook, here.

25
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Step 3: Charts → Insert Statistic Chart

Now click on the “Charts” option under the Insert tab and then select the “Insert statistic chart”
option, this will open a new dialog box under it.

Step 4: Histogram Chart Option

Inside the histogram group in the Insert Static Chart dialog box, click on the Histogram group
and then the Histogram chart option. This will add a Histogram in Excel for the dataset selected
in the first step.

26
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

After successfully adding the histogram in Excel using the above steps, you can then customize
the histogram according to your choice. Some of the customization options include: By
Category, Bin Width, Automatic, Number of Bins, etc. among others.

Learn How to use the OFFSET function in an easy way.

Method 2: Creating A Histogram In Excel Using Data Analysis Toolpack

The second method for adding histograms to your Excel sheet uses the Data Analysis Toolpack
and is applicable for the 2016 version as well as the older versions of Microsoft Excel.

To use this method, you must first enable the Analysis Toolpack. To do so, click on the File
tab→ Options→ Add-ins in the navigation pane→ Excel Add-ins→ Go→ Analysis toolpack in
the add-ins dialog box and then click OK.

Once this installation is done, follow the below-mentioned steps to add a Histogram in Microsoft
Excel using the Data Analysis Toolpack:

Step 1: Create Bins

To create a histogram in Excel using the provided data, you first need to create data intervals,
also known as bins. You need to specify bins separately in an additional column alongside the
provided data that you need to make the histogram for.

Step 2: Data Tab

The next step to add a histogram is to go to the Menu bar and click on the Data tab. The Data
Tab dialog box will open.

27
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Learn more about How to Insert Watermarks in Excel, here.

Step 3: Analysis Group→ Data Analysis.

Now, click on the Data Analysis icon inside the Analysis Group in the Data tab on the rightmost
corner of the tab.

Step 4: Select the histogram

After selecting the Data Analysis icon, a new dialog box will appear. Select histogram from the
list of the Analysis tools and click OK.

Step 5: Fill Histogram Dialog Box

Once you have selected the histogram from the list of analysis tools provided, a new dialog box
will appear, requiring you to fill in the Input Range, Bin Range, Output Range, and select the
Chart output. Leave the rest of the boxes unchecked, including Labels. You can check the Label
box if you wish to include the labels in the data section.

Also, read more about How to Insert Checkbox in Excel, here.

Step 6: Click OK

Click OK to confirm all the information you just filled in and add the histogram in Excel.

28
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

[NOTE: If you leave the Bin Range box empty, the histogram will still be created. The system
will automatically create six equally spaced bins for your histogram.]

The above method works best for all Microsoft Excel versions before and including the 2016
version but just for backup, if this method doesn‟t work for you then you can use the third and
last method mentioned below to add a histogram in Microsoft Excel.

Know How to Subtract in Excel here.

Method 3: Using The Frequency Function To Create A Histogram

The 3rd method for adding a histogram to your Excel sheet is using the FREQUENCY function
and is usually used to create a dynamic histogram in Excel. A dynamic histogram updates itself
when the data inside it is changed.

Step 1: Bins

The first step here as well is to create separate bins in a separate column. These bins or data
intervals are what you want to show the frequency of.

Step 2: Use The Formula

Use the formula below to calculate the frequency of each interval.

=FREQUENCY(B2:B41,D2:D8)

Here you can enter the marks of students in column B and the bins in the D column. You must
note that this is an array function so you must use Ctrl+Shift+Enter instead of just pressing
„Enter‟.

29
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

For getting a more accurate result and to make sure you don‟t make any possible mistakes,
follow the below steps:

1. Select the cells that are adjacent to the bins. For the above formula used, they will be the
cells E2:E8
2. Press F2 to get the edit access for cell E2.
3. Enter the Frequency formula.
4. Press Ctrl+Shift +Enter

This will give you the frequency of the bins you created. With the frequency already calculated,
all you need to do is to create the histogram in Excel which will be nothing more than just a
column chart.

Histograms in R language
A histogram contains a rectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical intervals. A
graphical representation that manages a group of data points into different specified ranges. It
has a special feature which shows no gaps between the bars and is similar to a vertical bar
graph.
R – Histograms
We can create histogram in R Programming Language using hist() function.
Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border)
Parameters:
 v: This parameter contains numerical values used in histogram.
 main: This parameter main is the title of the chart.
 col: This parameter is used to set color of the bars.
 xlab: This parameter is the label for horizontal axis.
 border: This parameter is used to set border color of each bar.
 xlim: This parameter is used for plotting values of x-axis.

30
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

 ylim: This parameter is used for plotting values of y-axis.


 breaks: This parameter is used as width of each bar.

Creating a simple Histogram in R

Creating a simple histogram chart by using the above parameter. This vector v is plot
using hist().
Example:

# Create data for the graph.

v <- c(19, 23, 11, 5, 16, 21, 32,


14, 19, 27, 39)

# Create the histogram.


hist(v, xlab = "No.of Articles ",
col = "green", border = "black")

Output:

Range of X and Y values

To describe the range of values we need to do the following steps:


1. We can use the xlim and ylim parameter in X-axis and Y-axis.
2. Take all parameters which are required to make histogram chart.
Example
# Create data for the graph.
v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19, 27, 39)

# Create the histogram.


hist(v, xlab = "No.of Articles", col = "green",
border = "black", xlim = c(0, 50),
ylim = c(0, 5), breaks = 5)

31
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Output:

Using histogram return values for labels using text()

To create a histogram return value chart.


# Creating data for the graph.
v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19,
27, 39, 120, 40, 70, 90)

# Creating the histogram.


m<-hist(v, xlab = "Weight", ylab ="Frequency",
col = "darkmagenta", border = "pink",
breaks = 5)

# Setting labels
text(m$mids, m$counts, labels = m$counts,
adj = c(0.5, -0.5))
Output:

Histogram using non-uniform width

Creating different width histogram charts, by using the above parameters, we created
histogram using non-uniform width.
Example

32
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

# Creating data for the graph.


v <- c(19, 23, 11, 5, 16, 21, 32, 14,
19, 27, 39, 120, 40, 70, 90)

# Creating the histogram.


hist(v, xlab = "Weight", ylab ="Frequency",
xlim = c(50, 100),
col = "darkmagenta", border = "pink",
breaks = c(5, 55, 60, 70, 75,
80, 100, 140))
Output:

Using r language regression, clustering and text analysis:

Clustering in R Programming Language is an unsupervised learning technique in which the


data set is partitioned into several groups called as clusters based on their similarity. Several
clusters of data are produced after the segmentation of data. All the objects in a cluster share
common characteristics. During data mining and analysis, clustering is used to find similar
datasets.
Applications of Clustering in R Programming Language
 Marketing: In R programming, clustering is helpful for the marketing field. It helps in
finding the market pattern and thus, helping in finding the likely buyers. Getting the
interests of customers using clustering and showing the same product of their interest can
increase the chance of buying the product.
 Medical Science: In the medical field, there is a new invention of medicines and
treatments on a daily basis. Sometimes, new species are also found by researchers and
scientists. Their category can be easily found by using the clustering algorithm based on
their similarities.
 Games: A clustering algorithm can also be used to show the games to the user based on his
interests.
 Internet: An user browses a lot of websites based on his interest. Browsing history can be
aggregated to perform clustering on it and based on clustering results, the profile of the
user is generated.
Methods of Clustering
There are 2 types of clustering in R programming:

33
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

 Hard clustering: In this type of clustering, the data point either belongs to the cluster
totally or not and the data point is assigned to one cluster only. The algorithm used for hard
clustering is k-means clustering.
 Soft clustering: In soft clustering, the probability or likelihood of a data point is assigned
in the clusters rather than putting each data point in a cluster. Each data point exists in all
the clusters with some probability. The algorithm used for soft clustering is the fuzzy
clustering method or soft k-means.

K-Means Clustering in R Programming language

K-Means is an iterative hard clustering technique that uses an unsupervised learning algorithm.
In this, total numbers of clusters are pre-defined by the user and based on the similarity of each
data point, the data points are clustered. This algorithm also finds out the centroid of the
cluster.
Algorithm:
 Specify number of clusters (K): Let us take an example of k =2 and 5 data points.
 Randomly assign each data point to a cluster: In the below example, the red and green
color shows 2 clusters with their respective random data points assigned to them.
 Calculate cluster centroids: The cross mark represents the centroid of the corresponding
cluster.
 Re-allocate each data point to their nearest cluster centroid: Green data point is
assigned to the red cluster as it is near to the centroid of red cluster.
 Re-figure cluster centroid

Syntax: kmeans(x, centers, nstart)


where,
 x represents numeric matrix or data frame object
 centers represents the K value or distinct cluster centers
 nstart represents number of random sets to be chosen

Example:

# Library required for fviz_cluster function


install.packages("factoextra")
library(factoextra)

# Loading dataset
df <- mtcars

# Omitting any NA values


df <- na.omit(df)

34
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

# Scaling dataset
df <- scale(df)

# output to be present as PNG file


png(file = "KMeansExample.png")

km <- kmeans(df, centers = 4, nstart = 25)

# Visualize the clusters


fviz_cluster(km, data = df)

# saving the file


dev.off()

# output to be present as PNG file


png(file = "KMeansExample2.png")

km <- kmeans(df, centers = 5, nstart = 25)

# Visualize the clusters


fviz_cluster(km, data = df)

# saving the file


dev.off()
Output:
When k = 4

35
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

When k = 5

Regression Analysis in R Programming

Logistic Regression is a model that takes response variables (dependent variable) and features
(independent variables) to determine the estimated probability of an event. A logistic model is
used when the response variable has categorical values such as 0 or 1. For example, a student
will pass/fail, a mail is spam or not, determining the images, etc. In this article, we‟ll discuss
regression analysis, types of regression, and implementation of logistic regression in R
programming.

Regression Analysis in R
Regression analysis is a group of statistical processes used in R programming and statistics to
determine the relationship between dataset variables. Generally, regression analysis is used to
determine the relationship between the dependent and independent variables of the dataset.
Regression analysis helps to understand how dependent variables change when one of the
independent variables is changes and other independent variables are kept constant. This helps
in building a regression model and further, helps in forecasting the values with respect to a
change in one of the independent variables. On the basis of types of dependent variables, a
number of independent variables, and the shape of the regression line, there are 4 types of
regression analysis techniques i.e., Linear Regression, Logistic Regression, Multinomial
Logistic Regression and Ordinal Logistic Regression.
Types of Regression Analysis
Linear Regression
Linear Regression is one of the most widely used regression techniques to model the
relationship between two variables. It uses a linear relationship to model the regression line.
There are 2 variables used in the linear relationship equation i.e., predictor variable and
response variable.

36
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

y = ax + b
where,
 y is the response variable
 x is the predictor variable
 a and b are the coefficients

The regression line created using this technique is a straight line. The response variable is
derived from predictor variables. Predictor variables are estimated using some statistical
experiments. Linear regression is widely used but these techniques is not capable of predicting
the probability.
Logistic Regression
On the other hand, logistic regression has an advantage over linear regression as it is
capable of predicting the values within the range. Logistic regression is used to predict
the values within the categorical range. For example, male or female, winner or loser,
etc.
Logistic regression uses the following sigmoidal function:
Y= 1 / 1+e -z
where,
 y represents response variable
 z represents equation of independent variables or features

Multinomial Logistic Regression


Multinomial logistic regression is an advanced technique of logistic regression which takes
more than 2 categorical variables unlike, in logistic regression which takes 2 categorical
variables. For example, a biology researcher found a new type of species and type of species
can be determined on many factors such as size, shape, eye color, the environmental factor of
its living, etc.
Ordinal Logistic Regression
Ordinal logistic regression is also an extension to logistic regression. It is used to predict the
values as different levels of category (ordered). In simple words, it predicts the rank. For
example, a survey of taste quality of food is created by a restaurant and using ordinal logistic
regression, a survey response variable can be created on a scale of any interval such 1-10
which helps in determining the customer‟s response to their food items.
Implementation of Logistic Regression in R programming
In R language, logistic regression model is created using glm() function.
Syntax:glm(formula, family = binomial)
Parameters:
formula: represents an equation on the basis of which model has to be fitted.
family: represents the type of function to be used i.e., binomial for logistic regression.

37
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

To know about more optional parameters of glm() function, use below command in R:
help("glm")
Example:
Let us assume a vector of IQ level of students in a class. Another vector contains the result of
the corresponding student i.e., fail or pass (0 or 1) in an exam.
# Generate random IQ values with mean = 30 and sd =2
IQ <- rnorm(40, 30, 2)

# Sorting IQ level in ascending order


IQ <- sort(IQ)

# Generate vector with pass and fail values of 40 students


result <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
1, 1, 1, 0, 1, 1, 1, 1, 0, 1)

# Data Frame
df <- as.data.frame(cbind(IQ, result))

# Print data frame


print(df)

# output to be present as PNG file


png(file="LogisticRegressionGFG.png")

# Plotting IQ on x-axis and result on y-axis


plot(IQ, result, xlab = "IQ Level",
ylab = "Probability of Passing")

# Create a logistic model


g = glm(result~IQ, family=binomial, df)

# Create a curve based on prediction using the regression model


curve(predict(g, data.frame(IQ=x), type="resp"), add=TRUE)

# This Draws a set of points


# Based on fit to the regression model
points(IQ, fitted(g), pch=30)

# Summary of the regression model


summary(g)

38
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

# saving the file


dev.off()

Output:
IQ result
1 25.46872 0
2 26.72004 0
3 27.16163 0
4 27.55291 1
5 27.72577 0
6 28.00731 0
7 28.18095 0
8 28.28053 0
9 28.29086 0
10 28.34474 1
11 28.35581 1
12 28.40969 0
13 28.72583 0
14 28.81105 0
15 28.87337 1
16 29.00383 1
17 29.01762 0
18 29.03629 0
19 29.18109 1
20 29.39251 0
21 29.40852 0
22 29.78844 0
23 29.80456 1
24 29.81815 0
25 29.86478 0

39
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

26 29.91535 1
27 30.04204 1
28 30.09565 0
29 30.28495 1
30 30.39359 1
31 30.78886 1
32 30.79307 1
33 30.98601 1
34 31.14602 0
35 31.48225 1
36 31.74983 1
37 31.94705 1
38 31.94772 1
39 33.63058 0
40 35.35096 1

Call:
glm(formula = result ~ IQ, family = binomial, data = df)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.1451 -0.9742 -0.4950 1.0326 1.7283

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.8093 7.3368 -2.291 0.0220 *
IQ 0.5651 0.2482 2.276 0.0228 *
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1

40
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 55.352 on 39 degrees of freedom


Residual deviance: 48.157 on 38 degrees of freedom
AIC: 52.157

Number of Fisher Scoring iterations: 4

Text Analysis in R:
Text analytics is the process of examining unstructured data in the form of text to gather some
insights on patterns and topics of interest.

library(tm) # a text mining package

library(stopwords)

library(tidytext) # a tidyverse friendly text mining package

library(stringr) # a package for manipulating strings

library(gutenbergr)

library(SnowballC) # a package for plotting text based data

library(wordcloud) # another package for plotting text data

library(lubridate)

library(tidyverse)

library(harrypotter) # contains a labelled corpus of all harry potter books

library(sentimentr) # simple sentiment analysis function

41
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

library(textdata) # another package to support parsing

library(topicmodels) # specify, save and load topic models

library(LDAvis) # visualize the output of Latent Dirichlet Allocation

library(servr) # we use this library to access a data set

library(stringi) # natural language processing tools

library(ldatuning) # automatically specify LDA models

library(reshape2)

library(igraph)

library(ggraph)

devtools::install_github("bradleyboehmke/harrypotter")

Correlation between Algorithm

The mutual relationship, covariation, or association between two or more variables is called
Correlation. It is not concerned with either the changes in x or y individually, but with the
measurement of simultaneous variations in both variables.

A covariance is a statistical tool that helps to quantify the total variance of random variables
from their expected value(Mean). In simple words, it is a measure of the linear relationship
between two random variables. It can take any positive and negative values.

 Positive Covariance: It indicates that two variables tend to move in the same direction, which
means that if we increase the value of one variable other variable value will also increase.
 Zero Covariance: It indicates that there is no linear relationship between them.

 Negative Covariance: It indicates that two variables tend to move in the opposite direction,
which means that if we increase the value of one variable other variable value will decrease and
vice versa.

Covariance between two variables X and Y can be calculated using the following formula:

The following figure illustrates the linear relationship graphically

42
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Types of Correlation Metrics

 Pearson Correlation
 Spearman‟s Rank Correlation
 Kendall Rank Correlation
 Point Biserial Correlation

Pearson Correlation
Pearson correlation is also known as the Pearson product-moment correlation coefficient and is a
normalized measurement of the covariance. It also measures the linear relationship between two
variables and fails to capture the non-linear relationship of two variables. Pearson correlation
assumes that both variables are normally distributed. It can be used for nominal variables or
continuous variables.

Pearson correlation coefficient between two variables X and Y can be calculated by the
following formula:

43
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Limitation of Pearson Correlation

 It fails to capture the non-linear relationship between two variables.


 Usually, we do not use the Pearson correlation coefficient for ordinal variables(where sequence
matters).

Now let us calculate the Pearson correlation coefficient between two variables using the
python library.

Importing the necessary modules

from scipy.stats import pearsonr


import numpy as np
Generating random dataset which is normally distributed

a = np.random.normal(size=10)
b = np.random.normal(size=10)
Calculating Pearson Correlation Coefficient between two variables

pearsonr(a,b)
Output

Here Pearson Correlation is -0.05, so we can no linear relationship among them.

Spearman’s Rank Correlation


It is a nonparametric(no prior assumptions about distribution) measure for calculating correlation
coefficient that is used for ordinal variables or continuous variables. Spearman‟s rank correlation
can capture both linear or non-linear relationships.

Spearman‟s rank correlation coefficient between two variables X and Y can be calculated using
the following formula:

44
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Now let us calculate Spearman‟s rank correlation coefficient between two variables using the
python library.

Importing the necessary modules

import numpy as np
from scipy.stats import spearmanr
Generating random dataset which is normally distributed

a = np.random.rand(10)
b = np.random.rand(10)
Calculating Pearson Correlation Coefficient between two variables

spearmanr(a,b)
Output

Here Spearman‟s Correlation is 0.15, so we can say positive correlation among them.

Kendall Rank Correlation


Kendell rank correlation, sometimes called Kendall tau coefficient, is a nonparametric measure
for calculating the rank correlation of ordinals variables. It can also capture both linear or non-
linear relationships between two variables. There are three different flavours of Kendall tau
namely tau-a, tau-b, tau-c.

Generalized Kendall rank correlation coefficient between two variables X and Y can be
calculated using the following formula:

Concordant Pair: A pair is concordant if the observed rank is higher on one variable and is also
higher on another variable.

45
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Discordant Pair: A pair is discordant if the observed rank is higher on one variable and is lower
on the other variable.

Now let us calculate the Kendall tau correlation coefficient between two variables using the
python library.

Importing the necessary modules

import numpy as np
from scipy.stats import kendalltau
Generating random dataset which is normally distributed

a = np.random.rand(10)
b = np.random.rand(10)
Calculating Pearson Correlation Coefficient between two variables

kendalltau(a,b)
Output

Here Kendall Correlation is -0.19, so we can say negative correlation among them.

Point Biserial Correlation


Point Biserial Correlation is used when one variable is dichotomous(binary) and another variable
is continuous. It can also capture both linear or non-linear relationships between two variables. It
is denoted by rpb.

Dichotomous Variable: If a variable can have only binary values like head or tail, male or
female then such variable is called a dichotomous variable.

Point Biserial correlation coefficient between two variables X and Y can be calculated using the
following formula:

46
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236

Now let us calculate the Point Biserial correlation coefficient between two variables using the
python library.

Importing the necessary modules

import numpy as np
from scipy.stats import pointbiserialr
Generating random dataset which is normally distributed

a = np.random.rand(10)
b = np.random.rand(10)
Calculating Pearson Correlation Coefficient between two variables

pointbiserialr(a,b)
Output

Here Point Biserial Correlation is 0.305, so we can say positive correlation among them.

47
Downloaded by Uma Mageswari T ([email protected])

You might also like