Unit 2 Bi Unlocked Notes
Unit 2 Bi Unlocked Notes
UNIT-II
Data Science: The concept, process and typical tools in data science. Example of
different algorithms i.e segmentation, classification, validation, regressions,
recommendations. Exercises using Excel and R to work on histograms, regression, clustering
and text analysis. Correlation between Algorithm and Code in data science.
Data Science can be explained as the entire process of gathering actionable insights from raw
data that involves various concepts that include statistical analysis, data analysis, machine
learning algorithms, data modeling, preprocessing of data, etc.
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms,
and systems to extract knowledge and insights from structured and unstructured data. It
involves the use of techniques from statistics, data analysis, machine learning, and computer
science to extract insights and knowledge from data. Data science can be applied in a wide
range of fields, including business, healthcare, finance, and government, among others. The
goal of data science is to turn raw data into actionable insights that can inform decision-making
and improve outcomes.
Data Science has evolved over the years and didn‟t start as how we know data science today.
Let‟s take a look at the timeline to understand how Data Science evolved over the years.
1
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
1. 1962 – Inception
a. Future of Data Analysis – In 1962, John W Tukey wrote the “Future of Data Analysis”
where he first mentioned the importance of data analysis with respect to science rather than
mathematics.
2. 1974
a. Concise Survey of Computer Methods – In 1974, Peter Naur published the “Concise
Survey of Computer methods that surveys the contemporary methods of data processing
in various applications.
3. 1974 – 1980
4. 1980-1990
a. Knowledge Discovery in Databases – In 1989, Gregory Piatetsky-Shapiro chaired the
Knowledge Discovery in Databases that later went on to become the annual conference
on knowledge discovery and data mining.
5. 1990-2000
a. Database Marketing – In 1994, BusinessWeek published a cover story that explains
how big organizations are using the customer data to predict the likelihood of a customer
buying a specific product or not. Kind of like how targeted ads work in the modern era
for social media campaigns.
b. International Federation of Classification Society – For the first time in 1996, the
term “Data Science” was used in a conference held in Japan.
6. 2000-2010
a. Data Science – An Action Plan for Expanding the Technical Areas of the Field of
Statistics – In 2001, William S Cleveland published the action plan, that majorly focused
on major areas of the technical work in the field of statistics and coined the term Data
Science.
b. Statistical Modeling – The Two Cultures – In 2001, Leo Breiman wrote “There are
two cultures in the use of statistical modeling to reach conclusions from data. One
assumes that the data are generated by a given stochastic data model. The other uses
algorithmic models and treats the data mechanism as unknown”.
c. Data Science Journal – April 2002 saw the launch of a journal that focused on
management of data and databases in science and technology.
7. 2010-Present
a. Data Everywhere – In February 2010, Kenneth Cukier wrote a special report for The
2
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Economist that said a new professional has arrived – a data scientist. Who combines the
skills of software programmer, statistician and storyteller/artist to extract the nuggets of
gold hidden under mountains of data.
They can follow the following approach to get an optimal solution using Data Science:
1. Gather the previous data on the sales that were closed.
2. Use statistical analysis to find out the patterns that were followed by the leads that were
closed.
3. Use machine learning to get actionable insights for finding out potential leads.
4. Use the new data on sales lead to segregate potential leads that will be highly likely to be
closed.
1. Formulating a Business Problem: Any data science problem will start their journey
from formulating a business problem. A business problem explains the issues that may be
fixed with insights gathered from an efficient Data Science solution. A simple example of
a business problem is – You have past 1 year‟s sales data for a retail store. Using machine
learning approaches, you have to predict or forecast the sales for the next 3 months that
will help the store to create an inventory that will help in reducing the wastage of
products that have lesser shelf life than the other products.
2. Data Extraction, Transformation, Loading: The next step in the data science life cycle
is to create a data pipeline where the relevant data is extracted from the source and
3
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
transformed into machine readable format, and eventually the data is loaded into the
program or the machine learning pipeline to get things started.
For the above example – To forecast the sales, we will need data from the store that will
be useful for formulating an efficient machine learning model. Keeping this in mind, we
would create separate data points that may or may not be affecting the sales for that
particular store.
3. Data Preprocessing: The third step is where the magic happens. Using statistical
analysis, Exploratory data analysis, data wrangling and manipulation, we will create
meaningful data. The preprocessing is done to assess the various data points and
formulate hypotheses that best explain the relationship between the various features in the
data.
For example – The store sales problem will require the data to be in a time series format
to be able to forecast the sales. The hypothesis testing will test the stationarity of the
series and further computations will show various trends, seasonality and other
relationship patterns in the data.
4. Data Modeling: This step involves advanced machine learning concepts that will be used
for feature selection, feature transformation, standardization of the data, data
normalization, etc. Choosing the best algorithms based on evidence from the above steps
will help you create a model that will efficiently create a forecast for the said months in
the above example.
For example – We can use the Time Series forecasting approach for the business problem
where the presence of high dimensional data could be the case. We will use various
dimensionality reduction techniques, and create a Forecasting model using AR, MA, or
ARIMA model and forecast the sales for the next quarter.
5. Gathering Actionable Insights: The final step from the data science life cycle is
gathering insights from the said problem statement. We create inferences and findings
from the entire process that would best explain the business problem.
For example – From the above Time series model, we will get the monthly or weekly
sales for the next 3 months. These insights will in turn help the professionals create a
strategy plan to overcome the problem at hand.
6. Solutions For the Business Problem: The solutions for the business problem are
nothing but actionable insights that will solve the problem using evidence based
information. For example – Our forecast from the Time series model will give an
efficient estimate for the store sales in the next 3 months. Using those insights, the store
can plan their inventory to reduce the wastage of perishable goods.
Data science is a field that involves using scientific methods, processes, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. It can be used
in a variety of industries and applications such as:
4
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
1. Business: Data science can be used to analyze customer data, predict market trends, and
optimize business operations.
2. Healthcare: Data science can be used to analyze medical data and identify patterns that
can aid in diagnosis, treatment, and drug discovery.
3. Finance: Data science can be used to identify fraud, analyze financial markets, and make
investment decisions.
4. Social Media: Data science can be used to understand user behavior, recommend content,
and identify influencers.
5. Internet of things: Data science can be used to analyze sensor data from IoT devices and
make predictions about equipment failures, traffic patterns, and more.
6. Natural Language Processing: Data science can be used to make computers understand
human language, process large amounts of text or speech data and make predictions.
Overall Data Science is a multidisciplinary field that involves the use of statistics, machine
learning, and computer science to extract insights and knowledge from data.
Following are some of the applications that make use of Data Science for their services:
Internet Search Results (Google)
Recommendation Engine (Spotify)
Intelligent Digital Assistants (Google Assistant)
Autonomous Driving Vehicle (Waymo)
Spam Filter (Gmail)
Abusive Content and Hate Speech Filter (Facebook)
Robotics (Boston Dynamics)
Automatic Piracy Detection (YouTube)
Those applications drive a wide variety of use cases in organizations, including the
following:
customer analytics
fraud detection
risk management
stock trading
targeted advertising
website personalization
customer service
predictive maintenance
logistics and supply chain management
image recognition
speech recognition
5
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Data engineer. Responsibilities include setting up data pipelines and aiding in data
preparation and model deployment, working closely with data scientists.
Data analyst. This is a lower-level position for analytics professionals who don't have
the experience level or advanced skills that data scientists do.
Machine learning engineer. This programming-oriented job involves developing the
machine learning models needed for data science applications.
Data visualization developer. This person works with data scientists to create
visualizations and dashboards used to present analytics results to business users.
Data translator. Also called an analytics translator, it's an emerging role that serves as
a liaison to business units and helps plan projects and communicate results.
Data architect. A data architect designs and oversees the implementation of the
underlying systems used to store and manage data for analytics uses.
Numerous tools are available for data scientists to use in the analytics process, including both
commercial and open source options:
data platforms and analytics engines, such as Spark, Hadoop and NoSQL databases;
programming languages, such as Python, R, Julia, Scala and SQL;
statistical analysis tools like SAS and IBM SPSS;
machine learning platforms and libraries, including TensorFlow, Weka, Scikit-learn,
Keras and PyTorch;
Jupyter Notebook, a web application for sharing documents with code, equations and
other information; and
data visualization tools and libraries, such as Tableau, D3.js and Matplotlib.
Data Science Tools
Let’s take a look at those tools and their advantages now, placed in alphabetical order:
6
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Algorithms.io.
This tool is a machine-learning (ML) resource that takes raw data and shapes it into real-time
insights and actionable events, particularly in the context of machine-learning.
Advantages:
It‟s on a cloud platform, so it has all the SaaS advantages of scalability, security, and
infrastructure
Makes machine learning simple and accessible to developers and companies
Apache Hadoop
This open-source framework creates simple programming models and distributes extensive data
set processing across thousands of computer clusters. Hadoop works equally well for research
and production purposes. Hadoop is perfect for high-level computations.
Advantages:
Open-source
Highly scalable
It has many modules available
Failures are handled at the application layer
Apache Spark
Also called “Spark,” this is an all-powerful analytics engine and has the distinction of being the
most used data science tool. It is known for offering lightning-fast cluster computing. Spark
accesses varied data sources such as Cassandra, HDFS, HBase, and S3. It can also easily handle
large datasets.
7
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Advantages:
BigML
This tool is another top-rated data science resource that provides users with a fully interactable,
cloud-based GUI environment, ideal for processing ML algorithms. You can create a free or
premium account depending on your needs, and the web interface is easy to use.
Advantages:
D3.js
D3.js is an open-source JavaScript library that lets you make interactive visualizations on your
web browser. It emphasizes web standards to take full advantage of all of the features of modern
browsers, without being bogged down with a proprietary framework.
Advantages:
8
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Data Robot
This tool is described as an advanced platform for automated machine learning. Data scientists,
executives, IT professionals, and software engineers use it to help them build better quality
predictive models, and do it faster.
Advantages:
With just a single click or line of code, you can train, test, and compare many different
models
It features Python SDK and APIs
It comes with a simple model deployment process
Excel
Yes, even this ubiquitous old database workhorse gets some attention here, too! Originally
developed by Microsoft for spreadsheet calculations, it has gained widespread use as a tool for
data processing, visualization, and sophisticated calculations.
Advantages:
You can sort and filter your data with one click
Advanced Filtering function lets you filter data based on your favorite criteria
Well-known and found everywhere
ForecastThis
If you‟re a data scientist who wants automated predictive model selection, then this is the tool for
you! ForecastThis helps investment managers, data scientists, and quantitative analysts to use
their in-house data to optimize their complex future objectives and create robust forecasts.
Advantages:
9
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Google BigQuery
This is a very scalable, serverless data warehouse tool created for productive data analysis. It
uses Google‟s infrastructure-based processing power to run super-fast SQL queries against
append-only tables.
Advantages:
Extremely fast
Keeps costs down since users need only pay for storage and computer usage
Easily scalable
Java
Java is the classic object-oriented programming language that‟s been around for years. It‟s
simple, architecture-neutral, secure, platform-independent, and object-oriented.
Advantages:
Suitable for large science projects if used with Java 8 with Lambdas
Java has an extensive suite of tools and libraries that are perfect for machine learning and
data science
Easy to understand
MATLAB
MATLAB is a high-level language coupled with an interactive environment for numerical
computation, programming, and visualization. MATLAB is a powerful tool, a language used in
technical computing, and ideal for graphics, math, and programming.
10
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Advantages:
Intuitive use
It analyzes data, creates models, and develops algorithms
With just a few simple code changes, it scales analyses to run on clouds, clusters, and
GPUs
MySQL
Another familiar tool that enjoys widespread popularity, MySQL is one of the most popular
open-source databases available today. It‟s ideal for accessing data from databases.
Advantages:
NLTK
Short for Natural Language Toolkit, this open-source tool works with human language data and
is a well-liked Python program builder. NLTK is ideal for rookie data scientists and students.
Advantages:
11
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Rapid Miner
This data science tool is a unified platform that incorporates data prep, machine learning, and
model deployment for making data science processes easy and fast. It enjoys heavy use in the
manufacturing, telecommunication, utility, and banking industries.
Advantages:
SAS
This data science tool is designed especially for statistical operations. It is a closed-source
proprietary software tool that specializes in handling and analyzing massive amounts of data for
large organizations. It‟s well-supported by its company and very reliable. Still, it‟s a case of
getting what you pay for because SAS is expensive and best suited for large companies and
organizations.
Advantages:
12
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Tableau
Tableau is a Data Visualization software that is packed with powerful graphics to make
interactive visualizations. It is focused on industries working in the field of business intelligence.
The most important aspect of Tableau is its ability to interface with databases, spreadsheets,
OLAP (Online Analytical Processing) cubes, etc. Along with these features, Tableau has the
ability to visualize geographical data and for plotting longitudes and latitudes in maps.
TensorFlow
TensorFlow has become a standard tool for Machine Learning. It is widely used for advanced
machine learning algorithms like Deep Learning. Developers named TensorFlow after Tensors
which are multidimensional arrays.
It is an open-source and ever-evolving toolkit which is known for its performance and high
computational abilities. TensorFlow can run on both CPUs and GPUs and has recently emerged
on more powerful TPU platforms.
This gives it an unprecedented edge in terms of the processing power of advanced machine
learning algorithms.
Weka
Weka or Waikato Environment for Knowledge Analysis is a machine learning software written
in Java. It is a collection of various Machine Learning algorithms for data mining. Weka consists
13
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
of various machine learning tools like classification, clustering, regression, visualization and data
preparation.
It is an open-source GUI software that allows easier implementation of machine learning
algorithms through an interactable platform.
The implementation of Data Science to any problem requires a set of skills. Machine Learning is
an integral part of this skill set.
For doing Data Science, you must know the various Machine Learning algorithms used for
solving different types of problems, as a single algorithm cannot be the best for all types of use
cases. These algorithms find an application in various tasks
like prediction, classification, clustering, etc. from the dataset under consideration.
1. Supervised Algorithms: The training data set has inputs as well as the desired output.
During the training session, the model will adjust its variables to map inputs to the
corresponding output.
2. Unsupervised Algorithms: In this category, there is not a target outcome. The algorithms
will cluster the data set for different groups.
14
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Linear Regression
Linear regression method is used for predicting the value of the dependent variable by using the
values of the independent variable.
The linear regression model is suitable for predicting the value of a continuous quantity.
The linear regression model represents the relationship between the input variables (x) and the
output variable (y) of a dataset in terms of a line given by the equation,
y = m*x + c
where y is the dependent variable and x is the independent variable. Basic calculus theories are
applied to find the values for m and c using the given data set. The main aim of this method is to
find the value of b0 and b1 to find the best fit line that will be covering or will be nearest to most
of the data points.
Logistic Regression
15
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
1 / (1 + e^-x)
Here, e represents base of natural log and we obtain the S-shaped curve with values between 0
and 1. The equation for logistic regression is written as:
Decision Trees
This algorithm categorizes the population for several sets based on some chosen properties
(independent variables) of a population. Usually, this algorithm is used to solve classification
problems. Categorization is done by using some techniques such as Gini, Chi-square, entropy etc.
Decision trees help in solving both classification and prediction problems. It makes it easy to
understand the data for better accuracy of the predictions. Each node of the Decision tree
represents a feature or an attribute, each link represents a decision and each leaf node holds a
class label, that is, the outcome.
The drawback of decision trees is that it suffers from the problem of overfitting.
Basically, these two Data Science algorithms are most commonly used for implementing the
Decision trees.
This algorithm uses entropy and information gain as the decision metric.
This algorithm uses the Gini index as the decision metric. The below image will help you to
understand things better.
16
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Here‟s a decision tree that evaluates scenarios where people want to play football.
Support Vector Machine or SVM comes under the category of supervised Machine Learning
algorithms and finds an application in both classification and regression problems. It is most
commonly used for classification of problems and classifies the data points by using
a hyperplane.
17
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
The first step of this Data Science algorithm involves plotting all the data items as individual
points in an n-dimensional graph.
Here, n is the number of features and the value of each individual feature is the value of a
specific coordinate. Then we find the hyperplane that best separates the two classes for
classifying them.
Finding the correct hyperplane plays the most important role in classification. The data points
which are closest to the separating hyperplane are the support vectors.
Let us consider the following example to understand how you can identify the right hyperplane.
The basic principle for selecting the best hyperplane is that you have to choose the hyperplane
that separates the two classes very well.
In this case, the hyperplane B is classifying the data points very well. Thus, B will be the right
hyperplane.
18
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
All three hyperplanes are separating the two classes properly. In such cases, we have to select the
hyperplane with the maximum margin. As we can see in the above image, hyperplane B has the
maximum margin therefore it will be the right hyperplane.
In this case, the hyperplane B has the maximum margin but it is not classifying the two classes
accurately. Thus, A will be the right hyperplane.
Naive Bayes
The Naive Bayes algorithm helps in building predictive models. We use this Data Science
algorithm when we want to calculate the probability of the occurrence of an event in the future.
Here, we have prior knowledge that another event has already occurred.
The Naive Bayes algorithm works on the assumption that each feature is independent and has
an individual contribution to the final prediction.
The Naive Bayes theorem is represented by:
P(A|B) is the posterior probability i.e. the probability of A given that B has already
occurred.
P(B|A) is the likelihood i.e. the probability of B given that A has already occurred.
P(A) is the class prior to probability.
P(B) is the predictor prior probability.
Example: Let‟s understand it using an example. Below I have a training data set of
weather and the corresponding target variable „Play‟. Now, we need to classify whether
players will play or not based on weather conditions. Let‟s follow the below steps to
perform it.
Step 2: Create a Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use the Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of the prediction.
19
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Problem: Players will pay if the weather is sunny, is this statement correct?
We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) *
P(Yes) / P (Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different classes based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.
KNN
KNN stands for K-Nearest Neighbors. This Data Science algorithm employs both classification
and regression problems.
This is a simple algorithm which predicts unknown data point with its k nearest neighbors. The
value of k is a critical factor here regarding the accuracy of prediction. It determines the nearest
by calculating the distance using basic distance functions like Euclidean.
The KNN algorithm considers the complete dataset as the training dataset. After training the
model using the KNN algorithm, we try to predict the outcome of a new data point.
Here, the KNN algorithm searches the entire data set for identifying the k most similar or nearest
neighbors of that data point. It then predicts the outcome based on these k instances. For finding
the nearest neighbors of a data instance, we can use various distance measures like Euclidean
distance, Hamming distance, etc. To better understand, let us consider the following example.
20
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Here we have represented the two classes A and B by the circle and the square respectively.
The selection of the value of k is a very critical task. You should take such a value of k that it is
neither too small nor too large. Another simpler approach is to take k = √n where n is the number
of data points.
K-Means Clustering
Clustering basically means dividing the data set into groups of similar data items called clusters.
K means clustering categorizes the data items into k groups with similar data items.
For measuring this similarity, we use Euclidean distance which is given by,
D = √(x1-x2)^2 + (y1-y2)^2
K means clustering is iterative in nature.
First, we select the value of k which is equal to the number of clusters into which we
want to categorize our data.
Then we assign the random center values to each of these k clusters.
Now we start searching for the nearest data points to the cluster centers by using the
Euclidean distance formula.
21
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
In the next step, we calculate the mean of the data points assigned to each cluster.
Again we search for the nearest data points to the newly created centers and assign them
to their closest clusters.
We should keep repeating the above steps until there is no change in the data points
assigned to the k clusters.
First, we randomly initialize and select the k-points. These k-points are the means.
We use the Euclidean distance to find data-points that are closest to their centreW of the
cluster.
Then we calculate the mean of all the points in the cluster which is finding their centroid.
We iteratively repeat step 1, 2 and 3 until all the points are assigned to their respective
clusters.
K-means clustering is the most popular form of an unsupervised learning algorithm. It is easy to
understand and implement.
The objective of the K-means clustering is to minimize the Euclidean distance that each point
has from the centroid of the cluster. To better understand, let us consider the following example.
22
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Neural Networks are modeled after the neurons in the human brain. It comprises many layers of
neurons that are structured to transmit information from the input layer to the output layer.
Between the input and the output layer, there are hidden layers present.
These hidden layers can be many or just one. A simple neural network comprising of a single
hidden layer is known as Perceptron.
In the above diagram for a simple neural network, there is an input layer that takes the input in
the form of a vector. Then, this input is passed to the hidden layer which comprises of various
mathematical functions that perform computation on the given input.
For example, given the images of cats and dogs, our hidden layers perform various
mathematical operations to find the maximum probability of the class our input image falls in.
This is an example of binary classification where the class, that is, dog or cat, is assigned its
appropriate place.
PCA is basically a technique for performing dimensionality reduction of the datasets with the
least effect on the variance of the datasets. This means removing the redundant features but
keeping the important ones.
To achieve this, PCA transforms the variables of the dataset into a new set of variables. This new
set of variables represents the principal components.
All the PCs are orthogonal (i.e. they are at a right angle to each other).
They are created in such a way that with the increasing number of components, the amount
of variation that it retains starts decreasing.
23
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
This means the 1st principal component retains the variation to the maximum extent as
compared to the original variables.
PCA is basically used for summarizing data. While dealing with a dataset there might be some
features related to each other. Thus PCA helps you to reduce such features and make predictions
with less number of features without compromising with the accuracy.
For example, consider the following diagram in which we have reduced a 3D space to a 2D
space.
Random Forests
Random Forests overcomes the overfitting problem of decision trees and helps in solving both
classification and regression problems. It works on the principle of Ensemble learning.
The Ensemble learning methods believe that a large number of weak learners can work together
for giving high accuracy predictions.
Random Forests work in a much similar way. It considers the prediction of a large number of
individual decision trees for giving the final outcome. It calculates the number of votes of
predictions of different decision trees and the prediction with the largest number of votes
becomes the prediction of the model. Let us understand this by an example.
24
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
In the above image, there are two classes labeled as A and B. In this random forest consisting of
7 decision trees, 3 have voted for class A and 4 voted for class B. As class B has received the
maximum votes thus the model‟s prediction will be class B.
As mentioned earlier, there are different ways of adding a histogram in Excel. Below are the
methods that work best for the 2016 and older versions of Microsoft Excel.
This is the easiest method to add a histogram in your Excel worksheet, as it uses the recently
added in-built histogram option in the charts section. It works only for the 2016 and newer
versions.
The first step to add a histogram to the Excel sheet requires you to select the entire dataset that
you wish to represent using the histogram. To do this, simply click on any of the corner cells and
drag the pointer over the rest of the cells that you wish to select.
Once you have selected the database that you wish to add a histogram in excel, go to the menu
bar and click on the Insert tab. The insert tab is used to insert images, charts, filters, hyperlinks,
etc.
Read more about How to Combine Multiple Excel Files into a Single Workbook, here.
25
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Now click on the “Charts” option under the Insert tab and then select the “Insert statistic chart”
option, this will open a new dialog box under it.
Inside the histogram group in the Insert Static Chart dialog box, click on the Histogram group
and then the Histogram chart option. This will add a Histogram in Excel for the dataset selected
in the first step.
26
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
After successfully adding the histogram in Excel using the above steps, you can then customize
the histogram according to your choice. Some of the customization options include: By
Category, Bin Width, Automatic, Number of Bins, etc. among others.
The second method for adding histograms to your Excel sheet uses the Data Analysis Toolpack
and is applicable for the 2016 version as well as the older versions of Microsoft Excel.
To use this method, you must first enable the Analysis Toolpack. To do so, click on the File
tab→ Options→ Add-ins in the navigation pane→ Excel Add-ins→ Go→ Analysis toolpack in
the add-ins dialog box and then click OK.
Once this installation is done, follow the below-mentioned steps to add a Histogram in Microsoft
Excel using the Data Analysis Toolpack:
To create a histogram in Excel using the provided data, you first need to create data intervals,
also known as bins. You need to specify bins separately in an additional column alongside the
provided data that you need to make the histogram for.
The next step to add a histogram is to go to the Menu bar and click on the Data tab. The Data
Tab dialog box will open.
27
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Now, click on the Data Analysis icon inside the Analysis Group in the Data tab on the rightmost
corner of the tab.
After selecting the Data Analysis icon, a new dialog box will appear. Select histogram from the
list of the Analysis tools and click OK.
Once you have selected the histogram from the list of analysis tools provided, a new dialog box
will appear, requiring you to fill in the Input Range, Bin Range, Output Range, and select the
Chart output. Leave the rest of the boxes unchecked, including Labels. You can check the Label
box if you wish to include the labels in the data section.
Step 6: Click OK
Click OK to confirm all the information you just filled in and add the histogram in Excel.
28
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
[NOTE: If you leave the Bin Range box empty, the histogram will still be created. The system
will automatically create six equally spaced bins for your histogram.]
The above method works best for all Microsoft Excel versions before and including the 2016
version but just for backup, if this method doesn‟t work for you then you can use the third and
last method mentioned below to add a histogram in Microsoft Excel.
The 3rd method for adding a histogram to your Excel sheet is using the FREQUENCY function
and is usually used to create a dynamic histogram in Excel. A dynamic histogram updates itself
when the data inside it is changed.
Step 1: Bins
The first step here as well is to create separate bins in a separate column. These bins or data
intervals are what you want to show the frequency of.
=FREQUENCY(B2:B41,D2:D8)
Here you can enter the marks of students in column B and the bins in the D column. You must
note that this is an array function so you must use Ctrl+Shift+Enter instead of just pressing
„Enter‟.
29
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
For getting a more accurate result and to make sure you don‟t make any possible mistakes,
follow the below steps:
1. Select the cells that are adjacent to the bins. For the above formula used, they will be the
cells E2:E8
2. Press F2 to get the edit access for cell E2.
3. Enter the Frequency formula.
4. Press Ctrl+Shift +Enter
This will give you the frequency of the bins you created. With the frequency already calculated,
all you need to do is to create the histogram in Excel which will be nothing more than just a
column chart.
Histograms in R language
A histogram contains a rectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical intervals. A
graphical representation that manages a group of data points into different specified ranges. It
has a special feature which shows no gaps between the bars and is similar to a vertical bar
graph.
R – Histograms
We can create histogram in R Programming Language using hist() function.
Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border)
Parameters:
v: This parameter contains numerical values used in histogram.
main: This parameter main is the title of the chart.
col: This parameter is used to set color of the bars.
xlab: This parameter is the label for horizontal axis.
border: This parameter is used to set border color of each bar.
xlim: This parameter is used for plotting values of x-axis.
30
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Creating a simple histogram chart by using the above parameter. This vector v is plot
using hist().
Example:
Output:
31
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Output:
# Setting labels
text(m$mids, m$counts, labels = m$counts,
adj = c(0.5, -0.5))
Output:
Creating different width histogram charts, by using the above parameters, we created
histogram using non-uniform width.
Example
32
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
33
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Hard clustering: In this type of clustering, the data point either belongs to the cluster
totally or not and the data point is assigned to one cluster only. The algorithm used for hard
clustering is k-means clustering.
Soft clustering: In soft clustering, the probability or likelihood of a data point is assigned
in the clusters rather than putting each data point in a cluster. Each data point exists in all
the clusters with some probability. The algorithm used for soft clustering is the fuzzy
clustering method or soft k-means.
K-Means is an iterative hard clustering technique that uses an unsupervised learning algorithm.
In this, total numbers of clusters are pre-defined by the user and based on the similarity of each
data point, the data points are clustered. This algorithm also finds out the centroid of the
cluster.
Algorithm:
Specify number of clusters (K): Let us take an example of k =2 and 5 data points.
Randomly assign each data point to a cluster: In the below example, the red and green
color shows 2 clusters with their respective random data points assigned to them.
Calculate cluster centroids: The cross mark represents the centroid of the corresponding
cluster.
Re-allocate each data point to their nearest cluster centroid: Green data point is
assigned to the red cluster as it is near to the centroid of red cluster.
Re-figure cluster centroid
Example:
# Loading dataset
df <- mtcars
34
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
# Scaling dataset
df <- scale(df)
35
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
When k = 5
Logistic Regression is a model that takes response variables (dependent variable) and features
(independent variables) to determine the estimated probability of an event. A logistic model is
used when the response variable has categorical values such as 0 or 1. For example, a student
will pass/fail, a mail is spam or not, determining the images, etc. In this article, we‟ll discuss
regression analysis, types of regression, and implementation of logistic regression in R
programming.
Regression Analysis in R
Regression analysis is a group of statistical processes used in R programming and statistics to
determine the relationship between dataset variables. Generally, regression analysis is used to
determine the relationship between the dependent and independent variables of the dataset.
Regression analysis helps to understand how dependent variables change when one of the
independent variables is changes and other independent variables are kept constant. This helps
in building a regression model and further, helps in forecasting the values with respect to a
change in one of the independent variables. On the basis of types of dependent variables, a
number of independent variables, and the shape of the regression line, there are 4 types of
regression analysis techniques i.e., Linear Regression, Logistic Regression, Multinomial
Logistic Regression and Ordinal Logistic Regression.
Types of Regression Analysis
Linear Regression
Linear Regression is one of the most widely used regression techniques to model the
relationship between two variables. It uses a linear relationship to model the regression line.
There are 2 variables used in the linear relationship equation i.e., predictor variable and
response variable.
36
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
y = ax + b
where,
y is the response variable
x is the predictor variable
a and b are the coefficients
The regression line created using this technique is a straight line. The response variable is
derived from predictor variables. Predictor variables are estimated using some statistical
experiments. Linear regression is widely used but these techniques is not capable of predicting
the probability.
Logistic Regression
On the other hand, logistic regression has an advantage over linear regression as it is
capable of predicting the values within the range. Logistic regression is used to predict
the values within the categorical range. For example, male or female, winner or loser,
etc.
Logistic regression uses the following sigmoidal function:
Y= 1 / 1+e -z
where,
y represents response variable
z represents equation of independent variables or features
37
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
To know about more optional parameters of glm() function, use below command in R:
help("glm")
Example:
Let us assume a vector of IQ level of students in a class. Another vector contains the result of
the corresponding student i.e., fail or pass (0 or 1) in an exam.
# Generate random IQ values with mean = 30 and sd =2
IQ <- rnorm(40, 30, 2)
# Data Frame
df <- as.data.frame(cbind(IQ, result))
38
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Output:
IQ result
1 25.46872 0
2 26.72004 0
3 27.16163 0
4 27.55291 1
5 27.72577 0
6 28.00731 0
7 28.18095 0
8 28.28053 0
9 28.29086 0
10 28.34474 1
11 28.35581 1
12 28.40969 0
13 28.72583 0
14 28.81105 0
15 28.87337 1
16 29.00383 1
17 29.01762 0
18 29.03629 0
19 29.18109 1
20 29.39251 0
21 29.40852 0
22 29.78844 0
23 29.80456 1
24 29.81815 0
25 29.86478 0
39
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
26 29.91535 1
27 30.04204 1
28 30.09565 0
29 30.28495 1
30 30.39359 1
31 30.78886 1
32 30.79307 1
33 30.98601 1
34 31.14602 0
35 31.48225 1
36 31.74983 1
37 31.94705 1
38 31.94772 1
39 33.63058 0
40 35.35096 1
Call:
glm(formula = result ~ IQ, family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1451 -0.9742 -0.4950 1.0326 1.7283
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.8093 7.3368 -2.291 0.0220 *
IQ 0.5651 0.2482 2.276 0.0228 *
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1
40
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Text Analysis in R:
Text analytics is the process of examining unstructured data in the form of text to gather some
insights on patterns and topics of interest.
library(stopwords)
library(gutenbergr)
library(lubridate)
library(tidyverse)
41
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
library(reshape2)
library(igraph)
library(ggraph)
devtools::install_github("bradleyboehmke/harrypotter")
The mutual relationship, covariation, or association between two or more variables is called
Correlation. It is not concerned with either the changes in x or y individually, but with the
measurement of simultaneous variations in both variables.
A covariance is a statistical tool that helps to quantify the total variance of random variables
from their expected value(Mean). In simple words, it is a measure of the linear relationship
between two random variables. It can take any positive and negative values.
Positive Covariance: It indicates that two variables tend to move in the same direction, which
means that if we increase the value of one variable other variable value will also increase.
Zero Covariance: It indicates that there is no linear relationship between them.
Negative Covariance: It indicates that two variables tend to move in the opposite direction,
which means that if we increase the value of one variable other variable value will decrease and
vice versa.
Covariance between two variables X and Y can be calculated using the following formula:
42
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Pearson Correlation
Spearman‟s Rank Correlation
Kendall Rank Correlation
Point Biserial Correlation
Pearson Correlation
Pearson correlation is also known as the Pearson product-moment correlation coefficient and is a
normalized measurement of the covariance. It also measures the linear relationship between two
variables and fails to capture the non-linear relationship of two variables. Pearson correlation
assumes that both variables are normally distributed. It can be used for nominal variables or
continuous variables.
Pearson correlation coefficient between two variables X and Y can be calculated by the
following formula:
43
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Now let us calculate the Pearson correlation coefficient between two variables using the
python library.
a = np.random.normal(size=10)
b = np.random.normal(size=10)
Calculating Pearson Correlation Coefficient between two variables
pearsonr(a,b)
Output
Spearman‟s rank correlation coefficient between two variables X and Y can be calculated using
the following formula:
44
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Now let us calculate Spearman‟s rank correlation coefficient between two variables using the
python library.
import numpy as np
from scipy.stats import spearmanr
Generating random dataset which is normally distributed
a = np.random.rand(10)
b = np.random.rand(10)
Calculating Pearson Correlation Coefficient between two variables
spearmanr(a,b)
Output
Here Spearman‟s Correlation is 0.15, so we can say positive correlation among them.
Generalized Kendall rank correlation coefficient between two variables X and Y can be
calculated using the following formula:
Concordant Pair: A pair is concordant if the observed rank is higher on one variable and is also
higher on another variable.
45
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Discordant Pair: A pair is discordant if the observed rank is higher on one variable and is lower
on the other variable.
Now let us calculate the Kendall tau correlation coefficient between two variables using the
python library.
import numpy as np
from scipy.stats import kendalltau
Generating random dataset which is normally distributed
a = np.random.rand(10)
b = np.random.rand(10)
Calculating Pearson Correlation Coefficient between two variables
kendalltau(a,b)
Output
Here Kendall Correlation is -0.19, so we can say negative correlation among them.
Dichotomous Variable: If a variable can have only binary values like head or tail, male or
female then such variable is called a dichotomous variable.
Point Biserial correlation coefficient between two variables X and Y can be calculated using the
following formula:
46
Downloaded by Uma Mageswari T ([email protected])
lOMoARcPSD|36571236
Now let us calculate the Point Biserial correlation coefficient between two variables using the
python library.
import numpy as np
from scipy.stats import pointbiserialr
Generating random dataset which is normally distributed
a = np.random.rand(10)
b = np.random.rand(10)
Calculating Pearson Correlation Coefficient between two variables
pointbiserialr(a,b)
Output
Here Point Biserial Correlation is 0.305, so we can say positive correlation among them.
47
Downloaded by Uma Mageswari T ([email protected])