0% found this document useful (0 votes)
25 views48 pages

Eda 5

Uploaded by

thalithatanusree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views48 pages

Eda 5

Uploaded by

thalithatanusree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

What is Machine Learning

In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions. But can a machine also learn from experiences or past data
like a human does? So here comes the role of Machine Learning.

Machine Learning is said as a subset of artificial intelligence that is mainly concerned


with the development of algorithms which allow a computer to learn from the data
and past experiences on their own. The term machine learning was first introduced
by Arthur Samuel in 1959. We can define it in a summarized way as:

Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.

With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine learning brings computer
science and statistics together for creating predictive models. Machine learning
constructs or uses the algorithms that learn from historical data. The more we will
provide the information, the higher will be the performance.

A machine has the ability to learn if it can improve its performance by gaining
more data.

How does Machine Learning work


A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The
accuracy of predicted output depends upon the amount of data, as the huge amount
of data helps to build a better model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so


instead of writing a code for it, we just need to feed the data to generic algorithms,
and with the help of these algorithms, machine builds the logic as per the data and
predict the output. Machine learning has changed our way of thinking about the
problem. The below block diagram explains the working of Machine Learning
algorithm:
Features of Machine Learning:
o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount
of the data.

Need for Machine Learning


The need for machine learning is increasing day by day. The reason behind the need
for machine learning is that it is capable of doing tasks that are too complex for a
person to implement directly. As a human, we have some limitations as we cannot
access the huge amount of data manually, so for this, we need some computer systems
and here comes the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data
and let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of
machine learning, we can save both time and money.

The importance of machine learning can be easily understood by its uses cases,
Currently, machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc. Various top companies such
as Netflix and Amazon have build machine learning models that are using a vast
amount of data to analyze the user interest and recommend product accordingly.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

CLASSIFICATION OF MACHINE LEARNING


At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample
labeled data to the machine learning system in order to train it, and on that basis, it
predicts the output.

The system creates a model using labeled data to understand the datasets and learn
about each data, once the training and processing are done then we test the model
by providing a sample data to check whether it is predicting the exact output or not.

The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student learns
things in the supervision of the teacher. The example of supervised learning is spam
filtering.

Supervised learning can be grouped further in two categories of algorithms:


o Classification
o Regression

2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.

The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to


find useful insights from the huge amount of data. It can be further classifieds into two
categories of algorithms:

o Clustering
o Density Estimation
o Dimensionality Reduction

3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning
agent gets a reward for each right action and gets a penalty for each wrong action.
The agent learns automatically with these feedbacks and improves its performance. In
reinforcement learning, the agent interacts with the environment and explores it. The
goal of an agent is to get the most reward points, and hence, it improves its
performance.

The robotic dog, which automatically learns the movement of his arms, is an example
of Reinforcement learning.
APPLICATIONS OF MACHINE LEARNING
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:
Image Recognition: Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images, etc. The popular use case
of image recognition and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
Speech Recognition: While using Google, we get an option of "Search by voice," it comes
under speech recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning algorithms
are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.
Traffic prediction: If we want to visit a new place, we take help of Google Maps, which shows
us the correct path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.
Product recommendations: Machine learning is widely used by various e-commerce and
entertainment companies such as Amazon, Netflix, etc., for product recommendation to the
user. Whenever we search for some product on Amazon, then we started getting an
advertisement for the same product while internet surfing on the same browser and this is
because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
Self-driving cars: One of the most exciting applications of machine learning is self-driving
cars. Machine learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
Email Spam and Malware Filtering: Whenever we receive a new email, it is filtered
automatically as important, normal, and spam. We always receive an important mail in our
inbox with the important symbol and spam emails in our spam box, and the technology behind
this is Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
ISSUES IN MACHINE LEARNING.
a. Poor Quality of Data

Data plays a significant role in the machine learning process. One of the significant
issues that machine learning professionals face is the absence of good quality data. Unclean
and noisy data can make the whole process extremely exhausting. We don’t want our
algorithm to make inaccurate or faulty predictions. Hence the quality of data is essential to
enhance the output. Therefore, we need to ensure that the process of data preprocessing which
includes removing outliers, filtering missing values, and removing unwanted features, is done
with the utmost level of perfection.
b. Underfitting of Training Data

This process occurs when data is unable to establish an accurate relationship between
input and output variables. It simply means trying to fit in undersized jeans. It signifies the
data is too simple to establish a precise relationship. To overcome this issue:
 Maximize the training time
 Enhance the complexity of the model
 Add more features to the data
 Reduce regular parameters
 Increasing the training time of model
c. Overfitting of Training Data

Overfitting refers to a machine learning model trained with a massive amount of data
that negatively affect its performance. It is like trying to fit in Oversized jeans. Unfortunately,
this is one of the significant issues faced by machine learning professionals. This means that
the algorithm is trained with noisy and biased data, which will affect its overall performance.
Let’s understand this with the help of an example. Let’s consider a model trained to
differentiate between a cat, a rabbit, a dog, and a tiger. The training data contains 1000 cats,
1000 dogs, 1000 tigers, and 4000 Rabbits. Then there is a considerable probability that it will
identify the cat as a rabbit. In this example, we had a vast amount of data, but it was biased;
hence the prediction was negatively affected.
It can be solved with:
 Analyzing the data with the utmost level of perfection
 Use data augmentation technique
 Remove outliers in the training set
 Select a model with lesser features
d. Machine Learning is a Complex Process

The machine learning industry is young and is continuously changing. Rapid hit and
trial experiments are being carried on. The process is transforming, and hence there are high
chances of error which makes the learning complex. It includes analyzing the data, removing
data bias, training data, applying complex mathematical calculations, and a lot more. Hence
it is a really complicated process which is another big challenge for Machine learning
professionals.
e. Lack of Training Data

The most important task you need to do in the machine learning process is to train the
data to achieve an accurate output. Less amount training data will produce inaccurate or too
biased predictions. A machine-learning algorithm needs a lot of data to distinguish. For
complex problems, it may even require millions of data to be trained. Therefore we need to
ensure that Machine learning algorithms are trained with sufficient amounts of data.
f. Slow Implementation
This is one of the common issues faced by machine learning professionals. The
machine learning models are highly efficient in providing accurate results, but it takes a
tremendous amount of time. Slow programs, data overload, and excessive requirements
usually take a lot of time to provide accurate results. Further, it requires constant monitoring
and maintenance to deliver the best output.
g. Imperfections in the Algorithm When Data Grows

So you have found quality data, trained it amazingly, and the predictions are really
concise and accurate. The best model of the present may become inaccurate in the coming
Future and require further rearrangement. So we need regular monitoring and maintenance to
keep the algorithm working. This is one of the most exhausting issues faced by machine
learning.
MACHINE LEARNING WORKFLOW | PROCESS STEPS

We have discussed-
 Machine learning is building machines that can adapt and learn from experience.
 Machine learning systems are not explicitly programmed.

In this article, we will discuss machine learning workflow.

Machine Learning Workflow

Machine learning workflow refers to the series of stages or steps involved in the
process of building a successful machine learning system.

The various stages involved in the machine learning workflow are-


1. Data Collection
2. Data Preparation
3. Choosing Learning Algorithm
4. Training Model
5. Evaluating Model
6. Predictions

Let us discuss each stage one by one.

1. Data Collection-

In this stage,
 Data is collected from different sources.
 The type of data collected depends upon the type of desired project.
 Data may be collected from various sources such as files, databases etc.
 The quality and quantity of gathered data directly affects the accuracy of the desired
system.

2. Data Preparation-

In this stage,
 Data preparation is done to clean the raw data.
 Data collected from the real world is transformed to a clean dataset.
 Raw data may contain missing values, inconsistent values, duplicate instances etc.
 So, raw data cannot be directly used for building a model.

Different methods of cleaning the dataset are-


 Ignoring the missing values
 Removing instances having missing values from the dataset.
 Estimating the missing values of instances using mean, median or mode.
 Removing duplicate instances from the dataset.
 Normalizing the data in the dataset.

This is the most time consuming stage in machine learning workflow.

3. Choosing Learning Algorithm-

In this stage,
 The best performing learning algorithm is researched.
 It depends upon the type of problem that needs to solved and the type of data we have.
 If the problem is to classify and the data is labeled, classification algorithms are used.
 If the problem is to perform a regression task and the data is labeled, regression
algorithms are used.
 If the problem is to create clusters and the data is unlabeled, clustering algorithms are
used.

The following chart provides the overview of learning algorithms-

4. Training Model-

In this stage,
 The model is trained to improve its ability.
 The dataset is divided into training dataset and testing dataset.
 The training and testing split is order of 80/20 or 70/30.
 It also depends upon the size of the dataset.
 Training dataset is used for training purpose.
 Testing dataset is used for the testing purpose.
 Training dataset is fed to the learning algorithm.
 The learning algorithm finds a mapping between the input and the output and generates
the model.

5. Evaluating Model-

In this stage,
 The model is evaluated to test if the model is any good.
 The model is evaluated using the kept-aside testing dataset.
 It allows to test the model against data that has never been used before for training.
 Metrics such as accuracy, precision, recall etc are used to test the performance.
 If the model does not perform well, the model is re-built using different hyper parameters.
 The accuracy may be further improved by tuning the hyper parameters.

6. Predictions-

In this stage,
 The built system is finally used to do something useful in the real world.
 Here, the true value of machine learning is realized.
Wine Quality Prediction.

Link: https://machinelearningprojects.net/wine-quality-prediction/

Step 1 – Importing libraries required for Wine Quality Prediction.


 Step 2 – Read input data.
 Step 3 – Describe the data.
 Step 4 – Take info from the data.
 Step 5 – Plot out the data.
 Step 6 – Count the no. of instances of each class.
 Step 7 – Make just 2 categories good and bad.
 Step 8 – Alloting 0 to bad and 1 to good.
 Step 9 – Again check counts.
 Step 10 – Balancing the two classes.
 Step 11 – Again check the counts of classes in the new dataframe.
 Step 12 – Checking the correlation between columns.
 Step 13 – Splitting the data into train and test.
 Step 14 – Finally training our Wine Quality Prediction model.
Step 1 – Importing libraries required for Wine Quality Prediction.

import numpy as np

import pandas as pd

import seaborn as sns

from sklearn.svm import SVC

import matplotlib.pyplot as plt

from sklearn.linear_model import SGDClassifier

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import
classification_report,confusion_matrix,accuracy_score

%matplotlib inline

Step 2 – Read input data.

wine = pd.read_csv('winequality-red.csv')

wine.head()

Input data
Step 3 – Describe the data.

wine.describe()

Description of our data


Step 4 – Take info from the data.

wine.info()

 From the data below, we can infer that there is no NULL value
in our data.

Info of our data


Step 5 – Plot out the data.

fig = plt.figure(figsize=(15,10))

plt.subplot(3,4,1)

sns.barplot(x='quality',y='fixed acidity',data=wine)

plt.subplot(3,4,2)

sns.barplot(x='quality',y='volatile acidity',data=wine)

plt.subplot(3,4,3)

sns.barplot(x='quality',y='citric acid',data=wine)

plt.subplot(3,4,4)

sns.barplot(x='quality',y='residual sugar',data=wine)

plt.subplot(3,4,5)

sns.barplot(x='quality',y='chlorides',data=wine)

plt.subplot(3,4,6)

sns.barplot(x='quality',y='free sulfur dioxide',data=wine)

plt.subplot(3,4,7)

sns.barplot(x='quality',y='total sulfur dioxide',data=wine)

plt.subplot(3,4,8)

sns.barplot(x='quality',y='density',data=wine)
plt.subplot(3,4,9)

sns.barplot(x='quality',y='pH',data=wine)

plt.subplot(3,4,10)

sns.barplot(x='quality',y='sulphates',data=wine)

plt.subplot(3,4,11)

sns.barplot(x='quality',y='alcohol',data=wine)

plt.tight_layout()

Step 6 – Count the no. of instances of each class.

wine['quality'].value_counts()

 We can see that we have 6 classes of quality that are 3,4,5,6,7,8


but we don’t want it like this.
 So what we will do is we will mark every rating from 3 to 6 as
BAD and ratings of 7 and 8 as GOOD.

value counts
Step 7 – Make just 2 categories good and bad.

ranges = (2,6.5,8)

groups = ['bad','good']

wine['quality'] = pd.cut(wine['quality'],bins=ranges,labels=groups)

 Here we are cutting bins use pd.cut() in 2 categories 2-6.5 as


BAD and 6.5-8 as GOOD.
Step 8 – Alloting 0 to bad and 1 to good.

le = LabelEncoder()

wine['quality'] = le.fit_transform(wine['quality'])

wine.head()

 Replace BAD with 0.


 Replace GOOD with 1.
 For reference see the quality column in the image below.
Step 9 – Again check counts.

wine['quality'].value_counts()

 Now we have just 2 classes, 0 and 1 or BAD and GOOD.


 But as we can see that the data is highly unbalanced, so we will
balance it in the next step.

Unequal value counts


Step 10 – Balancing the two classes.

good_quality = wine[wine['quality']==1]

bad_quality = wine[wine['quality']==0]

bad_quality = bad_quality.sample(frac=1)

bad_quality = bad_quality[:217]

new_df = pd.concat([good_quality,bad_quality])

new_df = new_df.sample(frac=1)

new_df

 In this step, we are simply balancing our dataset.


 We are making a new data frame good_quality in which we will
have data of just good_quality wine or we can say where the
quality is 1.
 Similarly, we are making for bad_quality.
 Then we are simply shuffling bad quality data
using df.sample(frac=1). It means shuffle the data and take a
100% fraction of the data.
 Then we are taking out 217 samples of bad_quality because we
have just 217 samples of good_quality.
 Then we are joining both 217 samples of each class and our
final data frame will have 217*2=434 rows.
 Finally, again shuffling the data.

Step 11 – Again check the counts of classes in the new dataframe.

new_df['quality'].value_counts()

 Now we can see that both the classes have 217 instances and
hence our data is shuffled.

Equal value counts


Step 12 – Checking the correlation between columns.

new_df.corr()['quality'].sort_values(ascending=False)

 From the image below we can infer that quality is highly


dependent on the alcohol quantity in the wine.
Correlations
Step 13 – Splitting the data into train and test.

from sklearn.model_selection import train_test_split

X = new_df.drop('quality',axis=1)

y = new_df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=101)

Step 14 – Finally training our Wine Quality Prediction model.

param =
{'n_estimators':[100,200,300,400,500,600,700,800,900,1000]}
grid_rf =
GridSearchCV(RandomForestClassifier(),param,scoring='accuracy',c
v=10,)

grid_rf.fit(X_train, y_train)

print('Best parameters --> ', grid_rf.best_params_)

# Wine Quality Prediction

pred = grid_rf.predict(X_test)

print(confusion_matrix(y_test,pred))

print('\n')

print(classification_report(y_test,pred))

print('\n')

print(accuracy_score(y_test,pred))

 I also used some other algorithms like SVM and SGD Classifier
but Random Forest stood out as always.
 Here I have used GridSearchCV with Random Forest to find the
best value of the ‘n_estimators’ parameter.
 Finally, we ended up with an accuracy of 83.9% which is very
good for this much small dataset.
final metrics

White wine quality analysis

Loading the data set and libraries


Variable explanation

1) Fixed Acidity: — It represents the level of fix acid in the wine


making process. It consists of tartaric, malic, citric, and succinic acids.
Fixed acidity will give sour ingredients in wine making.

2) Volatile Acidity: — It represents the level of acetic acid in the wine


making process. Which consist of acids like lactic, formic, butyric,
and propionic. Volatile acidity will add salty acids in wine making.

3) Critic Acid: — It is a component of fixed acidity. Which will give


sour and crispiness flavor to the wine.

4) Residual Sugar: — It refers to any natural grape sugars that are


leftover after fermentation ceases.

5) Chlorides: — It is a component of volatile acidity. Which will add


saltiness like vinegar during fermentation ceases? 356mg/L

6) Free Sulfur Dioxide: — • ‘Free’ SO2 is that which is unbound to


compounds in the wine and is, therefore, able to exert an
antioxidant/preservative action. Bound’ SO2 is that which has already
been complexes to other compounds in the wine (such as sugars) and
has essentially been quenched such that it no longer has
antioxidant/preservative activity. Total SO2 is the sum of both of these
forms.

7) Total Sulfur Dioxide: — It is the sum of binding and unbound


forms of SO2s.0.8 mg/L of molecular SO2.
8) Ph. Value: — pH describes how acidic or basic a wine is on a scale
from 0 (very acidic) to 14 (very basic); most wines are between 3–3.4
on the pH scale

9) Sulfates: — Sulfates is the head of SO2.Sulfur dioxide is a


component of sulfates. Wine ranges from about 5 mg/L (5 parts per
million) to about 200 mg/L.

10) Alcohol: — It will measure the level of ethanol (taste) gas in wine.

Types of data in the data set

In this data set, we have a total of 4898 dimensions and 11 features, 12


column is the class label. All the 11 features are in float. All are
numerical variables. The 12th variable is an integer but it is a
categorical variable. In this data set no missing value. It means the
data set is in a standardized form. So we can proceed for further
exploratory analysis.
Class label and Frequency

As here Class is representing the ranking of wine quality. We have a


measuring scale in between 1–7. From the above bar graph, we can
analyses that wine quality is between 3- 4 high. We can do binning of
class label into three categories like:-

a) 1–3 = ‘Poor’

b) 4–5 = ‘Good’

c) 6–7 = ‘Rich’
Normal Distribution of all the feature

Above all the features values don’t have standard measure value, in
the given data set. In the case of making an assumption, it will give
false or irrelevant interpretations. It may be overvalued or
undervalued. Out of 11 features only 1 ph. value has a standard
measuring unit. It will vary in between 3–3.4 acidic values for wine.
Mean value of ph. Value is 3.21 which represents more acidic. The
rest of the other features standard measure value will vary with
location, legal norms, ph. value, type of wine, etc.

Objective

From this data set, we are trying to find which feature is impacting to
the class level of the wine. For that first, we have to see the
relationship between the features and class.

The best way to finding the relationship is a correlation. Which will


show the relationship how is a relationship between all features and
From the above figures, we can find out how many features are
correlated with class.

Negative Correlation

1) Ph. value has a negative relationship with fixed acidity

2) Alcohol has a negative relationship with density, residual sugar, and


total sulfur dioxide.

3) Density has a negative relationship with class

Positive Correlation
1) Density has a positive relationship with residual sugar.

2) Alcohol and sulfate have a positive relationship with Class.

3) Free sulfur and total sulfur also have a positive relationship.

Depth analysis of correlation

From the correlation report, we found out the dependent and


independent variables. Alcohol and Sulfate have a positive correlation
with Wine Class. Which will help us in the preparation of the
prediction model?

Correlation between alcohol and class


Correlation between sulfates and class level

Correlation between class_level and density


Correlation impact

From the three-box plots, we can found out sulfate and alcohol
representing positive and density negative with the higher label of
wine class.

Alcohol percentage increased from 10 to 13 with class label poor to


rich

Sulfate units increased from 0.4 to 0.6 with class labels.

Density volume is decreased from 1.00 to.0.99 with class labels.

Logistic regression

We are using logistic regression because the class label is a categorical


variable. So we have to use a classifier to predict the quality of the
wine. First, we will split our data set into training and testing sets and
then create a logistic regression model using the training set.

Splitting of the data set into train and test. Later on, fitting a classifier
line.
Checking of prediction value through the fitted model.
Above we can see here our y_pred. can classify the test data set. The
classifier line learned the pattern of classification from the trained data
set. Now our next test the accuracy of the classifier. How much
percentage accurate able to predict or classify the data set.

So the accuracy percentage of the predictor is 88%. For the ideal


classifier, accuracy needs to be more than 90%. For that, we will use a
nonlinear classifier that is a neural network.
Neural Network

Machine learning another technique is Neural Network. When logistic


regression is unable to do linear classification. To deal with the
linearity problem in logistic regression we will use a neural network.
Red wine quality analysis

Dataset

The dataset used in this article is based on the ingredients or the


composition of mostly Red Wine. This means it has to include:

1. Fixed Acidity: Most acids involved with wine or fixed or non-


volatile (do not evaporate readily)

2. Volatile Acidity: The amount of acetic acid in wine, which at too


high of levels can lead to an unpleasant, vinegar taste

3. Citric Acid: Often added to wines to increase acidity, complement


a specific flavor or prevent ferric hazes

4. Residual Sugar: From the natural grape sugars left in a wine after
the alcoholic fermentation finishes.

5. Chlorides: The amount of salt in the wine

6. Free Sulfur Dioxide: It prevents microbial growth and the


oxidation of wine

7. Total Sulfur Dioxide: The amount of free + bound forms of SO₂

8. Density: Sweeter wines have a higher density

9. pH: Describes the level of acidity on a scale of 0–14. Most wines


are always between 3–4 on the pH scale

10. Alcohol: Available in small quantities in wines makes the


drinkers sociable
11. Sulphates: A wine additive that contributes to SO₂ levels and
acts as an antimicrobial and antioxidant

12. Quality: which is the output variable/predictor

The dataset source is taken


from https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-
et-al-2009.

You can also check it on my


GitHub: https://github.com/VireZee/Statistics-and-
Probability/blob/Stable/winequality-red.csv

Importing Python Library that will be used

In this project, I used this library:

Load & Read Dataset

After importing Python Libraries, I wanna load and read the dataset
from a source that I have taken from Kaggle link above or I put on my
GitHub.
Exploratory Data Analysis (EDA)

There are two methods to analyze the data correlation in Multiple


Linear Regression, those are Univariate Analysis and Multivariate
Analysis.

Univariate Analysis

Univariate Analysis is the simplest form of analyzing data. “Uni”


means “one”, so in other words, your data has only one variable. It
takes data, summarizes that data and finds patterns in the data. In
simple form, we check the summary statistics for each column in the
data set (or) summary only on one variable.
Multivariate Analysis

Multivariate Analysis takes a whole host of variables into


consideration. This makes it a complicated as well as an essential tool.
The greatest virtue of such a model is that it considers as many factors
into consideration as possible. This results in a tremendous reduction
of bias and gives a result closest to reality. In simple form,
Multivariate Analysis is used for understanding the interactions
between each field in the dataset more than two.
From the images above, we knew that the highest to the lowest
correlation in order to Quality are: Alcohol (0.48), Volatile Acidity (-
0.39), Sulphates (0.25), Citric Acid (0.23), Total Sulfur Dioxide (-
0.19), Density (-0.17), Chlorides (-0.13), Fixed Acidity (0.12), pH (-
0.058), Free Sulfur Dioxide (-0.051), and Residual Sugar (0.014).

Independent Variables & Dependent Variables


So we already know which are the Independent Variables, and the
Dependent Variable. All the variables excluding Quality are assigned
to Independent Variables, while Quality itself is assigned to
Dependent Variable.

Splitting Dataset into Training Set & Test Set

After we defined the Independent Variables and Dependent Variable,


we need to split both into train and test sets in order to apply the
model. In this case, I give a 20% proportion for the testing data.

Applying Linear Regression

The next step is to train the Linear Regression Model using the train
data that we just split in the previous step.

Prediction

Now we go to the prediction time. To make a prediction, we have to


input the numbers inside of the parameter (Input your Independent
Variables to make a prediction in order to know the Quality or the
Dependent Variable).
As I input the data one by one (from Fixed Acidity to Alcohol), the
Quality is 9.82988592 (9 out of 10)

Evaluation

After doing the prediction, we need to evaluate our project to find the
accuracy level of prediction.

Model Evaluation

To do a model summary, we can use the Ordinary Least Square (OLS)


principle
Evaluating Goodness of Fit using Mean Squared Error (MSE)

The Formula of Mean Squared Error (MSE). Sometimes known as


Mean Squared Deviation (MSD)

We can use Mean Squared Error (MSE) to predict how accurate our
prediction is. We can know by calculating how much the MSE is
compared to the range of our data. The lower the MSE compared to
the data range, the more accurate our prediction is.

To know the data range of Quality we need to define the Unique Value
listed on the Quality Column.

You might also like