Eda 5
Eda 5
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions. But can a machine also learn from experiences or past data
like a human does? So here comes the role of Machine Learning.
Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine learning brings computer
science and statistics together for creating predictive models. Machine learning
constructs or uses the algorithms that learn from historical data. The more we will
provide the information, the higher will be the performance.
A machine has the ability to learn if it can improve its performance by gaining
more data.
We can train machine learning algorithms by providing them the huge amount of data
and let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of
machine learning, we can save both time and money.
The importance of machine learning can be easily understood by its uses cases,
Currently, machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc. Various top companies such
as Netflix and Amazon have build machine learning models that are using a vast
amount of data to analyze the user interest and recommend product accordingly.
Following are some key points which show the importance of Machine Learning:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample
labeled data to the machine learning system in order to train it, and on that basis, it
predicts the output.
The system creates a model using labeled data to understand the datasets and learn
about each data, once the training and processing are done then we test the model
by providing a sample data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student learns
things in the supervision of the teacher. The example of supervised learning is spam
filtering.
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
o Clustering
o Density Estimation
o Dimensionality Reduction
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning
agent gets a reward for each right action and gets a penalty for each wrong action.
The agent learns automatically with these feedbacks and improves its performance. In
reinforcement learning, the agent interacts with the environment and explores it. The
goal of an agent is to get the most reward points, and hence, it improves its
performance.
The robotic dog, which automatically learns the movement of his arms, is an example
of Reinforcement learning.
APPLICATIONS OF MACHINE LEARNING
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:
Image Recognition: Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images, etc. The popular use case
of image recognition and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
Speech Recognition: While using Google, we get an option of "Search by voice," it comes
under speech recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning algorithms
are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.
Traffic prediction: If we want to visit a new place, we take help of Google Maps, which shows
us the correct path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.
Product recommendations: Machine learning is widely used by various e-commerce and
entertainment companies such as Amazon, Netflix, etc., for product recommendation to the
user. Whenever we search for some product on Amazon, then we started getting an
advertisement for the same product while internet surfing on the same browser and this is
because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
Self-driving cars: One of the most exciting applications of machine learning is self-driving
cars. Machine learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
Email Spam and Malware Filtering: Whenever we receive a new email, it is filtered
automatically as important, normal, and spam. We always receive an important mail in our
inbox with the important symbol and spam emails in our spam box, and the technology behind
this is Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
ISSUES IN MACHINE LEARNING.
a. Poor Quality of Data
Data plays a significant role in the machine learning process. One of the significant
issues that machine learning professionals face is the absence of good quality data. Unclean
and noisy data can make the whole process extremely exhausting. We don’t want our
algorithm to make inaccurate or faulty predictions. Hence the quality of data is essential to
enhance the output. Therefore, we need to ensure that the process of data preprocessing which
includes removing outliers, filtering missing values, and removing unwanted features, is done
with the utmost level of perfection.
b. Underfitting of Training Data
This process occurs when data is unable to establish an accurate relationship between
input and output variables. It simply means trying to fit in undersized jeans. It signifies the
data is too simple to establish a precise relationship. To overcome this issue:
Maximize the training time
Enhance the complexity of the model
Add more features to the data
Reduce regular parameters
Increasing the training time of model
c. Overfitting of Training Data
Overfitting refers to a machine learning model trained with a massive amount of data
that negatively affect its performance. It is like trying to fit in Oversized jeans. Unfortunately,
this is one of the significant issues faced by machine learning professionals. This means that
the algorithm is trained with noisy and biased data, which will affect its overall performance.
Let’s understand this with the help of an example. Let’s consider a model trained to
differentiate between a cat, a rabbit, a dog, and a tiger. The training data contains 1000 cats,
1000 dogs, 1000 tigers, and 4000 Rabbits. Then there is a considerable probability that it will
identify the cat as a rabbit. In this example, we had a vast amount of data, but it was biased;
hence the prediction was negatively affected.
It can be solved with:
Analyzing the data with the utmost level of perfection
Use data augmentation technique
Remove outliers in the training set
Select a model with lesser features
d. Machine Learning is a Complex Process
The machine learning industry is young and is continuously changing. Rapid hit and
trial experiments are being carried on. The process is transforming, and hence there are high
chances of error which makes the learning complex. It includes analyzing the data, removing
data bias, training data, applying complex mathematical calculations, and a lot more. Hence
it is a really complicated process which is another big challenge for Machine learning
professionals.
e. Lack of Training Data
The most important task you need to do in the machine learning process is to train the
data to achieve an accurate output. Less amount training data will produce inaccurate or too
biased predictions. A machine-learning algorithm needs a lot of data to distinguish. For
complex problems, it may even require millions of data to be trained. Therefore we need to
ensure that Machine learning algorithms are trained with sufficient amounts of data.
f. Slow Implementation
This is one of the common issues faced by machine learning professionals. The
machine learning models are highly efficient in providing accurate results, but it takes a
tremendous amount of time. Slow programs, data overload, and excessive requirements
usually take a lot of time to provide accurate results. Further, it requires constant monitoring
and maintenance to deliver the best output.
g. Imperfections in the Algorithm When Data Grows
So you have found quality data, trained it amazingly, and the predictions are really
concise and accurate. The best model of the present may become inaccurate in the coming
Future and require further rearrangement. So we need regular monitoring and maintenance to
keep the algorithm working. This is one of the most exhausting issues faced by machine
learning.
MACHINE LEARNING WORKFLOW | PROCESS STEPS
We have discussed-
Machine learning is building machines that can adapt and learn from experience.
Machine learning systems are not explicitly programmed.
Machine learning workflow refers to the series of stages or steps involved in the
process of building a successful machine learning system.
1. Data Collection-
In this stage,
Data is collected from different sources.
The type of data collected depends upon the type of desired project.
Data may be collected from various sources such as files, databases etc.
The quality and quantity of gathered data directly affects the accuracy of the desired
system.
2. Data Preparation-
In this stage,
Data preparation is done to clean the raw data.
Data collected from the real world is transformed to a clean dataset.
Raw data may contain missing values, inconsistent values, duplicate instances etc.
So, raw data cannot be directly used for building a model.
In this stage,
The best performing learning algorithm is researched.
It depends upon the type of problem that needs to solved and the type of data we have.
If the problem is to classify and the data is labeled, classification algorithms are used.
If the problem is to perform a regression task and the data is labeled, regression
algorithms are used.
If the problem is to create clusters and the data is unlabeled, clustering algorithms are
used.
4. Training Model-
In this stage,
The model is trained to improve its ability.
The dataset is divided into training dataset and testing dataset.
The training and testing split is order of 80/20 or 70/30.
It also depends upon the size of the dataset.
Training dataset is used for training purpose.
Testing dataset is used for the testing purpose.
Training dataset is fed to the learning algorithm.
The learning algorithm finds a mapping between the input and the output and generates
the model.
5. Evaluating Model-
In this stage,
The model is evaluated to test if the model is any good.
The model is evaluated using the kept-aside testing dataset.
It allows to test the model against data that has never been used before for training.
Metrics such as accuracy, precision, recall etc are used to test the performance.
If the model does not perform well, the model is re-built using different hyper parameters.
The accuracy may be further improved by tuning the hyper parameters.
6. Predictions-
In this stage,
The built system is finally used to do something useful in the real world.
Here, the true value of machine learning is realized.
Wine Quality Prediction.
Link: https://machinelearningprojects.net/wine-quality-prediction/
import numpy as np
import pandas as pd
%matplotlib inline
wine = pd.read_csv('winequality-red.csv')
wine.head()
Input data
Step 3 – Describe the data.
wine.describe()
wine.info()
From the data below, we can infer that there is no NULL value
in our data.
fig = plt.figure(figsize=(15,10))
plt.subplot(3,4,1)
sns.barplot(x='quality',y='fixed acidity',data=wine)
plt.subplot(3,4,2)
sns.barplot(x='quality',y='volatile acidity',data=wine)
plt.subplot(3,4,3)
sns.barplot(x='quality',y='citric acid',data=wine)
plt.subplot(3,4,4)
sns.barplot(x='quality',y='residual sugar',data=wine)
plt.subplot(3,4,5)
sns.barplot(x='quality',y='chlorides',data=wine)
plt.subplot(3,4,6)
plt.subplot(3,4,7)
plt.subplot(3,4,8)
sns.barplot(x='quality',y='density',data=wine)
plt.subplot(3,4,9)
sns.barplot(x='quality',y='pH',data=wine)
plt.subplot(3,4,10)
sns.barplot(x='quality',y='sulphates',data=wine)
plt.subplot(3,4,11)
sns.barplot(x='quality',y='alcohol',data=wine)
plt.tight_layout()
wine['quality'].value_counts()
value counts
Step 7 – Make just 2 categories good and bad.
ranges = (2,6.5,8)
groups = ['bad','good']
wine['quality'] = pd.cut(wine['quality'],bins=ranges,labels=groups)
le = LabelEncoder()
wine['quality'] = le.fit_transform(wine['quality'])
wine.head()
wine['quality'].value_counts()
good_quality = wine[wine['quality']==1]
bad_quality = wine[wine['quality']==0]
bad_quality = bad_quality.sample(frac=1)
bad_quality = bad_quality[:217]
new_df = pd.concat([good_quality,bad_quality])
new_df = new_df.sample(frac=1)
new_df
new_df['quality'].value_counts()
Now we can see that both the classes have 217 instances and
hence our data is shuffled.
new_df.corr()['quality'].sort_values(ascending=False)
X = new_df.drop('quality',axis=1)
y = new_df['quality']
param =
{'n_estimators':[100,200,300,400,500,600,700,800,900,1000]}
grid_rf =
GridSearchCV(RandomForestClassifier(),param,scoring='accuracy',c
v=10,)
grid_rf.fit(X_train, y_train)
pred = grid_rf.predict(X_test)
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))
print('\n')
print(accuracy_score(y_test,pred))
I also used some other algorithms like SVM and SGD Classifier
but Random Forest stood out as always.
Here I have used GridSearchCV with Random Forest to find the
best value of the ‘n_estimators’ parameter.
Finally, we ended up with an accuracy of 83.9% which is very
good for this much small dataset.
final metrics
10) Alcohol: — It will measure the level of ethanol (taste) gas in wine.
a) 1–3 = ‘Poor’
b) 4–5 = ‘Good’
c) 6–7 = ‘Rich’
Normal Distribution of all the feature
Above all the features values don’t have standard measure value, in
the given data set. In the case of making an assumption, it will give
false or irrelevant interpretations. It may be overvalued or
undervalued. Out of 11 features only 1 ph. value has a standard
measuring unit. It will vary in between 3–3.4 acidic values for wine.
Mean value of ph. Value is 3.21 which represents more acidic. The
rest of the other features standard measure value will vary with
location, legal norms, ph. value, type of wine, etc.
Objective
From this data set, we are trying to find which feature is impacting to
the class level of the wine. For that first, we have to see the
relationship between the features and class.
Negative Correlation
Positive Correlation
1) Density has a positive relationship with residual sugar.
From the three-box plots, we can found out sulfate and alcohol
representing positive and density negative with the higher label of
wine class.
Logistic regression
Splitting of the data set into train and test. Later on, fitting a classifier
line.
Checking of prediction value through the fitted model.
Above we can see here our y_pred. can classify the test data set. The
classifier line learned the pattern of classification from the trained data
set. Now our next test the accuracy of the classifier. How much
percentage accurate able to predict or classify the data set.
Dataset
4. Residual Sugar: From the natural grape sugars left in a wine after
the alcoholic fermentation finishes.
After importing Python Libraries, I wanna load and read the dataset
from a source that I have taken from Kaggle link above or I put on my
GitHub.
Exploratory Data Analysis (EDA)
Univariate Analysis
The next step is to train the Linear Regression Model using the train
data that we just split in the previous step.
Prediction
Evaluation
After doing the prediction, we need to evaluate our project to find the
accuracy level of prediction.
Model Evaluation
We can use Mean Squared Error (MSE) to predict how accurate our
prediction is. We can know by calculating how much the MSE is
compared to the range of our data. The lower the MSE compared to
the data range, the more accurate our prediction is.
To know the data range of Quality we need to define the Unique Value
listed on the Quality Column.