0% found this document useful (0 votes)
158 views71 pages

University of Mauritius: Assignment On Supervised & Unsupervised Machine Learning Algorithms

This document provides an overview of supervised and unsupervised machine learning algorithms applied to wine quality prediction. It describes using various classification and regression models on a wine quality dataset to predict wine quality based on physicochemical properties. Code examples are provided applying logistic regression, KNN, SVM, naive Bayes, decision trees, random forest, apriori, neural networks, anomaly detection, clustering, and hierarchical clustering algorithms. Visualizations explore relationships between attributes and quality. The models are trained on 80% of the data and tested on the remaining 20%.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views71 pages

University of Mauritius: Assignment On Supervised & Unsupervised Machine Learning Algorithms

This document provides an overview of supervised and unsupervised machine learning algorithms applied to wine quality prediction. It describes using various classification and regression models on a wine quality dataset to predict wine quality based on physicochemical properties. Code examples are provided applying logistic regression, KNN, SVM, naive Bayes, decision trees, random forest, apriori, neural networks, anomaly detection, clustering, and hierarchical clustering algorithms. Visualizations explore relationships between attributes and quality. The models are trained on 80% of the data and tested on the remaining 20%.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 71

University of Mauritius

Assignment on Supervised
&
Unsupervised Machine Learning
Algorithms

Course: BSC Software Engineering (level 3)


Module: Artificial Intelligence (SIS 3119Y)

Assignment done by:


Sandya Askoolum - 1812279
Khritish Bhoodhoo - 1812681
Rishab Dimlaye - 1812915
Cooshal Purmessur - 1811241

Lecturer: Mrs Baby Gobin


Table of Contents
1.0 Supervised learning algorithms..............................................................................2
1.1 Classification models..............................................................................................2
1.1.1 Brief description of the scenario......................................................................2
1.1.2 Dataset description..........................................................................................2
1.1.3 Python codes...................................................................................................3
16 Classification models.........................................................................................12
16.1 Logistic Regression.....................................................................................12
16.2 K-Nearest Neighbours (KNN)......................................................................13
16.3 Support Vector Machine (SVM - Linear).....................................................14
16.4 Non-linear Support Vector Machine............................................................15
16.5 Naive Bayes................................................................................................16
16.6 Decision Tree Classification........................................................................17
16.7 Random Forest Classification.....................................................................18
1.1.4 Conclusion.....................................................................................................21
1.2 Regression models...........................................................................................22
1.2.1 Description of the scenario.........................................................................22
1.2.2 Dataset Description....................................................................................22
1.2.3 Python codes..............................................................................................23
2.0 Unsupervised Learning Algorithms......................................................................37
2.1 Apriori Algorithms..............................................................................................37
2.1.1 Description of scenario...............................................................................37
2.1.2 Dataset Description....................................................................................37
2.1.3 Python codes..............................................................................................38
2.2 Neural Network (NN).........................................................................................46
2.2.1 Description of scenario...............................................................................46
2.2.2 Dataset Description....................................................................................47
2.2.3 Python codes..............................................................................................47
2.3 Anomality Detection Unsupervised Techniques..................................................54
2.3.1 Description of Scenario..............................................................................54
2.3.2 Dataset Description....................................................................................54
2.3.3 Python codes..............................................................................................55
2.4 Clustering Algorithm..........................................................................................60
2.4.1 Description of the scenario.........................................................................60
2.4.2 Description of the dataset...........................................................................60

1|Page
2.4.3 Python code................................................................................................61
2.5 Hierarchical Clustering Model...........................................................................65
2.5.1 Python Codes.................................................................................................65
2.3.4 Conclusion..................................................................................................70

2|Page
1.0 Supervised learning algorithms

1.1 Classification models


1.1.1 Brief description of the scenario
In this assignment, to demonstrate the different classification models in supervised
machine learning algorithms, we are attempting to predict the quality of wine given
many factors that contribute to make red wine good.
1.1.2 Dataset description
To implement the supervised machine learning algorithms, the quality Red wine
dataset from Kaggle.com has been used. This dataset is related to red variants of
the Portuguese "Vinho Verde" wine. 

Brief description about the dataset:

• It consists of 1559 samples


• There are 12 attributes.
• 11 out of 12 columns are the independent variables.
• The 12th column is the quality of red wine (dependant variable).
• The dataset consists of the following attributes:

Attribute name Description


Input variables
fixed acidity most acids involved with wine or fixed or non-volatile.
volatile acidity the amount of acetic acid in wine.
citric acid found in small quantities, citric acid can add 'freshness' and
flavour to wines.
residual sugar the amount of sugar remaining after fermentation stops.
chlorides the amount of salt in the wine.
free sulphur dioxide the free form of SO2 exists in equilibrium between molecular SO2
(as a dissolved gas) and bisulfite ion.
vtotal sulphur dioxide amount of free and bound forms of S02.
density the density of water is close to that of water depending on the
percent alcohol and sugar content.
pH describes how acidic or basic a wine is on a scale from 0 (very
acidic) to 14 (very basic).
sulphates a wine additive which can contribute to sulphur dioxide gas (S02)
levels.
alcohol the percent alcohol content of the wine.
Output variables

3|Page
quality output variable (based on sensory data, score between 0 and 10)

1.1.3 Python codes


1. Importing the libraries:
• Numpy library with is required by Pandas.
• Pandas library to manipulate the csv files.
• Matplotlib is used for virtualization.
• Seaborn library for data visualization.

2. The CSV dataset file is read using pandas. The data. Head () function shows the
first few columns of the dataset. It helps to choose the columns needed to fit data
in machine learning model.

3. The dataset is checked to identify any missing value.

4|Page
1. Pre-processing data points.

4. Normalise the labels with sklearn library. The data is plotted so as to visualise the
change in quality of wine affected by fixed acidity.

5|Page
5. Visualisation of data for Volatile acidity.

6. Visualisation of data for citric acid affecting wine quality.

6|Page
7. Visualisation of data for Residual Sugar affecting wine quality.

8. Visualisation of data for Chlorides affecting wine quality.

7|Page
9. Visualisation of data for Residual Sugar affecting wine quality.

10. Visualisation of data density and quality of wine.

8|Page
11. For the data visualization, a graph is plotted to demonstrate the relationship
between the pH and the quality of wine.

11 Visualisation of pH data and quality of wine.

12. A graph is plotted to demonstrate the relationship between the sulphates and the
quality of wine.

9|Page
13. Graph is plotted to demonstrate the relationship between the Alcohol and the
quality of wine.

10 | P a g e
14. The data. Head () function shows the first few columns of the dataset. There are
more columns in dataset however due to limited space only a few is shown.

15. Splitting data into training and testing set.


The model will be trained using on a train dataset and testing the models on a test
data set. The dataset is divided into training 80 % and testing 20%.

11 | P a g e
16 Classification models
16.1 Logistic Regression

1. Fitting Logistic Regression to the Training set.

2. Predicting Cross Validation score.

3. Calculating accuracy score which is measuring the accuracy of the model


against the training data

12 | P a g e
4. Confusion matrix demonstrates that the correct predictions and incorrect
ones.

16.2 K-Nearest Neighbours (KNN)

1. Fitting the classifier into the dataset.

2. Calculating accuracy.

3. Confusion matrix with correct and incorrect data.

13 | P a g e
16.3 Support Vector Machine (SVM - Linear)

1. Fitting classifier to the Training set.

2. Predicting Cross Validation Score.

3. Confusion matrix showing a summary of the number of correct and incorrect


predictions made by a classifier.

14 | P a g e
16.4 Non-linear Support Vector Machine

1. Fitting classifier to the Training set.

2. Predicting Cross Validation Score.

3. Confusion matrix showing a summary of the number of correct and incorrect


predictions made by a classifier.

15 | P a g e
16.5 Naive Bayes
1. Fitting classifier to the Training set.

2. Predicting Cross Validation Score.

3. Confusion matrix demonstrates that the correct predictions and incorrect ones.

16 | P a g e
16.6 Decision Tree Classification

1. Fitting classifier to the Training set.

2. Calculating accuracy score which is measuring the accuracy of the model


against the training data.

3. Confusion matrix demonstrates that the correct predictions and incorrect


ones.

17 | P a g e
16.7 Random Forest Classification

1. Fitting Random Forest Classification to the Training set.

2. Calculating accuracy score which is measuring the accuracy of the model


against the training data.

3. Confusion matrix demonstrates that the correct predictions and incorrect


ones.

18 | P a g e
17. Assembling and comparing all the models data together.

18. Visualising the cross validation of the models.

19 | P a g e
19. Visualising the accuracy of the training data and testing data for all the
algorithms, presenting the ones with the highest accuracy first.

20 | P a g e
20. Visualization of all the models using bar charts with its Root Mean Square
Error Score.

1.1.4 Conclusion
As shown clearly in the visual graphs, it can be concluded from the above graph that
Naïve bayes works best for this scenario, followed by SVM(Linear), Logistic
regression, decision classification, KNN and SVM (Non-linear) and Random forest
tree classification.

 Navies bayes algorithm the highest accuracy because it works very well with
small dataset compare to any another model It also have a generative model.
 SVM linear have a very good accuracy because the solution we were searching
was a linear solution. SVM takes cares of outliners better than KNN. But KNN
would have a higher accuracy if there was huge dataset.
 Logistic Regression support only linear solutions an have convex loss function
that why it has a good accuracy
 Decision tree normally find non-linear solutions and have interactive between
independent variables. That why it has an average accuracy.
 KNN is relative slower to run compare to all another models. It is a non-
parametric model. It works well in non- linear dataset which is not the case for
this solution since we have to find linear solution.
 SVM non-linear have not given a good accuracy because it supports only non-
linear solution.
 Random forest has not been effective because it does have interpretation.
Normally random forest is used when you need a high performance but with little
or no interpretation.

21 | P a g e
1.2 Regression models
Formulation of a functional relationship between a set of Independent or Explanatory
variables (X’s) with a Dependent or Response variable (Y). Y = f(X).

 Linear Regression – The linear regression is a linear approach to modelling


the relationship between a scalar response and one or more explanatory
variables
 Support Vector Regression – The Support Vector Regression (SVR) uses the
same principles as the SVM for classification, with only a few minor
differences.
 Decision Tree Regression- It breaks down a dataset into smaller and smaller
subsets while at the same time an associated decision tree is incrementally
developed.
 Random Forest regression - Random forest is an ensemble of decision trees.
This is to say that many trees, constructed in a certain “random” way.

1.2.1 Description of the scenario


In the following scenario we are going to compare regression model such as Linear
Regression, Support Vector Regression, Decision Tree Regression and Random
Forest Regression against its accuracy. The aim of this project is to use the data
provided by the health insurance, and they apply all types of models and compare to
identify which model is the best in predicting the health insurance Charges

1.2.2 Dataset Description


Columns:

 age: age of primary beneficiary.


 sex: insurance contractor gender, female, male
 bmi: body mass index, providing an understanding of body, weights that are
relatively high or low relative to height, objective index of body weight (kg / m ^ 2)
using the ratio of height to weight, ideally 18.5 to 24.9
 children: number of children covered by health insurance / number of dependents
 smoker: smoking
 region: the beneficiary's residential area in the US, northeast, southeast,
southwest, northwest
 charges: individual medical costs billed by health insurance

22 | P a g e
1.2.3 Python codes
1.Importing Libraries
Libraries needed are imported and the import statements are executed.
 Numpy for data analysis.
 Pandas library to manipulate the csv files.
 Matplotlib used for data virtualization.
 Sklearn library is used to import models

2. Reading the dataset and display the head


It is displayed top 5 rows with the head element with df. head().

23 | P a g e
3. Used to count number of rows in the dataset

4. Dataset.describe() is used to give description about the dataset

5. Data Visualization and preprocessing

5.1 Check if the dataset contains null values

5.2 Generate the heat map

24 | P a g e
5.3 Seaborn distplot() function depicts the variation in the data distribution for ages.
Seaborn scatterplot() function simply plot scatter value of 2 variable in this case age
and charges.

5.4 Seaborn distplot() function depicts the variation in the data distribution for BMI.
Seaborn scatterplot() function plot scatter value of charges and BMI.

25 | P a g e
5.5 Seaborn countplot() function show the counts of observation in sex category
using bars. Seaborn boxenplot represented as lists, numpy arrays or pandas Series
objects.

26 | P a g e
5.6 Seaborn countplot() function show the counts of observation for children
category using bars. Seaborn boxenplot represented as lists, numpy arrays or
pandas Series objects for number of children against charges

5.6 Seaborn countplot() function show the counts of observation for smoker category
using bars. Seaborn violinplot() function shows the distribution quantitative data
across smokers and charges.

27 | P a g e
5.7 Seaborn countplot() function show the counts of observation for region category
using bars. Swarnplot() function shows the relationship of regions related in charges
resembling bees swarming.

6. pandas.get_dummies convert the categorical variable into dummy/indicator


variables

7.0 Splitting the dataset into test and train 25% and 75% respectively.
28 | P a g e
8.0 Applying Linear Regression to the algorithms

9.0 Applying Support vector regression to the algorithms.

29 | P a g e
Steps:
 Feature Scaling
 Creating the SVR regressor
 Applying Grid Search to find the best model and the best parameters

10. Predicting Cross Validation Score. Predicting R2 Score the Train set results.
Predicting R2 Score the Test set results. Predicting RMSE the Test set results.

30 | P a g e
11. Applying Decision Tree to the algorithms

Steps:
 Creating the Decision Tree regressor.
 Applying Grid Search to find the best model and the best parameters.

31 | P a g e
12. Apply random Forest to the algorithm

Steps:
 Creating the Random Forest regressor.
 Applying Grid Search to find the best model and the best parameters.

13. Predicting Cross Validation Score. Predicting R2 Score the Train set results.
Predicting R2 Score the Test set results. Predicting RMSE the Test set results
32 | P a g e
14. Display and compare all the model against its accuracy of the training, test and
cross-validation.

15. Visualization of all the models using bar chart with its cross-validation Score.

33 | P a g e
15. Visualization of all the models using bar charts with its training Score.

34 | P a g e
15.0 Visualization of all the models using bar charts with its Root Mean Square Error
Score.

35 | P a g e
Model Comparison for the Regression Using its accuracy

It can be concluded that random Forest Regression work best for this scenario,
followed by Decision Tree, Support Vector Regression and Linear Regression.

2.0 Unsupervised Learning Algorithms

36 | P a g e
2.1 Apriori Algorithms
The Apriori Algorithm is an unsupervised influential algorithm for mining frequent
itemsets for boolean association rules. Apriori uses a "bottom up" approach, where
frequent subsets are extended one item at a time (a step known as candidate
generation, and groups of candidates are tested against the data. Apriori is designed
to operate on database containing transactions (for example, collections of items
bought by customers, or details of a website frequentation).

APRIORI Advantages APRIORI Disadvantages


Uses large itemset property Assumes transaction database is memory
resident.
Easily parallelized Requires many database scans

2.1.1 Description of scenario


The aim is to build a Movie Recommendation System using Apriori Algorithms,
to present movie ranked list with a given text. Typically, this ranking is based on the
similarity between the input object and the listed objects. The model will be train
using the MovieLens dataset.

2.1.2 Dataset Description


The movieLens dataset is a dataset for training recommendation models. The
dataset is from the GroupLens website. There are various datasets, the one chosen
for this algorithm consists of 100,000 movie ratings by users (on a 1-5 scale). The
main data file consists of a tab-separated list with user-id (starting at 1), item-id
(starting at 1), rating, and timestamp as the four fields.

2.1.3 Python codes


1. Importing the libraries:
• Numpy library with is required by Pandas. Pandas library to manipulate the csv
files.

37 | P a g e
• Sys library provide access to some variables
• Matplotlib is used for virtualization.

2. Importing the datasets.

3. Loading of data, the rating data consist of column UserID, MovieID, and
Rating.

4. Exploration.

38 | P a g e
5. Checking for null values.

6. Virtualization of the number of movies with the corresponding rating.


Number of ratings per movie.

7. Display of the most rating movie by its MovieID.

39 | P a g e
8. According to the total rating calculating its count, mean, std, min, 25%, 50%,
75%, max

40 | P a g e
9. Determining if a person is recommending a movie.

10. Selecting the top 200 users

41 | P a g e
11. Calculating the number of recommendations for a movie

12. Applying the Apriori Algorithm
• Creating a frequent itemset

42 | P a g e
13. Creating an association rule.
• The rules will be candidate rules until tested.

14. Calculating the confidence of each rule.

15. Setting the minimum confidence.

16. Rule Ordering

43 | P a g e
17. Model Evaluation
• Creating a test set.
44 | P a g e
• Applying the Apriori Algorithm on the test set.

• Testing the test set results.

2.2 Neural Network (NN)


Artificial Neural Network (ANN) or Neural Network (NN) has provided an exciting
alternative method for solving a variety of problems in different fields of science and
45 | P a g e
engineering.  A Neural Network is a system that consist of many simple processing
elements operating in parallel which can achieve, store, and use experimental
knowledge.
ANN is biologically inspired by human brain. In the human brain, neurons are linked
in the same way that nodes in an artificial neural network are connected.

An example of a simple neutral network.

2.2.1 Description of scenario


Credit card fraud has become more popular in recent years as more people use
credit cards to pay for things. This is attributed to technological advancements and a
rise in online transactions, which has resulted in massive financial losses due to
fraud. As a result, successful loss-reduction methods are needed.
The aim is to predict the incidence of fraud using a variety of machine learning
algorithms, such as artificial neural networks (ANN).

2.2.2 Dataset Description


This document is about credit card applications. To preserve the data's
confidentiality, all attribute names and values have been replaced with meaningless
symbols. The attributes in this dataset are a strong combination of continuous,
nominal with small numbers of values, and nominal with larger numbers of values.

46 | P a g e
2.2.3 Python codes
1. Importing our libraries

2. Importing our dataset and displaying dataset :This dataset has the bank data

of 690. The last column that is column ‘Class’ tells us weather the user
committed fraud or not, 0 represents that no fraud was committed while 1 say
fraud was committed

3. Checking how many values it has in each column to ensure there is no


missing values

47 | P a g e
4. Getting all the correlation data.

5. Splitting our data into test and train. In total we have bank details of 690 users
out of which we will train our Artificial 690 users out of which we will train our
Artificial. Network on 80 % data and 20 % data we will test our network
accuracy.

48 | P a g e
6. Some plots from the dataset to see the skew of features that are multivariate.
A2, A3.

49 | P a g e
50 | P a g e
7. Most correlated feature A8

8. Feature scaling so that one feature does not have greater weight on the result

9. Importing some libraries to build our ANN.

51 | P a g e
10. Adding the input layer and the first hidden layer with dropout
 Take average of input + output for units/output_dim param in Dense
 input_dim is necessary for the first layer as it was just initialized.

11. Predicting the Test set results
52 | P a g e
Note that the output we have got are the probabilities of potential fraud 
Any probability greater that 50 percent or 0.5 will be considered as 1 and the less
than that will be converted to 0.

12. Plotting our confusion matrix.

The model is able to predict fraud with 88% accuracy.


 

53 | P a g e
2.3 Anomality Detection Unsupervised Techniques
Anomaly detection is the identification of data points, objects, observations or events
that do not adhere to the expected pattern of a given group. These anomalies occur
quite infrequently but can signal a major and serious danger such as cyber intrusions
or fraud. In behavioral analysis and other ways of analysis, anomaly detection is
extensively used to help in learning about the detection.

2.3.1 Description of Scenario


In this example the anomaly detection has been implemented using isolation model
to detect fraud detection

2.3.2 Dataset Description


Dataset used to implement anomaly detection is from Kaggle.com. It is labelled and t
has been chosen to be used due to confidentiality issues, features from V1 to V28
have been transformed using PCA and therefore is no missing value in the dataset.

Brief description:
 It consists of numerical input variables only due to PCA transformation. 
 The datasets include credit cards transactions in year 2013(September) by
european people.
 The dataset has been collected and used during a research collaboration of
Worldline and the Machine Learning Group.
 The training data set is the former 198,365 (70%) instances and the test data is
the later 86,442 (30%) transactions.

Content of this kernel:


1. Data pre-processing
 Exploratory Data Analysis
 Transformation feature
 Selection feature
2. Modeling
 Isolation Forest.

2.3.3 Python codes

54 | P a g e
1. Import libraries and data preprocessing so as to clean data to build and train
the machine Learning model.

2. The CSV dataset file is read using pandas and returns the first 5 rows of the
dataframe.

3. Using seaborn library read, plot and show the plot according to categorical
data in the dataset. Plotted data shows 0.17% fraud cases in the dataset
which are anomalies detected.

4. Only Time and Amount have not been transformed with PCA. Time contains


the seconds elapsed between each transaction and the first transaction in the
dataset. This feature is transformed into hours to get a better understanding.
This shows that hour of day has some impact on number or fraud cases.
55 | P a g e
5. Transforming the remaining features using PCA.

6. Distribution of features. As concluded. data distribution of normal and fraud


cases of some features like V18, V20, V25 are overlapping and they seem
same. Therefore, it is seen that such features are not good at differentiating
between normal and fraud transactions.

56 | P a g e
7. Feature Selection using Z-test: hypothesis testing is used to find statistically
significant features. Z-test is performed with valid transactions as the
population.
It is inferred that to find if the values of fraud transactions are significantly
different from normal transaction or not for all features then the level of
significance is 0.01 and its a two tailed test.
Scenario:
 Valid transactions as our population
 Fraud transactions as sample
 Two tailed Z-test
 Level of significance 0.01
 Corresponding critical value is 2.58
Hypothesis:
H0: There is no difference (insignificant)
H1: There is a difference (significant)
Formula for z-score:
Zscore=(x¯−μ)/S.E

57 | P a g e
8. As seen from distribution plots that distribution of normal and fraud data of
V13, V15, V22, V23, V25 and 26 features is almost same, now, its proven
through hypothesis testing. These features will be eliminated from the dataset
as they are not necessary (data cleaning).

9. Split data into Inliers and Outliers: Inliers are values that are normal. Outliers
are values that don't belong to normal data and they are the anomalies.

58 | P a g e
10. Isolation Forest: Isolation Forest is an unsupervised anomaly detection
algorithm that uses the two properties “Few” and “Different” of anomalies to
detect their existence. Since anomalies are few and different, they are more
susceptible to isolation. This algorithm isolates each point in the data and
splits them into outliers or inliers. This split depends on how long it takes to
separate the points. If we try to separate a point which is obviously a non-
outlier, it’ll have many points in its round, so that it will be really difficult to
isolate. On the other hand, if the point is an outlier, it’ll be alone and we’ll find
it very easily.

2.4 Clustering Algorithm


Clustering Algorithms do the task of dividing the population or data points into variety
of groups such that data points within the same cluster to other datapoints within the
same cluster than those in other groups.
K-MEANS CLUSTERING MODEL

59 | P a g e
k-means clustering is a method of vector quantization, originally from signal
processing, that aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of
the cluster.
2.4.1 Description of the scenario

In this case, the clustering models have been implemented to predict whether a
person belongs to a specific category of mall customers based on their details such
as gender, age, annual income and their spending score.

2.4.2 Description of the dataset

The dataset used to implement this model is the mall_customer dataset from
Kaggle.com.

Format of data in the dataset:

VARIABLE DESCRIPTION
CustomerID Unique ID assigned to the customer
Gender Gender of the customer
Age Age of the customer
Annual Income Annual income of the customer
Spending Score Score assigned by the mall to the
customer based on customer`s
behaviour and spending nature

2.4.3 Python code

1. Need to import the following libraries.

60 | P a g e
2. Loading the dataset.

These are features:

61 | P a g e
3. Data Visualization

Now, we need to find the most suitable value of K. To achieve this, we will calculate
the value of inertia for each value of k.
We choose k to be 5 as the change to 4 is less drastic that from 6 to 5.

62 | P a g e
we need to plot the 5 clusters

63 | P a g e
4. Conclusion: We conclude 5 distinct classes of customers

5. By carefully considering the position of the clusters with respect to the x and y
axis, we can attribute labels to them. This is possible because the k means
algorithm has already sorted them out.

64 | P a g e
2.5 Hierarchical Clustering Model
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm
that groups similar objects into groups called clusters. The endpoint is a set
of clusters, where each cluster is distinct from each other cluster, and the objects
within each cluster are broadly similar to each other.

2.5.1 Python Codes


1. First, we need to import the useful libraries.

2. Reading data from dataset.

3. Next, we need to check if we have any null or empty values.


65 | P a g e
As you see, we have no issue in that sense.

4. Data visualization
 Displaying distribution of age and distribution of annual income.

66 | P a g e
67 | P a g e
5. Next, we want to know about the distribution of income.

6. Additionally, we want to know about the spending habits among various age
groups

68 | P a g e
7. We can see that we have 5 distinct groups hence 5 clusters

8. Conclusion

2.3.4 Conclusion
Comparison of unsupervised learning

69 | P a g e
The neural network can perform tasks that a linear program cannot as it does not
have the ability to detect non-linear relation as shown in the non-linear solution
implemented it has a very good accuracy in the but it is very slow in the training.

On the other hand, Apricot algorithm works good for representative subsets from
large data set using submodular optimization. In our implementation, we have
created a movie recommendation which takes only some data from a huge dataset
leading to a very good accuracy of 90 % when the user type for a movie to be
recommended.

Anomaly detection systems have the capability to detect all types of attack as
concluded in our credit card fraud detection having a very good accuracy of 89%
however it is heavily dependent on how accurate the normal behavior is modeled
and updated over time. Any mistake in choosing features could decrease the
effectiveness of this algorithm.

K-Means Clustering produces a single partitioning and is usually more efficient run-
time wise.

Hierarchal Clustering give different partitions depending on the high level of


resolution. It does not need the number of clusters specified.

70 | P a g e

You might also like