University of Mauritius: Assignment On Supervised & Unsupervised Machine Learning Algorithms
University of Mauritius: Assignment On Supervised & Unsupervised Machine Learning Algorithms
Assignment on Supervised
&
Unsupervised Machine Learning
Algorithms
1|Page
2.4.3 Python code................................................................................................61
2.5 Hierarchical Clustering Model...........................................................................65
2.5.1 Python Codes.................................................................................................65
2.3.4 Conclusion..................................................................................................70
2|Page
1.0 Supervised learning algorithms
3|Page
quality output variable (based on sensory data, score between 0 and 10)
2. The CSV dataset file is read using pandas. The data. Head () function shows the
first few columns of the dataset. It helps to choose the columns needed to fit data
in machine learning model.
4|Page
1. Pre-processing data points.
4. Normalise the labels with sklearn library. The data is plotted so as to visualise the
change in quality of wine affected by fixed acidity.
5|Page
5. Visualisation of data for Volatile acidity.
6|Page
7. Visualisation of data for Residual Sugar affecting wine quality.
7|Page
9. Visualisation of data for Residual Sugar affecting wine quality.
8|Page
11. For the data visualization, a graph is plotted to demonstrate the relationship
between the pH and the quality of wine.
12. A graph is plotted to demonstrate the relationship between the sulphates and the
quality of wine.
9|Page
13. Graph is plotted to demonstrate the relationship between the Alcohol and the
quality of wine.
10 | P a g e
14. The data. Head () function shows the first few columns of the dataset. There are
more columns in dataset however due to limited space only a few is shown.
11 | P a g e
16 Classification models
16.1 Logistic Regression
1. Fitting Logistic Regression to the Training set.
12 | P a g e
4. Confusion matrix demonstrates that the correct predictions and incorrect
ones.
2. Calculating accuracy.
13 | P a g e
16.3 Support Vector Machine (SVM - Linear)
1. Fitting classifier to the Training set.
2. Predicting Cross Validation Score.
14 | P a g e
16.4 Non-linear Support Vector Machine
1. Fitting classifier to the Training set.
2. Predicting Cross Validation Score.
15 | P a g e
16.5 Naive Bayes
1. Fitting classifier to the Training set.
2. Predicting Cross Validation Score.
3. Confusion matrix demonstrates that the correct predictions and incorrect ones.
16 | P a g e
16.6 Decision Tree Classification
1. Fitting classifier to the Training set.
17 | P a g e
16.7 Random Forest Classification
1. Fitting Random Forest Classification to the Training set.
18 | P a g e
17. Assembling and comparing all the models data together.
19 | P a g e
19. Visualising the accuracy of the training data and testing data for all the
algorithms, presenting the ones with the highest accuracy first.
20 | P a g e
20. Visualization of all the models using bar charts with its Root Mean Square
Error Score.
1.1.4 Conclusion
As shown clearly in the visual graphs, it can be concluded from the above graph that
Naïve bayes works best for this scenario, followed by SVM(Linear), Logistic
regression, decision classification, KNN and SVM (Non-linear) and Random forest
tree classification.
Navies bayes algorithm the highest accuracy because it works very well with
small dataset compare to any another model It also have a generative model.
SVM linear have a very good accuracy because the solution we were searching
was a linear solution. SVM takes cares of outliners better than KNN. But KNN
would have a higher accuracy if there was huge dataset.
Logistic Regression support only linear solutions an have convex loss function
that why it has a good accuracy
Decision tree normally find non-linear solutions and have interactive between
independent variables. That why it has an average accuracy.
KNN is relative slower to run compare to all another models. It is a non-
parametric model. It works well in non- linear dataset which is not the case for
this solution since we have to find linear solution.
SVM non-linear have not given a good accuracy because it supports only non-
linear solution.
Random forest has not been effective because it does have interpretation.
Normally random forest is used when you need a high performance but with little
or no interpretation.
21 | P a g e
1.2 Regression models
Formulation of a functional relationship between a set of Independent or Explanatory
variables (X’s) with a Dependent or Response variable (Y). Y = f(X).
22 | P a g e
1.2.3 Python codes
1.Importing Libraries
Libraries needed are imported and the import statements are executed.
Numpy for data analysis.
Pandas library to manipulate the csv files.
Matplotlib used for data virtualization.
Sklearn library is used to import models
23 | P a g e
3. Used to count number of rows in the dataset
24 | P a g e
5.3 Seaborn distplot() function depicts the variation in the data distribution for ages.
Seaborn scatterplot() function simply plot scatter value of 2 variable in this case age
and charges.
5.4 Seaborn distplot() function depicts the variation in the data distribution for BMI.
Seaborn scatterplot() function plot scatter value of charges and BMI.
25 | P a g e
5.5 Seaborn countplot() function show the counts of observation in sex category
using bars. Seaborn boxenplot represented as lists, numpy arrays or pandas Series
objects.
26 | P a g e
5.6 Seaborn countplot() function show the counts of observation for children
category using bars. Seaborn boxenplot represented as lists, numpy arrays or
pandas Series objects for number of children against charges
5.6 Seaborn countplot() function show the counts of observation for smoker category
using bars. Seaborn violinplot() function shows the distribution quantitative data
across smokers and charges.
27 | P a g e
5.7 Seaborn countplot() function show the counts of observation for region category
using bars. Swarnplot() function shows the relationship of regions related in charges
resembling bees swarming.
7.0 Splitting the dataset into test and train 25% and 75% respectively.
28 | P a g e
8.0 Applying Linear Regression to the algorithms
29 | P a g e
Steps:
Feature Scaling
Creating the SVR regressor
Applying Grid Search to find the best model and the best parameters
10. Predicting Cross Validation Score. Predicting R2 Score the Train set results.
Predicting R2 Score the Test set results. Predicting RMSE the Test set results.
30 | P a g e
11. Applying Decision Tree to the algorithms
Steps:
Creating the Decision Tree regressor.
Applying Grid Search to find the best model and the best parameters.
31 | P a g e
12. Apply random Forest to the algorithm
Steps:
Creating the Random Forest regressor.
Applying Grid Search to find the best model and the best parameters.
13. Predicting Cross Validation Score. Predicting R2 Score the Train set results.
Predicting R2 Score the Test set results. Predicting RMSE the Test set results
32 | P a g e
14. Display and compare all the model against its accuracy of the training, test and
cross-validation.
15. Visualization of all the models using bar chart with its cross-validation Score.
33 | P a g e
15. Visualization of all the models using bar charts with its training Score.
34 | P a g e
15.0 Visualization of all the models using bar charts with its Root Mean Square Error
Score.
35 | P a g e
Model Comparison for the Regression Using its accuracy
It can be concluded that random Forest Regression work best for this scenario,
followed by Decision Tree, Support Vector Regression and Linear Regression.
36 | P a g e
2.1 Apriori Algorithms
The Apriori Algorithm is an unsupervised influential algorithm for mining frequent
itemsets for boolean association rules. Apriori uses a "bottom up" approach, where
frequent subsets are extended one item at a time (a step known as candidate
generation, and groups of candidates are tested against the data. Apriori is designed
to operate on database containing transactions (for example, collections of items
bought by customers, or details of a website frequentation).
37 | P a g e
• Sys library provide access to some variables
• Matplotlib is used for virtualization.
2. Importing the datasets.
3. Loading of data, the rating data consist of column UserID, MovieID, and
Rating.
4. Exploration.
38 | P a g e
5. Checking for null values.
39 | P a g e
8. According to the total rating calculating its count, mean, std, min, 25%, 50%,
75%, max
40 | P a g e
9. Determining if a person is recommending a movie.
41 | P a g e
11. Calculating the number of recommendations for a movie
12. Applying the Apriori Algorithm
• Creating a frequent itemset
42 | P a g e
13. Creating an association rule.
• The rules will be candidate rules until tested.
14. Calculating the confidence of each rule.
15. Setting the minimum confidence.
43 | P a g e
17. Model Evaluation
• Creating a test set.
44 | P a g e
• Applying the Apriori Algorithm on the test set.
46 | P a g e
2.2.3 Python codes
1. Importing our libraries
of 690. The last column that is column ‘Class’ tells us weather the user
committed fraud or not, 0 represents that no fraud was committed while 1 say
fraud was committed
47 | P a g e
4. Getting all the correlation data.
5. Splitting our data into test and train. In total we have bank details of 690 users
out of which we will train our Artificial 690 users out of which we will train our
Artificial. Network on 80 % data and 20 % data we will test our network
accuracy.
48 | P a g e
6. Some plots from the dataset to see the skew of features that are multivariate.
A2, A3.
49 | P a g e
50 | P a g e
7. Most correlated feature A8
8. Feature scaling so that one feature does not have greater weight on the result
9. Importing some libraries to build our ANN.
51 | P a g e
10. Adding the input layer and the first hidden layer with dropout
Take average of input + output for units/output_dim param in Dense
input_dim is necessary for the first layer as it was just initialized.
11. Predicting the Test set results
52 | P a g e
Note that the output we have got are the probabilities of potential fraud
Any probability greater that 50 percent or 0.5 will be considered as 1 and the less
than that will be converted to 0.
12. Plotting our confusion matrix.
53 | P a g e
2.3 Anomality Detection Unsupervised Techniques
Anomaly detection is the identification of data points, objects, observations or events
that do not adhere to the expected pattern of a given group. These anomalies occur
quite infrequently but can signal a major and serious danger such as cyber intrusions
or fraud. In behavioral analysis and other ways of analysis, anomaly detection is
extensively used to help in learning about the detection.
Brief description:
It consists of numerical input variables only due to PCA transformation.
The datasets include credit cards transactions in year 2013(September) by
european people.
The dataset has been collected and used during a research collaboration of
Worldline and the Machine Learning Group.
The training data set is the former 198,365 (70%) instances and the test data is
the later 86,442 (30%) transactions.
54 | P a g e
1. Import libraries and data preprocessing so as to clean data to build and train
the machine Learning model.
2. The CSV dataset file is read using pandas and returns the first 5 rows of the
dataframe.
3. Using seaborn library read, plot and show the plot according to categorical
data in the dataset. Plotted data shows 0.17% fraud cases in the dataset
which are anomalies detected.
56 | P a g e
7. Feature Selection using Z-test: hypothesis testing is used to find statistically
significant features. Z-test is performed with valid transactions as the
population.
It is inferred that to find if the values of fraud transactions are significantly
different from normal transaction or not for all features then the level of
significance is 0.01 and its a two tailed test.
Scenario:
Valid transactions as our population
Fraud transactions as sample
Two tailed Z-test
Level of significance 0.01
Corresponding critical value is 2.58
Hypothesis:
H0: There is no difference (insignificant)
H1: There is a difference (significant)
Formula for z-score:
Zscore=(x¯−μ)/S.E
57 | P a g e
8. As seen from distribution plots that distribution of normal and fraud data of
V13, V15, V22, V23, V25 and 26 features is almost same, now, its proven
through hypothesis testing. These features will be eliminated from the dataset
as they are not necessary (data cleaning).
9. Split data into Inliers and Outliers: Inliers are values that are normal. Outliers
are values that don't belong to normal data and they are the anomalies.
58 | P a g e
10. Isolation Forest: Isolation Forest is an unsupervised anomaly detection
algorithm that uses the two properties “Few” and “Different” of anomalies to
detect their existence. Since anomalies are few and different, they are more
susceptible to isolation. This algorithm isolates each point in the data and
splits them into outliers or inliers. This split depends on how long it takes to
separate the points. If we try to separate a point which is obviously a non-
outlier, it’ll have many points in its round, so that it will be really difficult to
isolate. On the other hand, if the point is an outlier, it’ll be alone and we’ll find
it very easily.
59 | P a g e
k-means clustering is a method of vector quantization, originally from signal
processing, that aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of
the cluster.
2.4.1 Description of the scenario
In this case, the clustering models have been implemented to predict whether a
person belongs to a specific category of mall customers based on their details such
as gender, age, annual income and their spending score.
The dataset used to implement this model is the mall_customer dataset from
Kaggle.com.
VARIABLE DESCRIPTION
CustomerID Unique ID assigned to the customer
Gender Gender of the customer
Age Age of the customer
Annual Income Annual income of the customer
Spending Score Score assigned by the mall to the
customer based on customer`s
behaviour and spending nature
60 | P a g e
2. Loading the dataset.
61 | P a g e
3. Data Visualization
Now, we need to find the most suitable value of K. To achieve this, we will calculate
the value of inertia for each value of k.
We choose k to be 5 as the change to 4 is less drastic that from 6 to 5.
62 | P a g e
we need to plot the 5 clusters
63 | P a g e
4. Conclusion: We conclude 5 distinct classes of customers
5. By carefully considering the position of the clusters with respect to the x and y
axis, we can attribute labels to them. This is possible because the k means
algorithm has already sorted them out.
64 | P a g e
2.5 Hierarchical Clustering Model
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm
that groups similar objects into groups called clusters. The endpoint is a set
of clusters, where each cluster is distinct from each other cluster, and the objects
within each cluster are broadly similar to each other.
4. Data visualization
Displaying distribution of age and distribution of annual income.
66 | P a g e
67 | P a g e
5. Next, we want to know about the distribution of income.
6. Additionally, we want to know about the spending habits among various age
groups
68 | P a g e
7. We can see that we have 5 distinct groups hence 5 clusters
8. Conclusion
2.3.4 Conclusion
Comparison of unsupervised learning
69 | P a g e
The neural network can perform tasks that a linear program cannot as it does not
have the ability to detect non-linear relation as shown in the non-linear solution
implemented it has a very good accuracy in the but it is very slow in the training.
On the other hand, Apricot algorithm works good for representative subsets from
large data set using submodular optimization. In our implementation, we have
created a movie recommendation which takes only some data from a huge dataset
leading to a very good accuracy of 90 % when the user type for a movie to be
recommended.
Anomaly detection systems have the capability to detect all types of attack as
concluded in our credit card fraud detection having a very good accuracy of 89%
however it is heavily dependent on how accurate the normal behavior is modeled
and updated over time. Any mistake in choosing features could decrease the
effectiveness of this algorithm.
K-Means Clustering produces a single partitioning and is usually more efficient run-
time wise.
70 | P a g e