IECH1103 Analytics Report
IECH1103 Analytics Report
IECH1103 Analytics Report
Table of Content
Table of Figure
Data choice
Banknotes are essential for the smooth functioning of the monetary system. Preventing the
circulation of counterfeit banknotes is crucial for ensuring the smooth operation of the cash
economy. More counterfeit bills are available now than in previous years. Counterfeit
currency is a fake version of legal tender that has been illegally produced. The bad state of
the country's financial market can be attributed in part to the widespread production of
counterfeit bills in every denomination. In this data set, the quantity of each type is
distributed normally. Notes are authentic when the target class value is 0 and fake when it is
1.
It is considered counterfeit and is therefore a type of fake currency if a banknote is printed without
the proper authorization from either the state or the federal government. Fake currency is also
known as counterfeit currency. The long-term goal of this study is to devise a system for classifying
the various methods that can be used to detect counterfeit banknotes in order to forestall the
further spread of fakes. The study's goal is to establish a framework for classifying the various
approaches that can be taken to identify fake banknotes. The currency of a nation is considered to
be among its most valuable resources. Some participants in the financial market are so dishonest
that they deliberately flood the market with counterfeit notes that are almost impossible to tell
apart from the genuine article. This is done with the intention of sowing confusion and discord
among prospective investors. The genuine banknotes and the counterfeits share numerous
similarities, which makes it difficult for human beings to differentiate between the two. Because it is
possible to manufacture convincing counterfeit banknotes, it is necessary to have a system that can
accurately determine whether or not a particular bill is genuine. This research identifies a variety of
machine learning algorithms that, in the not-too-distant future, may be utilized for the purpose of
verifying and analyzing banknotes. For the purpose of identifying genuine banknotes, it is necessary
to make use of supervised learning in addition to unsupervised methods such as Decision Trees,
Linear Regression, and Simple K-Means. With the assistance of these algorithms, it is possible to
decipher the patterns on banknotes.
Dataset Description
The machine learning repository at UCI was responsible for providing the dataset that was utilized
during the training of the models. As a direct consequence of this, the training of the models was a
resounding and unqualified success. In order to compile the dataset, both genuine and fabricated
photographs of monetary instruments were used in equal measure. This treasure trove could
contain as many as 1,372 items that are completely unique to the world. The total number of
characteristics is five; four of these make up the features, and one of these serves as the objective.
The percentage of items belonging to each category that is represented in the dataset follows the
parameters of a normal distribution. [Ignore capitalization] The values 0 and 1 will be accepted by
the target class. A value of 0 will indicate a genuine note, while a value of 1 will indicate a fake note.
The histogram of all five attributes in the dataset can be seen in the image below.
You can see the relationships between the various attributes in the scatter plot below.
Data Preprocessing
During the process of data cleansing, redundant particulars and errors are eliminated and rectified.
The procedure of correcting incorrect information is known as "data cleansing." In the process of
preparing the data warehouse, there are several touchpoints at which the data must be cleansed.
The process of combining separate pieces of information into one coherent whole is referred to as
"data integration," which is a word that describes the method itself. Before any information can be
moved from one system to another, a process known as data transformation has to be carried out
first. Because there are no missing values, and because we have already normalized the data, this set
is in excellent condition for analysis. In order to accomplish the goals of this data analysis, auxiliary
considerations are simply unnecessary.
Data mining
Here, we've implemented the Decision Tree algorithm, the Linear Regression model, and the Simple
K-means clustering algorithm to mine this data.
The use of linear regression as a technique for doing predictive analysis is common because it's so
straightforward. The primary purposes of using regression are to find out the following two things: 1)
whether or not a specific group of independent variables, known as predictor variables, can
accurately predict a set of dependent variables,and 2) Which of the predictor variables are the best
indicators of the value of the outcome variable, and how do these predictor variables influence the
value of the outcome variable, as shown by the magnitude and sign of the beta estimates. The
correlation that exists between the dependent variable and the independent variables has been
demonstrated with the help of the regression estimations. The regression equation looks like this
when there is only one independent variable and one dependent variable involved in the analysis:
In this equation, ‘x’ represents the independent variable score, ‘b’ the regression coefficient, and ‘c’
a constant.
Mean absolute error is 12.95 percent, root mean squared error is 17.44 percent, relative absolute
error is 26.0624 percent, and the total percentage we obtained is 34.8272 percent. All these figures
originate from the same collection of data.
Techniques of supervised machine learning, such as the Random Forest algorithm, have found
widespread use in the disciplines of classification and regression. It constructs decision trees by using
a wide variety of instances as building blocks, and then employs majority voting for classification and
an average for regression. According to the findings, the coefficient of correlation is 98.58 percent,
the mean absolute error is 2.37 percent, the root mean squared error is 8.45 percent, the relative
absolute error is 4.767 percent, and the total percentage that we acquired is 16.880 percent.
K-means
The centers are located with the help of the K-means clustering algorithm, and the procedure is
repeated until the optimal solution is located. A "K" indicates the total number of discovered clusters
when using K-means. Data points are grouped into clusters if their squared distance from the
cluster's center is minimal. Data points tend to be more alike within a cluster if there is less variation
within it. In the realm of unsupervised machine learning, K-means is a data grouping method that
can be employed. An unlabeled dataset can be partitioned into a predetermined number of clusters
using this algorithm.
Here, the variance is 0.541, skewness is 0.5878, curtosis is 0.2, entropy is 0.6703, and class is 0.4366
at the cluster centroid for each of the five attributes that make up our dataset. The cluster instance
percentages are 48% for class 0 and 52% for class 1, respectively.
Conclusion
This study will investigate banknote authentication, also known as the process of distinguishing
genuine from counterfeit banknotes, by employing two supervised learning techniques and one
unsupervised learning technique. The goal of this research is to better understand banknote
authentication. Following a discussion of the approaches that have previously been utilized to detect
counterfeit banknotes, this study moved on to investigate additional approaches. The dataset
containing the banknotes was put through a battery of tests on each of these models so that it could
be determined which of these models would be the most effective choice for classifying the
banknotes. We compared the results of these analyses in order to select the model that provided
the most encouraging results for the purpose of conducting additional research.