Final Report
Final Report
Abstract—I have applied 5 Machine Learning models on three store can provide it ( be it a physical store or an
datasets related to different domains. The first dataset is about online store).The price , availablity and how the
Credit card fraud detection, in which principal component analyses items are buying by the customer tells about
has already been done because its confidential data related to their behaviuor and shop owner can use it to
financial accounts, as this is a classification problem I have applied increase their sales and can mange the store in a
SVM with various kernels and Naïve Bayes machine learning better way.Similary I have taken the data of an
algorithm. Dataset has about 300 thousand records with UK online store and by using clustering
imbalanced classification so used various sampling techniques. techniques I have tried to predict consumer
Some SVM kernel performed better than others. The second dataset
behavior for example prediction of sales and
is about crop production where I was predicting the production
segmentation of items. The dataset for this
based on Area for different districts using linear regression and
regression tree ML methods, this dataset has a lot of categorical analyses is available at
variables therefore dummy coding has been used. Regression model https://archive.ics.uci.edu/ml/datasets/Online%
was found more satisfactory as compare to linear one. The third 20Retail
dataset is about online store/shop transactions of certain period
which I have analyzed and applied clustering technique
(unsupervised learning) for predicting consumer behavior based on
together purchased items. II. RELATED WORK
I. INTRODUCTION A. Crop Production Dataset
The objectives of this poject is to apply various machine As Liakos et al.[1] explains how machine learning can be used
learning models on three large datasets. The reason behind in agriculture segment which is refered as digital agriculture.
choosing datasets are different for each dataset. Data is very important and useful in building such models
which can bring revolutionary changes in the field. In this
• Crop Production: Food is one of the fundamental research the main focus was on crop management , water
neccesity for everyone on this planet.With the management , soil management and livestock production. It
increasing number of population the demand for explained that how machine learning works for supervised and
more food is also increasing therefore unsupervised leaning. Yield prediciton has been achieved by
agriculture plays an important role. It would be ML methods, the aim is to apply cost effective and low price
beneficial for farmers and help them in taking model for predicting coffee fruits on a branch which helps
future decisions.I have analysed previous years coffee growers in their economic state which is similar to my
data crop agriculture dataset of INDIA , and work as I am trying to predict yield for vaious crops on the
trying to predict production.The dataset for this basis of area.They have also used regression tree model.
analyses is available at
[2] V.shah and P.shah also used Multiple Linear regression
https://data.gov.in/resources/district-wise- model for predicting ground nut for few states in their
season-wise-crop-production-statistics-1997 research. The good thing is that they have considered other
environmental factors such as rain , biotic factors like ph
value, and Area in the model building ,unlike the model I have
• Credit Card Transactions: People do online used where I would be majorly depend on the Area factor and
transactions from one euro to lakhs of euros that is the limitation of dataset I have used.
every seconds. Sometimes the payment system S. Mishra et al.[3] says in the similar research model build on
become vulnerable and can be misused by the prediction of “corn” crop using ANN (decision tree) and
someone. This fraud transactions costs a lot to muliple linear regression model. They have compared both the
individual, to company and as well as to model in process. Regression analyses done on the dataset and
government not just in terms of money but also showed that corn production is not dependent on evironemnt
in terms mental peace.The best way to deal with factor instead more on a planting practices , which is very
it is to extract out those transactions somehow by useful information.Correlation of dependent and independent
building some system which can help in variables has been also established using statistical
reducing these kind of fraud in a cost effective analyses.Time series analyses has also be done to compare the
way which does not involve any legal previous year data and train the model accordingly.
consequences. Here I have analysed the data and
build a model to predict the fraud transactions B. Sitienei et al.[4], also studied on Tea crop production in the
among all the transactions. I was evaluated the kenya region, as tea is produced in around 58 countries also
accuracy of this model using confusion matrices. for maize. Studies has done for creating multiple regression
The dataset for this analyses is available at model for wheat yield in the past, similar approcah has been
taken here for Tea.Climate variables considered here and
https://data.world/vlad/credit-card-fraud- contigency table used for verification of the model.Several
detection statistical model has been used for showing trend analysis like
• Online Retail Dataset : Market trends has always t-test , correlation analyses and multiple linear regression
changes from time to time. It is important to analyses. Results has been shown seasons wise (rainfally)
know what customer wants and how easily a which can be very useful for taking further decisions related
III. DATA MINING AND METHODOLOGY • Dataset has contained 284807 , records which is very
huge. Before applying ml methods, I have binning my
Fayyad et al.[20] Knowledge discovery in databases( KDD) is dataframe, changed all numerical variable into
one of the emerging field and most demanding methodogy in cateogorical bins of same length. All the colums are
current time. For making any system efficient we need factors with level 5 now , and it is easy to run the
meaningful data , as data can we available in any form with model. OneR package has been used which gives the
lots of noise and biasness therefore we need to manually Bin method.
analyse it and pre precessed it for further analyses .KDD is
• As this is a classification problem I have checked for
one such approach which can be use here. Some sectors has
any imbalanced classification in my dependent
even require it more like healthcare management , defence
variable and found that out of 284807 , approx 99.9%
management in which accuracy plays a very big role. More data falls into False( non-fraudulent ) and only
the data is noise free it is easy to use it. remaining on the fraud data. I have split my data in to
training and test data using caTools library split
Process of KDD function as 80:20 ratio,and run my model on this.
After that I have predict my train model against
complete test data which gives 99.9% accuracy (
TP+TN / total number of caes) means 56961 cases
has been correctly predicited which shows the
overfitting of the model.
• To tackle imbalanced classification I have used two
techniques: SMOTE and downsample
• SMOTE-sampling is used for handling unbalanced
classifcication.Two parameters perc.over and
perc.under used to control the over sampling and
under samplaning of majority and minority classes.
For all the dataset I have used the KDD methodology. Data This is the midway.N. Chawla et al.[17] explains that
selection, preprecossing then transformation ,using Data it is the combination of up sampling and
mining and evaluation different models and interepreted the downsampling which gives the hybrid sampled data
results. and handled unbalanced.Smote sampling is provide
by DMwR package in R.
A. Credit Card fraud Detection Dataset • C. Tantithamthavorn et al.[18] explains Downsample
For credit card dataset , PCA has been done already because is about make the majority class frequency same as
users credit card information and other user details are minority class.It basically downsampled the majority
confidential.As this is the classification problem of supervised class frequency. It is an R function downSample
provided by caret package.I can not use up-sampling
learning.
as it will add frequency in the minority class same as
Dependent Variable : Class which has two values False and
majority class resulting in about 4 hundred thousands
True. records which might be very difficult to run.
Data Preprocessing : • I have run SVM first on SMOTE sampled data using
kernel “vanilladot” which is linear kerner and majorly
• Import the data avaiable in csv into R studio.Load the used for sparse data , but here I am using it for
neccesary libraries like dplyr for data manipulation, comparing it efficiency with “rbfdot” which is more
caret and catools for data splitting. generic kernel based on Gaussian radial basis. Then I
run SVM using downsampled training data using
• Check the head and structure of the datasets which vanilla dot.
says there are total 32 variables and mostly all of them
are numeric.As PCA has done already , there is no • After that ,I have train the model using naïve bayes on
need to perform normalisation. downsampled data , as smote data is still too large for
naïve bayes.
• There was no Missing values in the dataset.
• I have predict the above model against the original
splitted Test data.
• We compare the evaluaton in evaluation method.
Data Modelling :
I have chose Naïve bayes and support vector machine
algortihms as they are widely used for problems like these.
B. Crop Production
I have majorly taken 4 crops Rice,Banana ,Turmeric and
Sugarcane.
Dependent Variable : Production ( continious data).
Data Preprocessing :
• Import csv file into R. It contains 7 variables and • After that I have split the training and test data into
246091 records. 70:30 ratio and train simple regression model and
made predictions. Draw various plots of residuals
• Check the Na values and omit them. which we disucss in the evaluation part.
• I have checked the coorelation using Ggally and • Then I run regression tree model and plot the model
coorplot which are R functions. I have Area field fitness using rpart.
which is highly correlated with my dependent
variable. Data Modelling:
• Structure of the dataset shows that there are lot of I am buidling a model for crop production using muliple linear
categorical variables for District name and Crop regression and regression trees.
which si creating the problem while prediction.To C. Online Store
deal with problem I have created dummy values using
dummyVar R function which have do internel
encoding of categorical variables into 0 and 1. Data Pre-processing:
• After that I have checked the skewness of Area and • Import xlsx file into R.
Production continious variable which are very high • Convert the data type of invoice date into a
(11 , 3.33) and making the model inefficient , I have date format using anytime R function.
applied loglp on both variable and skewness reduced • Converted invoiceNo, StockCode, Description,
to (-0.5 ,-0.3).
and Country variables to factor for making it
Before : Skewness categorical and avoid any other values.
Simultaneously converted Quantity and
Customer id into integer from numeric to avoid
float and double values.
• Now colSums gives the NA values column
wise, 9288 values in customer id are null means
invoice has not been generated for them. We
cannot remove them if we want to find top ten
selling items by the store. So added a new
column revenue (quantity*unit price), for
finding top ten items. I have showed the top ten
items and see Dotcom Postage is the most
selling item of the store.
After Skewness
• Now we can remove NA values from customer Data preparation for clustering:
ID, and filter it records for above the above top I have created clusters by grouping customer id and country
10 selling items. columns. Calculated customer regularity by checking the
• I have added month for visualising the transaction for unique year and months for every customer.
transactions per month. As data was for 13 months so divide it by 13 which give me
• I have showing the top 10 countries which the regularity rate. Then created new items for every stock in
produce high revenue. We found highest the new data set for clustering. Then I have performed the
revenue producer are the people from UK who PCA as I have very wide dataset.S. Mishra et al[19] PCA is
bought items. method to analyze a data of inter correlated dependent
variables.Internally it worked on Eigen values and eigen
vectors.
Data Modelling :
I have created clusters and apply k-means and also one
regression model just to see the outcomes. I discussed its
limitaiton and effectiveness in the evaluation part.
IV. EVALUATION
After succefully run all the models for three datasets. Let’s
explain model evaluation and parameters.
A. Credit Card fraud Detection
For evaluation I have used confusion matrices and
calculating the accuracy of the model. Confusion matrices is
good measure for classification models.
C. Online Production
Root node error is 10 which is not good but also not that much
bad.
A. Credit card Fraud Detection [1] K. Liakos, P. Busato, D. Moshou, S. Pearson and D. Bochtis, "Machine
Credit card fraud detection is very popular studies for Learning in Agriculture: A Review", Sensors, vol. 18, no. 8, 2018.
Available: https://www.mdpi.com/1424-8220/18/8/2674/htm.
researcher because its scope is very large and the pettern of [Accessed 13 December 2019].
frauds keep changing over the period of time. In my research
both the supervised learning works very well. There was not [2] V. Shah and P. Shah, "Groundnut Crop Yield Prediction Using
much difference in the accuracy while running svm with Machine Learning Techniques", vol. 3, no. 5, 2018. Available:
different kernels but overall “vanilladot” kernel works slighty https://www.researchgate.net/publication/326112319_Groundnut_Cro
p_Yield_Prediction_Using_Machine_Learning_Techniques.
faster than generic “rbfdot”. But naïve bayes model runs very [Accessed 13 December 2019].
quickly as compare to svm and almost same accuracy has
found in the evaluation against the complete test data as svm [3] S. Mishra, D. Mishra and G. Santra, "Applications of Machine
methods.If we talk about the limitations we could say as this Learning Techniques in Agricultural Crop Production: A Review
was not the real dataset , its already pca data , there is lack in Paper", Indian Journal of Science and Technology, vol. 9, no. 38, 2016.
predicting real time fraud for financial transactions. Also as I Available:
http://www.indjst.org/index.php/indjst/article/viewFile/95032/74105.
have not the expert , I believe in future if we can work with [Accessed 13 December 2019].
domain analyst and take real time feedback , we can make the [4] B. Sitienei, S. Juma and E. Opere, "On the Use of Regression Models
system much better.Also if we can include more predictors to Predict Tea Crop Yield Responses to Climate Change: A Case of
like geo location and device information then a very efficient Nandi East, Sub-County of Nandi County, Kenya", Climate, vol. 5, no.
3, 2017. Available:
model can be created. https://www.researchgate.net/publication/318496719_On_the_Use_of
_Regression_Models_to_Predict_Tea_Crop_Yield_Responses_to_Cli
mate_Change_A_Case_of_Nandi_East_Sub-
B. Crop Production County_of_Nandi_County_Kenya. [Accessed 13 December 2019].
As crop is one the basic neccesacity of a human being,
scientist around the world studies this topic regularyly.I have [5] K. Olson and G. Olson, "Use of multiple regression analysis to estimate
run multiple linear regression which underperformed with average corn yields using selected soils and climatic
data", Agricultural Systems, vol. 20, no. 2, pp. 105-120, 1986.
high biasness and one of the reason is involvement of high Available:
number of categorical variables with multiple levels. I https://www.sciencedirect.com/science/article/pii/0308521X8690062
Regression tree model run faster as compare to linear. There 4?via%3Dihub. [Accessed 14 December 2019].
are not many predictor variables other than area is one of the
limitation of this work. For the future we can add additional [6] Ramakrishna Anil, A statistical approach to estimate seasonal crop
features in the dataset and predict the crop produciton with production in India, Department of Computer Science, University of
Southern California, Los Angeles,CA.[online].Available:
taking geo satellite variables and environment variables. https://pdfs.semanticscholar.org/5c74/91174c1d8b2029a8b273c6e04a
95aa77cea4.pdf [Accessed on: Nov 2,2019]
C. Online Production
[7] Y. Cai et al., "Crop yield predictions - high resolution statistical model
There is a lot of scope in this dataset , we can either only for intra-season forecasts applied to corn in the US", Gro Intelligence,
visualise customer behaviour using EDA process and then we 2019. Available: https://gro-intelligence.com/yield-model-pdf/us-
corn. [Accessed 14 December 2019].
can also predict the sales using regression and clustering
algorithms. There are some limitations inmy proposed work
[8] C. PHUA, V. LEE, K. SMITH and R. GAYLER, "Comprehensive
like I have not analyze the clusters thorugh visualisations. Survey of Data Mining-based Fraud Detection Research", School of
Clustering algorithm can be implemented in more mature way Business Systems, 2019.
by anaysed it more thoroughly. With the limited knowledge of [9] W. Lim, A. Sachan and V. Thing, "Conditional Weighted Transaction
clustering algorithms I have not achieved much in prediction Aggregation for Credit Card Fraud Detection", Progress in Pattern
and training the model but I thorouhly done the overall analyse Recognition, Image Analysis, Computer Vision, and Applications, pp.
3-16, 2014. Available: 10.1007/978-3-662-44952-3_1 [Accessed 15
of the customer behavior through various visualisations. I will December 2019].
be looking forward to extend this work for more [10] [A. Thennakoon, C. Bhagyani, S. Premadasa and S. Mihiranga, "Real-
understanding of clustering algorithms. time Credit Card Fraud Detection Using Machine Learning", Research
Gate, 2019. Available:
https://www.researchgate.net/publication/334761474_Real-
time_Credit_Card_Fraud_Detection_Using_Machine_Learning.
[Accessed 15 December 2019].
[11] F. Carcillo, Y. Le Borgne, O. Caelen, Y. Kessaci, F. Oblé and G.
Bontempi, "Combining unsupervised and supervised learning in credit
card fraud detection", Information Sciences, 2019. Available:
https://www.sciencedirect.com/science/article/pii/S002002551930445
1. [Accessed 15 December 2019].
[12] N. Carneiro, G. Figueira and M. Costa, "A data mining based system
for credit-card fraud detection in e-tail", Research gate, vol. 95, pp. 91-
101, 2017. Available:
https://www.researchgate.net/publication/312255358_A_data_mining
_based_system_for_credit-card_fraud_detection_in_e-tail. [Accessed
15 December 2019].
[13] R. Gupta and C. Pathak, "A Machine Learning Framework for [17] N. Chawla, K. Bowyer, L. Hall and W. Kegelmeyer, "SMOTE:
Predicting Purchase by Online Customers based on Dynamic Synthetic Minority Over-sampling Technique", Journal of Artificial
Pricing", Procedia Computer Science, vol. 36, 2014. Available: Intelligence Research, vol. 16, pp. 321-357, 2002. Available:
https://www.researchgate.net/publication/275541641_A_Machine_Le https://www.researchgate.net/publication/220543125_SMOTE_Synth
arning_Framework_for_Predicting_Purchase_by_Online_Customers_ etic_Minority_Over-sampling_Technique. [Accessed 15 December
based_on_Dynamic_Pricing. [Accessed 15 December 2019]. 2019].
[14] F. Weber and R. Schütte, "A Domain-Oriented Analysis of the Impact [18] C. Tantithamthavorn, A. Hassan and K. Matsumoto, "The Impact of
of Machine Learning—The Case of Retailing", Big Data and Cognitive Class Rebalancing Techniques on the Performance and Interpretation
Computing, vol. 3, no. 1, p. 11, 2019. Available: of Defect Prediction Models", IEEE Transactions on Software
10.3390/bdcc3010011. Engineering, pp. 7-7, 2018. Available:
[15] Kronberger, G. and Affenzeller, M. (2011). Market Basket Analysis of https://arxiv.org/pdf/1801.10269.pdf. [Accessed 15 December 2019].
Retail Data: Supervised Learning Approach. Research Gate. [online] [19] S. Mishra et al., "Principal Component Analysis", 2017. Available:
Available at: https://www.researchgate.net/publication/316652806_Principal_Com
https://www.researchgate.net/publication/221431835_Market_Basket ponent_Analysis. [Accessed 17 December 2019].
_Analys is_of_Retail_Data_Supervised_Learning_Approach
[Accessed 15 Dec. 2019].
[20] Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge
Discovery: An Overview", in Fayyad, Piatetsky-Shapiro, Smyth,
[16] R. Liu, Y. Lee and H. Mu, "Customer Classification and Market Basket Uthurusamy, Advances in Knowledge Discovery and Data
Analysis Using K-Means Clustering and Association Rules: Evidence Mining, AAAI Press / The MIT Press, Menlo Park, CA, 1996, pp.1-34.
from Distribution Big Data of Korean Retailing Company", Research
Gate, 2019. Available:
https://www.researchgate.net/publication/330506538_Customer_Clas
sification_and_Market_Basket_Analysis_Using_K-
Means_Clustering_and_Association_Rules_Evidence_from_Distribut
ion_Big_Data_of_Korean_Retailing_Company. [Accessed 15
December 2019].