0% found this document useful (0 votes)
172 views9 pages

Final Report

Uploaded by

Rambo Gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
172 views9 pages

Final Report

Uploaded by

Rambo Gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Final Report

Abstract—I have applied 5 Machine Learning models on three store can provide it ( be it a physical store or an
datasets related to different domains. The first dataset is about online store).The price , availablity and how the
Credit card fraud detection, in which principal component analyses items are buying by the customer tells about
has already been done because its confidential data related to their behaviuor and shop owner can use it to
financial accounts, as this is a classification problem I have applied increase their sales and can mange the store in a
SVM with various kernels and Naïve Bayes machine learning better way.Similary I have taken the data of an
algorithm. Dataset has about 300 thousand records with UK online store and by using clustering
imbalanced classification so used various sampling techniques. techniques I have tried to predict consumer
Some SVM kernel performed better than others. The second dataset
behavior for example prediction of sales and
is about crop production where I was predicting the production
segmentation of items. The dataset for this
based on Area for different districts using linear regression and
regression tree ML methods, this dataset has a lot of categorical analyses is available at
variables therefore dummy coding has been used. Regression model https://archive.ics.uci.edu/ml/datasets/Online%
was found more satisfactory as compare to linear one. The third 20Retail
dataset is about online store/shop transactions of certain period
which I have analyzed and applied clustering technique
(unsupervised learning) for predicting consumer behavior based on
together purchased items. II. RELATED WORK
I. INTRODUCTION A. Crop Production Dataset
The objectives of this poject is to apply various machine As Liakos et al.[1] explains how machine learning can be used
learning models on three large datasets. The reason behind in agriculture segment which is refered as digital agriculture.
choosing datasets are different for each dataset. Data is very important and useful in building such models
which can bring revolutionary changes in the field. In this
• Crop Production: Food is one of the fundamental research the main focus was on crop management , water
neccesity for everyone on this planet.With the management , soil management and livestock production. It
increasing number of population the demand for explained that how machine learning works for supervised and
more food is also increasing therefore unsupervised leaning. Yield prediciton has been achieved by
agriculture plays an important role. It would be ML methods, the aim is to apply cost effective and low price
beneficial for farmers and help them in taking model for predicting coffee fruits on a branch which helps
future decisions.I have analysed previous years coffee growers in their economic state which is similar to my
data crop agriculture dataset of INDIA , and work as I am trying to predict yield for vaious crops on the
trying to predict production.The dataset for this basis of area.They have also used regression tree model.
analyses is available at
[2] V.shah and P.shah also used Multiple Linear regression
https://data.gov.in/resources/district-wise- model for predicting ground nut for few states in their
season-wise-crop-production-statistics-1997 research. The good thing is that they have considered other
environmental factors such as rain , biotic factors like ph
value, and Area in the model building ,unlike the model I have
• Credit Card Transactions: People do online used where I would be majorly depend on the Area factor and
transactions from one euro to lakhs of euros that is the limitation of dataset I have used.
every seconds. Sometimes the payment system S. Mishra et al.[3] says in the similar research model build on
become vulnerable and can be misused by the prediction of “corn” crop using ANN (decision tree) and
someone. This fraud transactions costs a lot to muliple linear regression model. They have compared both the
individual, to company and as well as to model in process. Regression analyses done on the dataset and
government not just in terms of money but also showed that corn production is not dependent on evironemnt
in terms mental peace.The best way to deal with factor instead more on a planting practices , which is very
it is to extract out those transactions somehow by useful information.Correlation of dependent and independent
building some system which can help in variables has been also established using statistical
reducing these kind of fraud in a cost effective analyses.Time series analyses has also be done to compare the
way which does not involve any legal previous year data and train the model accordingly.
consequences. Here I have analysed the data and
build a model to predict the fraud transactions B. Sitienei et al.[4], also studied on Tea crop production in the
among all the transactions. I was evaluated the kenya region, as tea is produced in around 58 countries also
accuracy of this model using confusion matrices. for maize. Studies has done for creating multiple regression
The dataset for this analyses is available at model for wheat yield in the past, similar approcah has been
taken here for Tea.Climate variables considered here and
https://data.world/vlad/credit-card-fraud- contigency table used for verification of the model.Several
detection statistical model has been used for showing trend analysis like
• Online Retail Dataset : Market trends has always t-test , correlation analyses and multiple linear regression
changes from time to time. It is important to analyses. Results has been shown seasons wise (rainfally)
know what customer wants and how easily a which can be very useful for taking further decisions related

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


to tea production.[5] K. Olson and G. Olson has observed anomalies and performed better by using outliers detection at
about creating a model using soil properties and climate different levels.It says that the combinations was really
factors. Here also multiple linear regression has been used to efficient but there was a lack in the accuracy of the
generate and analyze coefficients , also time series data has prediction.However in the research.
been taken of corn crop of last 43 years to build the linear
N. Carneiro et.al[12] suggested that focus of the research
equation.
should be on more practical scenarios from the perspective of
[6] Ramakrishna has done similar work in their research by e-commerce merchant.They have also explore automative and
applying multiple linear regression and regression trees.They manual classier also the process of selecting and labelling of
have build a model on randomly selected data.Evaluating the data.They have not just do statistical analysis but also
model by calucluating MPE values , where linear regression conducted intervies with fraud analyst to understand the
underperform with high biasness.The next model used was fundamanetals and to identify a pattern of fraud
regression tree which performed well. transactions.They transformned and trained data on support
vector machine, logistic regression and random forests ml
[7] In this research crop production model build for corn,
methods.They have taken the key indicators of a transaction
which is one of the valuable crop.The data collected is of US like automation level , chargeback , rate of payments refused
time series data with remote sensing data.Some specific US
and processing speed.They have calculated the suspician score
states has been target for this.The goal was to produced contry and accepting/rejecting a transaction on the basis of a
level prediciton , so they performed aggreation and build yield
threshold decided. All the three applied supervised learnings
model on that.Resuls showed the forecasting season wise. gives good result. So far this is best work I have find which
takes care of e-merchant also.
B. Credit Card fraud Detection Dataset C. Online Store Dataset
[8] In this research thay have been try to resolve challenges R. Gupta and C. Pathak[13] explains in this research paper ,
related to fraud detection methods and second about new they have been trying to predict sales based on dynamic
methods for intrusion detection , money laundering , spam pricing.They have done the EDA and do the customer
detection etc. E-commerce made this complusary to look after grouping by using k means clustering approach. They have
credit card transactions to find fraud detection by finding followed widely accepted RFM technique and applied logistic
trends and suspicious transactions. SVM ML method is used regression method to find out which items are more likelable
as similar to my model, for this supervised learning. Neural to buy by customer.The positive aspect is that they build the
network has also used. Multiple supervised learning models model by keeping customer and organization both in their
has used to build hybrid fraud detection system.Positively we mind.
can say that hybrid system are much efficient as multiple
techniques involved , but on the other side complication has
[14] F. Weber and R. Schütte have explain all the challenges
also increased to make such models on similar datasets.
faced in Retailing and machine learning initiatives which can
[9] W. Lim et.al. Explains how to build find fraud detection considered and categorized.It has been observed that mostly
system by using Transaction aggregation method ,which retailers wants to predict their sales.They have show all the
calculates sum of transaction amount respective to each ML techniques which can be used as consolidated summary
feature in last few days. Every transaction is labeled as with the increased complexity by adding more products in the
“fraudulent” and “legitimate” ,which is the dependent cateogory , competetion and store location.This article is good
variable.Weighting system has been used in this. The positve to understand overall concept about online store and ML
aspects is the accuracy of the mathematical model they form methods used to predicting sales and customer behaviour.
but the downside is the complication of weighted column
which is not easy to interpret and execute. [15] In this research , used supervised learning to predict
A. Thennakoon et.al[10] has mentioned in the research how to customer behaviour and identificaiton of the items bought
chose a suitable algorithm for fraud patterns and also compare frequently.This comes under the Market basket analyses and
various machine learning algorithm for the same.In the dataset they have used Apriori and Prim algorithm. It is important to
used in this reasearch, credit card no is being hashed due to know this behaviour so that shop owner can change their
security reasons.Also the dataset is highly imbalanced like the strategies accordingly. It is very easy to do online camplaigns
one I am using in my research.Data cleaning , transformation, and getting feedback which can also help in building the
normalisation and reduction has been done prior to getting a model.They have done deep comparison between both apriori
final dataset. For handling the imbalanced classification and prim and found that both are suitable depending on the
SMOTE and CNN sampling techniques used and also 10-fold situation.One big aspects is Apriori is good for big datasets
cross validation already applied before resampling. They have and prim for continuos data. They have applied association
used SVM,Naïve bayes ,KNN and logistic classificaiton
rules on the subset of the data , not on the complete dataset.In
model,and then they made confusion matrices for the
general this process is much efficient , they have both the
performance of the best model. We can observe the Positive
aspects of this reasearch because of its process in creating approached which can handle multiple situation.
models for fraud detection and the accuracy they have
achieved showed in the paper. As R. Liu et.al[16] explains in this work, they have segmented
two classes VIP and no VIP thorugh RFM technique ( I am
F. Carcillo et al.[11] suggested in their research paper that also using RFM in my model), basically to keep the
instead of using supervised learning methods which is widely relationship with the customer , it is required to understand
use for fraud detection , they have used hybrid of supervised their needs. For that they analysed their transaction data , they
and unsupervised learning model , which can handle the
used k-means clustering and association rules to assign an • Naïve Bayes Method : It is a machine learning
item into a particular cluster.After they done Market basket classified which used maximum posterior probablity
analyses to understand customer behaviour and created a rule for classification. It works on the past events data
record. The positive aspect is that they also suggested how to predict the future possiblities based on the bayesian
items should be placed like men wear should be one click method.
away from women wear. One drawback can be noticed is the • Support Vector machine : Is a classified is non-
history information of the organisation is very shot to produce probablistic classifier used in classification as well as
meaningful information. regression.It used kernel for internal mapping which
mapped high dimentional fetures with its input.I have
used vanilladot and rbfdot kernels in this research.

III. DATA MINING AND METHODOLOGY • Dataset has contained 284807 , records which is very
huge. Before applying ml methods, I have binning my
Fayyad et al.[20] Knowledge discovery in databases( KDD) is dataframe, changed all numerical variable into
one of the emerging field and most demanding methodogy in cateogorical bins of same length. All the colums are
current time. For making any system efficient we need factors with level 5 now , and it is easy to run the
meaningful data , as data can we available in any form with model. OneR package has been used which gives the
lots of noise and biasness therefore we need to manually Bin method.
analyse it and pre precessed it for further analyses .KDD is
• As this is a classification problem I have checked for
one such approach which can be use here. Some sectors has
any imbalanced classification in my dependent
even require it more like healthcare management , defence
variable and found that out of 284807 , approx 99.9%
management in which accuracy plays a very big role. More data falls into False( non-fraudulent ) and only
the data is noise free it is easy to use it. remaining on the fraud data. I have split my data in to
training and test data using caTools library split
Process of KDD function as 80:20 ratio,and run my model on this.
After that I have predict my train model against
complete test data which gives 99.9% accuracy (
TP+TN / total number of caes) means 56961 cases
has been correctly predicited which shows the
overfitting of the model.
• To tackle imbalanced classification I have used two
techniques: SMOTE and downsample
• SMOTE-sampling is used for handling unbalanced
classifcication.Two parameters perc.over and
perc.under used to control the over sampling and
under samplaning of majority and minority classes.
For all the dataset I have used the KDD methodology. Data This is the midway.N. Chawla et al.[17] explains that
selection, preprecossing then transformation ,using Data it is the combination of up sampling and
mining and evaluation different models and interepreted the downsampling which gives the hybrid sampled data
results. and handled unbalanced.Smote sampling is provide
by DMwR package in R.
A. Credit Card fraud Detection Dataset • C. Tantithamthavorn et al.[18] explains Downsample
For credit card dataset , PCA has been done already because is about make the majority class frequency same as
users credit card information and other user details are minority class.It basically downsampled the majority
confidential.As this is the classification problem of supervised class frequency. It is an R function downSample
provided by caret package.I can not use up-sampling
learning.
as it will add frequency in the minority class same as
Dependent Variable : Class which has two values False and
majority class resulting in about 4 hundred thousands
True. records which might be very difficult to run.
Data Preprocessing : • I have run SVM first on SMOTE sampled data using
kernel “vanilladot” which is linear kerner and majorly
• Import the data avaiable in csv into R studio.Load the used for sparse data , but here I am using it for
neccesary libraries like dplyr for data manipulation, comparing it efficiency with “rbfdot” which is more
caret and catools for data splitting. generic kernel based on Gaussian radial basis. Then I
run SVM using downsampled training data using
• Check the head and structure of the datasets which vanilla dot.
says there are total 32 variables and mostly all of them
are numeric.As PCA has done already , there is no • After that ,I have train the model using naïve bayes on
need to perform normalisation. downsampled data , as smote data is still too large for
naïve bayes.
• There was no Missing values in the dataset.
• I have predict the above model against the original
splitted Test data.
• We compare the evaluaton in evaluation method.
Data Modelling :
I have chose Naïve bayes and support vector machine
algortihms as they are widely used for problems like these.

B. Crop Production
I have majorly taken 4 crops Rice,Banana ,Turmeric and
Sugarcane.
Dependent Variable : Production ( continious data).
Data Preprocessing :
• Import csv file into R. It contains 7 variables and • After that I have split the training and test data into
246091 records. 70:30 ratio and train simple regression model and
made predictions. Draw various plots of residuals
• Check the Na values and omit them. which we disucss in the evaluation part.
• I have checked the coorelation using Ggally and • Then I run regression tree model and plot the model
coorplot which are R functions. I have Area field fitness using rpart.
which is highly correlated with my dependent
variable. Data Modelling:

• Structure of the dataset shows that there are lot of I am buidling a model for crop production using muliple linear
categorical variables for District name and Crop regression and regression trees.
which si creating the problem while prediction.To C. Online Store
deal with problem I have created dummy values using
dummyVar R function which have do internel
encoding of categorical variables into 0 and 1. Data Pre-processing:

• After that I have checked the skewness of Area and • Import xlsx file into R.
Production continious variable which are very high • Convert the data type of invoice date into a
(11 , 3.33) and making the model inefficient , I have date format using anytime R function.
applied loglp on both variable and skewness reduced • Converted invoiceNo, StockCode, Description,
to (-0.5 ,-0.3).
and Country variables to factor for making it
Before : Skewness categorical and avoid any other values.
Simultaneously converted Quantity and
Customer id into integer from numeric to avoid
float and double values.
• Now colSums gives the NA values column
wise, 9288 values in customer id are null means
invoice has not been generated for them. We
cannot remove them if we want to find top ten
selling items by the store. So added a new
column revenue (quantity*unit price), for
finding top ten items. I have showed the top ten
items and see Dotcom Postage is the most
selling item of the store.

After Skewness
• Now we can remove NA values from customer Data preparation for clustering:
ID, and filter it records for above the above top I have created clusters by grouping customer id and country
10 selling items. columns. Calculated customer regularity by checking the
• I have added month for visualising the transaction for unique year and months for every customer.
transactions per month. As data was for 13 months so divide it by 13 which give me
• I have showing the top 10 countries which the regularity rate. Then created new items for every stock in
produce high revenue. We found highest the new data set for clustering. Then I have performed the
revenue producer are the people from UK who PCA as I have very wide dataset.S. Mishra et al[19] PCA is
bought items. method to analyze a data of inter correlated dependent
variables.Internally it worked on Eigen values and eigen
vectors.

Data Modelling :
I have created clusters and apply k-means and also one
regression model just to see the outcomes. I discussed its
limitaiton and effectiveness in the evaluation part.

IV. EVALUATION
After succefully run all the models for three datasets. Let’s
explain model evaluation and parameters.
A. Credit Card fraud Detection
For evaluation I have used confusion matrices and
calculating the accuracy of the model. Confusion matrices is
good measure for classification models.

• I have also showed the revenue per month and see


maximum sales occurred in September month.
TABLE I. CONFUSION MATRIX FOR VANILLADOT SVM
KERNEL

Confusion Matrix Positive Negative


56715
Positive 148 (FN)
(TP)
Negative 21(FP) 77(TN)

TABLE II. CONFUSION MATRIX FOR RBFDOT SVM KERNEL


Confusion Matrix Positive Negative
56777
Positive 86 (FN)
(TP)
Negative 20(FP) 78(TN)

By looking at above matrix we can say vanilladot svm has


correctly predict non fraudulent transction for 56715 cases,
B. Crop Producion
and incorrectly do false prediction for 148 into fraud
transaction which suppose to be in positive category. Similarly For multiple linear regression , the summary is
it has incorrectly do false prediciton of 21 cases and correctly
for negative 77 cases.
If we talk about accuracy (TP+TN / total number of cases)
which is 99.7% , specificity is (TN/FP+TN) 0.78 ( 0.0 is
worse and 1 is perfect) and for rdfdot svm accuracy is 99.82%
and specifcity is 0.78 almost similar. .We can say that svm
vanilladot performed slightly better than rbfdot kernel. we can see after reducing the skewness the R square is become
96 % , it means the variance in Production can be predicted
Now evaluate accuracy for naïve bayes model. We can see in from independent variables by 96%. F value is very high (845)
the below accuracy is 99.3. which is not bad against the whole means that all my independent variables are not statistically
test data. Also I have observed that naïve bayes took only few significant.
seconds to run smote sampled data which have total one
hundred thousand rows.

Residuals basically tells about the difference between the


observed valus and fitted values means how much variance
can not be explained by our model.
R-square is almot 0.9 ( 90%) the variance in Production can
be predicted from independent variables by 90%.

C. Online Production

I have calculated the pca percetage for each dataset , PC for


first point is 15% and so on.

Normal QQ plot is saying that the two quartiles are coming


I have considered regularity , cutomer id , net revenue for top
from normal distribution if it is making almost a straight
selling stock items , and chose centers 3( k value) as I have
regression line else not. We can see that this model is
total 13 columns.
underperform and not best model for this data. Also the value
of MAPE ( mean absolute percentage error ) is “inf” means it
in neagative.

Lets see regression tree plot

Most of the data points clustered around cluster 2.Then


checked the silhouette plot which telles how much the
similarity of cluster points in their own cluster as compare to
other clusters.

Root node error is 10 which is not good but also not that much
bad.

its value is 0.96 ( should be in the range of -1 ot +1) , says that


object value is matched mostly with its own cluster.
V. CONCLUSIONS AND FUTURE WORK References

A. Credit card Fraud Detection [1] K. Liakos, P. Busato, D. Moshou, S. Pearson and D. Bochtis, "Machine
Credit card fraud detection is very popular studies for Learning in Agriculture: A Review", Sensors, vol. 18, no. 8, 2018.
Available: https://www.mdpi.com/1424-8220/18/8/2674/htm.
researcher because its scope is very large and the pettern of [Accessed 13 December 2019].
frauds keep changing over the period of time. In my research
both the supervised learning works very well. There was not [2] V. Shah and P. Shah, "Groundnut Crop Yield Prediction Using
much difference in the accuracy while running svm with Machine Learning Techniques", vol. 3, no. 5, 2018. Available:
different kernels but overall “vanilladot” kernel works slighty https://www.researchgate.net/publication/326112319_Groundnut_Cro
p_Yield_Prediction_Using_Machine_Learning_Techniques.
faster than generic “rbfdot”. But naïve bayes model runs very [Accessed 13 December 2019].
quickly as compare to svm and almost same accuracy has
found in the evaluation against the complete test data as svm [3] S. Mishra, D. Mishra and G. Santra, "Applications of Machine
methods.If we talk about the limitations we could say as this Learning Techniques in Agricultural Crop Production: A Review
was not the real dataset , its already pca data , there is lack in Paper", Indian Journal of Science and Technology, vol. 9, no. 38, 2016.
predicting real time fraud for financial transactions. Also as I Available:
http://www.indjst.org/index.php/indjst/article/viewFile/95032/74105.
have not the expert , I believe in future if we can work with [Accessed 13 December 2019].
domain analyst and take real time feedback , we can make the [4] B. Sitienei, S. Juma and E. Opere, "On the Use of Regression Models
system much better.Also if we can include more predictors to Predict Tea Crop Yield Responses to Climate Change: A Case of
like geo location and device information then a very efficient Nandi East, Sub-County of Nandi County, Kenya", Climate, vol. 5, no.
3, 2017. Available:
model can be created. https://www.researchgate.net/publication/318496719_On_the_Use_of
_Regression_Models_to_Predict_Tea_Crop_Yield_Responses_to_Cli
mate_Change_A_Case_of_Nandi_East_Sub-
B. Crop Production County_of_Nandi_County_Kenya. [Accessed 13 December 2019].
As crop is one the basic neccesacity of a human being,
scientist around the world studies this topic regularyly.I have [5] K. Olson and G. Olson, "Use of multiple regression analysis to estimate
run multiple linear regression which underperformed with average corn yields using selected soils and climatic
data", Agricultural Systems, vol. 20, no. 2, pp. 105-120, 1986.
high biasness and one of the reason is involvement of high Available:
number of categorical variables with multiple levels. I https://www.sciencedirect.com/science/article/pii/0308521X8690062
Regression tree model run faster as compare to linear. There 4?via%3Dihub. [Accessed 14 December 2019].
are not many predictor variables other than area is one of the
limitation of this work. For the future we can add additional [6] Ramakrishna Anil, A statistical approach to estimate seasonal crop
features in the dataset and predict the crop produciton with production in India, Department of Computer Science, University of
Southern California, Los Angeles,CA.[online].Available:
taking geo satellite variables and environment variables. https://pdfs.semanticscholar.org/5c74/91174c1d8b2029a8b273c6e04a
95aa77cea4.pdf [Accessed on: Nov 2,2019]

C. Online Production
[7] Y. Cai et al., "Crop yield predictions - high resolution statistical model
There is a lot of scope in this dataset , we can either only for intra-season forecasts applied to corn in the US", Gro Intelligence,
visualise customer behaviour using EDA process and then we 2019. Available: https://gro-intelligence.com/yield-model-pdf/us-
corn. [Accessed 14 December 2019].
can also predict the sales using regression and clustering
algorithms. There are some limitations inmy proposed work
[8] C. PHUA, V. LEE, K. SMITH and R. GAYLER, "Comprehensive
like I have not analyze the clusters thorugh visualisations. Survey of Data Mining-based Fraud Detection Research", School of
Clustering algorithm can be implemented in more mature way Business Systems, 2019.
by anaysed it more thoroughly. With the limited knowledge of [9] W. Lim, A. Sachan and V. Thing, "Conditional Weighted Transaction
clustering algorithms I have not achieved much in prediction Aggregation for Credit Card Fraud Detection", Progress in Pattern
and training the model but I thorouhly done the overall analyse Recognition, Image Analysis, Computer Vision, and Applications, pp.
3-16, 2014. Available: 10.1007/978-3-662-44952-3_1 [Accessed 15
of the customer behavior through various visualisations. I will December 2019].
be looking forward to extend this work for more [10] [A. Thennakoon, C. Bhagyani, S. Premadasa and S. Mihiranga, "Real-
understanding of clustering algorithms. time Credit Card Fraud Detection Using Machine Learning", Research
Gate, 2019. Available:
https://www.researchgate.net/publication/334761474_Real-
time_Credit_Card_Fraud_Detection_Using_Machine_Learning.
[Accessed 15 December 2019].
[11] F. Carcillo, Y. Le Borgne, O. Caelen, Y. Kessaci, F. Oblé and G.
Bontempi, "Combining unsupervised and supervised learning in credit
card fraud detection", Information Sciences, 2019. Available:
https://www.sciencedirect.com/science/article/pii/S002002551930445
1. [Accessed 15 December 2019].
[12] N. Carneiro, G. Figueira and M. Costa, "A data mining based system
for credit-card fraud detection in e-tail", Research gate, vol. 95, pp. 91-
101, 2017. Available:
https://www.researchgate.net/publication/312255358_A_data_mining
_based_system_for_credit-card_fraud_detection_in_e-tail. [Accessed
15 December 2019].
[13] R. Gupta and C. Pathak, "A Machine Learning Framework for [17] N. Chawla, K. Bowyer, L. Hall and W. Kegelmeyer, "SMOTE:
Predicting Purchase by Online Customers based on Dynamic Synthetic Minority Over-sampling Technique", Journal of Artificial
Pricing", Procedia Computer Science, vol. 36, 2014. Available: Intelligence Research, vol. 16, pp. 321-357, 2002. Available:
https://www.researchgate.net/publication/275541641_A_Machine_Le https://www.researchgate.net/publication/220543125_SMOTE_Synth
arning_Framework_for_Predicting_Purchase_by_Online_Customers_ etic_Minority_Over-sampling_Technique. [Accessed 15 December
based_on_Dynamic_Pricing. [Accessed 15 December 2019]. 2019].
[14] F. Weber and R. Schütte, "A Domain-Oriented Analysis of the Impact [18] C. Tantithamthavorn, A. Hassan and K. Matsumoto, "The Impact of
of Machine Learning—The Case of Retailing", Big Data and Cognitive Class Rebalancing Techniques on the Performance and Interpretation
Computing, vol. 3, no. 1, p. 11, 2019. Available: of Defect Prediction Models", IEEE Transactions on Software
10.3390/bdcc3010011. Engineering, pp. 7-7, 2018. Available:
[15] Kronberger, G. and Affenzeller, M. (2011). Market Basket Analysis of https://arxiv.org/pdf/1801.10269.pdf. [Accessed 15 December 2019].
Retail Data: Supervised Learning Approach. Research Gate. [online] [19] S. Mishra et al., "Principal Component Analysis", 2017. Available:
Available at: https://www.researchgate.net/publication/316652806_Principal_Com
https://www.researchgate.net/publication/221431835_Market_Basket ponent_Analysis. [Accessed 17 December 2019].
_Analys is_of_Retail_Data_Supervised_Learning_Approach
[Accessed 15 Dec. 2019].
[20] Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge
Discovery: An Overview", in Fayyad, Piatetsky-Shapiro, Smyth,
[16] R. Liu, Y. Lee and H. Mu, "Customer Classification and Market Basket Uthurusamy, Advances in Knowledge Discovery and Data
Analysis Using K-Means Clustering and Association Rules: Evidence Mining, AAAI Press / The MIT Press, Menlo Park, CA, 1996, pp.1-34.
from Distribution Big Data of Korean Retailing Company", Research
Gate, 2019. Available:
https://www.researchgate.net/publication/330506538_Customer_Clas
sification_and_Market_Basket_Analysis_Using_K-
Means_Clustering_and_Association_Rules_Evidence_from_Distribut
ion_Big_Data_of_Korean_Retailing_Company. [Accessed 15
December 2019].

You might also like