Machine Learning Techniques Using Python For Data
Machine Learning Techniques Using Python For Data
net/publication/324928841
CITATIONS READS
0 976
1 author:
Lakshmi Jupudi
Quaid E-Millath Government College for Women
11 PUBLICATIONS 21 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Stochastic Gradient Descent using Linear Regression with Python View project
All content following this page was uploaded by Lakshmi Jupudi on 16 August 2019.
J.V.N. Lakshmi
Acharya Institute of Management and Sciences,
MCA Department,
Bangalore, 560 058, India
Email: [email protected]
1 Introduction
2 Background
Figure 1 illustrates the r2 calculated for a sum of mean squared errors and the covariance
giving a correlation between x and y. Figure 2 shows the plot representing the predictions.
Figure 2 describes the linear model and the line describes the deviation of values from the
averages taken from existing for the use of predictions.
The classification problem with the data of binary variables in which you can take
values 0 and 1. 0 represents the positive class, whereas 1 represents the positive class.
Machine learning techniques using python for data analysis 7
This approach of classification of the discrete values can be constructed with the logistic
regression. This algorithm trains you for the corresponding x by classification.
1
g ( z) = . (4)
1 + e− z
The sigmoid curve represents the 0 and 1 classes for giving xi, and corresponding Yi
labels in Figure 3.
Logistic regression model derivatives the sigmoid function which follows the least square
regression. The least square regression could be derived as the maximum likelihood
estimator under a set of suppositions. These assumptions endow the stated classification
model with a set of probabilistic assumptions and then fit the parameters via maximum
likelihood (Walisa and Wichian, 2013b).
In Snippet 1, decision regions are classified as training and test through which a
signed cost function and the classification are shown by a plot for decision regions.
Snippet 1 Logistic regression using python (see online version for colours)
8 J.V.N. Lakshmi
Figures 3 and 4 describe the logistic regression, classification for the temperature dataset.
It shows data points plotted in the decision regions for the test dataset.
wT xi + b ≤ − ρ / 2 if yi = −1
⇔ yi (w T xi + b) ≥ ρ / 2.
w xi + b ≥ ρ / 2 if yi = 1
T
Snippet 2 shows the regularised regression where increasing the value of c increases the
bias and lowers the variance of the model.
Snippet 2 Support vector machines using python (see online version for colours)
The variable C can control, the penalty for misclassification. Large values if C
corresponds to large error penalties were as misclassification errors can be occurred when
C value is small. Therefore, the bias variance is tuned to adjust the trade offs. The
concept of regularised regression is shown in Snippet 2 where increasing the values of C
increases the bias and lowers the variance of the model.
Machine learning techniques using python for data analysis 9
Snippet 3 K nearest neighbour using python (see online version for colours)
In this snippet, a generalisation of Euclidean and Manhattan are used which set the
parameter p = 2 or the Manhattan distance p = 1 provided for the metric parameter.
By implementing the above described snippet, Figure 5 shows the balancing for
choosing the right K-value of the data.
This memory-based approach adapts to the classification rapidly for the training data
collected. The training data in context are the diabetics’ dataset was chosen very few
features is shown in Figure 5.
Figure 6 shows the parallel boundaries of the decision tree with the maximum depth of
three using entropy as a criterion of impurity and tree is given in Figure 7.
Snippet 4 Decision region for using (see online version for colours)
Figure 6 Decision tree representation of the dataset (see online version for colours)
Machine learning techniques using python for data analysis 11
Figure 7 Decision tree for a diabetes dataset giving entropy (see online version for colours)
The decision algorithm splits the tree root and data on the feature resulting
the information gain. In an iterative process, the split can be possible at each child
node until the leaves remain to be pure. This shows the samples at each node belongs to
the same class. This result to prune the tree by setting a limit for the maximal depth of the
tree.
Figure 8 Diabetes dataset plotted using random forest (see online version for colours)
Snippet 5 Random forest uses python (see online version for colours)
4 Unsupervised techniques
In the cluster problem a training set {x(1), x(2), …, x(m)} are grouped under a few
cohesive ‘clusters’. This technique does not involve a labels y(I). Hence, they are referred
as unsupervised learning.
4.1 K-means
This is a clustering algorithm repeatedly carries out two steps:
• assigning each training example x(I) to the closest cluster centroid µj
• moving each centroid µj with the mean of the points assigned to it. For each I set
2
c (i ) : = argmin j x ( i ) − µ j . (9)
And for each j set
∑ 1{c ( ) = j } x ( )
m i i
i =1
µj := . (10)
∑ 1{c( ) = j}
m i
i =1
To initialise the cluster centroids by choosing k training sets randomly and set the cluster
centroids to be equal to the values of the trained examples.
Machine learning techniques using python for data analysis 13
Snippet 6 illustrates the distortion function with K-means function measures the sum
of squared distances between each training examples and cluster centroid.
The distortion function c in equation (9) is a convex function implemented in iris dataset
and so coordinates descent on J for converge the global minimum. The different
clustering’s found the lowest distortion for a cluster is chosen. Figure 9 uses the diabetes
data set for executing K-Means clustering technique.
w(ji ) := p ( z (i ) = j |x (i ) ; ∅, µ , ∑ ) . (11)
M-step: Coming up with parameters, based on full assignments. If you work out the math
of choosing the best parameter values based on the features of a given piece of data in
your dataset, it comes out to “take the mean of all the data points that were labelled as c”
(Schwarz, 1978).
∑
m
i =1
w(ji )
∅j = . (12)
−β
The EM algorithm is an iterative algorithm that uses maximum likelihood estimation
becomes nearly identical to estimate the parameters of the Gaussian discriminant analysis
model except that plays the role of the class labels is illustrated in the Snippet 7.
Snippet 7 Expectation and maximisation step using python (see online version for colours)
5 Proposed methodology
When analysing data, time is a vital factor which has a major impact on execution.
Execution time of each node is read when datasets fit into a distributed memory. Almost
all the machine learning algorithms are iterative, so execution of memory-based jobs is
crucial and is less efficient.
When the dataset does not fit into distribute memory, then the data are initially stored
on local disks and are used until the execution terminates. The pattern determines
whether the given job is memory based or not. For improving the execution time machine
learning techniques is implemented as in the following model.
This model in Figure 10 suggests the least execution time for data analytics as it
follows both the techniques of supervised and unsupervised behaviours of machine
learning.
Machine learning techniques using python for data analysis 15
Figure 10 Model
As drafted in the proposed model, prune the dataset according to the algorithm for
analysing. Evaluate each technique of supervised and unsupervised on the dataset.
The model illustrated below has the following phases:
• read the dataset as input
• classification of data based on supervised and unsupervised learning methods
• extracting the featured and evaluating the techniques
• compared with the existing methods
• result analysis is done is the final phase.
Features of each algorithm are extracted for the best time, efficiency using regression,
classification and clustering on the dataset. The time efficiency is evaluated by applying
various ML techniques on the data. The algorithm can be improved with additional
parameter for accessing the better results.
6 Evaluation of algorithms
This paper evaluates the supervised and unsupervised techniques of machine learning.
The supervised methods, both linear and logistic regression are fitted into the pair of
values which resulting the targeted values either sided of the line of regression. If
unsupervised methods are used, the points are scattered forming clusters in K-means.
16 J.V.N. Lakshmi
This evaluation reveals the impact on supervised techniques for data analytics.
Our method targets multiple jobs using different parameters. Since the input of the jobs is
the same, the job integration technique improves the performance of the jobs. Figure 11
illustrates the average of linear regression methods besides the existing average.
Figure 12, the graph is the average taken from the time complexities calculated on
each machine learning technique. The straight line is the linear average and the blue line
is the average of time, space, read and writes operations of each method of machine
learning.
Figure 11 Average and linear average (see online version for colours)
7 Result analysis
The machine learning techniques reveal the efficient use of time and space. These
methods train the machine so they adapt to the dataset. Figure 13 gives the gist of the
techniques used for ML.
Machine learning techniques using python for data analysis 17
Our method optimises the assignments using the advantages of such jobs for developing a
deadline scheduling method. Their method maximises the number of jobs that can be run
in the cluster while satisfying the deadlines of all jobs. Jobs are scheduled using only the
minimum number of nodes so that the cluster keeps free nodes for later jobs. In contrast,
our method uses entire cluster to complete jobs as early as possible.
Executing data analysis, jobs using various parameters are commonly seen in machine
learning but time consuming. The proposed method for optimising the job assignment for
18 J.V.N. Lakshmi
machine learning is to minimise the total execution time. Our method can be extended for
data analytics job’s execution, memory-based execution and job integration, for machine
learning and optimises the job assignment based on the execution. An execution of
machine learning techniques is developed to predict the execution time of these jobs on
the extended execution.
References
Asha, T. and Shravanthi (2013) ‘Building machine learning algorithms on Hadoop for big data’,
IJET UK Journal, Vol. 3, No. 2, pp.143–147.
Chu, C-T., Lin, Y-A., Yu, Y., Bradski, G.R., Ng, A.Y. and Olukotun, K. (2006) ‘Map-reduce for
machine learning on multicore’, NIPS, MIT Press, pp.281–288.
Haroshi, T., Shinji, N. and Takuyu, A. (2011) ‘Optimizing multiple machine learning jobs on map
reduce’, IEEE–ICCCTS Conference at Japan, pp.59–66.
Lakshmi, J.V.N. (2016) ‘Stochastic gradient descent using linear regression with python’, IJA-ERA,
Vol. 2, No. 8, December, pp.519–524.
Manar, A. and Stephane, P. (2015) ‘Machine learning with Python’, SIMUREX, October.
Pavlo, A. (2009) ‘A comparison of approaches to large-scale data analysis’, Proc. ACM SIGMOD,
USA, pp.100–113.
Rich, C., Karampatziakis, N. and Yessenalina, A. (2008) ‘An empirical evaluation of supervised
learning in high dimensions’, Proceedings of the 25th International Conference on Machine
Learning, ACM, New York, USA, pp.96–103.
Schwarz, G. (1978) ‘Estimating the dimension of a model’, The Annals of Statistics,
Vol. 6, No. 2, pp.461–464.
Stuart, R. and Harald, B. (2007) Beginning Python for Language Research, Vol. 2, pp.44–47.
Walisa, R. and Wichan, P. (2013a) ‘An adaptive ML on map reduce for improving performance of
large scale data analysis’, EC2 IEEE, Bangkok, Thailand, pp.234–236.
Walisa, R. and Wichan, P. (2013b) ‘An adaptive ML on map reduce for improving performance of
large scale data analysis on EC2’, IEEE 11th Conference on ICT and Knowledge Engineering
2013.
Michael, B. (2015) Machine Learning in Python: Essential Techniques for Predictive Analysis,
Print ISBN:9781118961742, Online ISBN:9781119183600, DOI:10.1002/9781119183600.
Brownlee, J. (2016) Master Machine Learning–How it Works, pp.1–5.
Website
https://pythonhosted.org/spyder/