0% found this document useful (0 votes)
16 views8 pages

Application of Machine Learning

This research project explores the applications of Machine Learning in identifying celestial bodies, specifically stars, using algorithms like Naïve Bayes, Decision Tree, and KNN. The findings indicate that the Decision Tree algorithm yields the highest accuracy in both training and testing phases. Machine Learning's potential extends beyond astronomy, with applications in various fields such as healthcare, finance, and personalized recommendations.

Uploaded by

Subhodeep Chanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

Application of Machine Learning

This research project explores the applications of Machine Learning in identifying celestial bodies, specifically stars, using algorithms like Naïve Bayes, Decision Tree, and KNN. The findings indicate that the Decision Tree algorithm yields the highest accuracy in both training and testing phases. Machine Learning's potential extends beyond astronomy, with applications in various fields such as healthcare, finance, and personalized recommendations.

Uploaded by

Subhodeep Chanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

RESEARCH PROJECT ON APPLICATIONS OF MACHINE LEARNING

Report by Sirsha Dutta, M. P. Birla Foundation Higher Secondary School


Research work completed under Dr. Sudip Misra, IIT Kharagpur
1. Introduction

In the field of Astronomy, observation of celestial bodies using telescopes is a regular


activity. Every time a new celestial body is observed by a telescope, there are several kinds
of factors that need to be taken into account before declaring whether or not it is a star and
further to determine the type of the star. These observations consist of a huge amount of
information regarding the attributes of the given celestial body. In case of stars, these
attributes would be- luminosity, radius, absolute magnitude, star colour, and so on. These
attributes can then be used to figure out the type of the star observed.
Clearly, there is an overwhelming amount of information that needs to be reviewed.
Naturally, for human beings it becomes a tiringly long activity and the results are also highly
subject to manual errors.
Thus to make the process of star identification more efficient and time-saving,
Machine Learning can be utilised. Machine Learning is the process by which a computer (i.e.
a machine) is able to predict or speculate an output when certain information is provided to
it, based on previously inputted information and its corresponding results. This process is
known as Machine Learning because apart from a base program and some initial inputs, the
procedure is conducted by the machine independently.
Along with the developments in the technology in recent years, machines have had a
big role in our lives. There are a lot of data gathered in every part of our lives and these data
are increasing day by day. Although machines are thought to be used only in the fields of
engineering and computer science, they are encountered at every part of human life. Firms
that have already recognized and invested on this area are using this technology actively
today and achieving success. In the future, machines will be successful in the jobs that
cannot be done by humans. In such an environment, the applications of machine learning
increase. Prediction of weather, prediction of what disease a patient might have based on
their symptoms are all fields where machine learning can be utilized.
There are several kinds of Machine Learning algorithms that can be used to identify
stars. In this project, we will be using three different Machine Learning algorithms- Naïve
Bayes, Decision Tree, and KNN- and will be comparing the outputs yielded by the three
algorithms to determine the most accurate method of identifying stars using Machine
Learning.
2. Related Work

There are three algorithms discussed in this paper. There has been considerable amount of
work on each of them. In the paper, Improved naive Bayes classification algorithm for traffic
risk management by Hong Chen, Songhua Hu, Rui Hua & Xiuju Zhao, it is mentioned that the
Naive Bayesian classification algorithm is widely used in big data analysis and other fields
because of its simple and fast algorithm structure. Aiming at the shortcomings of the naive
Bayes classification algorithm, this paper uses feature weighting and Laplace calibration to
improve it, and obtains the improved naive Bayes classification algorithm. Through
numerical simulation, it is found that when the sample size is large, the accuracy of the
improved naive Bayes classification algorithm is more than 99%, and it is very stable; when
the sample attribute is less than 400 and the number of categories is less than 24, the
accuracy of the improved naive Bayes classification algorithm is more than 95%. Through
empirical research, it is found that the improved naive Bayes classification algorithm can
greatly improve the correct rate of discrimination analysis from 49.5 to 92%. Through
robustness analysis, the improved naive Bayes classification algorithm has higher accuracy.
[1] In the paper, KNN Model-Based Approach in Classification by Gongde Guo, Hui Wang,
David Bell, Yaxin Bi, and Kieran Greer, it is mentioned, the k-Nearest-Neighbours (kNN) is a
simple but effective methodfor classification. The major drawbacks with respect to kNN are
(1) its low-efficiency - being a lazy learning method prohibits it in many applications such as
dynamic web mining for a large repository, and (2) its dependency on the selection of a
“good value” for k. In this paper, they proposed a novel kNN type method for classification
that is aimed at overcoming these shortcomings. Their method constructs a kNN model for
the data, which replaces the data to serve as the basis of classification. The value of k is
automatically determined, is varied for different data, and is optimal in terms of
classification accuracy. The construction of the model reduces the dependency on k and
makes classification faster. Experiments were carried out on some public datasets collected
from the UCI machine learning repository in order to test their method. The experimental
results show that the kNN based model compares well with C5.0 and kNN in terms of
classification accuracy, but is more efficient than the standard kNN. [2] In Random Forests
and Decision Trees by Jehad Ali, Rehanullah Khan, Nasir Ahmad, Imran Maqsood, they have
compared the classification results of two models i.e. Random Forest and the J48 for
classifying twenty versatile datasets. They took 20 data sets available from UCI
repository containing instances varying from 148 to 20000. They compared the
classification results obtained from methods i.e. Random Forest and Decision Tree (J48).
The classification parameters consist of correctly classified instances, incorrectly classified
instances, F-Measure, Precision, Accuracy and Recall. They discussed the pros and cons of
using these models for large and small data sets. The classification results show that
Random Forest gives better results for the same number of attributes and large data sets i.e.
with greater number of instances, while J48 is handy with small data sets (less number of
instances). The results from breast cancer dataset depicts that when the number of
instances increased from 286 to 699, the percentage of correctly classified instances
increased from 69.23% to 96.13% for Random Forest i.e. for dataset with same number
of attributes but having more instances, the Random Forest accuracy increased. [3]
In Do we need hundreds of classifiers to solve real world classification problems, by Amorim,
D.G., Barro, S., Cernadas, E., & Delgado, M.F., they evaluate 179 classifiers arising from 17
families (discriminant analysis, Bayesian, neural networks, support vector machines,
decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other
ensembles, generalized linear models, nearest-neighbors, partial least squares and principal
component regression, logistic and multinomial regression, multiple adaptive regression
splines and other methods). We use 121 data sets from UCI data base to study the classifier
behavior, not dependent on the data set collection. The winners are the random forest (RF)
versions implemented in R and accessed via caret) and the SVM with Gaussian kernel
implemented in C using LibSVM. [4] In Trends in extreme learning machines: a review, by
Huang, G., Huang, G., Song, S., & You, K., they aim to report the current state of the
theoretical research and practical advances on Extreme learning machine (ELM). Apart from
classification and regression, ELM has recently been extended for clustering, feature
selection, representational learning and many other learning tasks. Due to its remarkable
efficiency, simplicity, and impressive generalization performance, ELM have been applied in
a variety of domains, such as biomedical engineering, computer vision, system
identification, and control and robotics. [5]
3. Methodology
The methodology followed consisted of 9 main steps. The coding was done in the
programming language, Python and was executed using Anaconda. Algorithm of the
program was as follows:
STEP 1: START
STEP 2: We import the sklearn package

STEP 3: We split the data frame into categorical and numeric features to perform

preprocessing

STEP 4: We initialize the standard scaler for scaling

STEP 5: We encode categorical variables

STEP 6: We combine scaled numeric variables and encoded categorical variables

STEP 7: We split training and testing data

STEP 8: We execute KNN classifier with 3 neighbors

STEP 9: We execute decision tree

STEP 10: We execute Naive Bayes

STEP 11: STOP

4. Results

The train score tells us how the model generalized or fitted in the training data. If the model

fits so well in a data with lots of variance, then this causes over-fitting. This causes poor

result on Test Score. Because the model curved a lot to fit the training data and generalized

very poorly.

The test score is made when our model is ready. Before this step we have not touched this

data-set. So, this represents real life scenario. Higher the score, better the model

generalized.

The results obtained were:


KNN-

Train score: 0.9791666666666666

Test score: 0.9583333333333334

Naive Bayes-

Train score: 0.9791666666666666

Test score: 0.9583333333333334

Decision Tree-

Train score: 1.0

Test score: 1.0

Figure 1: Accuracy of Machine Learning Algorithms

It is observed from the above chart that out of the three machine learning algorithms,

Decision Tree has the highest Train and Test Score.


5. Conclusion

From the findings of this project, it can be concluded that Decision Tree is the most

accurate algorithm followed by Naïve Bayes and KNN. Both the train and test scores of the

algorithm are highest.

Machine Learning has great uses in the future. Machine learning is used in internet

search engines, email filters to sort out spam, websites to make personalised

recommendations, banking software to detect unusual transactions, and lots of apps on our

phones such as voice recognition. It is efficient because it easily identifies trends and

patterns, no human intervention is needed (automation), there is continuous improvement

handling multi-dimensional and multi-variety data and it has wide applications.


6. References

1. Improved naive Bayes classification algorithm for traffic risk management by Hong
Chen, Songhua Hu, Rui Hua & Xiuju Zhao [1]
2. KNN Model-Based Approach in Classification by Gongde Guo, Hui Wang, David Bell,
Yaxin Bi, and Kieran Greer [2]
3. Random Forests and Decision Trees by Jehad Ali, Rehanullah Khan, Nasir Ahmad,
Imran Maqsood [3]
4. Do we need hundreds of classifiers to solve real world classification problems, by
Amorim, D.G., Barro, S., Cernadas, E., & Delgado, M.F. (2014). Journal of Machine
Learning Research (cited 387 times, HIC: 3 , CV: 0) [4]
5. Trends in extreme learning machines: a review, by Huang, G., Huang, G., Song, S., &
You, K. (2015). Neural Networks, (cited 323 times, HIC: 0, CV: 0) [5]

You might also like