Machine Learning Part: Domain Overview
Machine Learning Part: Domain Overview
Machine Learning Part: Domain Overview
INTRODUCTION
Domain overview
Machine learning is to predict the future from past data. Machine learning (ML) is a
type of artificial intelligence (AI) that provides computers with the ability to learn without
being explicitly programmed. Machine learning focuses on the development of Computer
Programs that can change when exposed to new data and the basics of Machine Learning,
implementation of a simple machine learning algorithm using python. Process of training and
prediction involves use of specialized algorithms. It feed the training data to an algorithm,
and the algorithm uses this training data to give predictions on a new test data. Machine
learning can be roughly separated in to three categories. There are supervised learning,
unsupervised learning and reinforcement learning. Supervised learning program is both given
the input data and the corresponding labeling to learn data has to be labeled by a human being
beforehand. Unsupervised learning is no labels. It provided to the learning algorithm. This
algorithm has to figure out the clustering of the input data. Finally, Reinforcement learning
dynamically interacts with its environment and it receives positive or negative feedback to
improve its performance.
Data scientists use many different kinds of machine learning algorithms to discover
patterns in python that lead to actionable insights. At a high level, these different algorithms
can be classified into two groups based on the way they “learn” about data to make
predictions: supervised and unsupervised learning. Classification is the process of predicting
the class of given data points. Classes are sometimes called as targets/ labels or categories.
Classification predictive modeling is the task of approximating a mapping function from input
variables(X) to discrete output variables(y). In machine learning and statistics, classification is
a supervised learning approach in which the computer program learns from the data input
given to it and then uses this learning to classify new observation. This data set may simply be
bi-class (like identifying whether the person is male or female or that the mail is spam or non-
spam) or it may be multi-class too. Some examples of classification problems are: speech
recognition, handwriting recognition, bio metric identification, document classification etc.
Analyses Predicts
Past Dataset Trains Machine Learning Result
. It have to find Accuracy of the training dataset, Accuracy of the testing dataset,
Specification, False Positive rate, precision and recall by comparing algorithm using python
code. The following Involvement steps are,
Define a problem
Preparing data
Evaluating algorithms
Improving results
Predicting results
This analysis is not meant to be providing a final conclusion on the reasons leading to
railway sector as it doesn't involve using any inferential statistics techniques/machine
learning algorithms. Machine learning supervised classification algorithms will be used to
give the travel class dataset and extract patterns, which would help in predicting the likely
patient affected or not, thereby helping the hospitals for making better decisions in the future.
Multiple datasets from different sources would be combined to form a generalized dataset,
and then different machine learning algorithms would be applied to extract patterns and to
obtain results with maximum accuracy.
At the dawn of artificial intelligence it was discovered that problems which could be
formally described by a list of mathematical rules could be easily solved by the machines.
This enabled computers to solve logical problems that were difficult for humans. The first
successes in artificial intelligence mostly took place in a formal environment where the
program did not need to have much knowledge about the rest of the world. This showed that
artificial intelligence excelled in problems that could be described mathematically and
although, artificial intelligence would instead struggle with less formal problems. Railway
Passenger Travel Choice Prediction by delay faults based on machine learning Algorithm.
Data Wrangling
In this section of the report will load in the data, check for cleanliness, and then trim
and clean given dataset for analysis. Make sure that the document steps carefully and justify
for cleaning decisions.
Data collection
The data set collected for predicting passengers is split into Training set and Test set.
Generally, 7:3 ratios are applied to split the Training set and Test set. The Data Model which
was created using Random Forest, logistic, Decision tree algorithms, K-Nearest Neighbor
(KNN) and Support vector classifier (SVC) are applied on the Training set and based on the
test result accuracy, Test set prediction is done.
Preprocessing
The data which was collected might contain missing values that may lead to
inconsistency. To gain better results data need to be preprocessed so as to improve the
efficiency of the algorithm. The outliers have to be removed and also variable conversion
need to be done. Based on the correlation among attributes it was observed that attributes that
are significant individually include property area, education, loan amount, and lastly credit
history, which is the strongest among all. Some variables such as applicant income and co-
applicant income are not significant alone, which is strange since by intuition it is considered
as important. The correlation among attributes can be identified using plot diagram in data
visualization process. Data preprocessing is the most time consuming phase of a data mining
process. Data cleaning of loan data removed several attributes that has no significance about
the behavior of a customer. Data integration, data reduction and data transformation are also
to be applicable for loan data. For easy analysis, the data is reduced to some minimum
amount of records. Initially the Attributes which are critical to make a loan credibility
prediction is identified with information gain as the attribute-evaluator and Ranker as the
search-method.
Pre Processing
Prediction
Project Goals
Exploration data analysis of variable identification
Loading the given dataset
Import required libraries packages
Analyze the general properties
Find duplicate and missing values
Checking unique and count values
Input Details
Classification ML
Training Model
Algorithm
dataset
Data Pre-Processing
Choose model
Train model
Test model
Tune model
Prediction
General Properties
Create cells freely to explore the given data and it should not perform too many
operations in each cell. One option that can take with this is to do a lot of explorations in an
initial notebook. These don't have to be organized, but make sure you use enough comments
to understand the purpose of each code cell. Then, after done with your analysis, create a
duplicate notebook where it will trim the excess and organize steps so that have a flowing,
cohesive report and make sure that informed on the steps that are taking in your investigation.
Follow every code cell, or every set of related code cells, with a markdown cell to describe to
the reader what was found in the preceding cell. Try to make it so that the reader can then
understand what they will be seeing in the following cell.
Project Requirements
General:
Requirements are the basic constrains that are required to develop a system.
Requirements are collected while designing the system. The following are the requirements
that are to be discussed.
1. Functional requirements
2. Non-Functional requirements
3. Environment requirements
A. Hardware requirements
B. software requirements
Functional requirements:
Non-Functional Requirements:
1. Problem define
2. Preparing data
3. Evaluating algorithms
4. Improving results
5. Prediction the result
Environmental Requirements:
1. Software Requirements:
2. Hardware requirements:
Library Packages:
Pandas
Numpy
Matplotlib
Seaborn
Sk-learn
Tkinter
Software Description
Anaconda Navigator
Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda distribution that allows users to launch applications and manage conda packages,
environments and channels without using command-line commands. Navigator can search for
packages on Anaconda Cloud or in a local Anaconda Repository, install them in an
environment, run the packages and update them. It is available
for Windows, macOS and Linux.
The following applications are available by default in Navigator:
JupyterLab
Jupyter Notebook
QtConsole
Spyder
Glueviz
Orange
Rstudio
Visual Studio Code
Conda:
Conda is an open source, cross-platform, language-agnostic package manager and
environment management system that installs, runs and updates packages and their
dependencies. It was created for Python programs, but it can package and distribute software
for any language (e.g., R), including multi-languages. The Conda package and environment
manager is included in all versions of Anaconda, Miniconda, and Anaconda Repository.
The Jupyter Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations and narrative text. Uses
include: data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.
Notebook document:
Notebook documents (or “notebooks”, all lower case) are documents produced by
the Jupyter Notebook App, which contain both computer code (e.g. python) and rich text
elements (paragraph, equations, figures, links, etc…). Notebook documents are both human-
readable documents containing the analysis description and the results (figures, tables, etc.)
as well as executable documents which can be run to perform data analysis.
Notebook Dashboard:
The Notebook Dashboard is the component which is shown first when you
launch Jupyter Notebook App. The Notebook Dashboard is mainly used to open notebook
documents, and to manage the running kernels (visualize and shutdown). The Notebook
Dashboard has other features similar to a file manager, namely navigating folders and
renaming/deleting files.
Working Process:
Download and install anaconda and get the most useful package for machine learning
in Python.
Load a dataset and understand its structure using statistical summaries and data
visualization.
machine learning models, pick the best and build confidence that the accuracy is
reliable.
Python is a popular and powerful interpreted language. Unlike R, Python is a complete
language and platform that you can use for both research and development and developing
production systems. There are also a lot of modules and libraries to choose from, providing
multiple ways to do each task. It can feel overwhelming.
The best way to get started using Python for machine learning is to complete a project.
It will force you to install and start the Python interpreter (at the very least).
It will give you a bird’s eye view of how to step through a small project.
It will give you confidence, maybe to go on to your own small projects.
When you are applying machine learning to your own datasets, you are working on a
project. A machine learning project may not be linear, but it has a number of well-known
steps:
Define Problem.
Prepare Data.
Evaluate Algorithms.
Improve Results.
Present Results.
The best way to really come to terms with a new platform or tool is to work through a
machine learning project end-to-end and cover the key steps. Namely, from loading data,
summarizing data, evaluating algorithms and making some predictions.
Source Data
Training Testing
Dataset Dataset
Validation techniques in machine learning are used to get the error rate of the
Machine Learning (ML) model, which can be considered as close to the true error rate of the
dataset. If the data volume is large enough to be representative of the population, you may not
need the validation techniques. However, in real-world scenarios, to work with samples of
data that may not be a true representative of the population of given dataset. To finding the
missing value, duplicate value and description of data type whether it is float variable or
integer. The sample of data used to provide an unbiased evaluation of a model fit on the
training dataset while tuning model hyper parameters.
The evaluation becomes more biased as skill on the validation dataset is incorporated
into the model configuration. The validation set is used to evaluate a given model, but this is
for frequent evaluation. It as machine learning engineers use this data to fine-tune the model
hyper parameters. Data collection, data analysis, and the process of addressing data content,
quality, and structure can add up to a time-consuming to-do list. During the process of data
identification, it helps to understand your data and its properties; this knowledge will help
you choose which algorithm to use to build your model. For example, time series data can be
analyzed by regression algorithms; classification algorithms can be used to analyze discrete
data. For example to show the data type format of given dataset. A number of different data
cleaning tasks using Python’s Pandas library and specifically, it focus on probably the biggest
data cleaning task, missing values and it able to more quickly clean data. It wants to spend less
time cleaning data, and more time exploring and modeling. Some of these sources are just
simple random mistakes. Other times, there can be a deeper reason why data is missing. It’s
important to understand these different types of missing data from a statistics point of view.
The type of missing data will influence how to deal with filling in the missing values and to
detect missing values, and do some basic imputation and detailed statistical approach
for dealing with missing data. Before, joint into code, it’s important to understand the sources
of missing data. Here are some typical reasons why data is missing:
Users chose not to fill out a field tied to their beliefs about how the results would be
used or interpreted.
import libraries for access and functional purpose and read the given dataset
General Properties of Analyzing the given dataset
Display the given dataset in the form of data frame
show columns
shape of the data frame
To describe the data frame
Checking data type and information about dataset
Checking for duplicate data
Checking Missing values of data frame
Checking unique values of data frame
Checking count values of data frame
Rename and drop the given data frame
To specify the type of values
To create extra columns
Sometimes data does not make sense until it can look at in a visual form, such as with
charts and plots. Being able to quickly visualize of data samples and others is an important
skill both in applied statistics and in applied machine learning. It will discover the many types
of plots that you will need to know when visualizing data in Python and how to use them to
better understand your own data.
How to chart time series data with line plots and categorical quantities with bar charts.
How to summarize data distributions with histograms and box plots.
How to summarize the relationship between variables with scatter plots.
There are many excellent plotting libraries in Python and it recommend exploring
them in order to create presentable graphics. For quick and dirty plots intended for your own
use, it recommends using the matplotlib library. It is the foundation for many other plotting
libraries and plotting support in higher-level libraries such as Pandas. The matplotlib provides
a context, one in which one or more plots can be drawn before the image is shown or saved to
file and context can be accessed via functions on pyplot.
Many machine learning algorithms are sensitive to the range and distribution of
attribute values in the input data. Outliers in input data can skew and mislead the training
process of machine learning algorithms resulting in longer training times, less accurate
models and ultimately poorer results.
Even before predictive models are prepared on training data, outliers can result in
misleading representations and in turn misleading interpretations of collected data. Outliers
can skew the summary distribution of attribute values in descriptive statistics like mean and
standard deviation and in plots such as histograms and scatterplots, compressing the body of
the data. Finally, outliers can represent examples of data instances that are relevant to the
problem such as anomalies in the case of fraud detection and computer security.
False Positives (FP): A person who will pay predicted as defaulter. When actual class is no
and predicted class is yes. E.g. if actual class says this passenger did not survive but predicted
class tells you that this passenger will survive.
False Negatives (FN): A person who default predicted as payer. When actual class is yes but
predicted class in no. E.g. if actual class value indicates that this passenger survived and
predicted class tells you that passenger will die.
True Positives (TP): A person who will not pay predicted as defaulter. These are the
correctly predicted positive values which means that the value of actual class is yes and the
value of predicted class is also yes. E.g. if actual class value indicates that this passenger
survived and predicted class tells you the same thing.
True Negatives (TN): A person who default predicted as payer. These are the correctly
predicted negative values which means that the value of actual class is no and value of
predicted class is also no. E.g. if actual class says this passenger did not survive and predicted
class tells you the same thing.
Logistic Regression
Random Forest
K-Nearest Neighbors
Decision tree
Support Vector Machines
The K-fold cross validation procedure is used to evaluate each algorithm, importantly
configured with the same random seed to ensure that the same splits to the training data are
performed and that each algorithm is evaluated in precisely the same way. Before that
comparing algorithm, Building a Machine Learning Model using install Scikit-Learn
libraries. In this library package have to done preprocessing, linear model with logistic
regression method, cross validating by KFold method, ensemble with random forest method
and tree with decision tree classifier. Additionally, splitting the train set and test set. To
predicting the result by comparing accuracy.
Accuracy: The Proportion of the total number of predictions that is correct otherwise overall
how often the model predicts correctly defaulters and non-defaulters.
Accuracy calculation:
Precision: The proportion of positive predictions that are actually correct. (When the model
predicts default: how often is correct?)
Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations. The question that this metric answer is of all passengers that labeled as
survived, how many actually survived? High precision relates to the low false positive rate.
We have got 0.788 precision which is pretty good.
Recall: The proportion of positive observed values correctly predicted. (The proportion of
actual defaulters that the model will correctly predict)
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both
false positives and false negatives into account. Intuitively it is not as easy to understand as
accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class
distribution. Accuracy works best if false positives and false negatives have similar cost. If
the cost of false positives and false negatives are very different, it’s better to look at both
Precision and Recall.
General Formula:
F1-Score Formula:
Algorithm Explanation
sklearn:
In python, sklearn is a machine learning package which include a lot of ML
algorithms.
Here, we are using some of its modules like train_test_split,
DecisionTreeClassifier or Logistic Regression and accuracy_score.
NumPy:
It is a numeric python module which provides fast maths functions for
calculations.
It is used to read data in numpy arrays and for manipulation purpose.
Pandas:
Used to read and write different files.
Data manipulation can be done easily with data frames.
Matplotlib:
Data visualization is a useful way to help with identify the patterns from given
dataset.
Data manipulation can be done easily with data frames.
Logistic Regression
It is a statistical method for analysing a data set in which there are one or more
independent variables that determine an outcome. The outcome is measured with a
dichotomous variable (in which there are only two possible outcomes). The goal of logistic
regression is to find the best fitting model to describe the relationship between the
dichotomous characteristic of interest (dependent variable = response or outcome variable)
and a set of independent (predictor or explanatory) variables. Logistic regression is a Machine
Learning classification algorithm that is used to predict the probability of a categorical
dependent variable. In logistic regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
For a binary regression, the factor level 1 of the dependent variable should represent
the desired outcome.
The independent variables should be independent of each other. That is, the model
should have little.
Decision Tree
It is one of the most powerful and popular algorithm. Decision-tree algorithm falls under
the category of supervised learning algorithms. It works for both continuous as well as
categorical output variables. Assumptions of Decision tree:
Random Forest
Random forests or random decision forests are an ensemble learning method for
classification, regression and other tasks, that operate by constructing a multitude of decision
trees at training time and outputting the class that is the mode of the classes (classification) or
mean prediction (regression) of the individual trees. Random decision forests correct for
decision trees’ habit of over fitting to their training set. Random forest is a type of supervised
machine learning algorithm based on ensemble learning. Ensemble learning is a type of
learning where you join different types of algorithms or same algorithm multiple times to
form a more powerful prediction model. The random forest algorithm combines multiple
algorithm of the same type i.e. multiple decision trees, resulting in a forest of trees, hence the
name "Random Forest". The random forest algorithm can be used for both regression and
classification tasks.
The following are the basic steps involved in performing the random forest algorithm:
Pick N random records from the dataset.
Build a decision tree based on these N records.
Choose the number of trees you want in your algorithm and repeat steps 1 and 2.
In case of a regression problem, for a new record, each tree in the forest predicts a
value for Y (output). The final value can be calculated by taking the average of all the
values predicted by all the trees in forest. Or, in case of a classification problem, each
tree in the forest predicts the category to which the new record belongs. Finally, the
new record is assigned to the category that wins the majority vote.
A classifier that categorizes the data set by setting an optimal hyper plane between
data. I chose this classifier as it is incredibly versatile in the number of different kernelling
functions that can be applied and this model can yield a high predictability rate. Support
Vector Machines are perhaps one of the most popular and talked about machine learning
algorithms. They were extremely popular around the time they were developed in the 1990s
and continue to be the go-to method for a high-performing algorithm with little tuning.
How to disentangle the many names used to refer to support vector machines.
The representation used by SVM when the model is actually stored on disk.
How a learned SVM model representation can be used to make predictions for new
data.
How to learn an SVM model from training data.
How to best prepare your data for the SVM algorithm.
Where you might look to get more information on SVM.