0% found this document useful (0 votes)
39 views24 pages

Project Report

Uploaded by

Mouhamadou DEME
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views24 pages

Project Report

Uploaded by

Mouhamadou DEME
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Perceptron Multilayer

Copyrights © 2017 Innodatatics Inc. All Rights Reserve


Machine Learning:

objective

Predict whether annual income of an individual exceeds $50K/yr based


on census data.
The classification goal is to predict whether a person's income is over
$50,000 (<= 50k) a year or (>50k)
Project Architecture / Project Flow
1.Pre-processing the data

2.

3.EDA:Exploratory Data Analysis

4.Model Building

5.Evaluate the model

6.Data Visualizations

7.Deployment Frame
Exploratory Data Analysis (EDA) and
Feature Engineering
Data set details

1)The dataset is having 45211 observations.

2)It is having 17 columns and 45211 rows.

3)There are no missing values in the data set

4)The data set is having both the combination of categorical values and numeric values.So, we need
to convert categorical values to numeric

5)There are no duplicate values in the data set.

6)The target column is the dependent variable with values yes or no

7)The top 5 rows of the data set is shown below.

7)Following are the columns in the data set:


Dependent Variable : 'Target', ‘
Independent Variables: age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact’,
'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome'
Data set details
1) Age : continuous.
2) Workclass : Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-
pay, Never-worked.
3) Fnlwgt : continuous.
4) Education : Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th,
7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
5) education-num: continuous.
6) marital-status : Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-
spouse-absent, Married-AF-spouse.
7) Occupation : Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty,
Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-
serv, Protective-serv, Armed-Forces.
8) Relationship : Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
9) Race : White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
10) Sex : Female, Male.
11) Capital-gain: continuous.
12) Capital-loss : continuous.
13) Hours-per-week : continuous.
14) Native-country : United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-
US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy,
Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador,
Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-
Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
15) Income_class : >50K, <=50K.
Data set details

Above table show the data set.

The income_class column is the dependent variable with values <=


50k or > 50k . The target value count is (<= 50k : 24720 , >
50k :7841)
Data visualization :

1)The above plot is the box plot for the numeric features in the data set. The main advantage of boxplot
is it shows outliers. An outlier is a data point that differs significantly from other observations.
Data visualization :

1)The above plot represents the scatter plot for numeric features in the data set. A pairplot plot a
pairwise relationships in a dataset. A “pairplot” is also known as a scatterplot, in which one variable in
the same data row is matched with another variable's value,.
Data set details

1)The above plot represents correlation plot for numeric features in the data set. A correlation matrix is
a table showing correlation coefficients between variables. Each cell in the table shows
the correlation between two variables. A correlation matrix is used to summarize data, as an input into
a more advanced analysis, and as a diagnostic for advanced analyses.

1) The diagram shows the heat map for numeric


features in the data set.Heatmaps provide a visual
approach to understanding numeric values. It is a
representation of data in the form of a map or
diagram in which data values are represented as
colours. A heat map is data analysis software that
uses color the way a bar graph uses height and
width: as a data visualization tool.
Data visualization :

We have few features which are categorical so we have to convert them to numeric.
job','marital','education','default','housing','loan','contact','month','poutcome’ these are in categorical values
we need to convert some variable to dummy variable. Below are the count plots for categorical features
which we can get insights from it.
Data visualization :
Model Building
Model Building

Following are the models used for Model building in this


project:
1)Logistic regression.
2)Decision tree classifier.
3) Random forest classifier.
4) Extra tree classifier.
5) Support Vector Machine(SVM) Classifier.
6)Neural Networks
7)Bagging classifier method.
8)Catboost Classifier
9)XGB Classifier
Logistic regression:
1)Logistic regression is often referred as logit model is a technique used to predict the
probability associated with each dependent variable category.
2)Logistic Regression Model is a generalized form of Linear Regression Model. It is a very
good Discrimination Tool.
3)Logistic Regression measures the relationship between the dependent variable (our
label, what we want to predict) and the one or more independent variables (our features),
by estimating probabilities using it’s underlying logistic function.
4)The probability in logistic regression curve can be given by :
Logistic regression:
Advantages of Logistic Regression:

1. Logistic Regression performs well when the dataset is linearly separable.


2. Logistic regression is less prone to over-fitting but it can overfit in high dimensional
datasets. You should consider Regularization (L1 and L2) techniques to avoid over-fitting in
these scenarios.
3. Logistic Regression not only gives a measure of how relevant a predictor (coefficient
size) is, but also its direction of association (positive or negative).
4. Logistic regression is easier to implement, interpret and very efficient to train.

Disadvantages of Logistic Regression:

1. Main limitation of Logistic Regression is the assumption of linearity between the


dependent variable and the independent variables. In the real world, the data is rarely
linearly separable. Most of the time data would be a jumbled mess.
2. If the number of observations are lesser than the number of features, Logistic
Regression should not be used, otherwise it may lead to overfit.
3. Logistic Regression can only be used to predict discrete functions. Therefore, the
dependent variable of Logistic Regression is restricted to the discrete number set. This
restriction itself is problematic, as it is prohibitive to the prediction of continuous data
Decision tree classifier :
1)It is called as Greedy algorithm and Supervised classification model.
2)Decision tree algorithm is like a tree like structure where it is the combination of root
node ,branch node and leaf nodes.
3)With the help of Entropy and Information gain we have to choose Root node.
4)Outcomes here are leaf nodes.
5)Overfitting is the main problem in decision tree classifier. As the model overfits we call it
as greedy. We can go with pruning method for removing the branches without removing
the information.

Root Node: Root node is from where the


decision tree starts. It represents the entire
dataset, which further gets divided into two
or more homogeneous sets.
Leaf Node: Leaf nodes are the final output
node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing
the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting
the tree.
Decision tree classifier :
Advantages of the Decision Tree:

1)It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
2)It can be very useful for solving decision-related problems.
3)It helps to think about all the possible outcomes for a problem.
4)There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree:


1)The decision tree contains lots of layers, which makes it complex.
2)It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
3) For more class labels, the computational complexity of the decision tree may increase.
Neural Networks :
1) A neural network is a network or circuit of neurons, or in a modern sense, an artificial
neural network, composed of artificial neurons or nodes. Thus a neural network is
either a biological neural network, made up of real biological neurons, or an artificial
neural network, for solving artificial intelligence(AI) problems.
2) Neural network architecture :
Neural Networks :
Different types of Activation
functions:
Model Deployment using
Deployment Using flask

Flask File Creation:

• Import Flask from flask module


• Create an instance of the Flask class
• We use @app.route(‘/’) to execute home function and
@app.route(‘/predict’, methods=[POST]) to execute predict function
• Using which gives results page for this use index.html
• After execute whole deployment code it gives link like http://127.0.0:5000
run this link to get results.
Thank you

You might also like