0% found this document useful (0 votes)
27 views74 pages

Machine Learning Project 3

Uploaded by

afifashaik169
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views74 pages

Machine Learning Project 3

Uploaded by

afifashaik169
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

MACHINE LEARNING MODEL FOR finding GDP USING

PYTHON BACHELORE OF TECHNOLOGY


IN
COMPUTER SCIENCE AND ENGINEERING
BY
 S.AFIFA (18F71A0502)
 S.UMME SALMA (18F71A0531)
 S.TAHASEEN (18F71A0530)
 S.ABEEDA (18F71A0501)
Under the estimated guidance of
Mr.G.Mohammad Rafi M.Tech.,
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SRI SAI INSTITUTE OF TECHNOLOGY AND SCIENCE
PREDICTING GDP OF A COUNTRY

BY USING
MACHINE LEARNING
INDEX
 ABSTRACTION  ARCHITECTURE
 INTRODUCTION  DATA WRANGLING
 GROSS DOMESTIC PRODUCT(GDP)  EDA
 EXISTING SYSTEM  MODELLING
 PROPOSED SYSTEM  PERFORMANCE METRICS
 ADVANTAGES OF PROPOSED SYSTEM  DEPLOYMENT
 LOAD LIBRARIES  FUTURE SCOPE
 SYSTEM REQUIREMENTS  CONCLUSION
ABSTRACTION
This project is to gather GDP data from multiple data sources and
uses various machine learning algorithms on this data to extract
important information This model can be use for calculating GDP
of a country . In this project Linear Regression Techniques is used
to predict the GDP of a country .
INTRODUCTION
GROSS DOMESTIC PRODUCT( GDP )
GDP Stands for Gross Domestic product and represents the total monetary
value of all final goods and services produced (and sold on the market )
within a period of time( typically 1 year ) GDP is the most commonly used
measure of economic activity
Existing System :

 GDP impacts personal finance, investments, and job growth.


 Investors look at a nation's growth rate to decide if they should adjust their asset
allocation, as well as compare country growth rates to find their best
international opportunities.
 They purchase shares of companies that are in rapidly growing countries.
 Gross Domestic Product (GDP) is calculated using five elements: Consumption (C);
Investment (I); Government Spending (G); and Exports (X) minus imports (M).
Disadvantages of Existing System :

 The failure to account for or represent the degree of income inequality in society.
 The failure to indicate whether the nation's rate of growth is sustainable or not.
 The exclusion of non-market transactions
 The failure to account for or represent the degree of income inequality in society
 The failure to indicate whether the nation’s rate of growth is sustainable or not
PROPOSED SYSTEM
 In the proposed system we will build a machine learning model by training
the model with training dataset and this model will help Business firms to
decide where to invest and buy the shares in that location

 A countries data like population, literacy , birth rate etc.., is given input to
the machine learning model and our output label is GDP

 Most economists, politicians and businesses like to see GDP rising


steadily.
ADVANTAGES OF PROPOSED SYSTEM

 GDP enables policymakers and central banks to judge whether the


economy is contracting or expanding and promptly take necessary
action.

 GDP helps government decide how much it can spend on public


services and how much it needs to raise in taxes.
SOFTWARE REQUIREMENTS
HARDWARE REQUIREMENTS:
✓ SYSTEM : INTEL I5
✓ HARD DISK : 500 GB
✓ RAM : 4 GB
✓ OPERATING SYSTEM : WINDOWS 10, 11 and above
versions
SOFTWARE REQUIREMENTS:
✓ WEB FRAMEWORK : FLASK
✓ TECHNOLOGY : PYTHON,HTML,CSS,BOOTSTRAP
✓ LIBRARIES USED : Pandas, NumPy, Seaborn, Matplotlib, Scikit-
Learn
HOSSTING ENVIRONMENT : HEROKU
ARCHITECTURE: Understand the
problem

Data
Gathering

Data
Wrangling

EDA

Modelling

Performance
Metrics

Deployment
MODULES:

This PROJECT CONTAINS FOUR MODULES:

 EXPLORATORY DATA ANALYSIS


 MODELLING
 PERFORMANCE METRICS
 DEPLOYMENT
DATA WRANGLING
&
EXPLORATORY DATA ANALYSIS
LOAD LIBRARIES
We install and import all these libraries in python

 Pandas is defined as an open-source library that provides high performance data


manipulation in Python.

 They provide you with a huge set of important commands and features which are used to
easily analyze your data.

 It used for working with data set. It has functions for analyzing, cleaning, exploring, and
manipulating data.

 We get the insights about the dataset using some functions in pandas, such as
head(),tail(),info(),describe(),sample().

 There are several useful functions for detecting, removing and replacing null values in pandas
data frame, such as isnull (), fillna(), replace(), drop()
LOAD LIBRARIES

NumPy (Numerical Python) is an open-source core Python library for scientific


computations. It is a general-purpose array and matrices processing package.
Numpy is compatible with, and used by many other popular Python packages,
including pandas and matplotlib. Numpy makes many mathematical
operations used widely in scientific computing fast and easy to use.
LOAD LIBRARIES
 Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
 Matplotlib is open source and we can use it freely.
 Pie plot is a Matplotlib module provides functions that interact with the figure, decorates the
plot with labels, creates plotting area in a figure. Different plots can be plotted using this
library they are

 Bar Graph  Histogram


 Pie Chart  Line Chart
 Box Plot  Scatter Plot
LOAD LIBRARIES

Seaborn is a library mostly used for statistical plotting in python. It is built on top of
Matplotlib and Provides beautiful default styles and color palettes to make statistical
plots more attractive. Different categories of plot in Seaborn Plots are basically used for
visualizing the relationship between variables. Those variables can be either be
completely numerical or a category like a group, class or division.
 Relational plots
 CATEGORICAL PLOTS
 DISTRIBUTION PLOTS
 REGRESSION PLOTS
 METRICS PLOTS
DATA GATHERING

 This dataset was build by augmenting dataset will be


focusing on the factors that affecting a country's GDP
per capita and try to make a model using the data of
227 countries from the database ( reference: Kaggle)
The Data set Used in Our Model include the following Attributes.

 Infant mortality (per 1000 births)  Deathrate rate of a country


 Literacy (%)  Service
 Phones (per 1000)  Label-GDP ($ per capita)
 Arable (%)  Selecting the country name
 Crops (%)  Selecting the Region
 Other (%)  Calculating the Population in the country
 Birth rate of a country  Selecting the Area of the Country in (sq. mi.)
 Agriculture  Calculating the Population density in (per sq. mi.)
 Industry  Ratio of the coast per area
 Service  Taking Net Migration of the country
 Label-GDP ($ per capita)  Agriculture
 Climate  Industry
READING THE DATASET AND GETTING INSIGHTS ABOUT THE DATA

We get the insights about the data using the following functions

 df.shape --for the shape of the data


 df.describe() -- for the distribution of data.
 df.info() -- for the columns and their data types
 df.head() -- for the first 5 rows
 df.tail() -- for the last 5 rows
 df.isnull().sum()—to check the null values in columns
DATA WRANGLING
One of the first steps is to make sure that the dataset we are using is accurate. The dataset
should not have any missing values and if the dataset does have missing values, they should be
replaced by the appropriate value.

Handling Missing Values:


There are three types to fill missing values

 Mean = df['column_name'].fillna(df['column_name'].mean()[0], inplace = True)


 Median = df['column_name'].fillna(df['column_name'].median()[0], inplace = True)
 Mode = df['column_name'].fillna(df['column_name’].mode()[0], inplace = True)
Outlier Treatment:
Outliers can be filled by Percentile and Quantile
EXPLORATORY DATA ANALYSIS
DATA VISUALIZATION
Data Visualization is the process of analyzing data in the form of graphs or maps, making it easier to
understand patterns in the data.

There are various types of Visualizations:

✓ UNIVARIATE ANALYSIS

✓ BI-VARIATE ANALYSIS

✓ MULTI-VARIATE ANALYSI
EXPLORATORY DATA ANALYSIS
We will use Matplotlib and Seaborn library for the data visualization.
Some commonly used graphs are

Univariate BIVARIATE MULTIVARIATE

•Bar Plot
• Bar Plot
•Hist Plot •Pair Plot
•Box Plot
•Scatter Plot •Heat Map
•Hist Plot
•Box Plot
UNIVARIATE NUMERICAL FEATURE ANALYSIS:
Three univariate model classes are considered: ARIMA-models (after BOX AND JENKINS
[1976]) ARIMAX-models (Combination of ARIMA-terms and additional predictors) and an
“ordinary multiple” approach with lagged terms of the variables.

 Univariate analysis is the simplest form of analyzing data.


 Uni means one, so in other words the data has only one variable.
Univariant Categorical Analysis
BI-VARIATE NUMERICAL FEAUTURE ANALYSIS:
Bivariate analysis is one of the statistical analysis where two variables are observed.
One variable here is dependent while the other is independent.
Bivariant categorical analysis
Multivariate Analysis :

The statistical study of data where multiple measurements are made on each

experimental unit and where the relationships among multivariate measurements

and their structure are important.


CORELATION BETWEEN TWO VARIABLES

The heatmap shows the correlation between all numerical columns.


MODELING
FEATURE SELECTION

After performing the data cleaning and visualizations, we implemented our


machine learning algorithms on the features of the dataset.
X =is the set of input features from the data set.
y =is the output feature from the data set.
df = dataset
For example :
X = df.drop(columns= ['GDP ($ per capita)','Country' , 'Region’])
y = df['GDP ($ per capita)']
SCIKIT-LEARN LIBRARY

 Scikit-learn is an open source Machine Learning Python package that offers


functionality supporting supervised and unsupervised learning.
 Additionally, it provides tools for model development, selection and evaluation
as well as many other utilities including data pre-processing functionality.

 Supervised Learning :
Supervised learning is an approach to creating Artificial Intelligence (AI) where a computer
algorithm is trained an input data have been labelled for a particular
 Unsupervised Learning :
Unsupervised learning is an approach to creating Artificial Intelligence (AI) where a
computer algorithm is trained an input data does not have labelled for a particular
Methods in Scikit-Learn Package:

fit_transform()
It joins the fit() and transform() method for the transformation of the dataset. It is used on the training
data so that we can scale the training data and also learn the scaling parameters.

fit()
This method calculates the parameters μ(mean) and σ(standard deviation) and saves them as internal
objects.

transform()
Using these same parameters, using this method we can transform a particular dataset. Used for pre-
processing before modeling.

predict()
Use the above-calculated weights on the test data to make the predictions
SPLITTING

 The next step is building the machine learning model.


 While building the machine learning model, first we need to split our dataset into 2 parts
i.e.: training data and test data.

The Syntax for splitting is given below :

from sklearn.model_selection import train_test_split


X_train,X_test, y_train, y_test = train_test_split(X, y, test_size = .2, stratify =y,random_state=1)
SCALING:

 Scaling is a technique to standardize the independent features present


in data in a fixed range.
 It is performed during the data pre-processing to handle highly varying
units.

Techniques to perform Scaling are


 Standard Scalar
 Min-Max Scalar
 STANDARD SCALAR :
It is very effective technique which rescales a feature value so that it
has distribution with 0-Mean value and 1-Variance.
 MIN-MAX SCALAR :
This technique rescales a feature or observation value between 0 and
1. The Scikit learn provides the implementation of scaling in a preprocessing
package. We import MinMaxScalar or StandardScalar from preprocessing
package to perform scaling.

Syntax :
from sklearn.preprocessing import StandardScaler
my_scalar = StandardScaler()
X_train_scaled= my_scalar.fit_transform(X_train)
X_test_scaled= my_scalar.transform(X_test_scaled , y_test)
Linear Regression:
Linear Regression is a machine learning algorithm based on supervised learning. It
performs a regression task. Regression models a target prediction value based on
independent variables.
Linear Regression Formula:

While training the model we are given :


x: input training data
y: labels to data (supervised learning)
θ1: intercept
θ2: coefficient of x
Gradient Descent
To update θ1 and θ2 values in order to reduce Cost function (minimizing RMSE
value) and achieving the best fit line the model uses Gradient Descent. The idea is to
start with random θ1 and θ2 values and then iteratively updating the values, reaching
minimum cost.

Syntax:
from sklearn.linear_model import LinearRegression
model_linear=LinearRegression()
model_linear.fit(X_train_scaled,y_train)
linear = model_linear.score(X_test_scaled,y_test)
SUPPORT VECTOR
MACHINE(SVM):
 In machine learning, support vector machines (SVMs, also support vector
networks) are supervised learning models with associated learning

 Algorithms that analyze data used for classification and regression analysis.

 A Support Vector Machine (SVM) is a discriminative regressor formally


defined by a separating hyperplane.

They are two types:

 Linear SVM – It is used when data is linearly separable

 Non Linear SVM – It is used when data is not linearly separable


By using Kernel converting lower dimension data into higher dimension data

Syntax:
from sklearn import svm
model_svm = svm.SVR()
model_svm.fit(X_train_scaled,y_train)
svm=model_svm.score(X_test_scaled, y_test)
Decision Tree Regression

 Decision tree builds regression or


classification models in the form of a tree
structure.
 It breaks down a dataset into smaller and
smaller subsets while at the same time an
associated decision tree is incrementally
developed.
 The final result is a tree with decision
nodes and leaf nodes.Discrete output
example:
 In decision tree each split minimize the impurity
 To calculate impurity of nodes:
 Gini = 1-Σ(Pi)^2
 Entropy = 1-ΣPi log₂ Pi
 Information Gain =parent node entropy – Σweighted entropy of child nodes

SYNTAX:

from sklearn.tree import DecisionTreeRegressor


model_decision = DecisionTreeRegressor(random_state = 100,
max_depth = 3,
min_samples_leaf = 3)
model_decision .fit(X_train, y_train)
decisiontree = model_decision.score(X_test, y_test)
RANDOM FOREST ALGORITHM

 Random Forest is a supervised ensemble machine learning algorithm


used in both classification as well as regression problems.
 It contains various decision trees and an average of it is taken so as to
give the output.
 As decision tree are prone to over fitting, random forest is useful in
reducing the effect of over fitting and hence giving a more accurate
output.
RANDOM FOREST REGRESSION FLOWCHART
SYNTAX:

from sklearn.ensemble import RandomForestRegressor


model_rfr = RandomForestRegressor(n_estimators = 50,
max_depth = 6,
min_weight_fraction_leaf = 0.05,
max_features = 0.8,
random_state = 42)
model_rfr.fit(X_train, y_train)
PERFORMANCE
TUNING
METRICS:
 Evaluation metrics are a measure of how good a model
performance and how well it approximates the relationship.

 There are three error metrics that are commonly used for
evaluating and reporting the performance of a regression model.

They are :

 MEAN SQUARED ERROR(MSE)


 MEAN ABSOLUTE LOG ERROR(MAE)
 R2 SCORE
Metrices:

• Performances metrices are a part of every machine learning


pipeline.

• They tell you if you’re making progress , and put a number on it.

• All machine learning models , whether it’s Random Forest.


mean_squared_log_error:

• Mean squared logarithmic error (MSLE)can be intrepreted .

• Measured of the ratio between the true and predicted values .

• Mean squared logarithmic error is, as the name suggests , a

variation of the mean squared error.


Mean Squared Error:
The MSE error tells you how close a regression line is a set of points .It does this by
taking the distances from the points to the regression line (these distances are the
“errors”) and squaring them .

The squaring is necessary to remove any negative signs.

mse = mean_squared_error(y_test,y_predict)
Steps to find the MSE:

1.Find the equation for the regression line.

2.Insert X values in the equation found in step 1 in order to get the respective Y values i.e.

3.Now subtract the new Y values (i.e.) from the original Y values. Thus, found values are the error terms.

4.square the errors found in step 3.

5.It is also known as the vertical distance of the given point from the regression line.

6. Divide the value found in step 5 by the total number of observations.


Mean Absolute Error:
Mean absolute error (MAE) is a loss function used for regression. Use MAE when you are doing
regression and don't want outliers to play a big role. The loss is the mean over the absolute
differences between true and predicted values, deviations in either direction from the true value
are treated the same way.

mae= mean_absolute_error(y_test,y_predict)

Formula:
Mean Absolute Error = (1/n) * ∑|yi – xi|
where,
•Σ: Greek symbol for summation
•yi: Actual value for the ith observation
•xi: Calculated value for the ith observation
•n: Total number of observations
Method 1: Using Actual Formulae
Mean Absolute Error (MAE) is calculated by taking the summation of the absolute difference
between the actual and calculated values of each observation over the entire array and then
dividing the sum obtained by the number of observations in the array.

Method 2: Using sklearn

sklearn.metrics module of python contains functions for calculating errors for different purposes. It
provides a method named mean_absolute_error() to calculate the mean absolute error of the given
arrays.

Syntax:
mean_absolute_error (actual , calculated)
R2 score:
Coefficient of determination also called as R 2 score is used to evaluate the performance of a linear
regression model. It is the amount of the variation in the output dependent attribute which is predictable
from the input independent variable(s). It is used to check how well-observed results are reproduced by
the model, depending on the ratio of total deviation of results described by the model.

Mathematical Formula :
R2= 1- SSres / SStot

Where,

SSres is the sum of squares of the residual errors.

SStot is the total sum of the errors.


DEPLOYMENT
DEPLOYMENT
We developed a web application with simple user interface to end user to use
our model to predict the recommended crop in his location.
TOOLS USED

Frontend Tools: HTML,CSS&BOOTSTRAP.

Backend Tools: Python, Web Frame Work: Flask.


Lastly ,we wanted to predict the results for the obtained values from the
user for which we made use of the FLASK Framework to integrate the backend
and the frontend.
DEPLOYMENT
HTML:

✓ HTML stands for Hyper Text Markup Language.


✓ HTML is the standard markup language for creating Web pages.
✓ HTML describes the structure of a Web page.
✓ HTML consists of a series of elements.
✓ HTML elements tell the browser how to display the content.
✓ HTML elements label pieces of content such as "this is a
heading", "this is a paragraph", "this is a link", etc.
DEPLOYMENT
CSS:

WHAT IS CSS ?
✓ CSS stands for Cascading Style Sheets.
✓ CSS describes how HTML elements are to be displayed on screen, paper, or in
other media.
✓ CSS saves a lot of work. It can control the layout of multiple web pages all at once.
✓ External style sheets are stored in CSS files.
DEPLOYMENT
BOOTSRAMP:
Bootstrap is a free and open-source tool collection for creating responsive
websites and web applications. It is the most popular HTML, CSS, and
JavaScript framework for developing responsive, mobile-first websites.

Why we use Bootstrap ?


•It is Faster and Easier way for Web-Development.
•It creates Platform-independent web-pages.
•It creates Responsive Web-pages.
•It designes the responsive web pages for mobile devices too.
DEPLOYMENT
Flask – (Creating first simple application):
Flask is an API of Python that allows us to build up web-applications. A Web-
Application Framework is the collection of modules and libraries that helps
the developer to write applications without writing the low-level codes such
as protocols, thread management, etc. Flask is based on WSGI(Web Server
Gateway Interface) toolkit and Jinja2 template engine. Python 2.6 or higher is
usually required for installation of Flask
DEPLOYMENT
SET UP PROJECT

Python projects live in virtual environments. So, we need to install virtual environment.

✓ To setup the project first Create the Project Directory, then in terminal goto that directory and use the
following command to install virtual environment.

pip install virtualenv

✓ After installing virtualenvironment create the virtual environment by below command.

virtualenv my_venv (Name of virtual environment)

✓ Now we should Activate the virtualenvironment, to activate use the following command

my_venv/scripts/activate

✓ After activating the virtual environment we should install the flask, to import Flask module in python

pip inatall flask


DEPLOYMENT
 Now create an app that hosts the application.
app = Flask(__name__)
✓ Then you need a route that calls a Python function. A route maps what you type
in the browser (the url)to a Python function.
@app.route(‘/’)
def index():
✓ The function should return something to the web browser, so use the return
statement.
✓ To run the application use the below code
app.run(debug=True)
✓ Now run the python file in Terminal, you will get the URL.

✓ Enter the url in your web browser, To see the website


DEPLOYMENT
✓ It is possible to return the output of a function bound to a certain URL in the form of HTML.

✓ Generating HTML content from Python code is difficult.

✓ This is where one can take advantage of Jinja2 template engine, on which Flask is based. Instead of
returning hardcode HTML from the function, a HTML file can be rendered by the render_template( )
function

page
App.py Python Flask Jinja
input data
data

template

Index.html
DEPLOYMENT
✓ A web application often requires a static file such as a javascript file or
a CSS file supporting the display of a web page.
✓ Usually, the web server is configured to serve them for you, but
during the development, these files are served from static folder in your
package or next to your module and it will be available at /static on the
application
DEPLOYMENT
 To Deploy the project in the CLOUD, we are using HEROKU Platform.
 Heroku is a platform as a service (PaaS) that enables developers to build, run, and operate
applications entirely in the cloud.
 To Deploy the Project Using Heroku we need to follow the following Steps.
 In the terminal go to Project Location and install the gunicorn which is a python
 Now go to your Folder and Create the folder Procfile without extensions.
 In Terminal use this commands pip freeze > requriements.txt .
 Install the Git and HerokuCli in your system.
 Now Initialize the git in terminal. To initialize use git init
 To add files use git add
Commit the git git commit -m “msg”
DEPLOYMENT
 Login into Heroku, use Heroku login

 Now create the application using heroku AppName

 Add a remote to your local repository

heroku git: remote –a AppName

 Push the files to Heroku

git push heroku master

Now application is deployed into the cloud.


WELCOME PAGE
DEPLOYMENT

Form page for giving the input values


 After putting the values by the user , we get the results of the model in the
way mentioned below
FUTURE SCOPE

This applicant can be implemented as a mobile app and be made


available free for the business to download which they can use it to
predict the GDP of a country where the can buy shares in that
location.
CONCLUSION
:  we have made this project to business improvements to use the
current technology instead of relying on old methods.

 Using machine learning algorithms, we trained the model with 4


following algorithms namely logistic regression, Decision Tree,
SVM & Random Forest. And we got highest accuracy of 83% with
Random Forest Model. So we selected this model as best model
for future prediction.
THAnk’s
FROM

 S.AFIFA  S.ABEEDA
 S.TAHASEEN  S.UMMESALMA

You might also like