0% found this document useful (0 votes)
6 views80 pages

Presentation 4

Uploaded by

afifashaik169
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views80 pages

Presentation 4

Uploaded by

afifashaik169
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

PREDICTING GDP ACROSS WORLD BY

USING
MACHINE LEARNING
MACHINE LEARNING MODEL FOR CROP
RECOMMENDATION USING PYTHON
BY

 S.AFIFA 18F71A0502
 S.TAHASEEN 18F71A0530
 S.ABEEDA 18F71A0501
 S.UMMESALMA 18F71A0531

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING SRI


SAI INSTITUTE OF TECHNOLOGY AND SCIENCE
INDEX
 ABSTRACTION
 INTRODUCTION
 GROSS DOMESTIC PRODUCT(GDP)
 DISADVANTAGES OF GDP
 LOAD LIBRARIES
 DISADVANTAGES OF LOAD LIBRARIES
 SYSTEM REQUIREMENTS
 ARCHITECTURE
 DATA WRANGLING
 EDA
 MODELLING
 PERFORMANCE TUNNING
 DEPLOYMENT
 FUTURE SCOPE
 CONCLUSION
ABSTRACTION

The idea of this project to gather GDP data from multiple data sources and uses
various machine learning algorithms on this data to extract important information
This model can be used for calculating GDP of a country.
In this project Random Forest Regression Technique is used to predict the GDP of a
country.
Random Forest Regression is marginally better it includes variables of literacy , net
migration , infant morality (Per 1000 births literacy (%) phones (per 1000 ) Arable
(%) crops other , birth rate , death rate , agriculture , industry , service etc. . .
INTRODUCTION
GROSS DOMESTIC PRODUCT( GDP )
GDP Stands for Gross Domestic product and represents the total monetary
value of all final goods and services produced (and sold on the market )
within a period of time( typically 1 year ) GDP is the most commonly used
measure of economic activity
DISADVANTAGES OF GDP:
•GDP does not incorporate any measures of welfare.

•GDP only includes market transactions.

•GDP does not describe income distribution.

•GDP does not describe what is being produced.

•GDP ignores externalities.

•Social Progress Index.


LOAD LIBRARIES
DISADVANTAGES OF LOAD LIBRARIES:

 Public libraries have operating hours; if an individual


arrives too late, she won't be able to access the library's
resources.

 People are restricted too by the amount of time they can spend
with a resource, since each must be returned to the library within
a set period of time.
SOFTWARE REQUIREMENTS

Python technology stack and system requirements:

• Python
• Pandas
• Seaborn
• Sklearn
• Flask
ARCHITECTURE:
numPY :

NumPY is a library for the python programming


language ,adding support for large ,multi-Dimensional
arrays and matrices , along with a large collection of
high-level mathematical functions to operate on these
arrays.
Pandas:
 Pandas is an open-source library that is made mainly for working with
relational or labeled data both easily and intuitively. It provides various data
structures and operations for manipulating numerical data and time series.
 This library is built on top of the NumPy library. Pandas is fast and it has high
performance & productivity for users.
Seaborn:
 Seaborn is an open-source Python library built on top of matplotlib.
 It is used for data visualization and exploratory data analysis.
Seaborn works easily with data frames and the Pandas library.
The graphs created can also be customized easily.
Matplotlib:
 Numerical mathematics extension NumPy. It provides an object-oriented API for
embedding plots into applications using general-purpose GUI toolkits like Tkinter
, wxPython, Qt, or GTK.

 There is also a procedural "pylab" interface based on a state machine (like


OpenGL), designed to closely resemble that of MATLAB, though its use is
discouraged.

 Matplotlib is a plotting library for the Python programming language and its
Sklearn
 Scikit-learn (formerly scikits.learn and also known as sklearn) is a
free software machine learning library for the Python programming language.

 It various classification, regression and clustering algorithms including


support-vector machines, random forests, gradient boosting, k-means and
DBSCAN, and is designed to interoperate with the Python numerical and
scientific libraries NumPy and SciPy.
Bootstrap
• Bootstrap is a free and open-source tool collection for creating

responsive websites and web applications

• It is the most popular HTML,CSS and JavaScript framework for

developing responsive , mobile-first websites.

• It solves many problems which we had once one of which is the cross-

browser compatibility issue


Flask:
Flask

• Flask is a web application framework written in python

• Armin Ronacher who leads an international group of python

enthusiasts named pocco, develops it

• Flask does not support for API flask does not support

dynamic HTML pages


MODULES
• Data preprocessing:

The training dataset is model data which consists of various attributes and this
dataset is processed and unnecessary attributes and extracted and new dataset
appears to have attributes like
Train test split machine learning

• The train-test split is used to estimate the performance of machine

learning algorithms

• That are applicable for prediction – based algorithms/applications

• This method is a fast and easy procedure to perform such that we

can compare own machine learning model results to machine

results
Linear Models

• The term linear model implies that the model is specified as a linear

combination of future based on training data

• The learning process computers one weight each future to form a model

that can predict or estimate the target value


Decision Tree Regression

• Decision tree builts regression are classification models in the form of a tree

structure

• It break down a data set into smaller and smaller subsets while at the same

time and associated decision tree is incremently developed


• The final result is a tree with decision notes and leaf notes
Ensemble
• An ensemble is a machine learning model that combines the predictions from
two or more models .

• The models that contribute to the ensemble , referred to as ensemble


members.

• may be the same type or different types and may or may not be trained on
the same training data.
Random Forest Regression:

• Random forest regression is a supervised learning algorithm that uses


ensemble learning method regression.

• Ensemble learning method is a technique that combines prediction.

• Multiple machine learning algorithms to make a more


Accurate prediction then a single model.
Metrices:

• Performances metrices are a part of every machine learning pipeline.

• They tell you if you’re making progress , and put a number on it.

• All machine learning models , whether it’s Random Forest.


MEAN _SQUARED _ERROR:

• The mean squared error(MSE) is perhaps the simplest and most common loss
function .
• Often taught introductory machine learning courses.
• To calculate the MSE , you take the differences between your models
predictions and the ground truth
mean_squared_log_error

• Mean squared logarithmic error (MSLE)can be intrepeted .

• Measured of the ratio between the true and predicted values .

• Mean squared logarithmic error is, as the name suggests,a variation


of the mean squared error.
overview
Country:

• A country is a distinct territorial body , a state , nation or other political


entity.

• It may be a sovereign state or part of a larger state.

Region:
• A region is an area of land that has common features .

• A region can be defined by natural or artificial features.


Population:
• The whole number of people or inhabitants in a country or region.
• The total of individuals occupying an area or making up a whole

Area:
• GDP estimates the value of the goods and services produced in an area.
• It can be used to compare the size and growth of country economies across the
nation.

Pop density:
• Population density is the number of people per km² of land area.
• To allow comparisons between countries and over time, GDP per capita is
adjusted for price differences between countries and adjusted for inflation –it is
measured in international-$.

Per capita gross domestic product (GDP) measures a country's economic


Coastline(coast/area ratio):
• It contributes to nearly 4% of the total GDP.
The people in the coastal areas rely on the coastal economy as it provides
them with their basic livelihood.
Net migration:
• Net migration is the difference between the number of people
moving into an area (a country, state, or county, for example) and
the number moving out.
• Between 2010 and 2019, more than 7.6 million more people moved into
the United States than left.
Infant mortality:
• Infant mortality is the death of an infant before his or her first
birthday.
• The infant mortality rate is the number of infant deaths for every 1,000
live births.
GDP($per capita):

• Per capita gross domestic product (GDP) is a financial metric that breaks down a
country's economic output per person and is calculated by dividing the GDP of a nation
by its population.
Literacy:
• Countries with a high literacy rate usually have a high GDP per capita.
• Nations with low GDP frequently have lower literacy rates since the people in that country have
less access to education, and children often have to work to help support the family.
Phones(per 1000):
• Number of mobile phone subscriptions, measured per 100 people versus gross domestic
product (GDP) per capita, measured in constant international-$.
Arable(%):
• Arable land (hectares per person) in India was reported at 0.11564 in 2018.
• according to the World Bank collection of development indicators, compiled from officially recognized sources.
Crop(%):

• The share of agriculture in GDP increased to 19.9 per cent in 2020-21 from 17.8 per cent in 2019-20.

• The last time the contribution of the agriculture sector in GDP was at 20 per cent was in 2003-04.
other(%):
• The services sector accounts for 53.89% of total India's GVA of 179.15 lakh crore Indian
rupees.
• With GVA of Rs. 46.44 lakh crore, the Industry sector contributes 25.92%
Climate:
• Over the past 10 years, storms, wildfires, and floods alone have caused losses of
around 0.3% of GDP per year globally according insurance firm Swiss Re.
• For most countries, exposure to, and costs from climate change are already increasing.
Birthrate:
• There is generally an inverse correlation between income and the total fertility rate within
and between nations.
• The higher the degree of education and GDP per capita of a human population, subpopulation
or social stratum, the fewer children are born in any developed country.
Deathrate:
• In 2021, the crude death rate for the world is 7.64 deaths per thousand population.
• The crude death rate and birth rate of the world have been declining at a moderating rate since
1950.
agriculture:

The share of agriculture in GDP increased to 19.9 per cent in 2020- 21 from 17.8 per cent in
2019-20.The

Industry:

The services sector is the largest sector of India. Gross Value Added (GVA) at current prices for the
services sector is estimated at 96.54 lakh crore INR in 2020-21.

Service:

The services sector accounts for 53.89% of total India's GVA of 179.15 lakh crore Indian rupees.
With GVA of Rs. 46.44 lakh crore, the Industry sector contributes 25.92%. While Agriculture and
allied sector share 20.19%.
19.9 per cent in 2020-21 from 17.8 per cent in 2019-20.
Data PrepRocessing
Data Preprocessing:
o It is a technique that is used to convert the raw data into a clean dataset.

Data preparation - fill in missing data We noticed that there are some missing
data in the table. For simplicity, I will just fill the missing data using the median of the
region that a country belongs, as countries that are close geologically are often similar in
many ways. For example, lets check the region median of 'GDP ($ per capita)', 'Literacy
(%)' and 'Agriculture'. Note that for 'climate' we use the mode instead of median as it seems
that 'climate' is a categorical feature here.
EXPLORATORY DATA ANALYSIS
EDA is an approach to analyze the data visual techniques

There are various types of EDA

 UNIVARIATE ANALYSIS
 BI-VARIATE ANALYSIS
 MULTI-VARIATE ANALYSIS
EDA
We will use Matplotlib and Seaborn library for the data visualization.
Some commonly used graphs are

Univariate BIVARIATE MULTIVARIATE

•Bar Plot
• Bar Plot •Hist Plot •Pair Plot
•Box Plot •Scatter Plot •Heat Map
•Hist Plot •Box Plot
UNIVARIATE:
Three univariate model classes are considered: ARIMA-models (after BOX AND
JENKINS [1976]) ARIMAX-models (Combination of ARIMA-terms and additional
predictors)analysis
 Univariate and an “ordinary multiple”form
is the simplest approach with lagged
of analyzing terms of the variables.
data.
UNIVARIATE ANALYSIS
 Uni means one, so in other words the data has only one variable.
BI-VARIATE ANALYSIS:
Bivariate analysis is one of the statistical analysis where two variables are observed.
One variable here is dependent while the other is independent.
MULTIVARIATE:
Multivariate analysis is defined as: The statistical study of data where multiple
measurements are made on each experimental unit and where the relationships
among multivariate measurements and their structure are important.
MODELING
Linear Regression:
Linear Regression is the supervised Machine Learning model in which the model finds the best
fit linear line between the independent and Dependent variable i .e it finds the linear
relationship between the dependent and independent variable.

Linear Regression is of two types:

Simple Linear Regression:

is where only one independent variable is present and the model has
to find the linear relationship of it with the dependent variable. Simple
and Multiple.
MULTIVARIATE:
there are more than one independent variables for the model to find the
Relationship

Equation of Simple Linear Regression , where bo is the intercept, b1 is coefficient or slope, x is the
independent variable and y is the dependent variable.

Equation of Multiple Linear Regression , where bo is the intercept, b1,b2,b3,b4…,bn are coefficients or
slopes of the independent variables x1,x2,x3,x4…,xn and y is the dependent variable.
A Linear Regression model’s main aim is to find the best fit linear line and the optimal values of intercept
and coefficients such that the error is minimized.
Error is the difference between the actual value and Predicted value and the goal is to reduce this difference.
Let’s understand this with the help of a diagram.
Mathematical Approach:
Residual/Error = Actual values – Predicted Values
Sum of Residuals/Errors = Sum(Actual- Predicted Values)
Square of Sum of Residuals/Errors = (Sum(Actual- Predicted Values)) 2
i.e
PERFORMANCE
TUNING
METRICS:
 Evaluation metrics are a measure of how good a model
performance and how well it approximates the relationship.

 There are three error metrics that are commonly used for
evaluating and reporting the performance of a regression model.
They are :
 Mean Square Error(MSE)
Mean Absolute Error(MAE)
 Root Mean Squared Error(RMSE)
Mean Squared Error:
The MSE error tells you how close a regression line is a set of points .It
does this by taking the distances from the points to the regression line ( these
distances are the “errors”) and squaring them .
The squaring is necessary to remove any negative signs.

mse = mean_squared_error(y_test,y_predict)

Steps to find the MSE

1.Find the equation for the regression line.


Steps to find the MSE:

1.Find the equation for the regression line.

2.Insert X values in the equation found in step 1 in order to get the respective Y values i.e.

3.Now subtract the new Y values (i.e.) from the original Y values. Thus, found values are the error terms

4.square the errors found in step 3

5.It is also known as the vertical distance of the given point from the regression line.

6. Divide the value found in step 5 by the total number of observations.


Mean Absolute Error:
Mean absolute error (MAE) is a loss function used for regression. Use MAE when you are doing regression
and don't want outliers to play a big role. The loss is the mean over the absolute differences between true
and predicted values, deviations in either direction from the true value are treated the same way.

mae= mean_absolute_error(y_test,y_predict)
Formula:
Mean Absolute Error = (1/n) * ∑|yi – xi|
where,
•Σ: Greek symbol for summation
•yi: Actual value for the ith observation
•xi: Calculated value for the ith observation
•n: Total number of observations
Method 1: Using Actual Formulae
Mean Absolute Error (MAE) is calculated by taking the summation of the absolute difference
between the actual and calculated values of each observation over the entire array and then
dividing the sum obtained by the number of observations in the array.

Method 2: Using sklearn


sklearn.metrics module of python contains functions for calculating errors for different purposes. It
provides a method named mean_absolute_error() to calculate the mean absolute error of the given
arrays.

Syntax:

mean_absolute_error (actual,calculated)
Where:

•actual- Array of actual values as first argument


•calculated – Array of predicted/calculated values as
second argument
It will return the mean absolute error of the given arrays .
R2 score:
Coefficient of determination also called as R 2 score is used to evaluate the performance of a linear
regression model. It is the amount of the variation in the output dependent attribute which is predictable
from the input independent variable(s). It is used to check how well-observed results are reproduced by
the model, depending on the ratio of total deviation of results described by the model.

Mathematical Formula :

R2= 1- SSres / SStot

Where,

SSres is the sum of squares of the residual errors.

SStot is the total sum of the errors.


Decision Tree:
As we know the target not linear with many features, it is worth
trying some nonlinear models.

 For example, the Decision Tree model


 A decision tree is a flow chart – likes structure in which
each internal mode represents a test on a feature (ex:
whether a coin flip comes up heads or tails)
 Each leaf node represents a class label (decision taken after
computing all features )
K Nearest Neighbors :
The K-Nearest Neighbors (KNN) algorithm is a simple, easy-to-implement
supervised machine learning algorithm that can be used to solve both
classification and regression problems.

 This article demonstrates an illustration of K-nearest neighbours on a


sample random data using sklearn library.

Pre-requisites : Numpy, Pandas, matplotlib, sklearn We’ve been given a


random data set with one feature as the target classes. We’ll try to use KNN to
create a model that directly predicts a class for a new data point based off of
the features.
Choosing a K Value:
Let’s go ahead and use the elbow method to pick a good K Value
K Nearest Neighbors Regression:
 K Nearest Neighbors Regression first stores the training examples.

 During prediction, when it encounters a new instance ( or test example ) to predict, it finds
the K number of training instances nearest to this new instance

 Then predicts the target value for this instance by calculating the mean of the target values of
these nearest neighbors.
Pseudocode :

Store all training examples.


1.Repeat steps 3, 4, and 5 for each test example.
2.Find the K number of training examples nearest to the
current test example.
3.y_pred for current test example = mean of the true target
values of these K neighbors.
4.Go to step 2.
knn.fit(X_train, y_train)
SUPPORT VECTOR
MACHINE(SVM):
As we know the target not linear with many features, it is worth trying some nonlinear
models. For example, the Decision Tree model
 In machine learning, support vector machines (SVMs, also support vector
networks) are supervised learning models with associated learning

 algorithms that analyze data used for classification and regression analysis.

 A Support Vector Machine (SVM) is a discriminative classifier formally


defined by a separating hyperplane.
DEPLOYMENT
 In order to deploy the trained model for the country of a GDP to use off , we would need an
application with the simple user interface which GDP of a country can utilize.
 Thus , here we made a simple web interface using HTML,CSS,BOOTSTRAMP,FLASK

 Lastly , we wanted to predict the results for the obtained values from the
user for which we made use of the FLASK Framework to integrate the
Backend and the frontend.

 And we generated the pickle file for our model to generate the predictions for the input
data.
 In the next step, we have built a UI for a user to input his data so that once he enters the
information of all the inputs.
 The model will process the data and will be recommend the appropriate type of GDP to be
grown in such a condition..
HTML:
 HTML stands for Hyper Text Markup Language.
 It is used to design web pages using the markup language.
 Hypertext defines the link between the web pages and markup language
defines the text document within the tag that define the structure of web
pages.

SYNTAX:

Syntax:
<form> <!--form elements--> </form>
CSS:
 CSS (Cascading Style Sheets) is a stylesheet language used to design a
webpage to make it attractive.
 The reason for using this is to simplify the process of making web pages
presentable.
Basic Format:

 It is the basic structure of HTML webpage and we use CSS style inside webpage.

 In a web page, we use internal CSS (i.e. adding CSS code inside <head> tag of
HTML code).
BOOTSRAMP:
Bootstrap is a free and open-source tool collection for creating responsive
websites and web applications. It is the most popular HTML, CSS, and JavaScript
framework for developing responsive, mobile-first websites.

Why we use Bootstrap ?


•It is Faster and Easier way for Web-Development.
•It creates Platform-independent web-pages.
•It creates Responsive Web-pages.
•It designes the responsive web pages for mobile devices too.
•It is Free and open-source framework available on www.getbootstrap.com
Flask – (Creating first simple application):

Flask is a web application framework written in Python. Flask is based on the Werkzeug WSGI toolkit
and Jinja2 template engine. Both are Pocco projects.
To understand what Flask is you have to understand few general terms.

1.WSGI: Web Server Gateway Interface (WSGI) has been adopted as a standard for Python web
application development.

2.Werkzeug :It is a WSGI toolkit, which implements requests, response objects, and other utility
functions.

3.jinja2 :jinja2 is a popular templating engine for Python. A web templating system combines a
template with a certain data source to render dynamic web pages.

You might also like