LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
Project summary
The term "Life expectancy" refers to the number of years a person can expect to
live. By definition, life expectancy is based on estimate of the average age that members
of a particular polpulation group will be when they die.
Life expectancy is the statistical measure of the average time a human being is
expected to live. Life expectancy depends on the factors like Regional variations,
Economic circumstances, Sex Differences, Mental illnesses, Physical illnesses,
Education, Year of their birth and other demographic factors. This problem statement
provides a way to predict average life expectancy of people living in a country when
various factors such as year, GDP, education, Alcohol intake of people in the country,
expenditure on health care system and some specific disease related deaths that
happened in the country given.
Project Requiremets:
1. Functional Requiremnts:
a. Accuracy of the algorithm used for the prediction should be more than 90%
b. User should enter the details in a proper format in order to get the accurte results
c. Data should be trained well
d. Application should be able to predict life expectancy for the particular country with
maximum efficiency
e. Without compromising prediction accuracy, the model is able to make predictions
quickly, automatically and systematically
f. Developer should be able to update database
g. Application should be easier to use for the user
2.Software Requirements:
a. Python(R)
b. IBM cloud platform
c. IBM Watson Services
d. IBM Watson Studio
e. Operating system: Windows XP/Windows7/Windows Vista
3. Project Team: Individual
Project Title : Predicting life expectancy
Name : Jyothi Burla
4. Project Schedule:
TASK2 : Setup the environment development environmnt
a. Git Hub Account
b. Slack Account
c. IBM Cloud Account
Node Red starter Application:
IBM Watson USeCases:
The interviewed organizations had similar, though somewhat varied Watson Assistant
deployments. The artificial intelligence research and innovation manager of a financial
services organization was an early adopter of Watson Assistant. He shared with
Forrester: “The conversations began with IBM to understand Watson and see if we
could find a use case. And because artificial intelligence was such a new technology, we
didn’t want to have a use case that would be exposed directly to customers. So, we
found the internal use case for employees.” Other interviewees focused their initial
deployments on externally facing use cases. The three main categories of use cases
covered in this study are as follows:
● Agent assist: In the report “Stop Trying To Replace Your Agents With Chatbots,”
Forrester highlights agent assist as a preferred method for blending customer
service automation and humans: “Using chatbots internally first is a good
starting point for many firms just setting out on their chatbot journey. Your
agents make an ideal and captive test bed for a bot before you expose it to your
customers.
● Customer self-service: This use case deploys a customer-facing chatbot that
can respond and contain simple queries, search for complex answers from
content or a knowledge base, and properly route to a human.
● Employee self-service: This use case is also an internally facing utilization of
Watson and is aimed at answering employee questions. The organizations
interviewed for this study used Watson to augment HR and IT help desk.
Introductioin to Machine Learning:
Machine learning is a growing technology which enables computers to learn
automatically from past data. Machine learning uses various algorithms for building
mathematical models and making predictions using historical data or information.
Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender system, and many
more.
What is Machine Learning:
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions. But can a machine also learn from experiences or past data
like a human does? So here comes the role of Machine Learning.
A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it. The accuracy of predicted
output depends upon the amount of data, as the huge amount of data helps to build a
better model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so
instead of writing a code for it, we just need to feed the data to generic algorithms, and
with the help of these algorithms, machine builds the logic as per the data and predict
the output. Machine learning has changed our way of thinking about the problem. The
below block diagram explains the working of Machine Learning algorithm
Features of Machine Learning:
○ Machine learning uses data to detect various patterns in a given dataset.
○ It can learn from past data and improve automatically.
○ It is a data-driven technology.
○ Machine learning is much similar to data mining as it also deals with the huge
amount of the data.
Following are some key points which show the importance of Machine Learning:
○ Rapid increment in the production of data
○ Solving complex problems, which are difficult for a human
○ Decision making in various sector including finance
○ Finding hidden patterns and extracting useful information from data.
Classification of machine Learning:
At a broad level, machine learning can be classified into three types:
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
1) Supervised learning
Supervised learning is a type of machine learning method in which we provide sample
labeled data to the machine learning system in order to train it, and on that basis, it
predicts the output.
The system creates a model using labeled data to understand the datasets and learn
about each data, once the training and processing are done then we test the model by
providing a sample data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student learns
things in the supervision of the teacher. The example of supervised learning is spam
filtering.
Supervised learning can be grouped further in two categories of algorithms:
○ Classification
○ Regression
2) Unsupervised learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to
find useful insights from the huge amount of data. It can be further classifieds into two
categories of algorithms:
○ Clustering
○ Association
3) Reinforcement learning
Reinforcement learning is a feedback-based learning method, in which a learning agent
gets a reward for each right action and gets a penalty for each wrong action. The agent
learns automatically with these feedbacks and improves its performance. In
reinforcement learning, the agent interacts with the environment and explores it. The
goal of an agent is to get the most reward points, and hence, it improves its
performance.
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
Automating ML model(For salary prediction):
Output for the Auto AI model(salary prediction):
Node Red for salary Prediction:
Output:
Predicting Life Expectancy Using Python:
Data set :
Created IBM neccesary Services:(Machine Learning and Watson Studio)
Watson Studio project Creation:
Creating notebook:
Importing dataset into Notebook:
Create Machine Learning Model and Create end points for Node Red Integration:
Build Node-RED flow to integrate ML Services :
OUTPUT :
Predicting Life Expectancy Without Python:
Create neccessary IBM Cloud Services(Watson Studio and Machine
Learning):
Imported Dataset and created Auto AI Experiment:
Build Node RED flow to integrate AutoAI :
OUTPUT :
Project Documenta on
Project Report on
Under
Project by :
G. Hemant Kumar
B.E. 3rd year (Computer) Da a Meghe College Of Engineering, Airoli, Navi Mumbai
Email: [email protected]
1. INTRODUCTION:
● Overview:
Life expectancy is the statistical measure of the average time a human being is
expected to live. Life expectancy depends on the factors like Regional variations,
Economic circumstances, Sex Differences, Mental illnesses, Physical illnesses,
Education, Year of their birth and other demographic factors. This problem statement
provides a way to predict average life expectancy of people living in a country when
various factors such as year, GDP, education, Alcohol intake of people in the country,
expenditure on health care system and some specific disease related deaths that
happened in the country given.
● Purpose:
Life expectancy is one of the most important factors in end-of-life decision
making. Good prognostication for example helps to determine the course of treatment
and helps to anticipate the procurement of health care services and facilities, or more
broadly: facilitates Advance Care Planning. Advance Care Planning improves the quality
of the final phase of life by stimulating doctors to explore the preferences for end-of-life
care with their patients, and people close to the patients. Physicians, however, tend to
overestimate life expectancy, and miss the window of opportunity to initiate Advance
Care Planning
The project uses a RandomForestRegressor algorithm.It isa measure of the relation
between the mean value of one variable (e.g. output) and corresponding values of other
variables . The dataset used or the training of the model was downloaded from
kaggle.com and Python is used to write the code for machine learning model.
2.LITERATURE SURVEY:
● Existing problem:
It has been noted that data collection for predicting the life/health using the
machine learning/big data is a big challenge due to considerations relating to privacy
and government policy, which will require the collaboration of various health sector
bodies. Despite these challenges, Life expectancy can be predicted by proposing a data
collection and application approach. As Artificial intelligence and Machine Learning
technologies are developing and quickly being implemented, the ease of gathering
health data from the public as well as current government agencies such as centralized
health servers could be increased
● Proposed Solution:
I explored life expectancy and looked for dataset on the following aspects
(features) : "Country", "Year", "Status", "Adult Mortality", "infant deaths", "Alcohol",
"percentage expenditure", "Hepatitis B", "Measles ", " BMI ", "under-five deaths ",
"Polio", "Total expenditure", "Diphtheria ", " HIV/AIDS", "GDP", "Population", " thinness
1-19 years", " thinness 5-9 years", "Income composition of resources", "Schooling"
In summation, the dataset started with 21 unclean variables (including the
target) and has been pared down to 12 features to describe the target variable (Life
Expectancy). This is very likely only the beginning of the possible things that could be
done with this dataset, but nonetheless it serves as a solid foundation for modeling.
After performing data cleaning and data analysis using the sta s cal tools in python(R)
and selected the dependent and independent features and created a machine learning model
using RandomForestRegressor using that model when we give the inputs( features ) the
model will give predic on ( life expectancy in years ) as output. and finally that model is
deployed to IBM cloud and madeso that it would be useful for all the people.
3. THEORETICAL ANALYSIS:
● Block Diagram:
● Hardware/Software Requirements:
Operating system: Windows XP/Windows7/Windows 10
IBM Cloud Services which includes
Watson Studio
Machine learning
Node RED
Cloud Foundary Service
4. EXPERIMENTAL INVESTIGATIONS:
With Python:
These are some graphs from the refined data analysis, that makes us understand
collinearity.
The following are very/extremely highly correlated (correlation > .7 or correlation < -.7):
● Infant Deaths/Under Five Deaths (drop Infant Deaths - Under Five Deaths is more
highly correlated to Life Expectancy)
● GDP/Percentage Expenditure (drop Percentage Expenditure - GDP is more higher
correlated to Life Expectancy)
● Polio/Diphtheria (drop Polio - Diphtheria is more highly correlated to Life
Expectancy)
● Thinness 5-9/Thinness 10-19 (drop Thinness 10-19 as correlations to other
variables are slightly higher)
● Income Composition of Resources/Schooling (drop Schooling - Income
Composition of Resources is more highly correlated with Life Expectancy)
● Developing/Developed (drop Developing - these two are the same just opposite
of one another)
After giving inputs to the deployed machine learning model output is given below:
Without Python(AutoAI):
Using the AutoAI graphical tool in Watson Studio will quickly build a model
and evaluate its accuracy, all without writing a single line of code. AutoAI guides us,
step by step, through building a machine learning model by uploading training data,
choosing a machine learning technique and algorithms, and training and evaluating
the model.
Relationship Map:
Progress Map:
Feature Importance:
Node RED Flow:
OUTPUT:
5. FLOW CHART:
6. RESULT:
Based on the given data, the autoAI model or ML model will understands the
data, whatever the factors that are affecting the results we require i.e life expectancy. It
will predict the output based on the features that we trained . Then based on the given
input to the trained model, it will validate the features and give the accurate results with
maximum efficiencty as output .
7. ADVANTAGES AND DISADVANTAGES:
Advantages:
1. By using this application we will be able to predict the number of years a person
can expect to live.
2. It will be helped in the ease of gathering health data from the public as well as
current government agencies such as centralized health servers could be
increased
3. Good prognostication for example helps to determine the course of treatment
and helps to anticipate the procurement of health care services and facilities, or
more broadly .
Disadvantages:
1. Anamolies in database can lead to wrong predictions
2. Analysis on the data should be correct in order to get the accurate results
3. Accuracy is not 100%
4. Fake entries in the dataset will give wrong predictions
8. APPLICATIONS:
1.This project/idea is useful for Insurance companies as they consider age,
lifestyle choices, family medical history, and several other factors when
determining premium rates for individual life insurance policies. The principle of
life expectancy suggests that you should purchase a life insurance policy for an
individual.
2. This will also help increase the expectancy considering the impact of a
specific factor on the average lifespan of people in a specific country.
9. CONCLUSION:
Thus, we have developed a model that will predict the life expectancy of a
specific demographic region based on the inputs provided. Various factors have a
significant impact on the life span such as Adult Mortality, Population, Under 5 Deaths,
Thinness 1-5 Years, Alcohol, HIV, Hepatitis B, GDP, Percentage Expenditure and many
more. Users can interact with the system via a simple Graphical user interface which is
in the form of a form with input spaces which the user needs to fill the specific inputs
into and then press the “Submit” button in to get the accurate results .
10. FUTURE SCOPE:
For future scope, we can connect the model to the database which can predict
the life Expectancy of not only human beings but also of the plants and different
animals present on the earth. This will help us analyze the trends in the life span. A
model with country wise bifurcation can be made, which will help to segregate the data
demographically
Big data and machine learning can benefit public health researchers with
analyzing thousands of variables to obtain data regarding life expectancy. We can use
demographics of selected regional areas and multiple behavioral health disorders
across regions to find correlation between individual behavior indicators and behavioral
health outcomes.
APPENDIX:
APP/UI Web page using python:
https://node-red-eoxga.eu-gb.mybluemix.net/ui/#!/0?socketid=xP8kRyUlzz1BLLkqAAAK
APP/UI Web page using AutoAI:
https://node-red-eoxga.eu-gb.mybluemix.net/ui/#!/0?socketid=mhs-CFMHAanq5_THAA
AO
Dataset link : https://www.kaggle.com/kumarajarshi/life-expectancy-who
Source Code :
https://github.com/SmartPracticeschool/llSPS-INT-2831-Predicting-Life-Expectancy-usi
ng-Machine-Learning/blob/master/Predicting%20Life%20Expectancy%20using%20pyth
on.ipynb