0% found this document useful (0 votes)
272 views37 pages

Loan Prediction

The document discusses exploratory data analysis used for loan approval prediction. It describes analyzing a dataset on loans using EDA techniques like data normalization, missing value treatment, feature selection and visualization to better understand driving factors behind loan defaults.

Uploaded by

Manashi Debbarma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
272 views37 pages

Loan Prediction

The document discusses exploratory data analysis used for loan approval prediction. It describes analyzing a dataset on loans using EDA techniques like data normalization, missing value treatment, feature selection and visualization to better understand driving factors behind loan defaults.

Uploaded by

Manashi Debbarma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

1

FINAL REPORT

ON

LOAN APPROVAL PREDICTION USING EXPLORATORY DATA


ANALYSIS
2

ABSTRACT

Mortgage prediction is very helpful for employee of banks in addition to for the applicant
also. The mortgage Prediction machine can can robotically calculate the load of each features
taking element in loan processing and on new take a look at data equal features are processed
with admire to their related weight .A time limit may be set for the applicant to test whether
his/her mortgage may be sanctioned or now not. mortgage Prediction machine lets in jumping
to particular application so that it can be check on priority basis. in this undertaking we
understand an idea approximately how actual commercial enterprise troubles are solved the
usage of EDA. on this take a look at, we will also broaden a simple information of hazard
analytics in banking and monetary services and understand how facts is used to minimise the
danger of losing money while lending to customers. In India, the wide variety of human
beings making use of for the loans receives expanded for diverse reasons in latest years. The
financial institution employees aren't able to examine or are expecting whether the consumer
can payback the amount or not (suitable consumer or horrific client) for the given hobby
price. The purpose of this paper is to discover the character of the purchaser making use of for
the personal mortgage. An exploratory facts evaluation method is used to cope with this
hassle. The dataset that makes use of EDA undergoes the procedure of normalisation, lacking
value treatment, choosing critical columns the usage of filtering, deriving new columns,
figuring out the goal variables and visualising the information inside the graphical layout.
Python is used for easy and green processing of records. we used the pandas library to be had
in Python to system and extract facts from the given dataset. particularly, it gives facts
structures and operations for manipulating numerical tables and time series. The processed
facts is converted into appropriate graphs for higher visualisation of the effects and for higher
know-how. For acquiring the graph Matplot library is used. Matplotlib is a plotting library in
Python and its numerical mathematics extension NumPy. It presents an item-orientated API
for embedding plots into programs using standard-cause GUI toolkits like Tkinter, wxPython,
Qt and so forth. This organization is the largest on-line mortgage marketplace, facilitating
private loans, business loans, and financing of medical strategies. borrowers can effortlessly
get admission to lower hobby rate loans thru a quick on line interface. Like most different
lending companies, lending loans to ‘volatile’ applicants is the biggest source of financial loss
(called credit score loss). The credit score loss is the quantity of money lost by means of the
lender when the borrower refuses to pay or runs away with the money owed. In other phrases,
debtors who default cause the largest amount of loss to the creditors. In this case, the
customers labelled as 'charged-off' are the 'defaulters'. If one is capable of identify these
unstable mortgage candidates, then such loans can be reduced thereby slicing down the
amount of credit score loss. identification of such candidates the use of EDA is the purpose of
this assignment. In other phrases, the organisation desires to recognize the using elements (or
driving force variables) behind loan default, i.e. the variables which might be sturdy signs of
default. The corporation can utilise this knowledge for its portfolio and hazard assessment.
3

CHAPTER 1

INTRODUCTION

A loan is a amount of money borrowed from the bank to help for sure deliberate or unplanned
activities. The borrower is required to pay lower back the mortgage, inclusive of the hobby
charged over a stipulated duration. There are several styles of loans for various monetary
requirements. A bank can furnish a mortgage inside the shape of a secured or unsecured
mortgage. A cozy mortgage is often a big sum of money this is needed to buy a house or car
and is the correct choice for a home loan or car loan. An unsecured mortgage is preferential
for pupil loans, or personal loans which typically encompass smaller quantities of cash.

TYPES OF LOANS

Banks provide various types of loans to ensure that they meet all your needs.

• Home Loan – The financial institution borrows you money, and the house remains the
property of the financial institution till the final instalment is made. consumers are
required to pay returned the mortgage on a month-to-month foundation, on the given
hobby price and over a stipulated period, usually 20yrs.

• Student Loan – Students that need to in addition their studies at any better schooling
organization that require financial help apply for student loans. The bank provides the
money for the duration of their studies and after the finishing touch in their studies;
the pupil needs to pay back the cash. The hobby prices are commonly low and there
are flexible compensation alternatives.

• Car Loan – Most banks provide automobile loans for each used and new automobiles.
consumers pay lower back the instalments on a monthly basis, and the automobile
belongs to the bank until the final charge is made.

• Personal Loan – Banks provide exclusive alternatives in relation to a private loan.


this is a monetary solution ideal to in shape all of your needs. the amount of money
that you may borrow relies upon on the selected bank, your financial eligibility and
affordability to pay off the loan.
4

• Business Loan – A enterprise mortgage presents you with the capital to begin your
enterprise venture. The bank provides you with the cash and you're required to make
the repayments after an agreed period of time. The requirements range in keeping with
each financial institution and whether or not you are a new enterprise or had been in
operation performs a first-rate element on your loan software.

SECURED LOAN VS UNSECURED LOAN


A secured mortgage is an extended-term mortgage, which has a guarantee attached to it. It’s
the great way to obtain large amounts of money and buy assets. Property are used as safety in
case of a default. Big amounts of cash will no longer be borrowed to you with out assurance,
that is why you region your private home or assets as leverage to guarantee that you will pay
off your loan on time. Secured loans encompass low interest prices and longer compensation
alternatives.

A unsecured mortgage is a short-time period loan and it has no assure connected to it. It is
usually given on the premise of your credit score file and monetary role. Unsecured loans
consist of credit playing cards, private loans and pupil loans. Because of the high danger of
this type of loan then, the hobby charge is likewise higher. It is imperative to visit your
preferred monetary institution at the diverse options regarding each their secured and
unsecured loans.

When a person applies for a loan, there are two types of decisions that could be taken by
the company:

Loan accepted: If the company approves the loan, there are 3 possible scenarios described
below:

Fully paid: Applicant has fully paid the loan (the principal and the interest rate)

Current: Applicant is in the process of paying the instalments, i.e. the tenure of the loan is not
yet completed. These candidates are not labelled as 'defaulted'.

Charged-off: Applicant has not paid the instalments in due time for a long period of time, i.e.
he/she has defaulted on the loan

Loan rejected: The company had rejected the loan (because the candidate does not meet their
requirements etc.). Since the loan was rejected, there is no transactional history of those
applicants with the company and so this data is not available with the company (and thus in
this dataset)

Now a day’s bank plays a vital role in market economy. The success or failure of organization
largely depends on the industry’s ability to evaluate credit risk. Before giving the credit loan
to borrowers, bank decides whether the borrower is bad (defaulter) or good (non defaulter).
The prediction of borrower status i.e. in future borrower will be defaulter or non defaulter is a
challenging task for any organization or bank. Basically the loan defaulter prediction is a
binary classification problem. Loan amount, costumer’s history governs his creditability for
receiving loan. The problem is to classify borrower as defaulter or non defaulter. However
developing such a model is a very challenging task due to increasing in demands for loans.
5

Exploratory Data Analysis

Exploratory records analysis (EDA) is an important step in any studies evaluation. The
number one aim with exploratory analysis is to look at the statistics for distribution, outliers
and anomalies to direct specific trying out of your speculation. It additionally offers tools for
hypothesis generation via visualizing and information the data usually through graphical
representation. EDA targets to help the natural styles reputation of the analyst. subsequently,
feature choice techniques regularly fall into EDA. because the seminal paintings of Tukey in
1977, EDA has received a large following because the gold widespread method to investigate
a statistics set. in step with Howard Seltman (Carnegie Mellon university), “loosely speakme,
any approach of looking at records that doesn't include formal statistical modeling and
inference falls under the term exploratory facts evaluation” . EDA is a fundamental early step
after information collection and pre-processing where the statistics is really visualized,
plotted, manipulated, without any assumptions, so that it will assist assessing the best of the
information and constructing fashions. “maximum EDA techniques are graphical in nature
with a few quantitative techniques. The reason for the heavy reliance on photos is that by
means of its very nature the principle role of EDA is to explore, and pictures gives the
analysts unprecedented energy to do so, even as being ready to advantage perception into the
facts. there are lots of ways to categorize the many EDA techniques”.

Exploratory information evaluation (EDA) is a totally vital step which takes place after
function engineering and acquiring data and it need to be performed earlier than any
modeling. this is because it's far very crucial for a facts scientist so one can understand the
nature of the facts with out making assumptions.

The purpose of EDA is to apply precis information and visualizations to better understand
facts, and discover clues approximately the tendencies of the records, its first-rate and to
formulate assumptions and the speculation of our evaluation. EDA is not approximately
making fancy visualizations or even aesthetically attractive ones, the goal is to try to answer
questions with data. Your intention should be that allows you to create a figure which a
person can have a look at in a couple of seconds and apprehend what is going on. If now not,
the visualization is too complicated (or fancy) and something comparable should be used.

EDA is also very iterative because we first make assumptions based totally on our first
exploratory visualizations, then construct some models. We then make visualizations of the
version results and song our fashions.

EDA is a phenomenon under data analysis used for gaining a better understanding of data
aspects like:
6

1. main features of data


2. variables and relationships that hold between them
3. identifying which variables are important for our problem

We shall look at various exploratory data analysis methods like:

➢ Descriptive Statistics, which is a way of giving a brief overview of the dataset we are
dealing with, including some measures and features of the sample

➢ Grouping data

➢ ANOVA, Analysis Of Variance, which is a computational method to divide variations


in an observations set into different components.

➢ Correlation and correlation methods

Descriptive Statistics

Descriptive data is a beneficial manner to apprehend characteristics of your records and to get
a short precis of it. Pandas in python provide an exciting approach describe(). The describe
feature applies primary statistical computations on the dataset like intense values, count of
statistics points general deviation and so on. Any lacking price or nan cost is automatically
skipped. Describe() characteristic offers a good photograph of distribution of information.

Grouping data

Group by is an interesting measure available in pandas which can help us figure out effect of
different categorical attributes on other data variables.

ANOVA

ANOVA stands for Analysis of Variance. It is performed to figure out the relation between
the different group of categorical data.
Under ANOVA we have two measures as result:
F-testscore : which shows the variaton of groups mean over variation
p-value: it shows the importance of the result

Correlation and Correlation computation


7

Correlation is a simple relationship between two variables in a context such that one variable
affects the other. Correlation is different from act of causing. One way to calculate correlation
among variables is to find Pearson correlation.

What is the General Purpose of EDA?


There's no formal set of techniques which can be used in EDA. recall, EDA is an technique to
how we examine information, not a particular set of methods set in stone. it's a philosophy
and art greater so than a science.

Its purpose is to take a preferred view of a few given statistics with out making any
assumptions approximately it. we are looking to get a feel for the records and what it would
suggest rather than reject or be given some type of premise round it before we start its
exploration.

In different phrases, with EDA we allow the information communicate for itself as opposed to
looking to pressure the facts into some form of pre-decided version.

although, a few techniques are used to help us get a sense for the facts. as an instance, we are
able to categorize facts, quantify some of its primary components, or visualize it.

for instance, uncooked records may be plotted using histograms or other visualization
strategies. every so often, the facts is juxtaposed in a manner that facilitates us spot vital styles
within or between records sets.

What is EDA Used For?


EDA is used for:

• Catching mistakes and anomalies


• Gaining new insights into data
• Detecting outliers in data
• Testing assumptions
• Identifying important factors in the data
• Understanding relationships

And perhaps, most importantly, EDA is used to help figure out our next steps with respect to
the data. For instance, we might have new questions we need answered or new research we
need to conduct.

PYTHON INTRODUCTION

Python is a high-level, interpreted, interactive and object-oriented scripting language. Python


is designed to be highly readable. It uses English keywords frequently where as other
languages use punctuation, and it has fewer syntactical constructions than other languages.
8

• Python is Interpreted − Python is processed at runtime by the interpreter. You do


not need to compile your program before executing it. This is similar to PERL and
PHP.

• Python is Interactive − You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.

• Python is Object-Oriented − Python supports Object-Oriented style or technique of


programming that encapsulates code within objects.

• Python is a Beginner's Language − Python is a great language for the beginner-level


programmers and supports the development of a wide range of applications from
simple text processing to WWW browsers to games.

History of Python
Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.

Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-68,
SmallTalk, and Unix shell and other scripting languages.

Python is copyrighted. Like Perl, Python source code is now available under the GNU
General Public License (GPL).

Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.

Python Features
Python's features include −

• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined
syntax. This allows the student to pick up the language quickly.

• Easy-to-read − Python code is more clearly defined and visible to the eyes.

• Easy-to-maintain − Python's source code is fairly easy-to-maintain.

• A broad standard library − Python's bulk of the library is very portable and cross-
platform compatible on UNIX, Windows, and Macintosh.

• Interactive Mode − Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.

• Portable − Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.
9

• Extendable − You can add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more efficient.

• Databases − Python provides interfaces to all major commercial databases.

• GUI Programming − Python supports GUI applications that can be created and
ported to many system calls, libraries and windows systems, such as Windows MFC,
Macintosh, and the X Window system of Unix.

• Scalable − Python provides a better structure and support for large programs than
shell scripting.
10

CHAPTER 2

LITERATURE SURVEY

AmiraKamil Ibrahim Hassan, Ajith Abraham (2008) makes use of a prediction model that's
built the use of three special education algorithms to train a supervised twolayer feed-ahead
community. The results display that the training algorithm improves the design of mortgage
default prediction version.

Angelini (2008) used a neural network with preferred topology and a feed-forward neural
community with ad hoc connections. Neural community may be used for prediction version.
This paper suggests that the above fashions provide most useful outcomes with less error.

Ngai (2009) makes use of the type model for predicting the future behaviour of costumers in
CRM. In CRM area, the typically used model is neural network. He recognized 80 seven
articles associated to records mining programs and strategies among 2000 and 2006.

Dr. A. Chitra and S. Uma (2010) introduced a ensemble gaining knowledge of technique for
prediction of time collection primarily based on Radial foundation function networks (RBF),
okay - Nearest Neighbor (KNN) and Self Organizing Map (SOM). They proposed a version
particularly PAPEM which carry out higher than person version.

Akkoç (2012) used a model namely hybrid Adaptive Neuro-Fuzzy Inference model, grouping
of facts and Neuro-Fuzzy community. a 10-fold go validation is used for better outcomes and
a contrast with other fashions.

Sarwesh web site, Dr.Sadhna okay. Mishra (2013) proposed a method in which two or more
classifiers are blended together to supply an ensemble version for the better prediction. They
used the bagging and boosting techniques after which used random wooded area method.

Maher Alaraj, MaysamAbbod, and ZiadHunaiti (2014) proposed a brand new ensemble
approach for classification of costumer mortgage. This ensemble method is based totally on
neural community. They kingdom that the proposed approach provide higher effects and
accuracy as compared to single classifier and any other version.

AlarajM ,AbbodM (2015) added a model that are based totally on homogenous and
heterogeneous classifiers. Ensemble version primarily based on three classifiers which are
logistic artificial neural network, logistic regression and aid vector system.
11

CHAPTER 3

PROBLEM STATEMENT

Take a consumer finance company which specialises in lending various types of loans to
urban customers. When the company receives a loan application, the company has to make a
decision for loan approval based on the applicant’s profile. Two types of risks are associated
with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of
business to the company

If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving
the loan may lead to a financial loss for the company

The dataset using for programming contains the information about past loan applicants and
whether they ‘defaulted’ or not. The aim is to identify patterns which indicate if a person is
likely to default, which may be used for taking actions such as denying the loan, reducing the
amount of loan, lending (to risky applicants) at a higher interest rate, etc.

OBJECTIVES

(1) Identification of variables that have impact on loan status i.e. Probability if loan default

will happen or loan will get paid.

(2) Data exploration, cleanup, univariate, segmented univariate and bivariate analysis to

identify the parameters that impact target variable – loan status

(3) Following analysis is required to find impacting variables

✓ Identification of parameters – Categorical and numerical variables


✓ Derived variables
✓ Analysis of various parameters

(4) Probability of Charged Off or default =Number of Applicants who defaulted/ Total No.of
Applicants
12

RESEARCH METHDOLOGY

collection
of data
set

result feature
analysis selection

test train
model model

1.Collection of data set : the information set include the information of mortgage given by
organization. In data series , device load the excel sheet containing loan records.

2. Feature selection: pick out right functions from the information set to get right effects.
function choice is one of the center ideas in system getting to know which hugely influences
the performance of your version. The data functions which you use to train your machine
gaining knowledge of models have a large influence on the performance you could reap.

3.Train model : train your machine model on the basis of selected features. The process of
training an ML model involves offering an ML algorithm (this is, the getting to know
algorithm) with schooling records to study from. The term ML version refers back to the
model artifact this is created with the aid of the training procedure. you could use the ML
version to get predictions on new facts for that you do no longer realize the goal.

4. Test model : test your machine model by matching with database


13

5. Result analysis: compare predictive value with real value to get the accuracy ofmachine
model.

CHAPTER 4

RESULTS

1. UNIVARIATE ANALYSIS (CONTINUOUS VARIABLES)

For continuous variables, we need to understand central tendency and spread hence following
plots are beneficial

• Box plot

• Violin plots

• Histogram / Distribution plots


14

Findings

➢ Majority of loans lie between 6K -14K

➢ Few outliers are exist

➢ Interest rate lies between 10-23%

➢ Annual Income lies between 40K to 80K

➢ There are many outliers and hence need to be removed to do any further analysis

2. UNIVARIATE ANALYSIS (CATEGORICAL VARIABLES)

For categorical variables, frequency tables can be used to understand distribution of each
category. Metrics possible are count and count%. Following are beneficial:
15

a. Count plot

b. Bar chart

Findings

1. Loans are majorly from 10+ years , < 1year, 2 years, 3 years and 4 years

1. 22.4% - 10+ years

2. 11.4% - Less than 1 year

3. 11% for 2 years


16

2. Majority of loans are of B A,C grades

1. B – 30%

2. A – 25%

3. C – 20%
17

Findings

1. Purpose of loans

a. 54% of loans are for debt consolidation

b. 15% are for credit card

2. Loan Status

a. 83% of loans are fully paid

b. 14% of loans are charged off / defaults happen

c. 3% of loans are current

3. Home Ownership

a. 47% of loans are staying in rented houses

b. 46% of loans have mortgaged their homes


18
19

Findings

1. Address States

a. CA and NY covers around 18% and 10% are from loans

2. Issue Dates

a. There are 5 years of data

b. Loans are increasing as the number of years are increasing

c. 54.5% of loans were given in 2011, 29% in 2010

d. Loans issued are following an increasing trend every month

3. Multivariate Analysis (Correlation / heat map)

1. From Heat map, it is evident that loan amount, fund amount, fund amount investors,
instalment are highly co-related

2. Same is true for Total Payment, Total payment inv.

3. Loan amount to income ratio doesn’t shown any significant impact


20

4. Analysis (Box Plot)

1. Purpose and loan amount wrt loan status

2. Small business shows large charged off / defaults, followed by medical and moving
21

1. Employee length v/s loan amount wrt purpose

2. Small business show a large variation and defaults

5. Analysis (Loan Status v/s Probability of default)

1. Probability of Charged Off or default =Number of Applicants who defaulted/ Total No.of
Applicants

2. Probability of charged off shows variation wrt loan states however that is not significant

3. This can’t be predictor variable


22

1. Probability of Charged Off or default =Number of Applicants who defaulted/ Total No.of
Applicants

2. Probability of charged off shows variation wrt loan purpose however that is significant for
small business, medical and moving

3. This can be a predictor variable

6. Analysis (Grade/sub grade v/s probability of default)

1. Probability of Charged Off or default =Number of Applicants who defaulted/ Total No.of
Applicants

2. Probability of charged off shows variation wrt grade and it increases with changing grades
from A to G

3. This can be a predictor variable

4. Sub grade is a sub-parameter and follows a similar trend, however we are not taking as a
predictor variable as Grade would solve the purpose.
23

7. Analysis(Annual income slabs/ interest rate slabs v/s probability of default)

1. Probability of default decreases with increase in annual income. Lower the income, lower
is probability of default

2. Interest rates from 10% and above has probability of default of more 12%.

3. 15% and above interest rates, probability of default increases to 23% 4. Interest rate can be
considered as predictor variable.
24

8. Analysis (loan amount status / employee length v/s probability of default)

1. Loan amount slabs increase and it increases beyond 10K loan amount

2. Employees with 1 year or are self employed (0 years) have maximum default

3. Employee length can be considered as a predictor variable.


25

9. Analysis (loan term / principal amount received v/s probability of default)

1. As loan term increases, the probability of default increases

2. Loan principal amount received is small, the probability of default is high at 60%

3. Loan Term can be considered as a predictor variable


26

10. Analysis (loan payment slabs / interest amount received v/s probability of default)

1. Loan interest amount received shows no significant behavior.

2. Loan payment amount received is small, the probability of default is high at 60%.
27

CONCLUSION AND FUTURE WORK

Following can be considered as major predictor variables for target variable loan status and
predict a higher probability of default

1. Purpose of Loan - Small Business has maximum probability of default 26%. Medical,
moving and debt consolidation are next in the list

2. Employment Length - Self employed and 1 year has more chances of default

3. Grade - As the grades go from A to G, the probability of default increases

4. Interest Rate - 10% and above interest rate has higher probability of default

5. Term - 60 months term has higher probability of default

6. Loan Payment amount - Lower the amount paid back higher is the probability of default

Time series evaluation can be executed using the loan information of numerous years, for
prediction of the approximate time, whilst the consumer can default. future analysis may be
finished on predicting the approximate interest fees that the loan applicant is expected to get
as consistent with his profile if his mortgage is authorized. This could be beneficial for loan
applicants, in view that some banks approve loans, but deliver very high hobby fees to the
consumer. It would provide the customers a rough insight concerning the interest rates that
they should be getting for his or her profile and it's going to ensure they don’t grow to be
paying a whole lot greater quantity in interest to the financial institution. an software may be
constructed, with the intention to take numerous inputs from the consumer like, employment
duration, salary, age, marital status, ssn, deal with, mortgage amount, loan period and many
others. And provide a prediction of whether or not their mortgage utility may be authorized by
the banks or now not based on their inputs along side an approximate interest prices.
28

CHAPTER 4

PYTHON PROGRAMMING WITH PANDAS

Import pandas as pd

Df = pd.read_csv("stock_data.csv")

Print(df)

Df = pd.read_csv("stock_data.csv", skiprows=1)

Print(df)

Df = pd.read_csv("stock_data.csv", header=1)

# skiprows and header are kind of same

Print(df)

Df=pd.read_csv("stock_data.csv", header=none, names = ["ticker","eps","revenue","people"])

Print(df)

Df = pd.read_csv("stock_data.csv", nrows=2)

Print(df)

Df = pd.read_csv("stock_data.csv", na_values=["n.a.", "not available"])

Print(df)

Df = pd.read_csv("stock_data.csv", na_values={

'eps': ['not available'],

'revenue': [-1],

'people': ['not available','n.a.']

})

Print(df)

Df.to_csv("new.csv", index=false)

Df.columns
29

Df.to_csv("new.csv",header=false)

Df.to_csv("new.csv", columns=["tickers","price"], index=false)

Df = pd.read_excel("stock_data.xlsx","sheet1")

Print(df)

Def convert_people_cell(cell):

if cell=="n.a.":

return 'sam walton'

return cell

Def convert_price_cell(cell):

if cell=="n.a.":

return 50

return cell

Df = pd.read_excel("stock_data.xlsx","sheet1", converters= {

'people': convert_people_cell,

'price': convert_price_cell

})

Print(df)

Df.to_excel("new.xlsx", sheet_name="stocks", index=false, startrow=2, startcol=1)

Df_stocks = pd.dataframe({

'tickers': ['googl', 'wmt', 'msft'],

'price': [845, 65, 64 ],

'pe': [30.37, 14.26, 30.97],


30

'eps': [27.82, 4.61, 2.12]

})

Df_weather = pd.dataframe({

'day': ['1/1/2017','1/2/2017','1/3/2017'],

'temperature': [32,35,28],

'event': ['rain', 'sunny', 'snow']

})

With pd.excelwriter('stocks_weather.xlsx') as writer:

df_stocks.to_excel(writer, sheet_name="stocks")

df_weather.to_excel(writer, sheet_name="weather")

Pandas with Nan values code

Import pandas as pd

Import numpy as np

Df = pd.read_csv("weather_data.csv")

Print(df)

New_df = df.replace(-99999, value=np.nan)

Print(new_df)

New_df = df.replace(to_replace=[-99999,-88888], value=0)

Print(new_df)

New_df = df.replace({

'temperature': -99999,

'windspeed': -99999,

'event': '0'

}, np.nan)
31

Print(new_df)

# when windspeed is 6 mph, 7 mph etc. & temperature is 32 f, 28 f etc.

New_df = df.replace({'temperature': '[a-za-z]', 'windspeed': '[a-z]'},'', regex=true)

Print(new_df)

Df = pd.dataframe({

'score': ['exceptional','average', 'good', 'poor', 'average', 'exceptional'],

'student': ['rob', 'maya', 'parthiv', 'tom', 'julian', 'erica']

})

Df

Df.replace(['poor', 'average', 'good', 'exceptional'], [1,2,3,4])

Create dataframe using pandas

Import pandas as pd

India_weather = pd.dataframe({

"city": ["mumbai","delhi","banglore"],

"temperature": [32,45,30],

"humidity": [80, 60, 78]

})

India_weather

Us_weather = pd.dataframe({

"city": ["new york","chicago","orlando"],

"temperature": [21,14,35],

"humidity": [68, 65, 75]

})

Us_weather
32

Df = pd.concat([india_weather, us_weather])

Df

Df = pd.concat([india_weather, us_weather], ignore_index=true)

Df

Df = pd.concat([india_weather, us_weather], keys=["india", "us"])

Df

Df.loc["us"]

Df.loc["india"]

Temperature_df = pd.dataframe({

"city": ["mumbai","delhi","banglore"],

"temperature": [32,45,30],

}, index=[0,1,2])

Temperature_df

Windspeed_df = pd.dataframe({

"city": ["delhi","mumbai"],

"windspeed": [7,12],

}, index=[1,0])

Windspeed_df

Df = pd.concat([temperature_df,windspeed_df],axis=1)

Df

S = pd.series(["humid","dry","rain"], name="event")

Df = pd.concat([temperature_df,s],axis=1)

Df
33

Basic commands of pandas

Import pandas as pd

Weather_data = {'day': ['1/1/2017', '1/2/2017', '1/3/2017', '1/4/2017', '1/5/2017', '1/6/2017'],

'temperature': [32, 35, 28, 24, 32, 31],

'wind_speed': [6, 7, 2, 7, 4, 2],

'event': ['rain', 'sunny', 'snow', 'snow', 'rain', 'sunny']

Df = pd.dataframe(weather_data)

Print(df)

Rows, columns = df.shape

Print("rows:", rows)

Print("columns:", columns)

Print(df.head()) //only 5 records shows

Print(df.head(2)) // only first two

Print(df.tail(2)) //only two last

Print(df[2:5]) #indexing

Print(df.columns)

Print(df['event']) // show event data

Print(type(df['event'])) // data type

Print(df[['event', 'wind_speed', 'day']]) / only shows three outputs

Print(df['temperature'].max())

Print(df['temperature'].min())

Print(df.describe()) // tells standards

Print(df[df.temperature >= 32]) //only shows temperature above 32

Print(df[df.temperature == df.temperature.max()])
34

Print(df.set_index('day', inplace=true)) //inplace means changes saves in this we change


index day

Print(df.loc['1/4/2017']) //check location

Print(df.reset_index(inplace=true)) // by default index

Print(df)

Print(df.set_index('event', inplace=true)) //change index event

Print(df)

Print(df.loc['snow']) // snow
35

REFERENCES

[1]. Rattle data mining tool: available from http://rattle.togaware.com/rattle-download.html

[2]. Aafer Y, Du W &Yin H 2013, DroidAPIMiner: ‘Mining API-Level Features for Robust
Malware Detection in Android’, in: Security and privacy in Communication Networks
Springer, pp 86-103 .

[3]. EktaGandotra, DivyaBansal, SanjeevSofat 2014, ‘Malware Analysis and Classification: A


Survey’available from http:// www.scirp.org/journal/jis.

[4]. K. HanumanthaRao, G. Srinivas, A. Damodhar, M. Vikas Krishna 2011: Implementation


of Anomaly Detection Technique Using Machine Learning Algorithms: Internatinal Journal
of Computer Science and Telecommunications (Volume2, Issue3, June 2011).

[5]. J. R. Quinlan. Induction of Decision Tree. Machine Learning, Vol. 1, No. 1. pp. 81-106.,
1086.

[6]. Mean Decrease Accuracy https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.hml

[7]. J.R. Quinlan. Induction of decision trees. Machine learningSpringer, 1(1):81–106, 1086.

[8]. Andy Liaw and Matthew Wiener 2002. Classification and Regression by randomForest. R
News( http://CRAN.R-project.org/doc/Rnews/ ), 2(3):9–22, 2002.

[9]. S.S. Keerthi and E.G. Gilbert 2002 . Convergence of a generalizeSMO algorithm for
SVM classifier design. Machine Learning, Springer, 46(1):351–360, 2002.

[10]. J.M. Chambers. Computational methods for data analysis.Applied Statistics, Wiley,
1(2):1–10, 1077.

[11] G. Francesca, "A Discrete-Time Hazard Model for Loans: Some Evidence from Italian
Banking System," American Journal of Applied Sciences, vol. 9, p. 1337, 2012.

[12] S. Purohit and A. Kulkarni, "Credit evaluation model of loan proposals for Indian
Banks," in Information and Communication Technologies (WICT), 2011 World Congress on,
2011, pp. 868-873.

[13] D. Zakrzewska, "On integrating unsupervised and supervised classification for credit risk
evaluation," Information Technology and Control, vol. 36, pp. 98-102, 2007.

[14] M. L. Bhasin, "Data Mining: A Competitive Tool in the Banking and Retail Industries,"
Banking and finance, vol. 588, 2006.

[15] N. İkizler and H. A. Guvenir, "Mining interesting rules in bank loans data," in
Proceedings of the Tenth Turkish Symposium on Artificial Intelligence and Neural Networks,
2001.

[16] J. Nassali, "A Loan Assessment System for Centenary Rural Development Bank," 2005.
36

[17] T. Jacobson and K. Roszbach, "Bank lending policy, credit scoring and value-at-risk,"
Journal of banking & finance, vol. 27, pp. 615-633, 2003.

[18] G. Kabir, I. Jahan, M. H. Chisty, and M. A. A. Hasin, "Credit Risk Assessment and
Evaluation System for Industrial Project."

[19] B. Bodla and R. Verma, "Credit Risk Management Framework at Banks in India," ICFAI
Journal of Bank Management, Feb2009, vol.8, pp. 47-72, 2009.

[20] R. Raghavan, "Risk Management in Banks," CHARTERED ACCOUNTANT-NEW


DELHI-, vol. 51, pp. 841- 851, 2003.

[21] M. A. Karaolis, J. A. Moutiris, D. Hadjipanayi, and C. S. Pattichis, "Assessment of the


risk factors of coronary heart events based on data mining with decision trees," Information
Technology in Biomedicine, IEEE Transactions on, vol. 14, pp. 559-566, 2010.

[22] M. Anbarasi, E. Anupriya, and N. Iyengar, "Enhanced prediction of heart disease with
feature subset selection using genetic algorithm," International Journal of Engineering
Science and Technology, vol. 2, pp. 5370-5376, 2010.

[23] M. Du, S. M. Wang, and G. Gong, "Research on decision tree algorithm based on
information entropy," Advanced Materials Research,vol. 267, pp. 732-737, 2011.

[24] X. Liu and X. Zhu, "Study on the Evaluation System of Individual Credit Risk in
commercial banks based on data mining," in Communication Systems, Networks and
Applications (ICCSNA), 2010 Second International Conference on, 2010, pp. 308-311.

[25] B. Azhagusundari and A. S. Thanamani, "Feature selection based on information gain,"


International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN,
pp. 2278-3075.

[26] M. Lopez, J. Luna, C. Romero, and S. Ventura, "Classification via clustering for
predicting final marks based on student participation in forums," Educational Data Mining
Proceedings, 2012.

[27] K.Kala, Dr. E.Ramaraj “ERPCA: A Novel Approach for Risk Evaluation of
Multidimensional Risk Prediction Clustering Algorithm” ,International Journal of computer
science and Engineering, ISSN : 0975-3397 Vol. 5 No. 10 Oct 2013.

[28] Pedro S, João M, and C. S, "Educational Data Mining: preliminary results at University
of Porto," pp. 1-11, JUNE 2014

[29] A.F.ElGamal, "An Educational Data Mining Model for Predicting Student Performance
in Programming Course," International Journal of Computer Applications, vol. 70, pp. 22-28,
May 2013.

[30] J. Nayak, B. Naik, and H. S. Behera, "A Comprehensive Survey on Support Vector
Machine in Data Mining Tasks: Applications & Challenges," International Journal of
Database Theory and Application, vol. 8, pp. 169- 186. [11] S. K. Shevade, S. S. Keerthi, C.
37

Bhattacharyya, and K. R. K. Murthy, "Improvements to the SMO Algorithm for SVM


Regression," IEEE, vol. 11, pp. 1188-1193, SEPTEMBER 2000.

You might also like