0% found this document useful (0 votes)
86 views1 page

Data Munging in Python: Using Pandas

Data munging is the process of cleaning and preparing raw data for analysis by addressing issues like missing values, outliers, and non-numerical fields. This document discusses using Pandas in Python for data munging, noting that the example data set contains missing values in some variables, extreme values in income and loan amount fields, and non-numerical fields like gender, area, and education that should be analyzed for useful information. It recommends reading an article on Pandas techniques for data manipulation before proceeding with the munging.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views1 page

Data Munging in Python: Using Pandas

Data munging is the process of cleaning and preparing raw data for analysis by addressing issues like missing values, outliers, and non-numerical fields. This document discusses using Pandas in Python for data munging, noting that the example data set contains missing values in some variables, extreme values in income and loan amount fields, and non-numerical fields like gender, area, and education that should be analyzed for useful information. It recommends reading an article on Pandas techniques for data manipulation before proceeding with the munging.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

4.

Data Munging in Python : Using Pandas

For those, who have been following, here are your must wear shoes to start running.

Data munging recap of the need

While our exploration of the data, we found a few problems in the data set, which needs to be
solved before the data is ready for a good model. This exercise is typically referred as Data
Munging. Here are the problems, we are already aware of:

1. There are missing values in some variables. We should estimate those values wisely
depending on the amount of missing values and the expected importance of variables.
2. While looking at the distributions, we saw that ApplicantIncome and LoanAmount seemed to
contain extreme values at either end. Though they might make intuitive sense, but should be
treated appropriately.

In addition to these problems with numerical fields, we should also look at the non-numerical
fields i.e. Gender, Property_Area, Married, Education and Dependents to see, if they contain
any useful information.

If you are new to Pandas, I would recommend reading this article before moving on. It
details some useful techniques of data manipulation.

You might also like