Data Wrangling
Data Wrangling
Data wrangling involves processing the data in various formats like - merging, grouping, concatenating etc.
for the purpose of analysing or getting them ready to be used with another set of data. Python has built-in
features to apply these wrangling methods to various data sets to achieve the analytical goal. Data
Wrangling is also known as Data Munging.
Data Wrangling is a very important step in a Data science project. The below example will explain its
importance:
Books selling Website want to show top-selling books of different domains, according to user preference.
For example, if a new user searches for motivational books, then they want to show those motivational
books which sell the most or have a high rating, etc.
But on their website, there are plenty of raw data from different users. Here the concept of Data Munging
or Data Wrangling is used. As we know Data wrangling is not by the System itself. This process is done by
Data Scientists. So, the data Scientist will wrangle data in such a way that they will sort the motivational
books that are sold more or have high ratings or user buy this book with these package of Books, etc. On
the basis of that, the new user will make a choice.
1. Data exploration: In this process, the data is studied, analyzed, and understood by visualizing
representations of data.
2. Dealing with missing values: Most of the datasets having a vast amount of data contain missing
values of NaN, they are needed to be taken care of by replacing them with mean, mode, the most
frequent value of the column, or simply by dropping the row having a NaN value.
3. Reshaping data: In this process, data is manipulated according to the requirements, where new
data can be added or pre-existing data can be modified.
4. Filtering data: Some times datasets are comprised of unwanted rows or columns which are required
to be removed or filtered
5. Other: After dealing with the raw dataset with the above functionalities we get an efficient dataset
as per our requirements and then it can be used for a required purpose like data analyzing, machine
learning, data visualization, model training etc.
Data exploration in Python:-Here in Data exploration, we load the data into a dataframe, and then we
visualize the data in a tabular format .
As we can see from the previous output, there are NaN values present in the MARKS column which is
a missing value in the dataframe that is going to be taken care of in data wrangling by replacing them with
the column mean.
in the GENDER column, we can replace the Gender column data by categorizing them into different
numbers.
EXAMPLE: Suppose that a Teacher has two types of Data, the first type of Data consists of Details of
Students and the Second type of Data Consist of Pending Fees Status which is taken from the Account Office.
So The Teacher will use the merge operation here in order to merge the data and provide it meaning. So
that teacher will analyze it easily and it also reduces the time and effort of the Teacher from Manual
Merging.