0% found this document useful (0 votes)
2K views12 pages

Lab Assignment 1 Title: Data Wrangling I: Problem Statement

This document provides instructions for a data wrangling lab assignment in Python. Students are asked to: 1) import Python libraries, 2) locate an open source dataset from online and describe it, 3) load the dataset into a pandas dataframe, 4) perform preprocessing steps like checking for missing values and variable types, 5) format and normalize the data by changing variable types and handling categorical variables, and 6) submit the processed data. The document also provides background on data wrangling, the need for it, and common steps like cleaning, structuring, and publishing data.

Uploaded by

Mr. Legendperson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views12 pages

Lab Assignment 1 Title: Data Wrangling I: Problem Statement

This document provides instructions for a data wrangling lab assignment in Python. Students are asked to: 1) import Python libraries, 2) locate an open source dataset from online and describe it, 3) load the dataset into a pandas dataframe, 4) perform preprocessing steps like checking for missing values and variable types, 5) format and normalize the data by changing variable types and handling categorical variables, and 6) submit the processed data. The document also provides background on data wrangling, the need for it, and common steps like cleaning, structuring, and publishing data.

Uploaded by

Mr. Legendperson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lab Assignment 1

Title: Data Wrangling I

PROBLEM STATEMENT:

Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a clear description of the
data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() function to get
some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data
frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the
correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.

THEORY:

What is Data Wrangling?


Data Munging, commonly referred to as Data Wrangling, is the cleaning and transforming of one type of data
to another type to make it more appropriate into a processed format. Data wrangling involves processing the
data in various formats and analyzes and get them to be used with another set of data and bringing them together
into valuable insights. It further includes data aggregation, data visualization, and training statistical models for
prediction. data wrangling is one of the most important steps of the data science process. The quality of data
analysis is only as good as the quality of data itself, so it is very important to maintain data quality.

NEED FOR WRANGLING:


Wrangling the data is crucial, yet it is considered as a backbone to the entire analysis part. The main purpose of
data wrangling is to make raw data usable. In other words, getting data into a shape. 0n average, data scientists
spend 75% of their time wrangling the data, which is not a surprise at all. The important needs of data wrangling
include,

• The quality of the data is ensured.

• Supports timely decision-making and fastens data insights.

• Noisy, flawed, and missing data are cleaned.

• It makes sense to the resultant dataset, as it gathers data that acts as a preparation stage for the data
mining process.
• Helps to make concrete and take a decision by cleaning and structuring raw data into the required
format.

• Raw data are pieced together to the required format.

• To create a transparent and efficient system for data management, the best solution is to have all data
in a centralized location so it can be used in improving compliance.

• Wrangling the data helps make decisions promptly and helps the wrangler clean, enrich, and transform
the data into a perfect picture.

DATA WRANGLING STEPS:

1. DISCOVERING:

Discovering is a term for an entire analytic process, and it’s a good way to learn how to use the data to explore
and it brings out the best approach for analytics explorations. It is a step in which the data is to be understood
more deeply.

2. STRUCTURING:

Raw data is given randomly. There will not be any structure to it in most cases because raw data comes from
many formats of different shapes and sizes. The data must be organized in such a manner where the analytics
attempt to use it in his analysis part.

3. CLEANING:

High-quality analysis happens here where every piece of data is checked carefully and redundancies are
removed that don’t fit the data for analysis. Data containing the Null values have to be changed either to an
empty string or zero and the formatting will be standardized to make the data of higher quality. The goal of data
cleaning or remediation is to ensure that there are no possible ways that the final data could be influenced that
is to be taken for final analysis.
4. ENRICHING:

Enriching is like adding some sense to the data. In this step, the data is derived into new kinds of data fromthe
data which already exits from cleaning into the formatted manner. This is where the data need to strategize that
you have in your hand and to make sure that you have is the best-enriched data. The best way to get the refined
data is to down sample, upscale it, and finally augur the data.

5. VALIDATING:

For analysis and evaluation of the quality of specific data set data quality rules are used. After processing the
data, the quality and consistency are verified which establish a strong surface to the security issues. These are
to be conducted along multiple dimensions and to adhere to syntactic constraints.

6. PUBLISHING:

The final part of the data wrangling is Publishing which gives the sole purpose of the entire wrangling process.
Analysts prepare the wrangled data that use further down the line that is its purpose after all. The finalized data
must match its format for the eventual data’s target. Now the cooked data can be used for analytics.

DATA WRANGLING IN PYTHON:

Pandas are an open-source mainly used for Data Analysis. Data wrangling deals with the following
functionalities.

• Data exploration: Visualization of data is made to analyze and understand the data.

• Dealing with missing values: Having Missing values in the data set has been a common issue when
dealing with large data set and care must be taken to replace them. It can be replaced either by mean,
mode or just labelling them as NaN value.

• Reshaping data: Here the data is either modified from the addressing of pre-existing data or the data
is modified and manipulated according to the requirements.

• Filtering data: The unwanted rows and columns are filtered and removed which makes the data into a
compressed format.

• Others: After making the raw data into an efficient dataset, it is bought into useful for data visualization,
data analyzing, training the model, etc.

How is Data Preprocessing performed?

Data Preprocessing is carried out to remove the cause of unformatted real-world data which we discussed above.
First of all, let's explain how missing data can be handled during Data Preparation. Three different steps can be
executed which are given below -

• Ignoring the missing record - It is the simplest and efficient method for handling the missing data. But,
this method should not be performed at the time when the number of missing values is immenseor
when the pattern of data is related to the unrecognized primary root of the cause of the statement
problem.
• Filling the missing values manually - This is one of the best-chosen methods of Data Preparation
process. But there is one limitation that when there are large data set, and missing values are significant
then, this approach is not efficient as it becomes a time-consuming task.

• Filling using computed values - The missing values can also be occupied by computing mean, mode
or median of the observed given values. Another method could be the predictive values in Data
Preprocessing are that are computed by using any Machine Learning or Deep Learning tools and
algorithms. But one drawback of this approach is that it can generate bias within the data as the calculated
values are not accurate concerning the observed values.

Data Formatting

• Incorrect data types

We should make sure that every column is assigned to the correct data type. This can be checked through the
property dtypes.

df.dtypeswhich gives the following output:

Tweet Id object
Tweet URL object
Tweet Posted Time (UTC) object
Tweet Content object
Tweet Type object
Client object
Retweets Received int64
Likes Received int64
Tweet Location object
Tweet Language object
User Id object
Name object
Username object
User Bio object
Verified or Non-Verified object
Profile URL object
Protected or Non-protected object
User Followers int64
User Following int64
User Account Creation Date object
Impressions int64
dtype: object

We can convert the column Tweet Location to string by using the function astype() as follows:

df['Tweet Location'] = df['Tweet Location'].astype('string')


Data Normalization with Pandas
Data Normalization could also be a typical practice in machine learning which consists of transforming numeric
columns to a standard scale. In machine learning, some feature values differ from others multiple times. The
features with higher values will dominate the learning process.

Data Normalization involves adjusting values measured on different scales to a common scale.

Normalization applies only to columns containing numeric values. Normalization methods are:

• Simple feature scaling

• min max

• z-score

Min-Max scaling

Z-score normalization

Z=(x−μ)/ σ

Simple feature scaling

Convert Categorical Variable to Numeric

When we look at the categorical data, the first question that arises to anyone is how to handle those data, because
machine learning is always good at dealing with numeric values. We could make machine learning models
by using text data. So, to make predictive models we have to convert categorical data into numeric form.

Method 1: Using replace() method


Replacing is one of the methods to convert categorical terms into numeric. For example, We will take a dataset
of people’s salaries based on their level of education. This is an ordinal type of categorical variable. We will
convert their education levels into numeric terms.

Syntax:

replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method=’pad’)

Method 2: Using get_dummies() / One Hot Encoding

Replacing the values is not the most efficient way to convert them. Pandas provide a method called
get_dummies which will return the dummy variable columns.

Syntax: pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None,


sparse=False, drop_first=False, dtype=None)

One-Hot Encoding: The Standard Approach for Categorical Data

One hot encoding is the most widespread approach, and it works very well unless your categorical variable
takes on a large number of values One hot encoding creates new (binary) columns, indicating the presence of
each possible value from the original data.It uses get_dummies() Method

Method 3:

Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-
readable form. Machine learning algorithms can then decide in a better way how those labels must be operated.
It is an important pre-processing step for the structured dataset in supervised learning.

Example:
Suppose we have a column Height in some dataset.
After applying label encoding, the Height column is converted into: where 0 is the label for tall, 1 is the
label for medium, and 2 is a label for short height.
Example :# Import dataset

# Import label encoder

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.

label_encoder = preprocessing.LabelEncoder()

# Encode labels in column Height.

df[‘Height’]= label_encoder.fit_transform(df[Height’])

df[‘Height’].unique()

Procedure-

STEP 1: IMPORTING THE LIBRARIES


IMPORT NUMPY AS NP
IMPORT MATPLOTLIB.PYPLOT AS PLT
IMPORT PANDAS AS PD
STEP 2: IMPORT THE DATASET
PATH="C:/USERS/ADMIN/DESKTOP/DYPIEMR DATA/DSBDA LAB/WRANGLED_DATA.CSV"
DF= PD.READ_CSV(PATH)
PRINT(DF)
STEP 3:DATA PREPROCESSING: CHECK FOR MISSING VALUES IN THE DATA USING PANDAS
ISNULL()
DF.ISNULL()
DF
STEP 4: #DESCRIBE() FUNCTION TO GET SOME INITIAL STATISTICS
DF.DESCRIBE()
#CHECK THE DIMENSIONS OF THE DATA FRAME
DF.SHAPE
#TOTAL NUMBER OF ELEMENTS IN THE DATAFRAME
DF.SIZE
STEP 5: DATA FORMATTING
DF.DTYPES
DF.ASTYPES(“COLUMN_NAME”)
DF = DF.ASTYPE({"ENGINE-LOCATION":'CATEGORY', " HORSEPOWER":'INT64'})

PROGRAM :
CONCLUSION:
They will understand how important data wrangling is for data and using
different techniques optimizedresults can be obtained. Hence wrangle the data,
before processing for analysis.

You might also like