0% found this document useful (0 votes)

2K views12 pages

Lab Assignment 1 Title: Data Wrangling I: Problem Statement

This document provides instructions for a data wrangling lab assignment in Python. Students are asked to: 1) import Python libraries, 2) locate an open source dataset from online and describe it, 3) load the dataset into a pandas dataframe, 4) perform preprocessing steps like checking for missing values and variable types, 5) format and normalize the data by changing variable types and handling categorical variables, and 6) submit the processed data. The document also provides background on data wrangling, the need for it, and common steps like cleaning, structuring, and publishing data.

Uploaded by

Mr. Legendperson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views12 pages

Lab Assignment 1 Title: Data Wrangling I: Problem Statement

Uploaded by

Mr. Legendperson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Lab Assignment 1

Title: Data Wrangling I

PROBLEM STATEMENT:

Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a clear description of the
data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() function to get
some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data
frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the
correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.

THEORY:

What is Data Wrangling?

Data Munging, commonly referred to as Data Wrangling, is the cleaning and transforming of one type of data
to another type to make it more appropriate into a processed format. Data wrangling involves processing the
data in various formats and analyzes and get them to be used with another set of data and bringing them together
into valuable insights. It further includes data aggregation, data visualization, and training statistical models for
prediction. data wrangling is one of the most important steps of the data science process. The quality of data
analysis is only as good as the quality of data itself, so it is very important to maintain data quality.

NEED FOR WRANGLING:

Wrangling the data is crucial, yet it is considered as a backbone to the entire analysis part. The main purpose of
data wrangling is to make raw data usable. In other words, getting data into a shape. 0n average, data scientists
spend 75% of their time wrangling the data, which is not a surprise at all. The important needs of data wrangling
include,

• The quality of the data is ensured.

• Supports timely decision-making and fastens data insights.

• Noisy, flawed, and missing data are cleaned.

• It makes sense to the resultant dataset, as it gathers data that acts as a preparation stage for the data
mining process.
• Helps to make concrete and take a decision by cleaning and structuring raw data into the required
format.

• Raw data are pieced together to the required format.

• To create a transparent and efficient system for data management, the best solution is to have all data
in a centralized location so it can be used in improving compliance.

• Wrangling the data helps make decisions promptly and helps the wrangler clean, enrich, and transform
the data into a perfect picture.

DATA WRANGLING STEPS:

1. DISCOVERING:

Discovering is a term for an entire analytic process, and it’s a good way to learn how to use the data to explore
and it brings out the best approach for analytics explorations. It is a step in which the data is to be understood
more deeply.

2. STRUCTURING:

Raw data is given randomly. There will not be any structure to it in most cases because raw data comes from
many formats of different shapes and sizes. The data must be organized in such a manner where the analytics
attempt to use it in his analysis part.

3. CLEANING:

High-quality analysis happens here where every piece of data is checked carefully and redundancies are
removed that don’t fit the data for analysis. Data containing the Null values have to be changed either to an
empty string or zero and the formatting will be standardized to make the data of higher quality. The goal of data
cleaning or remediation is to ensure that there are no possible ways that the final data could be influenced that
is to be taken for final analysis.
4. ENRICHING:

Enriching is like adding some sense to the data. In this step, the data is derived into new kinds of data fromthe
data which already exits from cleaning into the formatted manner. This is where the data need to strategize that
you have in your hand and to make sure that you have is the best-enriched data. The best way to get the refined
data is to down sample, upscale it, and finally augur the data.

5. VALIDATING:

For analysis and evaluation of the quality of specific data set data quality rules are used. After processing the
data, the quality and consistency are verified which establish a strong surface to the security issues. These are
to be conducted along multiple dimensions and to adhere to syntactic constraints.

6. PUBLISHING:

The final part of the data wrangling is Publishing which gives the sole purpose of the entire wrangling process.
Analysts prepare the wrangled data that use further down the line that is its purpose after all. The finalized data
must match its format for the eventual data’s target. Now the cooked data can be used for analytics.

DATA WRANGLING IN PYTHON:

Pandas are an open-source mainly used for Data Analysis. Data wrangling deals with the following
functionalities.

• Data exploration: Visualization of data is made to analyze and understand the data.

• Dealing with missing values: Having Missing values in the data set has been a common issue when
dealing with large data set and care must be taken to replace them. It can be replaced either by mean,
mode or just labelling them as NaN value.

• Reshaping data: Here the data is either modified from the addressing of pre-existing data or the data
is modified and manipulated according to the requirements.

• Filtering data: The unwanted rows and columns are filtered and removed which makes the data into a
compressed format.

• Others: After making the raw data into an efficient dataset, it is bought into useful for data visualization,
data analyzing, training the model, etc.

How is Data Preprocessing performed?

Data Preprocessing is carried out to remove the cause of unformatted real-world data which we discussed above.
First of all, let's explain how missing data can be handled during Data Preparation. Three different steps can be
executed which are given below -

• Ignoring the missing record - It is the simplest and efficient method for handling the missing data. But,
this method should not be performed at the time when the number of missing values is immenseor
when the pattern of data is related to the unrecognized primary root of the cause of the statement
problem.
• Filling the missing values manually - This is one of the best-chosen methods of Data Preparation
process. But there is one limitation that when there are large data set, and missing values are significant
then, this approach is not efficient as it becomes a time-consuming task.

• Filling using computed values - The missing values can also be occupied by computing mean, mode
or median of the observed given values. Another method could be the predictive values in Data
Preprocessing are that are computed by using any Machine Learning or Deep Learning tools and
algorithms. But one drawback of this approach is that it can generate bias within the data as the calculated
values are not accurate concerning the observed values.

Data Formatting

• Incorrect data types

We should make sure that every column is assigned to the correct data type. This can be checked through the
property dtypes.

df.dtypeswhich gives the following output:

Tweet Id object
Tweet URL object
Tweet Posted Time (UTC) object
Tweet Content object
Tweet Type object
Client object
Retweets Received int64
Likes Received int64
Tweet Location object
Tweet Language object
User Id object
Name object
Username object
User Bio object
Verified or Non-Verified object
Profile URL object
Protected or Non-protected object
User Followers int64
User Following int64
User Account Creation Date object
Impressions int64
dtype: object

We can convert the column Tweet Location to string by using the function astype() as follows:

df['Tweet Location'] = df['Tweet Location'].astype('string')

Data Normalization with Pandas
Data Normalization could also be a typical practice in machine learning which consists of transforming numeric
columns to a standard scale. In machine learning, some feature values differ from others multiple times. The
features with higher values will dominate the learning process.

Data Normalization involves adjusting values measured on different scales to a common scale.

Normalization applies only to columns containing numeric values. Normalization methods are:

• Simple feature scaling

• min max

• z-score

Min-Max scaling

Z-score normalization

Z=(x−μ)/ σ

Simple feature scaling

Convert Categorical Variable to Numeric

When we look at the categorical data, the first question that arises to anyone is how to handle those data, because
machine learning is always good at dealing with numeric values. We could make machine learning models
by using text data. So, to make predictive models we have to convert categorical data into numeric form.

Method 1: Using replace() method

Replacing is one of the methods to convert categorical terms into numeric. For example, We will take a dataset
of people’s salaries based on their level of education. This is an ordinal type of categorical variable. We will
convert their education levels into numeric terms.

Syntax:

replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method=’pad’)

Method 2: Using get_dummies() / One Hot Encoding

Replacing the values is not the most efficient way to convert them. Pandas provide a method called
get_dummies which will return the dummy variable columns.

Syntax: pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None,

sparse=False, drop_first=False, dtype=None)

One-Hot Encoding: The Standard Approach for Categorical Data

One hot encoding is the most widespread approach, and it works very well unless your categorical variable
takes on a large number of values One hot encoding creates new (binary) columns, indicating the presence of
each possible value from the original data.It uses get_dummies() Method

Method 3:

Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-
readable form. Machine learning algorithms can then decide in a better way how those labels must be operated.
It is an important pre-processing step for the structured dataset in supervised learning.

Example:
Suppose we have a column Height in some dataset.
After applying label encoding, the Height column is converted into: where 0 is the label for tall, 1 is the
label for medium, and 2 is a label for short height.
Example :# Import dataset

# Import label encoder

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.

label_encoder = preprocessing.LabelEncoder()

# Encode labels in column Height.

df[‘Height’]= label_encoder.fit_transform(df[Height’])

df[‘Height’].unique()

Procedure-

STEP 1: IMPORTING THE LIBRARIES

IMPORT NUMPY AS NP
IMPORT MATPLOTLIB.PYPLOT AS PLT
IMPORT PANDAS AS PD
STEP 2: IMPORT THE DATASET
PATH="C:/USERS/ADMIN/DESKTOP/DYPIEMR DATA/DSBDA LAB/WRANGLED_DATA.CSV"
DF= PD.READ_CSV(PATH)
PRINT(DF)
STEP 3:DATA PREPROCESSING: CHECK FOR MISSING VALUES IN THE DATA USING PANDAS
ISNULL()
DF.ISNULL()
DF
STEP 4: #DESCRIBE() FUNCTION TO GET SOME INITIAL STATISTICS
DF.DESCRIBE()
#CHECK THE DIMENSIONS OF THE DATA FRAME
DF.SHAPE
#TOTAL NUMBER OF ELEMENTS IN THE DATAFRAME
DF.SIZE
STEP 5: DATA FORMATTING
DF.DTYPES
DF.ASTYPES(“COLUMN_NAME”)
DF = DF.ASTYPE({"ENGINE-LOCATION":'CATEGORY', " HORSEPOWER":'INT64'})

PROGRAM :
CONCLUSION:
They will understand how important data wrangling is for data and using
different techniques optimizedresults can be obtained. Hence wrangle the data,
before processing for analysis.

Data Wrangling With Python Lab Manual
No ratings yet
Data Wrangling With Python Lab Manual
29 pages
Final BI Lab Manual
No ratings yet
Final BI Lab Manual
42 pages
WT Lab Manual
No ratings yet
WT Lab Manual
47 pages
Data Mining Written Notes 1
No ratings yet
Data Mining Written Notes 1
35 pages
SPPU 2022 Solved Question Paper DWDM
50% (2)
SPPU 2022 Solved Question Paper DWDM
25 pages
DSBDA Lab Manual 2022-23
100% (2)
DSBDA Lab Manual 2022-23
148 pages
What Is Data Visualization UNIT-V
No ratings yet
What Is Data Visualization UNIT-V
24 pages
Unit 7: Data Mining For Business Intelligence Applications: A) Balanced Scorecard
33% (3)
Unit 7: Data Mining For Business Intelligence Applications: A) Balanced Scorecard
11 pages
DDM - Unit 5 - Material
100% (2)
DDM - Unit 5 - Material
45 pages
HN DAA 15CS43 LectureNotes 1
20% (5)
HN DAA 15CS43 LectureNotes 1
28 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
Predictive Analytics-Mid Sem Exam Question Bank
No ratings yet
Predictive Analytics-Mid Sem Exam Question Bank
28 pages
BCA-404: Data Mining and Data Ware Housing
No ratings yet
BCA-404: Data Mining and Data Ware Housing
19 pages
Unit 1 Introduction of Machine Learning Notes
No ratings yet
Unit 1 Introduction of Machine Learning Notes
57 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
Unit 1 - Data Science & Big Data - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Data Science & Big Data - WWW - Rgpvnotes.in
20 pages
Knowledge Representation in Data Mining
No ratings yet
Knowledge Representation in Data Mining
22 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
CS8091 Big Data Analytics MCQ
100% (2)
CS8091 Big Data Analytics MCQ
22 pages
Lab Manual: 18CS3262S Data Modelling and Visualization Techniques
33% (3)
Lab Manual: 18CS3262S Data Modelling and Visualization Techniques
17 pages
Applications and Trends in Data Mining
100% (1)
Applications and Trends in Data Mining
20 pages
Matrix-Vector Multiplication Using MapReduce in Big Data.
No ratings yet
Matrix-Vector Multiplication Using MapReduce in Big Data.
4 pages
Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Unit 1
100% (1)
Unit 1
54 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
Enterprise Reporting
No ratings yet
Enterprise Reporting
40 pages
Question Bank Python For Data Science
0% (1)
Question Bank Python For Data Science
3 pages
Data Wrangling
0% (1)
Data Wrangling
7 pages
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages
Data Mining and Visualization Question Bank
100% (1)
Data Mining and Visualization Question Bank
11 pages
Data Science-Lab Manual
100% (1)
Data Science-Lab Manual
15 pages
Introduction To Data Analytics and Visualization Question Paper
100% (1)
Introduction To Data Analytics and Visualization Question Paper
2 pages
DVT - Question Bank
100% (1)
DVT - Question Bank
3 pages
MCQ - Bda
33% (3)
MCQ - Bda
3 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
EDA - With Python Question Bank
No ratings yet
EDA - With Python Question Bank
3 pages
Iwt Practical
No ratings yet
Iwt Practical
20 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
Unit 5
No ratings yet
Unit 5
19 pages
DMW Question Paper
0% (1)
DMW Question Paper
7 pages
Data Mining Question Bank Chapter-1 (Introduction To Data Warehouse and Data Mining) Expected Questions 1 Mark Questions
No ratings yet
Data Mining Question Bank Chapter-1 (Introduction To Data Warehouse and Data Mining) Expected Questions 1 Mark Questions
6 pages
Assignment No 2
No ratings yet
Assignment No 2
26 pages
CS402 Data Mining and Warehousing Question Bank
No ratings yet
CS402 Data Mining and Warehousing Question Bank
6 pages
Data Mining Lab Manual COMPLETE GMR
No ratings yet
Data Mining Lab Manual COMPLETE GMR
66 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
25 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Question Bank For DMDW
No ratings yet
Question Bank For DMDW
10 pages
CS230 Midterm Solutions Fall 2022
No ratings yet
CS230 Midterm Solutions Fall 2022
20 pages
Assignment Questions For Group: 1 Short Answer Questions
100% (1)
Assignment Questions For Group: 1 Short Answer Questions
10 pages
Data WareHouse Previous Year Question Paper
100% (1)
Data WareHouse Previous Year Question Paper
10 pages
Unit-3 DS Students
No ratings yet
Unit-3 DS Students
35 pages
Subject Code:: Data Visualization
No ratings yet
Subject Code:: Data Visualization
8 pages
300+ TOP DATA MINING Multiple Choice Questions and Answers
No ratings yet
300+ TOP DATA MINING Multiple Choice Questions and Answers
10 pages
Data Analytics (A) CS-503, B.Tech. 5 Semester Assignment Questions
0% (1)
Data Analytics (A) CS-503, B.Tech. 5 Semester Assignment Questions
2 pages
DWDM Assignment 1
No ratings yet
DWDM Assignment 1
4 pages
Question Bank: Data Warehousing and Data Mining Semester: VII
No ratings yet
Question Bank: Data Warehousing and Data Mining Semester: VII
4 pages
GE3151 Problem Solving and Python Programming Question Bank 1
No ratings yet
GE3151 Problem Solving and Python Programming Question Bank 1
6 pages
Data Structure BCA Practical Exercises
No ratings yet
Data Structure BCA Practical Exercises
8 pages
WWW - Manaresults.Co - In: I B. Tech. II Semester Regular Examinations, April/May - 2017 Data Structures
No ratings yet
WWW - Manaresults.Co - In: I B. Tech. II Semester Regular Examinations, April/May - 2017 Data Structures
4 pages
Dsbda Lab Manual
No ratings yet
Dsbda Lab Manual
167 pages
Word 2 Vec
No ratings yet
Word 2 Vec
6 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Jntuk ML RECORD Full
No ratings yet
Jntuk ML RECORD Full
46 pages
Project1 Report
No ratings yet
Project1 Report
21 pages
MLOps Zoomcamp FAQ
No ratings yet
MLOps Zoomcamp FAQ
132 pages
Word Embeddings Notes
No ratings yet
Word Embeddings Notes
9 pages
21CSU393 Kunal Verma - AI&ML Lab Manual
No ratings yet
21CSU393 Kunal Verma - AI&ML Lab Manual
71 pages
Python Sklearn Linear Regression
No ratings yet
Python Sklearn Linear Regression
45 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
20 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Feature Engineering
No ratings yet
Feature Engineering
43 pages
Candlestick Chart Based Trading System Using Ensemble Learning For Financial Assets
No ratings yet
Candlestick Chart Based Trading System Using Ensemble Learning For Financial Assets
10 pages
CS 461 - Fall 2021 - Neural Networks - Machine Learning
No ratings yet
CS 461 - Fall 2021 - Neural Networks - Machine Learning
5 pages
Comp2712 l05 ML Feature
No ratings yet
Comp2712 l05 ML Feature
20 pages
Flipkart Training: Exploratory Data Analysis
No ratings yet
Flipkart Training: Exploratory Data Analysis
9 pages
Text Generation:Use Technique Like Markov Models or LSTM Network To Generate Realistic Text in A Specific Style or Genre
No ratings yet
Text Generation:Use Technique Like Markov Models or LSTM Network To Generate Realistic Text in A Specific Style or Genre
7 pages
Machine Learning Engineer Nanodegree: Supervised Learning Project: Finding Donors For Charityml
No ratings yet
Machine Learning Engineer Nanodegree: Supervised Learning Project: Finding Donors For Charityml
18 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages

Lab Assignment 1 Title: Data Wrangling I: Problem Statement

Uploaded by

Lab Assignment 1 Title: Data Wrangling I: Problem Statement

Uploaded by

Lab Assignment 1

Title: Data Wrangling I

What is Data Wrangling?

NEED FOR WRANGLING:

• The quality of the data is ensured.

• Supports timely decision-making and fastens data insights.

• Noisy, flawed, and missing data are cleaned.

• Raw data are pieced together to the required format.

DATA WRANGLING STEPS:

DATA WRANGLING IN PYTHON:

How is Data Preprocessing performed?

• Incorrect data types

df.dtypeswhich gives the following output:

df['Tweet Location'] = df['Tweet Location'].astype('string')

• Simple feature scaling

Simple feature scaling

Convert Categorical Variable to Numeric

Method 1: Using replace() method

replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method=’pad’)

Method 2: Using get_dummies() / One Hot Encoding

Syntax: pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None,

One-Hot Encoding: The Standard Approach for Categorical Data

# Import label encoder

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.

# Encode labels in column Height.

STEP 1: IMPORTING THE LIBRARIES

You might also like