0% found this document useful (0 votes)
15 views10 pages

Exp 3

This document outlines various data transformation techniques using pandas, including merging data frames, reshaping data with hierarchical indexing, handling missing data, and data deduplication. It provides sample experiments to perform these operations, such as merging data frames, filling missing values, and replacing values in data frames. The document also emphasizes the importance and challenges of data transformation in data analysis.

Uploaded by

damisettilohitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

Exp 3

This document outlines various data transformation techniques using pandas, including merging data frames, reshaping data with hierarchical indexing, handling missing data, and data deduplication. It provides sample experiments to perform these operations, such as merging data frames, filling missing values, and replacing values in data frames. The document also emphasizes the importance and challenges of data transformation in data analysis.

Uploaded by

damisettilohitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT-3

(EXPERIMENTS-3)

Data Transformation: Merging database-style data frames, concatenating along


with an axis, merging on index, Reshaping and pivoting, melting Transformation
techniques, handling missing data, Mathematical operations with NaN, Filling
missing values, Discretization and binning, Outlier detection and filtering,
Permutation and random sampling, Benefits of data transformation, Challenges
Sample Experiments:
1. Perform the following operations
a) Merging Dataframes
b) Reshaping with Hierarchical Indexing
c) Data Deduplication
d) Replacing Values
2. Apply different Missing Data handling techniques
a) NaN values in mathematical Operations
b) Filling in missing data
c) Forward and Backward filling of missing values
d) Filling with index values
e) Interpolation of missing values
3. Apply different data transformation techniques
a) Renaming axis indexes
b) Discretization and Binning
c) Permutation and Random Sampling
d) Dummy variables
1. Perform the following operations
a) Merging Data frames
b) Reshaping with Hierarchical Indexing
c) Data Deduplication
d) Replacing Values

a) Merging Dataframes
Creating First Dataframe to Perform Merge Operation
# import module
import pandas as pd
# creating DataFrame for Student Details
details = pd.DataFrame({
'ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
'NAME': ['Jagroop', 'Praveen', 'Harjot', 'Pooja', 'Rahul',
'Nikita', 'Saurabh', 'Ayush', 'Dolly', "Mohit"],
'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE',
'CSE', 'CSE', 'CSE']})
# printing details
print(details)

Creating Second Dataframe to Perform Merge operation


# Import module
import pandas as pd
# Creating Dataframe for Fees_Status
fees_status = pd.DataFrame(
{'ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
'PENDING': ['5000', '250', 'NIL', '9000', '15000', 'NIL','4500', '1800',
'250', 'NIL']})
# Printing fees_status
print(fees_status)

Merge Operation
# Import module
import pandas as pd
# Creating Dataframe
details = pd.DataFrame({
'ID': [101, 102, 103, 104, 105,
106, 107, 108, 109, 110],
'NAME': ['Jagroop', 'Praveen', 'Harjot',
'Pooja', 'Rahul', 'Nikita',
'Saurabh', 'Ayush', 'Dolly', "Mohit"],
'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE',
'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})
# Creating Dataframe
fees_status = pd.DataFrame(
{'ID': [101, 102, 103, 104, 105,
106, 107, 108, 109, 110],
'PENDING': ['5000', '250', 'NIL',
'9000', '15000', 'NIL',
'4500', '1800', '250', 'NIL']})
# Merging Dataframe

print(pd.merge(details, fees_status, on='ID'))


OUTPUT:

Two Data frame For Concatenation:


# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd'],
'Mobile No': [97, 91, 58, 76]}
# Define a dictionary containing employee data
data2 = {'Name':['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'],
'Age':[22, 32, 12, 52],
'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons'],
'Salary':[1000, 2000, 3000, 4000]}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data1,index=[0, 1, 2, 3])
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])
res = pd.concat([df, df1])
b) Reshaping with Hierarchical Indexing
Hierarchical indexing is an important feature of pandas that enables you to have
multiple (two or more) index levels on an axis. It provides a way for you to work with
higher dimensional data in a lower dimensional form
import numpy as np
import pandas as pd
data = pd.Series(np.random.randn(9),
index = [['a','a','a','b','b','c','c','d','d'],
[1,2,3,1,3,1,2,2,3]])
data
a 1 -0.214941
2 2.147522
3 0.564280
b1 1.059833
3 -1.104780
c1 0.210634
2 1.423999
d 2 -1.256163
3 -1.129026
dtype: float64
data.shape
(9,)
ata.ndim
1
data.index

MultiIndex([('a', 1),

('a', 2),

('a', 3),

('b', 1),

('b', 3),

('c', 1),

('c', 2),

('d', 2),

('d', 3)],

data['b']
1 1.059833
3 -1.104780
dtype: float64
data['b':'c']

b 1 1.059833

3 -1.104780

c 1 0.210634

2 1.423999

dtype: float64

data.loc[['b','d']]

b 1 1.059833

3 -1.104780

d 2 -1.256163

3 -1.129026

dtype: float64

data.loc[:,2] # inner level selection a2, c2, d2


a 2.147522
c 1.423999
d -1.256163
dtype: float64
Hierarchical indexing plays an important role in reshaping data and group-based operations
like forming a pivot table
data.unstack()
1 2 3
a -0.214941 2.147522 0.564280
b 1.059833 NaN -1.104780
c 0.210634 1.423999 NaN
d NaN -1.256163 -1.129026
data.unstack().stack()
a 1 -0.214941
2 2.147522
3 0.564280
b 1 1.059833
3 -1.104780
c 1 0.210634
2 1.423999
d 2 -1.256163
3 -1.129026
dtype: float64

c) Data Deduplication
Duplicate data from the Dataset
# import module
import pandas as pd
# initializing Data
student_data = {'Name': ['Amit', 'Praveen', 'Jagroop',
'Rahul', 'Vishal', 'Suraj',
'Rishab', 'Satyapal', 'Amit',
'Rahul', 'Praveen', 'Amit'],
'Roll_no': [23, 54, 29, 36, 59, 38,
12, 45, 34, 36, 54, 23],
'Email': ['[email protected]', '[email protected]',
'[email protected]', '[email protected]',
'[email protected]', '[email protected]',
'[email protected]', '[email protected]',
'[email protected]', '[email protected]',
'[email protected]',
'[email protected]']}
# creating dataframe
df = pd.DataFrame(student_data)
# Here df.duplicated() list duplicate Entries in ROllno.
# So that ~(NOT) is placed in order to get non duplicate values.
non_duplicate = df[~df.duplicated('Roll_no')]
# printing non-duplicate values
print(non_duplicate)

OUTPUT:

D) Replacing Values
import pandas as pd
df = { "Array_1": [49.50, 70], "Array_2": [65.1, 49.50]}
data = pd.DataFrame(df)print(data.replace(49.50, 60))
You can replace specific values in a Data Frame using the replace () method. Here’s a basic
example:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
# Replace all occurrences of 1 with 100
df.replace(1, 100, inplace=True)
print(df)
Replace Values in Pandas Dataframe
# importing pandas as pd
import pandas as pd
# Making data frame from the csv file
df = pd.read_csv("nba.csv")
# Printing the first 10 rows of the data frame for visualization
df[:10]

Replacing a Single Value


We are going to replace team “Boston Celtics” with “Omega Warrior” in the ‘df’
Dataframe
# this will replace "Boston Celtics" with "Omega Warrior"
df.replace(to_replace="Boston Celtics", value="Omega Warrior")
Replacing Two Values with a Single Value
Replacing more than one value at a time. Using python list as an argument We are
going to replace team “Boston Celtics” and “Texas” with “Omega Warrior” in the ‘df’
Data frame
# importing pandas as pd
import pandas as pd
# Making data frame from the csv file
df = pd.read_csv("nba.csv")
# this will replace "Boston Celtics" and "Texas" with "Omega Warrior"
df.replace(to_replace=["Boston Celtics", "Texas"],
value="Omega Warrior")

You might also like