0% found this document useful (0 votes)
15 views36 pages

Unit - 4 - Part 2

Data wrangling is a crucial step in data science that involves transforming raw data into a more suitable format for analysis, including tasks like data exploration, handling null values, and feature engineering. The document outlines various Python libraries such as NumPy, Pandas, and SciKit-Learn that facilitate data manipulation and analysis. It also covers methods for reading, filtering, grouping, and aggregating data using Pandas DataFrames.

Uploaded by

sibixzzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views36 pages

Unit - 4 - Part 2

Data wrangling is a crucial step in data science that involves transforming raw data into a more suitable format for analysis, including tasks like data exploration, handling null values, and feature engineering. The document outlines various Python libraries such as NumPy, Pandas, and SciKit-Learn that facilitate data manipulation and analysis. It also covers methods for reading, filtering, grouping, and aggregating data using Pandas DataFrames.

Uploaded by

sibixzzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

DATA WRANGLING

Data wrangling (otherwise known as data munging or


preprocessing) is a key component of any data science
project. Wrangling is a process where one transforms
“raw” data for making it more suitable for analysis and it
will improve the quality of your data.
2
� Data Exploration: Checking for feature data types,
unique values, and describing data.
� Null Values: Counting null values and deciding what to
do with them.
� Reshaping and Feature Engineering: This step
transforms raw data into a more useful format. Examples
of feature engineering include one-hot encoding,
aggregation, joins, and grouping.
� Text Processing: BeautifulSoup and Regex (among
other tools) are often used to clean and extract web
scraped texts from HTML and XML documents.

3
AGENDA
� Reading Data
� Selecting and Filtering the Data
� Grouping
� Slicing
� Data manipulation
� loc, iloc
� Sorting
� Handling Missing values
� Aggregation

4
Python Libraries for Data Science
NumPy:
▪ introduces objects for multidimensional arrays and matrices, as
well as functions that allow to easily perform advanced
mathematical and statistical operations on those objects

▪ provides vectorization of mathematical operations on arrays


and matrices which significantly improves the performance

▪ many other python libraries are built on NumPy

Link: http://www.numpy.org/

5
Python Libraries for Data Science
SciPy:
▪ collection of algorithms for linear algebra, differential
equations, numerical integration, optimization, statistics and
more

▪ part of SciPy Stack

▪ built on NumPy

Link: https://www.scipy.org/scipylib/

6
Python Libraries for Data Science
Pandas:
▪ adds data structures and tools designed to work with table-like
data (similar to Series and Data Frames in R)

▪ provides tools for data manipulation: reshaping, merging,


sorting, slicing, aggregation etc.

▪ allows handling missing data

Link: http://pandas.pydata.org/

7
Python Libraries for Data Science
SciKit-Learn:
▪ provides machine learning algorithms: classification, regression,
clustering, model validation etc.

▪ built on NumPy, SciPy and matplotlib

Link: http://scikit-learn.org/

8
Python Libraries for Data Science
matplotlib:
▪ python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats

▪ a set of functionalities similar to those of MATLAB

▪ line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ relatively low-level; some effort needed to create advanced


visualization
Link: https://matplotlib.org/

9
Python Libraries for Data Science
Seaborn:
▪ based on matplotlib

▪ provides high level interface for drawing attractive statistical


graphics

▪ Similar (in style) to the popular ggplot2 library in R

Link: https://seaborn.pydata.org/

10
Loading Python Libraries

In [ ]: #Import Python Libraries


import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

Press Shift+Enter to execute the jupyter cell

11
Reading data using pandas
In [ ]:
#Read csv file
df = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/Salaries.csv")

Note: The above command has many optional arguments to


fine-tune the data import process.

There is a number of pandas commands to read other data formats:

pd.read_excel('myfile.xlsx',sheet_name='Sheet1',
index_col=None, na_values=['NA'])
pd.read_stata('myfile.dta')
pd.read_sas('myfile.sas7bdat')
pd.read_hdf('myfile.h5','df')

12
Exploring data frames

In [3]: #List first 5 records


df.head()
Out[3]:

13
Data Frame data types

Pandas Type Native Python Type Description


object string The most general dtype. Will be
assigned to your column if
column has mixed types
(numbers and strings).
int64 int Numeric characters. 64 refers to
the memory allocated to hold
this character.
float64 float Numeric characters with
decimals. If a column contains
numbers and NaNs(see below),
pandas will default to float64, in
case your missing value has a
decimal.
datetime64, timedelta[ns] N/A (but see Values meant to hold time data.
the datetime module in Python’s Look into these for time series
standard library) experiments.

14
Data Frame data types

In [4]:
#Check a particular column type
df['salary'].dtype
Out[4]: dtype('int64')

In [5]: #Check types for all the columns


df.dtypes
Out[4]: rank object
discipline object
phd int64
service int64
sex object
salary int64
dtype: object

15
Data Frames attributes

Python objects have attributes and methods.

df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values numpy representation of the data

16
Data Frames methods

Unlike attributes, python methods have parenthesis.


All attributes and methods can be listed with a dir() function:
dir(df)

df.method() description
head( [n] ), tail( [n] ) first/last n rows

describe() generate descriptive statistics (for numeric columns only)

max(), min() return max/min values for all numeric columns

mean(), median() return mean/median values for all numeric columns

std() standard deviation

sample([n]) returns a random sample of the data frame

dropna() drop all the records with missing values

17
Selecting a column in a Data Frame
Method 1: Subset the data frame using column name:
df['sex']

Method 2: Use the column name as an attribute:


df.sex

Note: there is an attribute rank for pandas data frames, so to select a column with a
name "rank" we should use method 1.

18
Data Frames groupby method

Using "group by" method we can:

• Split the data into groups based on some criteria


• Calculate statistics (or apply a function) to each group
• Similar to dplyr() function in R

In [ ]: #Group data using rank


df_rank = df.groupby(['rank'])

In [ ]:
#Calculate mean value for each numeric column per each
group
df_rank.mean()

19
Data Frames groupby method

Once groupby object is create we can calculate various statistics for each

group:

In [ ]: #Calculate mean salary for each professor rank:


df.groupby('rank')[['salary']].mean()

Note: If single brackets are used to specify the column (e.g. salary), then
the output is Pandas Series object. When double brackets are used the
output is a Data Frame
20
Data Frames groupby method

groupby performance notes:

- no grouping/splitting occurs until it's needed. Creating the groupby


object only verifies that you have passed a valid mapping
- by default the group keys are sorted during the groupby operation.
You may want to pass sort=False for potential speedup:

In [ ]: #Calculate mean salary for each professor rank:


df.groupby(['rank'], sort=False)[['salary']].mean()

21
Data Frame: filtering

To subset the data we can apply Boolean indexing. This indexing is


commonly known as a filter. For example if we want to subset the rows in
which the salary value is greater than $120K:

In [ ]: #Calculate mean salary for each professor rank:


df_sub = df[ df['salary'] > 120000 ]
Any Boolean operator can be used to subset the data:
> greater; >= greater or equal;
< less; <= less or equal;
== equal; != not equal;
In [ ]: #Select only those rows that contain female
professors:
df_f = df[ df['sex'] == 'Female' ]

22
Data Frames: Slicing

There are a number of ways to subset the Data Frame:


• one or more columns
• one or more rows
• a subset of rows and columns

Rows and columns can be selected by their position or label

23
Data Frames: Slicing

When selecting one column, it is possible to use single set of brackets, but
the resulting object will be a Series (not a DataFrame):
In [ ]: #Select column salary:
df['salary']

When we need to select more than one column and/or make the output to
be a DataFrame, we should use double brackets:
In [ ]: #Select column salary:
df[['rank','salary']]

24
Data Frames: Selecting rows

If we need to select a range of rows, we can specify the range using ":"

In [ ]: #Select rows by their position:


df[10:20]

Notice that the first row has a position 0, and the last value in the range is
omitted:
So for 0:10 range the first 10 rows are returned with the positions starting
with 0 and ending with 9

25
Data Frames: method loc

If we need to select a range of rows, using their labels we can use method
loc:
In [ ]: #Select rows by their labels:
df_sub.loc[10:20,['rank','sex','salary']]

Out[ ]:

26
Data Frames: method iloc

If we need to select a range of rows and/or columns, using their positions


we can use method iloc:
In [ ]: #Select rows by their labels:
df_sub.iloc[10:20,[0, 3, 4, 5]]

Out[ ]:

27
Data Frames: method iloc (summary)

df.iloc[0] # First row of a data frame


df.iloc[i] #(i+1)th row
df.iloc[-1] # Last row

df.iloc[:, 0] # First column


df.iloc[:, -1] # Last column

df.iloc[0:7] #First 7 rows


df.iloc[:, 0:2] #First 2 columns
df.iloc[1:3, 0:2] #Second through third rows and
first 2 columns
df.iloc[[0,5], [1,3]] #1st and 6th rows and 2nd and 4th
columns

28
Data Frames: Sorting

We can sort the data by a value in the column. By default the sorting will
occur in ascending order and a new data frame is return.

In [ ]: # Create a new data frame from the original sorted by


the column Salary
df_sorted = df.sort_values( by ='service')
df_sorted.head()

Out[ ]:

29
Data Frames: Sorting

We can sort the data using 2 or more columns:

In [ ]: df_sorted = df.sort_values( by =['service', 'salary'], ascending = [True, False])


df_sorted.head(10)

Out[ ]:

30
Missing Values

Missing values are marked as NaN


In [ ]: # Read a dataset with missing values
flights = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/flights.csv")

In [ ]: # Select the rows that have at least one missing value


flights[flights.isnull().any(axis=1)].head()

Out[ ]:

31
Missing Values
There are a number of methods to deal with missing values in the data
frame:

df.method() description
dropna() Drop missing observations

dropna(how='all') Drop observations where all cells is NA

dropna(axis=1, Drop column if all the values are missing


how='all')
dropna(thresh = 5) Drop rows that contain less than 5 non-missing values

fillna(0) Replace missing values with zeros

isnull() returns True if the value is missing

notnull() Returns True for non-missing values

32
Missing Values

• When summing the data, missing values will be treated as zero


• If all values are missing, the sum will be equal to NaN
• cumsum() and cumprod() methods ignore missing values but preserve
them in the resulting arrays
• Missing values in GroupBy method are excluded (just like in R)
• Many descriptive statistics methods have skipna option to control if
missing data should be excluded . This value is set to True by default
(unlike R)

33
Aggregation Functions in Pandas

Aggregation - computing a summary statistic about each group, i.e.


• compute group sums or means
• compute group sizes/counts

Common aggregation functions:

min, max
count, sum, prod
mean, median, mode, mad
std, var

34
Aggregation Functions in Pandas

agg() method are useful when multiple statistics are computed per column:
In [ ]: flights[['dep_delay','arr_delay']].agg(['min','mean','max'])

Out[ ]:

35
Basic Descriptive Statistics

df.method() description
describe Basic statistics (count, mean, std, min, quantiles, max)

min, max Minimum and maximum values

mean, median, mode Arithmetic average, median and mode

var, std Variance and standard deviation

sem Standard error of mean

skew Sample skewness

kurt kurtosis

36

You might also like