0% found this document useful (0 votes)
22 views12 pages

05 Pandas

Pandas is a Python library designed for data analysis, providing flexible data structures like Series and DataFrame for easy manipulation of labeled data. It facilitates data wrangling, exploration, cleaning, and analysis through various functions and methods, including handling missing values and feature engineering techniques. The library also supports data visualization and statistical analysis, making it a fundamental tool for practical data analysis in Python.

Uploaded by

Rochit Limje
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

05 Pandas

Pandas is a Python library designed for data analysis, providing flexible data structures like Series and DataFrame for easy manipulation of labeled data. It facilitates data wrangling, exploration, cleaning, and analysis through various functions and methods, including handling missing values and feature engineering techniques. The library also supports data visualization and statistical analysis, making it a fundamental tool for practical data analysis in Python.

Uploaded by

Rochit Limje
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Pandas – Python’s Panel Data Analysis Libary

• pandas is a Python package providing fast, flexible, and expressive data


structures designed to make working with “relational” or “labeled” data both
easy and intuitive.
• It aims to be the fundamental high-level building block for doing practical,
real-world data analysis in Python.
• It is very useful in data wrangling, the process of gathering, collecting, and
transforming Raw data into another format for better understanding,
decision-making, accessing, and analysis in less time. Data Wrangling is
also known as Data Munging.
• Series and DataFrame are two data structure available to enable data
processing intuitively

1
Pandas as a data exploration tool
• Import the package
import pandas as pd
• Read file
df = pd.read_csv('C:\\Rekha\\InputFiles\\DAUP\\Student.csv’)
• Explore/Analyze data
• df.size – Total values
• df.shape – Rows and Columns
• df.describe() – Gives statistical information on the columns. Descriptive Statistics
• df.info() - Displays the column names and their data types
• df.columns : Access column names
• df[‘gender’], df.gender : Access data by column name or index values df[0:3]
• df["Team"].value_counts() – Frequency Distribution
• df.Team.unique() - All unique values for the column
• Selection of data by applying conditions
• Using loc, iloc, query

2
• Cleanup the data
• Rename columns
• Replace values in the dataframe
• Drop columns
• Remove duplicates
• Address null values : Drop
• Prepare the Data
• Add new columns if necessary
• Format the Date Column
• Apply selection, Filter to analyze a subset of data

3
• Analyze the data
• Sorting data
• Aggregation using group by, pivot table, crosstab– Sum, Count, mean, max
• Visualization

4
Functions that we will be working on:
• shape, size, index, columns

• head, tail, info, describe

• Acess columns : df[‘gender’], df.gender, or df[[‘column list’]]

• Acess rows : df[:3][‘gender’], df[0:3][['gender','group’]]

• df.loc[2:5,['gender','group’]]

• df.iloc[3:5,0:3] (Using index)

• Form filters and query


• df[df.group.isin(['group A' ,'group B’])]

• df.query('group == "group A" and math_score <60')[:3]

5
• Renaming: df.rename
(columns = {'gender':'Gender','group':'Group'},inplace=True)

• Replace Values:
df.replace({'gender': {'female':'F', 'male' : 'M'}},inplace=True)

• Handling Null values : df.isna().sum(axis=1) , df.isna().sum(axis=0)

• Drop rows or columns : df.dropna(axis=0)

• Drop duplicates : drop_duplicates(inplace=True)


6
• df.nunique() – To obtain the count of unique values for each column

• df.group.value_counts() – To obtaine the count for each of the unique values in the
columns ‘group’

• Fill null values df1['math_score'].fillna(df1.math_score.median(),inplace=True)

• Grouping data :
df.groupby(by='gender').mean()
df.groupby(by=['group','gender']).sum()
df.groupby(['gender', 'group']).agg({'total' : ['min', 'max', 'mean', 'std'], 'math_score': ['mean’]})

• Adding new column


df['total'] = df.sum(axis=1)

7
• df.sort_values(by=['math_score'], na_position='first',ascending=False)[:5]

• df.pivot_table(index=['group','gender’])

• pd.crosstab(df.gender,df.group,margins=True)

• pd.cut(df.Age,bins=bins, labels=bins[1:])

• df.corr()

8
Data Cleaning
• Exploring Data
• Shape, info, columns, indexes, describe, head, tail
• Filter the data
• Handling Missing Data
• Drop data (delete), fill the values with mean, median, mode.
• Use ML algorithm to identify highly probable value using regression
• Handling Outliers
• Using box plot, scatter plot to identify outliers and handle using z-score or inter
quartile range method

9
Feature Engineering
• Feature Encoding Technique
• One-hot coding, Label Coding, Ordinal encoding
• Feature Scaling
• Features have different ranges, magnitudes and units. To be able to compare
data in multiple scales like salary, age. This is known has feature normalization or
feature scaling.
• Feature transformation:
• Converting numerical data to categorical data
• Splitting a categorical data to multiple columns

10
Feature encoding

One-hot coding transforms the categorical into


labels and splits the column into multiple columns

Label Coding is also called a s integer coding.


Converting. Here, the unique values in variables
are replaced with a sequence of integer values.
For example
categories: red, green, and blue.
encoded value : red is 0, green is 1,
and blue is 2.

Ordinal Encoding is similar to Label coding,


except there is an order to the encoding.
Example category : low , medium , high.

Library to use : sklearn.prepocessing

11
Feature Scaling
• Standard scaling or Z score Normalization. Derive value based on its z value.
It is best suited for normally suited distribution. Suppose is the mean and is the standard deviation of
the feature column. Then z score is as follows.

• Min Max Scaling : This method linearly transforms the original data into the given
range. It preserves the relationships between the scaled data and the original data. If
the distribution is not normally distributed and the value of the standard deviation is
very small, then the min-max scaler works better since it is more sensitive to outliers.

12

You might also like