0% found this document useful (0 votes)
13 views

Python Basics Refresher

The document provides an overview of Numpy and Pandas, two essential Python libraries for numerical computing and data manipulation, respectively. It covers key features, data loading, cleaning, preprocessing, encoding categorical data, and scaling techniques. The summary emphasizes the importance of these libraries in data analysis and machine learning workflows.

Uploaded by

er saroya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Python Basics Refresher

The document provides an overview of Numpy and Pandas, two essential Python libraries for numerical computing and data manipulation, respectively. It covers key features, data loading, cleaning, preprocessing, encoding categorical data, and scaling techniques. The summary emphasizes the importance of these libraries in data analysis and machine learning workflows.

Uploaded by

er saroya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Python Libraries: Numpy and Pandas

NIELIT Chandigarh/Ropar

"In God we trust, all others must bring data." – W. Edwards Deming
Numpy
Numpy
Definition: A library for numerical computing in Python.
Key Features:
• Support for multi-dimensional arrays.
• Mathematical functions for fast operations.
Example:
import numpy as np
# Create a 1D array OUTPUT:
data = np.array([1, 2, 3, 4])
print(data)
# Perform operations
print(data.mean()) # Mean value
print(data + 5) # Element-wise addition

NIELIT CHANDIGARH 2
Pandas
Definition: A library for data manipulation and analysis.
• Key Features:
• DataFrames: 2D labeled data structures.
• Easy handling of missing data.
• Integration with CSV, Excel, and databases.
Example: Output:
import pandas as pd
# Create a DataFrame
data = {'Name': [‘Raju', ‘Priya'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)
# Inspect DataFrame
print(df.head()) # First few rows
print(df.describe()) # Summary statistics

NIELIT CHANDIGARH 3
Loading and Inspecting Datasets
Loading CSV Files
# Load a dataset
import pandas as pd
data = pd.read_csv('./data.csv')
print(data)

Inspecting Data
• View First 5 Rows: data.head()
• Shape of Data: data.shape
• Column Names: data.columns
• Basic Statistics: data.describe()

NIELIT CHANDIGARH 4
Loading and Inspecting Datasets
• Loading a CSV file:
data = pd.read_csv('./data.csv’) # Replace 'data.csv' with your file path

Inspecting Data:
• View first few rows:
print(df.head())
• Summary of data:
print(df.info())
• Descriptive statistics:
print(df.describe())
• Check for null values:
print(df.isnull().sum())

NIELIT CHANDIGARH 5
Data Cleaning and Preprocessing
1. Always check your data for missing
Handling Missing Values values before using dropna():
Why: print(data.isnull().sum())
2. Use inplace=False if you want to keep
• Missing values can distort analysis and results.
the original DataFrame intact.
• Missing data can skew analysis and lead to incorrect conclusions.
Methods:

• Fill Missing Values:


df.fillna(value=0, inplace=True) # Fill missing values with 0
print(df)
data['ColumnName'].fillna(value, inplace=True)
Example: data['Age'].fillna(25, inplace=True)
# Replaces all NaN values in the 'Age' column with 25.

NIELIT CHANDIGARH 6
Data Cleaning and Preprocessing
import pandas as pd
# Load the dataset
df = pd.read_csv('./data.csv’)
# Fill missing values with 0 (create a new modified DataFrame)
df = df.fillna(value=0)
# Print the DataFrame
print(df)
• If you prefer to modify the DataFrame in place, you can use:
df.fillna(value=0, inplace=True)
print(df)

NIELIT CHANDIGARH 7
Data Cleaning and Preprocessing
import pandas as pd

# Create the DataFrame


df = pd.DataFrame({'Name': ["Ajay", "Vishal", "Raj"],
'Age': [24, None, 19]})

# Modify the DataFrame directly


df.fillna(0, inplace=True)

# Print the DataFrame


print(df)

NIELIT CHANDIGARH 8
Data Cleaning and Preprocessing
Handle Missing Values
◦ Drop Missing Values: Remove rows or columns with missing data.
• Drop Rows or Columns

data.dropna(inplace=True)

◦ Drop columns with missing values

data = data.dropna(axis=1)

NIELIT CHANDIGARH 9
Data Cleaning and Preprocessing
• Parameters:
1. axis (default = 0):
1. Specifies whether to drop rows or columns.
1. axis=0: Drop rows with missing values.
2. axis=1: Drop columns with missing values.
Example:
data.dropna(axis=1, inplace=True) # Drops columns with NaN values.
2. how (default = 'any'):
• Defines the condition to drop rows or columns:
• 'any': Drops rows/columns if any value is missing.
• 'all': Drops rows/columns only if all values are missing.
Example:
data.dropna(how='all', inplace=True) # Drops rows where all values are NaN.

NIELIT CHANDIGARH 10
Data Cleaning and Preprocessing
3. thresh:
• Requires a minimum number of non-NaN values to retain the row/column.
Example:
data.dropna(thresh=3, inplace=True) # Keeps rows with at least 3 non-NaN values.
4. subset:
• Allows specifying columns to check for missing values instead of the entire
DataFrame.
Example:
data.dropna(subset=['Column1', 'Column2'], inplace=True) # Drops rows based on NaNs in
specified columns.
5. inplace (default = False):
◦ If True, makes changes directly to the original DataFrame.
◦ If False, returns a new DataFrame with rows/columns dropped.

NIELIT CHANDIGARH 11
Data Cleaning and Preprocessing
• Handle Missing Values
Fill Missing Values
◦ With a constant value:

data['ColumnName'] = data['ColumnName'].fillna('Value’)

◦ With the mean, median, or mode:

data['ColumnName'] = data['ColumnName'].fillna(data['ColumnName'].mean())

data['ColumnName'] = data['ColumnName'].fillna(data['ColumnName'].median())

data['ColumnName'] = data['ColumnName'].fillna(data['ColumnName'].mode()[0])

NIELIT CHANDIGARH 12
Encoding Categorical Data
• Why: Machine learning models work with numerical data.
• Categorical data must be converted into numeric values for most
machine learning models. There are two common encoding
techniques:
• How:
1. Label Encoding (Simple Integer Mapping):
2. One-Hot Encoding
1. Label Encoding
• Assigns a unique integer to each category.
• Suitable for ordinal (ranked) categories.

NIELIT CHANDIGARH 13
Encoding Categorical Data
EXAMPLE:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample Data
data = {'Name': ['Ajay', 'Vishal', 'Raj'], 'Gender': ['Male', 'Male', 'Female']}
df = pd.DataFrame(data)
OUTPUT: (Here, Male is encoded
as 1, and Female as 0.)
# Encode Gender
encoder = LabelEncoder()
df['Gender'] = encoder.fit_transform(df['Gender'])

print(df)

NIELIT CHANDIGARH 14
Encoding Categorical Data
2. One-Hot Encoding
• Creates binary columns for each category.
• Suitable for nominal (unordered) categories.
• Example:
# One-hot encoding using pandas
df = pd.DataFrame({'Name': ['Ajay', 'Vishal', 'Raj'],
'Department': ['IT', 'HR', 'Finance']})
df_encoded = pd.get_dummies(df, columns=['Department'])

print(df_encoded)

NIELIT CHANDIGARH 15
Scaling Data
• Why: Models converge faster and perform better when data is scaled.
• Scaling ensures all features are in a similar range, which helps improve model performance.
1. Min-Max Scaling: Scales data to a fixed range, typically [0, 1].
• Formula:

from sklearn.preprocessing import MinMaxScaler


# Sample Data
df = pd.DataFrame({'Age': [24, 19, 30], 'Salary': [50000, 40000, 70000]})
# Scale data Output:

scaler = MinMaxScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
print(df)

NIELIT CHANDIGARH 16
Scaling Data
2. Standard ScalingStandardizes data to have a mean of 0 and a
standard deviation of 1.
Formula:

from sklearn.preprocessing import StandardScaler

# Sample Data
df = pd.DataFrame({'Age': [24, 19, 30], 'Salary': [50000, 40000,
70000]} Output:

# Scale data
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
print(df)

NIELIT CHANDIGARH 17
Summary
• Numpy and Pandas are essential Python libraries for data analysis.
• Loading and inspecting datasets is the first step in any data science
workflow.
• Data cleaning ensures quality, while preprocessing prepares data for
machine learning.
Encoding Categorical Data:
• Label Encoding: Use for ordinal data.
• One-Hot Encoding: Use for nominal data.
Scaling Data:
• Min-Max Scaling: Scales between a range (e.g., [0, 1]).
• Standard Scaling: Standardizes to a mean of 0 and a standard deviation of 1.

NIELIT CHANDIGARH 18
Thank You! ☺

NIELIT CHANDIGARH 19

You might also like