Python Basics Refresher
Python Basics Refresher
NIELIT Chandigarh/Ropar
"In God we trust, all others must bring data." – W. Edwards Deming
Numpy
Numpy
Definition: A library for numerical computing in Python.
Key Features:
• Support for multi-dimensional arrays.
• Mathematical functions for fast operations.
Example:
import numpy as np
# Create a 1D array OUTPUT:
data = np.array([1, 2, 3, 4])
print(data)
# Perform operations
print(data.mean()) # Mean value
print(data + 5) # Element-wise addition
NIELIT CHANDIGARH 2
Pandas
Definition: A library for data manipulation and analysis.
• Key Features:
• DataFrames: 2D labeled data structures.
• Easy handling of missing data.
• Integration with CSV, Excel, and databases.
Example: Output:
import pandas as pd
# Create a DataFrame
data = {'Name': [‘Raju', ‘Priya'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)
# Inspect DataFrame
print(df.head()) # First few rows
print(df.describe()) # Summary statistics
NIELIT CHANDIGARH 3
Loading and Inspecting Datasets
Loading CSV Files
# Load a dataset
import pandas as pd
data = pd.read_csv('./data.csv')
print(data)
Inspecting Data
• View First 5 Rows: data.head()
• Shape of Data: data.shape
• Column Names: data.columns
• Basic Statistics: data.describe()
NIELIT CHANDIGARH 4
Loading and Inspecting Datasets
• Loading a CSV file:
data = pd.read_csv('./data.csv’) # Replace 'data.csv' with your file path
Inspecting Data:
• View first few rows:
print(df.head())
• Summary of data:
print(df.info())
• Descriptive statistics:
print(df.describe())
• Check for null values:
print(df.isnull().sum())
NIELIT CHANDIGARH 5
Data Cleaning and Preprocessing
1. Always check your data for missing
Handling Missing Values values before using dropna():
Why: print(data.isnull().sum())
2. Use inplace=False if you want to keep
• Missing values can distort analysis and results.
the original DataFrame intact.
• Missing data can skew analysis and lead to incorrect conclusions.
Methods:
NIELIT CHANDIGARH 6
Data Cleaning and Preprocessing
import pandas as pd
# Load the dataset
df = pd.read_csv('./data.csv’)
# Fill missing values with 0 (create a new modified DataFrame)
df = df.fillna(value=0)
# Print the DataFrame
print(df)
• If you prefer to modify the DataFrame in place, you can use:
df.fillna(value=0, inplace=True)
print(df)
NIELIT CHANDIGARH 7
Data Cleaning and Preprocessing
import pandas as pd
NIELIT CHANDIGARH 8
Data Cleaning and Preprocessing
Handle Missing Values
◦ Drop Missing Values: Remove rows or columns with missing data.
• Drop Rows or Columns
data.dropna(inplace=True)
data = data.dropna(axis=1)
NIELIT CHANDIGARH 9
Data Cleaning and Preprocessing
• Parameters:
1. axis (default = 0):
1. Specifies whether to drop rows or columns.
1. axis=0: Drop rows with missing values.
2. axis=1: Drop columns with missing values.
Example:
data.dropna(axis=1, inplace=True) # Drops columns with NaN values.
2. how (default = 'any'):
• Defines the condition to drop rows or columns:
• 'any': Drops rows/columns if any value is missing.
• 'all': Drops rows/columns only if all values are missing.
Example:
data.dropna(how='all', inplace=True) # Drops rows where all values are NaN.
NIELIT CHANDIGARH 10
Data Cleaning and Preprocessing
3. thresh:
• Requires a minimum number of non-NaN values to retain the row/column.
Example:
data.dropna(thresh=3, inplace=True) # Keeps rows with at least 3 non-NaN values.
4. subset:
• Allows specifying columns to check for missing values instead of the entire
DataFrame.
Example:
data.dropna(subset=['Column1', 'Column2'], inplace=True) # Drops rows based on NaNs in
specified columns.
5. inplace (default = False):
◦ If True, makes changes directly to the original DataFrame.
◦ If False, returns a new DataFrame with rows/columns dropped.
NIELIT CHANDIGARH 11
Data Cleaning and Preprocessing
• Handle Missing Values
Fill Missing Values
◦ With a constant value:
data['ColumnName'] = data['ColumnName'].fillna('Value’)
data['ColumnName'] = data['ColumnName'].fillna(data['ColumnName'].mean())
data['ColumnName'] = data['ColumnName'].fillna(data['ColumnName'].median())
data['ColumnName'] = data['ColumnName'].fillna(data['ColumnName'].mode()[0])
NIELIT CHANDIGARH 12
Encoding Categorical Data
• Why: Machine learning models work with numerical data.
• Categorical data must be converted into numeric values for most
machine learning models. There are two common encoding
techniques:
• How:
1. Label Encoding (Simple Integer Mapping):
2. One-Hot Encoding
1. Label Encoding
• Assigns a unique integer to each category.
• Suitable for ordinal (ranked) categories.
NIELIT CHANDIGARH 13
Encoding Categorical Data
EXAMPLE:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Sample Data
data = {'Name': ['Ajay', 'Vishal', 'Raj'], 'Gender': ['Male', 'Male', 'Female']}
df = pd.DataFrame(data)
OUTPUT: (Here, Male is encoded
as 1, and Female as 0.)
# Encode Gender
encoder = LabelEncoder()
df['Gender'] = encoder.fit_transform(df['Gender'])
print(df)
NIELIT CHANDIGARH 14
Encoding Categorical Data
2. One-Hot Encoding
• Creates binary columns for each category.
• Suitable for nominal (unordered) categories.
• Example:
# One-hot encoding using pandas
df = pd.DataFrame({'Name': ['Ajay', 'Vishal', 'Raj'],
'Department': ['IT', 'HR', 'Finance']})
df_encoded = pd.get_dummies(df, columns=['Department'])
print(df_encoded)
NIELIT CHANDIGARH 15
Scaling Data
• Why: Models converge faster and perform better when data is scaled.
• Scaling ensures all features are in a similar range, which helps improve model performance.
1. Min-Max Scaling: Scales data to a fixed range, typically [0, 1].
• Formula:
scaler = MinMaxScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
print(df)
NIELIT CHANDIGARH 16
Scaling Data
2. Standard ScalingStandardizes data to have a mean of 0 and a
standard deviation of 1.
Formula:
# Sample Data
df = pd.DataFrame({'Age': [24, 19, 30], 'Salary': [50000, 40000,
70000]} Output:
# Scale data
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
print(df)
NIELIT CHANDIGARH 17
Summary
• Numpy and Pandas are essential Python libraries for data analysis.
• Loading and inspecting datasets is the first step in any data science
workflow.
• Data cleaning ensures quality, while preprocessing prepares data for
machine learning.
Encoding Categorical Data:
• Label Encoding: Use for ordinal data.
• One-Hot Encoding: Use for nominal data.
Scaling Data:
• Min-Max Scaling: Scales between a range (e.g., [0, 1]).
• Standard Scaling: Standardizes to a mean of 0 and a standard deviation of 1.
NIELIT CHANDIGARH 18
Thank You! ☺
NIELIT CHANDIGARH 19