0% found this document useful (0 votes)
18 views32 pages

Chapter 2

This document provides an overview of NumPy, a Python library for working with numerical data and multi-dimensional arrays. NumPy features n-dimensional arrays (ndarrays) that enable fast and vectorized operations on large datasets. Key features include efficient array computing, multi-dimensional arrays, universal functions that operate element-wise on arrays, broadcasting operations, indexing and slicing capabilities, and integration with other scientific computing libraries in Python. NumPy arrays can be created in various ways and support common operations like addition, multiplication, and matrix multiplication. NumPy also supports indexing, slicing, and broadcasting for efficient data manipulation.

Uploaded by

Ingle Ashwini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views32 pages

Chapter 2

This document provides an overview of NumPy, a Python library for working with numerical data and multi-dimensional arrays. NumPy features n-dimensional arrays (ndarrays) that enable fast and vectorized operations on large datasets. Key features include efficient array computing, multi-dimensional arrays, universal functions that operate element-wise on arrays, broadcasting operations, indexing and slicing capabilities, and integration with other scientific computing libraries in Python. NumPy arrays can be created in various ways and support common operations like addition, multiplication, and matrix multiplication. NumPy also supports indexing, slicing, and broadcasting for efficient data manipulation.

Uploaded by

Ingle Ashwini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Subject Title : Data Science Using Python SEMESTER - III

Subject Ref. No. : MANC503

Chapter 2
Introduction to NumPy in Detail:

NumPy (Numerical Python) is a powerful Python library that provides


support for efficient manipulation of multi-dimensional arrays and
matrices of numerical data, along with a wide range of mathematical
functions to operate on these arrays. It is a fundamental package for
scientific computing and data analysis in Python. NumPy's key
feature is the ndarray (n-dimensional array) data structure, which
enables fast and vectorized operations on large datasets.

Key Features of NumPy:


Efficient Array Computing: NumPy arrays are more memory-efficient
and faster for numerical computations compared to Python's built-in
lists. NumPy operations are implemented in C and are therefore
significantly faster than equivalent Python loops.

Multi-Dimensional Arrays: NumPy arrays can have any number of


dimensions, allowing you to work with multi-dimensional datasets
like images, time series, and scientific data.

Universal Functions (ufuncs): NumPy provides a wide range of


mathematical functions that operate element-wise on arrays, resulting
in concise and efficient code.

Broadcasting: Broadcasting allows NumPy to perform operations on


arrays of different shapes, intelligently applying the operation to
elements without explicit looping.

1
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Indexing and Slicing: NumPy supports powerful indexing and slicing


operations to access and manipulate elements within arrays.
Data Types: NumPy arrays have a consistent data type, enabling better
control over memory usage and efficient storage of homogeneous
data.
Vectorization: NumPy encourages vectorized operations, where you
perform operations on entire arrays instead of looping over individual
elements. This leads to cleaner and more efficient code.
Integration with Other Libraries: NumPy is the foundation for many
other scientific and data-related libraries in Python, including libraries
like SciPy, pandas, scikit-learn, and more.
Creating NumPy Arrays:

You can create NumPy arrays using various methods:

import numpy as np
# Create an array from a Python list
arr = np.array([1, 2, 3, 4, 5])

2
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

# Create a 2D array from a nested list


matrix = np.array([[1, 2, 3], [4, 5, 6]])
# Create an array of zeros or ones
zeros = np.zeros((3, 4))
ones = np.ones((2, 3))
# Create an identity matrix
identity = np.eye(3)
# Create an array with a range of values
range_array = np.arange(0, 10, 2)
# Create an array of evenly spaced values
linspace_array = np.linspace(0, 1, 5)
Array Operations:
NumPy supports various mathematical and logical operations on
arrays:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Element-wise addition
result_add = a + b

# Element-wise multiplication
result_mul = a * b

# Matrix multiplication
result_matmul = np.dot(a, b)

3
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Indexing and Slicing:


Accessing elements in a NumPy array using indexing and slicing:
arr = np.array([10, 20, 30, 40, 50])

print(arr[0]) # Access the first element


print(arr[1:4]) # Access elements from index 1 to 3
print(arr[:3]) # Access elements from the beginning to index 2
print(arr[2:]) # Access elements from index 2 to the end
NumPy Array
NumPy arrays, also known as ndarrays (n-dimensional arrays), are the
foundation of the NumPy library. They provide an efficient and
flexible way to store and manipulate large datasets of homogeneous
numerical data in Python. This guide will cover key concepts related
to NumPy arrays, including creation, indexing, slicing, operations,
and attributes.

Importing NumPy:
To use NumPy, you need to import the library:
import numpy as np
Creating NumPy Arrays:
NumPy arrays can be created in several ways:
From a Python List:
arr = np.array([1, 2, 3, 4, 5])
From a Nested List (2D Array):
matrix = np.array([[1, 2, 3], [4, 5, 6]])

4
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Using Built-in Functions:

zeros = np.zeros((3, 4)) # Array of zeros with shape (3, 4)


ones = np.ones((2, 3)) # Array of ones with shape (2, 3)
identity = np.eye(3) # 3x3 identity matrix

Using Range and Linspace:

range_array = np.arange(0, 10, 2) # Array with values [0, 2, 4, 6, 8]


linspace_array = np.linspace(0, 1, 5) # Array with 5 evenly spaced
values between 0 and 1

Array Attributes:

NumPy arrays have several useful attributes:

shape = arr.shape # Shape of the array (rows, columns)


dtype = arr.dtype # Data type of array elements
ndim = arr.ndim # Number of dimensions
size = arr.size # Total number of elements
Array Indexing and Slicing:
NumPy arrays are indexed and sliced similarly to Python lists:

element = arr[0] # Access the first element


sub_array = arr[1:4] # Access elements from index 1 to 3

5
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

sub_matrix = matrix[:2, 1:] # Access rows 0 to 1, columns 1 to end


Array Operations:
NumPy supports element-wise operations:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

add_result = a + b # Element-wise addition


mul_result = a * b # Element-wise multiplication
matmul_result = np.dot(a, b) # Matrix multiplication
Universal Functions (ufuncs):

NumPy provides numerous ufuncs for efficient element-wise


operations:
sqrt_array = np.sqrt(arr) # Square root of array elements
exp_array = np.exp(arr) # Exponential of array elements
Broadcasting:
Broadcasting allows performing operations on arrays of different
shapes:
scalar_mul = arr * 2 # Multiply each element by 2

Quick Note on Array Indexing

Array indexing is a fundamental concept in programming that allows


you to access individual elements within an array. In the context of
NumPy arrays, indexing refers to the process of retrieving specific
elements or subsets of elements from an array. Here's a quick
overview of array indexing in NumPy:
6
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

1. Indexing Basics:

Indexing in NumPy arrays is 0-based, meaning the index of the first


element is 0, the second element's index is 1, and so on.
You can use square brackets [] to access elements at specific indices.
2. Single Element Indexing:

To access a single element of a 1D array, use the index within the


square brackets:
import numpy as np

arr = np.array([10, 20, 30, 40, 50])


element = arr[2] # Access the element at index 2 (30)
3. Multi-Dimensional Arrays:
For 2D or multi-dimensional arrays, use a comma-separated pair of
indices within the square brackets:
matrix = np.array([[1, 2, 3], [4, 5, 6]])
element = matrix[1, 2] # Access the element at row 1, column 2 (6)
4. Slicing:
Slicing allows you to extract a subset of elements from an array. The
syntax is 'start:end:step', where 'start' is the starting index, 'end' is the
ending index (exclusive), and 'step' specifies the interval between
elements.
arr = np.array([10, 20, 30, 40, 50])
subset = arr[1:4] # Extract elements from index 1 to 3: [20, 30, 40]
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
7
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

submatrix = matrix[:2, 1:] # Extract rows 0 to 1, columns 1 to end


5. Boolean Indexing:
Boolean indexing involves using a Boolean condition to extract
elements that satisfy the condition.
arr = np.array([10, 20, 30, 40, 50])
condition = arr > 30
filtered = arr[condition] # Elements greater than 30: [40, 50]
6. Fancy Indexing:
Fancy indexing allows you to access elements at specific indices
using an array of index values.
arr = np.array([10, 20, 30, 40, 50])
indices = np.array([0, 3])
selected = arr[indices] # Elements at indices 0 and 3: [10, 40]
7. Negative Indices:
Negative indices count from the end of the array.
arr = np.array([10, 20, 30, 40, 50])
last_element = arr[-1] # Access the last element (50)
print(arr[-1])
Understanding array indexing is crucial for effectively working with
data in NumPy arrays. It allows you to access, modify, and manipulate
individual elements and subsets of elements within arrays.

1. NumPy Operations:
NumPy operations provide powerful tools for performing
computations on arrays efficiently. Understanding these operations is
crucial for effective data manipulation and analysis.
Arithmetic Operations:
8
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Addition: 'np.add(arr1, arr2)'


Subtraction: 'np.subtract(arr1, arr2)'
Multiplication: 'np.multiply(arr1, arr2)'
Division: 'np.divide(arr1, arr2)'
Element-wise Operations:
Square root: 'np.sqrt(arr)'
Exponential: 'np.exp(arr)'
Logarithm: 'np.log(arr)'
Trigonometric functions: 'np.sin(arr), np.cos(arr)'

Aggregation Functions:
Sum: 'np.sum(arr)'
Mean: 'np.mean(arr)'
Median:' np.median(arr)'
Standard deviation: 'np.std(arr)'
Reshaping and Transposing:
Reshaping in NumPy:
Reshaping refers to changing the dimensions (shape) of an array
without changing the data it contains. It's often used when you want to
convert a one-dimensional array into a two-dimensional matrix or
vice versa, or when you need to change the number of rows or
columns in a multidimensional array.
import numpy as np
# Create a one-dimensional array with 12 elements
arr = np.arange(12)
# Reshape it into a 3x4 matrix

9
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

reshaped_arr = arr.reshape(3, 4)
print("Original array:")
print(arr)
print("\nReshaped array:")
print(reshaped_arr)
In this example, we created a one-dimensional array with 12 elements
and then reshaped it into a 3x4 matrix using the reshape method. The
resulting array keeps the same data but has a different shape.

Transposing in NumPy:
Transposing involves swapping the rows and columns of a two-
dimensional array. This operation is useful when you want to change
the orientation of your data, for example, when you want to perform
matrix operations like matrix multiplication or when you want to
align data for different calculations.
import numpy as np
# Create a 2x3 matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6]])
# Transpose the matrix (swap rows and columns)
transposed_matrix = matrix.T
print("Original matrix:")
print(matrix)
print("\nTransposed matrix:")
print(transposed_matrix)
10
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Original matrix:
[[1 2 3]
[4 5 6]]
Transposed matrix:
[[1 4]
[2 5]
[3 6]]
In this example, we created a 2x3 matrix and then used the .T attribute
to transpose it. The resulting matrix swaps the rows and columns,
changing its orientation.

NumPy Exercises Solutions with Examples:

Solutions to exercises provide step-by-step guidance and validation of


your understanding.

Solution 1: Array Creation and Indexing:

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
element = arr[1, 2] # Access element at row 1, column 2 (6)

Solution 2: Element-wise Operations:


import numpy as np
arr1 = np.array([1, 2, 3])
11
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

arr2 = np.array([4, 5, 6])


result_add = np.add(arr1, arr2) # Element-wise addition

Solution 3: Aggregation and Statistics:

import numpy as np
arr = np.array([10, 20, 30, 40, 50])
mean_value = np.mean(arr) # Calculate mean
std_deviation = np.std(arr) # Calculate standard deviation

Solution 4: Reshaping and Transposing:

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
reshaped = arr.reshape(3, 2) # Reshape to 3 rows, 2 columns
transposed = arr.T # Transpose the matrix

12
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Introduction to Pandas:

Pandas is a widely used open-source Python library for data


manipulation and analysis. It provides versatile data structures and
functions that simplify working with structured data, making it an
essential tool in data science, analytics, and research. Pandas is built
on top of NumPy and is particularly useful for handling tabular data,
time series, and labeled data.

Features of Pandas:
DataFrame: The DataFrame is a two-dimensional, size-mutable, and
heterogeneous tabular data structure. It is the most commonly used
data structure in Pandas and can be thought of as a spreadsheet or
SQL table. DataFrames can hold data of various types, including
numeric, string, and categorical data.

13
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Series: A Series is a one-dimensional labeled array that can hold data


of any type. Series are used as columns in DataFrames and are often
used to represent time series data.

Data Alignment: Pandas automatically aligns data when performing


operations on Series and DataFrames, ensuring that operations are
performed on corresponding elements, even when data is missing or
misaligned.

Data Cleaning and Preparation: Pandas provides a wide range of


functions for cleaning and preparing data, including methods for
handling missing data (NaN values), duplicate data, and data type
conversions.

Indexing and Selection: Pandas supports powerful indexing and


selection capabilities, allowing you to select data based on labels,
positions, and boolean conditions. You can also perform multi-axis
indexing and slicing.

Aggregation and Grouping: Pandas allows you to perform


aggregation operations like sum, mean, count, and more on data sets.
It also supports grouping data based on one or more criteria.

Time Series Data: Pandas has robust support for working with time
series data, including date and time indexing, resampling, and rolling
window calculations.

Input/Output: Pandas provides functions to read data from various file


formats, including CSV, Excel, SQL databases, JSON, and more. You
can also write DataFrames to these formats.
14
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Visualization: Pandas integrates with popular data visualization


libraries like Matplotlib and Seaborn to create plots and charts directly
from your data.

Merging and Joining: Pandas supports various methods for combining


DataFrames through merging and joining operations, similar to SQL
joins.

Data Transformation: You can perform data transformations, such as


pivoting, melting, and stacking, to reshape your data for analysis.

Pandas Data Structures:


1. Series:
A Pandas Series is a one-dimensional array-like object with labeled
data and an associated index. It's often used to represent a column of
data in a DataFrame.
import pandas as pd
# Creating a Series
data = pd.Series([10, 20, 30, 40, 50])

2. DataFrame:
A Pandas DataFrame is a two-dimensional table of data with rows and
columns. It's a powerful data structure for storing and manipulating
structured data.

# Creating a DataFrame from a dictionary

15
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

data = {'Name': ['Alice', 'Bob', 'Charlie'],


'Age': [25, 30, 28]}
df = pd.DataFrame(data)
print(df)
DataFrames - Part 1: Introduction to DataFrames

1. Introduction:
DataFrames are a core data structure in the Pandas library. They
provide a two-dimensional, labeled data structure that is highly
efficient for data manipulation and analysis.

2. Creating DataFrames:
You can create DataFrames using various methods:

From dictionaries: 'pd.DataFrame({'Column1': data1, 'Column2':


data2})'
From lists: 'pd.DataFrame([data1, data2], columns=['Column1',
'Column2'])'
From external data sources (CSV, Excel, databases, etc.):
pd.read_csv('data.csv')
Creating DataFrames is a fundamental task when working with data
analysis and manipulation in Python, especially when using libraries
like Pandas. A DataFrame is a two-dimensional, tabular data structure
that is commonly used to store and work with structured data.
import pandas as pd

16
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

# Creating a DataFrame from a dictionary


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 35],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Printing the DataFrame


print(df)
Importing Pandas: First, you need to import the Pandas library using
import pandas as pd. This is a common convention in data analysis to
use pd as an alias for Pandas.

Creating Data: In this example, we create a Python dictionary called


data with three keys ('Name', 'Age', and 'City'). Each key is associated
with a list of values. Each list represents a column in the DataFrame.

Creating the DataFrame: We use the pd.DataFrame() constructor to


create a DataFrame from the data dictionary. The constructor takes the
dictionary as input, and each key in the dictionary becomes a column
in the DataFrame. The values in each list become the data in the
corresponding column.

Printing the DataFrame: We print the DataFrame df to the console.


The DataFrame is displayed in a tabular format, where each row

17
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

represents an observation (in this case, a person), and each column


represents a variable (in this case, 'Name', 'Age', and 'City').
3. Exploring DataFrames:

'df.head(n)': Display the first n rows of the DataFrame.


'df.tail(n)': Display the last n rows of the DataFrame.
'df.info()': Display information about the DataFrame, including data
types and non-null counts.
import pandas as pd
# Creating a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 35],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Exploring the DataFrame

# 1. Displaying the first few rows of the DataFrame


print("First 3 rows of the DataFrame:")
print(df.head(3))

# 2. Getting basic information about the DataFrame

18
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

print("\nSummary information about the DataFrame:")


print(df.info())

# 3. Descriptive statistics of numeric columns


print("\nDescriptive statistics of numeric columns:")
print(df.describe())

# 4. Checking for missing values


print("\nChecking for missing values:")
print(df.isnull())

# 5. Counting unique values in a column


print("\nCount of unique values in the 'City' column:")
print(df['City'].value_counts())

# 6. Selecting specific columns


print("\nSelecting specific columns ('Name' and 'Age'):")
print(df[['Name', 'Age']])

Displaying Rows: We use the head() method to display the first few
rows of the DataFrame. In this case, we show the first 3 rows.

Summary Information: The info() method provides a summary of the


DataFrame, including the number of rows and columns, column
names, non-null counts, and data types.

19
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Descriptive Statistics: The describe() method provides descriptive


statistics for the numeric columns, including count, mean, standard
deviation, minimum, 25th percentile, median (50th percentile), 75th
percentile, and maximum.

Checking for Missing Values: The isnull() method checks for missing
values in the DataFrame. In this case, there are no missing values, so
all entries are False.

Counting Unique Values: We use value_counts() to count the unique


values in the 'City' column. This is useful for understanding the
distribution of categorical data.

Selecting Specific Columns: We select specific columns ('Name' and


'Age') by specifying their names within double square brackets.

DataFrames - Part 2: Indexing and Selection


1. Indexing and Selection: Indexing and selection in Pandas
DataFrames are essential operations for retrieving specific data from
your dataset. This allows you to access and manipulate the data you
need for analysis. import pandas as pd

# Creating a sample DataFrame


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 35],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

20
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

df = pd.DataFrame(data)

# 1. Selecting a single column by column name


name_column = df['Name']
print("Name column:")
print(name_column)

# 2. Selecting multiple columns by column names


name_age_columns = df[['Name', 'Age']]
print("\nName and Age columns:")
print(name_age_columns)

# 3. Selecting rows by index (integer location)


second_row = df.iloc[1]
print("\nSecond row by integer location:")
print(second_row)

# 4. Selecting specific rows and columns by integer location


subset = df.iloc[1:3, 0:2]
print("\nSubset of rows and columns by integer location:")
print(subset)

# 5. Selecting rows based on a condition


young_people = df[df['Age'] < 30]

21
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

print("\nYoung people (age < 30):")


print(young_people)

# 6. Selecting rows based on multiple conditions


city_condition = (df['City'] == 'New York') | (df['City'] == 'Chicago')
selected_cities = df[city_condition]
print("\nSelected cities (New York or Chicago):")
print(selected_cities)
Selecting a Single Column: To select a single column by its name, use
square brackets and the column name as a string. In this case, we
selected the 'Name' column.

Selecting Multiple Columns: To select multiple columns, pass a list of


column names within double square brackets. Here, we selected both
the 'Name' and 'Age' columns.

Selecting Rows by Index (Integer Location): You can use the iloc
indexer to select specific rows by their integer location. In this
example, we selected the second row (index 1).

Selecting Specific Rows and Columns: By using iloc with row and
column indices, you can select a subset of rows and columns. Here,
we selected rows 1 and 2 and columns 0 and 1.

Selecting Rows Based on a Condition: You can filter rows based on a


condition. In this case, we selected all rows where the 'Age' is less
than 30.

22
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Selecting Rows Based on Multiple Conditions: You can use logical


operators (| for OR, & for AND) to filter rows based on multiple
conditions. Here, we selected rows where the 'City' is either 'New
York' or 'Chicago'.

Column selection: 'df['Column'] or df.Column'


Multiple columns: 'df[['Column1', 'Column2']]'
Row selection using boolean indexing:' df[df['Column'] > value]'
Location-based selection: 'df.loc[row_label, column_label]'
Integer-based selection:' df.iloc[row_index, column_index]'
2. Filtering Data:

Using boolean conditions to filter rows: 'df[df['Column'] > value]'


Combining multiple conditions: 'df[(df['Column1'] > value1) &
(df['Column2'] < value2)]'

DataFrames - Part 3: Data Manipulation and Operations


Data manipulation and operations in Pandas DataFrames are essential
for transforming and processing data to extract meaningful insights.
1. Adding and Modifying Data:
Adding columns: 'df['NewColumn'] = data'
Modifying values based on conditions: 'df.loc[df['Column'] > value,
'NewColumn'] = new_value'
import pandas as pd
# Creating a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
23
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

'Age': [25, 30, 22, 35],


'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# 1. Adding a new column
df['Gender'] = ['Female', 'Male', 'Male', 'Male']
print("DataFrame with a new 'Gender' column:")
print(df)

# 2. Removing a column
df.drop(columns='Gender', inplace=True)
print("\nDataFrame with 'Gender' column removed:")
print(df)
# 3. Filtering rows based on a condition
young_people = df[df['Age'] < 30]
print("\nYoung people (age < 30):")
print(young_people)
# 4. Sorting the DataFrame by a column
sorted_df = df.sort_values(by='Age', ascending=False)
print("\nSorted DataFrame by 'Age' in descending order:")
print(sorted_df)
# 5. Aggregating data (calculating mean age)
mean_age = df['Age'].mean()
print("\nMean age of all people:")

24
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

print(mean_age)
# 6. Grouping and aggregating data
city_groups = df.groupby('City')['Age'].mean()
print("\nMean age of people in each city:")
print(city_groups)
# 7. Applying a function to a column
def is_adult(age):
return age >= 18

df['IsAdult'] = df['Age'].apply(is_adult)
print("\nDataFrame with 'IsAdult' column:")
print(df)
Adding a New Column: We added a new 'Gender' column to the
DataFrame by assigning a list of values to it.

Removing a Column: We removed the 'Gender' column using the drop


method with the columns parameter and set inplace=True to modify
the DataFrame in place.

Filtering Rows Based on a Condition: We filtered rows where the


'Age' is less than 30 to create a DataFrame of young people.

Sorting the DataFrame: We sorted the DataFrame by the 'Age' column


in descending order using sort_values. This helps in arranging data in
a specific order.

25
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Aggregating Data: We calculated the mean age of all people using the
mean method on the 'Age' column.

Grouping and Aggregating Data: We grouped the data by the 'City'


column and calculated the mean age in each city using groupby and
mean.

Applying a Function to a Column: We applied a custom function


is_adult to the 'Age' column to create a new 'IsAdult' column based on
the age condition.

Missing Data:

Introduction:
Missing data is a common challenge in data analysis. Pandas provides
tools to handle and manage missing data effectively. import pandas as
pd
import numpy as np
# Create a sample DataFrame with missing data
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, np.nan],
'C': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
# 1. Checking for missing data

26
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

print("Checking for missing data:")


print(df.isnull())
# 2. Counting missing values in each column
missing_count = df.isnull().sum()
print("\nCount of missing values in each column:")
print(missing_count)

# 3. Dropping rows or columns with missing data


df_dropped_rows = df.dropna() # Drop rows with any missing values
df_dropped_cols = df.dropna(axis=1) # Drop columns with any
missing values

# 4. Filling missing data


df_filled = df.fillna(0) # Fill missing values with 0

# 5. Interpolating missing values


df_interpolated = df.interpolate() # Interpolate missing values

# 6. Replacing missing values with a specific value


df_replaced = df.replace(np.nan, -1) # Replace NaN with -1

# 7. Forward-fill missing values


df_forward_filled = df.ffill()

# 8. Backward-fill missing values

27
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

df_backward_filled = df.bfill()
# 9. Checking for any missing values left
print("\nChecking for any missing values left:")
print(df.isnull())
Checking for Missing Data: We use the isnull() method to create a
Boolean DataFrame where True indicates missing values (NaN) and
False indicates non-missing values.

Counting Missing Values: We count the missing values in each


column using isnull().sum(), which gives the count of True values
(True = missing) in each column.

Dropping Rows or Columns with Missing Data: We can remove rows


or columns containing missing data using dropna(). The axis
parameter specifies whether to drop rows (axis=0) or columns
(axis=1). In this example, we create two DataFrames: one with rows
containing missing values removed and one with columns containing
missing values removed.

Filling Missing Data: We can fill missing values using fillna(). In this
example, we fill missing values with 0.

Interpolating Missing Data: Interpolation is used to estimate missing


values based on surrounding data points. We use interpolate() to fill
missing values based on linear interpolation.

Replacing Missing Data: We can replace missing values with a


specific value using replace(). In this case, we replace NaN with -1.

28
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Forward-Fill and Backward-Fill: Forward-fill (ffill()) replaces missing


values with the previous non-missing value in the same column.
Backward-fill (bfill()) replaces missing values with the next non-
missing value in the same column.

Checking for Any Missing Values Left: After applying various


methods to handle missing data, we check if any missing values are
still present in the DataFrame.

Handling missing data is crucial because it ensures that your analysis


is based on complete and accurate information. Depending on your
specific dataset and analysis, you can choose the appropriate method
for dealing with missing data, whether it's dropping, filling,
interpolating, or replacing the missing values.

Groupby:
Introduction:
Groupby operation allows you to group data based on a column and
perform aggregate functions on the groups.
The groupby operation in Pandas is a powerful tool for splitting,
applying a function, and combining the results on a DataFrame based
on some criteria. It is often used for data aggregation and summary
statistics. Let's explore the groupby operation in Pandas
import pandas as pd

# Create a sample DataFrame


data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 15, 12, 18, 9, 11],
29
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

'Quantity': [100, 150, 120, 180, 90, 110]


}

df = pd.DataFrame(data)

# Grouping by 'Category'
grouped = df.groupby('Category')

# 1. Applying an aggregation function (e.g., mean) to grouped data


mean_values = grouped.mean()
print("Mean values for each category:")
print(mean_values)

# 2. Applying multiple aggregation functions


agg_functions = {
'Value': 'mean',
'Quantity': 'sum'
}
aggregated = grouped.agg(agg_functions)
print("\nAggregated values (mean for 'Value' and sum for
'Quantity'):")
print(aggregated)

# 3. Applying a custom aggregation function


def custom_aggregation(arr):

30
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

return arr.max() - arr.min()

custom_agg = grouped['Value'].agg(custom_aggregation)
print("\nCustom aggregation (max - min) for 'Value' in each
category:")
print(custom_agg)

# 4. Iterating through groups


print("\nIterating through groups and displaying them:")
for category, group_data in grouped:
print(f"Category: {category}")
print(group_data)

# 5. Selecting a specific group


group_a = grouped.get_group('A')
print("\nData for 'Category' A:")
print(group_a)
Applying an Aggregation Function: We group the DataFrame by the
'Category' column using groupby. Then, we calculate the mean values
for each group using the mean() function. This gives us the average
'Value' and 'Quantity' for each category.

Applying Multiple Aggregation Functions: We define a dictionary of


aggregation functions for specific columns and use the agg method to
apply them to the grouped data. In this case, we calculate the mean of
'Value' and the sum of 'Quantity' for each category.

31
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503

Applying a Custom Aggregation Function: We define a custom


aggregation function custom_aggregation that computes the range
(max - min) of a given array. We apply this function to the 'Value'
column in each category and display the results.
Iterating Through Groups: We iterate through the groups created by
groupby and display each group's data. This can be useful for custom
processing or analysis on each group separately.

Selecting a Specific Group: We use get_group to retrieve data for a


specific group, in this case, 'Category' A. This allows you to access
and manipulate data for a specific category easily.

32
Asst Professor :- Ingle A.R

You might also like