Chapter 2
Chapter 2
Chapter 2
Introduction to NumPy in Detail:
1
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
import numpy as np
# Create an array from a Python list
arr = np.array([1, 2, 3, 4, 5])
2
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
# Element-wise addition
result_add = a + b
# Element-wise multiplication
result_mul = a * b
# Matrix multiplication
result_matmul = np.dot(a, b)
3
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
Importing NumPy:
To use NumPy, you need to import the library:
import numpy as np
Creating NumPy Arrays:
NumPy arrays can be created in several ways:
From a Python List:
arr = np.array([1, 2, 3, 4, 5])
From a Nested List (2D Array):
matrix = np.array([[1, 2, 3], [4, 5, 6]])
4
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
Array Attributes:
5
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
1. Indexing Basics:
1. NumPy Operations:
NumPy operations provide powerful tools for performing
computations on arrays efficiently. Understanding these operations is
crucial for effective data manipulation and analysis.
Arithmetic Operations:
8
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
Aggregation Functions:
Sum: 'np.sum(arr)'
Mean: 'np.mean(arr)'
Median:' np.median(arr)'
Standard deviation: 'np.std(arr)'
Reshaping and Transposing:
Reshaping in NumPy:
Reshaping refers to changing the dimensions (shape) of an array
without changing the data it contains. It's often used when you want to
convert a one-dimensional array into a two-dimensional matrix or
vice versa, or when you need to change the number of rows or
columns in a multidimensional array.
import numpy as np
# Create a one-dimensional array with 12 elements
arr = np.arange(12)
# Reshape it into a 3x4 matrix
9
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
reshaped_arr = arr.reshape(3, 4)
print("Original array:")
print(arr)
print("\nReshaped array:")
print(reshaped_arr)
In this example, we created a one-dimensional array with 12 elements
and then reshaped it into a 3x4 matrix using the reshape method. The
resulting array keeps the same data but has a different shape.
Transposing in NumPy:
Transposing involves swapping the rows and columns of a two-
dimensional array. This operation is useful when you want to change
the orientation of your data, for example, when you want to perform
matrix operations like matrix multiplication or when you want to
align data for different calculations.
import numpy as np
# Create a 2x3 matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6]])
# Transpose the matrix (swap rows and columns)
transposed_matrix = matrix.T
print("Original matrix:")
print(matrix)
print("\nTransposed matrix:")
print(transposed_matrix)
10
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
Original matrix:
[[1 2 3]
[4 5 6]]
Transposed matrix:
[[1 4]
[2 5]
[3 6]]
In this example, we created a 2x3 matrix and then used the .T attribute
to transpose it. The resulting matrix swaps the rows and columns,
changing its orientation.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
element = arr[1, 2] # Access element at row 1, column 2 (6)
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
mean_value = np.mean(arr) # Calculate mean
std_deviation = np.std(arr) # Calculate standard deviation
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
reshaped = arr.reshape(3, 2) # Reshape to 3 rows, 2 columns
transposed = arr.T # Transpose the matrix
12
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
Introduction to Pandas:
Features of Pandas:
DataFrame: The DataFrame is a two-dimensional, size-mutable, and
heterogeneous tabular data structure. It is the most commonly used
data structure in Pandas and can be thought of as a spreadsheet or
SQL table. DataFrames can hold data of various types, including
numeric, string, and categorical data.
13
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
Time Series Data: Pandas has robust support for working with time
series data, including date and time indexing, resampling, and rolling
window calculations.
2. DataFrame:
A Pandas DataFrame is a two-dimensional table of data with rows and
columns. It's a powerful data structure for storing and manipulating
structured data.
15
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
1. Introduction:
DataFrames are a core data structure in the Pandas library. They
provide a two-dimensional, labeled data structure that is highly
efficient for data manipulation and analysis.
2. Creating DataFrames:
You can create DataFrames using various methods:
16
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
df = pd.DataFrame(data)
17
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
df = pd.DataFrame(data)
18
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
Displaying Rows: We use the head() method to display the first few
rows of the DataFrame. In this case, we show the first 3 rows.
19
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
Checking for Missing Values: The isnull() method checks for missing
values in the DataFrame. In this case, there are no missing values, so
all entries are False.
20
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
df = pd.DataFrame(data)
21
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
Selecting Rows by Index (Integer Location): You can use the iloc
indexer to select specific rows by their integer location. In this
example, we selected the second row (index 1).
Selecting Specific Rows and Columns: By using iloc with row and
column indices, you can select a subset of rows and columns. Here,
we selected rows 1 and 2 and columns 0 and 1.
22
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
# 2. Removing a column
df.drop(columns='Gender', inplace=True)
print("\nDataFrame with 'Gender' column removed:")
print(df)
# 3. Filtering rows based on a condition
young_people = df[df['Age'] < 30]
print("\nYoung people (age < 30):")
print(young_people)
# 4. Sorting the DataFrame by a column
sorted_df = df.sort_values(by='Age', ascending=False)
print("\nSorted DataFrame by 'Age' in descending order:")
print(sorted_df)
# 5. Aggregating data (calculating mean age)
mean_age = df['Age'].mean()
print("\nMean age of all people:")
24
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
print(mean_age)
# 6. Grouping and aggregating data
city_groups = df.groupby('City')['Age'].mean()
print("\nMean age of people in each city:")
print(city_groups)
# 7. Applying a function to a column
def is_adult(age):
return age >= 18
df['IsAdult'] = df['Age'].apply(is_adult)
print("\nDataFrame with 'IsAdult' column:")
print(df)
Adding a New Column: We added a new 'Gender' column to the
DataFrame by assigning a list of values to it.
25
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
Aggregating Data: We calculated the mean age of all people using the
mean method on the 'Age' column.
Missing Data:
Introduction:
Missing data is a common challenge in data analysis. Pandas provides
tools to handle and manage missing data effectively. import pandas as
pd
import numpy as np
# Create a sample DataFrame with missing data
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, np.nan],
'C': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
# 1. Checking for missing data
26
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
27
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
df_backward_filled = df.bfill()
# 9. Checking for any missing values left
print("\nChecking for any missing values left:")
print(df.isnull())
Checking for Missing Data: We use the isnull() method to create a
Boolean DataFrame where True indicates missing values (NaN) and
False indicates non-missing values.
Filling Missing Data: We can fill missing values using fillna(). In this
example, we fill missing values with 0.
28
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
Groupby:
Introduction:
Groupby operation allows you to group data based on a column and
perform aggregate functions on the groups.
The groupby operation in Pandas is a powerful tool for splitting,
applying a function, and combining the results on a DataFrame based
on some criteria. It is often used for data aggregation and summary
statistics. Let's explore the groupby operation in Pandas
import pandas as pd
df = pd.DataFrame(data)
# Grouping by 'Category'
grouped = df.groupby('Category')
30
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
custom_agg = grouped['Value'].agg(custom_aggregation)
print("\nCustom aggregation (max - min) for 'Value' in each
category:")
print(custom_agg)
31
Asst Professor :- Ingle A.R
Subject Title : Data Science Using Python SEMESTER - III
Subject Ref. No. : MANC503
32
Asst Professor :- Ingle A.R