0% found this document useful (0 votes)
96 views

NumPy and Pandas Tutorial

Uploaded by

omvati343
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

NumPy and Pandas Tutorial

Uploaded by

omvati343
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

NumPy and Pandas for Data Analysis AI ML Training

NumPy Tutorial
Introduction

NumPy (Numerical Python) is a library for the Python programming language, adding support
for large, multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays.

Installation

To install NumPy, use the following command:

pip install numpy

Basic Operations

Importing NumPy

import numpy as np

Creating Arrays

# Create a 1D array
array_1d = np.array([1, 2, 3, 4, 5])
print(array_1d)

# Create a 2D array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
print(array_2d)

# Create an array with zeros


zeros_array = np.zeros((3, 4))
print(zeros_array)

# Create an array with ones


ones_array = np.ones((2, 3))
print(ones_array)

# Create an identity matrix


identity_matrix = np.eye(3)
print(identity_matrix)

# Create an array with a range of values


range_array = np.arange(10, 20, 2)
print(range_array)

# Create an array with evenly spaced values


linspace_array = np.linspace(0, 1, 5)
print(linspace_array)

Array Operations

# Arithmetic operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 1 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

print(a + b) # Addition
print(a - b) # Subtraction
print(a * b) # Element-wise multiplication
print(a / b) # Element-wise division

# Matrix multiplication
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])
print(np.dot(matrix_a, matrix_b))

# Broadcasting
array_broadcast = np.array([1, 2, 3])
print(array_broadcast + 1) # Adds 1 to each element

# Statistical operations
print(np.mean(a)) # Mean
print(np.median(a)) # Median
print(np.std(a)) # Standard deviation
print(np.sum(a)) # Sum
print(np.min(a)) # Minimum
print(np.max(a)) # Maximum

Indexing and Slicing

array = np.array([1, 2, 3, 4, 5, 6])

# Indexing
print(array[0]) # First element
print(array[-1]) # Last element

# Slicing
print(array[1:4]) # Elements from index 1 to 3
print(array[:3]) # First three elements
print(array[3:]) # Elements from index 3 to end
print(array[::2]) # Every second element

Reshaping Arrays

array = np.arange(1, 10)


reshaped_array = array.reshape((3, 3))
print(reshaped_array)

# Flattening arrays
flattened_array = reshaped_array.flatten()
print(flattened_array)

Pandas Tutorial
Introduction

Pandas is a library providing high-performance, easy-to-use data structures and data analysis
tools for the Python programming language.

Installation

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 2 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

To install Pandas, use the following command:

pip install pandas

Basic Operations

Importing Pandas

import pandas as pd

Creating DataFrames

# Create a DataFrame from a dictionary


data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
print(df)

# Create a DataFrame from a CSV file


df_from_csv = pd.read_csv('path_to_csv_file.csv')
print(df_from_csv)

Viewing Data

# Display the first few rows


print(df.head())

# Display the last few rows


print(df.tail())

# Display the data types of columns


print(df.dtypes)

# Display the shape of the DataFrame


print(df.shape)

# Display summary statistics


print(df.describe())

Selecting Data

# Select a single column


print(df['Name'])

# Select multiple columns


print(df[['Name', 'City']])

# Select rows by index


print(df.iloc[0]) # First row
print(df.iloc[0:2]) # First two rows

# Select rows by label


print(df.loc[0]) # First row
print(df.loc[0:2]) # First three rows (inclusive)

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 3 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

# Conditional selection
print(df[df['Age'] > 30])

Adding and Dropping Columns

# Add a new column


df['Country'] = ['USA', 'France', 'Germany', 'UK']
print(df)

# Drop a column
df = df.drop('Country', axis=1)
print(df)

Handling Missing Data

# Create a DataFrame with missing values


data_with_nan = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, None, 35, 32],
'City': ['New York', 'Paris', None, 'London']
}
df_nan = pd.DataFrame(data_with_nan)
print(df_nan)

# Drop rows with missing values


df_dropped_nan = df_nan.dropna()
print(df_dropped_nan)

# Fill missing values


df_filled_nan = df_nan.fillna({'Age': df_nan['Age'].mean(), 'City':
'Unknown'})
print(df_filled_nan)

Grouping and Aggregating Data

# Group by a column and calculate mean


print(df.groupby('City').mean())

# Group by multiple columns and calculate sum


print(df.groupby(['City', 'Name']).sum())

Merging DataFrames

# Create two DataFrames


df1 = pd.DataFrame({'Name': ['John', 'Anna'], 'Age': [28, 24]})
df2 = pd.DataFrame({'Name': ['Peter', 'Linda'], 'City': ['Berlin',
'London']})

# Concatenate DataFrames
df_concat = pd.concat([df1, df2], ignore_index=True)
print(df_concat)

# Merge DataFrames
df_merge = pd.merge(df1, df2, on='Name', how='inner')
print(df_merge)

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 4 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

Exporting Data

# Export DataFrame to CSV


df.to_csv('output.csv', index=False)

# Export DataFrame to Excel


df.to_excel('output.xlsx', index=False)

Advanced Pandas Tutorial


Handling Time Series Data

Pandas provides robust support for time series data. Here's how to work with it.

Creating Time Series Data

# Create a date range


date_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
print(date_range)

# Create a DataFrame with time series data


time_series_data = {
'Date': date_range,
'Value': np.random.randn(10)
}
df_time_series = pd.DataFrame(time_series_data)
df_time_series.set_index('Date', inplace=True)
print(df_time_series)

Resampling Time Series Data

# Resample to weekly frequency and calculate the mean


df_resampled = df_time_series.resample('W').mean()
print(df_resampled)

# Resample to monthly frequency and calculate the sum


df_resampled_monthly = df_time_series.resample('M').sum()
print(df_resampled_monthly)

Working with Categorical Data


# Create a DataFrame with categorical data
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'City': ['New York', 'Paris', 'Berlin', 'London'],
'Gender': ['Male', 'Female', 'Male', 'Female']
}
df_categorical = pd.DataFrame(data)

# Convert a column to categorical type


df_categorical['Gender'] = df_categorical['Gender'].astype('category')
print(df_categorical)

# Get the categories and codes


print(df_categorical['Gender'].cat.categories)

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 5 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

print(df_categorical['Gender'].cat.codes)

Pivot Tables
# Create a DataFrame
data = {
'Name': ['John', 'Anna', 'John', 'Anna', 'John', 'Anna'],
'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'],
'Sales': [150, 200, 130, 210, 170, 220]
}
df_sales = pd.DataFrame(data)

# Create a pivot table


pivot_table = df_sales.pivot_table(values='Sales', index='Name',
columns='Month', aggfunc='sum')
print(pivot_table)

Handling Large Datasets


# Read a large CSV file in chunks
chunk_size = 1000
chunks = pd.read_csv('large_dataset.csv', chunksize=chunk_size)

# Process each chunk


for chunk in chunks:
# Perform operations on the chunk
print(chunk.shape)

Applying Functions

Using apply()

# Create a DataFrame
data = {
'A': [1, 2, 3],
'B': [4, 5, 6]
}
df = pd.DataFrame(data)

# Define a function
def add_one(x):
return x + 1

# Apply the function to each element


print(df.applymap(add_one))

# Apply the function to each column


print(df.apply(lambda x: x + 1))

# Apply the function to each row


print(df.apply(lambda x: x + 1, axis=1))

Joining DataFrames
# Create two DataFrames
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 6 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

'value': [1, 2, 3, 4]
})
df2 = pd.DataFrame({
'key': ['B', 'D', 'E', 'F'],
'value': [5, 6, 7, 8]
})

# Inner join
inner_joined = pd.merge(df1, df2, on='key', how='inner')
print(inner_joined)

# Left join
left_joined = pd.merge(df1, df2, on='key', how='left')
print(left_joined)

# Right join
right_joined = pd.merge(df1, df2, on='key', how='right')
print(right_joined)

# Outer join
outer_joined = pd.merge(df1, df2, on='key', how='outer')
print(outer_joined)

Window Functions
# Create a DataFrame with time series data
data = {
'Date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
'Value': np.random.randn(10)
}
df = pd.DataFrame(data)
df.set_index('Date', inplace=True)

# Calculate rolling mean


rolling_mean = df['Value'].rolling(window=3).mean()
print(rolling_mean)

# Calculate expanding sum


expanding_sum = df['Value'].expanding().sum()
print(expanding_sum)

# Calculate exponentially weighted mean


ewm_mean = df['Value'].ewm(span=3).mean()
print(ewm_mean)

Handling JSON Data


# Create a JSON string
json_str = '''
[
{"Name": "John", "Age": 28, "City": "New York"},
{"Name": "Anna", "Age": 24, "City": "Paris"},
{"Name": "Peter", "Age": 35, "City": "Berlin"}
]
'''

# Read JSON string into DataFrame


df_json = pd.read_json(json_str)
print(df_json)

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 7 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

# Export DataFrame to JSON


df_json.to_json('output.json', orient='records', lines=True)

Advanced Indexing with MultiIndex


# Create a MultiIndex DataFrame
arrays = [
['A', 'A', 'B', 'B'],
['one', 'two', 'one', 'two']
]
index = pd.MultiIndex.from_arrays(arrays, names=('first', 'second'))
df_multi = pd.DataFrame({'value': [1, 2, 3, 4]}, index=index)
print(df_multi)

# Accessing data in MultiIndex DataFrame


print(df_multi.loc['A'])
print(df_multi.loc[('A', 'one')])

Combining DataFrames with concat and append


# Create DataFrames
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})
df2 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})

# Concatenate DataFrames
concatenated = pd.concat([df1, df2], ignore_index=True)
print(concatenated)

# Append DataFrames
appended = df1.append(df2, ignore_index=True)
print(appended)

Performance Tips
# Use vectorized operations instead of loops
data = pd.DataFrame({
'A': range(1000000),
'B': range(1000000)
})

# Inefficient way: Using loops


data['C'] = [x + y for x, y in zip(data['A'], data['B'])]

# Efficient way: Using vectorized operations


data['C'] = data['A'] + data['B']

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 8 |Pa ge

You might also like