0% found this document useful (0 votes)

7 views32 pages

DAP Journal

Python practicals

Uploaded by

jeel0613

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views32 pages

DAP Journal

Python practicals

Uploaded by

jeel0613

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 32

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Data Analytics with Python Journal

Practical No: 1

Aim: Setting up Python and Jupyter notebooks, basic Python exercises

a) Setting Up Python

Python 3.12.4 Installation on Windows

Step 1) Go to link https://www.python.org/downloads/, and select

the latest version for windows.

Step 2) store the setup in D:\> and right click and run the set up as administrator.

Step 3) Select Customize Installation

Roll No: Name: 1

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Step 4) Click NEXT

Step 5) In next screen

1. Select all advanced options except the first option and click on Install.

Roll No: Name: 2

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

You will get the following screen once the setup is successful

Roll No: Name: 3

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Step 6) Click Close button once install is done.

b) Jupyter Notebooks

Install Anaconda Distribution:

o Anaconda is a popular distribution for Python and includes Jupyter Notebook and
many essential libraries for data science.
o Download Anaconda from Anaconda Distribution.
(https://www.anaconda.com/download)

o
o Provide your valid email id and select the checkbox.
o Click on submit button

Roll No: Name: 4

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

o Click on download button and save the setup file in D:\>.

● Keep on clicking Next keeping default settings as it is, till getting Install
button.
● Click on Install
● Verify with your gmail or github or MSOffice account.

Roll No: Name: 5

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Again, click on Next as and when appears on the screen and then Finish.

Find and launch the Jupyter notebook in Anaconda Navigator.

By launching, the following screen can be achieved for Jupyter notebook to work
with that software

Click on Notebook at home screen of Anaconda and then find the following screen
Roll No: Name: 6
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Click on select

Roll No: Name: 7

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Install Additional Libraries (if needed):

● Anaconda comes with many libraries pre-installed, including Pandas. If you need
additional libraries (e.g., Matplotlib, Seaborn for visualization), you can install them using
Anaconda Navigator or via the command line with conda or pip.

c) basic Python Exercise

Create a list of six integer numbers. Perform the list operations append to append
a new element to the list. Remove an element to delete a specified element. Sort the
list to sort the list in ascending order. Also find max and min values to find the
largest and smallest elements in the list.

# Original list

numbers = [5, 2, 9, 1, 5, 6]

print ('original list',numbers)

# 1. Append a new element to the list

new_element = 10

numbers.append(new_element)

print("List after appending ",new_element, ": ",numbers)

# 2. Remove an element from the list

element_to_remove = 9

if element_to_remove in numbers:

numbers.remove(element_to_remove)

Roll No: Name: 8

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

print("List after removing ",element_to_remove,": ",numbers)

else:

print("Element ",element_to_remove," not found in the list.")

# 3. Sort the list in ascending order

numbers.sort()

print("List after sorting: ",numbers)

# 4. Find the maximum and minimum values in the list

max_value = max(numbers)

min_value = min(numbers)

print("Maximum value: ",max_value)

print("Minimum value: ",min_value)

Output:

Roll No: Name: 9

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Practical No: 2

Aim: Data manipulation tasks with Pandas

a) Data cleaning and filtering:

aim: Create the DataFrame object from a given student-grade dataset. Fill
the missing values with the average of the column, calculate the average
grade for each student, remove the duplicate rows, operations are
performed to clean the data, calculate average grades, filter out low-
performing students having the average grade less than or equal to 2, find
the maximum value in the ‘Math’ column and identify the student(s) with
the maximum ‘Math’ score.

Description:

import pandas as pd
import numpy as np

# Create DataFrame
df = pd.read_csv("student-grade.csv")

# Fill missing values with the average of the column

df.fillna(df['Math'].mean(), inplace=True)
df.fillna(df['Science'].mean(), inplace=True)
df.fillna(df['English'].mean(), inplace=True)
print(df.to_string())

# Calculate the average grade for each student

df['Average'] = df[['Math', 'Science', 'English']].mean(axis=1)
print(df.to_string())

#removing duplicate rows

df.drop_duplicates()

# Filter out students with an average grade below 50%

filtered_df = df[df['Average'] <= 2]

# Display the cleaned and filtered DataFrame

print(filtered_df)

max_math = df['Math'].max()

# Find the rows where 'Math' is equal to the maximum value

max_math_students = df[df['Math'] == max_math]

# Display the result

print("The maximum value in the 'Math' column is:", max_math)
Roll No: Name: 10
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

print("The student(s) with the maximum 'Math' score:")

print(max_math_students[['id', 'name', 'Math']])

Output:

DataFrame

DataFrame with average

Filter out students with an average grade below 50%

Roll No: Name: 11

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Find the rows where 'Math' is equal to the maximum value

b) data grouping

aim: Given a sample dataset containing ‘sales’ data for different ‘regions’
and ‘products’, with ‘date’ of sales. what steps are taken to calculate the
total and average sales for each region? Additionally, how is the resulting
DataFrame sorted based on total sales in descending order?

import pandas as pd

# Sample data
data = {
'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'],
'Region': ['North', 'South', 'East', 'West', 'North'],
'Product': ['A', 'B', 'A', 'B', 'A'],
'Sales': [150, 200, 130, 170, 160]
}

# Create DataFrame
df = pd.DataFrame(data)
print(df)

# Group by 'Region' and calculate total and average sales

Roll No: Name: 12
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

grouped_df = df.groupby('Region').agg(
Total_Sales=pd.NamedAgg(column='Sales', aggfunc='sum'),
Average_Sales=pd.NamedAgg(column='Sales', aggfunc='mean')
).reset_index()

# Sort the results by 'Total_Sales' in descending order

sorted_df = grouped_df.sort_values(by='Total_Sales', ascending=False)

# Display the grouped and sorted DataFrame

print(sorted_df)

Output:

Roll No: Name: 13

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Practical No: 3

Advanced data manipulation exercises using Pandas.

a) Aim: Customer Segmentation in Retail: A retail company wants to segment its customers based on
their purchase behaviour. Consider a dataset Mall_Customer.csv file containing customer
information and purchase history to create customer segments.

Data source:

Mall_Customer.csv
https://www.kaggle.com/datasets/shrutimechlearn/customer-data

Code:
import pandas as pd
df = pd.read_csv("Mall_Customers.csv")
#NO OF ROWS AND COLUMNS
print("Shape of the dataset:")
print(df.shape)
# Data type and non-null count for each column
print("\nColumn information:")
print(df.info())

print(df.to_string())

# Summary statistics (mean, min, max, etc.)

print("\nSummary statistics:")
print(df.describe())

# Unique values in the 'Gender' column

print("\nUnique values in 'Genre' column:")
print(df["Genre"].unique())

# Count missing values in each column

print("\nMissing values count:")
print(df.isnull().sum())

#Create Customer Segments (bins - [1,35],[36,70][71-100]

Roll No: Name: 14
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

df["Segment"] = pd.cut(df["Spending_Score"],
bins=[1, 35, 70, 100],
labels=["Low Spender", "Medium Spender", "High Spender"])
# Print the first few rows of the segmented data
print("\nCustomer Segmentation:")
print(df[["CustomerID", "Spending_Score", "Segment"]].to_string())

output:

Roll No: Name: 15

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Roll No: Name: 16

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

b) aim: Consider a CSV file named sales_data.csv containing sales data for different products. Load
this data into a Pandas DataFrame and answer the following:
● Calculate the total sales for each product category (e.g., Product A, Product B, Product C).
● Identify the product category with the highest average sales.

Data source:

Code:

import pandas as pd
sales_df = pd.read_csv('sales_data.csv')
print(sales_df.shape)
# Calculate total Revenue by Product
total_Revenue_by_Product = sales_df.groupby('Product')['Revenue'].sum()
print(total_Revenue_by_Product)
# Identify the category with the highest average revenue
total_Revenue_by_Product = sales_df.groupby('Product')['Revenue'].mean()
max_avg_revenue_product = total_Revenue_by_Product.idxmax()
print(f"Highest average Revenue by the product: {max_avg_revenue_product}")

output:

Roll No: Name: 17

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Practical No: 4

Performing data analysis tasks with NumPy arrays.

a) Aim: create a numpy array with arange of 20 elements with 5 rows and 4 columns. Find the index
of the maximum element (overall), find the index of the maximum element along each row, Find the
index of the maximum element along each column, sort the entire array, sort along each row, sort
along each column, find the mean of every 1D NumPy array in a given 2D array, and reverse the
array.

Code:
import numpy as np

# Create a 5x4 array

array = np.arange(20).reshape(5, 4)
print(array)

# Find the index of the maximum element (overall)

print(np.argmax(array))

# Find the index of the maximum element along each row

print(np.argmax(array, axis=1))

# Find the index of the maximum element along each column

print(np.argmax(array, axis=0))

# Sort along each row

print(np.sort(array, axis=1))

# Sort along each column

print(np.sort(array, axis=0))

#find the mean of every 1D NumPy array in a given 2D array

for arr in array:
print("Mean of array: ",np.mean(arr))

# Reverse the array

reversed_array = array[::-1]
print(reversed_array)

Output:

Roll No: Name: 18

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

b) data indexing and selection

define a data frame with data as column ‘A’, ‘B’ and ‘C’ with the values as list of 1,2,3,4 for ‘A’,
‘5’,’6’,’7’,’8’ for ‘B’ and ‘9’,’10’,’11,’12’ for column ‘C’; having its index as ‘row1’, ‘row2’, ‘row3’ and
‘row4’ respectively for all four rows. Perform the following selections with this given data of data frame.

i) Select all rows and column 'A'

ii) Select 'row2' and 'row3' for column 'B'
iii) Select 'row2' and 'row4' for column 'B'
iv) Select all rows and columns 'A' to 'C'
v) select rows where the values in column 'A' are greater than 2
vi) select rows where values in 'A' are greater than 2 and values in 'B' are less than 8
vii) Select rows where column 'A' is greater than 3
viii) Select rows where column 'B' is equal to 7

code:
import pandas as pd

data = {
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]
}
index = ['row1', 'row2', 'row3', 'row4']
df = pd.DataFrame(data, index=index)

df.loc[:, 'A']

df.loc['row2':'row3', 'B']

df.loc[['row2','row4'], 'B']

df.loc[:, 'A':'C']

df[df['A'] > 2]

df[(df['A'] > 2) & (df['B'] < 8)]

df[df['A'] > 3]

df[df['B'] == 7]

Roll No: Name: 19

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Output:

Roll No: Name: 20

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Practical No: 5

Creating basic plots with Matplotlib and Seaborn.

a) aim: Write Python code to create a plot with multiple variations of the sine function using
Matplotlib. Follow the steps and code provided below to complete the exercise.

Import Necessary Libraries and Generate Data to create an array x with 100 points ranging from 0 to
10, Create a Figure, Plot Multiple Sine Waves with different line styles and different colors, Add a
legend, change the x-axis dimensions from (0 to 10) to (-1 to 11) and (-1.0 to 1.0) to (-1.5 to 1.5), Label
the x-axis as 'x' and the y-axis as 'sin(x)' with blue color and font size 16. and title 'A sin(x) plot' with
blue color, font size 22, and oblique style and Display the Plot. Also, save the figure and display it
again from that location and a list of supported filetypes.

Code:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
fig = plt.figure()
plt.plot(x, np.sin(x), '--', color='blue', label='sin(x)')
plt.plot(x, np.sin(x - 1), color='g',linestyle=':') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), linestyle='-.', color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5)
plt.xlabel('x', c
olor= 'blue', fontsize=16)
plt.ylabel('sin(x)', color= 'blue', fontsize=16)
plt.legend()
plt.title("A sin(x) plot", color= 'blue', fontsize=22, style="oblique")

fig.savefig('my_figure.png')
fig.savefig('d:/DAP - Data Analytics with Python/my_figure.png')
from IPython.display import Image
Image('my_figure.png')
fig.canvas.get_supported_filetypes()

output:

b) Aim: Write Python code to create a scatter plot with random data points using Matplotlib. Follow
the steps and code provided below to complete the exercise.

Import Necessary Libraries, create an array x with 100 random values from a standard normal
distribution, Create an array y with 100 random values from a standard normal distribution. Create
an array of color with 100 random values. Create an array size with 100 random values scaled up to
1000. Create a Scatter Plot to create a scatter plot with coordinates, colors, sizes, the transparency,
color map and colorbar.

code:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(100)
Roll No: Name: 21
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

y = np.random.randn(100)
colors = np.random.rand(100)
sizes = 1000 * np.random.rand(100)
plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,cmap='viridis')
#viridis, plasma, inferno, magma, cividis
#alpha controls the transparency of the points, while cmap controls
#the color mapping based on the data values.
plt.colorbar(); # show color scale

output:

c) seaborn plots

Aim:

Using the "flights" dataset from Seaborn, which contains the number of passengers per month over
several years, create the following visualizations:

1. A line plot showing the number of passengers over time, with a separate line for each year.
2. A bar plot comparing the total number of passengers for each year.
3. A box plot to visualize the distribution of passengers per month, colored by year.
4. A heatmap showing the number of passengers each month over the years.

d) Analysis

Provide an analysis of the seasonal trends and any noticeable changes over the years based on the
visualizations.

code:
import seaborn as sns
import matplotlib.pyplot as plt

df = sns.load_dataset("flights")

# 1. Line plot: Number of passengers over time for each year

plt.figure(figsize=(12, 6))
sns.lineplot(x='month', y='passengers', hue='year', data=df, palette='viridis')
plt.title('Monthly Passengers Over Time')
plt.show()

# 2. Bar plot: Total number of passengers per year

plt.figure(figsize=(10, 6))
total_passengers_per_year = df.groupby('year')['passengers'].sum().reset_index()
sns.barplot(x='year', y='passengers', data=total_passengers_per_year, palette='plasma')
plt.title('Total Passengers per Year')
plt.show()

# 3. Box plot: Distribution of passengers per month by year

plt.figure(figsize=(12, 6))
sns.boxplot(x='month', y='passengers', hue='year', data=df)
plt.title('Distribution of Passengers per Month by Year')
plt.show()

# 4. Heatmap: Number of passengers each month over the years

flights_pivot = df.pivot(index="month", columns="year", values="passengers")
plt.figure(figsize=(12, 6))
Roll No: Name: 22
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

sns.heatmap(flights_pivot, annot=True, fmt="d", cmap='YlGnBu')

plt.title('Number of Passengers Each Month Over the Years')
plt.show()

Output:

d)
Line Plot: The number of passengers increases over the years till July, with a noticeable seasonal pattern
where the summer months typically see higher passenger numbers.
Bar Plot: The total number of passengers has generally increased each year, reflecting growth in air travel.
Box Plot: There is significant variability in the number of passengers month-to-month, with July and
August consistently being peak months.
Heatmap: The heatmap clearly shows the seasonal trends, indicating higher numbers appearing during the
summer months across all years.

Practical No: 6

Creating advanced and interactive plots.

Read a dataset named sales_data.csv containing monthly sales information for different products across
multiple regions. The dataset includes columns such as Product, Region, and Sales. Write a Python code to
create an interactive bar chart using Plotly that shows the sales of different products in a selected region.
Use a dropdown menu to allow users to select different regions.

Dataset: salesdata.csv

Code:
import plotly.graph_objects as go
import pandas as pd

df = pd.read_csv("salesdata.csv")

# Filter the data for each region manually

df_c = df[df['Region'] == 'Central']
Roll No: Name: 23
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

df_e = df[df['Region'] == 'East']

df_w = df[df['Region'] == 'West']
df_n = df[df['Region'] == 'North']
df_s = df[df['Region'] == 'South']

# Create traces for each region manually

trace1 = go.Bar(
x=df_c['Product'],
y=df_c['Sales'],
name='Central'
)
trace2 = go.Bar(
x=df_e['Product'],
y=df_e['Sales'],
name='East'
)
trace3 = go.Bar(
x=df_w['Product'],
y=df_w['Sales'],
name='West'
)
trace4 = go.Bar(
x=df_s['Product'],
y=df_s['Sales'],
name='South'
)

# Initialize the figure and add traces

fig = go.Figure(data=[trace1, trace2, trace3, trace4])

# Update layout with dropdown menu, specifying visibility for each trace
fig.update_layout(
updatemenus=[
dict(
buttons=[
dict(label='Central',
method='update',
args=[{'visible': [True, False, False, False]},
{'title': 'Sales by Product in Region: Central'}]),
dict(label='East',
method='update',
args=[{'visible': [False, True, False, False]},
{'title': 'Sales by Product in Region: East'}]),
dict(label='West',
method='update',
args=[{'visible': [False, False, True, False]},
{'title': 'Sales by Product in Region: West'}]),
dict(label='South',
method='update',
args=[{'visible': [False, False, False, True]},
{'title': 'Sales by Product in Region: South'}])
],
direction='down',
)
],
title="Sales by Product",
Roll No: Name: 24
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

xaxis_title="Product",
yaxis_title="Total Sales",
barmode='group'
)

fig.show()

output

b) Create a dataset named student_scores.csv by defining a dictionary as data that assigns

information about students' scores in various subjects. The dataset contains columns like StudentID,
DAPScore, IFScore, DEScore, and Gender for 10 students. Write a Python code to Create an
interactive scatter plot using Plotly to visualize the relationship between DAPScore and IFScore for
students. Also, add hover information to the scatter plot so that when a user hovers over a point, it
displays the StudentID, DEScore, and Gender also of the student.

Code:
import pandas as pd
import plotly.express as px
# Creating the dataset
data = {
'StudentID': range(1, 11),
'DAPScore': [78, 85, 92, 88, 76, 95, 89, 84, 91, 87],
'IFScore': [82, 79, 94, 90, 80, 97, 85, 89, 88, 90],
'DEScore': [75, 80, 88, 85, 78, 93, 84, 81, 92, 87],
'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F']
}

df = pd.DataFrame(data)
df.to_csv('student_scores.csv', index=False)
fig = px.scatter(df, x="DAPScore", y="IFScore", color="Gender",
hover_data=["StudentID", "DEScore", "Gender"],
title="DAP Score vs. IF Score")

fig.show()

Practical No: 7

Performing EDA on a dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("used_cars.csv")

df.head()

df.tail()
df.info()
Roll No: Name: 25
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

#counting unique values for each features

df.nunique()

#total missing values for each field

df.isnull().sum()

#missing values in perc

(df.isnull().sum()/(len(df)))*100

# Convert 'New_Price' from object to float64 # (\d+\.\d+|\d+) matches either a decimal number or an
integer.
df['New_Price'] = df['New_Price'].str.replace(',', '').str.extract('(\d+\.\d+|\d+)').astype(float)
# The 'Price' column is already float64, so no conversion is necessary
# Group by 'Name'
grouped = df.groupby('Name')

# Fill NaN values in 'New_Price' and 'Price' with the mean of their respective groups
for name, group in grouped:
new_price_mean = group['New_Price'].mean()
price_mean = group['Price'].mean()

df.loc[group.index, 'New_Price'] = group['New_Price'].fillna(new_price_mean)

df.loc[group.index, 'Price'] = group['Price'].fillna(price_mean)

# View the cleaned dataframe

df.head()

#missing values in perc

(df.isnull().sum()/(len(df)))*100

#Data Reduction
# Remove S.No. column from data
# axis=1 – column
df = df.drop(['S.No.'], axis = 1)
df.info()

#Feature Engineering - Creating Features

from datetime import date
date.today().year
df['Car_Age']=date.today().year-data['Year’]
df.head()

# split the name and introduce new variables “Brand” and “Model”
df['Brand'] = df.Name.str.split().str.get(0)
df['Model'] = df.Name.str.split().str.get(1) + df.Name.str.split().str.get(2)
df[['Name','Brand','Model']]

#Data Cleaning
print(df.Brand.unique())
print(df.Brand.nunique())

#The brand name ‘Isuzu’ ‘ISUZU’ and ‘Mini’ and ‘Land’ looks #incorrect. This needs to be corrected
searchfor = ['Isuzu' ,'ISUZU','Mini','Land']
df[df.Brand.str.contains('|'.join(searchfor))].head(5)
#'Isuzu | ISUZU | Mini | Land’ -> | means or
Roll No: Name: 26
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

df["Brand"].replace({"ISUZU": "Isuzu", "Mini": "Mini Cooper","Land":"Land Rover"}, inplace=True)

#Our Data is ready to perform EDAdf.describe().T
# T -> to covert row to column and column to row

df.describe(include='all').T

#Before we do EDA, lets separate Numerical and categorical variables

#for easy analysis
cat_cols=df.select_dtypes(include=['object']).columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)

#EDA Univariate Analysis

for col in num_cols:
print(col)
print('Skew :', round(df[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
# 1,2 -> grid of 1 row two columns. last 1 indicates the 1st plot is the left most
sns.histplot(df[col], bins=15, kde=True)
plt.ylabel('count')
plt.subplot(1, 2, 2) # 2nd plot is the right most
sns.boxplot(x=df[col])
plt.show()

#EDA Bivariate Analysis

plt.figure(figsize=(13,17))
sns.pairplot(data=df.drop(['Kilometers_Driven','Price'],axis=1))
plt.show()

Practical No: 8
Time series data analysis using Pandas.
a) Create a dataset containing daily temperature readings for a year 2023. Set the Date as the index of
the DataFrame, and resample the data to a monthly frequency to calculate the average temperature
for each month. Plot the original daily temperature data and the resampled monthly average
temperatures.

Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate a sample dataset of daily temperature readings for a year

date_rng = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
temperature = np.random.normal(loc=15, scale=10, size=len(date_rng))
df = pd.DataFrame(temperature, index=date_rng, columns=['Temperature'])
print (df)

# Resample the data to a monthly frequency and calculate the mean temperature
monthly_avg_temp = df.resample('M').mean()
Roll No: Name: 27
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

# Plotting the original daily temperature data

plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Temperature'], label='Daily Temperature', color='blue', alpha=0.5)
plt.plot(monthly_avg_temp.index, monthly_avg_temp['Temperature'], label='Monthly Avg Temperature',
color='red', marker='o')
plt.title('Daily and Monthly Average Temperatures in 2023')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.show()

output:

b) Create a time series dataset of daily sales data for a retail store for the year 2023, calculate the 7-
day rolling average of the sales data. Plot the original sales data and the 7-day rolling average on the
same graph to visualize the smoothing effect.

Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate a sample dataset of daily sales data for a year

date_rng = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
sales = np.random.randint(100, 500, size=len(date_rng))
df = pd.DataFrame(sales, index=date_rng, columns=['Sales'])

# Calculate the 7-day rolling average of the sales data

df['7-Day Rolling Avg'] = df['Sales'].rolling(window=7).mean()

# Plotting the original sales data and the 7-day rolling average
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Sales'], label='Daily Sales', color='blue', alpha=0.5)
plt.plot(df.index, df['7-Day Rolling Avg'], label='7-Day Rolling Avg', color='orange', linewidth=2)
plt.title('Daily Sales and 7-Day Rolling Average in 2023')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()

output:

Practical No: 9

Applying statistical analysis on a dataset.

Aim: a) You are given a dataset containing the heights (in cm) of male and female students in a class.
Perform the following tasks:

Calculate the mean, median, mode, and standard deviation of the heights for both males and females.
Perform an independent t-test to check if there is a significant difference between the heights of males
and females. Use a significance level of 0.05.

Description:
Roll No: Name: 28
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

1. Define mean, median, mode and standard deviation.

2. t-test

Code:
import pandas as pd
import numpy as np
from scipy import stats

data = {
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
'Height': [170, 160, 175, 158, 180, 162, 169, 159]
}
df = pd.DataFrame(data)

male_heights = df[df['Gender'] == 'Male']['Height']

female_heights = df[df['Gender'] == 'Female']['Height']

# Function for descriptive statistics

def descriptive_stats(heights):
mode_result = stats.mode(heights, keepdims=True)
# The keepdims=True argument ensures that the result maintains the same number of dimensions as the
input array.
mode_value = mode_result.mode[0] if mode_result.count[0] > 0 else np.nan # Check if mode exists
# It retrieves the first mode (mode_result.mode[0]) only if the count of that mode (mode_result.count[0])
is greater than 0.

return {
'Mean': np.mean(heights),
'Median': np.median(heights),
'Mode': mode_value,
'Standard Deviation': np.std(heights, ddof=1) # Sample standard deviation
}

male_stats = descriptive_stats(male_heights)
female_stats = descriptive_stats(female_heights)

print("Male Stats:", male_stats)

print("Female Stats:", female_stats)

# 2. Independent T-test (Hypothesis Testing)

t_stat, p_value = stats.ttest_ind(male_heights, female_heights)

if p_value < 0.05:

print(f"There is a significant difference between the heights of males and females (p-value =
{p_value:.5f}).")
else:
print(f"There is no significant difference between the heights of males and females (p-value =
{p_value:.5f}).")

output:

b) A dataset contains information about the hours of study and marks scored by students in an exam.
Perform the following tasks:
1. Compute the correlation between the hours of study and marks scored.
2. Perform a simple linear regression analysis to predict marks based on hours of study.

Roll No: Name: 29

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Description:
1. Describe correlation
2. Linear regression analysis

Code:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

data = {
'Hours of Study': [2, 3, 4, 5, 6, 7, 8, 9],
'Marks Scored': [50, 55, 60, 65, 70, 75, 80, 85]
}
df = pd.DataFrame(data)

# 1. Correlation
correlation = np.corrcoef(df['Hours of Study'], df['Marks Scored'])[0, 1]
print(f"Correlation between Hours of Study and Marks Scored: {correlation:.2f}")

# 2. Simple Linear Regression

X = df[['Hours of Study']]
y = df['Marks Scored']

# Create and fit the linear regression model

model = LinearRegression()
model.fit(X, y)

# Predict marks for the given hours of study

predicted_marks = model.predict(X)

# Plotting the regression line

plt.scatter(df['Hours of Study'], df['Marks Scored'], color='blue', label='Actual Data')
plt.plot(df['Hours of Study'], predicted_marks, color='red', label='Regression Line')
plt.xlabel('Hours of Study')
plt.ylabel('Marks Scored')
plt.title('Hours of Study vs Marks Scored')
plt.legend()
plt.show()

# Display the regression equation

print(f"Regression Equation: Marks = {model.intercept_:.2f} + {model.coef_[0]:.2f} * Hours of Study")

output:

Practical No: 10

Processing and analyzing text data.

Aim: You are working on a project to analyze customer reviews of a product. Your task is to preprocess the
text data to prepare it for further analysis, such as sentiment analysis or topic modeling. The Text is given as
follows:
"The product was excellent, I really liked it!",
"This is the worst purchase I have ever made. Absolutely disappointed!",
Roll No: Name: 30
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

"The service was decent but the product did not meet expectations.",
"Great quality, fast delivery, will buy again!"

Code:
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4') # For wordnet data

reviews = [
"The product was excellent, I really liked it!",
"This is the worst purchase I have ever made. Absolutely disappointed!",
"The service was decent but the product did not meet expectations.",
"Great quality, fast delivery, will buy again!"
]

def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
words = word_tokenize(text)
return words

def remove_stopwords(words):
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
return filtered_words

def lemmatize_words(words):
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words] # 'v' stands for verb
return lemmatized_words

for review in reviews:

print("Original Review:", review)
tokens = preprocess_text(review)
cleaned_tokens = remove_stopwords(tokens)
lemmatized_tokens = lemmatize_words(cleaned_tokens)
print("Cleaned, Tokenized & Lemmatized Review:", lemmatized_tokens)
print() # For spacing

Output:
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Roll No: Name: 31
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

nltk.download('omw-1.4') # For wordnet data

def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
words = word_tokenize(text)
return words

def remove_stopwords(words):
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
return filtered_words

def lemmatize_words(words):
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words] # 'v' stands for
verb
return lemmatized_words

for review in reviews:

Roll No: Name: 32

Data Analysis Python Notes
No ratings yet
Data Analysis Python Notes
3 pages
Python For Data Analyst
No ratings yet
Python For Data Analyst
4 pages
Ip Practical File
No ratings yet
Ip Practical File
47 pages
12 Ip Set A Anskey
No ratings yet
12 Ip Set A Anskey
17 pages
Python Data Analyst Handbook Guide - Byom - Cybertechie
No ratings yet
Python Data Analyst Handbook Guide - Byom - Cybertechie
57 pages
SSCE VGS Set-1 Updated
No ratings yet
SSCE VGS Set-1 Updated
4 pages
Vantika Kamra's Practical File 12 Diamond (26600872)
No ratings yet
Vantika Kamra's Practical File 12 Diamond (26600872)
46 pages
AIML 01 Merged
No ratings yet
AIML 01 Merged
25 pages
Practical File Infomatics Practices 2024-25
No ratings yet
Practical File Infomatics Practices 2024-25
39 pages
Data Science Lab
No ratings yet
Data Science Lab
61 pages
IP - Record 2023-24
No ratings yet
IP - Record 2023-24
79 pages
IP Practical Board 2024-25
No ratings yet
IP Practical Board 2024-25
14 pages
3-DSEs UGCF CS (H) Approved Facultymay25
No ratings yet
3-DSEs UGCF CS (H) Approved Facultymay25
44 pages
Megh 1234 Dvda
No ratings yet
Megh 1234 Dvda
21 pages
Practical Questions
No ratings yet
Practical Questions
7 pages
Mod 5 Python Introduction
No ratings yet
Mod 5 Python Introduction
7 pages
DVP PDF
No ratings yet
DVP PDF
53 pages
Data Analyst
No ratings yet
Data Analyst
7 pages
DADV - Lab - Subject - 303105315
No ratings yet
DADV - Lab - Subject - 303105315
35 pages
Vamshi ml-1,2
No ratings yet
Vamshi ml-1,2
25 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
12 Ip Model To Print
No ratings yet
12 Ip Model To Print
7 pages
BCA4 TH Sem Python Lab Manual Part B
No ratings yet
BCA4 TH Sem Python Lab Manual Part B
11 pages
Adobe Scan 25 Nov 2023
No ratings yet
Adobe Scan 25 Nov 2023
17 pages
Python & MySQL For Data Analysis
No ratings yet
Python & MySQL For Data Analysis
45 pages
Fds Merged
No ratings yet
Fds Merged
102 pages
Sample LabReportFile 2023-24
No ratings yet
Sample LabReportFile 2023-24
46 pages
Ip Practical File 2
No ratings yet
Ip Practical File 2
30 pages
Practical
No ratings yet
Practical
12 pages
SET 1 Part A Marks, (
No ratings yet
SET 1 Part A Marks, (
10 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Screenshot 2023-12-27 at 7.05.37 PM
No ratings yet
Screenshot 2023-12-27 at 7.05.37 PM
23 pages
Informatics Practices Practical File
No ratings yet
Informatics Practices Practical File
8 pages
1
No ratings yet
1
7 pages
FDS Record-1-4
No ratings yet
FDS Record-1-4
18 pages
Data Analysis Python Notes
No ratings yet
Data Analysis Python Notes
10 pages
Term-I Practical Question Paper 2022-2023
No ratings yet
Term-I Practical Question Paper 2022-2023
8 pages
Shatrughan (25084)
No ratings yet
Shatrughan (25084)
13 pages
FDS Lab
No ratings yet
FDS Lab
43 pages
DS Final
No ratings yet
DS Final
46 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
23 pages
Python Course Outline
No ratings yet
Python Course Outline
24 pages
Data Analysis Python Notes
No ratings yet
Data Analysis Python Notes
10 pages
Data Analysis Python
No ratings yet
Data Analysis Python
3 pages
Python and PowerBI Syllabus
No ratings yet
Python and PowerBI Syllabus
3 pages
TBC 401 Data Analytics Using Python
No ratings yet
TBC 401 Data Analytics Using Python
2 pages
IP Practical 2023-24 (1 To 34)
100% (1)
IP Practical 2023-24 (1 To 34)
32 pages
Data Analytics Curriculum
No ratings yet
Data Analytics Curriculum
8 pages
DSC-C-BCA-352T - MAJOR - Problem Solving Using Python
No ratings yet
DSC-C-BCA-352T - MAJOR - Problem Solving Using Python
4 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
2018 05 HP
No ratings yet
2018 05 HP
105 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Practical File Informatics Practices (2024-2025)
No ratings yet
Practical File Informatics Practices (2024-2025)
47 pages
CS352 - Lab Syllabus
No ratings yet
CS352 - Lab Syllabus
2 pages
Data Analytics in Python (Johar) SP2022
No ratings yet
Data Analytics in Python (Johar) SP2022
4 pages
Brochure NUS PA 210521
No ratings yet
Brochure NUS PA 210521
13 pages
Stats Unit1
No ratings yet
Stats Unit1
27 pages
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
No ratings yet
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
8 pages
Python Quick Notes
No ratings yet
Python Quick Notes
2 pages
Macro 1 Theory and Background - Rel 108 OM Format PDF
75% (4)
Macro 1 Theory and Background - Rel 108 OM Format PDF
33 pages
CO Distribution Cycle
No ratings yet
CO Distribution Cycle
10 pages
Mst121 Chapter A1
No ratings yet
Mst121 Chapter A1
52 pages
Technology Management Tools: S-Curve
No ratings yet
Technology Management Tools: S-Curve
18 pages
Humaira Thesis
No ratings yet
Humaira Thesis
28 pages
RFIC Inductor Toolkit
No ratings yet
RFIC Inductor Toolkit
39 pages
Vectors Plane
No ratings yet
Vectors Plane
28 pages
Explanatory Research Design Handout Prof - Panke
No ratings yet
Explanatory Research Design Handout Prof - Panke
1 page
RRB NTPC Time & Work Questions PDF
No ratings yet
RRB NTPC Time & Work Questions PDF
15 pages
Ch.4 Pile Foundations: 4.3 Ultimate Pile Capacity (Dynamic Analysis)
No ratings yet
Ch.4 Pile Foundations: 4.3 Ultimate Pile Capacity (Dynamic Analysis)
9 pages
Full Download Hands-On Time Series Analysis With Python: From Basics To Bleeding Edge Techniques B. V. Vishwas PDF
100% (2)
Full Download Hands-On Time Series Analysis With Python: From Basics To Bleeding Edge Techniques B. V. Vishwas PDF
55 pages
1 Introduction To Rings
No ratings yet
1 Introduction To Rings
23 pages
Uplift Force, Seepage, and Exit Gradient Under Diversion Dams
No ratings yet
Uplift Force, Seepage, and Exit Gradient Under Diversion Dams
11 pages
Tutorial - 4
No ratings yet
Tutorial - 4
2 pages
CSE 114 Unit 5
No ratings yet
CSE 114 Unit 5
58 pages
Model Question Paper - IA, IB & IA, IIB EM&TM
No ratings yet
Model Question Paper - IA, IB & IA, IIB EM&TM
25 pages
As 4
No ratings yet
As 4
2 pages
HW Assignment 2: Simplex Method For Solving LP and LINDO: TRAN-650 Urban Systems Engineering
No ratings yet
HW Assignment 2: Simplex Method For Solving LP and LINDO: TRAN-650 Urban Systems Engineering
3 pages
Code IV Rank
No ratings yet
Code IV Rank
4 pages
Sample DOKA Paper U For Year 11 13
No ratings yet
Sample DOKA Paper U For Year 11 13
4 pages
Cambridge IGCSE ™: Physics 0625/53 October/November 2022
No ratings yet
Cambridge IGCSE ™: Physics 0625/53 October/November 2022
8 pages
Salahaddin University College of Science Mathematics Department Stage Two
No ratings yet
Salahaddin University College of Science Mathematics Department Stage Two
14 pages
SignExplainer An Explainable AI-Enabled Framework For Sign Language Recognition With Ensemble Learning
No ratings yet
SignExplainer An Explainable AI-Enabled Framework For Sign Language Recognition With Ensemble Learning
10 pages
Dna Computing: Using Dna To Solve Computational Problems
No ratings yet
Dna Computing: Using Dna To Solve Computational Problems
12 pages
DLL 4TH Quarter
No ratings yet
DLL 4TH Quarter
11 pages
Topic 3 Notes: Jeremy Orloff
No ratings yet
Topic 3 Notes: Jeremy Orloff
11 pages
GreenHouse Model IEEEICAACCA2022
No ratings yet
GreenHouse Model IEEEICAACCA2022
6 pages
Class 9 Sample Paper 2020-21
No ratings yet
Class 9 Sample Paper 2020-21
3 pages
Model Mania 2003 Phase 2
No ratings yet
Model Mania 2003 Phase 2
1 page
Cody's Data Cleaning Techniques Using SAS, Third Edition
From Everand
Cody's Data Cleaning Techniques Using SAS, Third Edition
Ron Cody
4.5/5 (3)