0% found this document useful (0 votes)
5 views32 pages

DAP Journal

Python practicals

Uploaded by

jeel0613
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views32 pages

DAP Journal

Python practicals

Uploaded by

jeel0613
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Data Analytics with Python Journal

Practical No: 1

Aim: Setting up Python and Jupyter notebooks, basic Python exercises

a) Setting Up Python

Python 3.12.4 Installation on Windows

Step 1) Go to link https://www.python.org/downloads/, and select


the latest version for windows.

Step 2) store the setup in D:\> and right click and run the set up as administrator.

Step 3) Select Customize Installation

Roll No: Name: 1


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Step 4) Click NEXT

Step 5) In next screen


1. Select all advanced options except the first option and click on Install.

Roll No: Name: 2


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

You will get the following screen once the setup is successful

Roll No: Name: 3


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Step 6) Click Close button once install is done.

b) Jupyter Notebooks

Install Anaconda Distribution:

o Anaconda is a popular distribution for Python and includes Jupyter Notebook and
many essential libraries for data science.
o Download Anaconda from Anaconda Distribution.
(https://www.anaconda.com/download)

o
o Provide your valid email id and select the checkbox.
o Click on submit button

Roll No: Name: 4


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

o Click on download button and save the setup file in D:\>.

● Keep on clicking Next keeping default settings as it is, till getting Install
button.
● Click on Install
● Verify with your gmail or github or MSOffice account.

Roll No: Name: 5


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Again, click on Next as and when appears on the screen and then Finish.

Find and launch the Jupyter notebook in Anaconda Navigator.

By launching, the following screen can be achieved for Jupyter notebook to work
with that software

Or

Click on Notebook at home screen of Anaconda and then find the following screen
Roll No: Name: 6
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

OR

Click on select

Roll No: Name: 7


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Install Additional Libraries (if needed):

● Anaconda comes with many libraries pre-installed, including Pandas. If you need
additional libraries (e.g., Matplotlib, Seaborn for visualization), you can install them using
Anaconda Navigator or via the command line with conda or pip.

c) basic Python Exercise

Create a list of six integer numbers. Perform the list operations append to append
a new element to the list. Remove an element to delete a specified element. Sort the
list to sort the list in ascending order. Also find max and min values to find the
largest and smallest elements in the list.

# Original list

numbers = [5, 2, 9, 1, 5, 6]

print ('original list',numbers)

# 1. Append a new element to the list

new_element = 10

numbers.append(new_element)

print("List after appending ",new_element, ": ",numbers)

# 2. Remove an element from the list

element_to_remove = 9

if element_to_remove in numbers:

numbers.remove(element_to_remove)

Roll No: Name: 8


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

print("List after removing ",element_to_remove,": ",numbers)

else:

print("Element ",element_to_remove," not found in the list.")

# 3. Sort the list in ascending order

numbers.sort()

print("List after sorting: ",numbers)

# 4. Find the maximum and minimum values in the list

max_value = max(numbers)

min_value = min(numbers)

print("Maximum value: ",max_value)

print("Minimum value: ",min_value)

Output:

Roll No: Name: 9


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Practical No: 2

Aim: Data manipulation tasks with Pandas

a) Data cleaning and filtering:

aim: Create the DataFrame object from a given student-grade dataset. Fill
the missing values with the average of the column, calculate the average
grade for each student, remove the duplicate rows, operations are
performed to clean the data, calculate average grades, filter out low-
performing students having the average grade less than or equal to 2, find
the maximum value in the ‘Math’ column and identify the student(s) with
the maximum ‘Math’ score.

Description:

import pandas as pd
import numpy as np

# Create DataFrame
df = pd.read_csv("student-grade.csv")

# Fill missing values with the average of the column


df.fillna(df['Math'].mean(), inplace=True)
df.fillna(df['Science'].mean(), inplace=True)
df.fillna(df['English'].mean(), inplace=True)
print(df.to_string())

# Calculate the average grade for each student


df['Average'] = df[['Math', 'Science', 'English']].mean(axis=1)
print(df.to_string())

#removing duplicate rows


df.drop_duplicates()

# Filter out students with an average grade below 50%


filtered_df = df[df['Average'] <= 2]

# Display the cleaned and filtered DataFrame


print(filtered_df)

max_math = df['Math'].max()

# Find the rows where 'Math' is equal to the maximum value


max_math_students = df[df['Math'] == max_math]

# Display the result


print("The maximum value in the 'Math' column is:", max_math)
Roll No: Name: 10
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

print("The student(s) with the maximum 'Math' score:")


print(max_math_students[['id', 'name', 'Math']])

Output:

DataFrame

DataFrame with average

Filter out students with an average grade below 50%

Roll No: Name: 11


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Find the rows where 'Math' is equal to the maximum value

b) data grouping

aim: Given a sample dataset containing ‘sales’ data for different ‘regions’
and ‘products’, with ‘date’ of sales. what steps are taken to calculate the
total and average sales for each region? Additionally, how is the resulting
DataFrame sorted based on total sales in descending order?

import pandas as pd

# Sample data
data = {
'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'],
'Region': ['North', 'South', 'East', 'West', 'North'],
'Product': ['A', 'B', 'A', 'B', 'A'],
'Sales': [150, 200, 130, 170, 160]
}

# Create DataFrame
df = pd.DataFrame(data)
print(df)

# Group by 'Region' and calculate total and average sales


Roll No: Name: 12
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

grouped_df = df.groupby('Region').agg(
Total_Sales=pd.NamedAgg(column='Sales', aggfunc='sum'),
Average_Sales=pd.NamedAgg(column='Sales', aggfunc='mean')
).reset_index()

# Sort the results by 'Total_Sales' in descending order


sorted_df = grouped_df.sort_values(by='Total_Sales', ascending=False)

# Display the grouped and sorted DataFrame


print(sorted_df)

Output:

Roll No: Name: 13


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Practical No: 3

Advanced data manipulation exercises using Pandas.

a) Aim: Customer Segmentation in Retail: A retail company wants to segment its customers based on
their purchase behaviour. Consider a dataset Mall_Customer.csv file containing customer
information and purchase history to create customer segments.

Data source:

Mall_Customer.csv
https://www.kaggle.com/datasets/shrutimechlearn/customer-data

Code:
import pandas as pd
df = pd.read_csv("Mall_Customers.csv")
#NO OF ROWS AND COLUMNS
print("Shape of the dataset:")
print(df.shape)
# Data type and non-null count for each column
print("\nColumn information:")
print(df.info())

print(df.to_string())

# Summary statistics (mean, min, max, etc.)


print("\nSummary statistics:")
print(df.describe())

# Unique values in the 'Gender' column


print("\nUnique values in 'Genre' column:")
print(df["Genre"].unique())

# Count missing values in each column


print("\nMissing values count:")
print(df.isnull().sum())

#Create Customer Segments (bins - [1,35],[36,70][71-100]


Roll No: Name: 14
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

df["Segment"] = pd.cut(df["Spending_Score"],
bins=[1, 35, 70, 100],
labels=["Low Spender", "Medium Spender", "High Spender"])
# Print the first few rows of the segmented data
print("\nCustomer Segmentation:")
print(df[["CustomerID", "Spending_Score", "Segment"]].to_string())

output:

Roll No: Name: 15


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Roll No: Name: 16


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

b) aim: Consider a CSV file named sales_data.csv containing sales data for different products. Load
this data into a Pandas DataFrame and answer the following:
● Calculate the total sales for each product category (e.g., Product A, Product B, Product C).
● Identify the product category with the highest average sales.

Data source:

Code:

import pandas as pd
sales_df = pd.read_csv('sales_data.csv')
print(sales_df.shape)
# Calculate total Revenue by Product
total_Revenue_by_Product = sales_df.groupby('Product')['Revenue'].sum()
print(total_Revenue_by_Product)
# Identify the category with the highest average revenue
total_Revenue_by_Product = sales_df.groupby('Product')['Revenue'].mean()
max_avg_revenue_product = total_Revenue_by_Product.idxmax()
print(f"Highest average Revenue by the product: {max_avg_revenue_product}")

output:

Roll No: Name: 17


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Practical No: 4

Performing data analysis tasks with NumPy arrays.

a) Aim: create a numpy array with arange of 20 elements with 5 rows and 4 columns. Find the index
of the maximum element (overall), find the index of the maximum element along each row, Find the
index of the maximum element along each column, sort the entire array, sort along each row, sort
along each column, find the mean of every 1D NumPy array in a given 2D array, and reverse the
array.

Code:
import numpy as np

# Create a 5x4 array


array = np.arange(20).reshape(5, 4)
print(array)

# Find the index of the maximum element (overall)


print(np.argmax(array))

# Find the index of the maximum element along each row


print(np.argmax(array, axis=1))

# Find the index of the maximum element along each column


print(np.argmax(array, axis=0))

# Sort along each row


print(np.sort(array, axis=1))

# Sort along each column


print(np.sort(array, axis=0))

#find the mean of every 1D NumPy array in a given 2D array


for arr in array:
print("Mean of array: ",np.mean(arr))

# Reverse the array


reversed_array = array[::-1]
print(reversed_array)

Output:

Roll No: Name: 18


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

b) data indexing and selection

define a data frame with data as column ‘A’, ‘B’ and ‘C’ with the values as list of 1,2,3,4 for ‘A’,
‘5’,’6’,’7’,’8’ for ‘B’ and ‘9’,’10’,’11,’12’ for column ‘C’; having its index as ‘row1’, ‘row2’, ‘row3’ and
‘row4’ respectively for all four rows. Perform the following selections with this given data of data frame.

i) Select all rows and column 'A'


ii) Select 'row2' and 'row3' for column 'B'
iii) Select 'row2' and 'row4' for column 'B'
iv) Select all rows and columns 'A' to 'C'
v) select rows where the values in column 'A' are greater than 2
vi) select rows where values in 'A' are greater than 2 and values in 'B' are less than 8
vii) Select rows where column 'A' is greater than 3
viii) Select rows where column 'B' is equal to 7

code:
import pandas as pd

data = {
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]
}
index = ['row1', 'row2', 'row3', 'row4']
df = pd.DataFrame(data, index=index)

df.loc[:, 'A']

df.loc['row2':'row3', 'B']

df.loc[['row2','row4'], 'B']

df.loc[:, 'A':'C']

df[df['A'] > 2]

df[(df['A'] > 2) & (df['B'] < 8)]

df[df['A'] > 3]

df[df['B'] == 7]

Roll No: Name: 19


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Output:

Roll No: Name: 20


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Practical No: 5

Creating basic plots with Matplotlib and Seaborn.

a) aim: Write Python code to create a plot with multiple variations of the sine function using
Matplotlib. Follow the steps and code provided below to complete the exercise.

Import Necessary Libraries and Generate Data to create an array x with 100 points ranging from 0 to
10, Create a Figure, Plot Multiple Sine Waves with different line styles and different colors, Add a
legend, change the x-axis dimensions from (0 to 10) to (-1 to 11) and (-1.0 to 1.0) to (-1.5 to 1.5), Label
the x-axis as 'x' and the y-axis as 'sin(x)' with blue color and font size 16. and title 'A sin(x) plot' with
blue color, font size 22, and oblique style and Display the Plot. Also, save the figure and display it
again from that location and a list of supported filetypes.

Code:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
fig = plt.figure()
plt.plot(x, np.sin(x), '--', color='blue', label='sin(x)')
plt.plot(x, np.sin(x - 1), color='g',linestyle=':') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), linestyle='-.', color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5)
plt.xlabel('x', c
olor= 'blue', fontsize=16)
plt.ylabel('sin(x)', color= 'blue', fontsize=16)
plt.legend()
plt.title("A sin(x) plot", color= 'blue', fontsize=22, style="oblique")

fig.savefig('my_figure.png')
fig.savefig('d:/DAP - Data Analytics with Python/my_figure.png')
from IPython.display import Image
Image('my_figure.png')
fig.canvas.get_supported_filetypes()

output:

b) Aim: Write Python code to create a scatter plot with random data points using Matplotlib. Follow
the steps and code provided below to complete the exercise.

Import Necessary Libraries, create an array x with 100 random values from a standard normal
distribution, Create an array y with 100 random values from a standard normal distribution. Create
an array of color with 100 random values. Create an array size with 100 random values scaled up to
1000. Create a Scatter Plot to create a scatter plot with coordinates, colors, sizes, the transparency,
color map and colorbar.

code:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(100)
Roll No: Name: 21
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

y = np.random.randn(100)
colors = np.random.rand(100)
sizes = 1000 * np.random.rand(100)
plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,cmap='viridis')
#viridis, plasma, inferno, magma, cividis
#alpha controls the transparency of the points, while cmap controls
#the color mapping based on the data values.
plt.colorbar(); # show color scale

output:

c) seaborn plots

Aim:

Using the "flights" dataset from Seaborn, which contains the number of passengers per month over
several years, create the following visualizations:

1. A line plot showing the number of passengers over time, with a separate line for each year.
2. A bar plot comparing the total number of passengers for each year.
3. A box plot to visualize the distribution of passengers per month, colored by year.
4. A heatmap showing the number of passengers each month over the years.

d) Analysis

Provide an analysis of the seasonal trends and any noticeable changes over the years based on the
visualizations.

code:
import seaborn as sns
import matplotlib.pyplot as plt

df = sns.load_dataset("flights")

# 1. Line plot: Number of passengers over time for each year


plt.figure(figsize=(12, 6))
sns.lineplot(x='month', y='passengers', hue='year', data=df, palette='viridis')
plt.title('Monthly Passengers Over Time')
plt.show()

# 2. Bar plot: Total number of passengers per year


plt.figure(figsize=(10, 6))
total_passengers_per_year = df.groupby('year')['passengers'].sum().reset_index()
sns.barplot(x='year', y='passengers', data=total_passengers_per_year, palette='plasma')
plt.title('Total Passengers per Year')
plt.show()

# 3. Box plot: Distribution of passengers per month by year


plt.figure(figsize=(12, 6))
sns.boxplot(x='month', y='passengers', hue='year', data=df)
plt.title('Distribution of Passengers per Month by Year')
plt.show()

# 4. Heatmap: Number of passengers each month over the years


flights_pivot = df.pivot(index="month", columns="year", values="passengers")
plt.figure(figsize=(12, 6))
Roll No: Name: 22
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

sns.heatmap(flights_pivot, annot=True, fmt="d", cmap='YlGnBu')


plt.title('Number of Passengers Each Month Over the Years')
plt.show()

Output:

d)
Line Plot: The number of passengers increases over the years till July, with a noticeable seasonal pattern
where the summer months typically see higher passenger numbers.
Bar Plot: The total number of passengers has generally increased each year, reflecting growth in air travel.
Box Plot: There is significant variability in the number of passengers month-to-month, with July and
August consistently being peak months.
Heatmap: The heatmap clearly shows the seasonal trends, indicating higher numbers appearing during the
summer months across all years.

Practical No: 6

Creating advanced and interactive plots.


Read a dataset named sales_data.csv containing monthly sales information for different products across
multiple regions. The dataset includes columns such as Product, Region, and Sales. Write a Python code to
create an interactive bar chart using Plotly that shows the sales of different products in a selected region.
Use a dropdown menu to allow users to select different regions.

Dataset: salesdata.csv

Code:
import plotly.graph_objects as go
import pandas as pd

df = pd.read_csv("salesdata.csv")

# Filter the data for each region manually


df_c = df[df['Region'] == 'Central']
Roll No: Name: 23
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

df_e = df[df['Region'] == 'East']


df_w = df[df['Region'] == 'West']
df_n = df[df['Region'] == 'North']
df_s = df[df['Region'] == 'South']

# Create traces for each region manually


trace1 = go.Bar(
x=df_c['Product'],
y=df_c['Sales'],
name='Central'
)
trace2 = go.Bar(
x=df_e['Product'],
y=df_e['Sales'],
name='East'
)
trace3 = go.Bar(
x=df_w['Product'],
y=df_w['Sales'],
name='West'
)
trace4 = go.Bar(
x=df_s['Product'],
y=df_s['Sales'],
name='South'
)

# Initialize the figure and add traces


fig = go.Figure(data=[trace1, trace2, trace3, trace4])

# Update layout with dropdown menu, specifying visibility for each trace
fig.update_layout(
updatemenus=[
dict(
buttons=[
dict(label='Central',
method='update',
args=[{'visible': [True, False, False, False]},
{'title': 'Sales by Product in Region: Central'}]),
dict(label='East',
method='update',
args=[{'visible': [False, True, False, False]},
{'title': 'Sales by Product in Region: East'}]),
dict(label='West',
method='update',
args=[{'visible': [False, False, True, False]},
{'title': 'Sales by Product in Region: West'}]),
dict(label='South',
method='update',
args=[{'visible': [False, False, False, True]},
{'title': 'Sales by Product in Region: South'}])
],
direction='down',
)
],
title="Sales by Product",
Roll No: Name: 24
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

xaxis_title="Product",
yaxis_title="Total Sales",
barmode='group'
)

fig.show()

output

b) Create a dataset named student_scores.csv by defining a dictionary as data that assigns


information about students' scores in various subjects. The dataset contains columns like StudentID,
DAPScore, IFScore, DEScore, and Gender for 10 students. Write a Python code to Create an
interactive scatter plot using Plotly to visualize the relationship between DAPScore and IFScore for
students. Also, add hover information to the scatter plot so that when a user hovers over a point, it
displays the StudentID, DEScore, and Gender also of the student.

Code:
import pandas as pd
import plotly.express as px
# Creating the dataset
data = {
'StudentID': range(1, 11),
'DAPScore': [78, 85, 92, 88, 76, 95, 89, 84, 91, 87],
'IFScore': [82, 79, 94, 90, 80, 97, 85, 89, 88, 90],
'DEScore': [75, 80, 88, 85, 78, 93, 84, 81, 92, 87],
'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F']
}

df = pd.DataFrame(data)
df.to_csv('student_scores.csv', index=False)
fig = px.scatter(df, x="DAPScore", y="IFScore", color="Gender",
hover_data=["StudentID", "DEScore", "Gender"],
title="DAP Score vs. IF Score")

fig.show()

Practical No: 7

Performing EDA on a dataset.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("used_cars.csv")

df.head()

df.tail()
df.info()
Roll No: Name: 25
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

#counting unique values for each features


df.nunique()

#total missing values for each field


df.isnull().sum()

#missing values in perc


(df.isnull().sum()/(len(df)))*100

# Convert 'New_Price' from object to float64 # (\d+\.\d+|\d+) matches either a decimal number or an
integer.
df['New_Price'] = df['New_Price'].str.replace(',', '').str.extract('(\d+\.\d+|\d+)').astype(float)
# The 'Price' column is already float64, so no conversion is necessary
# Group by 'Name'
grouped = df.groupby('Name')

# Fill NaN values in 'New_Price' and 'Price' with the mean of their respective groups
for name, group in grouped:
new_price_mean = group['New_Price'].mean()
price_mean = group['Price'].mean()

df.loc[group.index, 'New_Price'] = group['New_Price'].fillna(new_price_mean)


df.loc[group.index, 'Price'] = group['Price'].fillna(price_mean)

# View the cleaned dataframe


df.head()

#missing values in perc


(df.isnull().sum()/(len(df)))*100

#Data Reduction
# Remove S.No. column from data
# axis=1 – column
df = df.drop(['S.No.'], axis = 1)
df.info()

#Feature Engineering - Creating Features


from datetime import date
date.today().year
df['Car_Age']=date.today().year-data['Year’]
df.head()

# split the name and introduce new variables “Brand” and “Model”
df['Brand'] = df.Name.str.split().str.get(0)
df['Model'] = df.Name.str.split().str.get(1) + df.Name.str.split().str.get(2)
df[['Name','Brand','Model']]

#Data Cleaning
print(df.Brand.unique())
print(df.Brand.nunique())

#The brand name ‘Isuzu’ ‘ISUZU’ and ‘Mini’ and ‘Land’ looks #incorrect. This needs to be corrected
searchfor = ['Isuzu' ,'ISUZU','Mini','Land']
df[df.Brand.str.contains('|'.join(searchfor))].head(5)
#'Isuzu | ISUZU | Mini | Land’ -> | means or
Roll No: Name: 26
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

df["Brand"].replace({"ISUZU": "Isuzu", "Mini": "Mini Cooper","Land":"Land Rover"}, inplace=True)


#Our Data is ready to perform EDAdf.describe().T
# T -> to covert row to column and column to row

df.describe(include='all').T

#Before we do EDA, lets separate Numerical and categorical variables


#for easy analysis
cat_cols=df.select_dtypes(include=['object']).columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)

#EDA Univariate Analysis


for col in num_cols:
print(col)
print('Skew :', round(df[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
# 1,2 -> grid of 1 row two columns. last 1 indicates the 1st plot is the left most
sns.histplot(df[col], bins=15, kde=True)
plt.ylabel('count')
plt.subplot(1, 2, 2) # 2nd plot is the right most
sns.boxplot(x=df[col])
plt.show()

#EDA Bivariate Analysis


plt.figure(figsize=(13,17))
sns.pairplot(data=df.drop(['Kilometers_Driven','Price'],axis=1))
plt.show()

Practical No: 8
Time series data analysis using Pandas.
a) Create a dataset containing daily temperature readings for a year 2023. Set the Date as the index of
the DataFrame, and resample the data to a monthly frequency to calculate the average temperature
for each month. Plot the original daily temperature data and the resampled monthly average
temperatures.

Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate a sample dataset of daily temperature readings for a year


date_rng = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
temperature = np.random.normal(loc=15, scale=10, size=len(date_rng))
df = pd.DataFrame(temperature, index=date_rng, columns=['Temperature'])
print (df)

# Resample the data to a monthly frequency and calculate the mean temperature
monthly_avg_temp = df.resample('M').mean()
Roll No: Name: 27
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

# Plotting the original daily temperature data


plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Temperature'], label='Daily Temperature', color='blue', alpha=0.5)
plt.plot(monthly_avg_temp.index, monthly_avg_temp['Temperature'], label='Monthly Avg Temperature',
color='red', marker='o')
plt.title('Daily and Monthly Average Temperatures in 2023')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.show()

output:

b) Create a time series dataset of daily sales data for a retail store for the year 2023, calculate the 7-
day rolling average of the sales data. Plot the original sales data and the 7-day rolling average on the
same graph to visualize the smoothing effect.

Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate a sample dataset of daily sales data for a year


date_rng = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
sales = np.random.randint(100, 500, size=len(date_rng))
df = pd.DataFrame(sales, index=date_rng, columns=['Sales'])

# Calculate the 7-day rolling average of the sales data


df['7-Day Rolling Avg'] = df['Sales'].rolling(window=7).mean()

# Plotting the original sales data and the 7-day rolling average
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Sales'], label='Daily Sales', color='blue', alpha=0.5)
plt.plot(df.index, df['7-Day Rolling Avg'], label='7-Day Rolling Avg', color='orange', linewidth=2)
plt.title('Daily Sales and 7-Day Rolling Average in 2023')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()

output:

Practical No: 9

Applying statistical analysis on a dataset.

Aim: a) You are given a dataset containing the heights (in cm) of male and female students in a class.
Perform the following tasks:

Calculate the mean, median, mode, and standard deviation of the heights for both males and females.
Perform an independent t-test to check if there is a significant difference between the heights of males
and females. Use a significance level of 0.05.

Description:
Roll No: Name: 28
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

1. Define mean, median, mode and standard deviation.


2. t-test

Code:
import pandas as pd
import numpy as np
from scipy import stats

data = {
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
'Height': [170, 160, 175, 158, 180, 162, 169, 159]
}
df = pd.DataFrame(data)

male_heights = df[df['Gender'] == 'Male']['Height']


female_heights = df[df['Gender'] == 'Female']['Height']

# Function for descriptive statistics


def descriptive_stats(heights):
mode_result = stats.mode(heights, keepdims=True)
# The keepdims=True argument ensures that the result maintains the same number of dimensions as the
input array.
mode_value = mode_result.mode[0] if mode_result.count[0] > 0 else np.nan # Check if mode exists
# It retrieves the first mode (mode_result.mode[0]) only if the count of that mode (mode_result.count[0])
is greater than 0.

return {
'Mean': np.mean(heights),
'Median': np.median(heights),
'Mode': mode_value,
'Standard Deviation': np.std(heights, ddof=1) # Sample standard deviation
}

male_stats = descriptive_stats(male_heights)
female_stats = descriptive_stats(female_heights)

print("Male Stats:", male_stats)


print("Female Stats:", female_stats)

# 2. Independent T-test (Hypothesis Testing)


t_stat, p_value = stats.ttest_ind(male_heights, female_heights)

if p_value < 0.05:


print(f"There is a significant difference between the heights of males and females (p-value =
{p_value:.5f}).")
else:
print(f"There is no significant difference between the heights of males and females (p-value =
{p_value:.5f}).")

output:

b) A dataset contains information about the hours of study and marks scored by students in an exam.
Perform the following tasks:
1. Compute the correlation between the hours of study and marks scored.
2. Perform a simple linear regression analysis to predict marks based on hours of study.

Roll No: Name: 29


BSc DS Sem III Data Analytics with Python Journal KES Shroff College

Description:
1. Describe correlation
2. Linear regression analysis

Code:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

data = {
'Hours of Study': [2, 3, 4, 5, 6, 7, 8, 9],
'Marks Scored': [50, 55, 60, 65, 70, 75, 80, 85]
}
df = pd.DataFrame(data)

# 1. Correlation
correlation = np.corrcoef(df['Hours of Study'], df['Marks Scored'])[0, 1]
print(f"Correlation between Hours of Study and Marks Scored: {correlation:.2f}")

# 2. Simple Linear Regression


X = df[['Hours of Study']]
y = df['Marks Scored']

# Create and fit the linear regression model


model = LinearRegression()
model.fit(X, y)

# Predict marks for the given hours of study


predicted_marks = model.predict(X)

# Plotting the regression line


plt.scatter(df['Hours of Study'], df['Marks Scored'], color='blue', label='Actual Data')
plt.plot(df['Hours of Study'], predicted_marks, color='red', label='Regression Line')
plt.xlabel('Hours of Study')
plt.ylabel('Marks Scored')
plt.title('Hours of Study vs Marks Scored')
plt.legend()
plt.show()

# Display the regression equation


print(f"Regression Equation: Marks = {model.intercept_:.2f} + {model.coef_[0]:.2f} * Hours of Study")

output:

Practical No: 10

Processing and analyzing text data.


Aim: You are working on a project to analyze customer reviews of a product. Your task is to preprocess the
text data to prepare it for further analysis, such as sentiment analysis or topic modeling. The Text is given as
follows:
"The product was excellent, I really liked it!",
"This is the worst purchase I have ever made. Absolutely disappointed!",
Roll No: Name: 30
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

"The service was decent but the product did not meet expectations.",
"Great quality, fast delivery, will buy again!"

Code:
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4') # For wordnet data

reviews = [
"The product was excellent, I really liked it!",
"This is the worst purchase I have ever made. Absolutely disappointed!",
"The service was decent but the product did not meet expectations.",
"Great quality, fast delivery, will buy again!"
]

def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
words = word_tokenize(text)
return words

def remove_stopwords(words):
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
return filtered_words

def lemmatize_words(words):
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words] # 'v' stands for verb
return lemmatized_words

for review in reviews:


print("Original Review:", review)
tokens = preprocess_text(review)
cleaned_tokens = remove_stopwords(tokens)
lemmatized_tokens = lemmatize_words(cleaned_tokens)
print("Cleaned, Tokenized & Lemmatized Review:", lemmatized_tokens)
print() # For spacing

Output:
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Roll No: Name: 31
BSc DS Sem III Data Analytics with Python Journal KES Shroff College

nltk.download('omw-1.4') # For wordnet data

reviews = [
"The product was excellent, I really liked it!",
"This is the worst purchase I have ever made. Absolutely disappointed!",
"The service was decent but the product did not meet expectations.",
"Great quality, fast delivery, will buy again!"
]

def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
words = word_tokenize(text)
return words

def remove_stopwords(words):
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
return filtered_words

def lemmatize_words(words):
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words] # 'v' stands for
verb
return lemmatized_words

for review in reviews:


print("Original Review:", review)
tokens = preprocess_text(review)
cleaned_tokens = remove_stopwords(tokens)
lemmatized_tokens = lemmatize_words(cleaned_tokens)
print("Cleaned, Tokenized & Lemmatized Review:", lemmatized_tokens)
print() # For spacing

Roll No: Name: 32

You might also like