DAP Journal
DAP Journal
Practical No: 1
a) Setting Up Python
Step 2) store the setup in D:\> and right click and run the set up as administrator.
You will get the following screen once the setup is successful
b) Jupyter Notebooks
o Anaconda is a popular distribution for Python and includes Jupyter Notebook and
many essential libraries for data science.
o Download Anaconda from Anaconda Distribution.
(https://www.anaconda.com/download)
o
o Provide your valid email id and select the checkbox.
o Click on submit button
● Keep on clicking Next keeping default settings as it is, till getting Install
button.
● Click on Install
● Verify with your gmail or github or MSOffice account.
Again, click on Next as and when appears on the screen and then Finish.
By launching, the following screen can be achieved for Jupyter notebook to work
with that software
Or
Click on Notebook at home screen of Anaconda and then find the following screen
Roll No: Name: 6
BSc DS Sem III Data Analytics with Python Journal KES Shroff College
OR
Click on select
● Anaconda comes with many libraries pre-installed, including Pandas. If you need
additional libraries (e.g., Matplotlib, Seaborn for visualization), you can install them using
Anaconda Navigator or via the command line with conda or pip.
Create a list of six integer numbers. Perform the list operations append to append
a new element to the list. Remove an element to delete a specified element. Sort the
list to sort the list in ascending order. Also find max and min values to find the
largest and smallest elements in the list.
# Original list
numbers = [5, 2, 9, 1, 5, 6]
new_element = 10
numbers.append(new_element)
element_to_remove = 9
if element_to_remove in numbers:
numbers.remove(element_to_remove)
else:
numbers.sort()
max_value = max(numbers)
min_value = min(numbers)
Output:
Practical No: 2
aim: Create the DataFrame object from a given student-grade dataset. Fill
the missing values with the average of the column, calculate the average
grade for each student, remove the duplicate rows, operations are
performed to clean the data, calculate average grades, filter out low-
performing students having the average grade less than or equal to 2, find
the maximum value in the ‘Math’ column and identify the student(s) with
the maximum ‘Math’ score.
Description:
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.read_csv("student-grade.csv")
max_math = df['Math'].max()
Output:
DataFrame
b) data grouping
aim: Given a sample dataset containing ‘sales’ data for different ‘regions’
and ‘products’, with ‘date’ of sales. what steps are taken to calculate the
total and average sales for each region? Additionally, how is the resulting
DataFrame sorted based on total sales in descending order?
import pandas as pd
# Sample data
data = {
'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'],
'Region': ['North', 'South', 'East', 'West', 'North'],
'Product': ['A', 'B', 'A', 'B', 'A'],
'Sales': [150, 200, 130, 170, 160]
}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
grouped_df = df.groupby('Region').agg(
Total_Sales=pd.NamedAgg(column='Sales', aggfunc='sum'),
Average_Sales=pd.NamedAgg(column='Sales', aggfunc='mean')
).reset_index()
Output:
Practical No: 3
a) Aim: Customer Segmentation in Retail: A retail company wants to segment its customers based on
their purchase behaviour. Consider a dataset Mall_Customer.csv file containing customer
information and purchase history to create customer segments.
Data source:
Mall_Customer.csv
https://www.kaggle.com/datasets/shrutimechlearn/customer-data
Code:
import pandas as pd
df = pd.read_csv("Mall_Customers.csv")
#NO OF ROWS AND COLUMNS
print("Shape of the dataset:")
print(df.shape)
# Data type and non-null count for each column
print("\nColumn information:")
print(df.info())
print(df.to_string())
df["Segment"] = pd.cut(df["Spending_Score"],
bins=[1, 35, 70, 100],
labels=["Low Spender", "Medium Spender", "High Spender"])
# Print the first few rows of the segmented data
print("\nCustomer Segmentation:")
print(df[["CustomerID", "Spending_Score", "Segment"]].to_string())
output:
b) aim: Consider a CSV file named sales_data.csv containing sales data for different products. Load
this data into a Pandas DataFrame and answer the following:
● Calculate the total sales for each product category (e.g., Product A, Product B, Product C).
● Identify the product category with the highest average sales.
Data source:
Code:
import pandas as pd
sales_df = pd.read_csv('sales_data.csv')
print(sales_df.shape)
# Calculate total Revenue by Product
total_Revenue_by_Product = sales_df.groupby('Product')['Revenue'].sum()
print(total_Revenue_by_Product)
# Identify the category with the highest average revenue
total_Revenue_by_Product = sales_df.groupby('Product')['Revenue'].mean()
max_avg_revenue_product = total_Revenue_by_Product.idxmax()
print(f"Highest average Revenue by the product: {max_avg_revenue_product}")
output:
Practical No: 4
a) Aim: create a numpy array with arange of 20 elements with 5 rows and 4 columns. Find the index
of the maximum element (overall), find the index of the maximum element along each row, Find the
index of the maximum element along each column, sort the entire array, sort along each row, sort
along each column, find the mean of every 1D NumPy array in a given 2D array, and reverse the
array.
Code:
import numpy as np
Output:
define a data frame with data as column ‘A’, ‘B’ and ‘C’ with the values as list of 1,2,3,4 for ‘A’,
‘5’,’6’,’7’,’8’ for ‘B’ and ‘9’,’10’,’11,’12’ for column ‘C’; having its index as ‘row1’, ‘row2’, ‘row3’ and
‘row4’ respectively for all four rows. Perform the following selections with this given data of data frame.
code:
import pandas as pd
data = {
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]
}
index = ['row1', 'row2', 'row3', 'row4']
df = pd.DataFrame(data, index=index)
df.loc[:, 'A']
df.loc['row2':'row3', 'B']
df.loc[['row2','row4'], 'B']
df.loc[:, 'A':'C']
df[df['A'] > 2]
df[df['A'] > 3]
df[df['B'] == 7]
Output:
Practical No: 5
a) aim: Write Python code to create a plot with multiple variations of the sine function using
Matplotlib. Follow the steps and code provided below to complete the exercise.
Import Necessary Libraries and Generate Data to create an array x with 100 points ranging from 0 to
10, Create a Figure, Plot Multiple Sine Waves with different line styles and different colors, Add a
legend, change the x-axis dimensions from (0 to 10) to (-1 to 11) and (-1.0 to 1.0) to (-1.5 to 1.5), Label
the x-axis as 'x' and the y-axis as 'sin(x)' with blue color and font size 16. and title 'A sin(x) plot' with
blue color, font size 22, and oblique style and Display the Plot. Also, save the figure and display it
again from that location and a list of supported filetypes.
Code:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
fig = plt.figure()
plt.plot(x, np.sin(x), '--', color='blue', label='sin(x)')
plt.plot(x, np.sin(x - 1), color='g',linestyle=':') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), linestyle='-.', color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5)
plt.xlabel('x', c
olor= 'blue', fontsize=16)
plt.ylabel('sin(x)', color= 'blue', fontsize=16)
plt.legend()
plt.title("A sin(x) plot", color= 'blue', fontsize=22, style="oblique")
fig.savefig('my_figure.png')
fig.savefig('d:/DAP - Data Analytics with Python/my_figure.png')
from IPython.display import Image
Image('my_figure.png')
fig.canvas.get_supported_filetypes()
output:
b) Aim: Write Python code to create a scatter plot with random data points using Matplotlib. Follow
the steps and code provided below to complete the exercise.
Import Necessary Libraries, create an array x with 100 random values from a standard normal
distribution, Create an array y with 100 random values from a standard normal distribution. Create
an array of color with 100 random values. Create an array size with 100 random values scaled up to
1000. Create a Scatter Plot to create a scatter plot with coordinates, colors, sizes, the transparency,
color map and colorbar.
code:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(100)
Roll No: Name: 21
BSc DS Sem III Data Analytics with Python Journal KES Shroff College
y = np.random.randn(100)
colors = np.random.rand(100)
sizes = 1000 * np.random.rand(100)
plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,cmap='viridis')
#viridis, plasma, inferno, magma, cividis
#alpha controls the transparency of the points, while cmap controls
#the color mapping based on the data values.
plt.colorbar(); # show color scale
output:
c) seaborn plots
Aim:
Using the "flights" dataset from Seaborn, which contains the number of passengers per month over
several years, create the following visualizations:
1. A line plot showing the number of passengers over time, with a separate line for each year.
2. A bar plot comparing the total number of passengers for each year.
3. A box plot to visualize the distribution of passengers per month, colored by year.
4. A heatmap showing the number of passengers each month over the years.
d) Analysis
Provide an analysis of the seasonal trends and any noticeable changes over the years based on the
visualizations.
code:
import seaborn as sns
import matplotlib.pyplot as plt
df = sns.load_dataset("flights")
Output:
d)
Line Plot: The number of passengers increases over the years till July, with a noticeable seasonal pattern
where the summer months typically see higher passenger numbers.
Bar Plot: The total number of passengers has generally increased each year, reflecting growth in air travel.
Box Plot: There is significant variability in the number of passengers month-to-month, with July and
August consistently being peak months.
Heatmap: The heatmap clearly shows the seasonal trends, indicating higher numbers appearing during the
summer months across all years.
Practical No: 6
Dataset: salesdata.csv
Code:
import plotly.graph_objects as go
import pandas as pd
df = pd.read_csv("salesdata.csv")
# Update layout with dropdown menu, specifying visibility for each trace
fig.update_layout(
updatemenus=[
dict(
buttons=[
dict(label='Central',
method='update',
args=[{'visible': [True, False, False, False]},
{'title': 'Sales by Product in Region: Central'}]),
dict(label='East',
method='update',
args=[{'visible': [False, True, False, False]},
{'title': 'Sales by Product in Region: East'}]),
dict(label='West',
method='update',
args=[{'visible': [False, False, True, False]},
{'title': 'Sales by Product in Region: West'}]),
dict(label='South',
method='update',
args=[{'visible': [False, False, False, True]},
{'title': 'Sales by Product in Region: South'}])
],
direction='down',
)
],
title="Sales by Product",
Roll No: Name: 24
BSc DS Sem III Data Analytics with Python Journal KES Shroff College
xaxis_title="Product",
yaxis_title="Total Sales",
barmode='group'
)
fig.show()
output
Code:
import pandas as pd
import plotly.express as px
# Creating the dataset
data = {
'StudentID': range(1, 11),
'DAPScore': [78, 85, 92, 88, 76, 95, 89, 84, 91, 87],
'IFScore': [82, 79, 94, 90, 80, 97, 85, 89, 88, 90],
'DEScore': [75, 80, 88, 85, 78, 93, 84, 81, 92, 87],
'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F']
}
df = pd.DataFrame(data)
df.to_csv('student_scores.csv', index=False)
fig = px.scatter(df, x="DAPScore", y="IFScore", color="Gender",
hover_data=["StudentID", "DEScore", "Gender"],
title="DAP Score vs. IF Score")
fig.show()
Practical No: 7
df = pd.read_csv("used_cars.csv")
df.head()
df.tail()
df.info()
Roll No: Name: 25
BSc DS Sem III Data Analytics with Python Journal KES Shroff College
# Convert 'New_Price' from object to float64 # (\d+\.\d+|\d+) matches either a decimal number or an
integer.
df['New_Price'] = df['New_Price'].str.replace(',', '').str.extract('(\d+\.\d+|\d+)').astype(float)
# The 'Price' column is already float64, so no conversion is necessary
# Group by 'Name'
grouped = df.groupby('Name')
# Fill NaN values in 'New_Price' and 'Price' with the mean of their respective groups
for name, group in grouped:
new_price_mean = group['New_Price'].mean()
price_mean = group['Price'].mean()
#Data Reduction
# Remove S.No. column from data
# axis=1 – column
df = df.drop(['S.No.'], axis = 1)
df.info()
# split the name and introduce new variables “Brand” and “Model”
df['Brand'] = df.Name.str.split().str.get(0)
df['Model'] = df.Name.str.split().str.get(1) + df.Name.str.split().str.get(2)
df[['Name','Brand','Model']]
#Data Cleaning
print(df.Brand.unique())
print(df.Brand.nunique())
#The brand name ‘Isuzu’ ‘ISUZU’ and ‘Mini’ and ‘Land’ looks #incorrect. This needs to be corrected
searchfor = ['Isuzu' ,'ISUZU','Mini','Land']
df[df.Brand.str.contains('|'.join(searchfor))].head(5)
#'Isuzu | ISUZU | Mini | Land’ -> | means or
Roll No: Name: 26
BSc DS Sem III Data Analytics with Python Journal KES Shroff College
df.describe(include='all').T
Practical No: 8
Time series data analysis using Pandas.
a) Create a dataset containing daily temperature readings for a year 2023. Set the Date as the index of
the DataFrame, and resample the data to a monthly frequency to calculate the average temperature
for each month. Plot the original daily temperature data and the resampled monthly average
temperatures.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Resample the data to a monthly frequency and calculate the mean temperature
monthly_avg_temp = df.resample('M').mean()
Roll No: Name: 27
BSc DS Sem III Data Analytics with Python Journal KES Shroff College
output:
b) Create a time series dataset of daily sales data for a retail store for the year 2023, calculate the 7-
day rolling average of the sales data. Plot the original sales data and the 7-day rolling average on the
same graph to visualize the smoothing effect.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Plotting the original sales data and the 7-day rolling average
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Sales'], label='Daily Sales', color='blue', alpha=0.5)
plt.plot(df.index, df['7-Day Rolling Avg'], label='7-Day Rolling Avg', color='orange', linewidth=2)
plt.title('Daily Sales and 7-Day Rolling Average in 2023')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
output:
Practical No: 9
Aim: a) You are given a dataset containing the heights (in cm) of male and female students in a class.
Perform the following tasks:
Calculate the mean, median, mode, and standard deviation of the heights for both males and females.
Perform an independent t-test to check if there is a significant difference between the heights of males
and females. Use a significance level of 0.05.
Description:
Roll No: Name: 28
BSc DS Sem III Data Analytics with Python Journal KES Shroff College
Code:
import pandas as pd
import numpy as np
from scipy import stats
data = {
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
'Height': [170, 160, 175, 158, 180, 162, 169, 159]
}
df = pd.DataFrame(data)
return {
'Mean': np.mean(heights),
'Median': np.median(heights),
'Mode': mode_value,
'Standard Deviation': np.std(heights, ddof=1) # Sample standard deviation
}
male_stats = descriptive_stats(male_heights)
female_stats = descriptive_stats(female_heights)
output:
b) A dataset contains information about the hours of study and marks scored by students in an exam.
Perform the following tasks:
1. Compute the correlation between the hours of study and marks scored.
2. Perform a simple linear regression analysis to predict marks based on hours of study.
Description:
1. Describe correlation
2. Linear regression analysis
Code:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
data = {
'Hours of Study': [2, 3, 4, 5, 6, 7, 8, 9],
'Marks Scored': [50, 55, 60, 65, 70, 75, 80, 85]
}
df = pd.DataFrame(data)
# 1. Correlation
correlation = np.corrcoef(df['Hours of Study'], df['Marks Scored'])[0, 1]
print(f"Correlation between Hours of Study and Marks Scored: {correlation:.2f}")
output:
Practical No: 10
"The service was decent but the product did not meet expectations.",
"Great quality, fast delivery, will buy again!"
Code:
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4') # For wordnet data
reviews = [
"The product was excellent, I really liked it!",
"This is the worst purchase I have ever made. Absolutely disappointed!",
"The service was decent but the product did not meet expectations.",
"Great quality, fast delivery, will buy again!"
]
def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
words = word_tokenize(text)
return words
def remove_stopwords(words):
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
return filtered_words
def lemmatize_words(words):
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words] # 'v' stands for verb
return lemmatized_words
Output:
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Roll No: Name: 31
BSc DS Sem III Data Analytics with Python Journal KES Shroff College
reviews = [
"The product was excellent, I really liked it!",
"This is the worst purchase I have ever made. Absolutely disappointed!",
"The service was decent but the product did not meet expectations.",
"Great quality, fast delivery, will buy again!"
]
def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
words = word_tokenize(text)
return words
def remove_stopwords(words):
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
return filtered_words
def lemmatize_words(words):
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words] # 'v' stands for
verb
return lemmatized_words