0% found this document useful (0 votes)
25 views11 pages

FDS Lab Question Bank

DATA SCIENCE LAB QUESTION BANK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views11 pages

FDS Lab Question Bank

DATA SCIENCE LAB QUESTION BANK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1.You are tasked with analyzing the distribution of exam scores in a class of 100 students.

The scores
are normally distributed with a mean of 75 and a standard deviation of 10. To better understand the
distribution and visualize it, you decide to write a Python program to generate and plot the normal
curve (probability density function).

Program:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

mean = 75
std_dev = 10

x = np.linspace(mean - 4*std_dev, mean + 4*std_dev, 1000)


pdf = norm.pdf(x, mean, std_dev)

plt.figure(figsize=(10, 6))
plt.plot(x, pdf, label=f'Normal Distribution PDF (μ={mean}, σ={std_dev})')
plt.xlabel('Score')
plt.ylabel('Probability Density')
plt.title('Normal Distribution Curve of Exam Scores')
plt.legend()
plt.grid(True)
plt.show()
Output:
2. You are a meteorologist studying the distribution of rainfall intensity across a region. Your objective
is to visualize the density and contour of rainfall measurements at various locations without using a
dataset.
Program:
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data


grid_size = 100
x = np.linspace(0, 10, grid_size)
y = np.linspace(0, 10, grid_size)
X, Y = np.meshgrid(x, y)
rainfall_intensity = np.random.uniform(0, 10, size=(grid_size, grid_size))

# Plotting the filled contour plot (2D Density Plot)


plt.figure(figsize=(10, 8))
plt.contourf(X, Y, rainfall_intensity, cmap='Blues')
plt.colorbar(label='Rainfall Intensity')
plt.title('2D Density Plot of Rainfall Intensity')
plt.xlabel('X-coordinate')
plt.ylabel('Y-coordinate')
plt.grid(True)
plt.show()

# Plotting the contour plot


plt.figure(figsize=(10, 8))
plt.contour(X, Y, rainfall_intensity, cmap='Blues')
plt.colorbar(label='Rainfall Intensity')
plt.title('Contour Plot of Rainfall Intensity')
plt.xlabel('X-coordinate')
plt.ylabel('Y-coordinate')
plt.grid(True)
plt.show()
Output:
3. You are a researcher studying the relationship between student performance and study hours. Your
objective is to analyze the correlation between study hours and exam scores among a group of
students and visualize this relationship using scatter plots.
Program:
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data


np.random.seed(0) # For reproducibility
num_students = 50
study_hours = np.random.randint(1, 10, num_students)
exam_scores = np.random.randint(40, 100, num_students)

# Calculate Pearson correlation coefficient


correlation_coefficient = np.corrcoef(study_hours, exam_scores)[0, 1]

# Display correlation coefficient


print(f"Pearson Correlation Coefficient: {correlation_coefficient:.2f}")

# Plotting the Scatter Plot


plt.figure(figsize=(10, 6))
plt.scatter(study_hours, exam_scores, color='blue', alpha=0.8)
plt.title('Relationship between Study Hours and Exam Scores')
plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.grid(True)
plt.show()
OUTPUT:
Pearson Correlation Coefficient: 0.09
4. You are a data analyst working for a retail company analyzing customer purchase behavior. Your
task is to explore and visualize the distribution of purchase amounts made by customers using
histograms.
Program:
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data for purchase amounts


np.random.seed(0) # For reproducibility
num_transactions = 500
purchase_amounts = np.random.randint(10, 500, num_transactions)

# Plotting the Histogram


plt.figure(figsize=(10, 6))
plt.hist(purchase_amounts, bins=20, color='blue', alpha=0.7)
plt.xlabel('Purchase Amount ($)')
plt.ylabel('Frequency')
plt.title('Histogram of Purchase Amounts')
plt.grid(True)
plt.show()

# Analyzing and interpreting the histogram


mean_purchase = np.mean(purchase_amounts)
median_purchase = np.median(purchase_amounts)
variance_purchase = np.var(purchase_amounts)
std_dev_purchase = np.std(purchase_amounts)

print(f"Mean Purchase Amount: ${mean_purchase:.2f}")


print(f"Median Purchase Amount: ${median_purchase:.2f}")
print(f"Variance of Purchase Amounts: {variance_purchase:.2f}")
print(f"Standard Deviation of Purchase Amounts: {std_dev_purchase:.2f}")

# Discuss insights
print("\nInsights:")
print("- The histogram shows a distribution of purchase amounts.")
print("- The mean and median values provide insights into the central tendency of the purchases.")
print("- Variance and standard deviation indicate the spread or dispersion of purchase amounts around
the mean.")
Output:

Mean Purchase Amount: $252.80


Median Purchase Amount: $261.00
Variance of Purchase Amounts: 19483.83
Standard Deviation of Purchase Amounts: 139.58

Insights:
- The histogram shows a distribution of purchase amounts.
- The mean and median values provide insights into the central tendency of the purchases.
- Variance and standard deviation indicate the spread or dispersion of purchase amounts around the
mean.

5. You are a geologist studying the topography of a volcanic island in the Pacific Ocean. Your task is to
visualize a three-dimensional representation of the island's volcanic crater to analyze its shape and
contours. This visualization will help you understand the geological structure and potentially predict
volcanic activity.
Program:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Function to simulate the shape of the volcanic crater


def volcanic_crater(x, y):
r = np.sqrt(x**2 + y**2)
Z = np.exp(-r**2) # Gaussian function for crater shape
return Z

# Define the 3D grid


x = np.linspace(-10, 10, 100)
y = np.linspace(-10, 10, 100)
X, Y = np.meshgrid(x, y)

# Calculate Z coordinates for the volcanic crater


Z = volcanic_crater(X, Y)

# Plotting the 3D surface


fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
# Plot the surface
surface = ax.plot_surface(X, Y, Z, cmap='terrain')
# Customize labels and title
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_zlabel('Z-axis')
ax.set_title('Three-Dimensional Plot of Volcanic Crater')
# Add color bar
cbar = fig.colorbar(surface, shrink=0.75, aspect=10)
cbar.set_label('Elevation')
# Show plot
plt.show()

Output:
6. You are a data scientist working for a healthcare research organization studying diabetes
prevalence among the Pima Indian population. You have been tasked with conducting a
comprehensive univariate analysis of the Pima Indians Diabetes dataset to derive statistical insights
into key variables related to diabetes risk factors.
Program:
import pandas as pd
import numpy as np
# Load the dataset (assuming 'diabetes.csv' is in the same directory)
df = pd.read_csv('diabetes.csv')

# List of variables to analyze


variables = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

# Perform univariate analysis for each variable


for var in variables:
# Frequency
frequency = df[var].value_counts().sort_index()
print(f"\nVariable: {var}")
print(f"Frequency:\n{frequency}")

print(df.agg({
var:["mean","median","var","std","skew","kurt"]
}))

# Mode
print("Mode:",df[var].mode())

# Interpretation and insights


print("\nInsights:")
print("- Skewness measures the asymmetry of the distribution.")
print("- Kurtosis measures the tailedness or peakedness of the distribution.")
print("- Analyze each variable's statistics to understand its distribution and implications for diabetes risk
factors.")

Output:
Variable: Pregnancies
Frequency:
Pregnancies
0 111
1 135
……..
15 1
17 1
Name: count, dtype: int64
Pregnancies
mean 3.845052
median 3.000000
var 11.354056
std 3.369578
skew 0.901674
kurt 0.159220
Mode: 0 1
Name: Pregnancies, dtype: int64

……. Continued for all features…….

Insights:
- Skewness measures the asymmetry of the distribution.
- Kurtosis measures the tailedness or peakedness of the distribution.
- Analyze each variable's statistics to understand its distribution and implications for diabetes risk factors.

7. As a data scientist investigating health outcomes in the Pima Indian community, you aim to
understand how glucose levels and diabetes pedigree (DiabetesPedigreeFunction) influence each
other within the context of diabetes progression. Your objective is to perform a bivariate analysis
using linear regression to explore the relationship between glucose levels and the
DiabetesPedigreeFunction among individuals in the dataset.

Program:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score,mean_squared_error

data = pd.read_csv("diabetes.csv")

X = pd.DataFrame(data["Glucose"])
y = data["DiabetesPedigreeFunction"]

model = LinearRegression()
model.fit(X,y)

y_pred = model.predict(X)
print("R-Squared:",r2_score(y,y_pred))
print("Mean Squared Error:",mean_squared_error(y_pred,y))

Output:
R-Squared: 0.018861533924148022
Mean Squared Error: 0.10756779952130735
8. As a data scientist researching health outcomes in the Pima Indian community, you're interested in
understanding how body mass index (BMI) influences the progression of diabetes. Your objective is to
perform bivariate analysis using logistic regression to explore the relationship between BMI and
diabetes progression among individuals in the dataset.

Program:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report

data = pd.read_csv("diabetes.csv")
data.columns

X = pd.DataFrame(data["BMI"])
y = data["Outcome"]

model = LogisticRegression()
model.fit(X,y)

y_pred = model.predict(X)
print("Accuracy Score:",accuracy_score(y_pred,y))
print("Report:",classification_report(y_pred,y))

Output:
Accuracy Score: 0.6640625
Report: precision recall f1-score support

0 0.90 0.68 0.78 660


1 0.22 0.55 0.31 108

accuracy 0.66 768


macro avg 0.56 0.61 0.55 768
weighted avg 0.81 0.66 0.71 768

9. As a data scientist at a botanical research institute, you are investigating the factors influencing
petal length in different species of iris flowers using the Iris dataset. Your task is to perform multiple
regression analysis to predict species based on various floral characteristics.

Program:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report
from sklearn.model_selection import train_test_split
data = pd.read_csv("data/iris.csv").iloc[1:]
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
target = 'species'

map_data = {"setosa":1,"versicolor":2,"virginica":3}
X = data[features]
y = data[target].map(map_data)

X_tr,X_te,y_tr,y_te = train_test_split(X,y,test_size=0.2,random_state=10)

model = LogisticRegression(max_iter=500)
model.fit(X_tr,y_tr)

y_pred = model.predict(X_te)

print("Accuracy:", accuracy_score(y_pred,y_te)*100)
print("Report:", classification_report(y_pred,y_te))
Output:
Accuracy: 96.66666666666667
Report: precision recall f1-score support

1 1.00 1.00 1.00 9


2 0.93 1.00 0.96 13
3 1.00 0.88 0.93 8

accuracy 0.97 30
macro avg 0.98 0.96 0.97 30
weighted avg 0.97 0.97 0.97 30

You might also like