0% found this document useful (0 votes)
18 views114 pages

Practical No. 01

The document outlines practical exercises in Excel and Python for data analysis, including conditional formatting, pivot tables, VLOOKUP, and what-if analysis in Excel. It also covers data pre-processing with pandas, feature scaling, dummification, hypothesis testing, and ANOVA. Each section provides detailed steps and code examples for executing the tasks effectively.

Uploaded by

Dhruv Borse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views114 pages

Practical No. 01

The document outlines practical exercises in Excel and Python for data analysis, including conditional formatting, pivot tables, VLOOKUP, and what-if analysis in Excel. It also covers data pre-processing with pandas, feature scaling, dummification, hypothesis testing, and ANOVA. Each section provides detailed steps and code examples for executing the tasks effectively.

Uploaded by

Dhruv Borse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

Practical No.

01

Aim: Introduction to Excel


 Perform conditional formatting on a dataset using various criteria.
 Create a pivot table to analyze and summarize data.
 Use VLOOKUP function to retrieve information from a different worksheet or table.
 Perform what-if analysis using Goal Seek to determine input values for desired output.

 Perform conditional formatting on a dataset using various criteria.


We perform conditional formatting on the "Profit" column to highlight cells with a profit greater than 800 using
following steps:

Steps:
1. Select the "Profit" column (Column E).

2. Go to the "Home" tab on the ribbon.


3. Click on "Conditional Formatting" in the toolbar.
4. Choose "Highlight Cells Rules" and then "Greater Than."

5. Enter the threshold value as 800.


6. Customize the formatting options (e.g., choose a fill color).
7. Click "OK" to apply the rule.

 Create a pivot table to analyze and summarize data.


Following are the steps to create a pivot table to analyze total sales by category.

Steps:
1. Select the entire dataset including headers.
2. Go to the "Insert" tab on the ribbon.
3. Click on "PivotTable."

4. Choose where you want to place the PivotTable (e.g., new worksheet).
5. Drag "Category" to the Rows area.
6. Drag "Sales" to the Values area, choosing the sum function.

 Use VLOOKUP function to retrieve information from a different worksheet or table.


Use the VLOOKUP function to retrieve the category of "Product M" from a separate worksheet named "Product
Table" using following steps:
Steps:
1. Assuming your "Product Table" is in a different worksheet.
2. In a cell in your main dataset, enter the formula:
=VLOOKUP("M", 'Product Table'!A:B, 2, FALSE)
 Perform what-if analysis using Goal Seek to determine input values for desired output.
Use Goal Seek to find the required sales for "Product P" to achieve a profit of 1000 using the following
steps.

Steps:
1. Identify the cell containing the formula for "Profit" for "Product P" (let's assume it's in cell E17).
2. Go to the "Data" tab on the ribbon.
3. Click on "What-If Analysis" and select "Goal Seek."

4. Set "Set cell" to the profit cell (E17), "To value" to 1000, and "By changing cell" to the sales cell (C17).
5. Click "OK" to let Excel determine the required sales.
Practical No. 02

Aim: Data Frames and Basic Data Pre-processing


 Read data from CSV and JSON files into a data frame.
 Perform basic data pre-processing tasks such as handling missing values and
outliers.
 Manipulate and transform data using functions like filtering, sorting, and grouping.

Data pre-processing:
Data pre-processing is a crucial step in the data analysis pipeline, encompassing tasks such as
reading data from various file formats, handling missing values, and managing outliers. This
practical guide explores how to execute these tasks using the pandas library in Python.

Steps:
Step 1: Reading from CSV and JSON Files
1. Utilize pandas to read data from a CSV file ('DATA SET.csv') into a data frame.
2. Use pandas to read data from a JSON file ('ds.json') into a data frame.
3. Display the first few rows of each data frame to inspect the data.
Step 2: Handling Missing Values
1. Drop rows with missing values from the CSV data frame.
2. Fill missing values with a specific value (e.g., 0) in the JSON data frame.
Step 3: Handling Outliers
1. Identify outliers in the 'Sales' column of the CSV data frame.
2. Replace outliers with the median value.
Step 4: Manipulating and Transforming Data
1. Filter the CSV data frame to include only rows where 'Sales' is greater than 10.
2. Sort the CSV data frame based on the 'Sales' column in descending order.
3. Group the CSV data frame by the 'Category' column and calculate the mean for numeric columns ('Sales',
'Cost', 'Profit').
Step 5: Displaying Results
1. Display the cleaned CSV data frame after handling missing values.
2. Display the JSON data frame after filling missing values.
3. Display the filtered CSV data frame.
4. Display the sorted CSV data frame.
5. Display the grouped CSV data frame showing the mean values for numeric columns.

Code:
import pandas as pd
# Read data from CSV file into a data frame
csv_file_path = 'DATA SET.csv'
df_csv = pd.read_csv(csv_file_path)
# Read data from JSON file into a data frame
json_file_path = 'ds.json'
df_json = pd.read_json(json_file_path)
# Display the first few rows of each data frame to inspect the data
print("CSV Data:")
print(df_csv.head())
print("\nJSON Data:")
print(df_json.head())
# Handling missing values
# Drop rows with missing values
df_csv_cleaned = df_csv.dropna()
# Fill missing values with a specific value (e.g., 0)
df_json_filled = df_json.fillna(0)
# Handling outliers
# Assume 'Sales' is the column with outliers
# Replace outliers with the median
median_value = df_csv['Sales'].median()
upper_threshold = df_csv['Sales'].mean() + 2 * df_csv['Sales'].std()
lower_threshold = df_csv['Sales'].mean() - 2 * df_csv['Sales'].std()
df_csv['Sales'] = df_csv['Sales'].apply(lambda x: median_value if x > upper_threshold or x <
lower_threshold else x)
# Manipulate and transform data
# Filtering
filtered_data = df_csv[df_csv['Sales'] > 10]
# Sorting
sorted_data = df_csv.sort_values(by='Sales', ascending=False)
# Grouping and calculating mean for numeric columns
numeric_columns = ['Sales', 'Cost', 'Profit']
grouped_data = df_csv.groupby('Category')[numeric_columns].mean()
# Display the results
print("\nCleaned CSV Data:")
print(df_csv_cleaned.head())
print("\nFilled JSON Data:")
print(df_json_filled.head())
print("\nFiltered Data:")
print(filtered_data.head())
print("\nSorted Data:")
print(sorted_data.head())
print("\nGrouped Data:")
print(grouped_data.head())
Output:
Practical No. 03

Aim: Feature Scaling and Dummification


 Apply feature-scaling techniques like standardization and normalization to
numerical features.
 Perform feature dummification to convert categorical variables into
numerical representations.

Feature Scaling:
Feature scaling is a preprocessing technique used to standardize the range of independent
variables or features of the data. It is essential for certain machine learning algorithms that are
sensitive to the scale of input features, ensuring that all features contribute equally to the
learning process.

Feature Dummification:
Feature dummification or one-hot encoding is a technique used to convert categorical
variables into numerical representations. This is necessary because many machine learning
algorithms require numerical input, and representing categorical variables as binary vectors
helps maintain their information.

Steps:
1. Load and Explore Data: Load the dataset and explore its structure, identify numeric and
categorical features.
2. Feature Scaling: Apply standardization and normalization to numeric features.
3. Feature Dummification: Convert categorical variables into numerical representations
using one-hot encoding.
4. Combine Features: Combine scaled numeric features with one-hot encoded categorical
features.
5. Display Resulting Dataset: Display the final dataset after both feature scaling and
dummification.

Code:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Define the data
data = {
'Product': ['Apple_Juice', 'Banana_Smoothie', 'Orange_Jam', 'Grape_Jelly', 'Kiwi_Parfait',
'Mango_Chutney', 'Pineapple_Sorbet', 'Strawberry_Yogurt', 'Blueberry_Pie', 'Cherry_Salsa'],
'Category': ['Apple', 'Banana', 'Orange', 'Grape', 'Kiwi', 'Mango', 'Pineapple', 'Strawberry',
'Blueberry', 'Cherry'],
'Sales': [1200, 1700, 2200, 1400, 2000, 1000, 1500, 1800, 1300, 1600],
'Cost': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800],
'Profit': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Display the original dataset
print("Original Dataset:")
print(df)
# Step 1: Feature Scaling (Standardization and Normalization)
numeric_columns = ['Sales', 'Cost', 'Profit']
scaler_standardization = StandardScaler()
scaler_normalization = MinMaxScaler()
df_scaled_standardized =
pd.DataFrame(scaler_standardization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
df_scaled_normalized =
pd.DataFrame(scaler_normalization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
# Combine the scaled numeric features with the categorical features
df_scaled = pd.concat([df_scaled_standardized, df.drop(numeric_columns, axis=1)], axis=1)
# Display the dataset after feature scaling
print("\nDataset after Feature Scaling:")
print(df_scaled)
# Step 2: Feature Dummification
# Identify categorical columns
categorical_columns = ['Product', 'Category']
# Create a column transformer for dummification
preprocessor = ColumnTransformer(
transformers=[
('categorical', OneHotEncoder(), categorical_columns)
],
remainder='passthrough'
)
# Apply the column transformer to the dataset
df_dummified = pd.DataFrame(preprocessor.fit_transform(df))
# Display the dataset after feature dummification
print("\nDataset after Feature Dummification:")
print(df_dummified)
Output:
Practical No. 04

Aim: Hypothesis Testing


 Formulate null and alternative hypotheses for a given problem.
 Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-
square test).
 Interpret the results and draw conclusions based on the test outcomes.

Hypothesis Testing:
Hypothesis testing is a statistical method used to make inferences about population parameters based on
sample data. It involves the formulation of a null hypothesis (H0) and an alternative hypothesis (H1), and the
collection of sample data to assess the evidence against the null hypothesis. The goal is to determine whether
there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.

1. Formulate Hypotheses:
 Null Hypothesis (H0​ ): The average caffeine content per serving is 80 mg (μ=80).
 Alternative Hypothesis (H1​ ): The average caffeine content per serving is different from 80 mg
(μ≠80).
2. Statistical Test:
 A t-test is appropriate since you are comparing a sample mean to a known population mean, and the
sample size is small.
3. Data Collection:
 Randomly select 30 cans of the energy drink and measure the caffeine content in each.
4. Conducting the Hypothesis Test:
a. Collect Data:
 Calculate the sample mean (�) and standard deviation (s) from the 30 samples.
b. Set Significance Level (α):
 Choose a significance level (α=0.05,0.01,0.10).
c. Calculate the Test Statistic (t-value):
 Use the formula t=s/n​ �−μ​ .
d. Determine Degrees of Freedom:
 For a one-sample t-test, degrees of freedom (df) is n−1.
e. Find Critical Values or P-value:
 Use a t-table or statistical software to find the critical t-values for a two-tailed test at the chosen
significance level.
f. Make a Decision:
 If the t-value falls outside the critical region, reject the null hypothesis. If it falls inside, fail to reject.
g. Interpretation:
 If you reject the null hypothesis, there is enough evidence to suggest that the average caffeine
content per serving is different from 80 mg. If you fail to reject the null hypothesis, there is not
enough evidence to suggest a difference in the average caffeine content.
5. Conclusion:
 Draw conclusions about the energy drink's caffeine content, considering both statistical and practical
significance. Consider decisions relevant to the context of the problem.

Code:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Generate two samples for demonstration purposes
np.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2, size=30)
sample2 = np.random.normal(loc=12, scale=2, size=30)
# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(sample1, sample2)
# Set the significance level
alpha = 0.05
print("Results of Two-Sample t-test:")
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")
# Plot the distributions
plt.figure(figsize=(10, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')
plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')
plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2)
plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2)
plt.title('Distributions of Sample 1 and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()
# Highlight the critical region if null hypothesis is rejected
if p_value < alpha:
critical_region = np.linspace(min(sample1.min(), sample2.min()), max(sample1.max(),
sample2.max()), 1000)
plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3, label='Critical Region')
# Show the observed t-statistic
plt.text(11, 5, f'T-statistic: {t_statistic:.2f}', ha='center', va='center', color='black',
backgroundcolor='white')
# Show the plot
plt.show()
# Draw Conclusions
# Drawing Conclusions
if p_value < alpha:
if np.mean(sample1) > np.mean(sample2):
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 1 is significantly higher than
that of Sample 2.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 2 is significantly higher than
that of Sample 1.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: Fail to reject the null hypothesis.")
print("Interpretation: There is not enough evidence to claim a significant difference between
the means.")

Output:
Practical No. 05

Aim- ANOVA(Analysis of variance)


 Perform one-way ANOVA to compare means across multiple groups
 Conduct post-hoc tests to identify significant differences between groups
means.

Acquire and Prepare Data:


Obtain a dataset with a categorical independent variable (factor) and a continuous
dependent variable.
Ensure the data meets the assumptions of ANOVA, including normality, homogeneity of
variances, and independence of observations.
Clean and preprocess the data as needed, handling missing values and outliers.

Perform One-Way ANOVA:


Set up the hypothesis test:
Null Hypothesis (H0): The means of all groups are equal.
Alternative Hypothesis (H1): At least one group mean is different from the others.
Conduct the one-way ANOVA test using software such as R, Python (with libraries like
SciPy or statsmodels), or statistical packages like SPSS.
Calculate the F-statistic and corresponding p-value to determine the statistical significance
of the differences between group means.

from matplotlib import pyplot as plt


movies=["golmaal","annabelle","bhoot-uncle","bhoothnath","de dana dan"]
num_oscars=[5,10,3,6,8]
plt.bar(range(len(movies)),num_oscars)
plt.title("Horror Movies")
plt.ylabel("oscar award 2024")
plt.xticks(range(len(movies)),movies)
plt.show()

Output-
from matplotlib import pyplot as plt
years = [2020,2021,2022,2023,2024]
failurepercentrates = [60,70,50,10,0]
plt.plot(years,failurepercentrates,color = "green" ,marker ="o", linestyle="solid" )
plt.title("corona times success rates")
plt.ylabel("percentages rates")
plt.show()

Output-

from matplotlib import pyplot as plt


from collections import Counter
totalnumber =[83,95,91,67,70,100]
histogram=Counter(min(score // 10*10,90) for score in totalnumber)
plt.bar([x+5 for x in histogram.keys()],
histogram.values(),
10,
edgecolor=(0,0,0))
plt.axis([-5,105,0,5])
plt.xticks([10*i for i in range(11)])
plt.xlabel("total_score")
plt.ylabel("N no of student")
plt.title("disttibution of exam 1 marks")
plt.show()

Output-
Practical No. 06

Aim:- Regression and Its Types


 Implement simple linear regression using a dataset.
 Explore and interpret the regression model coefficients and goodness-of-fit
measures.
 Extend the analysis to multiple linear regression and assess the impact of
additional predictors.

Acquire Dataset:
Obtain a dataset suitable for regression analysis. The dataset should contain variables that
you believe may have a linear relationship or can be used to predict another variable of
interest.

Explore the Dataset:


Load the dataset into your preferred data analysis environment (e.g., Python with libraries
like Pandas and NumPy, or R).
Visualize the data using scatter plots, histograms, and other relevant plots to identify
potential relationships between variables.

Implement Simple Linear Regression:


Choose a predictor variable (independent variable) and a target variable (dependent variable)
based on your analysis.
Implement simple linear regression using the selected variables. This involves fitting a linear
model to the data and estimating coefficients (slope and intercept).

Visualization and Interpretation:


Visualize the regression line overlaid on the scatter plot to visually assess how well the
model fits the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Load the dataset


df = pd.read_csv('diabetes.csv')

# Simple Linear Regression


X_simple = df[['Age']]
y_simple = df['Pregnancies']
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(X_simple,
y_simple, test_size=0.2, random_state=0)

regressor_simple = LinearRegression()
regressor_simple.fit(X_train_simple, y_train_simple)

# Predictions on the test set


y_pred_simple = regressor_simple.predict(X_test_simple)

# Model evaluation
print('Simple Linear Regression:')
print('Intercept:', regressor_simple.intercept_)
print('Coefficient:', regressor_simple.coef_)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_simple, y_pred_simple))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_simple, y_pred_simple))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_simple,
y_pred_simple)))
print('R-squared:', metrics.r2_score(y_test_simple, y_pred_simple))

# Visualization for Simple Linear Regression


plt.scatter(X_simple, y_simple, color='gray')
plt.plot(X_simple, regressor_simple.predict(X_simple), color='red', linewidth=2)
plt.title('Simple Linear Regression')
plt.xlabel('Age')
plt.ylabel('Pregnancies')
plt.show()

# Multiple Linear Regression


X_multi = df[['Glucose', 'BloodPressure', 'Insulin']]
y_multi = df['Outcome']
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_multi, y_multi,
test_size=0.2, random_state=0)

regressor_multi = LinearRegression()
regressor_multi.fit(X_train_multi, y_train_multi)

# Predictions on the test set


y_pred_multi = regressor_multi.predict(X_test_multi)

Output-
Practical No. 07

Aim- Logistic Regression and Decision Tree


 Build a logistic regression model to predict a binary outcome.
 Evaluate the model's performance using classification metrics (e.g.,
accuracy,precision, recall).
 Construct a decision tree model and interpret the decision rules for
classification.
Acquire Dataset:
Obtain a dataset suitable for binary classification tasks. The dataset should contain predictor
variables and a binary outcome variable.

Explore the Dataset:


Load the dataset into your preferred data analysis environment.
Perform exploratory data analysis (EDA) to understand the distribution of variables, identify
any missing values, and assess potential relationships between variables.

Preprocess the Data:


Handle missing values and perform any necessary data cleaning steps.
Encode categorical variables if required.
Split the dataset into training and testing sets for model evaluation.

Build Logistic Regression Model:


Choose predictor variables (features) based on the analysis.
Implement logistic regression model using the chosen features to predict the binary outcome.
Train the model using the training dataset.
Assess the model's performance using classification metrics such as accuracy, precision, recall,
F1-score, and ROC-AUC.

Evaluate Logistic Regression Model:


Evaluate the model's performance on the testing dataset using the chosen classification metrics.
Interpret the results to understand how well the logistic regression model predicts the binary
outcome.

Construct Decision Tree Model:


Choose predictor variables based on the analysis and understanding of the dataset.
Implement a decision tree model to predict the binary outcome.
Train the decision tree model using the training dataset.
Visualize the decision tree to interpret the decision rules for classification.
Evaluate Decision Tree Model:
Evaluate the decision tree model's performance on the testing dataset using classification
metrics similar to logistic regression.
Interpret the decision rules generated by the decision tree model to understand how the model
makes predictions.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Load the dataset


df = pd.read_csv('diabetesUp.csv')

# Split the dataset into features (X) and target variable (y)
X = df.drop(columns=['BloodPressure'])
y = df['Age']

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the data


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression
log_reg_model = LogisticRegression(max_iter=1000) # Increase max_iter to avoid
convergence warning
log_reg_model.fit(X_train_scaled, y_train)
y_pred_log_reg = log_reg_model.predict(X_test_scaled)

# Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_scaled, y_train)
y_pred_dt = dt_model.predict(X_test_scaled)

# Evaluation metrics for Logistic Regression


print("Logistic Regression:")
print("Accuracy:", accuracy_score(y_test, y_pred_log_reg))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_log_reg))
print("Classification Report:")
print(classification_report(y_test, y_pred_log_reg, zero_division=1)) # Set zero_division=1 to
handle zero division warning

# Evaluation metrics for Decision Tree


print("\nDecision Tree:")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dt))
print("Classification Report:")
print(classification_report(y_test, y_pred_dt, zero_division=1))

# Plot confusion matrices


plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.heatmap(confusion_matrix(y_test, y_pred_log_reg), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('True')

plt.subplot(1, 2, 2)
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('True')

plt.tight_layout()
plt.show()
Output-
Practical No. 08

Aim- K-Means Clustering


 Apply the K-Means algorithm to group similar data points into clusters.
 Determine the optimal number of clusters using elbow method or silhouette
analysis.
 Visualize the clustering results and analyze the cluster characteristics

Apply K-Means Algorithm:


Choose the number of clusters (K) that you want to create.
Apply the K-Means algorithm to the preprocessed data to cluster similar
data points into K clusters.
Iterate the algorithm until convergence, where the cluster centroids no
longer change significantly.

Determine Optimal Number of Clusters:


Use the elbow method or silhouette analysis to determine the optimal
number of clusters.
Elbow Method: Plot the within-cluster sum of squares (WCSS)
against the number of clusters. Choose the number of clusters where
the decrease in WCSS starts to slow down (elbow point).
Silhouette Analysis: Compute the silhouette scores for different
numbers of clusters. Choose the number of clusters with the highest
average silhouette score, indicating well-separated clusters.

Visualize Clustering Results:


Visualize the clustering results to understand the structure of the clusters.
Plot the clusters using scatter plots for two or three-dimensional data.
Use dimensionality reduction techniques such as PCA or t-SNE to
visualize high-dimensional data in two or three dimensions.
Analyze Cluster Characteristics:
Analyze the characteristics of each cluster to understand the patterns and
differences between clusters.
Compute cluster centroids to determine the center of each cluster in
feature space.
Explore the distribution of data points within each cluster to identify
commonalities and differences.

Interpret Results:
Interpret the clustering results based on the characteristics of each cluster.
Analyze any meaningful patterns or insights discovered through clustering.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

# Generate sample data


X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Determine the optimal number of clusters using silhouette analysis


silhouette_scores = []
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) #
Explicitly setting n_init
cluster_labels = kmeans.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
silhouette_scores.append(silhouette_avg)

# Plot silhouette scores to determine the optimal number of clusters


plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.show()
# Choose the optimal number of clusters based on the silhouette score or elbow
method
n_clusters = silhouette_scores.index(max(silhouette_scores)) + 2 # Adjusted for 0-
based indexing

# Apply K-Means clustering with the optimal number of clusters


kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) #
Explicitly setting n_init
kmeans.fit(X)
cluster_labels = kmeans.labels_

# Visualize the clustering results


plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='*', s=300, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering')
plt.show()

# Analyze the characteristics of each cluster


cluster_df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2'])
cluster_df['Cluster'] = cluster_labels
cluster_summary = cluster_df.groupby('Cluster').mean()
print(cluster_summary)
Output-
Practical No. 09

Aim- Principal Component Analysis (PCA)


 Perform PCA on a dataset to reduce dimensionality.
 Evaluate the explained variance and select the appropriate number of principal
components.
 Visualize the data in the reduced-dimensional space.

Acquire and Preprocess Data:


Obtain a dataset suitable for PCA. Ensure that the dataset contains numerical
features.
Preprocess the data by handling missing values, scaling the features if necessary,
and encoding categorical variables.

Standardize the Data:


Standardize the features by subtracting the mean and dividing by the standard
deviation. This step is essential for PCA as it ensures that all features have the same
scale.

Compute Covariance Matrix:


Compute the covariance matrix of the standardized data. The covariance matrix
represents the relationships between different features in the dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Calculate explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance_ratio)

# Visualize the data in the reduced-dimensional space


plt.figure(figsize=(8, 6))
for i in range(len(iris.target_names)):
plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], label=iris.target_names[i])

plt.title('PCA of Iris Dataset')


plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
Output-
Practical No. 10

Aim- Data Visualization and Storytelling


 Create meaningful visualizations using data visualization tools
 Combine multiple visualizations to tell a compelling data story.
 Present the findings and insights in a clear and concise manner.

Select Visualization Tools:


Choose appropriate data visualization tools based on your familiarity, dataset
complexity, and desired visualizations.
Common tools include Python libraries like Matplotlib, Seaborn, Plotly, and
ggplot2 in R. Alternatively, you can use BI tools like Tableau, Power BI, or
online platforms like Google Data Studio.

Explore and Clean the Data:


Perform exploratory data analysis (EDA) to understand the distribution,
relationships, and patterns within the dataset.
Clean the data by handling missing values, outliers, and inconsistencies.

Identify Key Insights:


Identify key insights or findings from the dataset that you want to communicate
through visualizations.
Prioritize insights based on relevance and significance to the audience.

Create Meaningful Visualizations:


Design and create visualizations that effectively communicate the key insights
identified.
Choose appropriate chart types (e.g., bar charts, line charts, scatter plots,
histograms) based on the nature of the data and the insights you want to convey.
Use color, size, and other visual cues effectively to enhance understanding and
highlight important information.

Combine Visualizations into a Story:


Organize your visualizations into a cohesive narrative or story.
Create a storyboard or outline to structure the flow of the story, including an
introduction, main points, and conclusion.
Use a combination of text, annotations, and visual transitions to guide the
audience through the story.
Present the Findings:
Prepare for the presentation by practicing your delivery and ensuring you can
effectively communicate the insights.
Use clear and concise language to explain the visualizations and the insights they
convey.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load sample dataset (you can replace this with your own dataset)
df = sns.load_dataset('tips')

# Explore the dataset


print(df.head())

# Data Visualization
# Visualization 1: Distribution of Total Bill Amount
plt.figure(figsize=(10, 6))
sns.histplot(df['total_bill'], kde=True)
plt.title('Distribution of Total Bill Amount')
plt.xlabel('Total Bill Amount')
plt.ylabel('Frequency')
plt.show()

# Visualization 2: Relationship between Total Bill and Tip Amount


plt.figure(figsize=(10, 6))
sns.scatterplot(x='total_bill', y='tip', data=df, hue='sex')
plt.title('Relationship between Total Bill and Tip Amount')
plt.xlabel('Total Bill Amount')
plt.ylabel('Tip Amount')
plt.legend(title='Sex')
plt.show()

# Visualization 3: Box plot of Total Bill Amount by Day


plt.figure(figsize=(10, 6))
sns.boxplot(x='day', y='total_bill', data=df)
plt.title('Box plot of Total Bill Amount by Day')
plt.xlabel('Day')
plt.ylabel('Total Bill Amount')
plt.show()

# Visualization 4: Count of Customers by Day and Time


plt.figure(figsize=(10, 6))
sns.countplot(x='day', hue='time', data=df)
plt.title('Count of Customers by Day and Time')
plt.xlabel('Day')
plt.ylabel('Count of Customers')
plt.legend(title='Time')
plt.show()

# Data Storytelling
print("\nInsights:")
print("1. The distribution of total bill amounts is right-skewed, with most bills falling between
$10 and $20.")
print("2. There is a positive relationship between total bill amount and tip amount, with some
variations based on gender.")
print("3. Total bill amounts tend to be higher on Saturdays compared to other days.")
print("4. The count of customers is higher during dinner time compared to lunchtime on all
days.")

# Conclusion
print("\nConclusion:")
print("Based on the analysis, we can infer that there is a strong relationship between the total
bill amount and tip amount, with variations based on factors such as day and time. Further
analysis can be conducted to explore these relationships in more detail.")

Output-
Practical No. 01

Aim: Introduction to Excel


 Perform conditional formatting on a dataset using various criteria.
 Create a pivot table to analyze and summarize data.
 Use VLOOKUP function to retrieve information from a different worksheet or table.
 Perform what-if analysis using Goal Seek to determine input values for desired output.

 Perform conditional formatting on a dataset using various criteria.


We perform conditional formatting on the "Profit" column to highlight cells with a profit greater than 800 using
following steps:

Steps:
1. Select the "Profit" column (Column E).

2. Go to the "Home" tab on the ribbon.


3. Click on "Conditional Formatting" in the toolbar.
4. Choose "Highlight Cells Rules" and then "Greater Than."

5. Enter the threshold value as 800.


6. Customize the formatting options (e.g., choose a fill color).
7. Click "OK" to apply the rule.

 Create a pivot table to analyze and summarize data.


Following are the steps to create a pivot table to analyze total sales by category.

Steps:
1. Select the entire dataset including headers.
2. Go to the "Insert" tab on the ribbon.
3. Click on "PivotTable."

4. Choose where you want to place the PivotTable (e.g., new worksheet).
5. Drag "Category" to the Rows area.
6. Drag "Sales" to the Values area, choosing the sum function.

 Use VLOOKUP function to retrieve information from a different worksheet or table.


Use the VLOOKUP function to retrieve the category of "Product M" from a separate worksheet named "Product
Table" using following steps:
Steps:
1. Assuming your "Product Table" is in a different worksheet.
2. In a cell in your main dataset, enter the formula:
=VLOOKUP("M", 'Product Table'!A:B, 2, FALSE)
 Perform what-if analysis using Goal Seek to determine input values for desired output.
Use Goal Seek to find the required sales for "Product P" to achieve a profit of 1000 using the following
steps.

Steps:
1. Identify the cell containing the formula for "Profit" for "Product P" (let's assume it's in cell E17).
2. Go to the "Data" tab on the ribbon.
3. Click on "What-If Analysis" and select "Goal Seek."

4. Set "Set cell" to the profit cell (E17), "To value" to 1000, and "By changing cell" to the sales cell (C17).
5. Click "OK" to let Excel determine the required sales.
Practical No. 02

Aim: Data Frames and Basic Data Pre-processing


 Read data from CSV and JSON files into a data frame.
 Perform basic data pre-processing tasks such as handling missing values and
outliers.
 Manipulate and transform data using functions like filtering, sorting, and grouping.

Data pre-processing:
Data pre-processing is a crucial step in the data analysis pipeline, encompassing tasks such as
reading data from various file formats, handling missing values, and managing outliers. This
practical guide explores how to execute these tasks using the pandas library in Python.

Steps:
Step 1: Reading from CSV and JSON Files
1. Utilize pandas to read data from a CSV file ('DATA SET.csv') into a data frame.
2. Use pandas to read data from a JSON file ('ds.json') into a data frame.
3. Display the first few rows of each data frame to inspect the data.
Step 2: Handling Missing Values
1. Drop rows with missing values from the CSV data frame.
2. Fill missing values with a specific value (e.g., 0) in the JSON data frame.
Step 3: Handling Outliers
1. Identify outliers in the 'Sales' column of the CSV data frame.
2. Replace outliers with the median value.
Step 4: Manipulating and Transforming Data
1. Filter the CSV data frame to include only rows where 'Sales' is greater than 10.
2. Sort the CSV data frame based on the 'Sales' column in descending order.
3. Group the CSV data frame by the 'Category' column and calculate the mean for numeric columns ('Sales',
'Cost', 'Profit').
Step 5: Displaying Results
1. Display the cleaned CSV data frame after handling missing values.
2. Display the JSON data frame after filling missing values.
3. Display the filtered CSV data frame.
4. Display the sorted CSV data frame.
5. Display the grouped CSV data frame showing the mean values for numeric columns.

Code:
import pandas as pd
# Read data from CSV file into a data frame
csv_file_path = 'DATA SET.csv'
df_csv = pd.read_csv(csv_file_path)
# Read data from JSON file into a data frame
json_file_path = 'ds.json'
df_json = pd.read_json(json_file_path)
# Display the first few rows of each data frame to inspect the data
print("CSV Data:")
print(df_csv.head())
print("\nJSON Data:")
print(df_json.head())
# Handling missing values
# Drop rows with missing values
df_csv_cleaned = df_csv.dropna()
# Fill missing values with a specific value (e.g., 0)
df_json_filled = df_json.fillna(0)
# Handling outliers
# Assume 'Sales' is the column with outliers
# Replace outliers with the median
median_value = df_csv['Sales'].median()
upper_threshold = df_csv['Sales'].mean() + 2 * df_csv['Sales'].std()
lower_threshold = df_csv['Sales'].mean() - 2 * df_csv['Sales'].std()
df_csv['Sales'] = df_csv['Sales'].apply(lambda x: median_value if x > upper_threshold or x <
lower_threshold else x)
# Manipulate and transform data
# Filtering
filtered_data = df_csv[df_csv['Sales'] > 10]
# Sorting
sorted_data = df_csv.sort_values(by='Sales', ascending=False)
# Grouping and calculating mean for numeric columns
numeric_columns = ['Sales', 'Cost', 'Profit']
grouped_data = df_csv.groupby('Category')[numeric_columns].mean()
# Display the results
print("\nCleaned CSV Data:")
print(df_csv_cleaned.head())
print("\nFilled JSON Data:")
print(df_json_filled.head())
print("\nFiltered Data:")
print(filtered_data.head())
print("\nSorted Data:")
print(sorted_data.head())
print("\nGrouped Data:")
print(grouped_data.head())
Output:
Practical No. 03

Aim: Feature Scaling and Dummification


 Apply feature-scaling techniques like standardization and normalization to
numerical features.
 Perform feature dummification to convert categorical variables into
numerical representations.

Feature Scaling:
Feature scaling is a preprocessing technique used to standardize the range of independent
variables or features of the data. It is essential for certain machine learning algorithms that are
sensitive to the scale of input features, ensuring that all features contribute equally to the
learning process.

Feature Dummification:
Feature dummification or one-hot encoding is a technique used to convert categorical
variables into numerical representations. This is necessary because many machine learning
algorithms require numerical input, and representing categorical variables as binary vectors
helps maintain their information.

Steps:
1. Load and Explore Data: Load the dataset and explore its structure, identify numeric and
categorical features.
2. Feature Scaling: Apply standardization and normalization to numeric features.
3. Feature Dummification: Convert categorical variables into numerical representations
using one-hot encoding.
4. Combine Features: Combine scaled numeric features with one-hot encoded categorical
features.
5. Display Resulting Dataset: Display the final dataset after both feature scaling and
dummification.

Code:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Define the data
data = {
'Product': ['Apple_Juice', 'Banana_Smoothie', 'Orange_Jam', 'Grape_Jelly', 'Kiwi_Parfait',
'Mango_Chutney', 'Pineapple_Sorbet', 'Strawberry_Yogurt', 'Blueberry_Pie', 'Cherry_Salsa'],
'Category': ['Apple', 'Banana', 'Orange', 'Grape', 'Kiwi', 'Mango', 'Pineapple', 'Strawberry',
'Blueberry', 'Cherry'],
'Sales': [1200, 1700, 2200, 1400, 2000, 1000, 1500, 1800, 1300, 1600],
'Cost': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800],
'Profit': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Display the original dataset
print("Original Dataset:")
print(df)
# Step 1: Feature Scaling (Standardization and Normalization)
numeric_columns = ['Sales', 'Cost', 'Profit']
scaler_standardization = StandardScaler()
scaler_normalization = MinMaxScaler()
df_scaled_standardized =
pd.DataFrame(scaler_standardization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
df_scaled_normalized =
pd.DataFrame(scaler_normalization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
# Combine the scaled numeric features with the categorical features
df_scaled = pd.concat([df_scaled_standardized, df.drop(numeric_columns, axis=1)], axis=1)
# Display the dataset after feature scaling
print("\nDataset after Feature Scaling:")
print(df_scaled)
# Step 2: Feature Dummification
# Identify categorical columns
categorical_columns = ['Product', 'Category']
# Create a column transformer for dummification
preprocessor = ColumnTransformer(
transformers=[
('categorical', OneHotEncoder(), categorical_columns)
],
remainder='passthrough'
)
# Apply the column transformer to the dataset
df_dummified = pd.DataFrame(preprocessor.fit_transform(df))
# Display the dataset after feature dummification
print("\nDataset after Feature Dummification:")
print(df_dummified)
Output:
Practical No. 04

Aim: Hypothesis Testing


 Formulate null and alternative hypotheses for a given problem.
 Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-
square test).
 Interpret the results and draw conclusions based on the test outcomes.

Hypothesis Testing:
Hypothesis testing is a statistical method used to make inferences about population parameters based on
sample data. It involves the formulation of a null hypothesis (H0) and an alternative hypothesis (H1), and the
collection of sample data to assess the evidence against the null hypothesis. The goal is to determine whether
there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.

1. Formulate Hypotheses:
 Null Hypothesis (H0​ ): The average caffeine content per serving is 80 mg (μ=80).
 Alternative Hypothesis (H1​ ): The average caffeine content per serving is different from 80 mg
(μ≠80).
2. Statistical Test:
 A t-test is appropriate since you are comparing a sample mean to a known population mean, and the
sample size is small.
3. Data Collection:
 Randomly select 30 cans of the energy drink and measure the caffeine content in each.
4. Conducting the Hypothesis Test:
a. Collect Data:
 Calculate the sample mean (�) and standard deviation (s) from the 30 samples.
b. Set Significance Level (α):
 Choose a significance level (α=0.05,0.01,0.10).
c. Calculate the Test Statistic (t-value):
 Use the formula t=s/n​ �−μ​ .
d. Determine Degrees of Freedom:
 For a one-sample t-test, degrees of freedom (df) is n−1.
e. Find Critical Values or P-value:
 Use a t-table or statistical software to find the critical t-values for a two-tailed test at the chosen
significance level.
f. Make a Decision:
 If the t-value falls outside the critical region, reject the null hypothesis. If it falls inside, fail to reject.
g. Interpretation:
 If you reject the null hypothesis, there is enough evidence to suggest that the average caffeine
content per serving is different from 80 mg. If you fail to reject the null hypothesis, there is not
enough evidence to suggest a difference in the average caffeine content.
5. Conclusion:
 Draw conclusions about the energy drink's caffeine content, considering both statistical and practical
significance. Consider decisions relevant to the context of the problem.

Code:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Generate two samples for demonstration purposes
np.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2, size=30)
sample2 = np.random.normal(loc=12, scale=2, size=30)
# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(sample1, sample2)
# Set the significance level
alpha = 0.05
print("Results of Two-Sample t-test:")
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")
# Plot the distributions
plt.figure(figsize=(10, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')
plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')
plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2)
plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2)
plt.title('Distributions of Sample 1 and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()
# Highlight the critical region if null hypothesis is rejected
if p_value < alpha:
critical_region = np.linspace(min(sample1.min(), sample2.min()), max(sample1.max(),
sample2.max()), 1000)
plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3, label='Critical Region')
# Show the observed t-statistic
plt.text(11, 5, f'T-statistic: {t_statistic:.2f}', ha='center', va='center', color='black',
backgroundcolor='white')
# Show the plot
plt.show()
# Draw Conclusions
# Drawing Conclusions
if p_value < alpha:
if np.mean(sample1) > np.mean(sample2):
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 1 is significantly higher than
that of Sample 2.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 2 is significantly higher than
that of Sample 1.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: Fail to reject the null hypothesis.")
print("Interpretation: There is not enough evidence to claim a significant difference between
the means.")

Output:
Practical No. 05

Aim- ANOVA(Analysis of variance)


 Perform one-way ANOVA to compare means across multiple groups
 Conduct post-hoc tests to identify significant differences between groups
means.

Acquire and Prepare Data:


Obtain a dataset with a categorical independent variable (factor) and a continuous
dependent variable.
Ensure the data meets the assumptions of ANOVA, including normality, homogeneity of
variances, and independence of observations.
Clean and preprocess the data as needed, handling missing values and outliers.

Perform One-Way ANOVA:


Set up the hypothesis test:
Null Hypothesis (H0): The means of all groups are equal.
Alternative Hypothesis (H1): At least one group mean is different from the others.
Conduct the one-way ANOVA test using software such as R, Python (with libraries like
SciPy or statsmodels), or statistical packages like SPSS.
Calculate the F-statistic and corresponding p-value to determine the statistical significance
of the differences between group means.

from matplotlib import pyplot as plt


movies=["golmaal","annabelle","bhoot-uncle","bhoothnath","de dana dan"]
num_oscars=[5,10,3,6,8]
plt.bar(range(len(movies)),num_oscars)
plt.title("Horror Movies")
plt.ylabel("oscar award 2024")
plt.xticks(range(len(movies)),movies)
plt.show()

Output-
from matplotlib import pyplot as plt
years = [2020,2021,2022,2023,2024]
failurepercentrates = [60,70,50,10,0]
plt.plot(years,failurepercentrates,color = "green" ,marker ="o", linestyle="solid" )
plt.title("corona times success rates")
plt.ylabel("percentages rates")
plt.show()

Output-

from matplotlib import pyplot as plt


from collections import Counter
totalnumber =[83,95,91,67,70,100]
histogram=Counter(min(score // 10*10,90) for score in totalnumber)
plt.bar([x+5 for x in histogram.keys()],
histogram.values(),
10,
edgecolor=(0,0,0))
plt.axis([-5,105,0,5])
plt.xticks([10*i for i in range(11)])
plt.xlabel("total_score")
plt.ylabel("N no of student")
plt.title("disttibution of exam 1 marks")
plt.show()

Output-
Practical No. 06

Aim:- Regression and Its Types


 Implement simple linear regression using a dataset.
 Explore and interpret the regression model coefficients and goodness-of-fit
measures.
 Extend the analysis to multiple linear regression and assess the impact of
additional predictors.

Acquire Dataset:
Obtain a dataset suitable for regression analysis. The dataset should contain variables that
you believe may have a linear relationship or can be used to predict another variable of
interest.

Explore the Dataset:


Load the dataset into your preferred data analysis environment (e.g., Python with libraries
like Pandas and NumPy, or R).
Visualize the data using scatter plots, histograms, and other relevant plots to identify
potential relationships between variables.

Implement Simple Linear Regression:


Choose a predictor variable (independent variable) and a target variable (dependent variable)
based on your analysis.
Implement simple linear regression using the selected variables. This involves fitting a linear
model to the data and estimating coefficients (slope and intercept).

Visualization and Interpretation:


Visualize the regression line overlaid on the scatter plot to visually assess how well the
model fits the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Load the dataset


df = pd.read_csv('diabetes.csv')

# Simple Linear Regression


X_simple = df[['Age']]
y_simple = df['Pregnancies']
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(X_simple,
y_simple, test_size=0.2, random_state=0)

regressor_simple = LinearRegression()
regressor_simple.fit(X_train_simple, y_train_simple)

# Predictions on the test set


y_pred_simple = regressor_simple.predict(X_test_simple)

# Model evaluation
print('Simple Linear Regression:')
print('Intercept:', regressor_simple.intercept_)
print('Coefficient:', regressor_simple.coef_)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_simple, y_pred_simple))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_simple, y_pred_simple))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_simple,
y_pred_simple)))
print('R-squared:', metrics.r2_score(y_test_simple, y_pred_simple))

# Visualization for Simple Linear Regression


plt.scatter(X_simple, y_simple, color='gray')
plt.plot(X_simple, regressor_simple.predict(X_simple), color='red', linewidth=2)
plt.title('Simple Linear Regression')
plt.xlabel('Age')
plt.ylabel('Pregnancies')
plt.show()

# Multiple Linear Regression


X_multi = df[['Glucose', 'BloodPressure', 'Insulin']]
y_multi = df['Outcome']
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_multi, y_multi,
test_size=0.2, random_state=0)

regressor_multi = LinearRegression()
regressor_multi.fit(X_train_multi, y_train_multi)

# Predictions on the test set


y_pred_multi = regressor_multi.predict(X_test_multi)

Output-
Practical No. 07

Aim- Logistic Regression and Decision Tree


 Build a logistic regression model to predict a binary outcome.
 Evaluate the model's performance using classification metrics (e.g.,
accuracy,precision, recall).
 Construct a decision tree model and interpret the decision rules for
classification.
Acquire Dataset:
Obtain a dataset suitable for binary classification tasks. The dataset should contain predictor
variables and a binary outcome variable.

Explore the Dataset:


Load the dataset into your preferred data analysis environment.
Perform exploratory data analysis (EDA) to understand the distribution of variables, identify
any missing values, and assess potential relationships between variables.

Preprocess the Data:


Handle missing values and perform any necessary data cleaning steps.
Encode categorical variables if required.
Split the dataset into training and testing sets for model evaluation.

Build Logistic Regression Model:


Choose predictor variables (features) based on the analysis.
Implement logistic regression model using the chosen features to predict the binary outcome.
Train the model using the training dataset.
Assess the model's performance using classification metrics such as accuracy, precision, recall,
F1-score, and ROC-AUC.

Evaluate Logistic Regression Model:


Evaluate the model's performance on the testing dataset using the chosen classification metrics.
Interpret the results to understand how well the logistic regression model predicts the binary
outcome.

Construct Decision Tree Model:


Choose predictor variables based on the analysis and understanding of the dataset.
Implement a decision tree model to predict the binary outcome.
Train the decision tree model using the training dataset.
Visualize the decision tree to interpret the decision rules for classification.
Evaluate Decision Tree Model:
Evaluate the decision tree model's performance on the testing dataset using classification
metrics similar to logistic regression.
Interpret the decision rules generated by the decision tree model to understand how the model
makes predictions.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Load the dataset


df = pd.read_csv('diabetesUp.csv')

# Split the dataset into features (X) and target variable (y)
X = df.drop(columns=['BloodPressure'])
y = df['Age']

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the data


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression
log_reg_model = LogisticRegression(max_iter=1000) # Increase max_iter to avoid
convergence warning
log_reg_model.fit(X_train_scaled, y_train)
y_pred_log_reg = log_reg_model.predict(X_test_scaled)

# Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_scaled, y_train)
y_pred_dt = dt_model.predict(X_test_scaled)

# Evaluation metrics for Logistic Regression


print("Logistic Regression:")
print("Accuracy:", accuracy_score(y_test, y_pred_log_reg))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_log_reg))
print("Classification Report:")
print(classification_report(y_test, y_pred_log_reg, zero_division=1)) # Set zero_division=1 to
handle zero division warning

# Evaluation metrics for Decision Tree


print("\nDecision Tree:")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dt))
print("Classification Report:")
print(classification_report(y_test, y_pred_dt, zero_division=1))

# Plot confusion matrices


plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.heatmap(confusion_matrix(y_test, y_pred_log_reg), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('True')

plt.subplot(1, 2, 2)
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('True')

plt.tight_layout()
plt.show()
Output-
Practical No. 08

Aim- K-Means Clustering


 Apply the K-Means algorithm to group similar data points into clusters.
 Determine the optimal number of clusters using elbow method or silhouette
analysis.
 Visualize the clustering results and analyze the cluster characteristics

Apply K-Means Algorithm:


Choose the number of clusters (K) that you want to create.
Apply the K-Means algorithm to the preprocessed data to cluster similar
data points into K clusters.
Iterate the algorithm until convergence, where the cluster centroids no
longer change significantly.

Determine Optimal Number of Clusters:


Use the elbow method or silhouette analysis to determine the optimal
number of clusters.
Elbow Method: Plot the within-cluster sum of squares (WCSS)
against the number of clusters. Choose the number of clusters where
the decrease in WCSS starts to slow down (elbow point).
Silhouette Analysis: Compute the silhouette scores for different
numbers of clusters. Choose the number of clusters with the highest
average silhouette score, indicating well-separated clusters.

Visualize Clustering Results:


Visualize the clustering results to understand the structure of the clusters.
Plot the clusters using scatter plots for two or three-dimensional data.
Use dimensionality reduction techniques such as PCA or t-SNE to
visualize high-dimensional data in two or three dimensions.
Analyze Cluster Characteristics:
Analyze the characteristics of each cluster to understand the patterns and
differences between clusters.
Compute cluster centroids to determine the center of each cluster in
feature space.
Explore the distribution of data points within each cluster to identify
commonalities and differences.

Interpret Results:
Interpret the clustering results based on the characteristics of each cluster.
Analyze any meaningful patterns or insights discovered through clustering.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

# Generate sample data


X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Determine the optimal number of clusters using silhouette analysis


silhouette_scores = []
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) #
Explicitly setting n_init
cluster_labels = kmeans.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
silhouette_scores.append(silhouette_avg)

# Plot silhouette scores to determine the optimal number of clusters


plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.show()
# Choose the optimal number of clusters based on the silhouette score or elbow
method
n_clusters = silhouette_scores.index(max(silhouette_scores)) + 2 # Adjusted for 0-
based indexing

# Apply K-Means clustering with the optimal number of clusters


kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) #
Explicitly setting n_init
kmeans.fit(X)
cluster_labels = kmeans.labels_

# Visualize the clustering results


plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='*', s=300, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering')
plt.show()

# Analyze the characteristics of each cluster


cluster_df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2'])
cluster_df['Cluster'] = cluster_labels
cluster_summary = cluster_df.groupby('Cluster').mean()
print(cluster_summary)
Output-
Practical No. 09

Aim- Principal Component Analysis (PCA)


 Perform PCA on a dataset to reduce dimensionality.
 Evaluate the explained variance and select the appropriate number of principal
components.
 Visualize the data in the reduced-dimensional space.

Acquire and Preprocess Data:


Obtain a dataset suitable for PCA. Ensure that the dataset contains numerical
features.
Preprocess the data by handling missing values, scaling the features if necessary,
and encoding categorical variables.

Standardize the Data:


Standardize the features by subtracting the mean and dividing by the standard
deviation. This step is essential for PCA as it ensures that all features have the same
scale.

Compute Covariance Matrix:


Compute the covariance matrix of the standardized data. The covariance matrix
represents the relationships between different features in the dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Calculate explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance_ratio)

# Visualize the data in the reduced-dimensional space


plt.figure(figsize=(8, 6))
for i in range(len(iris.target_names)):
plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], label=iris.target_names[i])

plt.title('PCA of Iris Dataset')


plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
Output-
Practical No. 10

Aim- Data Visualization and Storytelling


 Create meaningful visualizations using data visualization tools
 Combine multiple visualizations to tell a compelling data story.
 Present the findings and insights in a clear and concise manner.

Select Visualization Tools:


Choose appropriate data visualization tools based on your familiarity, dataset
complexity, and desired visualizations.
Common tools include Python libraries like Matplotlib, Seaborn, Plotly, and
ggplot2 in R. Alternatively, you can use BI tools like Tableau, Power BI, or
online platforms like Google Data Studio.

Explore and Clean the Data:


Perform exploratory data analysis (EDA) to understand the distribution,
relationships, and patterns within the dataset.
Clean the data by handling missing values, outliers, and inconsistencies.

Identify Key Insights:


Identify key insights or findings from the dataset that you want to communicate
through visualizations.
Prioritize insights based on relevance and significance to the audience.

Create Meaningful Visualizations:


Design and create visualizations that effectively communicate the key insights
identified.
Choose appropriate chart types (e.g., bar charts, line charts, scatter plots,
histograms) based on the nature of the data and the insights you want to convey.
Use color, size, and other visual cues effectively to enhance understanding and
highlight important information.

Combine Visualizations into a Story:


Organize your visualizations into a cohesive narrative or story.
Create a storyboard or outline to structure the flow of the story, including an
introduction, main points, and conclusion.
Use a combination of text, annotations, and visual transitions to guide the
audience through the story.
Present the Findings:
Prepare for the presentation by practicing your delivery and ensuring you can
effectively communicate the insights.
Use clear and concise language to explain the visualizations and the insights they
convey.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load sample dataset (you can replace this with your own dataset)
df = sns.load_dataset('tips')

# Explore the dataset


print(df.head())

# Data Visualization
# Visualization 1: Distribution of Total Bill Amount
plt.figure(figsize=(10, 6))
sns.histplot(df['total_bill'], kde=True)
plt.title('Distribution of Total Bill Amount')
plt.xlabel('Total Bill Amount')
plt.ylabel('Frequency')
plt.show()

# Visualization 2: Relationship between Total Bill and Tip Amount


plt.figure(figsize=(10, 6))
sns.scatterplot(x='total_bill', y='tip', data=df, hue='sex')
plt.title('Relationship between Total Bill and Tip Amount')
plt.xlabel('Total Bill Amount')
plt.ylabel('Tip Amount')
plt.legend(title='Sex')
plt.show()

# Visualization 3: Box plot of Total Bill Amount by Day


plt.figure(figsize=(10, 6))
sns.boxplot(x='day', y='total_bill', data=df)
plt.title('Box plot of Total Bill Amount by Day')
plt.xlabel('Day')
plt.ylabel('Total Bill Amount')
plt.show()

# Visualization 4: Count of Customers by Day and Time


plt.figure(figsize=(10, 6))
sns.countplot(x='day', hue='time', data=df)
plt.title('Count of Customers by Day and Time')
plt.xlabel('Day')
plt.ylabel('Count of Customers')
plt.legend(title='Time')
plt.show()

# Data Storytelling
print("\nInsights:")
print("1. The distribution of total bill amounts is right-skewed, with most bills falling between
$10 and $20.")
print("2. There is a positive relationship between total bill amount and tip amount, with some
variations based on gender.")
print("3. Total bill amounts tend to be higher on Saturdays compared to other days.")
print("4. The count of customers is higher during dinner time compared to lunchtime on all
days.")

# Conclusion
print("\nConclusion:")
print("Based on the analysis, we can infer that there is a strong relationship between the total
bill amount and tip amount, with variations based on factors such as day and time. Further
analysis can be conducted to explore these relationships in more detail.")

Output-
Practical No. 01

Aim: Introduction to Excel


 Perform conditional formatting on a dataset using various criteria.
 Create a pivot table to analyze and summarize data.
 Use VLOOKUP function to retrieve information from a different worksheet or table.
 Perform what-if analysis using Goal Seek to determine input values for desired output.

 Perform conditional formatting on a dataset using various criteria.


We perform conditional formatting on the "Profit" column to highlight cells with a profit greater than 800 using
following steps:

Steps:
1. Select the "Profit" column (Column E).

2. Go to the "Home" tab on the ribbon.


3. Click on "Conditional Formatting" in the toolbar.
4. Choose "Highlight Cells Rules" and then "Greater Than."

5. Enter the threshold value as 800.


6. Customize the formatting options (e.g., choose a fill color).
7. Click "OK" to apply the rule.

 Create a pivot table to analyze and summarize data.


Following are the steps to create a pivot table to analyze total sales by category.

Steps:
1. Select the entire dataset including headers.
2. Go to the "Insert" tab on the ribbon.
3. Click on "PivotTable."

4. Choose where you want to place the PivotTable (e.g., new worksheet).
5. Drag "Category" to the Rows area.
6. Drag "Sales" to the Values area, choosing the sum function.

 Use VLOOKUP function to retrieve information from a different worksheet or table.


Use the VLOOKUP function to retrieve the category of "Product M" from a separate worksheet named "Product
Table" using following steps:
Steps:
1. Assuming your "Product Table" is in a different worksheet.
2. In a cell in your main dataset, enter the formula:
=VLOOKUP("M", 'Product Table'!A:B, 2, FALSE)
 Perform what-if analysis using Goal Seek to determine input values for desired output.
Use Goal Seek to find the required sales for "Product P" to achieve a profit of 1000 using the following
steps.

Steps:
1. Identify the cell containing the formula for "Profit" for "Product P" (let's assume it's in cell E17).
2. Go to the "Data" tab on the ribbon.
3. Click on "What-If Analysis" and select "Goal Seek."

4. Set "Set cell" to the profit cell (E17), "To value" to 1000, and "By changing cell" to the sales cell (C17).
5. Click "OK" to let Excel determine the required sales.
Practical No. 02

Aim: Data Frames and Basic Data Pre-processing


 Read data from CSV and JSON files into a data frame.
 Perform basic data pre-processing tasks such as handling missing values and
outliers.
 Manipulate and transform data using functions like filtering, sorting, and grouping.

Data pre-processing:
Data pre-processing is a crucial step in the data analysis pipeline, encompassing tasks such as
reading data from various file formats, handling missing values, and managing outliers. This
practical guide explores how to execute these tasks using the pandas library in Python.

Steps:
Step 1: Reading from CSV and JSON Files
1. Utilize pandas to read data from a CSV file ('DATA SET.csv') into a data frame.
2. Use pandas to read data from a JSON file ('ds.json') into a data frame.
3. Display the first few rows of each data frame to inspect the data.
Step 2: Handling Missing Values
1. Drop rows with missing values from the CSV data frame.
2. Fill missing values with a specific value (e.g., 0) in the JSON data frame.
Step 3: Handling Outliers
1. Identify outliers in the 'Sales' column of the CSV data frame.
2. Replace outliers with the median value.
Step 4: Manipulating and Transforming Data
1. Filter the CSV data frame to include only rows where 'Sales' is greater than 10.
2. Sort the CSV data frame based on the 'Sales' column in descending order.
3. Group the CSV data frame by the 'Category' column and calculate the mean for numeric columns ('Sales',
'Cost', 'Profit').
Step 5: Displaying Results
1. Display the cleaned CSV data frame after handling missing values.
2. Display the JSON data frame after filling missing values.
3. Display the filtered CSV data frame.
4. Display the sorted CSV data frame.
5. Display the grouped CSV data frame showing the mean values for numeric columns.

Code:
import pandas as pd
# Read data from CSV file into a data frame
csv_file_path = 'DATA SET.csv'
df_csv = pd.read_csv(csv_file_path)
# Read data from JSON file into a data frame
json_file_path = 'ds.json'
df_json = pd.read_json(json_file_path)
# Display the first few rows of each data frame to inspect the data
print("CSV Data:")
print(df_csv.head())
print("\nJSON Data:")
print(df_json.head())
# Handling missing values
# Drop rows with missing values
df_csv_cleaned = df_csv.dropna()
# Fill missing values with a specific value (e.g., 0)
df_json_filled = df_json.fillna(0)
# Handling outliers
# Assume 'Sales' is the column with outliers
# Replace outliers with the median
median_value = df_csv['Sales'].median()
upper_threshold = df_csv['Sales'].mean() + 2 * df_csv['Sales'].std()
lower_threshold = df_csv['Sales'].mean() - 2 * df_csv['Sales'].std()
df_csv['Sales'] = df_csv['Sales'].apply(lambda x: median_value if x > upper_threshold or x <
lower_threshold else x)
# Manipulate and transform data
# Filtering
filtered_data = df_csv[df_csv['Sales'] > 10]
# Sorting
sorted_data = df_csv.sort_values(by='Sales', ascending=False)
# Grouping and calculating mean for numeric columns
numeric_columns = ['Sales', 'Cost', 'Profit']
grouped_data = df_csv.groupby('Category')[numeric_columns].mean()
# Display the results
print("\nCleaned CSV Data:")
print(df_csv_cleaned.head())
print("\nFilled JSON Data:")
print(df_json_filled.head())
print("\nFiltered Data:")
print(filtered_data.head())
print("\nSorted Data:")
print(sorted_data.head())
print("\nGrouped Data:")
print(grouped_data.head())
Output:
Practical No. 03

Aim: Feature Scaling and Dummification


 Apply feature-scaling techniques like standardization and normalization to
numerical features.
 Perform feature dummification to convert categorical variables into
numerical representations.

Feature Scaling:
Feature scaling is a preprocessing technique used to standardize the range of independent
variables or features of the data. It is essential for certain machine learning algorithms that are
sensitive to the scale of input features, ensuring that all features contribute equally to the
learning process.

Feature Dummification:
Feature dummification or one-hot encoding is a technique used to convert categorical
variables into numerical representations. This is necessary because many machine learning
algorithms require numerical input, and representing categorical variables as binary vectors
helps maintain their information.

Steps:
1. Load and Explore Data: Load the dataset and explore its structure, identify numeric and
categorical features.
2. Feature Scaling: Apply standardization and normalization to numeric features.
3. Feature Dummification: Convert categorical variables into numerical representations
using one-hot encoding.
4. Combine Features: Combine scaled numeric features with one-hot encoded categorical
features.
5. Display Resulting Dataset: Display the final dataset after both feature scaling and
dummification.

Code:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Define the data
data = {
'Product': ['Apple_Juice', 'Banana_Smoothie', 'Orange_Jam', 'Grape_Jelly', 'Kiwi_Parfait',
'Mango_Chutney', 'Pineapple_Sorbet', 'Strawberry_Yogurt', 'Blueberry_Pie', 'Cherry_Salsa'],
'Category': ['Apple', 'Banana', 'Orange', 'Grape', 'Kiwi', 'Mango', 'Pineapple', 'Strawberry',
'Blueberry', 'Cherry'],
'Sales': [1200, 1700, 2200, 1400, 2000, 1000, 1500, 1800, 1300, 1600],
'Cost': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800],
'Profit': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Display the original dataset
print("Original Dataset:")
print(df)
# Step 1: Feature Scaling (Standardization and Normalization)
numeric_columns = ['Sales', 'Cost', 'Profit']
scaler_standardization = StandardScaler()
scaler_normalization = MinMaxScaler()
df_scaled_standardized =
pd.DataFrame(scaler_standardization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
df_scaled_normalized =
pd.DataFrame(scaler_normalization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
# Combine the scaled numeric features with the categorical features
df_scaled = pd.concat([df_scaled_standardized, df.drop(numeric_columns, axis=1)], axis=1)
# Display the dataset after feature scaling
print("\nDataset after Feature Scaling:")
print(df_scaled)
# Step 2: Feature Dummification
# Identify categorical columns
categorical_columns = ['Product', 'Category']
# Create a column transformer for dummification
preprocessor = ColumnTransformer(
transformers=[
('categorical', OneHotEncoder(), categorical_columns)
],
remainder='passthrough'
)
# Apply the column transformer to the dataset
df_dummified = pd.DataFrame(preprocessor.fit_transform(df))
# Display the dataset after feature dummification
print("\nDataset after Feature Dummification:")
print(df_dummified)
Output:
Practical No. 04

Aim: Hypothesis Testing


 Formulate null and alternative hypotheses for a given problem.
 Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-
square test).
 Interpret the results and draw conclusions based on the test outcomes.

Hypothesis Testing:
Hypothesis testing is a statistical method used to make inferences about population parameters based on
sample data. It involves the formulation of a null hypothesis (H0) and an alternative hypothesis (H1), and the
collection of sample data to assess the evidence against the null hypothesis. The goal is to determine whether
there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.

1. Formulate Hypotheses:
 Null Hypothesis (H0​ ): The average caffeine content per serving is 80 mg (μ=80).
 Alternative Hypothesis (H1​ ): The average caffeine content per serving is different from 80 mg
(μ≠80).
2. Statistical Test:
 A t-test is appropriate since you are comparing a sample mean to a known population mean, and the
sample size is small.
3. Data Collection:
 Randomly select 30 cans of the energy drink and measure the caffeine content in each.
4. Conducting the Hypothesis Test:
a. Collect Data:
 Calculate the sample mean (�) and standard deviation (s) from the 30 samples.
b. Set Significance Level (α):
 Choose a significance level (α=0.05,0.01,0.10).
c. Calculate the Test Statistic (t-value):
 Use the formula t=s/n​ �−μ​ .
d. Determine Degrees of Freedom:
 For a one-sample t-test, degrees of freedom (df) is n−1.
e. Find Critical Values or P-value:
 Use a t-table or statistical software to find the critical t-values for a two-tailed test at the chosen
significance level.
f. Make a Decision:
 If the t-value falls outside the critical region, reject the null hypothesis. If it falls inside, fail to reject.
g. Interpretation:
 If you reject the null hypothesis, there is enough evidence to suggest that the average caffeine
content per serving is different from 80 mg. If you fail to reject the null hypothesis, there is not
enough evidence to suggest a difference in the average caffeine content.
5. Conclusion:
 Draw conclusions about the energy drink's caffeine content, considering both statistical and practical
significance. Consider decisions relevant to the context of the problem.

Code:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Generate two samples for demonstration purposes
np.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2, size=30)
sample2 = np.random.normal(loc=12, scale=2, size=30)
# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(sample1, sample2)
# Set the significance level
alpha = 0.05
print("Results of Two-Sample t-test:")
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")
# Plot the distributions
plt.figure(figsize=(10, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')
plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')
plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2)
plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2)
plt.title('Distributions of Sample 1 and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()
# Highlight the critical region if null hypothesis is rejected
if p_value < alpha:
critical_region = np.linspace(min(sample1.min(), sample2.min()), max(sample1.max(),
sample2.max()), 1000)
plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3, label='Critical Region')
# Show the observed t-statistic
plt.text(11, 5, f'T-statistic: {t_statistic:.2f}', ha='center', va='center', color='black',
backgroundcolor='white')
# Show the plot
plt.show()
# Draw Conclusions
# Drawing Conclusions
if p_value < alpha:
if np.mean(sample1) > np.mean(sample2):
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 1 is significantly higher than
that of Sample 2.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 2 is significantly higher than
that of Sample 1.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: Fail to reject the null hypothesis.")
print("Interpretation: There is not enough evidence to claim a significant difference between
the means.")

Output:
Practical No. 05

Aim- ANOVA(Analysis of variance)


 Perform one-way ANOVA to compare means across multiple groups
 Conduct post-hoc tests to identify significant differences between groups
means.

Acquire and Prepare Data:


Obtain a dataset with a categorical independent variable (factor) and a continuous
dependent variable.
Ensure the data meets the assumptions of ANOVA, including normality, homogeneity of
variances, and independence of observations.
Clean and preprocess the data as needed, handling missing values and outliers.

Perform One-Way ANOVA:


Set up the hypothesis test:
Null Hypothesis (H0): The means of all groups are equal.
Alternative Hypothesis (H1): At least one group mean is different from the others.
Conduct the one-way ANOVA test using software such as R, Python (with libraries like
SciPy or statsmodels), or statistical packages like SPSS.
Calculate the F-statistic and corresponding p-value to determine the statistical significance
of the differences between group means.

from matplotlib import pyplot as plt


movies=["golmaal","annabelle","bhoot-uncle","bhoothnath","de dana dan"]
num_oscars=[5,10,3,6,8]
plt.bar(range(len(movies)),num_oscars)
plt.title("Horror Movies")
plt.ylabel("oscar award 2024")
plt.xticks(range(len(movies)),movies)
plt.show()

Output-
from matplotlib import pyplot as plt
years = [2020,2021,2022,2023,2024]
failurepercentrates = [60,70,50,10,0]
plt.plot(years,failurepercentrates,color = "green" ,marker ="o", linestyle="solid" )
plt.title("corona times success rates")
plt.ylabel("percentages rates")
plt.show()

Output-

from matplotlib import pyplot as plt


from collections import Counter
totalnumber =[83,95,91,67,70,100]
histogram=Counter(min(score // 10*10,90) for score in totalnumber)
plt.bar([x+5 for x in histogram.keys()],
histogram.values(),
10,
edgecolor=(0,0,0))
plt.axis([-5,105,0,5])
plt.xticks([10*i for i in range(11)])
plt.xlabel("total_score")
plt.ylabel("N no of student")
plt.title("disttibution of exam 1 marks")
plt.show()

Output-
Practical No. 06

Aim:- Regression and Its Types


 Implement simple linear regression using a dataset.
 Explore and interpret the regression model coefficients and goodness-of-fit
measures.
 Extend the analysis to multiple linear regression and assess the impact of
additional predictors.

Acquire Dataset:
Obtain a dataset suitable for regression analysis. The dataset should contain variables that
you believe may have a linear relationship or can be used to predict another variable of
interest.

Explore the Dataset:


Load the dataset into your preferred data analysis environment (e.g., Python with libraries
like Pandas and NumPy, or R).
Visualize the data using scatter plots, histograms, and other relevant plots to identify
potential relationships between variables.

Implement Simple Linear Regression:


Choose a predictor variable (independent variable) and a target variable (dependent variable)
based on your analysis.
Implement simple linear regression using the selected variables. This involves fitting a linear
model to the data and estimating coefficients (slope and intercept).

Visualization and Interpretation:


Visualize the regression line overlaid on the scatter plot to visually assess how well the
model fits the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Load the dataset


df = pd.read_csv('diabetes.csv')

# Simple Linear Regression


X_simple = df[['Age']]
y_simple = df['Pregnancies']
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(X_simple,
y_simple, test_size=0.2, random_state=0)

regressor_simple = LinearRegression()
regressor_simple.fit(X_train_simple, y_train_simple)

# Predictions on the test set


y_pred_simple = regressor_simple.predict(X_test_simple)

# Model evaluation
print('Simple Linear Regression:')
print('Intercept:', regressor_simple.intercept_)
print('Coefficient:', regressor_simple.coef_)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_simple, y_pred_simple))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_simple, y_pred_simple))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_simple,
y_pred_simple)))
print('R-squared:', metrics.r2_score(y_test_simple, y_pred_simple))

# Visualization for Simple Linear Regression


plt.scatter(X_simple, y_simple, color='gray')
plt.plot(X_simple, regressor_simple.predict(X_simple), color='red', linewidth=2)
plt.title('Simple Linear Regression')
plt.xlabel('Age')
plt.ylabel('Pregnancies')
plt.show()

# Multiple Linear Regression


X_multi = df[['Glucose', 'BloodPressure', 'Insulin']]
y_multi = df['Outcome']
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_multi, y_multi,
test_size=0.2, random_state=0)

regressor_multi = LinearRegression()
regressor_multi.fit(X_train_multi, y_train_multi)

# Predictions on the test set


y_pred_multi = regressor_multi.predict(X_test_multi)

Output-
Practical No. 07

Aim- Logistic Regression and Decision Tree


 Build a logistic regression model to predict a binary outcome.
 Evaluate the model's performance using classification metrics (e.g.,
accuracy,precision, recall).
 Construct a decision tree model and interpret the decision rules for
classification.
Acquire Dataset:
Obtain a dataset suitable for binary classification tasks. The dataset should contain predictor
variables and a binary outcome variable.

Explore the Dataset:


Load the dataset into your preferred data analysis environment.
Perform exploratory data analysis (EDA) to understand the distribution of variables, identify
any missing values, and assess potential relationships between variables.

Preprocess the Data:


Handle missing values and perform any necessary data cleaning steps.
Encode categorical variables if required.
Split the dataset into training and testing sets for model evaluation.

Build Logistic Regression Model:


Choose predictor variables (features) based on the analysis.
Implement logistic regression model using the chosen features to predict the binary outcome.
Train the model using the training dataset.
Assess the model's performance using classification metrics such as accuracy, precision, recall,
F1-score, and ROC-AUC.

Evaluate Logistic Regression Model:


Evaluate the model's performance on the testing dataset using the chosen classification metrics.
Interpret the results to understand how well the logistic regression model predicts the binary
outcome.

Construct Decision Tree Model:


Choose predictor variables based on the analysis and understanding of the dataset.
Implement a decision tree model to predict the binary outcome.
Train the decision tree model using the training dataset.
Visualize the decision tree to interpret the decision rules for classification.
Evaluate Decision Tree Model:
Evaluate the decision tree model's performance on the testing dataset using classification
metrics similar to logistic regression.
Interpret the decision rules generated by the decision tree model to understand how the model
makes predictions.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Load the dataset


df = pd.read_csv('diabetesUp.csv')

# Split the dataset into features (X) and target variable (y)
X = df.drop(columns=['BloodPressure'])
y = df['Age']

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the data


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression
log_reg_model = LogisticRegression(max_iter=1000) # Increase max_iter to avoid
convergence warning
log_reg_model.fit(X_train_scaled, y_train)
y_pred_log_reg = log_reg_model.predict(X_test_scaled)

# Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_scaled, y_train)
y_pred_dt = dt_model.predict(X_test_scaled)

# Evaluation metrics for Logistic Regression


print("Logistic Regression:")
print("Accuracy:", accuracy_score(y_test, y_pred_log_reg))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_log_reg))
print("Classification Report:")
print(classification_report(y_test, y_pred_log_reg, zero_division=1)) # Set zero_division=1 to
handle zero division warning

# Evaluation metrics for Decision Tree


print("\nDecision Tree:")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dt))
print("Classification Report:")
print(classification_report(y_test, y_pred_dt, zero_division=1))

# Plot confusion matrices


plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.heatmap(confusion_matrix(y_test, y_pred_log_reg), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('True')

plt.subplot(1, 2, 2)
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('True')

plt.tight_layout()
plt.show()
Output-
Practical No. 08

Aim- K-Means Clustering


 Apply the K-Means algorithm to group similar data points into clusters.
 Determine the optimal number of clusters using elbow method or silhouette
analysis.
 Visualize the clustering results and analyze the cluster characteristics

Apply K-Means Algorithm:


Choose the number of clusters (K) that you want to create.
Apply the K-Means algorithm to the preprocessed data to cluster similar
data points into K clusters.
Iterate the algorithm until convergence, where the cluster centroids no
longer change significantly.

Determine Optimal Number of Clusters:


Use the elbow method or silhouette analysis to determine the optimal
number of clusters.
Elbow Method: Plot the within-cluster sum of squares (WCSS)
against the number of clusters. Choose the number of clusters where
the decrease in WCSS starts to slow down (elbow point).
Silhouette Analysis: Compute the silhouette scores for different
numbers of clusters. Choose the number of clusters with the highest
average silhouette score, indicating well-separated clusters.

Visualize Clustering Results:


Visualize the clustering results to understand the structure of the clusters.
Plot the clusters using scatter plots for two or three-dimensional data.
Use dimensionality reduction techniques such as PCA or t-SNE to
visualize high-dimensional data in two or three dimensions.
Analyze Cluster Characteristics:
Analyze the characteristics of each cluster to understand the patterns and
differences between clusters.
Compute cluster centroids to determine the center of each cluster in
feature space.
Explore the distribution of data points within each cluster to identify
commonalities and differences.

Interpret Results:
Interpret the clustering results based on the characteristics of each cluster.
Analyze any meaningful patterns or insights discovered through clustering.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

# Generate sample data


X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Determine the optimal number of clusters using silhouette analysis


silhouette_scores = []
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) #
Explicitly setting n_init
cluster_labels = kmeans.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
silhouette_scores.append(silhouette_avg)

# Plot silhouette scores to determine the optimal number of clusters


plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.show()
# Choose the optimal number of clusters based on the silhouette score or elbow
method
n_clusters = silhouette_scores.index(max(silhouette_scores)) + 2 # Adjusted for 0-
based indexing

# Apply K-Means clustering with the optimal number of clusters


kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) #
Explicitly setting n_init
kmeans.fit(X)
cluster_labels = kmeans.labels_

# Visualize the clustering results


plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='*', s=300, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering')
plt.show()

# Analyze the characteristics of each cluster


cluster_df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2'])
cluster_df['Cluster'] = cluster_labels
cluster_summary = cluster_df.groupby('Cluster').mean()
print(cluster_summary)
Output-
Practical No. 09

Aim- Principal Component Analysis (PCA)


 Perform PCA on a dataset to reduce dimensionality.
 Evaluate the explained variance and select the appropriate number of principal
components.
 Visualize the data in the reduced-dimensional space.

Acquire and Preprocess Data:


Obtain a dataset suitable for PCA. Ensure that the dataset contains numerical
features.
Preprocess the data by handling missing values, scaling the features if necessary,
and encoding categorical variables.

Standardize the Data:


Standardize the features by subtracting the mean and dividing by the standard
deviation. This step is essential for PCA as it ensures that all features have the same
scale.

Compute Covariance Matrix:


Compute the covariance matrix of the standardized data. The covariance matrix
represents the relationships between different features in the dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Calculate explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance_ratio)

# Visualize the data in the reduced-dimensional space


plt.figure(figsize=(8, 6))
for i in range(len(iris.target_names)):
plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], label=iris.target_names[i])

plt.title('PCA of Iris Dataset')


plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
Output-
Practical No. 10

Aim- Data Visualization and Storytelling


 Create meaningful visualizations using data visualization tools
 Combine multiple visualizations to tell a compelling data story.
 Present the findings and insights in a clear and concise manner.

Select Visualization Tools:


Choose appropriate data visualization tools based on your familiarity, dataset
complexity, and desired visualizations.
Common tools include Python libraries like Matplotlib, Seaborn, Plotly, and
ggplot2 in R. Alternatively, you can use BI tools like Tableau, Power BI, or
online platforms like Google Data Studio.

Explore and Clean the Data:


Perform exploratory data analysis (EDA) to understand the distribution,
relationships, and patterns within the dataset.
Clean the data by handling missing values, outliers, and inconsistencies.

Identify Key Insights:


Identify key insights or findings from the dataset that you want to communicate
through visualizations.
Prioritize insights based on relevance and significance to the audience.

Create Meaningful Visualizations:


Design and create visualizations that effectively communicate the key insights
identified.
Choose appropriate chart types (e.g., bar charts, line charts, scatter plots,
histograms) based on the nature of the data and the insights you want to convey.
Use color, size, and other visual cues effectively to enhance understanding and
highlight important information.

Combine Visualizations into a Story:


Organize your visualizations into a cohesive narrative or story.
Create a storyboard or outline to structure the flow of the story, including an
introduction, main points, and conclusion.
Use a combination of text, annotations, and visual transitions to guide the
audience through the story.
Present the Findings:
Prepare for the presentation by practicing your delivery and ensuring you can
effectively communicate the insights.
Use clear and concise language to explain the visualizations and the insights they
convey.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load sample dataset (you can replace this with your own dataset)
df = sns.load_dataset('tips')

# Explore the dataset


print(df.head())

# Data Visualization
# Visualization 1: Distribution of Total Bill Amount
plt.figure(figsize=(10, 6))
sns.histplot(df['total_bill'], kde=True)
plt.title('Distribution of Total Bill Amount')
plt.xlabel('Total Bill Amount')
plt.ylabel('Frequency')
plt.show()

# Visualization 2: Relationship between Total Bill and Tip Amount


plt.figure(figsize=(10, 6))
sns.scatterplot(x='total_bill', y='tip', data=df, hue='sex')
plt.title('Relationship between Total Bill and Tip Amount')
plt.xlabel('Total Bill Amount')
plt.ylabel('Tip Amount')
plt.legend(title='Sex')
plt.show()

# Visualization 3: Box plot of Total Bill Amount by Day


plt.figure(figsize=(10, 6))
sns.boxplot(x='day', y='total_bill', data=df)
plt.title('Box plot of Total Bill Amount by Day')
plt.xlabel('Day')
plt.ylabel('Total Bill Amount')
plt.show()

# Visualization 4: Count of Customers by Day and Time


plt.figure(figsize=(10, 6))
sns.countplot(x='day', hue='time', data=df)
plt.title('Count of Customers by Day and Time')
plt.xlabel('Day')
plt.ylabel('Count of Customers')
plt.legend(title='Time')
plt.show()

# Data Storytelling
print("\nInsights:")
print("1. The distribution of total bill amounts is right-skewed, with most bills falling between
$10 and $20.")
print("2. There is a positive relationship between total bill amount and tip amount, with some
variations based on gender.")
print("3. Total bill amounts tend to be higher on Saturdays compared to other days.")
print("4. The count of customers is higher during dinner time compared to lunchtime on all
days.")

# Conclusion
print("\nConclusion:")
print("Based on the analysis, we can infer that there is a strong relationship between the total
bill amount and tip amount, with variations based on factors such as day and time. Further
analysis can be conducted to explore these relationships in more detail.")

Output-

You might also like