0% found this document useful (0 votes)

27 views48 pages

Journal

The document outlines various programs focused on data wrangling, transformation, and time series analysis, detailing techniques such as merging datasets, reshaping data, string manipulation, and regular expressions. Each program includes source code examples and explanations of key operations, such as handling missing data, performing statistical tests, and utilizing GroupBy mechanics for time series data. The content serves as a comprehensive guide for data analysis and preprocessing techniques in Python.

Uploaded by

mtulsi1103

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views48 pages

Journal

Uploaded by

mtulsi1103

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

INDEX

SL DATE NAME OF THE EXPERIMENTS PAGE SIGN

NO. NO.
Program on data wrangling: Combining and merging datasets,
1 Reshaping and Pivoting. 1-5

2 Program on Data Transformation: String Manipulation, 6-10

Regular Expressions.

3 Program on Time series: Group By Mechanics to display in 11-15

data vector, multivariate time series and forecasting formats.

4 Program to measure central tendency and measures of

dispersion: Mean, Median, Mode, Standard Deviation, 16-18
Variance, Mean deviation and Quartile deviation for a
frequency distribution/data.

Program to perform cross validation for a given dataset to

5 measure Root Mean Squared Error (RMSE), Mean Absolute 19-22
Error (MAE) and R2 Error using Validation Set, Leave One
Out Cross-Validation(LOOCV) and K-fold Cross-Validation
approaches.

6 Program to display Normal, Binomial Poisson, Bernoulli 23-25

distribution for a given frequency distribution and analyze the
results.

7 Program to implement one sample, two sample and paired 26-28

sample t-tests for a sample data and analyze the results.

8 Program to implement one sample, two sample and paired 29-32

sample t-tests for a sample data and analyze the results.

9 Program to implement correlation, rank correlation and 33-35

regression and plot x-y plot and heat maps of correlation
matrices.

10 Program to implement PCA for Wisconsin dataset, visualize 36-39

and analyze the results.

11 Program to implement the working of linear discriminant 40-42

analysis using iris dataset and visualize the results.

12 Program to implement the working of linear discriminant 43-46

analysis using iris dataset and visualize the results.
1. Program on data wrangling: Combining and merging datasets, Reshaping and
Pivoting.

Data wrangling is a critical process in data analysis, where data is transformed into a
structured, usable format. This program demonstrates key operations in data wrangling:
combining and merging datasets, reshaping, pivoting, handling missing data, and generating
summary statistics.
• Combining and Merging Datasets
Combining and merging datasets are essential for integrating multiple data sources.
Merging: Combines datasets based on a common key using an inner join, retaining only
matching rows. This ensures focused analysis on shared data points.
Concatenation: Stacks datasets vertically, appending new records to create a unified
dataset.
• Reshaping Data with Melt
Reshaping is used to change the layout of a dataset to suit specific analytical needs. The melt
operation converts wide-format data into long format, turning columns into rows. This format is
ideal for grouping, filtering, and visualizing data across variables.
• Pivoting Data
Pivoting reverses the melting process, converting long-format data back into wide format.
It summarizes data for easier interpretation by using one column as the index and another as
columns. This transformation is particularly useful for summarizing data in a matrix-like format,
which is easier to interpret for certain statistical analyses.
• Handling Missing Data
Missing values are replaced with column means to ensure completeness. This maintains
data integrity for further analysis.
• Summary Statistics
The program concludes by calculating summary statistics (e.g., mean, standard deviation,
min, max) for the filled dataset. Summary statistics provide insights into the dataset's central
tendency, dispersion, and overall distribution.

1
Source Code :
import pandas as pd
import numpy as np

# Create two sample DataFrames

sales_data_1 = pd.DataFrame({
'OrderID': [1, 2, 3, 4],
'Product': ['Laptop', 'Tablet', 'Smartphone', 'Headphones'],
'Sales': [100, 200, 2000, 800]
})

sales_data_2 = pd.DataFrame({
'OrderID': [3, 4, 5, 6],
'Product': ['Headphones', 'Laptop', 'Smartwatch', 'Tablet'],
'Sales': [500, 300, 200, 900]
})

# Display the DataFrames

print("Sales Data 1:\n", sales_data_1)
print("\nSales Data 2:\n", sales_data_2)

# Merge DataFrames based on 'OrderID' using an inner join

merged_data = pd.merge(sales_data_1, sales_data_2, on='OrderID', how='inner',
suffixes=('_left', '_right'))
print("\nMerged Data (Inner Join):\n", merged_data)

# Concatenate the DataFrames vertically

combined_data = pd.concat([sales_data_1, sales_data_2], ignore_index=True)
print("\nCombined Data (Concatenated Vertically):\n", combined_data)

# 2. Reshaping Data with Melt

# Create a sample DataFrame for reshaping
reshaping_data = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar'],
'Product_A': [100, 150, 130],
'Product_B': [90, 80, 120]
})

print("\nReshaping Data (Original):\n", reshaping_data)

2
# Melt the DataFrame to reshape it from wide to long format
melted_data = pd.melt(reshaping_data, id_vars=['Month'], var_name='Product',
value_name='Sales')
print("\nMelted Data (Long Format):\n", melted_data)

# 3. Pivoting Data

# Create a sample DataFrame for pivoting

pivot_data = pd.DataFrame({
'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'],
'Product': ['Product_A', 'Product_B', 'Product_A', 'Product_B', 'Product_A',
'Product_B'],
'Sales': [100, 90, 150, 80, 130, 120]
})

print("\nPivot Data (Original):\n", pivot_data)

# Pivot the DataFrame to reshape it back to wide format

pivoted_data = pivot_data.pivot(index='Month', columns='Product', values='Sales')
print("\nPivoted Data (Wide Format):\n", pivoted_data)

# 4. Handling Missing Data

# Introduce some missing values
pivoted_data.loc['Feb', 'Product_A'] = np.nan
pivoted_data.loc['Mar', 'Product_B'] = np.nan
print("\nPivoted Data with Missing Values:\n", pivoted_data)
# Fill missing values with the mean of each column
filled_data = pivoted_data.fillna(pivoted_data.mean())
print("\nFilled Data (Missing Values Handled):\n", filled_data)

# 5. Summary Statistics
print("\nSummary Statistics of Filled Data:\n", filled_data.describe())

Output :
Sales Data 1: Sales Data 2:
Order ID Product Sales Order ID Product Sales

0 1 Laptop 100 0 3 Headphones 500

1 2 Tablet 200 1 4 Laptop 300
2 3 Smartphone 2000 2 5 Smartwatch 200
3 4 Headphones 800 3 6 Tablet 900
3
Merged Data (Inner Join):
Order ID Product_left Sales_left Product_right Sales_right
0 3 Smartphone 2000 Headphones 500
1 4 Headphones 800 Laptop 300

Combined Data (Concatenated Vertically):

OrderID Product Sales
0 1 Laptop 100
1 2 Tablet 200
2 3 Smartphone 2000
3 4 Headphones 800
4 3 Headphones 500
5 4 Laptop 300
6 5 Smartwatch 200
7 6 Tablet 900

Reshaping Data (Original):

Month Product_A Product_B
0 Jan 100 90
1 Feb 150 80
2 Mar 130 120

Melted Data (Long Format): Pivot Data (Original):

Month Product Sales Month Product Sales
0 Jan Product_A 100 0 Jan Product_A 100
1 Feb Product_A 150 1 Jan Product_B 90
2 Mar Product_A 130 2 Feb Product_A 150
3 Jan Product_B 90 3 Feb Product_B 80
4 Feb Product_B 80 4 Mar Product_A 130
5 Mar Product_B 120 5 Mar Product_B 120

Pivoted Data (Wide Format): Pivoted Data with Missing Values:

Product Product_A Product_B Product Product_A Product_B
Month Month
Feb 150 80 Feb NaN 80.0
Jan 100 90 Jan 100.0 90.0
Mar 130 120 Mar 130.0 NaN

4
Filled Data (Missing Values Handled):
Product Product_A Product_B
Month
Feb 115.0 80.0
Jan 100.0 90.0
Mar 130.0 85.0

Summary Statistics of Filled Data:

Product Product_A Product_B
count 3.0 3.0
mean 115.0 85.0
std 15.0 5.0
min 100.0 80.0
25% 107.5 82.5
50% 115.0 85.0
75% 122.5 87.5
max 130.0 90.0

5
2. Program on Data Transformation: String Manipulation, Regular Expressions

Data transformation is an essential step in preprocessing text data for analysis. The program
demonstrates two critical techniques: string manipulation and regular expressions (regex).
1. String Manipulation:
String manipulation involves performing operations on text data to clean or reformat it for
easier analysis. Common operations demonstrated include:
• Trimming Spaces: Removes leading and trailing spaces for cleaner text.
• Changing Case: Converts text to uppercase or lowercase to maintain consistency.
• Counting Substrings: Counts occurrences of specific characters or words.
• Replacing Text: Replaces specific words or patterns with desired text.
• Finding and Splitting: Locates words in a string and splits the text into individual words.
• Checking Prefix/Suffix: Verifies if a string starts or ends with specific content.
These operations are fundamental in cleaning and reformatting raw textual data.

2. Regular Expressions (Regex):

Regex is a powerful tool for pattern matching and text extraction. Key operations include:
• Removing Special Characters: Cleans text by removing unwanted symbols while retaining
meaningful content like emails.
• Converting Case: Ensures uniformity by converting text to lowercase.
• Replacing Spaces: Replaces multiple spaces with a single space for better readability.
• Pattern Matching: Finds specific patterns like words starting with vowels or extracting emails.
• Masking Sensitive Information: Replaces email addresses with placeholders to anonymize
data.
• Applications: These techniques are widely used for cleaning, structuring, and processing
textual datasets.

6
Source Code :

import re

#String Manupulation :

# Sample text to work with

text = " Hello, World! Welcome to Python programming Language. "

# 1. Strip leading and trailing spaces

clean_text = text.strip()
print(f"Original Text: '{text}'")
print(f"Text after stripping spaces: '{clean_text}'")

# 2. Convert the text to uppercase

upper_text = clean_text.upper()
print(f"\nText in uppercase: '{upper_text}'")

# 3. Convert the text to lowercase

lower_text = clean_text.lower()
print(f"\nText in lowercase: '{lower_text}'")

# 4. Count occurrences of a substring (e.g., "o")

count_o = clean_text.count("o")
print(f"\nNumber of occurrences of 'o': {count_o}")

# 5. Replace a word in the string

replaced_text = clean_text.replace("Python", "HTML")
print(f"\nText after replacing 'Python' with 'HTML': '{replaced_text}'")

# 6. Find the position of a word in the string

position_world = clean_text.find("World")
print(f"\nPosition of 'World' in the text: {position_world}")

# 7. Split the text into words (by default on spaces)

words = clean_text.split()
print(f"\nList of words in the text: {words}")

# 8. Join the words back into a single string

joined_text = " ".join(words)
7
print(f"\nText after joining words: '{joined_text}'")

# 9. Check if the text starts with "Hello"

starts_with_hello = clean_text.startswith("Hello")
print(f"\nDoes the text start with 'Hello'? {starts_with_hello}")

# 10. Check if the text ends with a specific word (e.g., "programming.")
ends_with_programming = clean_text.endswith("programming.")
print(f"\nDoes the text end with 'programming.'? {ends_with_programming}")

OUTPUT :

Original Text: ' Hello, World! Welcome to Python programming Language. '
Text after stripping spaces: 'Hello, World! Welcome to Python programming Language.'

Text in uppercase: 'HELLO, WORLD! WELCOME TO PYTHON PROGRAMMING LANGUAGE.'

Text in lowercase: 'hello, world! welcome to python programming language.'

Number of occurrences of 'o': 6

Text after replacing 'Python' with 'HTML': 'Hello, World! Welcome to Data HTML programming
Language.'

Position of 'World' in the text: 7

List of words in the text: ['Hello,', 'World!', 'Welcome', 'to', 'Python', 'programming', 'Language.']

Text after joining words: 'Hello, World! Welcome to Python programming Language.'

Does the text start with 'Hello'? True

Does the text end with 'programming.'? False

8
#Regular Expressions :
# Sample text
text = """
John's email is [[email protected]]. He said, "Python is awesome!!" It's a great
language.
Another email: [[email protected]].
"""

# 1. Remove special characters except for spaces and email-related characters.

# Using regex to remove non-alphabetic characters and non-email symbols
clean_text = re.sub(r"[^a-zA-Z0-9@\.\s]", "", text)
print("Text after removing special characters:")
print(clean_text)

# 2. Convert the text to lowercase

clean_text = clean_text.lower()
print("\nText after converting to lowercase:")
print(clean_text)

# 3. Replace multiple spaces with a single space

clean_text = re.sub(r"\s+", " ", clean_text)
print("\nText after replacing multiple spaces:")
print(clean_text)

# 4. Extract all words starting with a vowel (a, e, i, o, u)

vowel_words = re.findall(r"\b[aeiouAEIOU]\w+", clean_text)
print("\nWords starting with a vowel:")
print(vowel_words)

# 5. Replace email addresses with '[[email protected]]'

masked_text = re.sub(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
"[[email protected]]", clean_text)
print("\nText after replacing emails:")
print(masked_text)

Output :
Text after removing special characters:

Johns email is [email protected]. He said Python is awesome Its a great language.

Another email [email protected].
9
Text after converting to lowercase:

johns email is [email protected]. he said python is awesome its a great language.

another email [email protected].

Text after replacing multiple spaces:

johns email is [email protected]. he said python is awesome its a great language. another
email [email protected].

Words starting with a vowel:

['email', 'is', 'is', 'awesome', 'its', 'another', 'email']

Text after replacing emails:

johns email is [[email protected]]. he said python is awesome its a great language. another email
[[email protected]].

10
3. Program on Time series: GroupBy Mechanics to display in data vector,
multivariate time series and forecasting formats

Time series analysis involves working with data collected over time, helping in
understanding patterns and making forecasts. The program demonstrates three key aspects:
GroupBy mechanics, data formats, and forecasting.

• GroupBy Mechanics:

Time series data can be grouped to summarize and analyze trends over specific intervals
(e.g., months). The program groups daily data by month using the resample method and calculates
the monthly mean. This helps identify patterns or trends at a higher granularity, such as seasonal
or monthly variations.

• Data Formats:

Vector Format: Displays a single variable (e.g., Value_A) as a sequence of values over time,
useful for analyzing one aspect of the dataset. Multivariate Time Series: Includes multiple variables
(e.g., Value_A and Value_B), allowing for the analysis of relationships between variables over time.

• Time Series Forecasting:

Uses the Holt-Winters Exponential Smoothing model to predict future values based on
historical data. The program splits data into training and testing sets, fits the model to the training
data, and forecasts for the test period. Results are visualized to compare actual values and
predictions, aiding in decision-making.
Applications: Time series analysis is widely used in fields like finance, economics, and weather
forecasting.

11
Source Code :

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Create sample time series data

np.random.seed(42)
date_range = pd.date_range(start="2023-04-12", end="2023-04-12", freq="D")
data = pd.DataFrame({
"Date": date_range,
"Value_A": np.random.normal(100, 10, len(date_range)),
"Value_B": np.random.normal(200, 20, len(date_range)),
})

# Set Date as the index

data.set_index("Date", inplace=True)

# GroupBy Mechanics
def groupby_mechanics(data):
print("\n--- GroupBy Mechanics ---")
# Group data by month and calculate mean
grouped = data.resample('M').mean()
print(grouped)
return grouped

# Data Formats: Vector and Multivariate

def data_formats(data):
print("\n--- Data Formats ---")
# Display data as vector
print("\nVector Format:")
print(data["Value_A"].head())

# Display multivariate time series

print("\nMultivariate Time Series:")
print(data.head())

# Forecasting Example
def time_series_forecasting(data):
12
print("\n--- Forecasting ---")
# Select a single column for forecasting
ts = data["Value_A"]

# Train-Test Split
train = ts[:int(0.8 * len(ts))]
test = ts[int(0.8 * len(ts)):]

# Fit the Holt-Winters Exponential Smoothing model

model = ExponentialSmoothing(train, seasonal="add", seasonal_periods=30).fit()

# Forecast for the test period

forecast = model.forecast(len(test))

# Plot results
plt.figure(figsize=(12, 6))
plt.plot(train, label="Train")
plt.plot(test, label="Test")
plt.plot(forecast, label="Forecast")
plt.legend()
plt.title("Time Series Forecasting")
plt.show()

# Main function
if name == " main ":
print("--- Time Series Data ---")
print(data.head())

# Grouping Mechanics
monthly_data = groupby_mechanics(data)

# Data Formats
data_formats(data)

# Time Series Forecasting

time_series_forecasting(data)

13
Output :
--- Time Series Data ---
Value_A Value_B
Date
2023-04-12 104.967142 200.251848
2023-04-13 98.617357 201.953522
2023-04-14 106.476885 184.539804
2023-04-15 115.230299 200.490203
2023-04-16 97.658466 209.959966

--- GroupBy Mechanics ---

c:\Users\Lenovo\Jhon\Statistics Lab\p3.py:22: FutureWarning: 'M' is deprecated and will be
removed in a future version, please use 'ME' instead.
grouped = data.resample('M').mean()
Value_A Value_B
Date
2023-04-30 98.940175 201.469137
2023-05-31 97.012894 201.349620
2023-06-30 100.455494 201.436198
2023-07-31 99.067553 196.571382
2023-08-31 100.828514 199.414531
2023-09-30 100.318000 193.094705
2023-10-31 101.007366 197.896876
2023-11-30 101.699258 201.755240
2023-12-31 99.284726 205.322539
2024-01-31 99.887451 197.749011
2024-02-29 103.252043 197.832125
2024-03-31 98.903094 198.593193
2024-04-30 100.869866 193.921756

--- Data Formats ---

Vector Format:
Date
2023-04-12 104.967142
2023-04-13 98.617357
2023-04-14 106.476885
2023-04-15 115.230299
2023-04-16 97.658466
Name: Value_A, dtype: float64
14
Multivariate Time Series:
Value_A Value_B
Date
2023-04-12 104.967142 200.251848
2023-04-13 98.617357 201.953522
2023-04-14 106.476885 184.539804
2023-04-15 115.230299 200.490203
2023-04-16 97.658466 209.959966
--- Forecasting ---
C:\Users\Lenovo\AppData\Local\Programs\Python\Python312\Lib\site-
packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information
was provided, so inferred frequency D will be used.
self._init_dates(dates, freq)

15
4. Program to measure central tendency and measures of dispersion: Mean,
Median, Mode, Standard Deviation, Variance, Mean deviation and Quartile
deviation for a frequency distribution/data.

The measures are essential for understanding the distribution and variability of data in a
systematic way.

1. Central Tendency: These measures help identify the "center" or typical value of a dataset:

• Mean: The average of the data values, showing the overall central value.
• Median: The middle value when the data is arranged in order, representing the midpoint of
the dataset.
• Mode: The most frequently occurring value in the data, showing the most common
observation.

2. Dispersion: These measures describe how spread out the data is:

• Variance: Shows how much the data values differ from the mean on average.
• Standard Deviation: The square root of variance, indicating the average distance of data from
the mean.
• Mean Deviation: The average of the absolute differences between data values and the mean.
• Quartile Deviation: Focuses on the variability of the middle 50% of the data.

Program Working:

• Input: The program takes two inputs: data values and their frequencies.
• Processing: It calculates the measures of central tendency (mean, median, mode) and
dispersion (variance, standard deviation, etc.) using Python libraries like NumPy and pandas.
• Output: The program provides all the computed measures, giving insights into the dataset's
characteristics.
Advantages of Computational Statistics:

• Efficiency: Automates complex calculations, saving time.

• Accuracy: Reduces human error in computations.

16
Source Code :
import numpy as np
import pandas as pd

def cal_sta(data , freq):

df = pd.DataFrame({'Value': data, 'Frequency' : freq})
total = df['Frequency'].sum()

df['Weighted_Value'] = df['Value'] * df['Frequency']

mean = df['Weighted_Value'].sum() / total

cumulative_frequency = df['Frequency'].cumsum()
median_index = cumulative_frequency.searchsorted(total / 2)
median = df['Value'][median_index]

mode = df['Value'][df['Frequency'].idxmax()]

variance = np.average((df['Value'] - mean) ** 2, weights=df['Frequency'])

std_deviation = np.sqrt(variance)

mean_deviation = np.average(np.abs(df['Value'] - mean), weights=df['Frequency'])

q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
quartile_deviation = (q3 - q1) / 2
return {
'Mean': mean,
'Median': median,
'Mode': mode,
'Variance': variance,
'Standard Deviation': std_deviation,
'Mean Deviation': mean_deviation,
'Quartile Deviation': quartile_deviation
}

data_input = input("Enter the data values separated by commas (e.g., 10, 20, 30): ")
17
frequencies_input = input("Enter the corresponding frequencies separated by commas (e.g., 1, 2,
3): ")

data = list(map(int, data_input.split(',')))

frequencies = list(map(int, frequencies_input.split(',')))

statistics = cal_sta(data, frequencies)

for stat, value in statistics.items():

print(f"{stat}: {value:.2f}")

Output :

Enter the data values separated by commas (e.g., 10, 20, 30): 10,11,12,13,14
Enter the corresponding frequencies separated by commas (e.g., 1, 2, 3): 1,2,1,3,2
Mean: 12.33
Median: 13.00
Mode: 13.00
Variance: 1.78
Standard Deviation: 1.33
Mean Deviation: 1.19
Quartile Deviation: 1.00

18
5. Program to perform cross validation for a given dataset to measure Root Mean
Squared Error (RMSE), Mean Absolute Error (MAE) and R2 Error using Validation Set,
Leave One Out Cross-Validation(LOOCV) and K-fold Cross-Validation approaches.

Cross-validation is a method to evaluate a model's performance by testing it on different

subsets of data. It ensures that the model generalizes well to unseen data. The program calculates
three key metrics for model evaluation:
• 1. Root Mean Squared Error (RMSE): Measures the average prediction error, emphasizing
larger errors.
• 2. Mean Absolute Error (MAE): Measures the average prediction error without emphasizing
outliers.
• 3. R² Score: Indicates how well the model explains the variability in the data.
Cross-Validation Techniques
• 1. Validation Set Approach: Splits the data into training (80%) and validation (20%). Tests the
model on the validation set after training.
• 2. Leave-One-Out Cross-Validation (LOOCV): Uses one sample as the test set and the rest for
training. Repeats this process for all samples.
• 3. K-Fold Cross-Validation: Divides the data into k equal parts (folds). Trains on folds and tests
on the remaining fold, repeated times.
Purpose: The program evaluates a linear regression model using these techniques and
calculates RMSE, MAE, and R² to compare performance. It ensures reliable and unbiased model
evaluation.

19
Source Code :

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, LeaveOneOut
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset

data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Function to calculate and display metrics

def display_metrics(y_true, y_pred):
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R² Score: {r2:.4f}")
return rmse, mae, r2

# Validation Set Approach

def validation_set_approach(X, y):
print("Validation Set Approach:")
# Split the dataset into training (80%) and validation (20%) sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model

model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the validation set

y_pred = model.predict(X_val)

# Display metrics
display_metrics(y_val, y_pred)

20
# Leave-One-Out Cross-Validation (LOOCV) Approach
def loocv_approach(X, y):
print("Leave-One-Out Cross-Validation (LOOCV):")
loo = LeaveOneOut()
y_true, y_pred = [], []

# Loop through each sample using LOOCV

for train_index, test_index in loo.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]

# Initialize and train the model

model = LinearRegression()
model.fit(X_train, y_train)

# Make prediction for the single test sample

y_pred.append(model.predict(X_test)[0])
y_true.append(y_test.iloc[0])

# Display metrics
display_metrics(y_true, y_pred)

# K-Fold Cross-Validation Approach

def kfold_approach(X, y, k=5):
print(f"{k}-Fold Cross-Validation Approach:")
kf = KFold(n_splits=k, shuffle=True, random_state=42)
y_true, y_pred = [], []

# Loop through each fold

for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]

# Initialize and train the model

model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set

y_pred.extend(model.predict(X_test))
y_true.extend(y_test)

21
# Display metrics
display_metrics(y_true, y_pred)

# Main function to run all approaches

def main():
print("Cross-Validation for RMSE, MAE, and R²:\n")
validation_set_approach(X, y)
print("\n")
loocv_approach(X, y)
print("\n")
kfold_approach(X, y, k=5) # You can change k for different K-Fold Cross-Validation

# Execute the main function

if name == " main ":
main()

Output :

Cross-Validation for RMSE, MAE, and R²:

Validation Set Approach:

Root Mean Squared Error (RMSE): 0.7456
Mean Absolute Error (MAE): 0.5332
R² Score: 0.5758

Leave-One-Out Cross-Validation (LOOCV):

Root Mean Squared Error (RMSE): 0.7268
Mean Absolute Error (MAE): 0.5317
R² Score: 0.6033

5-Fold Cross-Validation Approach:

Root Mean Squared Error (RMSE): 0.7284
Mean Absolute Error (MAE): 0.5317
R² Score: 0.6015

22
6. Program to display Normal, Binomial Poisson, Bernoulli distributions for a given
frequency distribution and analyze the results.

Probability distributions describe how the values of a random variable are distributed. They
help in understanding the behavior of data and are essential in statistics and data analysis. The
program visualizes four key probability distributions for a given frequency distribution.

Distributions Covered
• Normal Distribution:
A continuous distribution forming a bell-shaped curve. It is symmetric about the mean, and
most data points cluster around the mean. Useful for modeling natural phenomena.
• Binomial Distribution:
A discrete distribution representing the number of successes in a fixed number of trials. It
depends on two parameters: the number of trials () and the probability of success (). Common in
scenarios like flipping a coin or rolling a die.
• Poisson Distribution:
A discrete distribution that models the number of events in a fixed interval of time or space.
It is characterized by the average rate () of occurrence. Useful for modeling rare events like system
failures or call arrivals.
• Bernoulli Distribution:
A discrete distribution representing a single trial with two outcomes: success or failure. It is
defined by the probability of success (). Used in binary events like yes/no or true/false.
Purpose: Accepts user input for data values and their frequencies.
Visualizes the probability density function (PDF) or probability mass function (PMF) for each
distribution.
Helps users compare how well each distribution fits the data.
Importance: Understanding Data: Helps identify patterns in data.
Modeling Real-World Scenarios: Simulates phenomena like natural variations or rare events.

23
Source Code :

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, binom, poisson, bernoulli

def get_user_data():

data_input = input("Enter the data values separated by commas (e.g., 10, 20, 30): ")
frequencies_input = input("Enter the corresponding frequencies separated by commas (e.g., 2,
3, 4): ")

data = list(map(int, data_input.split(',')))

freq = list(map(int, frequencies_input.split(',')))

return data, freq

def plot_normal_distribution(data, freq):

mean = np.mean(data)
std_dev = np.std(data)

x = np.linspace(min(data), max(data), 100)

pdf = norm.pdf(x, mean, std_dev)

plt.plot(x, pdf, 'r-', lw=2, label='Normal Distribution')

plt.title('Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.show()

def plot_binomial_distribution(data, freq):

n = max(data)
p = np.mean(data) / n

x = np.arange(0, n+1)
pmf = binom.pmf(x, n, p)

plt.bar(x, pmf, alpha=0.7, color='b', label='Binomial Distribution')

24
plt.title('Binomial Distribution')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()

def plot_poisson_distribution(data, freq):

lam = np.mean(data)

x = np.arange(0, max(data)+1)
pmf = poisson.pmf(x, lam)

plt.bar(x, pmf, alpha=0.7, color='g', label='Poisson Distribution')

plt.title('Poisson Distribution')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()

def plot_bernoulli_distribution(data, freq):

success_prob = np.mean(data) / max(data)

x = [0, 1]
pmf = bernoulli.pmf(x, success_prob)

plt.bar(x, pmf, alpha=0.7, color='purple', label='Bernoulli Distribution')

plt.title('Bernoulli Distribution')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()

def analyze_distributions(data, freq):

print("Analyzing Normal Distribution:")

plot_normal_distribution(data, freq)

print("Analyzing Binomial Distribution:")

plot_binomial_distribution(data, freq)

print("Analyzing Poisson Distribution:")

25
plot_poisson_distribution(data, freq)

print("Analyzing Bernoulli Distribution:")

plot_bernoulli_distribution(data, freq)

data, freq = get_user_data()

analyze_distributions(data, freq)

Output :
Enter the data values separated by commas (e.g., 10, 20, 30): 8, 10, 12, 14, 16
Enter the corresponding frequencies separated by commas (e.g., 2, 3, 4): 1, 2, 1, 3, 1

Analyzing Normal Distribution:

Analyzing Binomial Distribution:

Analyzing Poisson Distribution:

Analyzing Bernoulli Distribution:

26
7. Program to implement one sample, two sample and paired sample t-tests for a
sample data and analyse the results.

T-Tests are commonly used to assess whether there is a statistically significant difference
between groups or conditions. These tests help us make inferences about populations based on
sample data. Types of t-tests:
1. One-Sample T-Test:
This test compares the mean of a sample to a known value (often a population mean) to
determine if the sample mean is significantly different from this reference value. For eg, in the
code, we compare the average exam scores of a group of students to a population mean of 85.
The null hypothesis assumes there is no difference, and the alternative hypothesis suggests a
difference in means.
2. Two-Sample T-Test:
This test is used to compare the means of two independent groups to determine if they
differ significantly. In the code, we compare the scores of two groups (Group A and Group B). The
null hypothesis suggests that there is no difference between the two groups, while the alternative
hypothesis indicates a significant difference.
3. Paired-Sample T-Test:
This test compares the means of two related groups, typically measuring the same subjects
before and after an intervention. In the code, we compare the scores of the same group of students
before and after a treatment. The null hypothesis assumes no difference between the two sets of
scores, while the alternative hypothesis suggests a significant change.
Results are Interpreted as:
• T-Statistic: This value tells us how much the sample mean differs from the hypothesized value
(or the mean of the second group in case of two-sample or paired tests) in terms of standard
error.
• P-Value: This value indicates the probability of observing the data if the null hypothesis were
true. If the p-value is smaller than the chosen significance level (usually 0.05), we reject the
null hypothesis and conclude there is a statistically significant difference.

27
Source Code :
import numpy as np
import pandas as pd
from scipy import stats

exam_scores = np.array([85, 87, 90, 78, 88, 95, 82, 79, 94, 91])

group_A = np.array([85, 89, 88, 90, 93, 85, 84, 79, 90, 87])
group_B = np.array([82, 86, 85, 87, 92, 80, 81, 78, 89, 85])

before_treatment = np.array([82, 84, 88, 78, 80, 85, 90, 79, 87, 83])
after_treatment = np.array([85, 87, 89, 81, 83, 88, 92, 82, 89, 86])

def one_sample_ttest(data, population_mean):

t_stat, p_value = stats.ttest_1samp(data, population_mean)
return t_stat, p_value

def two_sample_ttest(group1, group2):

t_stat, p_value = stats.ttest_ind(group1, group2)
return t_stat, p_value

def paired_sample_ttest(before, after):

t_stat, p_value = stats.ttest_rel(before, after)
return t_stat, p_value

def analyze_ttest_results(t_stat, p_value, alpha=0.05):

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")
if p_value < alpha:
print("Result: The null hypothesis is rejected (statistically significant difference).")
else:
print("Result: The null hypothesis cannot be rejected (no statistically significant
difference).")

print("One-Sample T-Test:")
t_stat, p_value = one_sample_ttest(exam_scores, 85)
analyze_ttest_results(t_stat, p_value)
print()
28
print("Two-Sample T-Test:")
t_stat, p_value = two_sample_ttest(group_A, group_B)
analyze_ttest_results(t_stat, p_value)
print()

print("Paired-Sample T-Test:")
t_stat, p_value = paired_sample_ttest(before_treatment, after_treatment)
analyze_ttest_results(t_stat, p_value)

Output :
One-Sample T-Test:
T-statistic: 1.0189950494649807
P-value: 0.3348142605778697
Result: The null hypothesis cannot be rejected (no statistically significant difference).

Two-Sample T-Test:
T-statistic: 1.3547090246981803
P-value: 0.19227122007981406
Result: The null hypothesis cannot be rejected (no statistically significant difference).

Paired-Sample T-Test:
T-statistic: -11.758942438532781
P-value: 9.151111215642479e-07
Result: The null hypothesis is rejected (statistically significant difference).

29
8. Program to implement One-way and Two-way ANOVA tests and analyze the
results

ANOVA (Analysis of Variance) is a statistical method used to test if there are significant
differences between the means of multiple groups.
1. One-Way ANOVA:
Used when comparing the means of more than two groups based on one factor. It checks if
the group means are significantly different. Null Hypothesis (H₀): All group means are equal.
Alternative Hypothesis (H₁): At least one group mean is different.
2. Two-Way ANOVA:
Used when there are two factors, and it tests the individual effects of each factor and their
interaction on the dependent variable. Null Hypothesis(H₀): Neither factor nor their interaction
significantly affects response. Alternative Hypothesis (H₁): At least one factor or their interaction
significantly affects the response.

Key Results:
• F-statistic: Indicates how much the group means differ.
• P-value: If less than 0.05, we reject the null hypothesis, suggesting a significant difference.

30
Source Code :
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Function for One-way ANOVA

def one_way_anova(data, groups, response):
"""
Perform one-way ANOVA.
:param data: DataFrame containing the dataset
:param groups: Column name for grouping variable
:param response: Column name for response variable
"""
grouped_data = [group[response].values for _, group in data.groupby(groups)]
f_stat, p_value = f_oneway(*grouped_data)
print("\nOne-way ANOVA Results:")
print(f"F-statistic: {f_stat:.4f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print("Reject the null hypothesis: Significant difference among group means.")
else:
print("Fail to reject the null hypothesis: No significant difference among group means.")

# Function for Two-way ANOVA

def two_way_anova(data, response, factor1, factor2):
"""
Perform two-way ANOVA.
:param data: DataFrame containing the dataset
:param response: Column name for response variable
:param factor1: Column name for first factor
:param factor2: Column name for second factor
"""
formula = f"{response} ~ C({factor1}) + C({factor2}) + C({factor1}):C({factor2})"
model = ols(formula, data).fit()
anova_table = sm.stats.anova_lm(model, typ=2) # Type II ANOVA
print("\nTwo-way ANOVA Results:")
print(anova_table)

31
if name == " main ":
# Example dataset for One-way ANOVA
data_one_way = pd.DataFrame({
"Group": np.repeat(['A', 'B', 'C'], 10),
"Score": np.concatenate([
np.random.normal(loc=50, scale=5, size=10),
np.random.normal(loc=55, scale=5, size=10),
np.random.normal(loc=60, scale=5, size=10)
])
})

# Perform One-way ANOVA

one_way_anova(data_one_way, groups="Group", response="Score")

# Example dataset for Two-way ANOVA

data_two_way = pd.DataFrame({
"Factor1": np.repeat(['Low', 'Medium', 'High'], 6),
"Factor2": np.tile(['Type1', 'Type2'], 9),
"Response": np.concatenate([
np.random.normal(loc=50, scale=5, size=6),
np.random.normal(loc=55, scale=5, size=6),
np.random.normal(loc=60, scale=5, size=6)
])
})

# Perform Two-way ANOVA

two_way_anova(data_two_way, response="Response", factor1="Factor1", factor2="Factor2")

Output :

One-way ANOVA Results:

F-statistic: 13.9602, p-value: 0.0001
Reject the null hypothesis: Significant difference among group means.

Two-way ANOVA Results:

sum_sq df F PR(>F)
C(Factor1) 445.793486 2.0 17.604091 0.000270
C(Factor2) 8.818186 1.0 0.696449 0.420283
C(Factor1):C(Factor2) 145.092949 2.0 5.729625 0.017914
Residual 151.939738 12.0 NaN NaN
32
9. Program to implement correlation, rank correlation and regression and plot x-y
plot and heat maps of correlation matrices.

Correlation:
• Pearson Correlation: Measures the linear relationship between two variables (X and Y). A
value close to 1 means a strong positive relationship, -1 means a strong negative
relationship, and 0 means no linear relationship. The program calculates this correlation
using the corr function in Pandas.
Rank Correlation (Spearman's Rank Correlation):
• This measures the strength of a monotonic (ordered) relationship between two variables,
using their ranks rather than actual values. It can detect non-linear relationships, and
values close to 1 or -1 indicate strong positive or negative relationships.
Linear Regression:
• Linear regression fits a straight line to the data, modeling the relationship between a
dependent variable (Y) and an independent variable (X). The program uses scikit-learn to
fit a regression line and calculates the Mean Squared Error (MSE) to evaluate the fit.
Visualizations:
• X-Y Scatter Plot: Displays the data points, with a red regression line showing the fitted
model.
• Heatmap: Visualizes the correlation matrix, showing the strength of relationships between
variables. This program helps to understand relationships between variables using
correlation, regression, and visual tools.

33
Source Code :

# Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate sample data (or load your dataset here)

np.random.seed(42) # For reproducibility
x = np.random.rand(100) * 100 # Random values for x
y = 2.5 * x + np.random.normal(0, 25, 100) # Linear relation with noise

# Convert data into a DataFrame

data = pd.DataFrame({'X': x, 'Y': y})

# Compute Correlation
pearson_corr = data.corr(method='pearson') # Pearson Correlation
spearman_corr, _ = spearmanr(data['X'], data['Y']) # Spearman Rank Correlation

# Linear Regression
X = data['X'].values.reshape(-1, 1) # Reshape for sklearn
Y = data['Y'].values
model = LinearRegression()
model.fit(X, Y)
Y_pred = model.predict(X)
regression_coeff = model.coef_[0] # Slope
regression_intercept = model.intercept_ # Intercept
mse = mean_squared_error(Y, Y_pred)

# Print statistical results

print("Pearson Correlation Coefficient Matrix:")
print(pearson_corr)
print("\nSpearman Rank Correlation Coefficient:", spearman_corr)
print("\nLinear Regression Equation: Y = {:.2f}X + {:.2f}".format(regression_coeff,
regression_intercept))
print("Mean Squared Error (MSE):", mse)
34
# Plot X-Y scatter plot with regression line
plt.figure(figsize=(8, 6))
plt.scatter(data['X'], data['Y'], color='blue', label='Data Points')
plt.plot(data['X'], Y_pred, color='red', label='Regression Line')
plt.title('X-Y Scatter Plot with Regression Line')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()

# Plot heatmap of correlation matrix

plt.figure(figsize=(6, 5))
sns.heatmap(pearson_corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Heatmap of Correlation Matrix')
plt.show()

Output :
Pearson Correlation Coefficient Matrix:
X Y
X 1.000000 0.952966
Y 0.952966 1.000000

Spearman Rank Correlation Coefficient: 0.9519351935193517

Linear Regression Equation: Y = 2.39X + 5.38

Mean Squared Error (MSE): 504.11535247940856

35
10. Program to implement PCA for Wisconsin dataset, visualize and analyze the
results.

This program demonstrates Principal Component Analysis (PCA) on the Wisconsin Breast
Cancer dataset to reduce the dimensionality of the data, visualize the results, and analyze the
explained variance of the components.
Principal Component Analysis (PCA):
PCA is a technique used to reduce the dimensionality of large datasets while preserving as
much information as possible. It transforms the original features into new, uncorrelated variables
called principal components. The goal is to project the data into fewer dimensions, typically 2 or
3, for easier visualization while retaining most of the data's variance.
Standardization:
Before applying PCA, the data is standardized using StandardScaler to ensure that each
feature has zero mean and unit variance. This is important because PCA is sensitive to the scale of
the data.
Applying PCA:
PCA is performed to reduce the data to 2 principal components for visualization. The
program then calculates the explained variance ratio, which tells us how much variance
(information) each principal component captures.
Visualization: PCA Scatter Plot: The program creates a scatter plot of the first two principal
components (PCA1 and PCA2) to visualize how the data points are distributed in the reduced
space. Points are colored based on the target variable (malignant or benign).
• Explained Variance: A bar plot shows how much variance each of the first two principal
components explains.
• Cumulative Variance: A line plot shows how much cumulative variance is explained as more
components are added.

36
Source Code :

# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load the Wisconsin Breast Cancer dataset

data = load_breast_cancer()
X = data.data # Features
y = data.target # Target variable (0 = malignant, 1 = benign)
feature_names = data.feature_names
target_names = data.target_names

# Standardize the data (important for PCA)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2) # Reduce to 2 dimensions for visualization
X_pca = pca.fit_transform(X_scaled)

# Get explained variance ratio for each component

explained_variance_ratio = pca.explained_variance_ratio_

# Create a DataFrame for visualization

pca_df = pd.DataFrame(X_pca, columns=['PCA1', 'PCA2'])
pca_df['Target'] = y

# Plot the PCA results

plt.figure(figsize=(8, 6))
sns.scatterplot(data=pca_df, x='PCA1', y='PCA2', hue='Target', palette='Set1', alpha=0.8)
plt.title('PCA of Wisconsin Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(target_names)
37
plt.grid()
plt.show()

# Plot explained variance ratio

plt.figure(figsize=(8, 5))
plt.bar(range(1, 3), explained_variance_ratio, tick_label=['PCA1', 'PCA2'], color='skyblue')
plt.title('Explained Variance Ratio of PCA Components')
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.show()

# Full PCA with all components for analysis

pca_full = PCA()
X_pca_full = pca_full.fit_transform(X_scaled)
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# Plot cumulative explained variance

plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--',
color='b')
plt.title('Cumulative Explained Variance')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Variance Explained')
plt.grid()
plt.show()

# Print key insights

print("PCA Analysis of Wisconsin Breast Cancer Dataset")
print(" ")
print(f"Explained Variance (PCA1): {explained_variance_ratio[0]:.4f}")
print(f"Explained Variance (PCA2): {explained_variance_ratio[1]:.4f}")
print("Cumulative Variance Explained by All Components:")
for i, cum_var in enumerate(cumulative_variance, start=1):
print(f" Component {i}: {cum_var:.4f}")

Output :

PCA Analysis of Wisconsin Breast Cancer Dataset

Explained Variance (PCA1): 0.4427 Explained Variance (PCA2): 0.1897

38
Cumulative Variance Explained by All Components:

Component 1: 0.4427 Component 16: 0.9892

Component 2: 0.6324 Component 17: 0.9911
Component 3: 0.7264 Component 18: 0.9929
Component 4: 0.7924 Component 19: 0.9945
Component 5: 0.8473 Component 20: 0.9956
Component 6: 0.8876 Component 21: 0.9966
Component 7: 0.9101 Component 22: 0.9975
Component 8: 0.9260 Component 23: 0.9983
Component 9: 0.9399 Component 24: 0.9989
Component 10: 0.9516 Component 25: 0.9994
Component 11: 0.9614 Component 26: 0.9997
Component 12: 0.9701 Component 27: 0.9999
Component 13: 0.9781 Component 28: 1.0000
Component 14: 0.9834 Component 29: 1.0000
Component 15: 0.9865 Component 30: 1.0000

39
40
11. Program to implement the working of linear discriminant analysis using iris
dataset and visualize the results.

Linear Discriminant Analysis (LDA) is a technique used for dimensionality reduction and
classification. It aims to find the linear combinations of features that best separate the classes in
the dataset. Unlike Principal Component Analysis (PCA), which maximizes variance, LDA focuses
on maximizing class separability.
Key Steps in LDA:
• Data Standardization: Before applying LDA, the data is scaled so that each feature has zero
mean and unit variance. This ensures that all features contribute equally to the analysis.
• Compute Discriminants: LDA computes new axes (called discriminants) that maximize the
difference between classes.
• Dimensionality Reduction: LDA reduces the dataset to fewer dimensions while preserving
as much class separation as possible. In this case, we reduce it to 2 dimensions for easier
visualization.
Visualization: The transformed data is plotted in a 2D space, showing how well the classes (species
in the Iris dataset) are separated.
Application in the Iris Dataset:
• The Iris dataset has 4 features, and LDA reduces it to 2 components for visualization.
• LDA is useful in classification tasks, where the goal is to predict the class label of new data
points based on the transformed features.

41
Source Code :

# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset

data = load_iris()
X = data.data # Features
y = data.target # Target variable (0, 1, 2)
target_names = data.target_names # Class names

# Standardize the data (LDA benefits from scaling)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply Linear Discriminant Analysis (LDA)

lda = LinearDiscriminantAnalysis(n_components=2) # Reduce to 2 components for visualization
X_lda = lda.fit_transform(X_scaled, y)

# Create a DataFrame for LDA-transformed data

lda_df = pd.DataFrame(X_lda, columns=['LDA1', 'LDA2'])
lda_df['Target'] = y

# Plot the LDA results in 2D space

plt.figure(figsize=(8, 6))
sns.scatterplot(data=lda_df, x='LDA1', y='LDA2', hue='Target', palette='Set1', style='Target',
s=100)
plt.title('LDA of Iris Dataset')
plt.xlabel('Linear Discriminant 1')
plt.ylabel('Linear Discriminant 2')
plt.legend(title='Class', labels=target_names)
plt.grid()
plt.show()

42
# Print key insights
print("Linear Discriminant Analysis (LDA) Results")
print(" ")
print("Explained Variance Ratio by LDA Components:")
for i, ratio in enumerate(lda.explained_variance_ratio_, start=1):
print(f" LDA{i}: {ratio:.4f}")

Output :
Linear Discriminant Analysis (LDA) Results

Explained Variance Ratio by LDA Components:

LDA1: 0.9912
LDA2: 0.008

43
12. Program to Implement multiple linear regression using iris dataset, visualize
and analyze the results.

Multiple Linear Regression (MLR) is a technique used to predict a target variable based on
the relationship between multiple input variables. It helps in understanding how different features
affect the outcome.
Key Concepts:
• Prediction: MLR creates a model that predicts a target variable using multiple independent
variables.
• Training: The model learns from the training data by adjusting coefficients for each feature.
• Evaluation: The model’s accuracy is measured using metrics like Mean Squared Error (MSE)
and R-squared (R²).
Application to the Iris Dataset:
In this case, we predict the petal length based on other features like sepal length and petal width.
The dataset is split into a training set and a test set. The model is trained on the training set and
evaluated on the test set.
Steps:
1. Training: Fit the model using the training data.
2. Prediction: Make predictions on the test data.
3. Evaluation: Use MSE and R² to assess model performance.
4. Visualization: Compare the actual vs predicted values using a plot.

MLR is commonly used when there are multiple factors influencing the outcome and helps in
making predictions based on them.

44
Source Code :
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the Iris dataset

data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names) # Features
y = X['petal length (cm)'] # Let's predict 'petal length' as the dependent variable
X = X.drop(columns=['petal length (cm)']) # Remove 'petal length' from independent variables

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Multiple Linear Regression

model = LinearRegression()
model.fit(X_train, y_train) # Train the model

# Predict on the test set

y_pred = model.predict(X_test)

# Evaluate the model

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print model performance metrics

print("Multiple Linear Regression Results")
print(" ")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R²): {r2:.4f}")
print("\nModel Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature}: {coef:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
45
# Visualize actual vs predicted values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, color='blue', alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linewidth=2,
linestyle='--')
plt.title('Actual vs Predicted Values (Test Set)')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.grid()
plt.show()

# Pairplot to explore relationships in the dataset

sns.pairplot(pd.DataFrame(data.data, columns=data.feature_names), diag_kind='kde')
plt.suptitle('Pairplot of Iris Dataset Features', y=1.02)
plt.show()

Output :

Multiple Linear Regression Results

Mean Squared Error (MSE): 0.1300

R-squared (R²): 0.9603

Model Coefficients:
sepal length (cm): 0.7228
sepal width (cm): -0.6358
petal width (cm): 1.4675

46
47

Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
Identification of Outliers (Monographs On Statistics and - D. M. Hawkins (Auth.)
No ratings yet
Identification of Outliers (Monographs On Statistics and - D. M. Hawkins (Auth.)
194 pages
Business Statistics Formula - Sheet
100% (2)
Business Statistics Formula - Sheet
7 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Dev Lab Record
No ratings yet
Dev Lab Record
21 pages
Python Programming Pandas Across Examples
No ratings yet
Python Programming Pandas Across Examples
350 pages
Handling Several Batches, Scatter Plot and Resistant Lines and Transformations
No ratings yet
Handling Several Batches, Scatter Plot and Resistant Lines and Transformations
39 pages
Logistic Regression - AI-ML Developer Course
100% (1)
Logistic Regression - AI-ML Developer Course
14 pages
Google Data Science Interview Questions
No ratings yet
Google Data Science Interview Questions
6 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Research Report
No ratings yet
Research Report
47 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
CS Report
No ratings yet
CS Report
60 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Lesson 2 Quantitative Analysis and Interpretation
No ratings yet
Lesson 2 Quantitative Analysis and Interpretation
79 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
Datascience
No ratings yet
Datascience
26 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
Lab-Journal: Academic Year: 2024-25
No ratings yet
Lab-Journal: Academic Year: 2024-25
49 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Document
No ratings yet
Document
29 pages
4.4 Correlation and Simple Linear Regression
100% (1)
4.4 Correlation and Simple Linear Regression
18 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Session 9 - Bakery Case
No ratings yet
Session 9 - Bakery Case
48 pages
MLC Practical
No ratings yet
MLC Practical
51 pages
Fds PDF
No ratings yet
Fds PDF
58 pages
FDS Record-1-4
No ratings yet
FDS Record-1-4
18 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
The Wilcoxon Rank-Sum Test or Wilcoxon Two-Sample Test
No ratings yet
The Wilcoxon Rank-Sum Test or Wilcoxon Two-Sample Test
6 pages
Balaji Capstone Project 2
No ratings yet
Balaji Capstone Project 2
56 pages
Race Conditions and Exec Function
No ratings yet
Race Conditions and Exec Function
21 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
Data Science Unit 2 Second Half Notes
No ratings yet
Data Science Unit 2 Second Half Notes
18 pages
Field Visit Report: Department of Computer Science and Engineering "Jnana Sangamo", VTU-Campus, Belagavi, 590018
No ratings yet
Field Visit Report: Department of Computer Science and Engineering "Jnana Sangamo", VTU-Campus, Belagavi, 590018
19 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
20 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Prac 2
No ratings yet
Prac 2
11 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
Unit 4
No ratings yet
Unit 4
60 pages
Study Guide For Iassc Certified Lean Six Sigma Green Belt (Icgb) Certification Exam
No ratings yet
Study Guide For Iassc Certified Lean Six Sigma Green Belt (Icgb) Certification Exam
11 pages
EX-02-Data Manipulation Pandas Matplot
No ratings yet
EX-02-Data Manipulation Pandas Matplot
9 pages
Eda Lab Assignment2
No ratings yet
Eda Lab Assignment2
10 pages
Foundation of Data Science Lab Manual Full
No ratings yet
Foundation of Data Science Lab Manual Full
8 pages
UNIXppt - Read-Only - Compatibility Mode
No ratings yet
UNIXppt - Read-Only - Compatibility Mode
11 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
1 Asdfadgaf
No ratings yet
1 Asdfadgaf
8 pages
NumPy and Pandas Step
No ratings yet
NumPy and Pandas Step
9 pages
STAT1400 2022 1st Week4-Lecture 8
No ratings yet
STAT1400 2022 1st Week4-Lecture 8
22 pages
Statistical Data Analysis Assignment
No ratings yet
Statistical Data Analysis Assignment
17 pages
Advanced Python Lab
No ratings yet
Advanced Python Lab
17 pages
10) Merging Dataframes: # Detecting Duplicates
No ratings yet
10) Merging Dataframes: # Detecting Duplicates
7 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
10 pages
Ass2 Transformation
No ratings yet
Ass2 Transformation
6 pages
Exames Jorge
No ratings yet
Exames Jorge
14 pages
Chelsea Stats Passes 2024-25
No ratings yet
Chelsea Stats Passes 2024-25
6 pages
Python For DS Unit4
No ratings yet
Python For DS Unit4
11 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Module 2
No ratings yet
Module 2
20 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Question Bank (1&2)
No ratings yet
Question Bank (1&2)
4 pages
Jurnal Pengaruh Orientasi Kewirausahaan Dan Inovasi Produk Terhadap Keunggulan Bersaing
No ratings yet
Jurnal Pengaruh Orientasi Kewirausahaan Dan Inovasi Produk Terhadap Keunggulan Bersaing
15 pages
Introduction To Pandas Programming 2
No ratings yet
Introduction To Pandas Programming 2
3 pages
Frontpage
No ratings yet
Frontpage
1 page
Introduction To Stata Part 2 (Six Slides Per Page)
No ratings yet
Introduction To Stata Part 2 (Six Slides Per Page)
12 pages
(Part 2) 6.3 Normal Approximation of Binomial Probabilities: Continuity Correction Factor
No ratings yet
(Part 2) 6.3 Normal Approximation of Binomial Probabilities: Continuity Correction Factor
7 pages
P9120 Syllabus
No ratings yet
P9120 Syllabus
5 pages
Ganpat University V.M. Patel College of Management Studies (V.M.P.C.M.S)
No ratings yet
Ganpat University V.M. Patel College of Management Studies (V.M.P.C.M.S)
4 pages
Excel Rak Rancob
No ratings yet
Excel Rak Rancob
3 pages
CS273a Final Exam
No ratings yet
CS273a Final Exam
9 pages
Missing Data Part 1: Overview, Traditional Methods
No ratings yet
Missing Data Part 1: Overview, Traditional Methods
11 pages
FIN B385F Formula Book Unit 1
No ratings yet
FIN B385F Formula Book Unit 1
3 pages
Math Assignment
No ratings yet
Math Assignment
4 pages
For all problems, α = 0.05. Show screenshots from JMP as needed
No ratings yet
For all problems, α = 0.05. Show screenshots from JMP as needed
1 page
Summary of Frequency Distribution, Cross Tabulation and Hypothesis Testing
No ratings yet
Summary of Frequency Distribution, Cross Tabulation and Hypothesis Testing
3 pages
College Statistics Cheat Sheet
100% (2)
College Statistics Cheat Sheet
2 pages
Ann PM
No ratings yet
Ann PM
1 page
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Electricity Measuring & Testing Instruments World Summary: Market Values & Financials by Country
From Everand
Electricity Measuring & Testing Instruments World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet