Journal
Journal
Data wrangling is a critical process in data analysis, where data is transformed into a
structured, usable format. This program demonstrates key operations in data wrangling:
combining and merging datasets, reshaping, pivoting, handling missing data, and generating
summary statistics.
• Combining and Merging Datasets
Combining and merging datasets are essential for integrating multiple data sources.
Merging: Combines datasets based on a common key using an inner join, retaining only
matching rows. This ensures focused analysis on shared data points.
Concatenation: Stacks datasets vertically, appending new records to create a unified
dataset.
• Reshaping Data with Melt
Reshaping is used to change the layout of a dataset to suit specific analytical needs. The melt
operation converts wide-format data into long format, turning columns into rows. This format is
ideal for grouping, filtering, and visualizing data across variables.
• Pivoting Data
Pivoting reverses the melting process, converting long-format data back into wide format.
It summarizes data for easier interpretation by using one column as the index and another as
columns. This transformation is particularly useful for summarizing data in a matrix-like format,
which is easier to interpret for certain statistical analyses.
• Handling Missing Data
Missing values are replaced with column means to ensure completeness. This maintains
data integrity for further analysis.
• Summary Statistics
The program concludes by calculating summary statistics (e.g., mean, standard deviation,
min, max) for the filled dataset. Summary statistics provide insights into the dataset's central
tendency, dispersion, and overall distribution.
1
Source Code :
import pandas as pd
import numpy as np
sales_data_2 = pd.DataFrame({
'OrderID': [3, 4, 5, 6],
'Product': ['Headphones', 'Laptop', 'Smartwatch', 'Tablet'],
'Sales': [500, 300, 200, 900]
})
2
# Melt the DataFrame to reshape it from wide to long format
melted_data = pd.melt(reshaping_data, id_vars=['Month'], var_name='Product',
value_name='Sales')
print("\nMelted Data (Long Format):\n", melted_data)
# 3. Pivoting Data
# 5. Summary Statistics
print("\nSummary Statistics of Filled Data:\n", filled_data.describe())
Output :
Sales Data 1: Sales Data 2:
Order ID Product Sales Order ID Product Sales
4
Filled Data (Missing Values Handled):
Product Product_A Product_B
Month
Feb 115.0 80.0
Jan 100.0 90.0
Mar 130.0 85.0
5
2. Program on Data Transformation: String Manipulation, Regular Expressions
Data transformation is an essential step in preprocessing text data for analysis. The program
demonstrates two critical techniques: string manipulation and regular expressions (regex).
1. String Manipulation:
String manipulation involves performing operations on text data to clean or reformat it for
easier analysis. Common operations demonstrated include:
• Trimming Spaces: Removes leading and trailing spaces for cleaner text.
• Changing Case: Converts text to uppercase or lowercase to maintain consistency.
• Counting Substrings: Counts occurrences of specific characters or words.
• Replacing Text: Replaces specific words or patterns with desired text.
• Finding and Splitting: Locates words in a string and splits the text into individual words.
• Checking Prefix/Suffix: Verifies if a string starts or ends with specific content.
These operations are fundamental in cleaning and reformatting raw textual data.
6
Source Code :
import re
#String Manupulation :
# 10. Check if the text ends with a specific word (e.g., "programming.")
ends_with_programming = clean_text.endswith("programming.")
print(f"\nDoes the text end with 'programming.'? {ends_with_programming}")
OUTPUT :
Original Text: ' Hello, World! Welcome to Python programming Language. '
Text after stripping spaces: 'Hello, World! Welcome to Python programming Language.'
Text after replacing 'Python' with 'HTML': 'Hello, World! Welcome to Data HTML programming
Language.'
List of words in the text: ['Hello,', 'World!', 'Welcome', 'to', 'Python', 'programming', 'Language.']
Text after joining words: 'Hello, World! Welcome to Python programming Language.'
8
#Regular Expressions :
# Sample text
text = """
John's email is [[email protected]]. He said, "Python is awesome!!" It's a great
language.
Another email: [[email protected]].
"""
Output :
Text after removing special characters:
10
3. Program on Time series: GroupBy Mechanics to display in data vector,
multivariate time series and forecasting formats
Time series analysis involves working with data collected over time, helping in
understanding patterns and making forecasts. The program demonstrates three key aspects:
GroupBy mechanics, data formats, and forecasting.
• GroupBy Mechanics:
Time series data can be grouped to summarize and analyze trends over specific intervals
(e.g., months). The program groups daily data by month using the resample method and calculates
the monthly mean. This helps identify patterns or trends at a higher granularity, such as seasonal
or monthly variations.
• Data Formats:
Vector Format: Displays a single variable (e.g., Value_A) as a sequence of values over time,
useful for analyzing one aspect of the dataset. Multivariate Time Series: Includes multiple variables
(e.g., Value_A and Value_B), allowing for the analysis of relationships between variables over time.
Uses the Holt-Winters Exponential Smoothing model to predict future values based on
historical data. The program splits data into training and testing sets, fits the model to the training
data, and forecasts for the test period. Results are visualized to compare actual values and
predictions, aiding in decision-making.
Applications: Time series analysis is widely used in fields like finance, economics, and weather
forecasting.
11
Source Code :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# GroupBy Mechanics
def groupby_mechanics(data):
print("\n--- GroupBy Mechanics ---")
# Group data by month and calculate mean
grouped = data.resample('M').mean()
print(grouped)
return grouped
# Forecasting Example
def time_series_forecasting(data):
12
print("\n--- Forecasting ---")
# Select a single column for forecasting
ts = data["Value_A"]
# Train-Test Split
train = ts[:int(0.8 * len(ts))]
test = ts[int(0.8 * len(ts)):]
# Plot results
plt.figure(figsize=(12, 6))
plt.plot(train, label="Train")
plt.plot(test, label="Test")
plt.plot(forecast, label="Forecast")
plt.legend()
plt.title("Time Series Forecasting")
plt.show()
# Main function
if name == " main ":
print("--- Time Series Data ---")
print(data.head())
# Grouping Mechanics
monthly_data = groupby_mechanics(data)
# Data Formats
data_formats(data)
13
Output :
--- Time Series Data ---
Value_A Value_B
Date
2023-04-12 104.967142 200.251848
2023-04-13 98.617357 201.953522
2023-04-14 106.476885 184.539804
2023-04-15 115.230299 200.490203
2023-04-16 97.658466 209.959966
Vector Format:
Date
2023-04-12 104.967142
2023-04-13 98.617357
2023-04-14 106.476885
2023-04-15 115.230299
2023-04-16 97.658466
Name: Value_A, dtype: float64
14
Multivariate Time Series:
Value_A Value_B
Date
2023-04-12 104.967142 200.251848
2023-04-13 98.617357 201.953522
2023-04-14 106.476885 184.539804
2023-04-15 115.230299 200.490203
2023-04-16 97.658466 209.959966
--- Forecasting ---
C:\Users\Lenovo\AppData\Local\Programs\Python\Python312\Lib\site-
packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information
was provided, so inferred frequency D will be used.
self._init_dates(dates, freq)
15
4. Program to measure central tendency and measures of dispersion: Mean,
Median, Mode, Standard Deviation, Variance, Mean deviation and Quartile
deviation for a frequency distribution/data.
The measures are essential for understanding the distribution and variability of data in a
systematic way.
1. Central Tendency: These measures help identify the "center" or typical value of a dataset:
• Mean: The average of the data values, showing the overall central value.
• Median: The middle value when the data is arranged in order, representing the midpoint of
the dataset.
• Mode: The most frequently occurring value in the data, showing the most common
observation.
2. Dispersion: These measures describe how spread out the data is:
• Variance: Shows how much the data values differ from the mean on average.
• Standard Deviation: The square root of variance, indicating the average distance of data from
the mean.
• Mean Deviation: The average of the absolute differences between data values and the mean.
• Quartile Deviation: Focuses on the variability of the middle 50% of the data.
Program Working:
• Input: The program takes two inputs: data values and their frequencies.
• Processing: It calculates the measures of central tendency (mean, median, mode) and
dispersion (variance, standard deviation, etc.) using Python libraries like NumPy and pandas.
• Output: The program provides all the computed measures, giving insights into the dataset's
characteristics.
Advantages of Computational Statistics:
16
Source Code :
import numpy as np
import pandas as pd
cumulative_frequency = df['Frequency'].cumsum()
median_index = cumulative_frequency.searchsorted(total / 2)
median = df['Value'][median_index]
mode = df['Value'][df['Frequency'].idxmax()]
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
quartile_deviation = (q3 - q1) / 2
return {
'Mean': mean,
'Median': median,
'Mode': mode,
'Variance': variance,
'Standard Deviation': std_deviation,
'Mean Deviation': mean_deviation,
'Quartile Deviation': quartile_deviation
}
data_input = input("Enter the data values separated by commas (e.g., 10, 20, 30): ")
17
frequencies_input = input("Enter the corresponding frequencies separated by commas (e.g., 1, 2,
3): ")
Output :
Enter the data values separated by commas (e.g., 10, 20, 30): 10,11,12,13,14
Enter the corresponding frequencies separated by commas (e.g., 1, 2, 3): 1,2,1,3,2
Mean: 12.33
Median: 13.00
Mode: 13.00
Variance: 1.78
Standard Deviation: 1.33
Mean Deviation: 1.19
Quartile Deviation: 1.00
18
5. Program to perform cross validation for a given dataset to measure Root Mean
Squared Error (RMSE), Mean Absolute Error (MAE) and R2 Error using Validation Set,
Leave One Out Cross-Validation(LOOCV) and K-fold Cross-Validation approaches.
19
Source Code :
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, LeaveOneOut
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
# Display metrics
display_metrics(y_val, y_pred)
20
# Leave-One-Out Cross-Validation (LOOCV) Approach
def loocv_approach(X, y):
print("Leave-One-Out Cross-Validation (LOOCV):")
loo = LeaveOneOut()
y_true, y_pred = [], []
# Display metrics
display_metrics(y_true, y_pred)
21
# Display metrics
display_metrics(y_true, y_pred)
Output :
22
6. Program to display Normal, Binomial Poisson, Bernoulli distributions for a given
frequency distribution and analyze the results.
Probability distributions describe how the values of a random variable are distributed. They
help in understanding the behavior of data and are essential in statistics and data analysis. The
program visualizes four key probability distributions for a given frequency distribution.
Distributions Covered
• Normal Distribution:
A continuous distribution forming a bell-shaped curve. It is symmetric about the mean, and
most data points cluster around the mean. Useful for modeling natural phenomena.
• Binomial Distribution:
A discrete distribution representing the number of successes in a fixed number of trials. It
depends on two parameters: the number of trials () and the probability of success (). Common in
scenarios like flipping a coin or rolling a die.
• Poisson Distribution:
A discrete distribution that models the number of events in a fixed interval of time or space.
It is characterized by the average rate () of occurrence. Useful for modeling rare events like system
failures or call arrivals.
• Bernoulli Distribution:
A discrete distribution representing a single trial with two outcomes: success or failure. It is
defined by the probability of success (). Used in binary events like yes/no or true/false.
Purpose: Accepts user input for data values and their frequencies.
Visualizes the probability density function (PDF) or probability mass function (PMF) for each
distribution.
Helps users compare how well each distribution fits the data.
Importance: Understanding Data: Helps identify patterns in data.
Modeling Real-World Scenarios: Simulates phenomena like natural variations or rare events.
23
Source Code :
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, binom, poisson, bernoulli
def get_user_data():
data_input = input("Enter the data values separated by commas (e.g., 10, 20, 30): ")
frequencies_input = input("Enter the corresponding frequencies separated by commas (e.g., 2,
3, 4): ")
mean = np.mean(data)
std_dev = np.std(data)
n = max(data)
p = np.mean(data) / n
x = np.arange(0, n+1)
pmf = binom.pmf(x, n, p)
lam = np.mean(data)
x = np.arange(0, max(data)+1)
pmf = poisson.pmf(x, lam)
x = [0, 1]
pmf = bernoulli.pmf(x, success_prob)
25
plot_poisson_distribution(data, freq)
Output :
Enter the data values separated by commas (e.g., 10, 20, 30): 8, 10, 12, 14, 16
Enter the corresponding frequencies separated by commas (e.g., 2, 3, 4): 1, 2, 1, 3, 1
26
7. Program to implement one sample, two sample and paired sample t-tests for a
sample data and analyse the results.
T-Tests are commonly used to assess whether there is a statistically significant difference
between groups or conditions. These tests help us make inferences about populations based on
sample data. Types of t-tests:
1. One-Sample T-Test:
This test compares the mean of a sample to a known value (often a population mean) to
determine if the sample mean is significantly different from this reference value. For eg, in the
code, we compare the average exam scores of a group of students to a population mean of 85.
The null hypothesis assumes there is no difference, and the alternative hypothesis suggests a
difference in means.
2. Two-Sample T-Test:
This test is used to compare the means of two independent groups to determine if they
differ significantly. In the code, we compare the scores of two groups (Group A and Group B). The
null hypothesis suggests that there is no difference between the two groups, while the alternative
hypothesis indicates a significant difference.
3. Paired-Sample T-Test:
This test compares the means of two related groups, typically measuring the same subjects
before and after an intervention. In the code, we compare the scores of the same group of students
before and after a treatment. The null hypothesis assumes no difference between the two sets of
scores, while the alternative hypothesis suggests a significant change.
Results are Interpreted as:
• T-Statistic: This value tells us how much the sample mean differs from the hypothesized value
(or the mean of the second group in case of two-sample or paired tests) in terms of standard
error.
• P-Value: This value indicates the probability of observing the data if the null hypothesis were
true. If the p-value is smaller than the chosen significance level (usually 0.05), we reject the
null hypothesis and conclude there is a statistically significant difference.
27
Source Code :
import numpy as np
import pandas as pd
from scipy import stats
exam_scores = np.array([85, 87, 90, 78, 88, 95, 82, 79, 94, 91])
group_A = np.array([85, 89, 88, 90, 93, 85, 84, 79, 90, 87])
group_B = np.array([82, 86, 85, 87, 92, 80, 81, 78, 89, 85])
before_treatment = np.array([82, 84, 88, 78, 80, 85, 90, 79, 87, 83])
after_treatment = np.array([85, 87, 89, 81, 83, 88, 92, 82, 89, 86])
print("One-Sample T-Test:")
t_stat, p_value = one_sample_ttest(exam_scores, 85)
analyze_ttest_results(t_stat, p_value)
print()
28
print("Two-Sample T-Test:")
t_stat, p_value = two_sample_ttest(group_A, group_B)
analyze_ttest_results(t_stat, p_value)
print()
print("Paired-Sample T-Test:")
t_stat, p_value = paired_sample_ttest(before_treatment, after_treatment)
analyze_ttest_results(t_stat, p_value)
Output :
One-Sample T-Test:
T-statistic: 1.0189950494649807
P-value: 0.3348142605778697
Result: The null hypothesis cannot be rejected (no statistically significant difference).
Two-Sample T-Test:
T-statistic: 1.3547090246981803
P-value: 0.19227122007981406
Result: The null hypothesis cannot be rejected (no statistically significant difference).
Paired-Sample T-Test:
T-statistic: -11.758942438532781
P-value: 9.151111215642479e-07
Result: The null hypothesis is rejected (statistically significant difference).
29
8. Program to implement One-way and Two-way ANOVA tests and analyze the
results
ANOVA (Analysis of Variance) is a statistical method used to test if there are significant
differences between the means of multiple groups.
1. One-Way ANOVA:
Used when comparing the means of more than two groups based on one factor. It checks if
the group means are significantly different. Null Hypothesis (H₀): All group means are equal.
Alternative Hypothesis (H₁): At least one group mean is different.
2. Two-Way ANOVA:
Used when there are two factors, and it tests the individual effects of each factor and their
interaction on the dependent variable. Null Hypothesis(H₀): Neither factor nor their interaction
significantly affects response. Alternative Hypothesis (H₁): At least one factor or their interaction
significantly affects the response.
Key Results:
• F-statistic: Indicates how much the group means differ.
• P-value: If less than 0.05, we reject the null hypothesis, suggesting a significant difference.
30
Source Code :
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
import statsmodels.api as sm
from statsmodels.formula.api import ols
31
if name == " main ":
# Example dataset for One-way ANOVA
data_one_way = pd.DataFrame({
"Group": np.repeat(['A', 'B', 'C'], 10),
"Score": np.concatenate([
np.random.normal(loc=50, scale=5, size=10),
np.random.normal(loc=55, scale=5, size=10),
np.random.normal(loc=60, scale=5, size=10)
])
})
Output :
Correlation:
• Pearson Correlation: Measures the linear relationship between two variables (X and Y). A
value close to 1 means a strong positive relationship, -1 means a strong negative
relationship, and 0 means no linear relationship. The program calculates this correlation
using the corr function in Pandas.
Rank Correlation (Spearman's Rank Correlation):
• This measures the strength of a monotonic (ordered) relationship between two variables,
using their ranks rather than actual values. It can detect non-linear relationships, and
values close to 1 or -1 indicate strong positive or negative relationships.
Linear Regression:
• Linear regression fits a straight line to the data, modeling the relationship between a
dependent variable (Y) and an independent variable (X). The program uses scikit-learn to
fit a regression line and calculates the Mean Squared Error (MSE) to evaluate the fit.
Visualizations:
• X-Y Scatter Plot: Displays the data points, with a red regression line showing the fitted
model.
• Heatmap: Visualizes the correlation matrix, showing the strength of relationships between
variables. This program helps to understand relationships between variables using
correlation, regression, and visual tools.
33
Source Code :
# Compute Correlation
pearson_corr = data.corr(method='pearson') # Pearson Correlation
spearman_corr, _ = spearmanr(data['X'], data['Y']) # Spearman Rank Correlation
# Linear Regression
X = data['X'].values.reshape(-1, 1) # Reshape for sklearn
Y = data['Y'].values
model = LinearRegression()
model.fit(X, Y)
Y_pred = model.predict(X)
regression_coeff = model.coef_[0] # Slope
regression_intercept = model.intercept_ # Intercept
mse = mean_squared_error(Y, Y_pred)
Output :
Pearson Correlation Coefficient Matrix:
X Y
X 1.000000 0.952966
Y 0.952966 1.000000
35
10. Program to implement PCA for Wisconsin dataset, visualize and analyze the
results.
This program demonstrates Principal Component Analysis (PCA) on the Wisconsin Breast
Cancer dataset to reduce the dimensionality of the data, visualize the results, and analyze the
explained variance of the components.
Principal Component Analysis (PCA):
PCA is a technique used to reduce the dimensionality of large datasets while preserving as
much information as possible. It transforms the original features into new, uncorrelated variables
called principal components. The goal is to project the data into fewer dimensions, typically 2 or
3, for easier visualization while retaining most of the data's variance.
Standardization:
Before applying PCA, the data is standardized using StandardScaler to ensure that each
feature has zero mean and unit variance. This is important because PCA is sensitive to the scale of
the data.
Applying PCA:
PCA is performed to reduce the data to 2 principal components for visualization. The
program then calculates the explained variance ratio, which tells us how much variance
(information) each principal component captures.
Visualization: PCA Scatter Plot: The program creates a scatter plot of the first two principal
components (PCA1 and PCA2) to visualize how the data points are distributed in the reduced
space. Points are colored based on the target variable (malignant or benign).
• Explained Variance: A bar plot shows how much variance each of the first two principal
components explains.
• Cumulative Variance: A line plot shows how much cumulative variance is explained as more
components are added.
36
Source Code :
# Apply PCA
pca = PCA(n_components=2) # Reduce to 2 dimensions for visualization
X_pca = pca.fit_transform(X_scaled)
Output :
38
Cumulative Variance Explained by All Components:
39
40
11. Program to implement the working of linear discriminant analysis using iris
dataset and visualize the results.
Linear Discriminant Analysis (LDA) is a technique used for dimensionality reduction and
classification. It aims to find the linear combinations of features that best separate the classes in
the dataset. Unlike Principal Component Analysis (PCA), which maximizes variance, LDA focuses
on maximizing class separability.
Key Steps in LDA:
• Data Standardization: Before applying LDA, the data is scaled so that each feature has zero
mean and unit variance. This ensures that all features contribute equally to the analysis.
• Compute Discriminants: LDA computes new axes (called discriminants) that maximize the
difference between classes.
• Dimensionality Reduction: LDA reduces the dataset to fewer dimensions while preserving
as much class separation as possible. In this case, we reduce it to 2 dimensions for easier
visualization.
Visualization: The transformed data is plotted in a 2D space, showing how well the classes (species
in the Iris dataset) are separated.
Application in the Iris Dataset:
• The Iris dataset has 4 features, and LDA reduces it to 2 components for visualization.
• LDA is useful in classification tasks, where the goal is to predict the class label of new data
points based on the transformed features.
41
Source Code :
42
# Print key insights
print("Linear Discriminant Analysis (LDA) Results")
print(" ")
print("Explained Variance Ratio by LDA Components:")
for i, ratio in enumerate(lda.explained_variance_ratio_, start=1):
print(f" LDA{i}: {ratio:.4f}")
Output :
Linear Discriminant Analysis (LDA) Results
43
12. Program to Implement multiple linear regression using iris dataset, visualize
and analyze the results.
Multiple Linear Regression (MLR) is a technique used to predict a target variable based on
the relationship between multiple input variables. It helps in understanding how different features
affect the outcome.
Key Concepts:
• Prediction: MLR creates a model that predicts a target variable using multiple independent
variables.
• Training: The model learns from the training data by adjusting coefficients for each feature.
• Evaluation: The model’s accuracy is measured using metrics like Mean Squared Error (MSE)
and R-squared (R²).
Application to the Iris Dataset:
In this case, we predict the petal length based on other features like sepal length and petal width.
The dataset is split into a training set and a test set. The model is trained on the training set and
evaluated on the test set.
Steps:
1. Training: Fit the model using the training data.
2. Prediction: Make predictions on the test data.
3. Evaluation: Use MSE and R² to assess model performance.
4. Visualization: Compare the actual vs predicted values using a plot.
MLR is commonly used when there are multiple factors influencing the outcome and helps in
making predictions based on them.
44
Source Code :
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Output :
Model Coefficients:
sepal length (cm): 0.7228
sepal width (cm): -0.6358
petal width (cm): 1.4675
46
47