Practical No. 01
Practical No. 01
01
Steps:
1. Select the "Profit" column (Column E).
Steps:
1. Select the entire dataset including headers.
2. Go to the "Insert" tab on the ribbon.
3. Click on "PivotTable."
4. Choose where you want to place the PivotTable (e.g., new worksheet).
5. Drag "Category" to the Rows area.
6. Drag "Sales" to the Values area, choosing the sum function.
Steps:
1. Identify the cell containing the formula for "Profit" for "Product P" (let's assume it's in cell E17).
2. Go to the "Data" tab on the ribbon.
3. Click on "What-If Analysis" and select "Goal Seek."
4. Set "Set cell" to the profit cell (E17), "To value" to 1000, and "By changing cell" to the sales cell (C17).
5. Click "OK" to let Excel determine the required sales.
Practical No. 02
Data pre-processing:
Data pre-processing is a crucial step in the data analysis pipeline, encompassing tasks such as
reading data from various file formats, handling missing values, and managing outliers. This
practical guide explores how to execute these tasks using the pandas library in Python.
Steps:
Step 1: Reading from CSV and JSON Files
1. Utilize pandas to read data from a CSV file ('DATA SET.csv') into a data frame.
2. Use pandas to read data from a JSON file ('ds.json') into a data frame.
3. Display the first few rows of each data frame to inspect the data.
Step 2: Handling Missing Values
1. Drop rows with missing values from the CSV data frame.
2. Fill missing values with a specific value (e.g., 0) in the JSON data frame.
Step 3: Handling Outliers
1. Identify outliers in the 'Sales' column of the CSV data frame.
2. Replace outliers with the median value.
Step 4: Manipulating and Transforming Data
1. Filter the CSV data frame to include only rows where 'Sales' is greater than 10.
2. Sort the CSV data frame based on the 'Sales' column in descending order.
3. Group the CSV data frame by the 'Category' column and calculate the mean for numeric columns ('Sales',
'Cost', 'Profit').
Step 5: Displaying Results
1. Display the cleaned CSV data frame after handling missing values.
2. Display the JSON data frame after filling missing values.
3. Display the filtered CSV data frame.
4. Display the sorted CSV data frame.
5. Display the grouped CSV data frame showing the mean values for numeric columns.
Code:
import pandas as pd
# Read data from CSV file into a data frame
csv_file_path = 'DATA SET.csv'
df_csv = pd.read_csv(csv_file_path)
# Read data from JSON file into a data frame
json_file_path = 'ds.json'
df_json = pd.read_json(json_file_path)
# Display the first few rows of each data frame to inspect the data
print("CSV Data:")
print(df_csv.head())
print("\nJSON Data:")
print(df_json.head())
# Handling missing values
# Drop rows with missing values
df_csv_cleaned = df_csv.dropna()
# Fill missing values with a specific value (e.g., 0)
df_json_filled = df_json.fillna(0)
# Handling outliers
# Assume 'Sales' is the column with outliers
# Replace outliers with the median
median_value = df_csv['Sales'].median()
upper_threshold = df_csv['Sales'].mean() + 2 * df_csv['Sales'].std()
lower_threshold = df_csv['Sales'].mean() - 2 * df_csv['Sales'].std()
df_csv['Sales'] = df_csv['Sales'].apply(lambda x: median_value if x > upper_threshold or x <
lower_threshold else x)
# Manipulate and transform data
# Filtering
filtered_data = df_csv[df_csv['Sales'] > 10]
# Sorting
sorted_data = df_csv.sort_values(by='Sales', ascending=False)
# Grouping and calculating mean for numeric columns
numeric_columns = ['Sales', 'Cost', 'Profit']
grouped_data = df_csv.groupby('Category')[numeric_columns].mean()
# Display the results
print("\nCleaned CSV Data:")
print(df_csv_cleaned.head())
print("\nFilled JSON Data:")
print(df_json_filled.head())
print("\nFiltered Data:")
print(filtered_data.head())
print("\nSorted Data:")
print(sorted_data.head())
print("\nGrouped Data:")
print(grouped_data.head())
Output:
Practical No. 03
Feature Scaling:
Feature scaling is a preprocessing technique used to standardize the range of independent
variables or features of the data. It is essential for certain machine learning algorithms that are
sensitive to the scale of input features, ensuring that all features contribute equally to the
learning process.
Feature Dummification:
Feature dummification or one-hot encoding is a technique used to convert categorical
variables into numerical representations. This is necessary because many machine learning
algorithms require numerical input, and representing categorical variables as binary vectors
helps maintain their information.
Steps:
1. Load and Explore Data: Load the dataset and explore its structure, identify numeric and
categorical features.
2. Feature Scaling: Apply standardization and normalization to numeric features.
3. Feature Dummification: Convert categorical variables into numerical representations
using one-hot encoding.
4. Combine Features: Combine scaled numeric features with one-hot encoded categorical
features.
5. Display Resulting Dataset: Display the final dataset after both feature scaling and
dummification.
Code:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Define the data
data = {
'Product': ['Apple_Juice', 'Banana_Smoothie', 'Orange_Jam', 'Grape_Jelly', 'Kiwi_Parfait',
'Mango_Chutney', 'Pineapple_Sorbet', 'Strawberry_Yogurt', 'Blueberry_Pie', 'Cherry_Salsa'],
'Category': ['Apple', 'Banana', 'Orange', 'Grape', 'Kiwi', 'Mango', 'Pineapple', 'Strawberry',
'Blueberry', 'Cherry'],
'Sales': [1200, 1700, 2200, 1400, 2000, 1000, 1500, 1800, 1300, 1600],
'Cost': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800],
'Profit': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Display the original dataset
print("Original Dataset:")
print(df)
# Step 1: Feature Scaling (Standardization and Normalization)
numeric_columns = ['Sales', 'Cost', 'Profit']
scaler_standardization = StandardScaler()
scaler_normalization = MinMaxScaler()
df_scaled_standardized =
pd.DataFrame(scaler_standardization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
df_scaled_normalized =
pd.DataFrame(scaler_normalization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
# Combine the scaled numeric features with the categorical features
df_scaled = pd.concat([df_scaled_standardized, df.drop(numeric_columns, axis=1)], axis=1)
# Display the dataset after feature scaling
print("\nDataset after Feature Scaling:")
print(df_scaled)
# Step 2: Feature Dummification
# Identify categorical columns
categorical_columns = ['Product', 'Category']
# Create a column transformer for dummification
preprocessor = ColumnTransformer(
transformers=[
('categorical', OneHotEncoder(), categorical_columns)
],
remainder='passthrough'
)
# Apply the column transformer to the dataset
df_dummified = pd.DataFrame(preprocessor.fit_transform(df))
# Display the dataset after feature dummification
print("\nDataset after Feature Dummification:")
print(df_dummified)
Output:
Practical No. 04
Hypothesis Testing:
Hypothesis testing is a statistical method used to make inferences about population parameters based on
sample data. It involves the formulation of a null hypothesis (H0) and an alternative hypothesis (H1), and the
collection of sample data to assess the evidence against the null hypothesis. The goal is to determine whether
there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
1. Formulate Hypotheses:
Null Hypothesis (H0 ): The average caffeine content per serving is 80 mg (μ=80).
Alternative Hypothesis (H1 ): The average caffeine content per serving is different from 80 mg
(μ≠80).
2. Statistical Test:
A t-test is appropriate since you are comparing a sample mean to a known population mean, and the
sample size is small.
3. Data Collection:
Randomly select 30 cans of the energy drink and measure the caffeine content in each.
4. Conducting the Hypothesis Test:
a. Collect Data:
Calculate the sample mean (�) and standard deviation (s) from the 30 samples.
b. Set Significance Level (α):
Choose a significance level (α=0.05,0.01,0.10).
c. Calculate the Test Statistic (t-value):
Use the formula t=s/n �−μ .
d. Determine Degrees of Freedom:
For a one-sample t-test, degrees of freedom (df) is n−1.
e. Find Critical Values or P-value:
Use a t-table or statistical software to find the critical t-values for a two-tailed test at the chosen
significance level.
f. Make a Decision:
If the t-value falls outside the critical region, reject the null hypothesis. If it falls inside, fail to reject.
g. Interpretation:
If you reject the null hypothesis, there is enough evidence to suggest that the average caffeine
content per serving is different from 80 mg. If you fail to reject the null hypothesis, there is not
enough evidence to suggest a difference in the average caffeine content.
5. Conclusion:
Draw conclusions about the energy drink's caffeine content, considering both statistical and practical
significance. Consider decisions relevant to the context of the problem.
Code:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Generate two samples for demonstration purposes
np.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2, size=30)
sample2 = np.random.normal(loc=12, scale=2, size=30)
# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(sample1, sample2)
# Set the significance level
alpha = 0.05
print("Results of Two-Sample t-test:")
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")
# Plot the distributions
plt.figure(figsize=(10, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')
plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')
plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2)
plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2)
plt.title('Distributions of Sample 1 and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()
# Highlight the critical region if null hypothesis is rejected
if p_value < alpha:
critical_region = np.linspace(min(sample1.min(), sample2.min()), max(sample1.max(),
sample2.max()), 1000)
plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3, label='Critical Region')
# Show the observed t-statistic
plt.text(11, 5, f'T-statistic: {t_statistic:.2f}', ha='center', va='center', color='black',
backgroundcolor='white')
# Show the plot
plt.show()
# Draw Conclusions
# Drawing Conclusions
if p_value < alpha:
if np.mean(sample1) > np.mean(sample2):
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 1 is significantly higher than
that of Sample 2.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 2 is significantly higher than
that of Sample 1.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: Fail to reject the null hypothesis.")
print("Interpretation: There is not enough evidence to claim a significant difference between
the means.")
Output:
Practical No. 05
Output-
from matplotlib import pyplot as plt
years = [2020,2021,2022,2023,2024]
failurepercentrates = [60,70,50,10,0]
plt.plot(years,failurepercentrates,color = "green" ,marker ="o", linestyle="solid" )
plt.title("corona times success rates")
plt.ylabel("percentages rates")
plt.show()
Output-
Output-
Practical No. 06
Acquire Dataset:
Obtain a dataset suitable for regression analysis. The dataset should contain variables that
you believe may have a linear relationship or can be used to predict another variable of
interest.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
regressor_simple = LinearRegression()
regressor_simple.fit(X_train_simple, y_train_simple)
# Model evaluation
print('Simple Linear Regression:')
print('Intercept:', regressor_simple.intercept_)
print('Coefficient:', regressor_simple.coef_)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_simple, y_pred_simple))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_simple, y_pred_simple))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_simple,
y_pred_simple)))
print('R-squared:', metrics.r2_score(y_test_simple, y_pred_simple))
regressor_multi = LinearRegression()
regressor_multi.fit(X_train_multi, y_train_multi)
Output-
Practical No. 07
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
# Split the dataset into features (X) and target variable (y)
X = df.drop(columns=['BloodPressure'])
y = df['Age']
# Logistic Regression
log_reg_model = LogisticRegression(max_iter=1000) # Increase max_iter to avoid
convergence warning
log_reg_model.fit(X_train_scaled, y_train)
y_pred_log_reg = log_reg_model.predict(X_test_scaled)
# Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_scaled, y_train)
y_pred_dt = dt_model.predict(X_test_scaled)
plt.subplot(1, 2, 1)
sns.heatmap(confusion_matrix(y_test, y_pred_log_reg), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.subplot(1, 2, 2)
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.tight_layout()
plt.show()
Output-
Practical No. 08
Interpret Results:
Interpret the clustering results based on the characteristics of each cluster.
Analyze any meaningful patterns or insights discovered through clustering.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Calculate explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance_ratio)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load sample dataset (you can replace this with your own dataset)
df = sns.load_dataset('tips')
# Data Visualization
# Visualization 1: Distribution of Total Bill Amount
plt.figure(figsize=(10, 6))
sns.histplot(df['total_bill'], kde=True)
plt.title('Distribution of Total Bill Amount')
plt.xlabel('Total Bill Amount')
plt.ylabel('Frequency')
plt.show()
# Data Storytelling
print("\nInsights:")
print("1. The distribution of total bill amounts is right-skewed, with most bills falling between
$10 and $20.")
print("2. There is a positive relationship between total bill amount and tip amount, with some
variations based on gender.")
print("3. Total bill amounts tend to be higher on Saturdays compared to other days.")
print("4. The count of customers is higher during dinner time compared to lunchtime on all
days.")
# Conclusion
print("\nConclusion:")
print("Based on the analysis, we can infer that there is a strong relationship between the total
bill amount and tip amount, with variations based on factors such as day and time. Further
analysis can be conducted to explore these relationships in more detail.")
Output-
Practical No. 01
Steps:
1. Select the "Profit" column (Column E).
Steps:
1. Select the entire dataset including headers.
2. Go to the "Insert" tab on the ribbon.
3. Click on "PivotTable."
4. Choose where you want to place the PivotTable (e.g., new worksheet).
5. Drag "Category" to the Rows area.
6. Drag "Sales" to the Values area, choosing the sum function.
Steps:
1. Identify the cell containing the formula for "Profit" for "Product P" (let's assume it's in cell E17).
2. Go to the "Data" tab on the ribbon.
3. Click on "What-If Analysis" and select "Goal Seek."
4. Set "Set cell" to the profit cell (E17), "To value" to 1000, and "By changing cell" to the sales cell (C17).
5. Click "OK" to let Excel determine the required sales.
Practical No. 02
Data pre-processing:
Data pre-processing is a crucial step in the data analysis pipeline, encompassing tasks such as
reading data from various file formats, handling missing values, and managing outliers. This
practical guide explores how to execute these tasks using the pandas library in Python.
Steps:
Step 1: Reading from CSV and JSON Files
1. Utilize pandas to read data from a CSV file ('DATA SET.csv') into a data frame.
2. Use pandas to read data from a JSON file ('ds.json') into a data frame.
3. Display the first few rows of each data frame to inspect the data.
Step 2: Handling Missing Values
1. Drop rows with missing values from the CSV data frame.
2. Fill missing values with a specific value (e.g., 0) in the JSON data frame.
Step 3: Handling Outliers
1. Identify outliers in the 'Sales' column of the CSV data frame.
2. Replace outliers with the median value.
Step 4: Manipulating and Transforming Data
1. Filter the CSV data frame to include only rows where 'Sales' is greater than 10.
2. Sort the CSV data frame based on the 'Sales' column in descending order.
3. Group the CSV data frame by the 'Category' column and calculate the mean for numeric columns ('Sales',
'Cost', 'Profit').
Step 5: Displaying Results
1. Display the cleaned CSV data frame after handling missing values.
2. Display the JSON data frame after filling missing values.
3. Display the filtered CSV data frame.
4. Display the sorted CSV data frame.
5. Display the grouped CSV data frame showing the mean values for numeric columns.
Code:
import pandas as pd
# Read data from CSV file into a data frame
csv_file_path = 'DATA SET.csv'
df_csv = pd.read_csv(csv_file_path)
# Read data from JSON file into a data frame
json_file_path = 'ds.json'
df_json = pd.read_json(json_file_path)
# Display the first few rows of each data frame to inspect the data
print("CSV Data:")
print(df_csv.head())
print("\nJSON Data:")
print(df_json.head())
# Handling missing values
# Drop rows with missing values
df_csv_cleaned = df_csv.dropna()
# Fill missing values with a specific value (e.g., 0)
df_json_filled = df_json.fillna(0)
# Handling outliers
# Assume 'Sales' is the column with outliers
# Replace outliers with the median
median_value = df_csv['Sales'].median()
upper_threshold = df_csv['Sales'].mean() + 2 * df_csv['Sales'].std()
lower_threshold = df_csv['Sales'].mean() - 2 * df_csv['Sales'].std()
df_csv['Sales'] = df_csv['Sales'].apply(lambda x: median_value if x > upper_threshold or x <
lower_threshold else x)
# Manipulate and transform data
# Filtering
filtered_data = df_csv[df_csv['Sales'] > 10]
# Sorting
sorted_data = df_csv.sort_values(by='Sales', ascending=False)
# Grouping and calculating mean for numeric columns
numeric_columns = ['Sales', 'Cost', 'Profit']
grouped_data = df_csv.groupby('Category')[numeric_columns].mean()
# Display the results
print("\nCleaned CSV Data:")
print(df_csv_cleaned.head())
print("\nFilled JSON Data:")
print(df_json_filled.head())
print("\nFiltered Data:")
print(filtered_data.head())
print("\nSorted Data:")
print(sorted_data.head())
print("\nGrouped Data:")
print(grouped_data.head())
Output:
Practical No. 03
Feature Scaling:
Feature scaling is a preprocessing technique used to standardize the range of independent
variables or features of the data. It is essential for certain machine learning algorithms that are
sensitive to the scale of input features, ensuring that all features contribute equally to the
learning process.
Feature Dummification:
Feature dummification or one-hot encoding is a technique used to convert categorical
variables into numerical representations. This is necessary because many machine learning
algorithms require numerical input, and representing categorical variables as binary vectors
helps maintain their information.
Steps:
1. Load and Explore Data: Load the dataset and explore its structure, identify numeric and
categorical features.
2. Feature Scaling: Apply standardization and normalization to numeric features.
3. Feature Dummification: Convert categorical variables into numerical representations
using one-hot encoding.
4. Combine Features: Combine scaled numeric features with one-hot encoded categorical
features.
5. Display Resulting Dataset: Display the final dataset after both feature scaling and
dummification.
Code:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Define the data
data = {
'Product': ['Apple_Juice', 'Banana_Smoothie', 'Orange_Jam', 'Grape_Jelly', 'Kiwi_Parfait',
'Mango_Chutney', 'Pineapple_Sorbet', 'Strawberry_Yogurt', 'Blueberry_Pie', 'Cherry_Salsa'],
'Category': ['Apple', 'Banana', 'Orange', 'Grape', 'Kiwi', 'Mango', 'Pineapple', 'Strawberry',
'Blueberry', 'Cherry'],
'Sales': [1200, 1700, 2200, 1400, 2000, 1000, 1500, 1800, 1300, 1600],
'Cost': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800],
'Profit': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Display the original dataset
print("Original Dataset:")
print(df)
# Step 1: Feature Scaling (Standardization and Normalization)
numeric_columns = ['Sales', 'Cost', 'Profit']
scaler_standardization = StandardScaler()
scaler_normalization = MinMaxScaler()
df_scaled_standardized =
pd.DataFrame(scaler_standardization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
df_scaled_normalized =
pd.DataFrame(scaler_normalization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
# Combine the scaled numeric features with the categorical features
df_scaled = pd.concat([df_scaled_standardized, df.drop(numeric_columns, axis=1)], axis=1)
# Display the dataset after feature scaling
print("\nDataset after Feature Scaling:")
print(df_scaled)
# Step 2: Feature Dummification
# Identify categorical columns
categorical_columns = ['Product', 'Category']
# Create a column transformer for dummification
preprocessor = ColumnTransformer(
transformers=[
('categorical', OneHotEncoder(), categorical_columns)
],
remainder='passthrough'
)
# Apply the column transformer to the dataset
df_dummified = pd.DataFrame(preprocessor.fit_transform(df))
# Display the dataset after feature dummification
print("\nDataset after Feature Dummification:")
print(df_dummified)
Output:
Practical No. 04
Hypothesis Testing:
Hypothesis testing is a statistical method used to make inferences about population parameters based on
sample data. It involves the formulation of a null hypothesis (H0) and an alternative hypothesis (H1), and the
collection of sample data to assess the evidence against the null hypothesis. The goal is to determine whether
there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
1. Formulate Hypotheses:
Null Hypothesis (H0 ): The average caffeine content per serving is 80 mg (μ=80).
Alternative Hypothesis (H1 ): The average caffeine content per serving is different from 80 mg
(μ≠80).
2. Statistical Test:
A t-test is appropriate since you are comparing a sample mean to a known population mean, and the
sample size is small.
3. Data Collection:
Randomly select 30 cans of the energy drink and measure the caffeine content in each.
4. Conducting the Hypothesis Test:
a. Collect Data:
Calculate the sample mean (�) and standard deviation (s) from the 30 samples.
b. Set Significance Level (α):
Choose a significance level (α=0.05,0.01,0.10).
c. Calculate the Test Statistic (t-value):
Use the formula t=s/n �−μ .
d. Determine Degrees of Freedom:
For a one-sample t-test, degrees of freedom (df) is n−1.
e. Find Critical Values or P-value:
Use a t-table or statistical software to find the critical t-values for a two-tailed test at the chosen
significance level.
f. Make a Decision:
If the t-value falls outside the critical region, reject the null hypothesis. If it falls inside, fail to reject.
g. Interpretation:
If you reject the null hypothesis, there is enough evidence to suggest that the average caffeine
content per serving is different from 80 mg. If you fail to reject the null hypothesis, there is not
enough evidence to suggest a difference in the average caffeine content.
5. Conclusion:
Draw conclusions about the energy drink's caffeine content, considering both statistical and practical
significance. Consider decisions relevant to the context of the problem.
Code:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Generate two samples for demonstration purposes
np.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2, size=30)
sample2 = np.random.normal(loc=12, scale=2, size=30)
# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(sample1, sample2)
# Set the significance level
alpha = 0.05
print("Results of Two-Sample t-test:")
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")
# Plot the distributions
plt.figure(figsize=(10, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')
plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')
plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2)
plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2)
plt.title('Distributions of Sample 1 and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()
# Highlight the critical region if null hypothesis is rejected
if p_value < alpha:
critical_region = np.linspace(min(sample1.min(), sample2.min()), max(sample1.max(),
sample2.max()), 1000)
plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3, label='Critical Region')
# Show the observed t-statistic
plt.text(11, 5, f'T-statistic: {t_statistic:.2f}', ha='center', va='center', color='black',
backgroundcolor='white')
# Show the plot
plt.show()
# Draw Conclusions
# Drawing Conclusions
if p_value < alpha:
if np.mean(sample1) > np.mean(sample2):
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 1 is significantly higher than
that of Sample 2.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 2 is significantly higher than
that of Sample 1.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: Fail to reject the null hypothesis.")
print("Interpretation: There is not enough evidence to claim a significant difference between
the means.")
Output:
Practical No. 05
Output-
from matplotlib import pyplot as plt
years = [2020,2021,2022,2023,2024]
failurepercentrates = [60,70,50,10,0]
plt.plot(years,failurepercentrates,color = "green" ,marker ="o", linestyle="solid" )
plt.title("corona times success rates")
plt.ylabel("percentages rates")
plt.show()
Output-
Output-
Practical No. 06
Acquire Dataset:
Obtain a dataset suitable for regression analysis. The dataset should contain variables that
you believe may have a linear relationship or can be used to predict another variable of
interest.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
regressor_simple = LinearRegression()
regressor_simple.fit(X_train_simple, y_train_simple)
# Model evaluation
print('Simple Linear Regression:')
print('Intercept:', regressor_simple.intercept_)
print('Coefficient:', regressor_simple.coef_)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_simple, y_pred_simple))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_simple, y_pred_simple))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_simple,
y_pred_simple)))
print('R-squared:', metrics.r2_score(y_test_simple, y_pred_simple))
regressor_multi = LinearRegression()
regressor_multi.fit(X_train_multi, y_train_multi)
Output-
Practical No. 07
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
# Split the dataset into features (X) and target variable (y)
X = df.drop(columns=['BloodPressure'])
y = df['Age']
# Logistic Regression
log_reg_model = LogisticRegression(max_iter=1000) # Increase max_iter to avoid
convergence warning
log_reg_model.fit(X_train_scaled, y_train)
y_pred_log_reg = log_reg_model.predict(X_test_scaled)
# Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_scaled, y_train)
y_pred_dt = dt_model.predict(X_test_scaled)
plt.subplot(1, 2, 1)
sns.heatmap(confusion_matrix(y_test, y_pred_log_reg), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.subplot(1, 2, 2)
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.tight_layout()
plt.show()
Output-
Practical No. 08
Interpret Results:
Interpret the clustering results based on the characteristics of each cluster.
Analyze any meaningful patterns or insights discovered through clustering.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Calculate explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance_ratio)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load sample dataset (you can replace this with your own dataset)
df = sns.load_dataset('tips')
# Data Visualization
# Visualization 1: Distribution of Total Bill Amount
plt.figure(figsize=(10, 6))
sns.histplot(df['total_bill'], kde=True)
plt.title('Distribution of Total Bill Amount')
plt.xlabel('Total Bill Amount')
plt.ylabel('Frequency')
plt.show()
# Data Storytelling
print("\nInsights:")
print("1. The distribution of total bill amounts is right-skewed, with most bills falling between
$10 and $20.")
print("2. There is a positive relationship between total bill amount and tip amount, with some
variations based on gender.")
print("3. Total bill amounts tend to be higher on Saturdays compared to other days.")
print("4. The count of customers is higher during dinner time compared to lunchtime on all
days.")
# Conclusion
print("\nConclusion:")
print("Based on the analysis, we can infer that there is a strong relationship between the total
bill amount and tip amount, with variations based on factors such as day and time. Further
analysis can be conducted to explore these relationships in more detail.")
Output-
Practical No. 01
Steps:
1. Select the "Profit" column (Column E).
Steps:
1. Select the entire dataset including headers.
2. Go to the "Insert" tab on the ribbon.
3. Click on "PivotTable."
4. Choose where you want to place the PivotTable (e.g., new worksheet).
5. Drag "Category" to the Rows area.
6. Drag "Sales" to the Values area, choosing the sum function.
Steps:
1. Identify the cell containing the formula for "Profit" for "Product P" (let's assume it's in cell E17).
2. Go to the "Data" tab on the ribbon.
3. Click on "What-If Analysis" and select "Goal Seek."
4. Set "Set cell" to the profit cell (E17), "To value" to 1000, and "By changing cell" to the sales cell (C17).
5. Click "OK" to let Excel determine the required sales.
Practical No. 02
Data pre-processing:
Data pre-processing is a crucial step in the data analysis pipeline, encompassing tasks such as
reading data from various file formats, handling missing values, and managing outliers. This
practical guide explores how to execute these tasks using the pandas library in Python.
Steps:
Step 1: Reading from CSV and JSON Files
1. Utilize pandas to read data from a CSV file ('DATA SET.csv') into a data frame.
2. Use pandas to read data from a JSON file ('ds.json') into a data frame.
3. Display the first few rows of each data frame to inspect the data.
Step 2: Handling Missing Values
1. Drop rows with missing values from the CSV data frame.
2. Fill missing values with a specific value (e.g., 0) in the JSON data frame.
Step 3: Handling Outliers
1. Identify outliers in the 'Sales' column of the CSV data frame.
2. Replace outliers with the median value.
Step 4: Manipulating and Transforming Data
1. Filter the CSV data frame to include only rows where 'Sales' is greater than 10.
2. Sort the CSV data frame based on the 'Sales' column in descending order.
3. Group the CSV data frame by the 'Category' column and calculate the mean for numeric columns ('Sales',
'Cost', 'Profit').
Step 5: Displaying Results
1. Display the cleaned CSV data frame after handling missing values.
2. Display the JSON data frame after filling missing values.
3. Display the filtered CSV data frame.
4. Display the sorted CSV data frame.
5. Display the grouped CSV data frame showing the mean values for numeric columns.
Code:
import pandas as pd
# Read data from CSV file into a data frame
csv_file_path = 'DATA SET.csv'
df_csv = pd.read_csv(csv_file_path)
# Read data from JSON file into a data frame
json_file_path = 'ds.json'
df_json = pd.read_json(json_file_path)
# Display the first few rows of each data frame to inspect the data
print("CSV Data:")
print(df_csv.head())
print("\nJSON Data:")
print(df_json.head())
# Handling missing values
# Drop rows with missing values
df_csv_cleaned = df_csv.dropna()
# Fill missing values with a specific value (e.g., 0)
df_json_filled = df_json.fillna(0)
# Handling outliers
# Assume 'Sales' is the column with outliers
# Replace outliers with the median
median_value = df_csv['Sales'].median()
upper_threshold = df_csv['Sales'].mean() + 2 * df_csv['Sales'].std()
lower_threshold = df_csv['Sales'].mean() - 2 * df_csv['Sales'].std()
df_csv['Sales'] = df_csv['Sales'].apply(lambda x: median_value if x > upper_threshold or x <
lower_threshold else x)
# Manipulate and transform data
# Filtering
filtered_data = df_csv[df_csv['Sales'] > 10]
# Sorting
sorted_data = df_csv.sort_values(by='Sales', ascending=False)
# Grouping and calculating mean for numeric columns
numeric_columns = ['Sales', 'Cost', 'Profit']
grouped_data = df_csv.groupby('Category')[numeric_columns].mean()
# Display the results
print("\nCleaned CSV Data:")
print(df_csv_cleaned.head())
print("\nFilled JSON Data:")
print(df_json_filled.head())
print("\nFiltered Data:")
print(filtered_data.head())
print("\nSorted Data:")
print(sorted_data.head())
print("\nGrouped Data:")
print(grouped_data.head())
Output:
Practical No. 03
Feature Scaling:
Feature scaling is a preprocessing technique used to standardize the range of independent
variables or features of the data. It is essential for certain machine learning algorithms that are
sensitive to the scale of input features, ensuring that all features contribute equally to the
learning process.
Feature Dummification:
Feature dummification or one-hot encoding is a technique used to convert categorical
variables into numerical representations. This is necessary because many machine learning
algorithms require numerical input, and representing categorical variables as binary vectors
helps maintain their information.
Steps:
1. Load and Explore Data: Load the dataset and explore its structure, identify numeric and
categorical features.
2. Feature Scaling: Apply standardization and normalization to numeric features.
3. Feature Dummification: Convert categorical variables into numerical representations
using one-hot encoding.
4. Combine Features: Combine scaled numeric features with one-hot encoded categorical
features.
5. Display Resulting Dataset: Display the final dataset after both feature scaling and
dummification.
Code:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Define the data
data = {
'Product': ['Apple_Juice', 'Banana_Smoothie', 'Orange_Jam', 'Grape_Jelly', 'Kiwi_Parfait',
'Mango_Chutney', 'Pineapple_Sorbet', 'Strawberry_Yogurt', 'Blueberry_Pie', 'Cherry_Salsa'],
'Category': ['Apple', 'Banana', 'Orange', 'Grape', 'Kiwi', 'Mango', 'Pineapple', 'Strawberry',
'Blueberry', 'Cherry'],
'Sales': [1200, 1700, 2200, 1400, 2000, 1000, 1500, 1800, 1300, 1600],
'Cost': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800],
'Profit': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Display the original dataset
print("Original Dataset:")
print(df)
# Step 1: Feature Scaling (Standardization and Normalization)
numeric_columns = ['Sales', 'Cost', 'Profit']
scaler_standardization = StandardScaler()
scaler_normalization = MinMaxScaler()
df_scaled_standardized =
pd.DataFrame(scaler_standardization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
df_scaled_normalized =
pd.DataFrame(scaler_normalization.fit_transform(df[numeric_columns]),
columns=numeric_columns)
# Combine the scaled numeric features with the categorical features
df_scaled = pd.concat([df_scaled_standardized, df.drop(numeric_columns, axis=1)], axis=1)
# Display the dataset after feature scaling
print("\nDataset after Feature Scaling:")
print(df_scaled)
# Step 2: Feature Dummification
# Identify categorical columns
categorical_columns = ['Product', 'Category']
# Create a column transformer for dummification
preprocessor = ColumnTransformer(
transformers=[
('categorical', OneHotEncoder(), categorical_columns)
],
remainder='passthrough'
)
# Apply the column transformer to the dataset
df_dummified = pd.DataFrame(preprocessor.fit_transform(df))
# Display the dataset after feature dummification
print("\nDataset after Feature Dummification:")
print(df_dummified)
Output:
Practical No. 04
Hypothesis Testing:
Hypothesis testing is a statistical method used to make inferences about population parameters based on
sample data. It involves the formulation of a null hypothesis (H0) and an alternative hypothesis (H1), and the
collection of sample data to assess the evidence against the null hypothesis. The goal is to determine whether
there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
1. Formulate Hypotheses:
Null Hypothesis (H0 ): The average caffeine content per serving is 80 mg (μ=80).
Alternative Hypothesis (H1 ): The average caffeine content per serving is different from 80 mg
(μ≠80).
2. Statistical Test:
A t-test is appropriate since you are comparing a sample mean to a known population mean, and the
sample size is small.
3. Data Collection:
Randomly select 30 cans of the energy drink and measure the caffeine content in each.
4. Conducting the Hypothesis Test:
a. Collect Data:
Calculate the sample mean (�) and standard deviation (s) from the 30 samples.
b. Set Significance Level (α):
Choose a significance level (α=0.05,0.01,0.10).
c. Calculate the Test Statistic (t-value):
Use the formula t=s/n �−μ .
d. Determine Degrees of Freedom:
For a one-sample t-test, degrees of freedom (df) is n−1.
e. Find Critical Values or P-value:
Use a t-table or statistical software to find the critical t-values for a two-tailed test at the chosen
significance level.
f. Make a Decision:
If the t-value falls outside the critical region, reject the null hypothesis. If it falls inside, fail to reject.
g. Interpretation:
If you reject the null hypothesis, there is enough evidence to suggest that the average caffeine
content per serving is different from 80 mg. If you fail to reject the null hypothesis, there is not
enough evidence to suggest a difference in the average caffeine content.
5. Conclusion:
Draw conclusions about the energy drink's caffeine content, considering both statistical and practical
significance. Consider decisions relevant to the context of the problem.
Code:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Generate two samples for demonstration purposes
np.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2, size=30)
sample2 = np.random.normal(loc=12, scale=2, size=30)
# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(sample1, sample2)
# Set the significance level
alpha = 0.05
print("Results of Two-Sample t-test:")
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")
# Plot the distributions
plt.figure(figsize=(10, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')
plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')
plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2)
plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2)
plt.title('Distributions of Sample 1 and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()
# Highlight the critical region if null hypothesis is rejected
if p_value < alpha:
critical_region = np.linspace(min(sample1.min(), sample2.min()), max(sample1.max(),
sample2.max()), 1000)
plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3, label='Critical Region')
# Show the observed t-statistic
plt.text(11, 5, f'T-statistic: {t_statistic:.2f}', ha='center', va='center', color='black',
backgroundcolor='white')
# Show the plot
plt.show()
# Draw Conclusions
# Drawing Conclusions
if p_value < alpha:
if np.mean(sample1) > np.mean(sample2):
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 1 is significantly higher than
that of Sample 2.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean caffeine content of Sample 2 is significantly higher than
that of Sample 1.")
# Additional context and practical implications can be added here.
else:
print("Conclusion: Fail to reject the null hypothesis.")
print("Interpretation: There is not enough evidence to claim a significant difference between
the means.")
Output:
Practical No. 05
Output-
from matplotlib import pyplot as plt
years = [2020,2021,2022,2023,2024]
failurepercentrates = [60,70,50,10,0]
plt.plot(years,failurepercentrates,color = "green" ,marker ="o", linestyle="solid" )
plt.title("corona times success rates")
plt.ylabel("percentages rates")
plt.show()
Output-
Output-
Practical No. 06
Acquire Dataset:
Obtain a dataset suitable for regression analysis. The dataset should contain variables that
you believe may have a linear relationship or can be used to predict another variable of
interest.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
regressor_simple = LinearRegression()
regressor_simple.fit(X_train_simple, y_train_simple)
# Model evaluation
print('Simple Linear Regression:')
print('Intercept:', regressor_simple.intercept_)
print('Coefficient:', regressor_simple.coef_)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_simple, y_pred_simple))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_simple, y_pred_simple))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_simple,
y_pred_simple)))
print('R-squared:', metrics.r2_score(y_test_simple, y_pred_simple))
regressor_multi = LinearRegression()
regressor_multi.fit(X_train_multi, y_train_multi)
Output-
Practical No. 07
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
# Split the dataset into features (X) and target variable (y)
X = df.drop(columns=['BloodPressure'])
y = df['Age']
# Logistic Regression
log_reg_model = LogisticRegression(max_iter=1000) # Increase max_iter to avoid
convergence warning
log_reg_model.fit(X_train_scaled, y_train)
y_pred_log_reg = log_reg_model.predict(X_test_scaled)
# Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_scaled, y_train)
y_pred_dt = dt_model.predict(X_test_scaled)
plt.subplot(1, 2, 1)
sns.heatmap(confusion_matrix(y_test, y_pred_log_reg), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.subplot(1, 2, 2)
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.tight_layout()
plt.show()
Output-
Practical No. 08
Interpret Results:
Interpret the clustering results based on the characteristics of each cluster.
Analyze any meaningful patterns or insights discovered through clustering.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Calculate explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance_ratio)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load sample dataset (you can replace this with your own dataset)
df = sns.load_dataset('tips')
# Data Visualization
# Visualization 1: Distribution of Total Bill Amount
plt.figure(figsize=(10, 6))
sns.histplot(df['total_bill'], kde=True)
plt.title('Distribution of Total Bill Amount')
plt.xlabel('Total Bill Amount')
plt.ylabel('Frequency')
plt.show()
# Data Storytelling
print("\nInsights:")
print("1. The distribution of total bill amounts is right-skewed, with most bills falling between
$10 and $20.")
print("2. There is a positive relationship between total bill amount and tip amount, with some
variations based on gender.")
print("3. Total bill amounts tend to be higher on Saturdays compared to other days.")
print("4. The count of customers is higher during dinner time compared to lunchtime on all
days.")
# Conclusion
print("\nConclusion:")
print("Based on the analysis, we can infer that there is a strong relationship between the total
bill amount and tip amount, with variations based on factors such as day and time. Further
analysis can be conducted to explore these relationships in more detail.")
Output-