0% found this document useful (0 votes)
20 views18 pages

DW Lab File

Uploaded by

jadeanica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views18 pages

DW Lab File

Uploaded by

jadeanica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Index

S. No Practical Remarks
1 Write a program to read data from CSV
files and display the content.
2 Write a program to read data from JSON
files and display the content.
3 Load a dataset into a data structure (e.g.,
Data Frame) and perform basic data
cleaning (e.g., handling missing values).
4 Solve some Case study by performing
filtering, Group by and add new column
to dataset.
5 Design and implement a program to
create a Data Mart.
6 Implementation of Data Cleansing using
the Python
7 Develop a program to create metadata
for a dataset, including relevant data
descriptions.
8 Write Python code to perform data
transformation tasks.
9 Write a python code for Data
Discretization.
10 Create and visualize a graph from a
dataset using a graph library.
11 Case Study I
12 Case Study II
13 Case Study III
14 Implement a k-Nearest Neighbour (k-
NN) classifier and evaluate its
performance on a given dataset.

1. Write a program to read data from CSV files and display the content.

Objective:- Read and display data from various file formats such as CSV and JSON.

Code:-
import csv
def read_csv(file_path):
try:
with open(file_path, mode='r', newline='', encoding='utf-8') as file:
csv_reader = csv.reader(file)
header = next(csv_reader)
print(f"Header: {header}")
print("\nData:")
for row in csv_reader:
print(row)
except FileNotFoundError:
print(f"Error: The file '{file_path}' was not found.")
except Exception as e:
print(f"An error occurred: {e}")
file_path = 'Book1.csv'
read_csv(file_path)

Output:-

2.Write a program to read data from JSON files and display the content.
Objective:- Load datasets into appropriate data structures (e.g., Pandas DataFrame)
for analysis.

Code:-

import json
def display_json_data():
json_data = {
"name": "Alice",
"age": 30,
"city": "New York",
"is_student": False,
"courses": ["Math", "Science", "English"]
}
print("JSON Data Content:")
print(json.dumps(json_data, indent=4))
display_json_data()

Output:-

3. Load a dataset into a data structure (e.g., Data Frame) and perform basic
data cleaning (e.g., handling missing values).
Objective:- Perform basic data cleaning by handling missing values and
inconsistencies.

Code:-

import pandas as pd
import numpy as np
#create a list of lists to hold the data
data = { 'Customer ID' :[1,2,3,4,5,6],
'Name' : ['John Smith','jane Doe','jake Doe','john Smith',None,'Alice
Brown'],
'Purchase Date' :
['01/09/2024','01/09/0024','02/09/2024','09/01/2024','01/09/2024','01/09/2024'],
'Amount' : [100, '$200', 300, 400, -500, 600],
'Email' : ['[email protected]', '[email protected]', 'N/A',
'[email protected]', '', 'alice#example.com'],
'Address' :['123 Maple St,NY','456 Eim St,NY' ,'789 Pine St,NY','123 Maple
St,NY','123 Maple St,NY','']
}
#create database
import pandas as pd
df= pd.DataFrame(data)
df['Purchase Date'] = pd.to_datetime(df['Purchase Date'], format='%d/%m/%Y',
errors='coerce').dt.strftime('%Y-%m-%d')
#Remove dollar signs and convert amount to numeric
df['Amount'] = df['Amount'].replace('[\$,]', '', regex=True).astype(float)
#Currect negetive amounts (assume refund should be positive)
df['Amount'] = df['Amount'].abs()
#Replace N/A and Invalid emails with None
df['Email'] = df['Email'].replace(['N/A', 'Invalid'], None)
df['Email'] = df['Email'].replace('alice#example.com', '[email protected]')
#currect lowercase names
df['Name'] = df['Name'].str.title()
df = df.drop_duplicates(subset=['Customer ID', 'Amount'])
df['Address'] = df['Address'].fillna('Address not Available',inplace=True)
#final cleasing data
print("\nCleansed Data")
print(df)
Output:-
4. Solve some Case study by performing filtering, Group by and add new
column to dataset.

Objective:- Implement advanced data analysis techniques, such as filtering,


grouping, and adding new calculated columns to datasets.

Code:-

import pandas as pd
file_path = 'sales_data.csv'
df = pd.read_csv(file_path)
# Display the original dataset
print("Original Dataset:")
print(df.head())
# Filter the dataset where Amount is greater than 800
filtered_df = df[df['Amount'] > 800]
# Display the filtered dataset
print("\nFiltered Dataset (Amount > 800):")
print(filtered_df)
grouped_df = df.groupby('Salesperson').agg(
total_sales=('Amount', 'sum'),
total_quantity_sold=('Quantity', 'sum')
).reset_index()
# Display the grouped data
print("\nGrouped Data by Salesperson:")
print(grouped_df)
df['Total Sales after Discount'] = df['Amount'] * (1 - df['Discount'])
print("\nDataset with 'Total Sales after Discount' Column:")
print(df)

Output:-
5. Design and implement a program to create a Data Mart.

Objective:- Design and build a Data Mart for organizing and storing data for business
intelligence.

Code:-

import pandas as pd
data = {
'Order_ID': [101, 102, 103, 104, 105, 106],
'Salesperson': ['Alice', 'Bob', 'Alice', 'Charlie', 'Alice', 'Bob'],
'Region': ['East', 'West', 'East', 'East', 'West', 'East'],
'Amount': [1000, 1500, 800, 1200, 2000, 900],
'Quantity': [10, 15, 8, 12, 20, 9],
'Discount': [0.1, 0.05, 0.2, 0.15, 0.1, 0.1],
'Date': ['2024-01-10', '2024-01-12', '2024-01-13', '2024-01-14', '2024-01-15',
'2024-01-16']
}
# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame(data)
# Data cleaning
print("\nMissing Values in Data:")
print(df.isnull().sum())
df['Discount'].fillna(0, inplace=True)
df['Total_Sales_After_Discount'] = df['Amount'] * (1 - df['Discount'])
# Aggregating data by Region and Salesperson
df_aggregated = df.groupby(['Region', 'Salesperson']).agg(
total_sales=('Total_Sales_After_Discount', 'sum'),
total_quantity=('Quantity', 'sum')
).reset_index()
# Display the transformed (aggregated) data
print("\nAggregated Data (Sales by Region and Salesperson):")
print(df_aggregated)
data_mart_path = 'sales_data_mart.csv'
df_aggregated.to_csv(data_mart_path, index=False)
# Confirm Data Mart creation
print(f"\nData Mart Created and Saved to: {data_mart_path}")
Output:-

6. Implementation of Data Cleansing using the Python

Objective:- Apply data cleansing techniques to improve data quality.

Code:-

import pandas as pd
import numpy as np
data = {
'Customer_ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'George',
'Hannah', 'Ivan', 'Jack'],
'Age': [25, 30, np.nan, 22, 35, 29, np.nan, 40, 23, 25],
'Email': ['[email protected]', '[email protected]', '[email protected]', np.nan,
'[email protected]', '[email protected]', '[email protected]', '[email protected]',
'[email protected]', '[email protected]'],
'Purchase_Amount': [100, 200, 150, 300, 250, 400, 450, 100, 500, 200],
'Country': ['USA', 'USA', 'USA', 'Canada', 'USA', 'Canada', 'USA', 'USA',
'USA', 'USA'],
}
# Convert to DataFrame
df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
df['Age'] = df['Age'].fillna(df['Age'].median())
# Fill NaN values in 'Email' with a placeholder
df['Email'] = df['Email'].fillna('[email protected]')
print("\nData after Handling Missing Values:")
print(df)
# Removing duplicates
df_duplicate = df.append(df.iloc[0], ignore_index=True)
df_no_duplicates = df_duplicate.drop_duplicates()
print("\nData after Removing Duplicates:")
print(df_no_duplicates)
# Inconsistent Formatting
df['Country'] = df['Country'].str.title()
# Stripping leading spaces in 'Name' column
df['Name'] = df['Name'].str.strip()
print("\nData after Inconsistent Formatting Handling:")
print(df)
Q1 = df['Purchase_Amount'].quantile(0.25)
Q3 = df['Purchase_Amount'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_no_outliers = df[(df['Purchase_Amount'] >= lower_bound) & (df['Purchase_Amount']
<= upper_bound)]
print("\nData after Handling Outliers:")
print(df_no_outliers)

Output:-
7. Develop a program to create metadata for a dataset, including relevant
data descriptions.

Objective:- Generate metadata to describe the structure and content of datasets.

Code:-

import pandas as pd

# Load dataset
df = pd.read_csv('Book1.csv')
# Generate metadata
metadata = {
'columns': df.columns.tolist(),
'data_types': df.dtypes.to_dict(),
'missing_values': df.isnull().sum().to_dict(),
'descriptive_statistics': df.describe().to_dict()
}
# Output the metadata
print("Metadata for the dataset: \n")
print(metadata['columns'])
print(metadata['data_types'])
print(metadata['missing_values'])
print(metadata['descriptive_statistics'])

Output:-
8. Write Python code to perform data transformation tasks.

Objective:- Perform data transformation tasks like normalization, scaling, and


encoding.

Code:-

from sklearn.preprocessing import MinMaxScaler


import pandas as pd

# Example dataset
data = {'Age': [25, 30, 35, 40], 'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)

# Min-Max Scaling of 'Age' and 'Salary' columns


scaler = MinMaxScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])

print("Transformed Data:")
print(df)

Output:-
9. Write a python code for Data Discretization.

Objective:- Apply data discretization to convert continuous data into discrete bins.

Code:-

import pandas as pd
import numpy as np

df = pd.DataFrame({'Age': [25, 30, 35, 40, 45, 50, 55, 60]})

# Define bins and labels


bins = [0, 30, 45, 100]
labels = ['Young', 'Middle-Aged', 'Old']

# Create a new 'Age_Group' column


df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)

print("Discretized Data:")
print(df)

Output:-
10. Create and visualize a graph from a dataset using a graph library.

Objective:- Visualize datasets by creating graphs using visualization libraries.

Code:-

import matplotlib.pyplot as plt


import pandas as pd

df = pd.DataFrame({
'Year': [2015, 2016, 2017, 2018, 2019],
'Sales': [100, 150, 200, 250, 300]
})

# Plotting
plt.plot(df['Year'], df['Sales'], marker='o')
plt.title('Sales Over the Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.grid(True)
plt.show()

Output:-
11. Case Study I

Objective:- Solve business case studies by analyzing the data and extracting
actionable insights.

Code:-

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
data = {'Age': [25, 30, 35, None, 45],
'Salary': [50000, 60000, None, 80000, 95000]}

df = pd.DataFrame(data)

# Handle missing values using mean imputation


imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])
df['Salary'] = imputer.fit_transform(df[['Salary']])

# Feature scaling
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])

# Print the cleaned data


print("Cleaned Data:")
print(df)

Output:-
12. Case Study II

Objective:- Implement machine learning algorithms such as k-Nearest Neighbors (k-


NN) for classification and performance evaluation.

Code:-

from sklearn.model_selection import train_test_split


from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Initialize the k-NN classifier with k=3


knn = KNeighborsClassifier(n_neighbors=3)

# Fit the model


knn.fit(X_train, y_train)

# Predict on the test set


y_pred = knn.predict(X_test)

# Evaluate the performance using accuracy score


accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of k-NN classifier: {accuracy * 100:.2f}%')

Output:-
13. Case Study III

Objective:- Solve business case studies by analyzing the data and extracting
actionable insights.

Code:-

from sklearn.model_selection import GridSearchCV


from sklearn.metrics import classification_report

# Grid search for hyperparameter tuning


param_grid = {'n_neighbors': [1, 3, 5, 7, 9],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan']}

grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)


grid_search.fit(X_train, y_train)

# Best parameters from Grid Search


print(f"Best parameters: {grid_search.best_params_}")

# Evaluate the best model


best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best))

Output:-
14. Implement a k-Nearest Neighbour (k-NN) classifier and evaluate its
performance on a given dataset.

Objective:- Implement machine learning algorithms such as k-Nearest Neighbors (k-


NN) for classification and performance evaluation.

Code:-

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Iris dataset


iris = load_iris()
X = iris.data # Features
y = iris.target # Target labels (species)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

knn = KNeighborsClassifier(n_neighbors=3) # You can adjust k (the number of


neighbors)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the k-NN classifier: {accuracy * 100:.2f}%")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

# Classification Report (Precision, Recall, F1-Score)


print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Visualize the confusion matrix using seaborn


plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names,
yticklabels=iris.target_names)
plt.title("Confusion Matrix for k-NN Classifier")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()
Output:-

You might also like