DW Lab File
DW Lab File
S. No Practical Remarks
1 Write a program to read data from CSV
files and display the content.
2 Write a program to read data from JSON
files and display the content.
3 Load a dataset into a data structure (e.g.,
Data Frame) and perform basic data
cleaning (e.g., handling missing values).
4 Solve some Case study by performing
filtering, Group by and add new column
to dataset.
5 Design and implement a program to
create a Data Mart.
6 Implementation of Data Cleansing using
the Python
7 Develop a program to create metadata
for a dataset, including relevant data
descriptions.
8 Write Python code to perform data
transformation tasks.
9 Write a python code for Data
Discretization.
10 Create and visualize a graph from a
dataset using a graph library.
11 Case Study I
12 Case Study II
13 Case Study III
14 Implement a k-Nearest Neighbour (k-
NN) classifier and evaluate its
performance on a given dataset.
1. Write a program to read data from CSV files and display the content.
Objective:- Read and display data from various file formats such as CSV and JSON.
Code:-
import csv
def read_csv(file_path):
try:
with open(file_path, mode='r', newline='', encoding='utf-8') as file:
csv_reader = csv.reader(file)
header = next(csv_reader)
print(f"Header: {header}")
print("\nData:")
for row in csv_reader:
print(row)
except FileNotFoundError:
print(f"Error: The file '{file_path}' was not found.")
except Exception as e:
print(f"An error occurred: {e}")
file_path = 'Book1.csv'
read_csv(file_path)
Output:-
2.Write a program to read data from JSON files and display the content.
Objective:- Load datasets into appropriate data structures (e.g., Pandas DataFrame)
for analysis.
Code:-
import json
def display_json_data():
json_data = {
"name": "Alice",
"age": 30,
"city": "New York",
"is_student": False,
"courses": ["Math", "Science", "English"]
}
print("JSON Data Content:")
print(json.dumps(json_data, indent=4))
display_json_data()
Output:-
3. Load a dataset into a data structure (e.g., Data Frame) and perform basic
data cleaning (e.g., handling missing values).
Objective:- Perform basic data cleaning by handling missing values and
inconsistencies.
Code:-
import pandas as pd
import numpy as np
#create a list of lists to hold the data
data = { 'Customer ID' :[1,2,3,4,5,6],
'Name' : ['John Smith','jane Doe','jake Doe','john Smith',None,'Alice
Brown'],
'Purchase Date' :
['01/09/2024','01/09/0024','02/09/2024','09/01/2024','01/09/2024','01/09/2024'],
'Amount' : [100, '$200', 300, 400, -500, 600],
'Email' : ['[email protected]', '[email protected]', 'N/A',
'[email protected]', '', 'alice#example.com'],
'Address' :['123 Maple St,NY','456 Eim St,NY' ,'789 Pine St,NY','123 Maple
St,NY','123 Maple St,NY','']
}
#create database
import pandas as pd
df= pd.DataFrame(data)
df['Purchase Date'] = pd.to_datetime(df['Purchase Date'], format='%d/%m/%Y',
errors='coerce').dt.strftime('%Y-%m-%d')
#Remove dollar signs and convert amount to numeric
df['Amount'] = df['Amount'].replace('[\$,]', '', regex=True).astype(float)
#Currect negetive amounts (assume refund should be positive)
df['Amount'] = df['Amount'].abs()
#Replace N/A and Invalid emails with None
df['Email'] = df['Email'].replace(['N/A', 'Invalid'], None)
df['Email'] = df['Email'].replace('alice#example.com', '[email protected]')
#currect lowercase names
df['Name'] = df['Name'].str.title()
df = df.drop_duplicates(subset=['Customer ID', 'Amount'])
df['Address'] = df['Address'].fillna('Address not Available',inplace=True)
#final cleasing data
print("\nCleansed Data")
print(df)
Output:-
4. Solve some Case study by performing filtering, Group by and add new
column to dataset.
Code:-
import pandas as pd
file_path = 'sales_data.csv'
df = pd.read_csv(file_path)
# Display the original dataset
print("Original Dataset:")
print(df.head())
# Filter the dataset where Amount is greater than 800
filtered_df = df[df['Amount'] > 800]
# Display the filtered dataset
print("\nFiltered Dataset (Amount > 800):")
print(filtered_df)
grouped_df = df.groupby('Salesperson').agg(
total_sales=('Amount', 'sum'),
total_quantity_sold=('Quantity', 'sum')
).reset_index()
# Display the grouped data
print("\nGrouped Data by Salesperson:")
print(grouped_df)
df['Total Sales after Discount'] = df['Amount'] * (1 - df['Discount'])
print("\nDataset with 'Total Sales after Discount' Column:")
print(df)
Output:-
5. Design and implement a program to create a Data Mart.
Objective:- Design and build a Data Mart for organizing and storing data for business
intelligence.
Code:-
import pandas as pd
data = {
'Order_ID': [101, 102, 103, 104, 105, 106],
'Salesperson': ['Alice', 'Bob', 'Alice', 'Charlie', 'Alice', 'Bob'],
'Region': ['East', 'West', 'East', 'East', 'West', 'East'],
'Amount': [1000, 1500, 800, 1200, 2000, 900],
'Quantity': [10, 15, 8, 12, 20, 9],
'Discount': [0.1, 0.05, 0.2, 0.15, 0.1, 0.1],
'Date': ['2024-01-10', '2024-01-12', '2024-01-13', '2024-01-14', '2024-01-15',
'2024-01-16']
}
# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame(data)
# Data cleaning
print("\nMissing Values in Data:")
print(df.isnull().sum())
df['Discount'].fillna(0, inplace=True)
df['Total_Sales_After_Discount'] = df['Amount'] * (1 - df['Discount'])
# Aggregating data by Region and Salesperson
df_aggregated = df.groupby(['Region', 'Salesperson']).agg(
total_sales=('Total_Sales_After_Discount', 'sum'),
total_quantity=('Quantity', 'sum')
).reset_index()
# Display the transformed (aggregated) data
print("\nAggregated Data (Sales by Region and Salesperson):")
print(df_aggregated)
data_mart_path = 'sales_data_mart.csv'
df_aggregated.to_csv(data_mart_path, index=False)
# Confirm Data Mart creation
print(f"\nData Mart Created and Saved to: {data_mart_path}")
Output:-
Code:-
import pandas as pd
import numpy as np
data = {
'Customer_ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'George',
'Hannah', 'Ivan', 'Jack'],
'Age': [25, 30, np.nan, 22, 35, 29, np.nan, 40, 23, 25],
'Email': ['[email protected]', '[email protected]', '[email protected]', np.nan,
'[email protected]', '[email protected]', '[email protected]', '[email protected]',
'[email protected]', '[email protected]'],
'Purchase_Amount': [100, 200, 150, 300, 250, 400, 450, 100, 500, 200],
'Country': ['USA', 'USA', 'USA', 'Canada', 'USA', 'Canada', 'USA', 'USA',
'USA', 'USA'],
}
# Convert to DataFrame
df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
df['Age'] = df['Age'].fillna(df['Age'].median())
# Fill NaN values in 'Email' with a placeholder
df['Email'] = df['Email'].fillna('[email protected]')
print("\nData after Handling Missing Values:")
print(df)
# Removing duplicates
df_duplicate = df.append(df.iloc[0], ignore_index=True)
df_no_duplicates = df_duplicate.drop_duplicates()
print("\nData after Removing Duplicates:")
print(df_no_duplicates)
# Inconsistent Formatting
df['Country'] = df['Country'].str.title()
# Stripping leading spaces in 'Name' column
df['Name'] = df['Name'].str.strip()
print("\nData after Inconsistent Formatting Handling:")
print(df)
Q1 = df['Purchase_Amount'].quantile(0.25)
Q3 = df['Purchase_Amount'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_no_outliers = df[(df['Purchase_Amount'] >= lower_bound) & (df['Purchase_Amount']
<= upper_bound)]
print("\nData after Handling Outliers:")
print(df_no_outliers)
Output:-
7. Develop a program to create metadata for a dataset, including relevant
data descriptions.
Code:-
import pandas as pd
# Load dataset
df = pd.read_csv('Book1.csv')
# Generate metadata
metadata = {
'columns': df.columns.tolist(),
'data_types': df.dtypes.to_dict(),
'missing_values': df.isnull().sum().to_dict(),
'descriptive_statistics': df.describe().to_dict()
}
# Output the metadata
print("Metadata for the dataset: \n")
print(metadata['columns'])
print(metadata['data_types'])
print(metadata['missing_values'])
print(metadata['descriptive_statistics'])
Output:-
8. Write Python code to perform data transformation tasks.
Code:-
# Example dataset
data = {'Age': [25, 30, 35, 40], 'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)
print("Transformed Data:")
print(df)
Output:-
9. Write a python code for Data Discretization.
Objective:- Apply data discretization to convert continuous data into discrete bins.
Code:-
import pandas as pd
import numpy as np
print("Discretized Data:")
print(df)
Output:-
10. Create and visualize a graph from a dataset using a graph library.
Code:-
df = pd.DataFrame({
'Year': [2015, 2016, 2017, 2018, 2019],
'Sales': [100, 150, 200, 250, 300]
})
# Plotting
plt.plot(df['Year'], df['Sales'], marker='o')
plt.title('Sales Over the Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.grid(True)
plt.show()
Output:-
11. Case Study I
Objective:- Solve business case studies by analyzing the data and extracting
actionable insights.
Code:-
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
data = {'Age': [25, 30, 35, None, 45],
'Salary': [50000, 60000, None, 80000, 95000]}
df = pd.DataFrame(data)
# Feature scaling
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
Output:-
12. Case Study II
Code:-
Output:-
13. Case Study III
Objective:- Solve business case studies by analyzing the data and extracting
actionable insights.
Code:-
# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best))
Output:-
14. Implement a k-Nearest Neighbour (k-NN) classifier and evaluate its
performance on a given dataset.
Code:-
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the k-NN classifier: {accuracy * 100:.2f}%")
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)