DWM - Exp 1
DWM - Exp 1
:2103138 Div: C2
Steps:
1) Load the libraries Download the data set from kaggle/ other sources
2) Read the file –select appropriate file read function according to data type of file
3) Describe the attributes name, count no of values, and find min, max, data type,
range, quartile, percentile, box plot and outliers.
Theory:
1. Data Preprocessing:
b.Data Cleaning:
Cleaning involves identifying and handling inconsistencies, missing values, and errors in
the data. Imputing missing values and rectifying anomalies ensures that downstream
analyses are not compromised by flawed data.
c.Data Transformation:
d.Data Deduplication:
e.Data Reduction:
Outliers can skew analysis and modeling results. Identifying and dealing with outliers
through techniques like statistical tests or clustering can lead to more reliable insights.
2.Data Transformation:
a.Normalisation:
Normalisation scales numerical data to a common range, usually between 0 and 1. This
eliminates the impact of differing scales on algorithms and ensures that each feature
contributes equally to analysis.
c.Aggregation:
d.Feature Creation:
Feature creation involves generating new attributes from existing data. These new
features can capture patterns that are not immediately evident. For example, creating a
"customer loyalty score" based on purchase frequency and total spending can provide
valuable insights.
Binning involves dividing continuous data into discrete intervals or bins. This can
simplify complex datasets and reveal trends that might not be apparent in the raw data.
Binning can be particularly useful when dealing with numerical data that has a wide
range.
3.Data Discretization:
a.Benefits of Discretization:
b.Methods of Discretization:
There are various methods for discretizing data, including equal width binning, equal
frequency binning, and clustering-based binning. Equal width binning divides the data
range into intervals of equal width, while equal frequency binning ensures each interval
contains approximately the same number of data points. Clustering-based binning
groups data based on similarity measured by clustering algorithms.
Discretization can affect the results of data analysis and modelling. While it simplifies
data, it may also lead to loss of information due to grouping similar values together. The
choice of binning method and the number of bins should be guided by the underlying
characteristics of the data and the objectives of the analysis.
4.Data Visualization:
a.Enhanced Understanding:
Data visualization transforms abstract data into visual cues, enabling viewers to quickly
grasp information and identify trends that might not be apparent in raw data. It provides
a clear overview of data distributions, correlations, and anomalies, making it easier to
derive insights.
b.Effective Communication:
Visualizations are powerful tools for communicating findings to both technical and
non-technical audiences. Complex data can be presented in a digestible manner,
allowing stakeholders to make informed decisions based on the visual representation of
information.
c.Visualisation Types:
There are various types of visualizations, each suited for different data characteristics
and objectives. Common types include bar charts, line graphs, scatter plots, histograms,
and heatmaps. Choosing the right visualization type depends on the data's nature and
the specific insights you want to convey.
Code:
Dataset: SalesRecords.csv
Peiview:
# Data selection and importing
import pandas as pd
# Removing Outliers
z_scores = (data[['Units Sold', 'Total Profit']] - data[['Units Sold', 'Total Profit']].mean()) /
data[['Units Sold', 'Total Profit']].std()
data = data[(z_scores.abs() < 3).all(axis=1)]
# Reset index
data = data.reset_index(drop=True)
def z_score_normalize(column):
mean_value = column.mean()
std_value = column.std()
return (column - mean_value) / std_value
data_normalized_minmax = data.copy()
data_normalized_zscore = data.copy()
data_normalized_decimal = data.copy()
# Z-score Normalization
data_normalized_zscore[column] = z_score_normalize(data_normalized_zscore[column])
data_normalized_minmax.to_csv(f'MinMaxNormalized.csv', index=False)
data_normalized_zscore.to_csv(f'ZScoreNormalized.csv', index=False)
data_normalized_decimal.to_csv(f'DecimalNormalized.csv', index=False)
# Define Discretization Function
def discretize_column(column, num_bins=5):
bin_labels = [f'Bin {i+1}' for i in range(num_bins)]
return pd.cut(column, bins=num_bins, labels=bin_labels)
data_discretized = data.copy()
# Discretize numeric columns and save to separate CSV files
for column in numeric_columns:
data_discretized[column] = discretize_column(data_discretized[column])
data_discretized.to_csv(f'Discretized.csv', index=False)
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns