0% found this document useful (0 votes)
29 views11 pages

DWM - Exp 1

1. The document discusses exploratory data analysis including data cleaning, transformation, discretization, and visualization. 2. Key steps in the analysis are loading data, describing attributes, cleaning and transforming data through techniques like normalization and outlier removal, and visualizing results. 3. The analysis uses a sales data set to demonstrate techniques including data preprocessing, normalization methods, discretization, and selecting appropriate visualizations.

Uploaded by

Himanshu Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views11 pages

DWM - Exp 1

1. The document discusses exploratory data analysis including data cleaning, transformation, discretization, and visualization. 2. Key steps in the analysis are loading data, describing attributes, cleaning and transforming data through techniques like normalization and outlier removal, and visualizing results. 3. The analysis uses a sales data set to demonstrate techniques including data preprocessing, normalization methods, discretization, and selecting appropriate visualizations.

Uploaded by

Himanshu Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

DWM -EXP 1 Roll no.

:2103138 Div: C2

Aim: To understand the data through Exploratory data analysis

⬤ Data cleaning- Missing Values,remove outliers

⬤ Data transformation- Min-max normalization, Z-score


normalization,Decimal Scaling

⬤ Data Discretization- Binning

⬤ Data analysis and Visualization

Steps:

1) Load the libraries Download the data set from kaggle/ other sources

2) Read the file –select appropriate file read function according to data type of file

3) Describe the attributes name, count no of values, and find min, max, data type,
range, quartile, percentile, box plot and outliers.

4) Perform cleaning,transformation , discretization and analysis

4) Give visualization of statistical description of data – in form of histogram, scatter plot,


pie chart,Give correlation matrix

Theory:

1. Data Preprocessing:

Data preprocessing is a crucial phase in the realm of data warehouse management


(DWM), where raw data is transformed and refined into a structured format suitable for
analysis and reporting. Effective data preprocessing sets the foundation for accurate
insights and informed decision-making. In this context, here are eight essential points to
consider when undertaking data preprocessing within a data warehouse management
framework.

Points for Data Preprocessing in DWM:

a.Data Collection and Integration:


Data preprocessing begins with collecting data from various sources, including
databases, APIs, and external files. Integrating this diverse data ensures a
comprehensive view and minimizes data silos.

b.Data Cleaning:

Cleaning involves identifying and handling inconsistencies, missing values, and errors in
the data. Imputing missing values and rectifying anomalies ensures that downstream
analyses are not compromised by flawed data.

c.Data Transformation:

Data often requires transformation to be usable. This could involve normalisation


(scaling data to a standard range), encoding categorical variables, or aggregating data
to a suitable granularity for analysis.

d.Data Deduplication:

Duplicate records can distort analysis results. Implementing deduplication techniques


ensures that the same data point is not counted multiple times, improving accuracy.

e.Data Reduction:

In cases of large datasets, data reduction techniques like dimensionality reduction


(PCA, t-SNE) can help retain essential information while reducing computational
complexity.

f.Outlier Detection and Handling:

Outliers can skew analysis and modeling results. Identifying and dealing with outliers
through techniques like statistical tests or clustering can lead to more reliable insights.

g.Data Formatting and Validation:

Ensuring data consistency and adherence to defined formats is crucial. Validation


checks are applied to confirm that the data meets the required standards before it's
loaded into the data warehouse.

2.Data Transformation:

Data transformation is a critical process in data preprocessing that involves converting


raw data into a more suitable format for analysis, reporting, and modeling. It enhances
the usability of the data by standardizing, scaling, and reorganizing it. Data
transformation enables better insights and more accurate decision-making by preparing
the data to be compatible with various algorithms and analytical techniques.

Five Points on Data Transformation:

a.Normalisation:

Normalisation scales numerical data to a common range, usually between 0 and 1. This
eliminates the impact of differing scales on algorithms and ensures that each feature
contributes equally to analysis.

b.Encoding Categorical Variables:

Categorical variables, such as gender or product category, need to be encoded into


numerical values for analysis. Techniques like one-hot encoding or label encoding are
used to represent categorical data in a format that algorithms can process.

c.Aggregation:

Aggregating data involves summarizing information by grouping it based on certain


attributes. For instance, sales data might be aggregated by month to analyze monthly
trends. Aggregation simplifies complex datasets and makes them more manageable.

d.Feature Creation:

Feature creation involves generating new attributes from existing data. These new
features can capture patterns that are not immediately evident. For example, creating a
"customer loyalty score" based on purchase frequency and total spending can provide
valuable insights.

e.Binning and Discretization:

Binning involves dividing continuous data into discrete intervals or bins. This can
simplify complex datasets and reveal trends that might not be apparent in the raw data.
Binning can be particularly useful when dealing with numerical data that has a wide
range.

3.Data Discretization:

Data discretization is a data transformation technique that involves converting


continuous data into a discrete format by grouping values into intervals or bins. This
process simplifies complex data, reduces noise, and can make it easier to uncover
patterns and trends that might not be immediately apparent in the raw data.
Points on Data Discretization:

a.Benefits of Discretization:

Discretization can simplify analysis and modeling processes by converting continuous


data into categories. This can make data more manageable, especially when dealing
with large datasets. Additionally, it can reduce the impact of outliers and small
fluctuations in the data.

b.Methods of Discretization:

There are various methods for discretizing data, including equal width binning, equal
frequency binning, and clustering-based binning. Equal width binning divides the data
range into intervals of equal width, while equal frequency binning ensures each interval
contains approximately the same number of data points. Clustering-based binning
groups data based on similarity measured by clustering algorithms.

c.Impact on Analysis and Interpretation:

Discretization can affect the results of data analysis and modelling. While it simplifies
data, it may also lead to loss of information due to grouping similar values together. The
choice of binning method and the number of bins should be guided by the underlying
characteristics of the data and the objectives of the analysis.

4.Data Visualization:

Data visualization is the practice of representing data in graphical or visual formats to


aid understanding, analysis, and communication of insights. By translating raw data into
intuitive visual representations, data visualization enhances the ability to identify trends,
patterns, and relationships within the data, making complex information more accessible
and actionable.

Points on Data Visualization:

a.Enhanced Understanding:

Data visualization transforms abstract data into visual cues, enabling viewers to quickly
grasp information and identify trends that might not be apparent in raw data. It provides
a clear overview of data distributions, correlations, and anomalies, making it easier to
derive insights.
b.Effective Communication:

Visualizations are powerful tools for communicating findings to both technical and
non-technical audiences. Complex data can be presented in a digestible manner,
allowing stakeholders to make informed decisions based on the visual representation of
information.

c.Visualisation Types:

There are various types of visualizations, each suited for different data characteristics
and objectives. Common types include bar charts, line graphs, scatter plots, histograms,
and heatmaps. Choosing the right visualization type depends on the data's nature and
the specific insights you want to convey.

Code:

Dataset: SalesRecords.csv
Peiview:
# Data selection and importing
import pandas as pd

# Load the dataset


data=pd.read_csv('SalesRecords.csv')

#Print 5 row of data


print(data.head())

# some aggregation on data


print(data.head().describe())

# Data Cleaning and Transformation


data = data.dropna()
data['Country'] = data['Country'].str.strip()
data['Item Type'] = data['Item Type'].str.strip()
data['Sales Channel'] = data['Sales Channel'].str.strip()
data['Order Priority'] = data['Order Priority'].str.strip()

data['Order Date'] = pd.to_datetime(data['Order Date'])


data['Ship Date'] = pd.to_datetime(data['Ship Date'])

# Remove duplicate rows


data = data.drop_duplicates()

# Removing Outliers
z_scores = (data[['Units Sold', 'Total Profit']] - data[['Units Sold', 'Total Profit']].mean()) /
data[['Units Sold', 'Total Profit']].std()
data = data[(z_scores.abs() < 3).all(axis=1)]

# Reset index
data = data.reset_index(drop=True)

# Save the cleaned dataset to a new CSV file


data.to_csv('CleanedSalesRecords.csv', index=False)
# Define Normalization Functions
def minmax_normalize(column):
min_value = column.min()
max_value = column.max()
return (column - min_value) / (max_value - min_value)

def z_score_normalize(column):
mean_value = column.mean()
std_value = column.std()
return (column - mean_value) / std_value

def decimal_normalize(column, decimal_scale=2):


max_value = column.max()
power = 10 ** decimal_scale
return column / max_value * power

# Numeric columns for normalization


numeric_columns = ['Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', 'Total
Profit']

data_normalized_minmax = data.copy()
data_normalized_zscore = data.copy()
data_normalized_decimal = data.copy()

# Apply normalization functions and save to separate CSV files


for column in numeric_columns:
# Min-Max Normalization
data_normalized_minmax[column] = minmax_normalize(data_normalized_minmax[column])

# Z-score Normalization
data_normalized_zscore[column] = z_score_normalize(data_normalized_zscore[column])

# Decimal Scaling Normalization


data_normalized_decimal[column] = decimal_normalize(data_normalized_decimal[column])

data_normalized_minmax.to_csv(f'MinMaxNormalized.csv', index=False)
data_normalized_zscore.to_csv(f'ZScoreNormalized.csv', index=False)
data_normalized_decimal.to_csv(f'DecimalNormalized.csv', index=False)
# Define Discretization Function
def discretize_column(column, num_bins=5):
bin_labels = [f'Bin {i+1}' for i in range(num_bins)]
return pd.cut(column, bins=num_bins, labels=bin_labels)

# Numeric columns for discretization


numeric_columns = ['Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', 'Total
Profit']

data_discretized = data.copy()
# Discretize numeric columns and save to separate CSV files
for column in numeric_columns:
data_discretized[column] = discretize_column(data_discretized[column])

data_discretized.to_csv(f'Discretized.csv', index=False)

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Histograms for Numeric Columns


numeric_columns = ['Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', 'Total
Profit']
data[numeric_columns].hist(bins=20, figsize=(15, 10))
plt.suptitle('Histograms for Numeric Columns', y=1.02)
plt.show()

# Box Plots for Numeric Columns


plt.figure(figsize=(15, 10))
sns.boxplot(data=data[numeric_columns])
plt.title('Box Plots for Numeric Columns')
plt.show()

# Count Plot for Categorical Columns


categorical_columns = ['Item Type', 'Sales Channel', 'Order Priority']
for column in categorical_columns:
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x=column, order=data[column].value_counts().index)
plt.title(f'Count Plot for {column}')
plt.xticks(rotation=45)
plt.show()

You might also like