0% found this document useful (0 votes)

29 views11 pages

DWM - Exp 1

1. The document discusses exploratory data analysis including data cleaning, transformation, discretization, and visualization. 2. Key steps in the analysis are loading data, describing attributes, cleaning and transforming data through techniques like normalization and outlier removal, and visualizing results. 3. The analysis uses a sales data set to demonstrate techniques including data preprocessing, normalization methods, discretization, and selecting appropriate visualizations.

Uploaded by

Himanshu Pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views11 pages

DWM - Exp 1

Uploaded by

Himanshu Pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

DWM -EXP 1 Roll no.

:2103138 Div: C2

Aim: To understand the data through Exploratory data analysis

⬤ Data cleaning- Missing Values,remove outliers

⬤ Data transformation- Min-max normalization, Z-score

normalization,Decimal Scaling

⬤ Data Discretization- Binning

⬤ Data analysis and Visualization

Steps:

1) Load the libraries Download the data set from kaggle/ other sources

2) Read the file –select appropriate file read function according to data type of file

3) Describe the attributes name, count no of values, and find min, max, data type,
range, quartile, percentile, box plot and outliers.

4) Perform cleaning,transformation , discretization and analysis

4) Give visualization of statistical description of data – in form of histogram, scatter plot,

pie chart,Give correlation matrix

Theory:

1. Data Preprocessing:

Data preprocessing is a crucial phase in the realm of data warehouse management

(DWM), where raw data is transformed and refined into a structured format suitable for
analysis and reporting. Effective data preprocessing sets the foundation for accurate
insights and informed decision-making. In this context, here are eight essential points to
consider when undertaking data preprocessing within a data warehouse management
framework.

Points for Data Preprocessing in DWM:

a.Data Collection and Integration:

Data preprocessing begins with collecting data from various sources, including
databases, APIs, and external files. Integrating this diverse data ensures a
comprehensive view and minimizes data silos.

b.Data Cleaning:

Cleaning involves identifying and handling inconsistencies, missing values, and errors in
the data. Imputing missing values and rectifying anomalies ensures that downstream
analyses are not compromised by flawed data.

c.Data Transformation:

Data often requires transformation to be usable. This could involve normalisation

(scaling data to a standard range), encoding categorical variables, or aggregating data
to a suitable granularity for analysis.

d.Data Deduplication:

Duplicate records can distort analysis results. Implementing deduplication techniques

ensures that the same data point is not counted multiple times, improving accuracy.

e.Data Reduction:

In cases of large datasets, data reduction techniques like dimensionality reduction

(PCA, t-SNE) can help retain essential information while reducing computational
complexity.

f.Outlier Detection and Handling:

Outliers can skew analysis and modeling results. Identifying and dealing with outliers
through techniques like statistical tests or clustering can lead to more reliable insights.

g.Data Formatting and Validation:

Ensuring data consistency and adherence to defined formats is crucial. Validation

checks are applied to confirm that the data meets the required standards before it's
loaded into the data warehouse.

2.Data Transformation:

Data transformation is a critical process in data preprocessing that involves converting

raw data into a more suitable format for analysis, reporting, and modeling. It enhances
the usability of the data by standardizing, scaling, and reorganizing it. Data
transformation enables better insights and more accurate decision-making by preparing
the data to be compatible with various algorithms and analytical techniques.

Five Points on Data Transformation:

a.Normalisation:

Normalisation scales numerical data to a common range, usually between 0 and 1. This
eliminates the impact of differing scales on algorithms and ensures that each feature
contributes equally to analysis.

b.Encoding Categorical Variables:

Categorical variables, such as gender or product category, need to be encoded into

numerical values for analysis. Techniques like one-hot encoding or label encoding are
used to represent categorical data in a format that algorithms can process.

c.Aggregation:

Aggregating data involves summarizing information by grouping it based on certain

attributes. For instance, sales data might be aggregated by month to analyze monthly
trends. Aggregation simplifies complex datasets and makes them more manageable.

d.Feature Creation:

Feature creation involves generating new attributes from existing data. These new
features can capture patterns that are not immediately evident. For example, creating a
"customer loyalty score" based on purchase frequency and total spending can provide
valuable insights.

e.Binning and Discretization:

Binning involves dividing continuous data into discrete intervals or bins. This can
simplify complex datasets and reveal trends that might not be apparent in the raw data.
Binning can be particularly useful when dealing with numerical data that has a wide
range.

3.Data Discretization:

Data discretization is a data transformation technique that involves converting

continuous data into a discrete format by grouping values into intervals or bins. This
process simplifies complex data, reduces noise, and can make it easier to uncover
patterns and trends that might not be immediately apparent in the raw data.
Points on Data Discretization:

a.Benefits of Discretization:

Discretization can simplify analysis and modeling processes by converting continuous

data into categories. This can make data more manageable, especially when dealing
with large datasets. Additionally, it can reduce the impact of outliers and small
fluctuations in the data.

b.Methods of Discretization:

There are various methods for discretizing data, including equal width binning, equal
frequency binning, and clustering-based binning. Equal width binning divides the data
range into intervals of equal width, while equal frequency binning ensures each interval
contains approximately the same number of data points. Clustering-based binning
groups data based on similarity measured by clustering algorithms.

c.Impact on Analysis and Interpretation:

Discretization can affect the results of data analysis and modelling. While it simplifies
data, it may also lead to loss of information due to grouping similar values together. The
choice of binning method and the number of bins should be guided by the underlying
characteristics of the data and the objectives of the analysis.

4.Data Visualization:

Data visualization is the practice of representing data in graphical or visual formats to

aid understanding, analysis, and communication of insights. By translating raw data into
intuitive visual representations, data visualization enhances the ability to identify trends,
patterns, and relationships within the data, making complex information more accessible
and actionable.

Points on Data Visualization:

a.Enhanced Understanding:

Data visualization transforms abstract data into visual cues, enabling viewers to quickly
grasp information and identify trends that might not be apparent in raw data. It provides
a clear overview of data distributions, correlations, and anomalies, making it easier to
derive insights.
b.Effective Communication:

Visualizations are powerful tools for communicating findings to both technical and
non-technical audiences. Complex data can be presented in a digestible manner,
allowing stakeholders to make informed decisions based on the visual representation of
information.

c.Visualisation Types:

There are various types of visualizations, each suited for different data characteristics
and objectives. Common types include bar charts, line graphs, scatter plots, histograms,
and heatmaps. Choosing the right visualization type depends on the data's nature and
the specific insights you want to convey.

Code:

Dataset: SalesRecords.csv
Peiview:
# Data selection and importing
import pandas as pd

# Load the dataset

data=pd.read_csv('SalesRecords.csv')

#Print 5 row of data

print(data.head())

# some aggregation on data

print(data.head().describe())

# Data Cleaning and Transformation

data = data.dropna()
data['Country'] = data['Country'].str.strip()
data['Item Type'] = data['Item Type'].str.strip()
data['Sales Channel'] = data['Sales Channel'].str.strip()
data['Order Priority'] = data['Order Priority'].str.strip()

data['Order Date'] = pd.to_datetime(data['Order Date'])

data['Ship Date'] = pd.to_datetime(data['Ship Date'])

# Remove duplicate rows

data = data.drop_duplicates()

# Removing Outliers
z_scores = (data[['Units Sold', 'Total Profit']] - data[['Units Sold', 'Total Profit']].mean()) /
data[['Units Sold', 'Total Profit']].std()
data = data[(z_scores.abs() < 3).all(axis=1)]

# Reset index
data = data.reset_index(drop=True)

# Save the cleaned dataset to a new CSV file

data.to_csv('CleanedSalesRecords.csv', index=False)
# Define Normalization Functions
def minmax_normalize(column):
min_value = column.min()
max_value = column.max()
return (column - min_value) / (max_value - min_value)

def z_score_normalize(column):
mean_value = column.mean()
std_value = column.std()
return (column - mean_value) / std_value

def decimal_normalize(column, decimal_scale=2):

max_value = column.max()
power = 10 ** decimal_scale
return column / max_value * power

# Numeric columns for normalization

numeric_columns = ['Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', 'Total
Profit']

data_normalized_minmax = data.copy()
data_normalized_zscore = data.copy()
data_normalized_decimal = data.copy()

# Apply normalization functions and save to separate CSV files

for column in numeric_columns:
# Min-Max Normalization
data_normalized_minmax[column] = minmax_normalize(data_normalized_minmax[column])

# Z-score Normalization
data_normalized_zscore[column] = z_score_normalize(data_normalized_zscore[column])

# Decimal Scaling Normalization

data_normalized_decimal[column] = decimal_normalize(data_normalized_decimal[column])

data_normalized_minmax.to_csv(f'MinMaxNormalized.csv', index=False)
data_normalized_zscore.to_csv(f'ZScoreNormalized.csv', index=False)
data_normalized_decimal.to_csv(f'DecimalNormalized.csv', index=False)
# Define Discretization Function
def discretize_column(column, num_bins=5):
bin_labels = [f'Bin {i+1}' for i in range(num_bins)]
return pd.cut(column, bins=num_bins, labels=bin_labels)

# Numeric columns for discretization

numeric_columns = ['Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', 'Total
Profit']

data_discretized = data.copy()
# Discretize numeric columns and save to separate CSV files
for column in numeric_columns:
data_discretized[column] = discretize_column(data_discretized[column])

data_discretized.to_csv(f'Discretized.csv', index=False)

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Histograms for Numeric Columns

numeric_columns = ['Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', 'Total
Profit']
data[numeric_columns].hist(bins=20, figsize=(15, 10))
plt.suptitle('Histograms for Numeric Columns', y=1.02)
plt.show()

# Box Plots for Numeric Columns

plt.figure(figsize=(15, 10))
sns.boxplot(data=data[numeric_columns])
plt.title('Box Plots for Numeric Columns')
plt.show()

# Count Plot for Categorical Columns

categorical_columns = ['Item Type', 'Sales Channel', 'Order Priority']
for column in categorical_columns:
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x=column, order=data[column].value_counts().index)
plt.title(f'Count Plot for {column}')
plt.xticks(rotation=45)
plt.show()

Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
DV Classnotes
No ratings yet
DV Classnotes
28 pages
Data Binning
No ratings yet
Data Binning
9 pages
Gr.11!12!07302021 Philosophy Q1 M1 Pursuing Wisdom
No ratings yet
Gr.11!12!07302021 Philosophy Q1 M1 Pursuing Wisdom
14 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
EDA
100% (1)
EDA
9 pages
Introduction To Course Module (EMTE1011/1012) : Emerging Technologies
100% (1)
Introduction To Course Module (EMTE1011/1012) : Emerging Technologies
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
94 pages
slmMA Ancient History and Archaeology
100% (1)
slmMA Ancient History and Archaeology
72 pages
Unit 2
No ratings yet
Unit 2
144 pages
4.1 - Phases of The Psychological Assessment Process
No ratings yet
4.1 - Phases of The Psychological Assessment Process
80 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
86 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
ModelQB - Part B&C-1
No ratings yet
ModelQB - Part B&C-1
51 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Important Questions
No ratings yet
Important Questions
26 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Down 2
No ratings yet
Down 2
61 pages
Data Preprocessing, Data Warehousing
No ratings yet
Data Preprocessing, Data Warehousing
9 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Unit 4
No ratings yet
Unit 4
33 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
Data Warehouse and Data Mining - Definition and Concepts
No ratings yet
Data Warehouse and Data Mining - Definition and Concepts
20 pages
Normalization
No ratings yet
Normalization
35 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Shortnjn
No ratings yet
Shortnjn
12 pages
Data Warehouse
No ratings yet
Data Warehouse
14 pages
Unit 3
No ratings yet
Unit 3
18 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Data Warehouse
No ratings yet
Data Warehouse
11 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Math211101020
No ratings yet
Math211101020
12 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Chap.3 Data Preprocessing
No ratings yet
Chap.3 Data Preprocessing
6 pages
Data Pre-Processing - Jagannath Dansana (200301120080)
No ratings yet
Data Pre-Processing - Jagannath Dansana (200301120080)
8 pages
Dw&bi PR2,3
No ratings yet
Dw&bi PR2,3
6 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Updated Notes of APR - 084732
No ratings yet
Updated Notes of APR - 084732
6 pages
Analisis Tokoh Dan Penokohan Pada Drama RT Nol RW Nol Karya Iwan Simatupang
No ratings yet
Analisis Tokoh Dan Penokohan Pada Drama RT Nol RW Nol Karya Iwan Simatupang
10 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
2 pages
BI Unit 4 Final
No ratings yet
BI Unit 4 Final
2 pages
Grade 11 STEM Interdisciplinary Performance Task Set B
No ratings yet
Grade 11 STEM Interdisciplinary Performance Task Set B
3 pages
Problem-Solving Interview Questions and Answers
No ratings yet
Problem-Solving Interview Questions and Answers
9 pages
Reviewer Data Mining
No ratings yet
Reviewer Data Mining
1 page
FINAL THESISss
No ratings yet
FINAL THESISss
71 pages
Nurs 303
No ratings yet
Nurs 303
11 pages
EFL Adult Learners' Perception of Learning English Vocabulary Through Pictures at A Private English Center
No ratings yet
EFL Adult Learners' Perception of Learning English Vocabulary Through Pictures at A Private English Center
8 pages
Designing Inclusive Interactions Inclusive Interactions Between People and Products in Their Contexts of Use (P. Biswas, P. Robinson (Auth.) Etc.) (Z-Library)
No ratings yet
Designing Inclusive Interactions Inclusive Interactions Between People and Products in Their Contexts of Use (P. Biswas, P. Robinson (Auth.) Etc.) (Z-Library)
246 pages
Chapter 9 - Learning To Be The Student
No ratings yet
Chapter 9 - Learning To Be The Student
4 pages
08 Urban Form
No ratings yet
08 Urban Form
7 pages
Classroom-Based Action Research and Basic Research
No ratings yet
Classroom-Based Action Research and Basic Research
3 pages
Jaswanth Narayana R (40738003) Vishesh K (40738007)
100% (1)
Jaswanth Narayana R (40738003) Vishesh K (40738007)
37 pages
Developmentally Appropriate Practice Dap
No ratings yet
Developmentally Appropriate Practice Dap
16 pages
How To Lead Like A Coach HBR
No ratings yet
How To Lead Like A Coach HBR
8 pages
PHCL 201 2021 201 Mind-Body Problem - Idealism - VL 3 and 4
No ratings yet
PHCL 201 2021 201 Mind-Body Problem - Idealism - VL 3 and 4
25 pages
Competency Mapping Process, Method, Type, Example
No ratings yet
Competency Mapping Process, Method, Type, Example
9 pages
Child and Adolescent Development
No ratings yet
Child and Adolescent Development
18 pages
Allama Iqbal Open University, Islamabad: (Department of Science Education)
No ratings yet
Allama Iqbal Open University, Islamabad: (Department of Science Education)
2 pages
Materi Strategy of Pedagogy Reading - Yang Terbaru
No ratings yet
Materi Strategy of Pedagogy Reading - Yang Terbaru
10 pages
Development and Validation of A Scale On Teacher's Competence in Action Research
No ratings yet
Development and Validation of A Scale On Teacher's Competence in Action Research
11 pages
Mech-Nd-2022-Ge 8071-Disaster Management-90328639-Nd22sh
No ratings yet
Mech-Nd-2022-Ge 8071-Disaster Management-90328639-Nd22sh
3 pages
International Test and Evaluation Standards For Artificial Intelligence Based On Networked Data-Information-Knowledge-Wisdom-Purpose (DIKWP) Model
No ratings yet
International Test and Evaluation Standards For Artificial Intelligence Based On Networked Data-Information-Knowledge-Wisdom-Purpose (DIKWP) Model
62 pages
Major Project Presentation ON S.A.G.E: Student's Academic Guide Engine
No ratings yet
Major Project Presentation ON S.A.G.E: Student's Academic Guide Engine
18 pages
50 Years Later: A Conversation About The Biological Study of Language With Noam Chomsky
No ratings yet
50 Years Later: A Conversation About The Biological Study of Language With Noam Chomsky
13 pages
Btech Keywords
No ratings yet
Btech Keywords
4 pages
DBM 613 WBA Oct-March 2025
No ratings yet
DBM 613 WBA Oct-March 2025
5 pages
Untitled
No ratings yet
Untitled
2 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet

DWM - Exp 1

Uploaded by

DWM - Exp 1

Uploaded by

DWM -EXP 1 Roll no.

Aim: To understand the data through Exploratory data analysis

⬤ Data cleaning- Missing Values,remove outliers

⬤ Data transformation- Min-max normalization, Z-score

⬤ Data Discretization- Binning

⬤ Data analysis and Visualization

4) Perform cleaning,transformation , discretization and analysis

4) Give visualization of statistical description of data – in form of histogram, scatter plot,

Data preprocessing is a crucial phase in the realm of data warehouse management

Points for Data Preprocessing in DWM:

a.Data Collection and Integration:

Data often requires transformation to be usable. This could involve normalisation

Duplicate records can distort analysis results. Implementing deduplication techniques

In cases of large datasets, data reduction techniques like dimensionality reduction

f.Outlier Detection and Handling:

g.Data Formatting and Validation:

Ensuring data consistency and adherence to defined formats is crucial. Validation

Data transformation is a critical process in data preprocessing that involves converting

Five Points on Data Transformation:

b.Encoding Categorical Variables:

Categorical variables, such as gender or product category, need to be encoded into

Aggregating data involves summarizing information by grouping it based on certain

e.Binning and Discretization:

Data discretization is a data transformation technique that involves converting

Discretization can simplify analysis and modeling processes by converting continuous

c.Impact on Analysis and Interpretation:

Data visualization is the practice of representing data in graphical or visual formats to

Points on Data Visualization:

# Load the dataset

#Print 5 row of data

# some aggregation on data

# Data Cleaning and Transformation

data['Order Date'] = pd.to_datetime(data['Order Date'])

# Remove duplicate rows

# Save the cleaned dataset to a new CSV file

def decimal_normalize(column, decimal_scale=2):

# Numeric columns for normalization

# Apply normalization functions and save to separate CSV files

# Decimal Scaling Normalization

# Numeric columns for discretization

# Histograms for Numeric Columns

# Box Plots for Numeric Columns

# Count Plot for Categorical Columns

You might also like