0% found this document useful (0 votes)

15 views

CLC - Data Cleansing and Data Summary

Here are the key steps I would take to analyze trends in the data: 1. Use visualizations: Graphical representations like line graphs, bar charts, and histograms can help identify trends over time, patterns across groups, and other relationships in the data. 2. Calculate descriptive statistics: Measures like mean, median, standard deviation calculated over time periods can expose trends. 3. Compare values: Directly compare values, percentages or averages between different time periods, locations, products etc. to see how things are changing. 4. Use regression analysis: Identify trends and predict future values by fitting a trendline to data using regression techniques like linear, polynomial or exponential regression. 5. Segment and filter

Uploaded by

Rana Usama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

CLC - Data Cleansing and Data Summary

Uploaded by

Rana Usama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

How did you verify that the data was reliable before proceeding?

By Data Quality Assessment.

We Conduct a thorough assessment of the data quality to identify any potential issues. This includes
checking for completeness, accuracy, consistency, and validity of the data. Data quality checks can
involve examining missing values, outliers, inconsistencies, and data distributions.

By Verification Data Source.

By Validate the sources of the data to ensure they are reputable and reliable. This involve verifying
the credibility of the data provider and conducting external research to confirm the accuracy and
authenticity of the data sources.
By Cross-Referencing

We Compare the data with other reliable sources or existing databases to check for consistency and
correctness. Cross-referencing the data with external sources help us to identify any discrepancies or
anomalies that need further investigation.

By Data Sampling and Testing

We Perform data sampling techniques to check the representativeness of the data. This involves
randomly selecting subsets of the data and analyzing them to assess if the patterns and relationships
observed in the sample align with expectations. Additionally, performing statistical tests and
validation techniques can help confirm the reliability of the data.

By Expert Validation

We Seek input from domain experts or stakeholders who have knowledge and expertise in the data
domain. They can provide insights and confirm the accuracy of the data based on their expertise and
experience.

By implementing these steps, we gain confidence in the reliability of the data and make informed
decisions about using it our analysis. It is important to ensure data integrity and reliability to avoid
drawing incorrect conclusions or making flawed decisions based on unreliable data.

What problems did you find and how did you address them?

During the process of verifying the data's reliability, several problems and issues that we identified.
Here are some common problems that arise and the corresponding steps to address them:

Missing Data

Missing data a common problem in datasets. To address this issue, several techniques we used
depending on the extent and nature of the missingness. This include imputing missing values using
methods mean imputation, regression imputation, using advanced imputation techniques like
multiple imputation and predictive modeling.
Outliers

Outliers are extreme values that deviate significantly from the majority of the data. They impact the
analysis and interpretation of results. To address outliers, various approaches applied, identifying the
cause of outliers (e.g., data entry errors), validating their accuracy, and deciding whether to remove
them, transform them, and handle them separately in the analysis.

Inconsistencies and Data Discrepancies

Inconsistencies and discrepancies in the data occur when there are conflicting values or errors in
data entry or data integration. These issues addressed by carefully examining the data, cross-
referencing with other reliable sources, and resolving any inconsistencies through data cleaning and
reconciliation processes.

Data Integrity and Accuracy

It is essential to verify the integrity and accuracy of the data to ensure that it aligns with expectations
and is reliable for analysis. This can involve conducting data audits, performing validation checks, and
comparing the data against known benchmarks or external sources. Addressing data integrity issues
may require data cleansing, data transformation, or obtaining additional data to fill gaps or correct
errors.

Data Skewness or Distribution Issues

Data that is highly skewed and does not follow a normal distribution can impact the validity of certain
statistical analyses. In such cases, data transformations and non-parametric approaches may be
employed to address the distributional issues and ensure appropriate analysis.

What relationships did you find in the data?

Correlation Analysis

Correlation analysis measures the strength and direction of the linear relationship between two
continuous variables. to determine if there is a relationship between variables and the degree to
which they are associated. Positive correlation indicates that as one variable increases, the other
variable also tends to increase, while negative correlation indicates an inverse relationship.
Regression Analysis

Regression analysis is used to analyze the relationship between a dependent variable and one or
more independent variables. It helps determine the nature and strength of the relationship and
allows for prediction or estimation based on the observed data. Simple linear regression examines the
relationship between two variables, while multiple regression can analyze the relationships between
multiple independent variables and a dependent variable.
null : null

Chi-Square Test

The chi-square test is used to analyze the relationship between two categorical variables. It
determines whether there is a significant association or dependence between the variables. It is
commonly used in cross-tabulation analysis to examine the relationship between two categorical
variables.
Data Visualization
Visualizing data through graphs and charts provide insights into relationships. Scatter plots can reveal
the relationship between two continuous variables, while pi charts or stacked bar charts display the
relationship between categorical variables.
Are there any missing data?

Yes 2% is missing data

How are you going to summarize data samples?

Descriptive statistics provide a summary of the main characteristics of a dataset. They include
measures such as mean, median, mode, standard deviation, variance, minimum, maximum, and
quartiles. These statistics help us understand the central tendency, dispersion, and distribution of our
data.

Frequency tables summarize categorical data by displaying the frequency and count of each category.
They provide an overview of the distribution of categorical variables and help identify the most
common and rare categories.
Cross-tabulation, also known as a contingency table, is used to summarize the relationship between
two or more categorical variables. It presents the frequencies and proportions of each combination of
categories, allowing us to identify patterns and associations between variables.

Summary tables provide a comprehensive overview of the data by presenting key statistics for
different variables groups. They can include measures like means, medians, standard deviations, and
counts for each variable, allowing us to compare and analyze different aspects of the data.

Visualizations such as bar charts, histograms, box plots, and scatter plots are effective in summarizing
data samples. They provide a visual representation of the data distribution, trends, and relationships,
making it easier to understand and interpret the findings.

Statistical tests used to summarize and compare data samples. For example, t-tests or ANOVA assess
the differences between groups, chi-square tests can evaluate the relationship between categorical
variables, and correlation analysis can measure the strength of relationships between continuous
variables.

Analyze trends
What have you done to prevent the Simpson’s paradox?

To prevent Simpson's paradox, which is a phenomenon where a trend or relationship observed in

different groups of data reverses and disappears when the groups are combined, we take several
steps during the data analysis process. Here are some approaches to mitigate the risk of Simpson's
paradox.

 Analyze and present data at the appropriate level of granularity Simpson's paradox often arises
when data from different subgroups are combined without considering the underlying factors
that may be influencing the relationship. By analyzing and presenting data at a more granular
level, you can capture the nuances and potential confounding variables within each subgroup.

 Consider and control for confounding variables Confounding variables are factors that can affect
the relationship between the variables of interest. It's essential to identify and account for these
variables to ensure a more accurate analysis. This can be done through statistical techniques
such as stratification or regression analysis, where the effect of confounding variables is
controlled for.

 Validate findings across subgroups When analyzing data across different groups or categories, it
is important to validate the findings within each subgroup separately. By examining the trends
and relationships within each subgroup, you can assess whether the observed patterns hold true
consistently or if there are any discrepancies.

 Conduct sensitivity analyses Sensitivity analyses involve testing the robustness of the results by
making adjustments or exploring alternative scenarios. This helps to evaluate the stability and
reliability of the findings and assess whether any changes in the data or assumptions could alter
the observed relationships.

Descriptive analytics

Which location
category best
What is the age represents the What is the How freque
range of the What is the gender customer's customer's does the cus
customer? of the customer? location? occupation? make purcha
N Valid 99 99 99 99
Missing 2 2 2 2

What is the age range of the customer?

Cumulative
Frequency Percent Valid Percent Percent
Valid 18-30 32 31.7 32.3 32.3
31-45 52 51.5 52.5 84.8
46 and above 15 14.9 15.2 100.0
Total 99 98.0 100.0
Missing System 2 2.0
Total 101 100.0
Data Segmentation

Segmenting the data can be helpful in understanding the behavior of different subgroups within the
dataset. It allows for a more detailed analysis and provides insights specific to each segment. If
needed, I would segment the data based on relevant variables such as customer demographics,
purchase behavior, or any other factors that are important to the business problem at hand.

Regarding redoing the sample, if there were specific issues or anomalies identified in the initial
sample, it might be necessary to revisit the sampling process and select a new sample that addresses
those concerns. This ensures that the data used for analysis is representative and reliable.

 dentify and investigate outliers Outliers are extreme values that significantly differ from other
data points. They can distort the analysis and affect the results. By identifying outliers and
examining their nature and potential causes, you can determine whether they are valid data
points or errors. Depending on the situation, outliers can be handled by either excluding them
from the analysis or transforming them to reduce their impact.

 Validate data quality Check for data inconsistencies, missing values, or incomplete records.
Validate the data against predefined rules or logical constraints to ensure accuracy and
completeness. If anomalies are found, appropriate actions such as data cleaning, imputation, or
data exclusion can be taken to address them.

 Perform data quality checks Conduct various data quality checks, such as cross-referencing data
with external sources, running consistency checks, and comparing data distributions or patterns.
This helps to identify any discrepancies or anomalies that may require further investigation or
correction.

 Implement data validation rules Establish and apply validation rules during data collection or
data entry processes to minimize errors. These rules can include range checks, format checks,
and logical checks to ensure the data is accurate and consistent.

SMART by GEP® Supplier User Guide - 0
No ratings yet
SMART by GEP® Supplier User Guide - 0
76 pages
Unit 05: Data Preparation & Analysis
100% (1)
Unit 05: Data Preparation & Analysis
26 pages
Data Analysis and Interpretation of Findings
No ratings yet
Data Analysis and Interpretation of Findings
34 pages
L 8data Analysis and Interpretation
No ratings yet
L 8data Analysis and Interpretation
33 pages
EDA
100% (1)
EDA
9 pages
EDA - Unit 1
No ratings yet
EDA - Unit 1
82 pages
Data analysis notes
No ratings yet
Data analysis notes
29 pages
MULTIVARIATE ANALYSIS Part 1
No ratings yet
MULTIVARIATE ANALYSIS Part 1
30 pages
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
No ratings yet
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
15 pages
Lesson 3 Notes
No ratings yet
Lesson 3 Notes
53 pages
Dcova Framework
No ratings yet
Dcova Framework
7 pages
Data Understanding and Prepration
100% (1)
Data Understanding and Prepration
10 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
Wisdom and StatisticsTecq-Amitava
No ratings yet
Wisdom and StatisticsTecq-Amitava
18 pages
Module Data Analysis
No ratings yet
Module Data Analysis
6 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
MIT 212 Collecting and Organizing Data_Tutorial 08
No ratings yet
MIT 212 Collecting and Organizing Data_Tutorial 08
5 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
module 3 data preparation
No ratings yet
module 3 data preparation
33 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
Econ 522- Chapter 5
No ratings yet
Econ 522- Chapter 5
37 pages
QM 1
No ratings yet
QM 1
58 pages
Quantitative Methods 3
No ratings yet
Quantitative Methods 3
174 pages
Econ 522- Chapter 6
No ratings yet
Econ 522- Chapter 6
37 pages
Chapter Six Methods of Describing Data
No ratings yet
Chapter Six Methods of Describing Data
20 pages
Null 1
No ratings yet
Null 1
62 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
Abn 2102
No ratings yet
Abn 2102
12 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
FDS - 3 SOLVED
No ratings yet
FDS - 3 SOLVED
21 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
L18&19 Data Exploration
No ratings yet
L18&19 Data Exploration
50 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
DM_merged
No ratings yet
DM_merged
169 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Data Exploration
No ratings yet
Data Exploration
23 pages
Analyzing The Data
No ratings yet
Analyzing The Data
54 pages
A Guide To Data Exploration
No ratings yet
A Guide To Data Exploration
20 pages
Introduction to Data Analysis
No ratings yet
Introduction to Data Analysis
8 pages
Analytics PrepBook AnSoc 2017 PDF
100% (1)
Analytics PrepBook AnSoc 2017 PDF
41 pages
Top 65 SQL Data Analysis Q&A
No ratings yet
Top 65 SQL Data Analysis Q&A
53 pages
data science slides
No ratings yet
data science slides
57 pages
Quantitative Data Analysis Guide
No ratings yet
Quantitative Data Analysis Guide
6 pages
Eda Reviewer
No ratings yet
Eda Reviewer
2 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
BRM Chapter 6
No ratings yet
BRM Chapter 6
8 pages
Quantitative Data Analysis Guide
100% (1)
Quantitative Data Analysis Guide
6 pages
ASM 1 Thay Duong
No ratings yet
ASM 1 Thay Duong
8 pages
Capstone: Stem-Based Research
No ratings yet
Capstone: Stem-Based Research
29 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Data Analysis
No ratings yet
Data Analysis
19 pages
Ue22cs342aa2 20240827192243
No ratings yet
Ue22cs342aa2 20240827192243
28 pages
Xdata Analysis
No ratings yet
Xdata Analysis
7 pages
Analytics - PrepBook 2018 PDF
No ratings yet
Analytics - PrepBook 2018 PDF
34 pages
Unit 2
No ratings yet
Unit 2
58 pages
The Art of Data Analysis: January 2015
No ratings yet
The Art of Data Analysis: January 2015
8 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Big Data Chapter 3
No ratings yet
Big Data Chapter 3
29 pages
Statistics: Practical Concept of Statistics for Data Scientists
From Everand
Statistics: Practical Concept of Statistics for Data Scientists
John Slavio
No ratings yet
Food Wateage Controll App Chapter 1
No ratings yet
Food Wateage Controll App Chapter 1
12 pages
CLC - Final Capstone Project Thesis
No ratings yet
CLC - Final Capstone Project Thesis
61 pages
Sales Management
No ratings yet
Sales Management
20 pages
Benchmark - Model Building
No ratings yet
Benchmark - Model Building
4 pages
CLC - Calibration and External Validation Literature Review
No ratings yet
CLC - Calibration and External Validation Literature Review
6 pages
CLC - Analytics Problem Statement
No ratings yet
CLC - Analytics Problem Statement
13 pages
Web Server Log Analysis Sysytem
No ratings yet
Web Server Log Analysis Sysytem
3 pages
Topic: Criminal Investigation Tracker With Suspect Prediction
No ratings yet
Topic: Criminal Investigation Tracker With Suspect Prediction
10 pages
MCQ Bank For MIS
No ratings yet
MCQ Bank For MIS
9 pages
Gamasutra - Q&A - Translating The Humor & Tone of Yakuza Games For The West
No ratings yet
Gamasutra - Q&A - Translating The Humor & Tone of Yakuza Games For The West
11 pages
Bagatrix Manual
No ratings yet
Bagatrix Manual
27 pages
Shafni - Shiyam (Information Security Management) 22 Update
No ratings yet
Shafni - Shiyam (Information Security Management) 22 Update
122 pages
2_Authentication&Integrity
No ratings yet
2_Authentication&Integrity
38 pages
Getting Started With DCL - Part 1 - AfraLISP
No ratings yet
Getting Started With DCL - Part 1 - AfraLISP
5 pages
Linux Administration March 2020
No ratings yet
Linux Administration March 2020
2 pages
Fibre Optic Cable System Acceptance Testing
No ratings yet
Fibre Optic Cable System Acceptance Testing
16 pages
Introduction To Information Systems: ITEC 1010 Information and Organizations
No ratings yet
Introduction To Information Systems: ITEC 1010 Information and Organizations
77 pages
Learning Presentation
No ratings yet
Learning Presentation
13 pages
Mastering Design Patterns in Java _ by Dharshi Balasubramaniyam _ Javarevisited _ Feb, 2024 _ Medium
No ratings yet
Mastering Design Patterns in Java _ by Dharshi Balasubramaniyam _ Javarevisited _ Feb, 2024 _ Medium
32 pages
Gen AI Course Content
No ratings yet
Gen AI Course Content
6 pages
Affiliate Marketing
No ratings yet
Affiliate Marketing
5 pages
Mania in The Eyes
No ratings yet
Mania in The Eyes
22 pages
Neural recording and stimulation using wireless networks of microimplants(科研通-ablesci.com)
No ratings yet
Neural recording and stimulation using wireless networks of microimplants(科研通-ablesci.com)
13 pages
Serial Number Walware Anti Malware
No ratings yet
Serial Number Walware Anti Malware
6 pages
Solving Multi Step Equation Worksheet Kuta Key
No ratings yet
Solving Multi Step Equation Worksheet Kuta Key
4 pages
Real Time Operating Systems - MicroC - OS-II
No ratings yet
Real Time Operating Systems - MicroC - OS-II
84 pages
ITP 457 Network Security: Networking Technologies III IP, Subnets & NAT
No ratings yet
ITP 457 Network Security: Networking Technologies III IP, Subnets & NAT
23 pages
Watch Suspicious Partner Episode 38 English Subbed On Myasiantv
No ratings yet
Watch Suspicious Partner Episode 38 English Subbed On Myasiantv
1 page
Clipx: The Facts: Bm40 Bm40Pb Bm40Ie
No ratings yet
Clipx: The Facts: Bm40 Bm40Pb Bm40Ie
1 page
TCS Aspire ILP Web Tech Quiz All Solutions
0% (1)
TCS Aspire ILP Web Tech Quiz All Solutions
5 pages
Essay About Computer History
No ratings yet
Essay About Computer History
5 pages
E Business Term Paper
No ratings yet
E Business Term Paper
16 pages
ANGOC Job Posting - 2024 PH LRCM
No ratings yet
ANGOC Job Posting - 2024 PH LRCM
2 pages
A Computational Approach To The Logistic Map
No ratings yet
A Computational Approach To The Logistic Map
1 page
FARO Laser Scanning Over Manual Methods For The Oil and Gas Industry
No ratings yet
FARO Laser Scanning Over Manual Methods For The Oil and Gas Industry
18 pages

CLC - Data Cleansing and Data Summary

Uploaded by

CLC - Data Cleansing and Data Summary

Uploaded by

How did you verify that the data was reliable before proceeding?

By Data Quality Assessment.

By Verification Data Source.

By Data Sampling and Testing

Inconsistencies and Data Discrepancies

Data Integrity and Accuracy

Data Skewness or Distribution Issues

What relationships did you find in the data?

Yes 2% is missing data

How are you going to summarize data samples?

To prevent Simpson's paradox, which is a phenomenon where a trend or relationship observed in

What is the age range of the customer?

You might also like