DSF 3-4

The document discusses various types of data, including qualitative (nominal and ordinal) and quantitative (discrete and continuous) data, along with their characteristics and examples. It also addresses common data quality issues such as incomplete data, default values, inconsistencies, duplicates, orphaned data, irrelevant data, redundant data, old data, unclear definitions, dysfunctional history management, and late data receipt, along with suggested solutions for each problem. Additionally, it covers data formats and the importance of structured data in R.

Uploaded by

bidiy85138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views18 pages

DSF 3-4

Uploaded by

bidiy85138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

DATA SCIENCE

FUNDAMENTAL
S
DSC293
Lecture 3-4
Dr. Hufsa Mohsin
TYPES OF DATA
 Qualitative or Categorical Data describes the object under consideration using a
finite set of discrete classes.
 Smartphone brand: current rating, the color of the phone, category of the phone

 Nominal
 These are the set of values that don’t possess a natural ordering.
 The color of a smartphone can be considered as a nominal data type as we can’t compare one
color with others.
 Mobile phone categories: midrange, budget segment, or premium smartphone

 Ordinal
 These types of values have a natural ordering while maintaining their class of values.
 Size of a clothing brand then we can easily sort them according to their name tag in the order
of small < medium < large.
 The grading system while marking candidates in a test can also be considered as an ordinal
data type where A+ is definitely better than B grade.
QUANTITATIVE DATA
TYPE
 This data type tries to quantify things and it does by considering numerical values that make it countable in
nature.
 The price of a smartphone, discount offered, number of ratings on a product, the frequency of processor of a smartphone,
or RAM of that particular phone.
 Discrete
 The numerical values which fall under are integers or whole numbers are placed under this category.
 The number of speakers in the phone, cameras, cores in the processor, the number of sims supported.

 Discrete data types in statistics cannot be measured – it can only be counted

 Discrete data is often identified through charts, including bar charts, pie charts, and tally charts.

 Continuous
 The fractional numbers are considered as continuous values.
 The operating frequency of the processors, the android version of the phone, wifi frequency, temperature of the
cores, and so on.
 Unlike discrete data types of data in research, with a whole and fixed value, continuous data can break down into
smaller pieces and can take any value.
 Continuous types of statistical data are represented using a graph that easily reflects value fluctuation by the
highs and lows of the line through a certain period of time.
DATA FORMATS
 XLS/XLSX
 JSON
 XML
 MongoDB
 SQL
DATA IN R
 Structured
 Working with data from file
 The most common ready-to-go data format is a family of tabular formats
INCOMPLETE DATA
 What kind of problems can be found in a particular data set and how they
can be fixed (expected to be fixed).
 Incomplete Data: This is by far the most common issue when dealing
with data quality in data sets. Key columns are missing information,
causing downstream analytics impact.
 How to fix this issue?
 The best way to fix this is to put in place a reconciliation framework control.
The control would check the number of records passing through your
analytical layers and alert when records have gone missing.
DEFAULT VALUES
 Ever analysed your data and found 01/01/1891 as a date for a transaction?
Unless your customer base comprises 130-year-old individuals, this is likely
a case of using default values. This is especially a problem if there is a lack
of documentation.
 How to fix this issue?
 The best way to fix this is to profile the data and understand the pattern of
why default values were used. Usually, engineers use this data when a
real-life alternative date is unavailable.
DATA FORMAT
INCONSISTENCIES
 String columns predominantly suffer from this problem, where data can be
stored in many formats.
 For example, a customer’s first and last name is stored in different cases or
an email address without the correct formatting.
 It occurs when multiple systems store information without an agreed data
format.
 How to fix this issue?
 To fix this, data needs to be homogenized (standardized) across the source
system or at least in the data pipeline when fed to the data lake or
warehouse.
DUPLICATE DATA
 Reasonably straightforward to spot, quite tricky to fix. If the critical
attribute is populated with dirty data duplicates, it will break all the key
downstream processes. It can also cause other data quality issues.
 How to fix this issue?
 To fix this, a master data management control needs to be implemented,
even as basic as a uniqueness check.
 This control will check for exact duplicates of records and purge one record.
 It can also send a notification for the other record to the data engineer or
steward for investigation
CROSS SYSTEM
INCONSISTENCIES
 Very common in large organizations that have grown by acquisitions and
mergers. Multiple source legacy systems all have a slightly different view of
the world. Customer name, address or DOB all have inconsistent or
incorrect information.
 How to fix this issue?
 Like above, a master data management solution must be implemented to
ensure all the different information is matched into a single record.
 This matching doesn’t need to be exact; it could be fuzzy based on a
threshold of match percentage
ORPHANED DATA
 This data quality issue relates to data inconsistency problems where data
exists in one system and not the other. A customer exists in table A, but
their account doesn’t exist in table B. It would be classed as an orphan
customer.
 On the other hand, if an account exists in table B but has no associated
customer, it would be classed as an orphan account. A data quality rule
that checks for consistency each time data is ingested in tables A and B
will help spot the issue.
 How to fix this issue?
 To remediate this, the source system would need to check the underlying
cause of this inconsistency. Irrelevant Data
IRRELEVANT DATA
 Nothing is more frustrating than capturing ALL the available information.
Besides the regulatory restrictions of data minimization, capturing all the
available data is more expensive and less sustainable.
 How to fix this issue?
 To fix this, data capturing principles need to be agreed upon; each data
attribute should have an end goal; otherwise, it should not be captured
REDUNDANT DATA
 Multiple teams across the organization, capturing the same data
repeatedly. In an organization with an online and high street presence,
capturing the same information numerous times will lead to data being
available in various systems leading to data redundancy. Not only is this
poor for the company’s bottom line, but it is also a poor customer
experience.
 How to fix this issue?
 To fix this, a singular base system should be utilized where all the
organization’s agents receive their data, yet again a master data
implementation
OLD & STALE DATA
 Storing data beyond a certain period adds no value to your data stack. It
costs more money, confuses the engineer, and it impacts your ability to
conduct analytics. It also makes the data irrelevant.
UNCLEAR DATA DEFINITIONS
 Speak to Sam in Finance and Jess in Customer Services, both interpreting
the same data point differently; sounds familiar? Clarity is a data quality
dimension that is not discussed much, as in the modern data stack, it is
part of the business glossary or data catalogue.
 How to fix this issue?
 Fixing this requires aligning data definitions each time a new metric/data
point is created.
DYSFUNCTIONAL HISTORY
MANAGEMENT
 History maintenance is critical for any data warehousing implementation.
Now data is being received chronologically, and history is being maintained
using slowly changing dimensions Type 2. However, the incorrect rows are
being opened and closed, leading to a false representation of the latest
valid record. In turn, breaking the history maintenance method and
downstream processes.
 How to fix this issue?
 To fix this, ensure the correct date column is used to determine the history
maintenance
DATA RECEIVED TOO LATE
 Data needs to be timely enough to make the critical decision in that period.
If your marketing campaigns are running weekly, you must receive the
required data by the set day of the week to trigger them. Otherwise, too
late data could lead to poor responses on your campaigns.
 How to fix this issue?
 You must agree on an appropriate time window with the engineering team
to fix this. And work backwards to ensure your source systems can adhere
to those Service Level Agreements.

English Apki Success (Md. Aalam) 1st
0% (2)
English Apki Success (Md. Aalam) 1st
832 pages
Communication Questionnaire
75% (4)
Communication Questionnaire
2 pages
Phrasal Verbs Worksheet:: - With Marty
No ratings yet
Phrasal Verbs Worksheet:: - With Marty
1 page
Handouts
No ratings yet
Handouts
19 pages
Data Quality and Data Cleaning: An Overview
No ratings yet
Data Quality and Data Cleaning: An Overview
27 pages
6a - Data Quality and Data Cleaning
No ratings yet
6a - Data Quality and Data Cleaning
5 pages
Data Quality and Data Cleaning: An Overview
0% (1)
Data Quality and Data Cleaning: An Overview
132 pages
DHV MODEL 1.2 Data Cleaning
No ratings yet
DHV MODEL 1.2 Data Cleaning
49 pages
Big Data Lec5
No ratings yet
Big Data Lec5
37 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
02-DataQuality Compressed
No ratings yet
02-DataQuality Compressed
71 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
Data Quality Concepts PDF
100% (3)
Data Quality Concepts PDF
83 pages
CS194 Lec 04 Data Cleaning
No ratings yet
CS194 Lec 04 Data Cleaning
50 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Analytics - Module-1.2
No ratings yet
Data Analytics - Module-1.2
55 pages
5 Data Cleaning
No ratings yet
5 Data Cleaning
36 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
2 DM DataPreprocessing
No ratings yet
2 DM DataPreprocessing
43 pages
18mca52c U2
No ratings yet
18mca52c U2
23 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Lect 6
No ratings yet
Lect 6
36 pages
Big Data Analysis With Apache Spark: Uc#Berkeley
No ratings yet
Big Data Analysis With Apache Spark: Uc#Berkeley
80 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
DP
No ratings yet
DP
44 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Introduction To Data Cleaning
No ratings yet
Introduction To Data Cleaning
36 pages
Null 1
No ratings yet
Null 1
62 pages
Lec 1 Data Acquisition and Preprocessing
No ratings yet
Lec 1 Data Acquisition and Preprocessing
8 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
Google Certificate Notes
No ratings yet
Google Certificate Notes
36 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Personal Report
No ratings yet
Personal Report
18 pages
DataQuality Session2
No ratings yet
DataQuality Session2
39 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
CS822 DataMining Week3
No ratings yet
CS822 DataMining Week3
91 pages
Machine Learning Lecture 4 Data Types
No ratings yet
Machine Learning Lecture 4 Data Types
21 pages
Process Data From Dirty To Clean
No ratings yet
Process Data From Dirty To Clean
34 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
UjwalBhattarai InternalAssignment
No ratings yet
UjwalBhattarai InternalAssignment
9 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
No ratings yet
Lab-1 PPT Dsa-Bcsl305
No ratings yet
Lab-1 PPT Dsa-Bcsl305
13 pages
Eti 22618 Ut1 Question Bank 290120
No ratings yet
Eti 22618 Ut1 Question Bank 290120
39 pages
The Greek Philosophical Vocabulary 1nbsped 0715623354 9780715623350 - Compress
No ratings yet
The Greek Philosophical Vocabulary 1nbsped 0715623354 9780715623350 - Compress
174 pages
AIO2024 LLamaIndex
No ratings yet
AIO2024 LLamaIndex
65 pages
Important Instruction For Internal and Practical Exam-1
No ratings yet
Important Instruction For Internal and Practical Exam-1
3 pages
Assignment On The Overview of China by Shahinur Parvin
No ratings yet
Assignment On The Overview of China by Shahinur Parvin
48 pages
Koos Faqs: Background Information
No ratings yet
Koos Faqs: Background Information
11 pages
Drexel Lesson Plan Template Interactive Read Aloud Teacher: Brad Jones Grade: 2
No ratings yet
Drexel Lesson Plan Template Interactive Read Aloud Teacher: Brad Jones Grade: 2
3 pages
Mulla Saahir - Itm820 Assignment 1
No ratings yet
Mulla Saahir - Itm820 Assignment 1
5 pages
k2 V11ea1
No ratings yet
k2 V11ea1
30 pages
Conjunctions Notes
No ratings yet
Conjunctions Notes
3 pages
English 1 2020
No ratings yet
English 1 2020
170 pages
Ov2500 Nms e 46r1 Install Reva
No ratings yet
Ov2500 Nms e 46r1 Install Reva
268 pages
EasyPicing PDF
No ratings yet
EasyPicing PDF
169 pages
DLL - English 9 - Q1 - W2
No ratings yet
DLL - English 9 - Q1 - W2
5 pages
Polisi
No ratings yet
Polisi
14 pages
2nd Pui Ching
No ratings yet
2nd Pui Ching
48 pages
Reading List 2023 24
No ratings yet
Reading List 2023 24
2 pages
Aschnorous Server Request
No ratings yet
Aschnorous Server Request
4 pages
Superiority of Christianity Over Other Religions On Earth by Pastor Paul Rika Ebook
100% (1)
Superiority of Christianity Over Other Religions On Earth by Pastor Paul Rika Ebook
76 pages
COMMAT1 Mock Exam: I. Identification
No ratings yet
COMMAT1 Mock Exam: I. Identification
6 pages
Computer Fundamentals (ALL in ONE)
No ratings yet
Computer Fundamentals (ALL in ONE)
818 pages
LP Spaces For P in 01
No ratings yet
LP Spaces For P in 01
4 pages
Lecture 2
No ratings yet
Lecture 2
24 pages
Lesson Plan Critique Smaion
No ratings yet
Lesson Plan Critique Smaion
16 pages
Present Perfect Simple or Present Perfect Continuous Exercise 3
No ratings yet
Present Perfect Simple or Present Perfect Continuous Exercise 3
3 pages
Unit Ii
No ratings yet
Unit Ii
16 pages

DSF 3-4

Uploaded by

DSF 3-4

Uploaded by

DATA SCIENCE

 Discrete data types in statistics cannot be measured – it can only be counted

You might also like