0% found this document useful (0 votes)
10 views18 pages

DSF 3-4

The document discusses various types of data, including qualitative (nominal and ordinal) and quantitative (discrete and continuous) data, along with their characteristics and examples. It also addresses common data quality issues such as incomplete data, default values, inconsistencies, duplicates, orphaned data, irrelevant data, redundant data, old data, unclear definitions, dysfunctional history management, and late data receipt, along with suggested solutions for each problem. Additionally, it covers data formats and the importance of structured data in R.

Uploaded by

bidiy85138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views18 pages

DSF 3-4

The document discusses various types of data, including qualitative (nominal and ordinal) and quantitative (discrete and continuous) data, along with their characteristics and examples. It also addresses common data quality issues such as incomplete data, default values, inconsistencies, duplicates, orphaned data, irrelevant data, redundant data, old data, unclear definitions, dysfunctional history management, and late data receipt, along with suggested solutions for each problem. Additionally, it covers data formats and the importance of structured data in R.

Uploaded by

bidiy85138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

DATA SCIENCE

FUNDAMENTAL
S
DSC293
Lecture 3-4
Dr. Hufsa Mohsin
TYPES OF DATA
 Qualitative or Categorical Data describes the object under consideration using a
finite set of discrete classes.
 Smartphone brand: current rating, the color of the phone, category of the phone

 Nominal
 These are the set of values that don’t possess a natural ordering.
 The color of a smartphone can be considered as a nominal data type as we can’t compare one
color with others.
 Mobile phone categories: midrange, budget segment, or premium smartphone

 Ordinal
 These types of values have a natural ordering while maintaining their class of values.
 Size of a clothing brand then we can easily sort them according to their name tag in the order
of small < medium < large.
 The grading system while marking candidates in a test can also be considered as an ordinal
data type where A+ is definitely better than B grade.
QUANTITATIVE DATA
TYPE
 This data type tries to quantify things and it does by considering numerical values that make it countable in
nature.
 The price of a smartphone, discount offered, number of ratings on a product, the frequency of processor of a smartphone,
or RAM of that particular phone.
 Discrete
 The numerical values which fall under are integers or whole numbers are placed under this category.
 The number of speakers in the phone, cameras, cores in the processor, the number of sims supported.

 Discrete data types in statistics cannot be measured – it can only be counted


 Discrete data is often identified through charts, including bar charts, pie charts, and tally charts.

 Continuous
 The fractional numbers are considered as continuous values.
 The operating frequency of the processors, the android version of the phone, wifi frequency, temperature of the
cores, and so on.
 Unlike discrete data types of data in research, with a whole and fixed value, continuous data can break down into
smaller pieces and can take any value.
 Continuous types of statistical data are represented using a graph that easily reflects value fluctuation by the
highs and lows of the line through a certain period of time.
DATA FORMATS
 XLS/XLSX
 JSON
 XML
 MongoDB
 SQL
DATA IN R
 Structured
 Working with data from file
 The most common ready-to-go data format is a family of tabular formats
INCOMPLETE DATA
 What kind of problems can be found in a particular data set and how they
can be fixed (expected to be fixed).
 Incomplete Data: This is by far the most common issue when dealing
with data quality in data sets. Key columns are missing information,
causing downstream analytics impact.
 How to fix this issue?
 The best way to fix this is to put in place a reconciliation framework control.
The control would check the number of records passing through your
analytical layers and alert when records have gone missing.
DEFAULT VALUES
 Ever analysed your data and found 01/01/1891 as a date for a transaction?
Unless your customer base comprises 130-year-old individuals, this is likely
a case of using default values. This is especially a problem if there is a lack
of documentation.
 How to fix this issue?
 The best way to fix this is to profile the data and understand the pattern of
why default values were used. Usually, engineers use this data when a
real-life alternative date is unavailable.
DATA FORMAT
INCONSISTENCIES
 String columns predominantly suffer from this problem, where data can be
stored in many formats.
 For example, a customer’s first and last name is stored in different cases or
an email address without the correct formatting.
 It occurs when multiple systems store information without an agreed data
format.
 How to fix this issue?
 To fix this, data needs to be homogenized (standardized) across the source
system or at least in the data pipeline when fed to the data lake or
warehouse.
DUPLICATE DATA
 Reasonably straightforward to spot, quite tricky to fix. If the critical
attribute is populated with dirty data duplicates, it will break all the key
downstream processes. It can also cause other data quality issues.
 How to fix this issue?
 To fix this, a master data management control needs to be implemented,
even as basic as a uniqueness check.
 This control will check for exact duplicates of records and purge one record.
 It can also send a notification for the other record to the data engineer or
steward for investigation
CROSS SYSTEM
INCONSISTENCIES
 Very common in large organizations that have grown by acquisitions and
mergers. Multiple source legacy systems all have a slightly different view of
the world. Customer name, address or DOB all have inconsistent or
incorrect information.
 How to fix this issue?
 Like above, a master data management solution must be implemented to
ensure all the different information is matched into a single record.
 This matching doesn’t need to be exact; it could be fuzzy based on a
threshold of match percentage
ORPHANED DATA
 This data quality issue relates to data inconsistency problems where data
exists in one system and not the other. A customer exists in table A, but
their account doesn’t exist in table B. It would be classed as an orphan
customer.
 On the other hand, if an account exists in table B but has no associated
customer, it would be classed as an orphan account. A data quality rule
that checks for consistency each time data is ingested in tables A and B
will help spot the issue.
 How to fix this issue?
 To remediate this, the source system would need to check the underlying
cause of this inconsistency. Irrelevant Data
IRRELEVANT DATA
 Nothing is more frustrating than capturing ALL the available information.
Besides the regulatory restrictions of data minimization, capturing all the
available data is more expensive and less sustainable.
 How to fix this issue?
 To fix this, data capturing principles need to be agreed upon; each data
attribute should have an end goal; otherwise, it should not be captured
REDUNDANT DATA
 Multiple teams across the organization, capturing the same data
repeatedly. In an organization with an online and high street presence,
capturing the same information numerous times will lead to data being
available in various systems leading to data redundancy. Not only is this
poor for the company’s bottom line, but it is also a poor customer
experience.
 How to fix this issue?
 To fix this, a singular base system should be utilized where all the
organization’s agents receive their data, yet again a master data
implementation
OLD & STALE DATA
 Storing data beyond a certain period adds no value to your data stack. It
costs more money, confuses the engineer, and it impacts your ability to
conduct analytics. It also makes the data irrelevant.
UNCLEAR DATA DEFINITIONS
 Speak to Sam in Finance and Jess in Customer Services, both interpreting
the same data point differently; sounds familiar? Clarity is a data quality
dimension that is not discussed much, as in the modern data stack, it is
part of the business glossary or data catalogue.
 How to fix this issue?
 Fixing this requires aligning data definitions each time a new metric/data
point is created.
DYSFUNCTIONAL HISTORY
MANAGEMENT
 History maintenance is critical for any data warehousing implementation.
Now data is being received chronologically, and history is being maintained
using slowly changing dimensions Type 2. However, the incorrect rows are
being opened and closed, leading to a false representation of the latest
valid record. In turn, breaking the history maintenance method and
downstream processes.
 How to fix this issue?
 To fix this, ensure the correct date column is used to determine the history
maintenance
DATA RECEIVED TOO LATE
 Data needs to be timely enough to make the critical decision in that period.
If your marketing campaigns are running weekly, you must receive the
required data by the set day of the week to trigger them. Otherwise, too
late data could lead to poor responses on your campaigns.
 How to fix this issue?
 You must agree on an appropriate time window with the engineering team
to fix this. And work backwards to ensure your source systems can adhere
to those Service Level Agreements.

You might also like