Types of Data
Types of Data
Data Science:
Data science involves drawing conclusions from data by applying methods such as statistical
modeling, machine learning, and big data tools. It covers every stage of the lifecycle, from gathering data to
implementing predictive models.
Examples:
• Fraud Detection: Banks use machine learning (e.g., Random Forests) to detect suspicious
transactions.
• Recommendation Systems: Netflix uses collaborative filtering to suggest movies based on user
behavior.
• Natural Language Processing (NLP): Analyzing social media sentiment to gauge brand perception.
Data Analytics:
The main goal of data analytics is to analyze past data to find answers to particular queries.
This process frequently uses descriptive statistics and visualization tools.
Examples:
• Customer Segmentation: Grouping users into "High-Value" or "Low-Value" categories using SQL.
Data analytics plays a crucial role in modern engineering by enabling engineers to make data-driven decisions,
optimize processes, and predict outcomes. Here are some key applications:
1. Predictive Maintenance:
Example:
Monitoring vibrations or temperature data from machines to predict wear and tear before a failure occurs.
2. Quality Control:
Example:
Analyzing sensor data during the production process to identify defects in materials or processes early.
Energy Efficiency:
Example:
Analyzing energy consumption in a building to optimize heating, ventilation, and air conditioning
(HVAC) systems.
Example:
Design Optimization:
Data analytics allows engineers to optimize designs based on performance data. This ensures that materials
are used efficiently, and designs are as cost-effective and functional as possible.
Example:
Analyzing data from wind tunnels or simulations to refine the design of an aircraft or vehicle.[4]
Data:
Data are pieces of information that can be numbers, facts, or symbols, used to describe things. They
can be discrete or continuous and help in understanding and decision-making.
Example:
• Employee names.
• Product names
Types of data:
2. Discrete Data
3. Categorical Data
4. Qualitative Data[5]
Continuous Data:
Continuous data is the complete opposite of discrete data. This is type of numerical data that relates to
the countless potential measurements that can exist between two assumed points.
Example:
Discrete Data:
Discrete data involves integers and only a limited number of values is possible. This category of data
cannot be divided into smaller components.
Example:
Categorical Data:
The categorical data includes categorical variables that describe characteristics such as a person’s
gender, hometown etc.
Example:
• Birthdate
• Favorite sport[7]
Qualitative Data:
Qualitative data is descriptive information that focuses on concepts and characteristics, rather than
numbers and statistics. The data cannot be counted, measured or expressed numerically.
Example:
• Interviews.[8]
Q7: Dendrograms?
A dendrogram is a tree-like diagram used in hierarchical clustering to represent the arrangement of data points
or clusters. It visually shows how different items or groups are related, based on their similarity or distance.
Key Features:
• Branches: Indicate the similarity between clusters. Shorter branches mean higher similarity.
• Height: The height at which clusters merge shows the dissimilarity between them. Lower height means
greater similarity.
Applications:
• Bioinformatics: Used to visualize evolutionary relationships between species (e.g., phylogenetic trees).
• Marketing: Used for customer segmentation, grouping customers based on similar purchasing
behaviors.
• Machine Learning: Hierarchical clustering applications in classifying and grouping data.
Purpose:
• Dendrograms visually represent how objects are grouped hierarchically based on their similarities or
distances.[9]
Fig.2 Dendrograms[9]
In data analysis and measurement, errors are typically categorized into two main types: Systematic Errors and
Random Errors.
1. Systematic Errors:
Errors that consistently occur in the same direction due to identifiable causes, lead
to measurements that are consistently higher or lower than the true value.
Characteristics:
• Predictable and reproducible.
• Often arise from faulty equipment, calibration issues, or environmental factors.
• Affect the accuracy of measurements.
Example:
A scale that consistently reads 2 kilograms heavier than the actual weight due to miscalibration.
2. Random Errors:
Errors that occur unpredictably and vary in magnitude and direction, often due to uncontrollable factors.
Characteristics:
Example:
Truncation Error:
Occurs when exact models are replaced with approximations (e.g., finite-difference derivatives instead of
analytical derivatives).
Round-off Error:
Results from the inability of computers to represent real numbers exactly (e.g., limited mantissa bits in
floating-point numbers).
Dependency Increases with larger step sizes (ℎ). Decreases with larger step sizes (fewer
computations).
Mitigation Use higher-order approximations (e.g., more Increase precision (e.g., double-precision
Taylor terms). arithmetic).[12]
A confidence interval (CI) is a range of values used to estimate a population parameter, such as the population
mean, based on sample data. It helps assess the degree of uncertainty around the point estimate (e.g., sample
mean) and indicates the precision of the estimate.
Key Components:
Point Estimate:
A single value derived from sample data (like the sample mean or proportion) that estimates the population
parameter.
Margin of Error:
The range above and below the point estimate, reflecting the uncertainty in the estimate. It's calculated using
the sample's standard error and a critical value (e.g., Z-score for a specific confidence level).
Confidence Level:
The percentage of confidence that the true population parameter lies within the interval. Common levels are
90%, 95%, and 99%.
For a population mean where the population standard deviation (\sigma) is known, the confidence interval is
calculated as:
ˉ 𝜎
𝐶𝐼 = 𝑥 ± 𝑧 ×
√𝑛
Where:
ˉ
• 𝑥 is the sample mean,
• 𝑧 is the Z-score corresponding to the desired confidence level (e.g., 1.96 for 95%),
• 𝜎 is the population standard deviation,
• 𝑛 is the sample size.
Interpretation:
A 95% confidence interval means that if you were to repeatedly sample from the population and compute a
confidence interval from each sample, about 95% of those intervals would contain the true population
parameter. It’s important to understand that the interval either contains the true parameter or it doesn’t the
95% refers to the reliability of the method.
Example:
If a sample of 100 individuals has a mean height of 170 cm and a population standard deviation of 10 cm, the
95% confidence interval for the true population mean height would be:
10
170 ± 1.96 × = 170 ± 1.96 × 1 = (168.04,171.96)
√100
This means we are 95% confident that the true population mean height is between 168.04 cm and 171.96
cm.
Applications of Confidence Intervals:
• Estimating Population Parameters: CI is widely used in statistics to estimate parameters like means and
proportions, providing a range within which the true parameter lies.
• Hypothesis Testing: CIs are used in hypothesis testing to determine if a sample statistic is significantly
different from a hypothesized value.
• Quality Control: In manufacturing, CI is used to assess if production processes are within acceptable limits.
Common Misunderstandings:
• A CI does not mean there is a 95% chance that the true parameter lies within the interval. It means that if the
sampling process were repeated many times, 95% of those intervals would contain the true parameter.
Fig.3 CI[14]
References:
[1] F. Provost and T. Fawcett, “Introduction: Data-Analytic Thinking,” Data Science for Business : What
You Need to Know About Data Mining and Data-Analytic Thinking., pp. 1–18, 2013, Accessed: Feb.
19, 2025. [Online]. Available:
https://www.researchgate.net/publication/256438799_Data_Science_for_Business
[2] “Data Analyst, Data Scientist, Data Engineer: Who’s the Real MVP of Data-Driven Decision
Making? | Rocket Recruiting Blog.” Accessed: Feb. 19, 2025. [Online]. Available:
https://www.getrocket.com/post/data-roles
[3] “Data Analyst Vs. Data Scientist: The Comparison To Distinguish The Best.” Accessed: Feb. 19,
2025. [Online]. Available: https://www.digitalvidya.com/blog/data-science-career-data-analyst-vs-
data-scientist/
[4] “Data Analysis in Engineering: Examples, Importance & Uses.” Accessed: Feb. 19, 2025. [Online].
Available: https://www.studysmarter.co.uk/explanations/engineering/professional-engineering/data-
analysis-in-engineering/
[5] T. H. Vines et al., “The availability of research data declines rapidly with article age,” Current
Biology, vol. 24, no. 1, p. <span class="nowrap">94-</span>97, Jan. 2014, doi:
10.1016/j.cub.2013.11.014.
[6] “Discrete vs. Continuous Data: What Is The Difference? | Whatagraph.” Accessed: Feb. 19, 2025.
[Online]. Available: https://whatagraph.com/blog/articles/discrete-vs-continuous-data
[7] “Categorical Data & Qualitative Data (Definition and Types).” Accessed: Feb. 19, 2025. [Online].
Available: https://byjus.com/maths/categorical-data/
[8] “What is Qualitative Data? | Definition from TechTarget.” Accessed: Feb. 19, 2025. [Online].
Available: https://www.techtarget.com/searchcio/definition/qualitative-data
[9] “Dendrogram - Wikipedia.” Accessed: Feb. 19, 2025. [Online]. Available:
https://en.wikipedia.org/wiki/Dendrogram
[10] “Errors in Measurement: Measurement, Gross Errors, Systematic Errors, Random Errors and FAQs.”
Accessed: Feb. 19, 2025. [Online]. Available: https://byjus.com/physics/accuracy-precision-error-
measurement/
[11] “Type I and type II errors - Wikipedia.” Accessed: Feb. 19, 2025. [Online]. Available:
https://en.wikipedia.org/wiki/Type_I_and_type_II_errors
[12] D. University, “Part 1 Chapter 4 Roundoff and Truncation Errors.”
[13] “Understanding Confidence Intervals | Easy Examples & Formulas.” Accessed: Feb. 19, 2025.
[Online]. Available: https://www.scribbr.com/statistics/confidence-interval/
[14] “Confidence Intervals and Z Score - Programmatically.” Accessed: Feb. 19, 2025. [Online].
Available: https://programmathically.com/confidence-intervals-and-z-score/