Data Types
Data Types
Machine learning involves dealing with a wide variety of data types. The nature of the
problem at hand and the data you're working with determine which data types are most
relevant. Here are some of the most common data types in machine learning:
Numerical (or Quantitative) Data:
Continuous Data: Data that can take any value within a range. Examples
include height, weight, and temperature.
Discrete Data: Data that can take only specific and separated values. Examples
include the number of employees in a company or the number of cars in a
household.
Categorical (or Qualitative) Data:
Nominal Data: Data that doesn’t have a natural order. Examples include colours
(red, blue, green) or gender (male, female).
Ordinal Data: Data that has a clear ordering. Examples include ratings (low,
medium, high) or education level (high school, bachelor's, master's, PhD).
Temporal Data: Data with a time component. Examples include time series data like
stock prices over a period, or date and time entries like timestamps of user activity.
Text Data: Raw textual data which is typically processed using techniques from natural
language processing (NLP). Examples include tweets, news articles, and product
reviews.
Image Data: Consists of graphical or visual information. This is commonly used in
computer vision tasks such as image recognition, object detection, and image
generation.
Audio Data: Sound or voice data used for tasks like speech recognition, sound
classification, and music generation.
Video Data: Sequential images or frames that represent moving scenes. Used in video
classification, activity recognition, and video summarization.
Sequential Data: Data where the order matters. This can be found in time series
forecasting, natural language processing (like predicting the next word in a sentence),
or any task where the sequence of data points is significant.
Structured Data: Organized data with clear relationships, often stored in databases or
spreadsheets. An example is a table with columns for 'Name', 'Age', and 'Salary'.
Unstructured Data: Data without a predefined schema or model. Examples include raw
text, images, and logs.
Spatial/Geospatial Data: Data that represents information about the physical location
and shape of geometric objects. This can be points (like store locations), lines (like
roads), or polygons (like country boundaries).
Relational Data: Data which has relationships between different entities or tables.
Examples can be found in relational databases where there are foreign key
relationships.
Mixed Data Types: Sometimes called "Heterogeneous Data", this refers to datasets
that contain a mix of the above data types.
It's important to understand these data types because the preprocessing, feature
engineering, and modelling techniques and machine learning algorithms uses / depends
heavily on the type of data of business case.