Data Science - Glossary
Data Science - Glossary
Data Science - Glossary
G LOS S A R Y
IN D U S T RY T E R MS
© 1993–2020 Pragmatic Institute, LLC. All rights reserved. Other company and product names mentioned in this manual are the intellectual property of
their respective companies and as such shall remain the sole property of those respective companies.
Pragmatic Institute®
D ATA S C I E N C E I N D U S T RY T E R M S
accuracy baseline
In classification, accuracy is defined as the number A model or heuristic used as reference point for
of observations that are correctly labeled by the comparing how well a machine learning model is
algorithm as a fraction of the total number of performing. A baseline helps model developers
observations the algorithm attempted to label. quantify the minimal, expected performance on a
Colloquially, it is the fraction of times the algorithm particular problem. Generally, baselines are set to
guessed “right.” simulate the performance of a model that doesn’t
actually make use of our data to make predictions.
anomaly detection This is called a naive benchmark.
Anomaly detection, also known as outlier
detection, is the identification of rare items, events, batch
observations, or patterns which raise suspicions by A set of observations that are fed into a machine
differing significantly from the majority of the data. learning model to train it. Batch training is a
counterpart to online learning, in which data are fed
artificial intelligence (AI) sequentially instead of all at once.
The ability to have machines act with apparent
intelligence, although varying definitions of bias
“intelligence” lead to a range of meanings for the Bias is a source of error that emerges from
artificial variety. In AI’s early days in the 1950s, erroneous assumptions in the learning algorithm.
researchers sought general principles of intelligence High bias can cause an algorithm to miss the
to implement, often using symbolic logic to
cross-validation
C The name given to a set of techniques that split
data into training sets and test sets when using data
classification with an algorithm. The training set is given to the
Classification is one of the two algorithm, along with the correct answers (labels),
major types of supervised learning and becomes the set used to make predictions.
models in which the labels we The algorithm is then asked to make predictions for
train the algorithm to predict each item in the test set. The answers it gives are
are distinct categories. Usually compared to the correct answers, and an overall
these categories are binary (yes/ score for how well the algorithm did is calculated.
no, innocent/guilty, 0/1) but Cross-validation repeats this splitting procedure
classification algorithms can typically be extended several times and computes an average score
to handle multiple classes (peach, plum, pear) or, based on the scores from each split.
in a more limited set of cases, multiple labels (an
object can belong to more than one category). See
also regression, supervised learning D
cloud computing data cleansing
A computing paradigm in which the storage and The act of reviewing and revising data to remove
processing of data or the hosting of computing duplicate entries, correct misspellings, add missing
services such as databases or websites takes data and provide more consistency.
place on a remote system comprised of multiple
individual computing units acting as one and data collection
typically owned by a cloud computing service Any process that captures any type of data.
provider.
data custodian
clustering
A person responsible for the database structure and
An unsupervised learning technique the technical environment, including the storage of
that identifies group structures in data.
data. Clusters are, loosely speaking,
groups of observations that are data dictionary
similar to other observations in the
A set of information describing the contents, format,
same cluster and different from
and structure of a database and the relationship
those belonging to different clusters.
between its elements, used to control access to and
The center of each cluster is known by the excellent
manipulation of the database.
name “centroid.” Importantly, clustering algorithms
only consider the relationships between features
data-directed decision making
in the data mathematically and not conceptually;
as such, the clusters identified by these algorithms The use of data to support making crucial
decisions.
PragmaticInstitute.com/Data-Science
PragmaticInstitute.com