Data Science - Glossary

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

DATA SCIENCE

G LOS S A R Y
IN D U S T RY T E R MS
© 1993–2020 Pragmatic Institute, LLC. All rights reserved. Other company and product names mentioned in this manual are the intellectual property of
their respective companies and as such shall remain the sole property of those respective companies.

Pragmatic Institute®
D ATA S C I E N C E I N D U S T RY T E R M S

automate reasoning. As the cost of computing


A resources dropped, the focus moved more toward
statistical analysis of large amounts of data to
A/B testing drive decision making that gives the appearance of
A statistical way of comparing intelligence. See also machine learning, data mining,
two (or more) techniques, and expert systems.
typically an incumbent against
a new rival. A/B testing aims
to determine not only which B
technique performs better but
also to understand whether the backtesting
difference is statistically significant. A/B testing Periodic evaluation of a trained machine learning
usually considers only two techniques using one algorithm to check whether the predictions of the
measurement, but it can be applied to any finite algorithm have degraded over time. Backtesting is a
number of techniques and measures. critical component of model maintenance.

accuracy baseline
In classification, accuracy is defined as the number A model or heuristic used as reference point for
of observations that are correctly labeled by the comparing how well a machine learning model is
algorithm as a fraction of the total number of performing. A baseline helps model developers
observations the algorithm attempted to label. quantify the minimal, expected performance on a
Colloquially, it is the fraction of times the algorithm particular problem. Generally, baselines are set to
guessed “right.” simulate the performance of a model that doesn’t
actually make use of our data to make predictions.
anomaly detection This is called a naive benchmark.
Anomaly detection, also known as outlier
detection, is the identification of rare items, events, batch
observations, or patterns which raise suspicions by A set of observations that are fed into a machine
differing significantly from the majority of the data. learning model to train it. Batch training is a
counterpart to online learning, in which data are fed
artificial intelligence (AI) sequentially instead of all at once.
The ability to have machines act with apparent
intelligence, although varying definitions of bias
“intelligence” lead to a range of meanings for the Bias is a source of error that emerges from
artificial variety. In AI’s early days in the 1950s, erroneous assumptions in the learning algorithm.
researchers sought general principles of intelligence High bias can cause an algorithm to miss the
to implement, often using symbolic logic to

© 2020 PRAGMATIC INSTITUTE, LLC DATA SCIENCE GLOSSARY 1


relevant relations between features and labels. Bias may not reflect any grouping structure that would be
can be mitigated by adding additional features to sensible to a human being. See also classification,
the data or using a more flexible algorithm. See also supervised learning, unsupervised learning, k-means
variance, cross-validation. clustering.

cross-validation
C The name given to a set of techniques that split
data into training sets and test sets when using data
classification with an algorithm. The training set is given to the
Classification is one of the two algorithm, along with the correct answers (labels),
major types of supervised learning and becomes the set used to make predictions.
models in which the labels we The algorithm is then asked to make predictions for
train the algorithm to predict each item in the test set. The answers it gives are
are distinct categories. Usually compared to the correct answers, and an overall
these categories are binary (yes/ score for how well the algorithm did is calculated.
no, innocent/guilty, 0/1) but Cross-validation repeats this splitting procedure
classification algorithms can typically be extended several times and computes an average score
to handle multiple classes (peach, plum, pear) or, based on the scores from each split.
in a more limited set of cases, multiple labels (an
object can belong to more than one category). See
also regression, supervised learning D
cloud computing data cleansing
A computing paradigm in which the storage and The act of reviewing and revising data to remove
processing of data or the hosting of computing duplicate entries, correct misspellings, add missing
services such as databases or websites takes data and provide more consistency.
place on a remote system comprised of multiple
individual computing units acting as one and data collection
typically owned by a cloud computing service Any process that captures any type of data.
provider.
data custodian
clustering
A person responsible for the database structure and
An unsupervised learning technique the technical environment, including the storage of
that identifies group structures in data.
data. Clusters are, loosely speaking,
groups of observations that are data dictionary
similar to other observations in the
A set of information describing the contents, format,
same cluster and different from
and structure of a database and the relationship
those belonging to different clusters.
between its elements, used to control access to and
The center of each cluster is known by the excellent
manipulation of the database.
name “centroid.” Importantly, clustering algorithms
only consider the relationships between features
data-directed decision making
in the data mathematically and not conceptually;
as such, the clusters identified by these algorithms The use of data to support making crucial
decisions.

2 DATA SCIENCE GLOSSARY © 2020 PRAGMATIC INSTITUTE, LLC


data engineer data quality
A specialist in data wrangling. The measure of data to determine its worthiness for
decision making, planning or operations.
data exhaust
The data that a person creates as a byproduct of a data replication
common activity—for example, a cell call log or web The process of sharing information to ensure
search history. consistency between redundant sources.

data feed data repository


A means for a person to receive a stream of data. The location of permanently stored data.
Examples of data feed mechanisms include RSS or
Twitter. data science
The discipline that incorporates
data integrity statistics, data visualization,
The measure of trust an organization has in the computer programming, data mining,
accuracy, completeness, timeliness and validity of machine learning and database
the data. engineering to solve complex
problems.
data mart
The access layer of a data warehouse used to data scientist
provide data to users. A practitioner of data science.

data migration data security


The process of moving data between different The practice of protecting data from destruction or
storage types or formats, or between different unauthorized access.
computer systems.
data steward
data mining A person responsible for data stored in a data field.
The process of deriving patterns or knowledge from
large data sets. data structure
A specific way of storing and organizing data.
data model, data modeling
An agreed upon data structure. This structure is data visualization
used to pass data from one individual, group, or A visual abstraction of data designed for the
organization to another, so that all parties know purpose of deriving meaning or communicating
what the different data components mean. Often information more effectively.
meant for both technical and non-technical users.
data warehouse
data profiling
A place to store data for the purpose of reporting
The process of collecting statistics and information and analysis.
about data in an existing source.

© 2020 PRAGMATIC INSTITUTE, LLC DATA SCIENCE GLOSSARY 3


data wrangling context of the decision being made and applying
The process of transforming and cleaning data from logic, mainly in the form of if-then rules.
raw formats to appropriate formats for later use.
Also called data munging.
H
deep learning
Hadoop
A multilevel algorithm that
Hadoop is a collection of software that facilitate
gradually identifies things at
using a network of many computers to solve
higher levels of abstraction.
problems involving large amounts of data and
For example, the first level
computation. It consists of two main functional
identifies certain lines. The
components. One, the Hadoop Distributed File
second identifies combinations
System (HDFS), is a utility that allows data to be
of lines as shapes. Then the
stored over multiple networked machines in a
third identifies combinations of shapes as specific
failure-tolerant manner while still being treated
objects. Deep learning is popular for image
as a single file from the perspective of the user.
classification. See also neural network.
The other, Hadoop MapReduce, is a programming
paradigm that allows the user to process and
data munging
analyze this data in parallel over large numbers of
See data wrangling. individual processing units located across multiple
machines.
E HiPPO
errors-at-random Highest Paid Person’s Opinion. A paradigm
for decision-making within businesses that is
Errors-at-random are data errors such as missing
inconsistent with data-driven cultures.
or mismeasured data that are random with respect
to the data we observe. Errors are not-at-random
Hive
if the probability that an observation is missing or
erroneous is correlated with the observed data. Hive is a data warehouse software project built
Errors-not-at-random are especially problematic if on top of Hadoop for providing data query and
errors are correlated with labels. analysis. Hive gives a SQL-like interface to query
data stored in various databases and file systems
ETL that integrate with Hadoop.
ETL is short for extract, transform, load, three
database functions that are combined into one tool
to pull data from a primary source and place it into a
I
database. Internet of Things (IoT)
The Internet of things (IoT) is the extension of
expert system
internet connectivity into physical devices and
An expert system is a computer system that everyday objects. Embedded with electronics,
emulates the decision-making ability of a human internet connectivity, artificial intelligence, and other
expert. Expert systems are designed to solve forms of hardware, these devices can communicate
complex problems processing data describing the

4 DATA SCIENCE GLOSSARY © 2020 PRAGMATIC INSTITUTE, LLC


and interact with others over the Internet, and they graph, the line that has the smallest possible sum
can be remotely monitored and controlled of squared distances to the actual data point y
values. Statistical software packages and typical
spreadsheet packages offer automated ways to
J calculate this.

Java linear relationship


Java is a general-purpose, object-oriented, compiled The relationship between two varying amounts,
programming language. While it is not among the such as price and sales, that can be expressed with
most common languages used by data scientists, it an equation that can be represented as a straight
and its close relative Scala are the native language line on a graph.
of many distributed computing frameworks such as
Hadoop and Spark.
M
L machine learning
The use of data-driven
label algorithms that perform better
In supervised learning applications, labels are the as they have more data to work
components of the data that indicate the desired with, redefining their models or
predictions or decisions we would like the machine “learning” from this additional
learning algorithm to make for each observation data. This involves cross-
we pass into the algorithm. Supervised learning validation with training and test
algorithms learn to use other features in the data data sets. Studying the practical
to predict labels so that these algorithms can learn application of machine learning usually means
to predict labels in other instances when the labels researching which machine learning algorithms are
are not known or determined. In certain fields, labels best for which situations.
are called targets. See also supervised learning,
classification, regression. machine learning model
The model artifact that is created in the process of
leakage providing a machine learning algorithm with training
Leakage is the introduction of information during data from which to learn.
training that will not be germane or available to the
deployed algorithm. MapReduce
MapReduce is a programming
length model and implementation
Length measures the number of observations in our designed to work with big data
dataset. sets in parallel on a distributed
cluster system. MapReduce
linear regression programs consist of two steps. First, a map step
A technique to look for a linear relationship takes chunks of data and processes it in some way
by starting with a set of data points that don’t (e.g. parsing text into words). Second, a reduce step
necessarily line up nicely. This is accomplished takes the data that are generated by the map step
by computing the “least squares” line: on an x-y and performs some kind of summary calculation

© 2020 PRAGMATIC INSTITUTE, LLC DATA SCIENCE GLOSSARY 5


(e.g. counting word occurrences). In between the neural network
map and reduce step, data move between machines A machine learning method
using a key-value pair system that guarantees modeled after the brain. This
that each reducer has the information it needs to method is extremely powerful
complete its calculation (e.g. all of the occurrences and flexible, as it is created from
of the word “Python” get routed to a single an arbitrary number of artificial
processor so they can be counted in aggregate). neurons that can be connected in
various patterns appropriate to the problem at hand,
minimum viable product (MVP) and the strength of those connections are adjusted
The minimum viable product is the smallest during the training process. They are able to learn
complete unit of work that would be valuable in its extremely complex relationships between data and
own right, even if the rest of the project fizzled out. output, at the cost of large computational needs.
They have been used to great success in processing
model image, movie, and text data, and any situation with
The specification of mathematical or probabilistic very large numbers of features.
relationships existing between different variables.
Because “modeling” can mean many things, the non-stationarity
term “statistical modeling” is often used to more Non-stationarity occurs when the mapping between
accurately describe the kind of modeling that data the features of our data and the label we’re trying
scientists do. to predict changes from the time our model was
trained. Housing prices, for example, are non-
stationary: a model fit in the 1930s would make
N exceptionally poor predictions today, as houses cost
a lot less back then. Models fit on non-stationary
natural language processing (NLP) data must be backtested and adjusted frequently to
Natural Language Processing (NLP) is a branch keep them relevant.
of data science that applies machine learning
techniques to help machines learn to interpret and NoSQL
process textual data consisting of human language. A database management system that uses any of
Applications of NLP include text classification several alternatives to the relational, table-oriented
(predicting what type of content a document model used by SQL databases. Originally meant as
contains), sentiment analysis (determining whether “not SQL,” it has come to mean something closer
a statement is positive, negative, or neutral), to “not only SQL” due to the specialized nature of
and translation. NLP also comprises techniques NoSQL database management systems. These
to encode textual content numerically to use in systems often are tasked with playing specific roles
machine learning applications. in a larger system that may also include SQL and
additional NoSQL systems.
Naive Bayes
A classification algorithm that predicts labels from
data by assuming that the features of the data are O
statistically independent from each other. Due to
this assumption, Naive Bayes models can be easily online learning
fit on distributed systems. Online learning is a learning paradigm by which
machine learning models may be trained by passing

6 DATA SCIENCE GLOSSARY © 2020 PRAGMATIC INSTITUTE, LLC


them training data sequentially or in small groups predictive modeling
(mini-batches). This is important in instances where The development of statistical models to predict
the amount of data on hand exceeds the capacity future events.
of the RAM of the system on which a model is being
developed. Online learning also allows models to be Python
continually updated as new data are produced.
A programming language available
since 1994 that is popular with
overfitting
people doing data science. Python
See Variance is noted for ease of use among
beginners and great power when
used by advanced users, especially
P when taking advantage of specialized libraries such
as those designed for machine learning and graph
Perl
generation.
An older scripting language with roots in pre-Linux
UNIX systems. Perl has always been popular
for text processing, especially data cleanup and R
enhancement tasks.
R
Pig An open-source programming language and
Apache Pig is a high-level platform for creating environment for statistical computing and graph
programs that run on Hadoop. Pig is designed generation available for Linux, Windows and Mac.
to make it easier to create data processing and Along with Python, R is among the most popular
analysis workflows that can be executed in software packages used by data scientists.
MapReduce, Spark, or other distributed frameworks.
regression
precision Regression is one of the two
A performance measure for classification models. major types of supervised learning
Precision measures the fraction of all of the models in which the labels we
observations that a classification algorithm flagged train the algorithm to predict are
positively that were flagged correctly. For example, ordered quantities like prices or
if our algorithm were judging suspects, precision numerical amounts. One might use
would measure the percentage of all the suspects a regression, for instance, to predict temperatures
declared guilty by the algorithm who actually were over time or housing prices within a city. See also
guilty. See also recall. classification, supervised learning

predictive analytics recall


The analysis of data to predict future events, A performance measure for classification
typically to aid in business planning. This models. Recall measures the fraction of all of the
incorporates predictive modeling and other observations that a classification algorithm should
techniques. Machine learning might be considered have flagged positively that were actually flagged
a set of algorithms to help implement predictive by the algorithm. For example, if our algorithm
analytics. were judging suspects, recall would measure the

© 2020 PRAGMATIC INSTITUTE, LLC DATA SCIENCE GLOSSARY 7


percentage of all guilty suspects that the algorithm Spark
correctly identified as such. See also precision. Apache Spark is a high-level open-source distributed
cloud computing framework. Spark is particularly
Ruby valuable because it contains libraries that support
A scripting language that first the querying of distributed databases, distributed
appeared in 1996. Ruby is popular in processing and wrangling, and distributed machine
the data science community, but not learning. As such, it provides end-to-end solutions
as popular as Python, which has more that allow data scientists to take full advantage of
specialized libraries available for data cloud computing resources.
science tasks.
SPSS
A commercial statistical software package, or
S predictive analytics software, popular in the social
sciences. It has been available since 1968 and was
SAS
acquired by IBM in 2009.
A commercial statistical software suite that
includes a programming language also known as SQL
SAS.
Stands for “Structured Query Language.” The ISO
standard query language for relational databases.
Scala
This language is used to ask structured databases
Scala is a Java-like programming language for information out of one or more data tables
commonly used by data scientists. It is the native stored in the database. Variations of this extremely
language of Spark. popular language are often available for data
storage systems that aren’t strictly relational. Watch
Scikit-Learn for the phrase “SQL-like.”
The most common Python package for machine
learning Stata
A commercial statistical software package
shell commonly used by academics, particularly in the
A computer’s operating system when used from social sciences.
the command line. Along with scripting languages
such as Perl and Python, Linux-based shell tools supervised learning
(included and available for Mac and Windows A type of machine learning algorithm in which a
computers) such as grep, diff, split, comm, head and system learns to predicts labels after being shown
tail are popular for data wrangling. A series of shell a set of training data and identifying statistical
commands stored in a file that lets you execute the associations between features in the data and the
series by entering the file’s name is known as a shell labels it is given. The classic example is sorting
script. email into spam versus ham. See also unsupervised
learning, machine learning.
Simpson’s paradox
Simpson’s paradox is a phenomenon in which a
trend appears in several different groups of data
but disappears or reverses when these groups are
combined.

8 DATA SCIENCE GLOSSARY © 2020 PRAGMATIC INSTITUTE, LLC


T V
Tableau variance
A commercial data visualization package often used Variance is the amount that the estimate of the
in data science projects. target function will change if different training
data was used. Another way of saying this is that
variance measures the degree to which a model
U picks up noise as opposed to signal. High-variance
is synonymous with overfitting.
unsupervised learning
A class of machine learning algorithms designed to
identify (potentially) useful patterns or structures W
in data without being directed to perform a specific
prediction or decision task. width
Width measures the number of features in a
dataset.

Every 2 days, we generate


as much data as all of
humanity did up to 2003.

M ake sure none of that data goes to


waste with help from the data experts
at Pragmatic Institute. We deliver the most
careers. From data dabblers to Ph.D.s and
professionals looking to increase their data
skills, our expert instructors provide hands-on
relevant and in-demand curriculum to training with the latest data science tools and
students at every point in their data science techniques.

DATA SCIENCE COURSES

Essential Practical Advanced Artificial Intelligence


Data Tools Machine Learning Machine Learning with TensorFlow

PragmaticInstitute.com/Data-Science

© 2020 PRAGMATIC INSTITUTE, LLC DATA SCIENCE GLOSSARY 9


MARKET-DRIVEN. DATA-LED.
Get the skills, tools and techniques you need to make
a lasting impact on your organization, with help from
the world leader in product and data training.

PragmaticInstitute.com

You might also like