Module5_notes
Module5_notes
Module 5: Unfolding Computational Graphs, Recurrent Neural Network, Bidirectional RNNs, Deep
Recurrent Networks, Recursive Neural Networks, The Long Short Term Memory and Other Gated RNNs,
Applications.
Text Book: Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep Learning”, MIT Press, 2016.
Recurrent neural networks or RNNs (Rumelhart et al., 1986a) are a family of neural networks for processing
sequential data. Much as a convolutional network is a neural network that is specialized for processing a grid
of values X such as an image, a recurrent neural network is a neural network that is specialized for processing
a sequence of values 𝑥 (1) … 𝑥 (𝑡) . Just as convolutional networks can readily scale to images with large width
and height, and some convolutional networks can process images of variable size, recurrent networks can
scale to much longer sequences than would be practical for networks without sequence-based specialization.
Most recurrent networks can also process sequences of variable length.
• If we used a feedforward network designed for fixed-length sentences, it would require separate
parameters for each input feature, forcing it to learn the rules of the language independently for every
position in the sentence.
• In contrast, a recurrent neural network (RNN) shares the same weights across multiple time steps,
enabling it to generalize patterns regardless of their position within the sequence.
• A similar concept is the use of convolution over a one-dimensional temporal sequence, which forms the
foundation of time-delay neural networks (Lang and Hinton, 1988; Waibel et al., 1989; Lang et al., 1990).
• Convolution enables a network to share parameters across time but remains relatively shallow. Each
output in a convolutional network is determined by a small neighbourhood of inputs, with parameter
sharing achieved through the repeated application of the same convolution kernel at every time step.
• Recurrent networks, on the other hand, share parameters differently. In these networks, each output
depends on the preceding outputs, with the same update rule applied consistently across time steps.
• This recurrent mechanism results in parameter sharing across a much deeper computational graph,
enabling the network to capture long-term dependencies in the data.
• For simplicity, we describe RNNs as processing a sequence of vectors 𝑥 (𝑡) , where the time step index t
ranges from 1 to τ.
• In practice, however, recurrent networks typically operate on minibatches of sequences, where each
sequence in the batch may have a different length τ. To keep the notation simple, we omit the minibatch
indices.
• Additionally, the time step index t does not always represent the actual passage of time; it often simply
indicates the position within the sequence.
• RNNs can also be applied in two dimensions for spatial data, such as images. Even when working with
time-based data, the network can include connections that move backward in time, as long as the entire
sequence is available before being processed by the network.
• The next section extends the idea of a computational graph to include cycles. These cycles represent the
influence of the present value of a variable on its own value at a future time step.
• Such computational graphs allow us to define recurrent neural networks. We then describe many different
ways to construct, train, and use recurrent neural networks.
• Each node represents state at some time t. Function f maps state at time t to the state at t+1. The same
parameters (the same value of θ used to parameterize f) are used for all time steps.
• As another example, let us consider a dynamical system driven by an external signal x(t),
• where we see that the state now contains information about the whole past sequence.
• Recurrent neural networks can be built in many different ways. Much as almost any function can be
considered a feedforward neural network, essentially any function involving recurrence can be
considered a recurrent neural network.
• To indicate that the state is the hidden units of the network, now let us rewrite the above equation using
the variable h to represent the state:
• A recurrent network with no outputs is illustrated in figure2. This recurrent network just processes
information from the input x by incorporating it into the state h that is passed forward through time.
(Left)Circuit diagram.
• The black square indicates a delay of a single time step. (Right)The same network seen as an unfolded
computational graph, where each node is now associated with one particular time instance.
• As illustrated in figure 2, typical RNNs will add extra architectural features such as output layers that
read information out of the state h to make predictions.
• When a recurrent network is trained for tasks that involve predicting the future from the past, it
typically learns to use h(t) as a condensed summary of the task-relevant aspects of the input sequence
up to time t. This summary is inherently lossy because it compresses a sequence of arbitrary length,
(x(t), x(t−1), x(t−2), . . . ,x(2) ,x(1)), into a fixed-length vector, h(t).The degree of detail retained in this
summary depends on the training objective; some parts of the sequence may be preserved with greater
precision than others.
• For instance, in statistical language modeling, where the goal is to predict the next word based on
previous words, the RNN does not need to store all information from the input sequence. Instead, it
retains just enough information to predict the continuation of the sentence effectively.
• When RNN is required to perform a task of predicting the future from the past, network typically learns
to use h(t) as a lossy summary of the task-relevant aspects of the past sequence of inputs upto t.
• The summary is in general lossy since it maps a sequence of arbitrary length (x(t), x(t-1),..,x(2),x(1)) to a
fixed length vector h(t)
• Information Contained in Summary: Depending on criterion, summary keeps some aspects of past
sequence more precisely than other aspects.
• Examples:
• RNN used in statistical language modeling, typically to predict next word from past words. It may not
be necessary to store all information upto time t but only enough information to predict rest of sentence.
• Most demanding situation: we ask h(t) to be rich enough to allow one to approximately recover the
input sequence as in autoencoders.
• Unfolding: from circuit diagram to computational graph:
• Equation h(t) =f (h(t-1), x(t); θ) can be written in two different ways: circuit diagram or an unfolded
computational graph. Unfolding is the operation that maps a circuit to a computational graph with
repeated pieces. The unfolded graph has a size dependent on the sequence length.
• We can represent the unfolded recurrence after t steps with a function g(t) : h(t) =g(t)(x(t) , x(t−1) , x(t−2) , .
. . , x(2) , x(1)) =f(h(t−1) , x(t) ; θ) The function g(t) takes the whole past sequence (x(t) , x(t−1) , x(t−2) , . . . ,
x(2) , x(1)) as input and produces the current state, but the unfolded recurrent structure allows us to
factorize g(t) into repeated application of a function f.
• The unfolding process thus introduces two major advantages:
i. Regardless of the sequence length, the learned model always has the same input size, because
it is specified in terms of transition from one state to another state, rather than specified in terms
of a variable-length history of states.
ii. It is possible to use the same transition function f with the same parameters at every time step.
• These two factors enable the learning of a single model f that works across all time steps and sequence
lengths, rather than requiring separate models g(t) for each time step. By sharing parameters, the model
can generalize to sequence lengths not seen during training and can be trained with significantly fewer
examples compared to models without parameter sharing.
• Both the recurrent graph and the unrolled graph have their uses. The recurrent graph is succinct. The
unfolded graph provides an explicit description of which computations to perform. The unfolded graph
also helps to illustrate the idea of information flow forward in time (computing outputs and losses) and
backward in time (computing gradients) by explicitly showing the path along which this information
flows.
RNNs are specialized neural networks designed for sequence data. They process sequences step by step,
sharing parameters across time to handle inputs of varying lengths. The key concept to be captured in here
are:
• Unfolding Computational Graphs: RNNs use a repetitive structure that simplifies computations. Instead
of using a separate model for each time step, a single function with shared parameters is applied
repeatedly.
• Parameter Sharing: This allows RNNs to generalize across different sequence lengths and positions,
making them more efficient and requiring fewer training examples.
• Teacher Forcing in Recurrent Neural Networks (RNNs): Teacher forcing is a training technique used in
RNNs where the model is provided with the actual target output (ground truth) from the training data at
the previous timestep, rather than using the model's predicted output. This helps guide the model during
training and ensures faster convergence.
• How It Works:
• During training, at each timestep t, the RNN receives:
• The current input xt.
• The true target output yt−1 from the previous timestep as context.
• This contrasts with the inference phase, where the model uses its own predicted output from the previous
timestep y`t−1 as input.
• Advantages
• Faster Training: Providing the ground truth reduces error propagation during training, leading to quicker
convergence.
• Stable Gradients: Teacher forcing stabilizes the training process, especially in long sequences, by
preventing compounded prediction errors.
• Design Patterns in Recurrent Neural Networks (RNNs)
• RNNs can be designed in various ways depending on the specific task and the nature of the input and
output sequences. Here are the three main design patterns:
1. Produce Outputs at Every Time Step with Connections Between Hidden Units
• Description:
• At each time step, the RNN processes the input and produces an output.
• There are recurrent connections between the hidden units across time steps, allowing the
network to retain information from previous inputs.
• Applications:
• Time series prediction (e.g., stock price forecasting).
• Language models that predict the next word in a sequence.
• Illustration: x1→h1→o1, x2→h2→o2, …, xT→hT→oT
2. Use Outputs at One Time Step as Inputs for the Next
• Description:
• The output from one time step is used as an input for the next time step, in addition to the
original input at that step.
• These connections enable the model to utilize the output information from prior steps directly.
• Applications:
• Text generation, where each word predicted is fed back into the network to generate the next
word.
• Reinforcement learning environments requiring sequential decision-making.
• Illustration: x1→h1→o1→h2, x2→h2→o2→h3, …
• Description:
• The RNN processes the entire input sequence and generates a single output at the end,
summarizing the sequence into a fixed-size representation.
• This design is common for tasks where the sequence needs to be analyzed as a whole.
• Applications:
• Forward Propagation
• RNNs update their hidden state at each time step using a combination of current input and the
previous hidden state:
➢ Bidirectional RNNs:
• All the recurrent networks we’ve discussed so far have a “causal” structure, meaning the state at time
t captures information only from the past inputs (x(1),…,x (t−1)) (and the current input x(t).
• In some cases, these models also incorporate information from past y values into the current state,
provided those y values are accessible.
• In some tasks, the interpretation of the current input depends on the entire input sequence, not just past
or present information.
• For instance, in speech recognition, determining the correct phoneme for the current sound might rely
on the next few phonemes due to co-articulation or even on future words because of linguistic
dependencies.
• If multiple interpretations of a word are acoustically plausible, context from both the future and the
past may be needed to resolve ambiguity.
• This concept applies to other sequence-to-sequence tasks as well, such as handwriting recognition.
• Bidirectional Recurrent Neural Networks (Bidirectional RNNs) were introduced by Schuster and
Paliwal in 1997 to address the need for utilizing both past and future context when processing
sequential data. These networks process input sequences in both forward and backward directions,
enabling the model to capture information from the entire sequence.
• Bidirectional RNNs have proven highly effective in tasks requiring this capability, such as:
• Handwriting recognition (Graves et al., 2008; Graves and Schmidhuber, 2009).
• Speech recognition (Graves and Schmidhuber, 2005; Graves et al., 2013).
• Bioinformatics (Baldi et al., 1999).
• Their ability to integrate context from both directions makes them powerful for sequence-to-sequence
tasks.
• Bidirectional RNNs combine two RNNs: one processes the sequence forward in time (from the start
to the end), and the other processes it backward in time (from the end to the start). In this architecture:
• h(t): Represents the state of the forward-moving RNN.
• g(t): Represents the state of the backward-moving RNN.
• This setup allows the output units o(t) to create a representation that leverages information from both
the past and the future.
• Unlike feedforward networks, convolutional networks, or standard RNNs with fixed-size look-ahead
buffers, bidirectional RNNs do not require specifying a fixed-size window around t. Instead, they
naturally focus on the input values near t, making them highly effective for tasks that benefit from
bidirectional context.
• The concept of bidirectional RNNs can be extended to 2-dimensional inputs, such as images, by
introducing four RNNs that process the data in different directions: up, down, left, and right. At each
position (i,j) in a 2D grid, an output Oi,j can compute a representation that captures mostly local
information while also incorporating long-range dependencies, provided the RNN effectively learns
to carry such information.
• Compared to convolutional networks:
• RNNs applied to images are generally more computationally expensive.
• They enable long-range lateral interactions between features within the same feature map, which
convolutional networks cannot achieve as easily.
• The forward propagation equations for these RNNs show they include:
• A convolution operation to compute bottom-up input for each layer.
• Recurrent propagation across the feature map, incorporating lateral interactions.
• This approach allows RNNs to leverage both local and global context in image data, providing a more
comprehensive understanding of spatial relationships (Visin et al., 2015; Kalchbrenner et al., 2015).
• A deep RNN is simply an RNN with multiple hidden layers stacked on top of each other. This stacking
allows the network to learn more complex patterns and representations from the data. Each layer in a deep
RNN can capture different levels of abstraction, making it more powerful than a single-layer RNN.
• Key Idea is that they process one step of a sequence at a time (like reading one word or one frame of a
video). Information flows not only through the sequence (via time) but also vertically through the layers
to capture deeper relationships.
• Example: Imagine analysing a video: A simple RNN might understand how objects in a single frame relate
to the previous one. A deep RNN could understand more abstract things, like the movement of a person
or the mood of the scene.
In the above figure 4, the deep RNN has three hidden layers. The arrows indicate the flow of information
through the network over time.
Let us now try to understand how a recurrent neural network can be made deep in many ways.
a) The hidden recurrent state can be broken down into groups organized hierarchically.
b) Deeper computation (e.g., an MLP) can be introduced in the input-to-hidden, hidden-to-hidden and
hidden-to-output parts. This may lengthen the shortest path linking different time steps.
c) The path-lengthening effect can be mitigated by introducing skip connections.
• As seen in figure 4(a), the hidden recurrent state is broken into groups or layers that are organized
hierarchically.
• The state h is updated at each time step, and it interacts with another intermediate layer z before
producing the final output y.
• Hierarchical organization means the recurrent hidden states (h) have additional layers of computation
(z) between them, creating a deeper network.
• The hierarchical nature of hidden states aligns with the idea that the hidden recurrent state can be
broken into groups.
• This hierarchical arrangement introduces multiple processing steps (hidden layers), allowing the RNN
to learn more abstract representations over time.
• These skip connections shorten the gradient path by allowing information to flow directly between
layers, mitigating the path-lengthening effect caused by deeper computation.
• The skip connections mentioned in the paragraph are explicitly shown here as shortcuts in the
network.
• These connections address the problem of long paths, improving gradient flow and enabling the RNN
to handle deeper architectures effectively.
Recursive neural networks2 represent yet another generalization of recurrent networks, with a different kind
of computational graph, which is structured as a deep tree, rather than the chain-like structure of RNNs. The
typical computational graph for a recursive network is illustrated in figure 5.
• Recursive networks were first proposed by Pollack (1990), with their potential application for reasoning
highlighted by Bottou (2011). These networks have been effectively utilized for processing structured data
as inputs to neural networks (Frasconi et al., 1997, 1998), as well as in natural language processing and
computer vision.
• A notable advantage of recursive networks over recurrent networks is their ability to significantly reduce
the depth (measured by the number of compositions of nonlinear operations) from τ to O (log τ) for a
sequence of length τ, which can help address long-term dependencies. However, determining the optimal
tree structure remains an open question. One approach is to use a fixed tree structure, such as a balanced
binary tree, that does not depend on the data.
• In certain domains, external methods can provide suitable tree structures. For instance, in natural language
processing, the tree structure can be aligned with the parse tree of a sentence generated by a natural
language parser. Ideally, the learner itself would infer and discover the most appropriate tree structure for
a given input, as suggested by Bottou (2011).
• There are numerous possible variants of the recursive network concept. For instance, Frasconi et al. (1997,
1998) associate the data with a tree structure where inputs and targets are linked to individual nodes of the
tree.
• The computations performed at each node are not limited to the conventional artificial neuron operations
(an affine transformation followed by a monotonic nonlinearity). For example, Socher et al. (2013a)
propose the use of tensor operations and bilinear forms, which have previously been shown to effectively
model relationships between concepts (Weston et al., 2010; Bordes et al., 2012) when the concepts are
represented as continuous vector embeddings.
The introduction of self-loops to create paths that allow gradients to flow for extended durations is a key
innovation of the original Long Short-Term Memory (LSTM) model by Hochreiter and Schmidhuber
(1997). A significant improvement was later made by Gers et al. (2000), where the weight of the self-loop is
conditioned on the context rather than being fixed. By using a gated mechanism (controlled by another
hidden unit), the time scale of integration can be dynamically adjusted. This means that even with fixed
parameters, an LSTM can adapt the integration time scale based on the input sequence, as the time constants
are generated by the model itself.
The LSTM has achieved remarkable success across a variety of applications, including:
• Just like a simple RNN, an LSTM also has a hidden state where H(t-1) represents the hidden state of
the previous timestamp and Ht is the hidden state of the current timestamp. In addition to that, LSTM
also has a cell state represented by C(t-1) and C(t) for the previous and current timestamps,
respectively.
• Example of LTSM Working: Let’s take an example to understand how LSTM works. Here we have
two sentences separated by a full stop. The first sentence is “Bob is a nice person,” and the second
sentence is “Dan, on the Other hand, is evil”. It is very clear, in the first sentence, we are talking about
Bob, and as soon as we encounter the full stop(.), we started talking about Dan.
• As we move from the first sentence to the second sentence, our network should realize that we are no
more talking about Bob. Now our subject is Dan. Here, the Forget gate of the network allows it to
forget about it. Let’s understand the roles played by these gates as per the LSTM architecture.
• Forget Gate: In a cell of the LSTM neural network, the first step is to decide whether we should keep
the information from the previous time step or forget it. Here is the equation for forget gate.
(𝑡)
• forget gate unit 𝑓𝑖 (for time step t and cell i), that sets this weight to a value between 0 and 1 via a
sigmoid unit.
• where x(t) is the current input vector and h(t) is the current hidden layer vector, containing the outputs
of all the LSTM cells, and bf, Uf, Wf are respectively biases, input weights and recurrent weights for
the forget gates.
(𝑡)
• The LSTM cell internal state is thus updated as follows, but with a conditional self-loop weight 𝑓𝑖 :
• where b, U and W respectively denote the biases, input weights and recurrent weights into the LSTM
(𝑡)
cell. The external input gate unit 𝑔𝑖 is computed similarly to the forget gate (with a sigmoid unit to
obtain a gating value between 0 and 1), but with its own parameters:
(𝑡) (𝑡)
• The output ℎ𝑖 of the LSTM cell can also be shut off, via the output gate 𝑞𝑖 , which also uses a
sigmoid unit for gating:
• which has parameters bo, Uo, Wo for its biases, input weights and recurrent weights, respectively.
(𝑡)
Among the variants, one can choose to use the cell state 𝑠𝑖 as an extra input (with its weight) into the
three gates of the i-th unit, as shown in figure 6.
➢ Other Gated RNNs:
• Which pieces of the LSTM architecture are actually necessary? What other successful architectures could
be designed that allow the network to dynamically control the time scale and forgetting behaviour of
different units?
• Some answers to these questions are given with the recent work on gated RNNs, whose units are also
known as gated recurrent units.
• The main difference with the LSTM is that a single gating unit simultaneously controls the forgetting
factor and the decision to update the state unit. The update equations are the following:
• where u stands for “update” gate and r for “reset” gate. Their value is defined as usual:
• The reset and update gates can selectively "ignore" specific parts of the state vector. The update gates
function as conditional leaky integrators, allowing any dimension to be linearly gated.
• At one extreme of the sigmoid function, they can copy the current value directly, while at the other
extreme, they can completely discard it and replace it with the new target state value, which the leaky
integrator aims to reach.
• Meanwhile, the reset gates determine which parts of the state contribute to the computation of the next
target state. This introduces an additional nonlinear effect that influences the relationship between the
previous state and the future state.
• Many more variants around this theme can be designed. For example, the reset gate (or forget gate) output
could be shared across multiple hidden units. Alternately, the product of a global gate (covering a whole
group of units, such as an entire layer) and a local gate (per unit) could be used to combine global control
and local control.
Clipping gradients is a popular technique in optimization, especially for training recurrent neural networks
(RNNs) and their variants (like LSTMs and GRUs), to address the issue of exploding gradients that arise when
dealing with long-term dependencies.
• When training deep networks or RNNs over long sequences, the gradients of the loss function with
respect to the model parameters can sometimes grow exponentially during backpropagation through
time (BPTT).
• This occurs when the product of gradients of weight matrices over multiple time steps becomes
excessively large, leading to unstable updates in gradient-based optimization.
• Exploding gradients result in:
o Overshooting the optimal parameters,
o Poor convergence, or
o NaNs (numerical instability).
Gradient clipping is a method to stabilize training by preventing the gradients from growing too large. The
idea is simple:
• If the norm of the gradient exceeds a predefined threshold, the gradient is rescaled so that its norm
matches the threshold.
• This ensures that gradient updates remain controlled and do not explode.
➢ In tasks with long-term dependencies, the gradients must propagate through many time steps. While
vanishing gradients hinder learning of long-term dependencies, exploding gradients can destabilize the
network. Gradient clipping directly addresses this instability by:
• Distributed Training
o Data Parallelism: The dataset is divided into smaller chunks, and each chunk is processed by a
different machine or GPU. The gradients from all machines are then aggregated to update the
model weights.
o Model Parallelism: The model is split across different devices, with each device handling a subset
of the model's parameters.
• Gradient Clipping
o As previously discussed, gradient clipping is used to prevent exploding gradients when training
large models, particularly on long sequences (e.g., RNNs and LSTMs).
➢ Speech recognition: Speech recognition is the process of converting spoken language into written
text. It involves understanding, interpreting, and transcribing spoken words into a format that can be
processed by computers. It is a crucial component of various applications like voice assistants (e.g., Siri,
Google Assistant), transcription services, and voice-controlled devices.
➢ Key Features of Speech Recognition:
• Accuracy and Speed: They can process speech in real-time or near real-time, providing quick
responses to user inputs.
• Multi-Language Support: Support for multiple languages and dialects, allowing users from different
linguistic backgrounds to interact with technology in their native language.
• Background Noise Handling: This feature is crucial for voice-activated systems used in public or
outdoor settings.
➢ Challenges in Speech Recognition
• Variability in Speech
▪ Accents and Dialects: Different accents, dialects, and regional variations in pronunciation
can make speech recognition more challenging.
▪ Background Noise: Speech recognition systems can struggle when background noise is
present, such as in crowded environments or when multiple speakers are involved.
▪ Speech Rate: Variations in the speed at which people speak can impact recognition
accuracy. Fast speech, slurred speech, or pauses between words can be problematic.
• Ambiguity in Speech
▪ Homophones: Words that sound the same but have different meanings (e.g., "see" and
"sea") can confuse the model.
▪ Contextual Meaning: Words can have different meanings depending on the context (e.g.,
"bank" as a financial institution or the side of a river).
▪ Out-of-Vocabulary Words: If a word has not been seen in the training data, the system may
struggle to recognize it correctly.
• Real-Time Processing
▪ For applications like voice assistants and real-time transcription, the model must recognize
speech instantaneously with minimal delay.
▪ Efficient latency management is crucial for keeping the user experience smooth and
responsive.
➢ Natural Language Processing (NLP): Natural Language Processing (NLP) is a branch of artificial
intelligence (AI) focused on enabling computers to understand, interpret, and generate human language.
NLP combines computational linguistics, machine learning, and deep learning to process and analyze large
amounts of natural language data, such as text or speech.
• Key Tasks in Natural Language Processing:
• NLP encompasses various tasks that are fundamental to understanding and processing human language.
Some of these tasks include:
• Tokenization: The process of breaking down text into smaller units, such as words, sentences, or sub word
units. For example, the sentence "Natural language processing is fun!" can be tokenized into the words
"Natural," "language," "processing," "is," "fun," and "!".
• Part-of-Speech (POS) Tagging: This task involves identifying the grammatical components of each token
in a sentence. It labels words as nouns, verbs, adjectives, etc., to understand their roles within the sentence.
• Named Entity Recognition (NER): NER involves identifying and classifying entities (such as names of
people, organizations, locations, dates, etc.) in the text.
• Sentiment Analysis: This technique determines the sentiment or emotional tone of a piece of text. It helps
classify text as positive, negative, or neutral, and can be used to analyze social media, product reviews,
etc.
• Machine Translation: The process of automatically translating text from one language to another. Google
Translate and Deeply are popular examples of machine translation systems.
• Text Summarization: This involves condensing a piece of text into a shorter summary while preserving its
essential meaning. There are two types of summarizations:
o Extractive Summarization: Selects important sentences from the original text.
o Abstractive Summarization: Generates new sentences to convey the core ideas of the original text.
• Speech Recognition: Converting spoken language into written text. This task involves recognizing words
in spoken language and transcribing them into text format.
• Challenges in NLP:
• Ambiguity: Natural language is inherently ambiguous. Words can have multiple meanings depending on
the context (e.g., "bank" can refer to a financial institution or the side of a river). Resolving this ambiguity
is a key challenge for NLP systems.
• Sarcasm and Irony: NLP models often struggle with understanding sarcasm, irony, and other figurative
language because they require nuanced understanding of context and tone.
• Multilingual NLP: NLP systems trained in one language may not perform well in others. Building
multilingual systems that can handle many languages is a complex challenge, but there has been progress
with multilingual models like mBERT (Multilingual BERT).
• Low-Resource Languages: Many languages still lack sufficient training data or resources, making it harder
to build high-performing models for these languages.
• Bias and Fairness: NLP models trained on large corpora may inherit biases present in the data, leading to
discriminatory or unfair outcomes in applications like hiring, law enforcement, or healthcare.