Documentation
Documentation
TRACKING
INTRODUCED BY:
SUPERVISED BY:
Egypt/2024
Committee Report
We certify we have read this graduation project report as
an examining committee, examined the student in its content, and
that in our opinion it is adequate as a project document for
“Radiation-Free Scoliosis Tracking”.
Chairman: Supervisor:
Name: Name: Dr. Nehal Khaled
Signature: Signature:
Date: / /2024 Date: / /2024
Examiner:
Name:
Signature:
Date: / /2024
Intellectual Property Right Declaration
This is to declare that the work under the supervision of Dr. Nehal
Khaled having title “Radiation-Free Scoliosis Tracking ” carried out in
partial fulfillment of the requirements of Bachelor of Science in
Computer Science is the sole property of Ahram Canadian University
and the respective supervisor. It is protected under the intellectual
property right laws and conventions. It can only be considered/ used
for purposes like extension for further enhancement, product
development, adoption for commercial/organizational usage, etc.
with the permission of the University and respective supervisor. This
above statement applies to all students and faculty members.
Names:
Supervisor:
Names:
CHAPTER 1: INTRODUCTION
OVERVIEW...................................................................................................................................................................6
MOTIVATION............................................................................................................................................................12
SCOPE...........................................................................................................................................................................12
CONSTRAINTS........................................................................................................................................................ 13
DOCUMENT ORGANIZATION........................................................................................................................14
BACKGROUND.........................................................................................................................................................15
CHAPTER 2: BACKGROUND
Overview
Kyphosis: A rounding of the upper spine, often creating a visible hump on the
back.
Lordosis: An exaggerated forward curve in the lower back or neck area.
Scoliosis: A sideways, S- or C-shaped curve of the spine.
Ohio State Spine Care specialists in Columbus, Ohio, provide treatments for these
spinal curvature conditions.
What is Scoliosis?
Certain conditions, such as cerebral palsy and muscular dystrophy, can contribute to
scoliosis. Additionally, some birth defects and genetic disorders, like Marfan
syndrome and Down syndrome, are associated with scoliosis. Many individuals with
scoliosis may also have a family history of the condition.
Types of Scoliosis
Symptoms of Scoliosis
What is Kyphosis?
Types of Kyphosis
1. Postural Kyphosis: Often due to poor posture, this type is most common in
adolescents and young adults and can typically be improved through proper
posture and exercise.
2. Scheuermann’s Kyphosis: This type occurs when the front of the spine grows
slower than the back during growth spurts, like those in puberty. Stretching
and anti-inflammatory medications can be effective treatments.
3. Congenital Kyphosis: A birth defect in which the spine does not develop
properly in the womb. Treatment focuses on preventing further curvature.
Causes of Kyphosis
Symptoms of Kyphosis
The main sign of kyphosis is an excessive forward curve in the upper back, often
giving the appearance of rounded shoulders or a hump. Other symptoms may
include:
What is Lordosis?
Lordosis, also known as swayback, is a condition where the spine curves forward
excessively in the neck (cervical) or lower back (lumbar) regions. While some
forward curvature is normal, lordosis occurs when this curve is more pronounced.
The type of lordosis depends on the area of the spine affected:
1. Cervical Lordosis: The neck curves forward more than usual, positioning the
head further forward.
2. Lumbar Lordosis: The lower back curves excessively, causing the hips and
pelvis to shift forward. This may make a person appear to have their stomach
pushed forward or their buttocks more prominent.
Lordosis commonly affects people over 50, children going through growth spurts
and individuals who are pregnant. Doctors may suggest stretching and exercises to
help improve posture, and most people with lordosis don’t require further
treatment. Children often grow out of the condition.
Causes of Lordosis
While the cause of lordosis is often unknown, certain medical conditions can lead to
its development, including:
Achondroplasia
Spondylolisthesis
Osteoporosis
Kyphosis
Obesity
Symptoms of Lordosis
Many individuals with lordosis may not experience noticeable symptoms. However,
posture changes that may suggest lordosis include:
Lifestyle Modifications
1. Nonsurgical Treatments
2. Surgical Treatments
1. Pedicle Subtraction Osteotomy (PSO): Removes a wedge of bone from the spine
to decrease curvature.
Project Overview
Scoliosis is a condition that often requires regular monitoring through X-rays, which
exposes patients to harmful radiation. The motivation behind this project is to
improve patient safety by significantly reducing radiation exposure during routine
check-ups. Additionally, the project aims to enhance monitoring accuracy by
utilizing an AI-trained model to provide precise tracking of scoliosis progression.
The application significantly reduces patients' exposure to harmful radiation and the
risk of potential diseases by minimizing the need for X-rays during follow-up
appointments.
SCOPE
Efficient measurement of the Cobb angle for scoliosis, as well as the degrees of
kyphosis and lordosis.
User guidance to position patients correctly in front of the camera for accurate
measurements.
AIM AND OBJECTIVE
The main objective of this project is to develop an AI-based system that helps scoliosis
patients by reducing their exposure to X-rays during routine follow-up. This program
will use advanced techniques to measure different angles of the spine and generate a
3D model, providing healthcare professionals with important information to
effectively monitor patient conditions and record progress .
CONSTRAINTS
Overweighting: One challenge we face is that the device (Diers) used to collect
data cannot detect landmarks on the Overweight patient's back because
doctors or device operators struggle to accurately identify these landmarks,
leading to reduced accuracy in the data used to train the model, which in turn
affects overall accuracy.
Positioning: The doctor should take a photo of the patient’s back in a "neutral
posture" with their arms at their sides. Any other position could affect the
accuracy of the results.
Elimination: The AI models that we used in the application work to generate
the spine and eliminate the vertebrae. It calculates the slope and then
calculates the Cobb angle Which makes it hard to eliminate the vertebrae in
severe scoliosis conditions and leads to reduced accuracy.
DOCUMENT ORGANIZATION
Chapter 2: Background
Review previous research efforts related to our topic and analyze relevant
studies. It highlights the strengths and weaknesses of these solutions, setting
the stage for our proposed approach.
Presents the prototype system and describes the methodology applied in our
experiments.
14
BACKGROUND
15
What is Machine Learning?
There are various machine learning algorithms, such as linear regression, logistic
regression, decision trees, random forests, support vector machines (SVMs), k-
nearest neighbors (KNN), and clustering techniques. Each approach suits different
types of data and problems.
One prominent type of machine learning algorithm is the neural network, also
known as an artificial neural network. Inspired by the structure and function of
the human brain, neural networks consist of layers of interconnected nodes
(similar to neurons) that work together to process and interpret complex data.
They excel at recognizing intricate patterns and relationships in large datasets.
1. Supervised Learning
16
2. Unsupervised Learning
3. Semi-Supervised Learning
4. Reinforcement Learning
Reinforcement learning trains algorithms through trial and error, where the
model operates within a specific environment and receives feedback on each
outcome. Over time, it learns from successes and failures, optimizing actions to
reach a desired result. A common example of reinforcement learning is training
algorithms through games, such as repeatedly playing chess to improve
performance based on past outcomes.
1. Neural networks
2. Linear regression
3. Logistic regression
4. Clustering
17
5. Decision trees
6. Random forests
1. Neural networks
Neural networks are a type of computing model inspired by the human brain,
consisting of interconnected processing nodes that work together to recognize
patterns. They are highly effective in tasks such as natural language translation,
image and speech recognition, and even image generation.
2. Linear Regression
This algorithm predicts numerical outcomes by analyzing linear relationships
between variables. For instance, it can help predict house prices by examining
historical data for a specific area.
3. Logistic Regression
A supervised learning algorithm is used to make predictions for categorical
outcomes, such as “yes” or “no.” Applications include spam filtering and quality
control on production lines.
4. Clustering
Clustering is an unsupervised learning method used to find patterns in data for
grouping purposes. This helps data scientists identify distinctions within data that
may be overlooked by humans.
5. Decision Trees
Decision trees can be used for both regression (predicting numbers) and
classification (categorizing data). They utilize a branching structure of decisions,
forming a "tree" of linked choices. Unlike neural networks, decision trees are
easier to validate, making them more transparent and interpretable.
6. Random Forests
Random forests predict values or categories by combining outcomes from
multiple decision trees, increasing the model’s accuracy and stability.
18
What is Neural Network?
Input Layer
Hidden Layer(s)
Output Layer
Neural networks vary in complexity and structure, allowing for many unique
configurations. Common types include:
19
1. Perceptron Networks: Basic, shallow networks with only an input and
output layer.
6. Radial Basis Function (RBF) Networks: Nodes use radial basis functions for
processing.
The most common types of neural networks and their typical applications:
20
Convolutional neural networks are specialized feedforward networks designed
for tasks involving image and pattern recognition, especially in computer vision.
CNNs are particularly effective in processing grid-like data structures, such as
images, by applying principles from linear algebra (e.g., matrix multiplication) to
detect spatial hierarchies and patterns in the data.
CNNs are widely used in deep learning, particularly in the field of computer vision,
where they allow machines to interpret and analyze visual data. While artificial
neural networks are powerful tools for machine learning across various types of
data—such as images, audio, and text—CNNs are especially well-suited for image-
related tasks. For instance, while RNNs (and more specifically LSTMs) are effective
in predicting word sequences, CNNs excel in tasks like image classification.
21
scalability is essential.
One primary distinction between CNNs and regular neural networks is the use of
convolutions—a mathematical operation performed instead of standard matrix
multiplication in at least one CNN layer. Convolutions allow CNNs to apply filters
across the data and adapt these filters during training, fine-tuning results as they
process vast amounts of data, such as images.
Since CNNs adjust filters during training, they eliminate the need for handcrafted
filters, allowing for greater flexibility and a more extensive range of filters that are
dynamically tailored to the dataset. This adaptability makes CNNs well-suited for
complex tasks like facial recognition. CNNs perform best with large datasets,
though they can be trained with as few as around 10,000 data points. However,
having access to more data generally enhances their accuracy and effectiveness.
Convolutional neural networks (CNNs) stand out in handling data like images,
audio, and speech due to their specialized structure, featuring three main types
of layers:
Convolutional Layer
Pooling Layer
Fully-Connected (FC) Layer
The convolutional layer is the first layer in a CNN and performs the main
computations, which involve detecting patterns and features within input data.
This layer may be followed by additional convolutional or pooling layers, while the
fully-connected layer typically serves as the final layer. As data progresses
through these layers, the CNN gradually builds up complexity, enabling it to
identify more detailed aspects of the input. Initially, simpler features such as
edges or color gradients are identified, while deeper layers begin recognizing
larger, more complex shapes and elements within the data. Eventually, the CNN
can identify the entire object or concept it was trained to recognize.
1. Convolutional Layer
22
The convolutional layer is the core of a CNN, where most of the computation
occurs. It requires input data, filters, and a feature map. Let’s say the input is a
color image, represented as a 3D matrix of pixels across height, width, and depth
channels (RGB). The convolutional layer also uses a filter (or kernel), which moves
across the image's receptive fields to detect specific features, a process known as
convolution.This filter is a small, 2D array of weights (commonly a 3x3 matrix),
representing part of the image that it will analyze. As the filter moves, or
"strides," across the image, a dot product is calculated between the filter values
and the corresponding input pixels in the receptive field.
This result is recorded in an output array. The filter continues moving across the
entire image in this manner, creating a feature map or activation map from these
dot products. The use of a single set of weights for the filter as it moves across the
image—known as parameter sharing—is a key aspect of CNNs.While some filter
parameters adjust during training via backpropagation and gradient descent,
three hyperparameters impact output volume size and must be set before
training begins:
2. Stride: This hyperparameter controls the distance the filter moves across
the input matrix. A stride value of one means the filter shifts one pixel at
a time, whereas higher stride values reduce the output size, as the filter
jumps over more pixels.
3. Zero-padding: Used to adjust the input dimensions so that the filters align
with the image. Zero-padding sets all elements outside the image
boundary to zero, ensuring the filter can cover the entire image.
Types of padding include:
Valid Padding (No Padding): No extra pixels are added, so parts of the
image may be excluded if dimensions don’t perfectly align.
Same Padding: Ensures that the output layer maintains the same
dimensions as the input layer.
Full Padding: Expands the output by adding zeros around the edges
of the input image, creating a larger output.
After each convolution operation, the CNN applies a Rectified Linear Unit (ReLU)
activation function to the feature map. ReLU introduces nonlinearity into the
23
network, allowing it to capture more complex patterns in the data. This
combination of convolutional, pooling, and fully connected layers, combined with
ReLU, makes CNNs particularly effective in tasks like image and object
recognition, where detailed pattern recognition and analysis are essential.
2. Pooling Layer
The pooling layer, also called the downsampling layer, reduces the dimensions of
the data, decreasing the number of parameters the network must process. Like
the convolutional layer, the pooling layer moves a filter across the input, but
unlike convolution, it does not apply weights. Instead, it aggregates values within
each receptive field. The two most common pooling types are:
Max Pooling: As the filter traverses the input, it records the maximum pixel
value in each receptive field and sends that to the output.
Average Pooling: Here, the filter calculates the average value within the
receptive field and sends this result to the output.
Although pooling results in some loss of information, it brings key benefits to the
CNN, such as reducing complexity, enhancing efficiency, and helping prevent
overfitting.
3. Fully-Connected Layer
24
The fully-connected layer (FC layer) connects each node in the output layer
directly to every node in the previous layer. This layer consolidates features
identified in the earlier layers and performs classification based on those features.
While ReLU is commonly used in convolutional and pooling layers, the fully-
connected layer generally employs a softmax activation function to assign class
probabilities, with outputs ranging from 0 to 1, allowing for effective
categorization.
25
1. Image Classification
Image classification involves categorizing an image according to its content. It
serves as a fundamental component of computer vision, aiming to sort images
into specific predefined classes, such as identifying dogs, cats, or cars. This
process begins with training a model on a substantial dataset of labeled images,
allowing it to learn and make predictions on previously unseen images. To
illustrate, consider organizing a collection of 1,000 books into fiction and non-
fiction categories; image classification functions in a similar way by classifying
images into distinct groups.
2. Object Detection
Object detection enhances image classification by not only categorizing images
but also pinpointing the locations of specific objects within them. Think of it like
searching for a particular book in a library: rather than just scanning the titles on
the shelves, you can actually see each book's exact placement.
3. Image Segmentation
Image segmentation involves dividing an image into distinct segments, each
representing different objects or parts of the image. This technique is crucial in
scenarios where it’s important to isolate and analyze specific elements within an
image, such as in medical imaging for disease diagnosis. Image segmentation can
be performed using various methods, including contour detection, edge
detection, and region-based techniques. Imagine navigating through a dense
forest; image segmentation allows computer vision to help you distinguish
between trees, bushes, and other elements, guiding you on the best path
forward.
4. Facial Recognition
Facial recognition is a specialized area of computer vision focused on identifying
and verifying individuals through their facial features. Deep learning algorithms
analyze key aspects such as the eyes, nose, and mouth to create a unique facial
signature. This technology is commonly seen in smartphone unlocking features
and is also employed in security systems for identity verification and in social
media for automatically tagging friends in images.
26
Training a model involves repeatedly processing example images and adjusting
the model's weights to enhance the accuracy of its predictions. While I won't
delve into the specifics of deep learning or convolutional neural networks here,
it's important to note that you can utilize deep learning techniques like transfer
learning without needing to understand all the underlying algorithms. However,
there are key concepts to grasp for developing and maintaining an effective
model.
Once the training and validation processes have been completed over a set
number of iterations (and potentially stopped early if no improvement in accuracy
is observed), you’ll end up with a model and usually a report detailing its
accuracy. However, it's important to remember that this reported accuracy only
reflects the performance of the data that was provided.
27
networks, known as deep neural networks, to emulate the intricate decision-
making capabilities of the human brain. This technology underpins many of the
artificial intelligence (AI) applications we encounter in our daily lives.
The primary distinction between deep learning and traditional machine learning
lies in the architecture of the neural networks involved. Traditional machine
learning models typically consist of simple neural networks with one or two
layers, while deep learning models incorporate three or more layers, often
numbering in the hundreds or thousands, allowing for more complex training.
Deep learning plays a crucial role in data science and fuels a wide range of
applications and services that enhance automation, allowing analytical and
physical tasks to be performed without human intervention. This technology
enables a variety of everyday products and services, including digital assistants,
voice-activated TV remotes, credit card fraud detection systems, self-driving
vehicles, and generative AI.
28
predictions. It adjusts the weights and biases of the network by moving backward
through the layers, effectively training the model. The combination of forward
propagation and backpropagation allows a neural network to not only make
predictions but also to correct errors, gradually improving the algorithm's
accuracy over time.
Most deep learning applications are developed using one of three major
frameworks:
JAX
PyTorch
TensorFlow
Deep learning algorithms are highly sophisticated, and various types of neural
networks have been developed to tackle specific challenges or datasets. Below
are six prominent models, presented in the approximate order of their evolution,
with each new model designed to address limitations of its predecessors. A
common drawback among these models is their tendency to operate as "black
boxes," making it difficult to decipher their inner mechanisms, which can lead to
challenges in interpretability. However, this complexity is often outweighed by
their advantages in accuracy and scalability.
1. CNNs
Convolutional Neural Networks (CNNs or ConvNets) are primarily utilized in
computer vision and image classification tasks. They excel at detecting features
29
and patterns within images and videos, facilitating functions such as object
detection, image recognition, pattern recognition, and facial recognition. CNNs
employ concepts from linear algebra, particularly matrix multiplication, to
uncover patterns within visual data.
CNNs are recognized for their exceptional performance with image, speech, and
audio signal inputs. Prior to the advent of CNNs, the process of feature extraction
for object identification in images was manual and labor-intensive. Now, CNNs
offer a more scalable solution for image classification and object recognition tasks
while effectively processing high-dimensional data. Additionally, CNNs facilitate
data exchange between layers, enhancing data processing efficiency. Although
some information may be lost in the pooling layers, the advantages of CNNs—
such as reduced complexity, improved efficiency, and decreased risk of overfitting
—often outweigh this drawback.
However, CNNs also come with challenges. They are computationally intensive,
requiring significant time and resources, often necessitating multiple graphical
processing units (GPUs) for effective operation. Furthermore, they demand highly
skilled experts with cross-domain expertise and meticulous testing of
configurations and hyperparameters.
2. RNNs
Recurrent Neural Networks (RNNs) are commonly employed in natural language
30
processing and speech recognition applications, as they are designed to handle
sequential or time-series data. RNNs are characterized by their feedback loops,
which allow them to utilize past information to inform current inputs and outputs.
They are particularly useful for making predictions based on time-series data, with
applications including stock market forecasting, sales predictions, and temporal
issues like language translation and image captioning. These functions are
frequently integrated into popular technologies such as Siri, voice search, and
Google Translate.
RNNs leverage their "memory" capabilities, where the information from previous
inputs influences the current output. Unlike traditional deep neural networks that
treat inputs and outputs as independent, the output of RNNs is contingent on
preceding elements within the sequence. While considering future events could
enhance the accuracy of predictions, unidirectional recurrent neural networks
cannot incorporate these future elements into their outputs.
RNNs utilize shared parameters across all layers of the network, employing the
same weight parameters within each layer. These weights are adjusted through
backpropagation and gradient descent to support reinforcement learning. To
compute gradients, RNNs use an algorithm known as backpropagation through
time (BPTT), which is tailored for sequential data and differs slightly from
traditional backpropagation. Like standard backpropagation, BPTT allows the
model to learn by calculating errors from the output layer back to the input layer.
However, BPTT accumulates errors at each time step, while feedforward
networks do not sum errors since they lack shared parameters across layers.
One advantage of RNNs over other types of neural networks is their ability to
handle both binary data processing and memory. RNNs can manage multiple
inputs and outputs, enabling them to produce various output types—such as one-
to-many, many-to-one, or many-to-many—rather than simply generating a single
result for a given input.
RNNs also come in different variations, with Long Short-Term Memory (LSTM)
networks being a notable example that outperforms simple RNNs by effectively
learning from and responding to longer-term dependencies. However, RNNs often
face two main challenges: exploding gradients and vanishing gradients, which are
defined by the size of the gradient—essentially the slope of the loss function
along the error curve.
31
Vanishing Gradients: This occurs when the gradient becomes too small,
continually diminishing and ultimately resulting in insignificant weight
updates, rendering the algorithm unable to learn effectively.
Additionally, RNNs often require lengthy training periods and can be challenging
to apply to large datasets. The optimization of RNNs can become complicated,
especially when they contain numerous layers and parameters.
32
The primary advantage of autoencoders lies in their ability to efficiently process
large datasets, providing a compressed view of the data where essential patterns
are highlighted. This is particularly useful for tasks like anomaly detection and
classification, as well as for reducing storage and transmission needs. Moreover,
because autoencoders can learn from unlabeled data, they are valuable when
labeled data is limited or unavailable. This unsupervised training approach is also
time-saving, allowing the model to enhance accuracy independently, without
manual feature selection. Furthermore, VAEs can create synthetic data, enabling
new possibilities in text and image generation.
4. GANs
Generative adversarial networks (GANs) are a class of neural networks designed
to generate new data that closely resembles the original training data, widely
used within and outside AI applications. For instance, GANs can create images
that look like human faces, although these images are artificially generated rather
than photographs of real individuals.
The "adversarial" aspect of GANs refers to the interaction between two main
components: the generator and the discriminator.
Training in GANs involves this dynamic between the generator and discriminator.
33
As the generator produces artificial data, the discriminator learns to identify the
differences between the real and generated samples. When the discriminator
accurately identifies a generated output, the generator is adjusted to improve its
results. This iterative process continues until the generator produces outputs
indistinguishable from real data.
GANs’ key advantage lies in their ability to create highly realistic outputs that are
often challenging to tell apart from genuine data, which can be valuable for
training other machine learning models. Training a GAN is relatively
straightforward since it primarily relies on unlabeled data or minimally labeled
datasets. However, GANs do have limitations. The competitive training process
between the generator and discriminator can be computationally intensive and
may require substantial data to produce high-quality results. Another challenge is
"mode collapse," a scenario where the generator repeatedly produces similar
outputs instead of generating diverse variations.
5. Diffusion models
Diffusion models are a type of generative model trained through a process of
adding and then removing noise, known as forward and reverse diffusion. They
typically generate data—often images—that resemble their training data but
ultimately overwrite the training data itself. During training, Gaussian noise is
incrementally added to the data until it becomes unrecognizable, and the model
learns a reverse “denoising” process that enables it to create realistic outputs
from random noise.
The training goal for a diffusion model is to minimize the difference between its
generated samples and the desired output. This difference, or loss, is calculated,
and the model’s parameters are adjusted to bring the generated samples closer
to the target, making the final results nearly indistinguishable from the original
training data.
Diffusion models offer several benefits, including the ability to produce high-
quality images without adversarial training, leading to faster learning and
enhanced control over the generation process. Compared to GANs, diffusion
models also provide more training stability and are less susceptible to mode
collapse.
34
However, training diffusion models can demand significant computational
resources and often requires careful fine-tuning. IBM Research® has also
identified a vulnerability in these models: they can be embedded with hidden
backdoors, allowing malicious actors to manipulate the image generation process
to produce altered images.
6. Transformer models
35
However, transformers come with limitations: they require significant
computational resources and extended training time. Additionally, high-quality,
unbiased, and ample training data is essential to ensure accurate performance.
1. Application Modernization
Generative AI is revolutionizing application modernization and IT automation,
bridging the skills gap in these fields. Advances in large language models (LLMs)
and natural language processing (NLP) have enabled AI-driven coding, using deep
learning and vast neural networks trained on extensive datasets of open-source
code.
Developers can input natural language prompts describing desired code functions,
and generative AI suggests relevant code snippets or even complete functions.
This reduces the need for repetitive coding and speeds up development.
Generative AI can also facilitate code translation between programming
languages, which supports projects like converting legacy COBOL code into
modern languages such as Java.
2. Computer vision
Computer vision is a branch of artificial intelligence (AI) focused on tasks like
image classification, object detection, and semantic segmentation. By leveraging
machine learning and neural networks, it enables computers to analyze and
interpret digital images, videos, and other visual inputs to extract valuable
insights. This analysis allows systems to recommend actions or detect issues, such
as identifying defects in products. If AI equips computers with the ability to
"think," computer vision grants them the ability to "see," "observe," and
"understand."
36
Often used to inspect products or monitor production processes, computer vision
systems are capable of analyzing thousands of items per minute, detecting even
subtle defects that might go unnoticed by human inspectors. Its applications span
diverse industries, including energy, utilities, manufacturing, and automotive
sectors.
The technology relies on algorithmic models that allow computers to learn from
the context within visual data. When a large volume of data is processed, the
model gradually learns to distinguish between different images without direct
programming for specific image recognition.
Computer vision enables systems to derive insights from visual inputs and act
based on those observations, setting it apart from basic image recognition.
Some notable uses of computer vision today include:
4. Retail: E-commerce sites use visual search to suggest items that match or
enhance a customer’s wardrobe, personalizing the shopping experience.
3. Customer care
37
AI is enabling businesses to better understand and respond to growing
consumer demands. In a world of highly personalized online shopping, direct-to-
consumer models, and quick delivery options, generative AI offers a range of
benefits to enhance customer care, talent development, and application
performance.
4. Digital Labor
Organizations can boost productivity by integrating robotic process automation
(RPA) and digital labor to complement human efforts or provide additional
support when needed. For instance, digital labor can assist developers with
updating legacy systems more efficiently.
5. Generative AI
38
Generative AI, often referred to as "gen AI," involves deep learning models
capable of producing original content—such as detailed text, high-quality images,
realistic videos, and more—in response to user prompts.
Although generative models have been used in statistical analysis for years to
handle numerical data, advancements over the past decade have expanded their
application to more complex data types.
This shift aligns with the development of three advanced deep learning model
types:
1. Training
39
Generative AI starts with a "foundation model," a deep learning model designed
as a base for various types of generative AI applications. Today, the most
prevalent foundation models are large language models (LLMs) used for text
generation, though there are also foundation models specifically for generating
images, video, sound, and music, as well as multimodal models capable of
producing different types of content.
2. Tuning
Once the foundation model is established, it requires tuning for specific content
generation tasks.
This can be accomplished through several methods:
Fine-tuning, which involves supplying the model with labeled data specific
to the task—such as common questions or prompts the application may
receive and corresponding correct answers in the preferred format.
40
Developers and users regularly evaluate the outputs of their generative AI
applications, making frequent adjustments to enhance accuracy and relevance—
sometimes updating the model every week. In contrast, updates to the
foundation model itself occur much less frequently, typically every 12 to 18
months.
A specific branch of NLP, called statistical NLP, integrates algorithms with machine
learning and deep learning models to extract, classify automatically, and label
components of text and speech. It then assigns a probability to each potential
meaning of these components. Today, deep learning models, especially those
based on recurrent neural networks (RNNs), enable NLP systems to "learn" as
they operate, deriving increasingly accurate meanings from massive volumes of
raw, unstructured text and voice data.
41