0% found this document useful (0 votes)
55 views

Documentation

Uploaded by

manarkhaliidx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Documentation

Uploaded by

manarkhaliidx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

RADIATION-FREE SCOLIOSIS

TRACKING

INTRODUCED BY:

# STUDENT NAME STUDENT ID


1 Yomna Hesham Abdel Hamid 42110132
2 Mohamed Ahmed El-Sayed Dawood 42110134
3 Gasser Ahmed Rashad 42110206
4 Noha El Sayed Ibrahim 42110440
5 Manar Amr Younis 42110233

SUPERVISED BY:

Dr. Nehal Khaled

Egypt/2024
Committee Report
We certify we have read this graduation project report as
an examining committee, examined the student in its content, and
that in our opinion it is adequate as a project document for
“Radiation-Free Scoliosis Tracking”.
Chairman: Supervisor:
Name: Name: Dr. Nehal Khaled
Signature: Signature:
Date: / /2024 Date: / /2024
Examiner:
Name:
Signature:
Date: / /2024
Intellectual Property Right Declaration
This is to declare that the work under the supervision of Dr. Nehal
Khaled having title “Radiation-Free Scoliosis Tracking ” carried out in
partial fulfillment of the requirements of Bachelor of Science in
Computer Science is the sole property of Ahram Canadian University
and the respective supervisor. It is protected under the intellectual
property right laws and conventions. It can only be considered/ used
for purposes like extension for further enhancement, product
development, adoption for commercial/organizational usage, etc.
with the permission of the University and respective supervisor. This
above statement applies to all students and faculty members.

Names:

 Yomna Hesham Abdel Hamid

 Mohamed Ahmed El-Sayed Dawood

 Gasser Ahmed Rashad

 Noha El Sayed Ibrahim

 Manar Amr Younis

Supervisor:

Dr. Nehal Khaled


Anti-Plagiarism Declaration
This is to declare that the above publication produced under the
supervision of Dr. Nehal Khaled having title " Radiation-Free Scoliosis
Tracking" is the sole contribution of the author(s) and no part hereof
has been reproduced illegally (cut and paste) which can be
considered as Plagiarism. All referenced parts have been used to
argue the idea and have been cited properly. We will be responsible
and liable for any consequence if violation of this declaration is
proven.

Names:

 Yomna Hesham Abdel Hamid

 Mohamed Ahmed El-Sayed Dawood

 Gasser Ahmed Rashad

 Noha El Sayed Ibrahim

 Manar Amr Younis


TABLE OF CONTENTS

CHAPTER 1: INTRODUCTION

OVERVIEW...................................................................................................................................................................6

MOTIVATION............................................................................................................................................................12

SCOPE...........................................................................................................................................................................12

AIM AND OBJECTIVE..........................................................................................................................................13

CONSTRAINTS........................................................................................................................................................ 13

DOCUMENT ORGANIZATION........................................................................................................................14

BACKGROUND.........................................................................................................................................................15

CHAPTER 2: BACKGROUND
Overview

The spine, or backbone, is the body’s central support structure, made up of


vertebrae, ligaments, and discs. When viewed from the front or back, a healthy
spine should appear straight. From a side view, however, it has a gentle S-shaped
curve, allowing for balanced weight distribution and flexible movement. If this
natural curve becomes misaligned or overly pronounced, it can lead to one of the
following conditions:

 Kyphosis: A rounding of the upper spine, often creating a visible hump on the
back.
 Lordosis: An exaggerated forward curve in the lower back or neck area.
 Scoliosis: A sideways, S- or C-shaped curve of the spine.

Ohio State Spine Care specialists in Columbus, Ohio, provide treatments for these
spinal curvature conditions.

What is Scoliosis?

Scoliosis is characterized by a sideways curve in the spine, typically detected during


childhood or early adolescence. A curve greater than 10 degrees on an X-ray
qualifies as scoliosis, often taking on an S or C shape, with the spine sometimes
twisting or rotating as well.

Doctors diagnose scoliosis through a combination of medical history, physical


exams, and imaging tests. Treatment varies based on factors like age, growth
potential, curve severity, and whether the curve is permanent or temporary.
Individuals with mild scoliosis may only need periodic checkups, while others might
require bracing or surgery to manage the condition.

Certain conditions, such as cerebral palsy and muscular dystrophy, can contribute to
scoliosis. Additionally, some birth defects and genetic disorders, like Marfan
syndrome and Down syndrome, are associated with scoliosis. Many individuals with
scoliosis may also have a family history of the condition.
Types of Scoliosis

1. Idiopathic Adolescent Scoliosis: The most common type, primarily affecting


adolescents, and is more frequent in girls than boys.

2. Congenital Scoliosis: A spinal deformity present at birth, caused by improper


formation of a spinal bone, leading to an imbalanced and curved spine.

3. Neuromuscular Scoliosis: Occurs in children with certain medical conditions,


such as Marfan syndrome, muscular dystrophy, cerebral palsy, and spina
bifida.

4. Adult De Novo Scoliosis: Develops later in life due to age-related spinal


degeneration.

Symptoms of Scoliosis

 Pain or stiffness in the middle or lower back


 Curved posture
 Uneven shoulders or hips
 Challenges with standing or sitting upright
 Leaning to one side
 One side of the rib cage protruding forward

What is Kyphosis?

Kyphosis is a spinal condition characterized by an excessive curve in the upper back,


often resulting in rounded shoulders. Also referred to as round back, hunchback, or
dowager's hump, kyphosis can develop at any age but is rarely present at birth.
Severe cases may lead to pain or breathing difficulties due to lung compression.
Treatment options range from physical therapy and back exercises for mild cases to
surgery for more pronounced curvature.

Types of Kyphosis

1. Postural Kyphosis: Often due to poor posture, this type is most common in
adolescents and young adults and can typically be improved through proper
posture and exercise.

2. Scheuermann’s Kyphosis: This type occurs when the front of the spine grows
slower than the back during growth spurts, like those in puberty. Stretching
and anti-inflammatory medications can be effective treatments.

3. Congenital Kyphosis: A birth defect in which the spine does not develop
properly in the womb. Treatment focuses on preventing further curvature.

Causes of Kyphosis

 Degenerative spinal diseases, including arthritis or disc degeneration


 Muscle weakness or chronic poor posture
 Osteoporosis
 Spine injuries
 Spondylolisthesis, where one vertebra slips forward over another
 Scheuermann’s disease, is a condition where multiple vertebrae wedge
together due to uneven growth during growth spurts, although its exact
cause is unknown.

Symptoms of Kyphosis

The main sign of kyphosis is an excessive forward curve in the upper back, often
giving the appearance of rounded shoulders or a hump. Other symptoms may
include:

 Pain or stiffness in the back or shoulder area


 Tightness in the hamstring muscles
 Uneven height of the shoulder blades

What is Lordosis?

Lordosis, also known as swayback, is a condition where the spine curves forward
excessively in the neck (cervical) or lower back (lumbar) regions. While some
forward curvature is normal, lordosis occurs when this curve is more pronounced.
The type of lordosis depends on the area of the spine affected:

1. Cervical Lordosis: The neck curves forward more than usual, positioning the
head further forward.

2. Lumbar Lordosis: The lower back curves excessively, causing the hips and
pelvis to shift forward. This may make a person appear to have their stomach
pushed forward or their buttocks more prominent.
Lordosis commonly affects people over 50, children going through growth spurts
and individuals who are pregnant. Doctors may suggest stretching and exercises to
help improve posture, and most people with lordosis don’t require further
treatment. Children often grow out of the condition.

Causes of Lordosis

While the cause of lordosis is often unknown, certain medical conditions can lead to
its development, including:

 Achondroplasia
 Spondylolisthesis
 Osteoporosis
 Kyphosis
 Obesity

Symptoms of Lordosis

Many individuals with lordosis may not experience noticeable symptoms. However,
posture changes that may suggest lordosis include:

 Head and neck leaning forward


 Hips pushed forward or a more pronounced buttock
 A visible gap under the lower back when lying down
 Pain in the neck or lower back

Diagnosing Spinal Curvatures

After conducting a comprehensive medical history review and neurological exam,


specialists may recommend imaging tests to assess spinal curvature, such as:

 X-rays to evaluate the spine's alignment


 CT scans for detailed images
 MRI scans to assess soft tissues and spinal structure
 Electromyography (EMG) and other electrophysiological tests to examine
nerve function
The spine plays a critical role in supporting body structure and facilitating
movement, but abnormal curvatures such as scoliosis, kyphosis, and lordosis can
disrupt this function and lead to health complications. Adolescent spinal deformities
require careful monitoring to prevent progression and manage symptoms
effectively. Traditional imaging techniques, while essential for diagnosis, expose
patients to radiation, raising concerns over long-term health risks.

Adolescent spinal scoliosis is a three-dimensional spinal deformity typically


diagnosed between ages 10 and 16. Whole spine radiographs remain the gold
standard for diagnosing and monitoring scoliosis. However, according to Nash et al.,
scoliosis patients may undergo up to 22 full spine X-rays over a 3-year treatment
period, leading to an 8% increase in cancer mortality and a fourfold higher risk of
breast cancer in this group. Kyphosis, occurring in 0.4% to 8% of children and teens
aged 10 to 18, is also common during adolescent growth phases. Lordosis generally
arises in children and young adults during growth spurts but can also affect older
adults, particularly those over 50, due to spinal degeneration.

Treatment Options for Spinal Curvatures

A wide range of treatments is available, from physical therapy to advanced spinal


surgeries. Specialists, therapists, and physicians work together to offer options that
enhance mobility and alleviate pain. The majority of patients do not need surgery,
and lifestyle modifications can often aid in managing the condition.

Lifestyle Modifications

 Maintaining good posture

 Strengthening spinal muscles through activities like yoga or Pilates

 Weight management to reduce spinal pressure

1. Nonsurgical Treatments

 Physical Therapy: Specialized therapists help with spine-focused exercises

 Bracing: Often used to support spine alignment

 Medication: Includes anti-inflammatory drugs


 Acupuncture: Can relieve pain and discomfort

 Spinal Cord Stimulation: May be recommended for chronic pain

2. Surgical Treatments

If non-invasive approaches do not provide adequate relief, surgical options are


available. The goal of spine surgery is to restore normal spinal alignment and
prevent the progression of the curvature.

3. Reconstructive Spine Surgeries

1. Pedicle Subtraction Osteotomy (PSO): Removes a wedge of bone from the spine
to decrease curvature.

2. Vertebral Column Resection (VCR): Involves removing one or more vertebrae to


reposition and straighten the spine.

Project Overview

This document outlines the Radiation-Free Scoliosis Tracking project, a web-based


application designed for healthcare providers to monitor spinal deformities with
minimal radiation exposure. This project addresses the need for safer, non-invasive
follow-up methods in tracking conditions like scoliosis, kyphosis, and lordosis.

Utilizing advanced measurement techniques, the app accurately captures the


angle of the spine to generate a comprehensive 3D model. This model
incorporates artificial intelligence to predict and construct a 3D representation
of the patient's spine from a 2D photograph of their back, smoothly
integrating it with the corresponding 3D body object.

Then using raster-stereography and surface topography technologies—both non-


invasive methods—the application detects spinal deformities, muscle contractions,
and spasms in the patient's back. This model provides healthcare
professionals with critical information, including the Cobb angle for scoliosis, as well
as the degrees of kyphosis and lordosis. In addition to its tracking capabilities, the
application includes the appropriate sizes for belts or braces tailored to each
patient's needs. A unique feature aids in positioning patients correctly in front of
the camera, ensuring precise measurements are obtained.
MOTIVATION

Scoliosis is a condition that often requires regular monitoring through X-rays, which
exposes patients to harmful radiation. The motivation behind this project is to
improve patient safety by significantly reducing radiation exposure during routine
check-ups. Additionally, the project aims to enhance monitoring accuracy by
utilizing an AI-trained model to provide precise tracking of scoliosis progression.
The application significantly reduces patients' exposure to harmful radiation and the
risk of potential diseases by minimizing the need for X-rays during follow-up
appointments.

Furthermore, it seeks to facilitate communication by establishing a seamless


connection between patients and healthcare providers, thereby improving the
overall management of scoliosis.

SCOPE

The project focuses on developing a web application that


allows:

 Continuous tracking of the patient's scoliosis condition.

 Advanced measurement techniques including artificial intelligence for


generating 3D models from 2D images, raster-stereography, and surface
topography.

 Patient safety and health monitoring by reducing reliance on X-rays.

 Efficient measurement of the Cobb angle for scoliosis, as well as the degrees of
kyphosis and lordosis.

 User guidance to position patients correctly in front of the camera for accurate
measurements.
AIM AND OBJECTIVE

The main objective of this project is to develop an AI-based system that helps scoliosis
patients by reducing their exposure to X-rays during routine follow-up. This program
will use advanced techniques to measure different angles of the spine and generate a
3D model, providing healthcare professionals with important information to
effectively monitor patient conditions and record progress .

CONSTRAINTS

Factors beyond the application's control can affect its


accuracy, potentially compromising its reliability such as:

 Overweighting: One challenge we face is that the device (Diers) used to collect
data cannot detect landmarks on the Overweight patient's back because
doctors or device operators struggle to accurately identify these landmarks,
leading to reduced accuracy in the data used to train the model, which in turn
affects overall accuracy.
 Positioning: The doctor should take a photo of the patient’s back in a "neutral
posture" with their arms at their sides. Any other position could affect the
accuracy of the results.
 Elimination: The AI models that we used in the application work to generate
the spine and eliminate the vertebrae. It calculates the slope and then
calculates the Cobb angle Which makes it hard to eliminate the vertebrae in
severe scoliosis conditions and leads to reduced accuracy.
DOCUMENT ORGANIZATION

 Chapter 2: Background

Covers the technical background relevant to our problem, explaining the


methods and technologies used AI field.

 Chapter 3: Literature Review

Review previous research efforts related to our topic and analyze relevant
studies. It highlights the strengths and weaknesses of these solutions, setting
the stage for our proposed approach.

 Chapter 4: Proposed System

Presents the prototype system and describes the methodology applied in our
experiments.

 Chapter 5: Experimental Results

Discusses the system’s results and analysis of system implementation. Also


covers an overview of the code.

 Chapter 6: Conclusion and Future Work

Summarizes our project’s achievements and results and suggests possible


improvements for further research.

14
BACKGROUND

What is Artificial Intelligence?

Artificial Intelligence (AI) refers to the theory and development of computer


systems designed to perform tasks traditionally requiring human intelligence,
such as speech recognition, decision-making, and pattern recognition. This term
broadly encompasses various technologies, including machine learning (ML), deep
learning, and natural language processing (NLP).

While AI is often used to describe a wide array of technologies currently in use,


there is ongoing debate about whether many of these qualify as true artificial
intelligence. Some experts argue that today’s technologies represent advanced
forms of machine learning that are merely steppingstones toward "general
artificial intelligence" (GAI), a level where machines would possess human-like
intelligence.

In most contexts, AI today generally refers to machine learning-powered


applications—like ChatGPT or computer vision—that enable machines to
complete tasks once exclusive to humans, such as content generation,
autonomous driving, or data analysis. At its core, machine learning employs
algorithms trained on datasets to build models that allow computers to perform
tasks, including song recommendations, optimizing travel routes, or translating
languages.
Examples of widely used AI applications include:

 ChatGPT: Uses large language models (LLMs) to generate text responses


based on user input.

 Google Translate: Applies deep learning techniques to translate text across


different languages.

Artificial General Intelligence (AGI) represents a hypothetical point at which


computer systems could match or surpass human intellectual capabilities,
embodying "true" artificial intelligence. Developing AI involves specialized
hardware and software for creating and training machine learning algorithms.
While there isn’t a single programming language exclusive to AI, languages such
as Python, R, Java, C++, and Julia are popular languages among AI developers.

15
What is Machine Learning?

Machine learning (ML) is a branch of artificial intelligence (AI) focused on creating


algorithms that learn from data to make predictions or classify information
without needing direct human input. Today, machine learning is applied across
many industries, from suggesting products based on previous purchases to
forecasting stock market trends and translating languages.

Within AI, machine learning specifically involves training algorithms on data to


build models that can make predictions or decisions. It encompasses numerous
techniques that enable computers to learn from data and draw conclusions
independently, without explicit instructions for each task.

There are various machine learning algorithms, such as linear regression, logistic
regression, decision trees, random forests, support vector machines (SVMs), k-
nearest neighbors (KNN), and clustering techniques. Each approach suits different
types of data and problems.

One prominent type of machine learning algorithm is the neural network, also
known as an artificial neural network. Inspired by the structure and function of
the human brain, neural networks consist of layers of interconnected nodes
(similar to neurons) that work together to process and interpret complex data.
They excel at recognizing intricate patterns and relationships in large datasets.

Types of Machine Learning

1. Supervised Learning

In supervised machine learning, algorithms are trained on labeled datasets that


contain tags identifying how the data should be understood. This "answer key"
enables the algorithm to learn associations and make predictions. Supervised
learning is widely used in tasks like spam classification, where certain messages
are flagged as spam. Common supervised learning methods include neural
networks, naïve Bayes, linear regression, logistic regression, random forest, and
support vector machines (SVM). It is especially useful for creating models for
predictive and classification tasks.

16
2. Unsupervised Learning

Unsupervised learning trains algorithms on datasets without labels, requiring the


model to identify patterns without prior guidance. These algorithms find hidden
structures within the data, making them ideal for exploratory data analysis,
customer segmentation, cross-selling strategies, and image recognition.
Unsupervised learning also aids in reducing model features through
dimensionality reduction techniques like principal component analysis (PCA) and
singular value decomposition (SVD). Other commonly used algorithms include
neural networks, k-means clustering, and probabilistic clustering methods.

3. Semi-Supervised Learning

Semi-supervised learning combines labeled and unlabeled data. Usually, a small


set of labeled data is first used to guide the algorithm, followed by larger volumes
of unlabeled data to complete model training. This approach is often used when
there is a lack of large labeled datasets, particularly for classification and
prediction tasks.

4. Reinforcement Learning

Reinforcement learning trains algorithms through trial and error, where the
model operates within a specific environment and receives feedback on each
outcome. Over time, it learns from successes and failures, optimizing actions to
reach a desired result. A common example of reinforcement learning is training
algorithms through games, such as repeatedly playing chess to improve
performance based on past outcomes.

Common Machine Learning Algorithms

Some frequently used machine learning algorithms include:

1. Neural networks

2. Linear regression

3. Logistic regression

4. Clustering

17
5. Decision trees

6. Random forests

1. Neural networks
Neural networks are a type of computing model inspired by the human brain,
consisting of interconnected processing nodes that work together to recognize
patterns. They are highly effective in tasks such as natural language translation,
image and speech recognition, and even image generation.

2. Linear Regression
This algorithm predicts numerical outcomes by analyzing linear relationships
between variables. For instance, it can help predict house prices by examining
historical data for a specific area.

3. Logistic Regression
A supervised learning algorithm is used to make predictions for categorical
outcomes, such as “yes” or “no.” Applications include spam filtering and quality
control on production lines.

4. Clustering
Clustering is an unsupervised learning method used to find patterns in data for
grouping purposes. This helps data scientists identify distinctions within data that
may be overlooked by humans.

5. Decision Trees
Decision trees can be used for both regression (predicting numbers) and
classification (categorizing data). They utilize a branching structure of decisions,
forming a "tree" of linked choices. Unlike neural networks, decision trees are
easier to validate, making them more transparent and interpretable.

6. Random Forests
Random forests predict values or categories by combining outcomes from
multiple decision trees, increasing the model’s accuracy and stability.

18
What is Neural Network?

A neural network is a computational architecture inspired by how the human


brain operates. Comprised of interconnected processing units, or "nodes," neural
networks pass data across layers in a way similar to how neurons in the brain
transmit electrical signals. Neural networks are used in machine learning,
particularly in deep learning, where they draw insights from unlabeled data
without human guidance. They are sometimes referred to as artificial neural
networks (ANNs) and are fundamental to deep learning models. For example,
with sufficient training data, a neural network-based deep learning model can
recognize objects in an image it has never seen before.

Neural networks enable many AI applications, including large language models


(LLMs) like ChatGPT, image generators like DALL-E, and predictive models.

Neural networks generally consist of:

 Input Layer
 Hidden Layer(s)
 Output Layer

Each node in these layers performs a specific computation on the input it


receives. Nodes contain mathematical formulas, where variables are weighted
differently. If the computation result exceeds a threshold, the node passes data to
the next layer; otherwise, it doesn’t.

Types of Neural Networks

Neural networks vary in complexity and structure, allowing for many unique
configurations. Common types include:

 Shallow Neural Networks: Typically have a single hidden layer, requiring


less processing power but suited for simpler tasks.
 Deep Neural Networks: Feature multiple hidden layers, enabling more
complex data processing but requiring greater computational resources.

Some key types of neural networks include:

19
1. Perceptron Networks: Basic, shallow networks with only an input and
output layer.

2. Multilayer Perceptron (MLP) Networks: Expand on perceptrons by adding a


hidden layer, increasing model complexity.

3. Feed-Forward Networks: Nodes pass information in one direction only,


from input to output.

4. Recurrent Neural Networks (RNNs): Allow feedback, so outputs from some


nodes can influence previous layers, beneficial in sequential data.

5. Modular Networks: Combine multiple neural networks to reach a final


output.

6. Radial Basis Function (RBF) Networks: Nodes use radial basis functions for
processing.

7. Liquid State Machines: Feature randomly connected nodes.

8. Residual Networks (ResNets): Use a process called identity mapping,


allowing data to skip layers and combine outputs from earlier layers with
those from later layers.

This flexibility in network structure enables neural networks to perform a broad


range of tasks across AI applications.

The most common types of neural networks and their typical applications:

1. Feedforward Neural Networks (FNNs) or Multi-Layer Perceptrons (MLPs)


Feedforward neural networks, often referred to as MLPs, consist of an input
layer, one or more hidden layers, and an output layer. Unlike simple perceptrons,
these networks typically use sigmoid neurons, which are better suited for
handling nonlinear real-world problems. Data flows in one direction—from input
to output—through the network, and they are widely used in foundational
applications like computer vision and natural language processing.

2. Convolutional Neural Networks (CNNs)

20
Convolutional neural networks are specialized feedforward networks designed
for tasks involving image and pattern recognition, especially in computer vision.
CNNs are particularly effective in processing grid-like data structures, such as
images, by applying principles from linear algebra (e.g., matrix multiplication) to
detect spatial hierarchies and patterns in the data.

3. Recurrent Neural Networks (RNNs)


RNNs are distinct from other neural networks due to their feedback loops,
which allow outputs from one step to influence inputs in subsequent steps. This
feature makes RNNs especially useful for sequential data, such as time series
analysis. They are commonly used for predicting future values in applications like
stock market forecasting and sales trend analysis.

What is Convolutional neural network?

A convolutional neural network (CNN) is a specialized type of neural network


designed for processing grid-like data, such as images. It uses multiple layers to
analyze data, extracting critical features from it. One key advantage of CNNs is
that they reduce the need for extensive pre-processing in image tasks, eliminating
the need for manual, time-consuming feature extraction. CNNs have made image
classification and object recognition far more efficient by using matrix-based
operations to detect patterns. However, they are computationally intensive, often
requiring graphical processing units (GPUs) for effective model training.

CNNs are widely used in deep learning, particularly in the field of computer vision,
where they allow machines to interpret and analyze visual data. While artificial
neural networks are powerful tools for machine learning across various types of
data—such as images, audio, and text—CNNs are especially well-suited for image-
related tasks. For instance, while RNNs (and more specifically LSTMs) are effective
in predicting word sequences, CNNs excel in tasks like image classification.

With traditional image processing algorithms, engineers would manually design


filters based on heuristics, which involved significant trial and error. CNNs,
however, learn which features in these filters are important during training,
making them highly adaptable, especially for high-resolution images with
thousands of pixels. The purpose of CNNs is to transform data into a more
manageable form for processing, preserving the essential features needed to
accurately interpret the data. This makes CNNs ideal for large datasets, where

21
scalability is essential.

One primary distinction between CNNs and regular neural networks is the use of
convolutions—a mathematical operation performed instead of standard matrix
multiplication in at least one CNN layer. Convolutions allow CNNs to apply filters
across the data and adapt these filters during training, fine-tuning results as they
process vast amounts of data, such as images.

Since CNNs adjust filters during training, they eliminate the need for handcrafted
filters, allowing for greater flexibility and a more extensive range of filters that are
dynamically tailored to the dataset. This adaptability makes CNNs well-suited for
complex tasks like facial recognition. CNNs perform best with large datasets,
though they can be trained with as few as around 10,000 data points. However,
having access to more data generally enhances their accuracy and effectiveness.

Convolutional neural networks (CNNs) stand out in handling data like images,
audio, and speech due to their specialized structure, featuring three main types
of layers:

 Convolutional Layer
 Pooling Layer
 Fully-Connected (FC) Layer

The convolutional layer is the first layer in a CNN and performs the main
computations, which involve detecting patterns and features within input data.
This layer may be followed by additional convolutional or pooling layers, while the
fully-connected layer typically serves as the final layer. As data progresses
through these layers, the CNN gradually builds up complexity, enabling it to
identify more detailed aspects of the input. Initially, simpler features such as
edges or color gradients are identified, while deeper layers begin recognizing
larger, more complex shapes and elements within the data. Eventually, the CNN
can identify the entire object or concept it was trained to recognize.

1. Convolutional Layer

22
The convolutional layer is the core of a CNN, where most of the computation
occurs. It requires input data, filters, and a feature map. Let’s say the input is a
color image, represented as a 3D matrix of pixels across height, width, and depth
channels (RGB). The convolutional layer also uses a filter (or kernel), which moves
across the image's receptive fields to detect specific features, a process known as
convolution.This filter is a small, 2D array of weights (commonly a 3x3 matrix),
representing part of the image that it will analyze. As the filter moves, or
"strides," across the image, a dot product is calculated between the filter values
and the corresponding input pixels in the receptive field.

This result is recorded in an output array. The filter continues moving across the
entire image in this manner, creating a feature map or activation map from these
dot products. The use of a single set of weights for the filter as it moves across the
image—known as parameter sharing—is a key aspect of CNNs.While some filter
parameters adjust during training via backpropagation and gradient descent,
three hyperparameters impact output volume size and must be set before
training begins:

1. Number of Filters: The number of filters determines the depth of the


output. For instance, if three filters are applied, the network will create
three distinct feature maps, leading to an output depth of three.

2. Stride: This hyperparameter controls the distance the filter moves across
the input matrix. A stride value of one means the filter shifts one pixel at
a time, whereas higher stride values reduce the output size, as the filter
jumps over more pixels.

3. Zero-padding: Used to adjust the input dimensions so that the filters align
with the image. Zero-padding sets all elements outside the image
boundary to zero, ensuring the filter can cover the entire image.
Types of padding include:

 Valid Padding (No Padding): No extra pixels are added, so parts of the
image may be excluded if dimensions don’t perfectly align.
 Same Padding: Ensures that the output layer maintains the same
dimensions as the input layer.
 Full Padding: Expands the output by adding zeros around the edges
of the input image, creating a larger output.

After each convolution operation, the CNN applies a Rectified Linear Unit (ReLU)
activation function to the feature map. ReLU introduces nonlinearity into the
23
network, allowing it to capture more complex patterns in the data. This
combination of convolutional, pooling, and fully connected layers, combined with
ReLU, makes CNNs particularly effective in tasks like image and object
recognition, where detailed pattern recognition and analysis are essential.

Additional Convolutional Layer

After the initial convolutional layer, a CNN can incorporate additional


convolutional layers, allowing the network to build a hierarchical structure where
later layers view pixel regions processed by previous layers. For example, if CNN is
designed to detect a bicycle in an image, it can break down the bicycle into
fundamental parts: a frame, wheels, handlebars, and pedals. These parts
correspond to lower-level patterns, while their arrangement forms higher-level
patterns, creating a hierarchy of features within the CNN. Essentially, the
convolutional layers transform the image into numerical representations that
allow the neural network to interpret and identify meaningful patterns.

2. Pooling Layer

The pooling layer, also called the downsampling layer, reduces the dimensions of
the data, decreasing the number of parameters the network must process. Like
the convolutional layer, the pooling layer moves a filter across the input, but
unlike convolution, it does not apply weights. Instead, it aggregates values within
each receptive field. The two most common pooling types are:

 Max Pooling: As the filter traverses the input, it records the maximum pixel
value in each receptive field and sends that to the output.

 Average Pooling: Here, the filter calculates the average value within the
receptive field and sends this result to the output.

Although pooling results in some loss of information, it brings key benefits to the
CNN, such as reducing complexity, enhancing efficiency, and helping prevent
overfitting.

3. Fully-Connected Layer

24
The fully-connected layer (FC layer) connects each node in the output layer
directly to every node in the previous layer. This layer consolidates features
identified in the earlier layers and performs classification based on those features.
While ReLU is commonly used in convolutional and pooling layers, the fully-
connected layer generally employs a softmax activation function to assign class
probabilities, with outputs ranging from 0 to 1, allowing for effective
categorization.

What is Computer Vision?

Computer vision is a specialized area of artificial intelligence dedicated to


enabling machines to interpret and analyze images and videos. Essentially, it
equips machines with the capability to "see," allowing them to understand and
process the visual world in a manner akin to human perception. Key techniques
within computer vision include image classification, object detection, image
segmentation, and facial recognition. This technology finds applications across
various sectors, including security, healthcare, and entertainment.

Some notable examples of computer vision applications are:

 Facial Recognition: Recognizing individuals by analyzing visual data.

 Self-Driving Vehicles: Utilizing computer vision for navigation and obstacle


avoidance.

 Robotic Automation: Allowing robots to perform tasks and make decisions


based on visual input.

 Medical Anomaly Detection: Identifying irregularities in medical images to


enhance diagnostic accuracy.

 Sports Performance Analysis: Monitoring and analyzing athletes'


movements to improve their performance.

 Manufacturing Fault Detection: Spotting defects in products during the


manufacturing process.

 Agricultural Monitoring: Observing crop development, livestock health,


and environmental conditions using visual information.

These are some Computer Vision tasks:

25
1. Image Classification
Image classification involves categorizing an image according to its content. It
serves as a fundamental component of computer vision, aiming to sort images
into specific predefined classes, such as identifying dogs, cats, or cars. This
process begins with training a model on a substantial dataset of labeled images,
allowing it to learn and make predictions on previously unseen images. To
illustrate, consider organizing a collection of 1,000 books into fiction and non-
fiction categories; image classification functions in a similar way by classifying
images into distinct groups.

2. Object Detection
Object detection enhances image classification by not only categorizing images
but also pinpointing the locations of specific objects within them. Think of it like
searching for a particular book in a library: rather than just scanning the titles on
the shelves, you can actually see each book's exact placement.

3. Image Segmentation
Image segmentation involves dividing an image into distinct segments, each
representing different objects or parts of the image. This technique is crucial in
scenarios where it’s important to isolate and analyze specific elements within an
image, such as in medical imaging for disease diagnosis. Image segmentation can
be performed using various methods, including contour detection, edge
detection, and region-based techniques. Imagine navigating through a dense
forest; image segmentation allows computer vision to help you distinguish
between trees, bushes, and other elements, guiding you on the best path
forward.

4. Facial Recognition
Facial recognition is a specialized area of computer vision focused on identifying
and verifying individuals through their facial features. Deep learning algorithms
analyze key aspects such as the eyes, nose, and mouth to create a unique facial
signature. This technology is commonly seen in smartphone unlocking features
and is also employed in security systems for identity verification and in social
media for automatically tagging friends in images.

Training, Validation, and Testing

26
Training a model involves repeatedly processing example images and adjusting
the model's weights to enhance the accuracy of its predictions. While I won't
delve into the specifics of deep learning or convolutional neural networks here,
it's important to note that you can utilize deep learning techniques like transfer
learning without needing to understand all the underlying algorithms. However,
there are key concepts to grasp for developing and maintaining an effective
model.

The dataset is divided to distinguish between training and validation. This


separation can often be automated, as seen with tools like Maximo Visual
Inspection. Understanding this separation is crucial because training alone can
lead to overfitting. As the model becomes more accurate on the training data, it
may become "too accurate," meaning it recognizes the training images with
nearly perfect accuracy but struggles with new, unseen images. This phenomenon
is known as overfitting. To create a more generalizable model, a validation set is
utilized to monitor training and prevent overfitting.

Once the training and validation processes have been completed over a set
number of iterations (and potentially stopped early if no improvement in accuracy
is observed), you’ll end up with a model and usually a report detailing its
accuracy. However, it's important to remember that this reported accuracy only
reflects the performance of the data that was provided.

Although validation is a part of the training process, it's highly beneficial—and


often gratifying—to assess the model with an additional test dataset containing
images that were not included in the training or validation phases. This testing
often reveals the need for ongoing evaluation and potential retraining of the
model with updated datasets to enhance its performance.

What is Deep Learning?

Deep learning is a branch of machine learning that utilizes multilayered neural

27
networks, known as deep neural networks, to emulate the intricate decision-
making capabilities of the human brain. This technology underpins many of the
artificial intelligence (AI) applications we encounter in our daily lives.

The primary distinction between deep learning and traditional machine learning
lies in the architecture of the neural networks involved. Traditional machine
learning models typically consist of simple neural networks with one or two
layers, while deep learning models incorporate three or more layers, often
numbering in the hundreds or thousands, allowing for more complex training.

Unlike supervised learning models, which depend on structured, labeled input


data for accurate predictions, deep learning can also employ unsupervised
learning. This means that deep learning models can derive characteristics,
features, and relationships necessary for making accurate predictions from raw,
unstructured data. Furthermore, these models can assess and enhance their
predictions to achieve greater accuracy.

Deep learning plays a crucial role in data science and fuels a wide range of
applications and services that enhance automation, allowing analytical and
physical tasks to be performed without human intervention. This technology
enables a variety of everyday products and services, including digital assistants,
voice-activated TV remotes, credit card fraud detection systems, self-driving
vehicles, and generative AI.

Neural networks also referred to as artificial neural networks, are designed to


emulate the human brain through a network of data inputs, weights, and biases
that function similarly to silicon neurons. These components collaborate to
effectively recognize, classify, and describe objects within the provided data.

Deep neural networks are composed of multiple layers of interconnected nodes,


with each layer building upon the previous one to enhance and optimize
predictions or classifications. This sequence of computations within the network is
referred to as forward propagation. The input and output layers in a deep neural
network are known as visible layers: the input layer is responsible for receiving
and processing data, while the output layer generates the final prediction or
classification.

In addition to forward propagation, another essential process called


backpropagation utilizes algorithms like gradient descent to identify errors in

28
predictions. It adjusts the weights and biases of the network by moving backward
through the layers, effectively training the model. The combination of forward
propagation and backpropagation allows a neural network to not only make
predictions but also to correct errors, gradually improving the algorithm's
accuracy over time.

Training deep learning models requires substantial computational power. High-


performance graphical processing units (GPUs) are particularly well-suited for this
task, as they can perform many calculations across multiple cores while offering
ample memory. Distributed cloud computing can also provide necessary
resources. However, managing numerous GPUs in-house can significantly strain
internal resources and become quite expensive to scale.

Most deep learning applications are developed using one of three major
frameworks:

 JAX

 PyTorch

 TensorFlow

Types of Deep Learning Models

Deep learning algorithms are highly sophisticated, and various types of neural
networks have been developed to tackle specific challenges or datasets. Below
are six prominent models, presented in the approximate order of their evolution,
with each new model designed to address limitations of its predecessors. A
common drawback among these models is their tendency to operate as "black
boxes," making it difficult to decipher their inner mechanisms, which can lead to
challenges in interpretability. However, this complexity is often outweighed by
their advantages in accuracy and scalability.

1. CNNs
Convolutional Neural Networks (CNNs or ConvNets) are primarily utilized in
computer vision and image classification tasks. They excel at detecting features

29
and patterns within images and videos, facilitating functions such as object
detection, image recognition, pattern recognition, and facial recognition. CNNs
employ concepts from linear algebra, particularly matrix multiplication, to
uncover patterns within visual data.

A CNN is a specialized form of neural network that consists of several layers,


including an input layer, one or more hidden layers, and an output layer. Each
node within these layers connects to others and has an associated weight and
threshold. If a node's output exceeds the defined threshold, it is activated and
transmits data to the subsequent layer; otherwise, it does not pass data forward.

A CNN typically comprises at least three main types of layers: convolutional


layers, pooling layers, and fully connected (FC) layers. For more complex
applications, a CNN can include thousands of layers, each building upon its
predecessors. The "convolution" process involves iteratively processing the
original input to uncover intricate patterns. As data progresses through the CNN,
the model gains complexity, starting with the identification of simple features like
colors and edges. As the data moves through successive layers, the CNN
increasingly recognizes larger elements and shapes, ultimately identifying the
intended object.

CNNs are recognized for their exceptional performance with image, speech, and
audio signal inputs. Prior to the advent of CNNs, the process of feature extraction
for object identification in images was manual and labor-intensive. Now, CNNs
offer a more scalable solution for image classification and object recognition tasks
while effectively processing high-dimensional data. Additionally, CNNs facilitate
data exchange between layers, enhancing data processing efficiency. Although
some information may be lost in the pooling layers, the advantages of CNNs—
such as reduced complexity, improved efficiency, and decreased risk of overfitting
—often outweigh this drawback.

However, CNNs also come with challenges. They are computationally intensive,
requiring significant time and resources, often necessitating multiple graphical
processing units (GPUs) for effective operation. Furthermore, they demand highly
skilled experts with cross-domain expertise and meticulous testing of
configurations and hyperparameters.

2. RNNs
Recurrent Neural Networks (RNNs) are commonly employed in natural language

30
processing and speech recognition applications, as they are designed to handle
sequential or time-series data. RNNs are characterized by their feedback loops,
which allow them to utilize past information to inform current inputs and outputs.
They are particularly useful for making predictions based on time-series data, with
applications including stock market forecasting, sales predictions, and temporal
issues like language translation and image captioning. These functions are
frequently integrated into popular technologies such as Siri, voice search, and
Google Translate.

RNNs leverage their "memory" capabilities, where the information from previous
inputs influences the current output. Unlike traditional deep neural networks that
treat inputs and outputs as independent, the output of RNNs is contingent on
preceding elements within the sequence. While considering future events could
enhance the accuracy of predictions, unidirectional recurrent neural networks
cannot incorporate these future elements into their outputs.

RNNs utilize shared parameters across all layers of the network, employing the
same weight parameters within each layer. These weights are adjusted through
backpropagation and gradient descent to support reinforcement learning. To
compute gradients, RNNs use an algorithm known as backpropagation through
time (BPTT), which is tailored for sequential data and differs slightly from
traditional backpropagation. Like standard backpropagation, BPTT allows the
model to learn by calculating errors from the output layer back to the input layer.
However, BPTT accumulates errors at each time step, while feedforward
networks do not sum errors since they lack shared parameters across layers.

One advantage of RNNs over other types of neural networks is their ability to
handle both binary data processing and memory. RNNs can manage multiple
inputs and outputs, enabling them to produce various output types—such as one-
to-many, many-to-one, or many-to-many—rather than simply generating a single
result for a given input.

RNNs also come in different variations, with Long Short-Term Memory (LSTM)
networks being a notable example that outperforms simple RNNs by effectively
learning from and responding to longer-term dependencies. However, RNNs often
face two main challenges: exploding gradients and vanishing gradients, which are
defined by the size of the gradient—essentially the slope of the loss function
along the error curve.

31
 Vanishing Gradients: This occurs when the gradient becomes too small,
continually diminishing and ultimately resulting in insignificant weight
updates, rendering the algorithm unable to learn effectively.

 Exploding Gradients: In contrast, exploding gradients arise when the


gradient becomes excessively large, leading to an unstable model. In such
cases, the model weights can grow to unmanageable sizes and may be
represented as NaN (not a number). One potential solution to mitigate
these problems is to reduce the number of hidden layers in the neural
network, simplifying the RNN structure.

Additionally, RNNs often require lengthy training periods and can be challenging
to apply to large datasets. The optimization of RNNs can become complicated,
especially when they contain numerous layers and parameters.

3. Autoencoders and variational autoencoders


Deep learning has significantly advanced data analysis, enabling the examination
of images, speech, and other complex types beyond just numerical data.
Variational autoencoders (VAEs) were among the early models to support this,
offering a scalable approach that has become foundational to generative AI.
These models made a notable impact by facilitating the generation of realistic
images and sounds, marking an evolution in deep generative modeling.

Autoencoders function by encoding raw, unlabeled data into a condensed format


and then reconstructing it back to its original form. Basic autoencoders have been
widely applied to tasks like reconstructing blurry or partially corrupted images.
VAEs extended this by introducing the capability to create novel variations of the
data, rather than merely replicating it. This feature of generating unique data
spurred further developments, including generative adversarial networks (GANs)
and diffusion models, each capable of producing increasingly realistic, albeit
artificial, imagery. Thus, VAEs played a pivotal role in shaping the generative AI
landscape we see today.

Autoencoders rely on an architecture of encoders and decoders, similar to the


design behind large language models. Encoders transform data into a compact,
dense representation, clustering similar points within an abstract space. From this
space, decoders can sample to generate new data while retaining key
characteristics of the original dataset.

32
The primary advantage of autoencoders lies in their ability to efficiently process
large datasets, providing a compressed view of the data where essential patterns
are highlighted. This is particularly useful for tasks like anomaly detection and
classification, as well as for reducing storage and transmission needs. Moreover,
because autoencoders can learn from unlabeled data, they are valuable when
labeled data is limited or unavailable. This unsupervised training approach is also
time-saving, allowing the model to enhance accuracy independently, without
manual feature selection. Furthermore, VAEs can create synthetic data, enabling
new possibilities in text and image generation.

However, autoencoders come with some limitations. Training complex models


can be resource-intensive, and during unsupervised learning, the model may
simply replicate the input without capturing essential properties. Additionally,
autoencoders might fail to capture intricate relationships within structured data,
limiting their ability to understand complex patterns.

4. GANs
Generative adversarial networks (GANs) are a class of neural networks designed
to generate new data that closely resembles the original training data, widely
used within and outside AI applications. For instance, GANs can create images
that look like human faces, although these images are artificially generated rather
than photographs of real individuals.

The "adversarial" aspect of GANs refers to the interaction between two main
components: the generator and the discriminator.

1. The generator is responsible for creating new outputs, such as images,


video, or audio, adding unique variations. For example, it might transform
an image of a horse into a zebra, with the result depending on the
generator's training and model configuration.

2. The discriminator serves as the "adversary," evaluating the generator's


outputs against real examples in the dataset. It attempts to differentiate
between authentic and generated images, video, or audio.

Training in GANs involves this dynamic between the generator and discriminator.
33
As the generator produces artificial data, the discriminator learns to identify the
differences between the real and generated samples. When the discriminator
accurately identifies a generated output, the generator is adjusted to improve its
results. This iterative process continues until the generator produces outputs
indistinguishable from real data.

GANs’ key advantage lies in their ability to create highly realistic outputs that are
often challenging to tell apart from genuine data, which can be valuable for
training other machine learning models. Training a GAN is relatively
straightforward since it primarily relies on unlabeled data or minimally labeled
datasets. However, GANs do have limitations. The competitive training process
between the generator and discriminator can be computationally intensive and
may require substantial data to produce high-quality results. Another challenge is
"mode collapse," a scenario where the generator repeatedly produces similar
outputs instead of generating diverse variations.

5. Diffusion models
Diffusion models are a type of generative model trained through a process of
adding and then removing noise, known as forward and reverse diffusion. They
typically generate data—often images—that resemble their training data but
ultimately overwrite the training data itself. During training, Gaussian noise is
incrementally added to the data until it becomes unrecognizable, and the model
learns a reverse “denoising” process that enables it to create realistic outputs
from random noise.

The training goal for a diffusion model is to minimize the difference between its
generated samples and the desired output. This difference, or loss, is calculated,
and the model’s parameters are adjusted to bring the generated samples closer
to the target, making the final results nearly indistinguishable from the original
training data.

Diffusion models offer several benefits, including the ability to produce high-
quality images without adversarial training, leading to faster learning and
enhanced control over the generation process. Compared to GANs, diffusion
models also provide more training stability and are less susceptible to mode
collapse.

34
However, training diffusion models can demand significant computational
resources and often requires careful fine-tuning. IBM Research® has also
identified a vulnerability in these models: they can be embedded with hidden
backdoors, allowing malicious actors to manipulate the image generation process
to produce altered images.

6. Transformer models

Transformer models, which use an encoder-decoder architecture with advanced


text processing, have fundamentally changed language model training. The
encoder converts raw text into representations called embeddings, while the
decoder uses these embeddings and previous outputs to predict each word in a
sentence.

Transformers learn language structure through a “fill-in-the-blank” style


of training, capturing the relationships between words and phrases without
needing labeled parts of speech or grammar rules. This allows transformers to be
pre-trained on vast datasets without any specific task in mind. Once trained,
these models can then be quickly adapted to perform a wide range of tasks with
minimal additional data.
Several key innovations enable transformers' success. Unlike previous models,
such as recurrent neural networks (RNNs), which processed text sequentially,
transformers handle words in parallel, accelerating training. They also learn word
positions and contextual relationships within sentences, which helps them
interpret meaning accurately and resolve ambiguities in longer sentences.

Transformers allow language models to be pre-trained on extensive, unlabeled


text corpora, leading to large-scale models. Previously, individual tasks required
separate models and labeled data for training. Now, a transformer can be
pretrained on general data and then fine-tuned with smaller, task-specific labeled
datasets for various applications.

Today, transformers are applied to both generative tasks, like machine


translation, summarization, and question answering, and non-generative tasks,
such as text classification and entity extraction. Their parallel processing capability
speeds up training significantly, and their ability to track long-term dependencies
in text enables a deeper understanding of context, enhancing output quality.
Transformers are also highly scalable and flexible, making them adaptable to
diverse tasks.

35
However, transformers come with limitations: they require significant
computational resources and extended training time. Additionally, high-quality,
unbiased, and ample training data is essential to ensure accurate performance.

Deep learning use cases

The applications of deep learning are continually expanding, providing businesses


with powerful tools to increase efficiency and improve customer service.
Here are some ways deep learning is transforming modern business:

1. Application Modernization
Generative AI is revolutionizing application modernization and IT automation,
bridging the skills gap in these fields. Advances in large language models (LLMs)
and natural language processing (NLP) have enabled AI-driven coding, using deep
learning and vast neural networks trained on extensive datasets of open-source
code.

Developers can input natural language prompts describing desired code functions,
and generative AI suggests relevant code snippets or even complete functions.
This reduces the need for repetitive coding and speeds up development.
Generative AI can also facilitate code translation between programming
languages, which supports projects like converting legacy COBOL code into
modern languages such as Java.

2. Computer vision
Computer vision is a branch of artificial intelligence (AI) focused on tasks like
image classification, object detection, and semantic segmentation. By leveraging
machine learning and neural networks, it enables computers to analyze and
interpret digital images, videos, and other visual inputs to extract valuable
insights. This analysis allows systems to recommend actions or detect issues, such
as identifying defects in products. If AI equips computers with the ability to
"think," computer vision grants them the ability to "see," "observe," and
"understand."

36
Often used to inspect products or monitor production processes, computer vision
systems are capable of analyzing thousands of items per minute, detecting even
subtle defects that might go unnoticed by human inspectors. Its applications span
diverse industries, including energy, utilities, manufacturing, and automotive
sectors.

To perform effectively, computer vision requires vast amounts of data,


undergoing repeated analysis until it learns to recognize specific images. For
instance, training a system to identify automobile tires involves feeding it
extensive image datasets of tires and related objects so it can accurately
recognize a tire and identify any defects.

The technology relies on algorithmic models that allow computers to learn from
the context within visual data. When a large volume of data is processed, the
model gradually learns to distinguish between different images without direct
programming for specific image recognition.

Computer vision enables systems to derive insights from visual inputs and act
based on those observations, setting it apart from basic image recognition.
Some notable uses of computer vision today include:

1. Automotive: While fully autonomous vehicles are still in development,


computer vision technology is already used in cars to enhance safety,
with features like lane detection.

2. Healthcare: In radiology, computer vision assists doctors in identifying


cancerous tumors within healthy tissue, enhancing diagnostic accuracy.

3. Marketing: Social media platforms offer suggestions for tagging individuals


in photos, and simplifying photo album management for users.

4. Retail: E-commerce sites use visual search to suggest items that match or
enhance a customer’s wardrobe, personalizing the shopping experience.

3. Customer care
37
AI is enabling businesses to better understand and respond to growing
consumer demands. In a world of highly personalized online shopping, direct-to-
consumer models, and quick delivery options, generative AI offers a range of
benefits to enhance customer care, talent development, and application
performance.

Through a customer-focused, data-driven approach, AI allows businesses to glean


insights from customer feedback and purchasing patterns. These insights support
improvements in product design, packaging, and overall customer satisfaction,
which can ultimately drive sales. Generative AI also acts as a cognitive assistant in
customer support, using conversation history, sentiment analysis, and call center
transcripts to offer relevant guidance. Additionally, generative AI helps create
customized shopping experiences, fostering customer loyalty and providing a
competitive edge.

4. Digital Labor
Organizations can boost productivity by integrating robotic process automation
(RPA) and digital labor to complement human efforts or provide additional
support when needed. For instance, digital labor can assist developers with
updating legacy systems more efficiently.

By leveraging foundation models, digital labor enhances knowledge workers’


productivity by enabling reliable self-service automation without technical
barriers. Using enterprise-grade large language models (LLMs) for tasks like slot
filling, digital labor can identify necessary information within conversations to
perform tasks or make API calls with minimal manual input.

Instead of requiring technical experts to define repetitive workflows for


knowledge tasks, digital labor built on model-powered instructions allows
knowledge workers to use self-service automation. For instance, in application
development, no-code digital assistants can guide end-users who lack
programming skills by teaching, overseeing, and validating code, accelerating the
app creation process.

5. Generative AI

38
Generative AI, often referred to as "gen AI," involves deep learning models
capable of producing original content—such as detailed text, high-quality images,
realistic videos, and more—in response to user prompts.

In essence, generative models encode a simplified version of their training data,


which they use to generate new content that resembles, but is distinct from, the
original data.

Although generative models have been used in statistical analysis for years to
handle numerical data, advancements over the past decade have expanded their
application to more complex data types.
This shift aligns with the development of three advanced deep learning model
types:

 Variational Autoencoders (VAEs), introduced in 2013, enable models to


create multiple content variations based on a prompt or instruction.

 Diffusion Models, first developed in 2014, apply incremental "noise" to


images until they become unrecognizable, then reverse the process to
produce original images in response to user inputs.

 Transformers (or transformer models) are trained on sequential data to


generate complex sequences, such as words in sentences, shapes in
images, video frames, or software commands. Transformers underpin most
current generative AI applications, including tools like ChatGPT, GPT-4,
Copilot, BERT, Bard, and Midjourney.

Generative AI typically operates through three main phases:

1. Training to establish a foundational model.

2. Tuning to specialize the model for a particular application.

3. Generation, Evaluation, and Further Tuning to refine the model’s accuracy


and performance.

1. Training

39
Generative AI starts with a "foundation model," a deep learning model designed
as a base for various types of generative AI applications. Today, the most
prevalent foundation models are large language models (LLMs) used for text
generation, though there are also foundation models specifically for generating
images, video, sound, and music, as well as multimodal models capable of
producing different types of content.

To create a foundation model, AI practitioners train a deep learning algorithm on


vast amounts of relevant, raw, unstructured, and unlabeled data—often ranging
from terabytes to petabytes of internet-sourced text, images, or video. This
extensive training yields a neural network containing billions of parameters,
encoding representations of entities, patterns, and relationships in the data. As a
result, the model can autonomously generate content based on prompts.
However, this process is highly resource-intensive, requiring thousands of
interconnected GPUs, significant time, and substantial financial investment, often
costing millions of dollars. Open-source foundation models, such as Meta's Llama-
2, allow generative AI developers to bypass these costs and complexities.

2. Tuning

Once the foundation model is established, it requires tuning for specific content
generation tasks.
This can be accomplished through several methods:

 Fine-tuning, which involves supplying the model with labeled data specific
to the task—such as common questions or prompts the application may
receive and corresponding correct answers in the preferred format.

 Reinforcement Learning with Human Feedback (RLHF), where human


evaluators assess the relevance or accuracy of the model’s outputs,
allowing the model to improve iteratively. This feedback loop may involve
users typing or speaking corrections to a virtual assistant or chatbot,
refining its responses over time.

3. Generation, evaluation, and more tuning

40
Developers and users regularly evaluate the outputs of their generative AI
applications, making frequent adjustments to enhance accuracy and relevance—
sometimes updating the model every week. In contrast, updates to the
foundation model itself occur much less frequently, typically every 12 to 18
months.

Another approach to boost an AI application's performance is retrieval-


augmented generation (RAG). This technique enables the model to incorporate
relevant information from external sources beyond its training data, refining its
parameters for better accuracy and relevance.

4. Natural Language Processing and Speech Recognition

Natural Language Processing (NLP) combines computational linguistics—rule-


based modeling of human language—with statistical and machine learning
models. This allows computers and digital devices to recognize, understand, and
generate text and speech. NLP powers tools and devices capable of translating
between languages, responding to text or voice commands, identifying or
verifying users by voice, summarizing extensive text, understanding intent or
sentiment in speech and text, and generating content on demand.

A specific branch of NLP, called statistical NLP, integrates algorithms with machine
learning and deep learning models to extract, classify automatically, and label
components of text and speech. It then assigns a probability to each potential
meaning of these components. Today, deep learning models, especially those
based on recurrent neural networks (RNNs), enable NLP systems to "learn" as
they operate, deriving increasingly accurate meanings from massive volumes of
raw, unstructured text and voice data.

Speech Recognition, or Automatic Speech Recognition (ASR), is a technology that


allows software to convert spoken language into text. Unlike voice recognition,
which focuses on identifying a specific user by their voice, speech recognition
primarily involves transcribing verbal language into text format.

41

You might also like