Unit - 4 Ai
Unit - 4 Ai
Concept of Grammar
Grammar is very essential and important to describe the syntactic structure of well-formed
programs. In the literary sense, they denote syntactical rules for conversation in natural
languages. Linguistics have attempted to define grammars since the inception of natural
languages like English, Hindi, etc.
The theory of formal languages is also applicable in the fields of Computer Science mainly in
programming languages and data structure. For example, in ‘C’ language, the precise grammar
rules state how functions are made from lists and statements.
A mathematical model of grammar was given by Noam Chomsky in 1956, which is effective
for writing computer languages.
Mathematically, a grammar G can be formally written as a 4-tuple (N, T, S, P) where −
• N or VN = set of non-terminal symbols, i.e., variables.
• T or ∑ = set of terminal symbols.
• S = Start symbol where S ∈ N
• P denotes the Production rules for Terminals as well as Non-terminals. It has the form α
→ β, where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to VN
Example
Before giving an example of constituency grammar, we need to know the fundamental points
about constituency grammar and constituency relation.
• All the related frameworks view the sentence structure in terms of constituency relation.
• The constituency relation is derived from the subject-predicate division of Latin as well
as Greek grammar.
• The basic clause structure is understood in terms of noun phrase NP and verb phrase
VP.
We can write the sentence “This tree is illustrating the constituency relation” as follows −
Dependency Grammar
It is opposite to the constituency grammar and based on dependency relation. It was introduced
by Lucien Tesniere. Dependency grammar (DG) is opposite to the constituency grammar
because it lacks phrasal nodes.
Example
Before giving an example of Dependency grammar, we need to know the fundamental points
about Dependency grammar and Dependency relation.
• In DG, the linguistic units, i.e., words are connected to each other by directed links.
• The verb becomes the center of the clause structure.
• Every other syntactic units are connected to the verb in terms of directed link. These
syntactic units are called dependencies.
We can write the sentence “This tree is illustrating the dependency relation” as follows;
Parse tree that uses Constituency grammar is called constituency-based parse tree; and the parse
trees that uses dependency grammar is called dependency-based parse tree.
Definition of CFG
CFG consists of finite set of grammar rules with the following four components −
Set of Non-terminals
It is denoted by V. The non-terminals are syntactic variables that denote the sets of strings,
which further help defining the language, generated by the grammar.
Set of Terminals
It is also called tokens and defined by Σ. Strings are formed with the basic symbols of terminals.
Set of Productions
It is denoted by P. The set defines how the terminals and non-terminals can be combined. Every
production(P) consists of non-terminals, an arrow, and terminals (the sequence of terminals).
Non-terminals are called the left side of the production and terminals are called the right side of
the production.
Start Symbol
The production begins from the start symbol. It is denoted by symbol S. Non-terminal symbol is
always designated as start symbol.
Syntactic Analysis
Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this
phase is to draw exact meaning, or you can say dictionary meaning from the text. Syntax
analysis checks the text for meaningfulness comparing to the rules of formal grammar. For
example, the sentence like “hot ice-cream” would be rejected by semantic analyzer.
In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings
of symbols in natural language conforming to the rules of formal grammar. The origin of the
word ‘parsing’ is from Latin word ‘pars’ which means ‘part’.
Concept of Parser
It is used to implement the task of parsing. It may be defined as the software component
designed for taking input data (text) and giving structural representation of the input after
checking for correct syntax as per formal grammar. It also builds a data structure generally in
the form of parse tree or abstract syntax tree or other hierarchical structure.
The main roles of the parse include −
• To report any syntax error.
• To recover from commonly occurring error so that the processing of the remainder of
program can be continued.
• To create parse tree.
• To create symbol table.
• To produce intermediate representations (IR).
Types of Parsing
Derivation divides parsing into the followings two types −
• Top-down Parsing
• Bottom-up Parsing
Top-down Parsing
In this kind of parsing, the parser starts constructing the parse tree from the start symbol and
then tries to transform the start symbol to the input. The most common form of topdown parsing
uses recursive procedure to process the input. The main disadvantage of recursive descent
parsing is backtracking.
Bottom-up Parsing
In this kind of parsing, the parser starts with the input symbol and tries to construct the parser
tree up to the start symbol.
Concept of Derivation
In order to get the input string, we need a sequence of production rules. Derivation is a set of
production rules. During parsing, we need to decide the non-terminal, which is to be replaced
along with deciding the production rule with the help of which the non-terminal will be
replaced.
Types of Derivation
In this section, we will learn about the two types of derivations, which can be used to decide
which non-terminal to be replaced with production rule −
Left-most Derivation
In the left-most derivation, the sentential form of an input is scanned and replaced from the left
to the right. The sentential form in this case is called the left-sentential form.
Right-most Derivation
In the left-most derivation, the sentential form of an input is scanned and replaced from right to
left. The sentential form in this case is called the right-sentential form.
MACHINE TRANSLATION
Machine translation (MT), process of translating one source language or text into another
language, is one of the most important applications of NLP. We can understand the process of
machine translation with the help of the following flowchart −
Types of Machine Translation Systems
There are different types of machine translation systems. Let us see what the different types are.
Bilingual MT System
Multilingual MT System
Multilingual MT systems produce translations between any pair of languages. They may be
either uni-directional or bi-directional in nature.
Approaches to Machine Translation (MT)
Let us now learn about the important approaches to Machine Translation. The approaches to
MT are as follows −
Direct MT Approach
It is less popular but the oldest approach of MT. The systems that use this approach are capable
of translating SL (source language) directly to TL (target language). Such systems are bi-lingual
and uni-directional in nature.
Interlingua Approach
The systems that use Interlingua approach translate SL to an intermediate language called
Interlingua (IL) and then translate IL to TL. The Interlingua approach can be understood with
the help of the following MT pyramid −
Transfer Approach
Empirical MT Approach
This is an emerging approach for MT. Basically, it uses large amount of raw data in the form of
parallel corpora. The raw data consists of the text and their translations. Analogybased,
example-based, memory-based machine translation techniques use empirical MTapproach.
Speech recognition
Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or
speech-to-text, is a capability which enables a program to process human speech into a written format.
While it’s commonly confused with voice recognition, speech recognition focuses on the translation of
speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual
user’s voice.
The best kind of systems also allow organizations to customize and adapt the technology to their
specific requirements — everything from language and nuances of speech to brand recognition.
For example:
• Language weighting: Improve precision by weighting specific words that are spoken
frequently (such as product names or industry jargon), beyond terms already in the base
vocabulary.
• Speaker labeling: Output a transcription that cites or tags each speaker’s contributions
to a multi-participant conversation.
• Acoustics training: Attend to the acoustical side of the business. Train the system to
adapt to an acoustic environment (like the ambient noise in a call center) and speaker
styles (like voice pitch, volume and pace).
• Profanity filtering: Use filters to identify certain words or phrases and sanitize speech
output.
Meanwhile, speech recognition continues to advance. Companies, like IBM, are making inroads
in several areas, the better to improve human and machine interaction.
Speech recognition technology is evaluated on its accuracy rate, i.e. word error rate (WER), and
speed. A number of factors can impact word error rate, such as pronunciation, accent, pitch,
volume, and background noise. Reaching human parity – meaning an error rate on par with that
of two humans speaking – has long been the goal of speech recognition systems. Research from
Lippmann (link resides outside IBM) (PDF, 344 KB) estimates the word error rate to be around 4
percent, but it’s been difficult to replicate the results from this paper.
Various algorithms and computation techniques are used to recognize speech into text and
improve the accuracy of transcription. Below are brief explanations of some of the most
commonly used methods:
• Natural language processing (NLP): While NLP isn’t necessarily a specific algorithm
used in speech recognition, it is the area of artificial intelligence which focuses on the
interaction between humans and machines through language through speech and text.
Many mobile devices incorporate speech recognition into their systems to conduct
voice search—e.g. Siri—or provide more accessibility around texting.
• Hidden markov models (HMM): Hidden Markov Models build on the Markov chain
model, which stipulates that the probability of a given state hinges on the current state,
not its prior states. While a Markov chain model is useful for observable events, such as
text inputs, hidden markov models allow us to incorporate hidden events, such as part-
of-speech tags, into a probabilistic model. They are utilized as sequence models within
speech recognition, assigning labels to each unit—i.e. words, syllables, sentences,
etc.—in the sequence. These labels create a mapping with the provided input, allowing
it to determine the most appropriate label sequence.
• N-grams: This is the simplest type of language model (LM), which assigns
probabilities to sentences or phrases. An N-gram is sequence of N-words. For example,
“order the pizza” is a trigram or 3-gram and “please order the pizza” is a 4-gram.
Grammar and the probability of certain word sequences are used to improve recognition
and accuracy.
• Neural networks: Primarily leveraged for deep learning algorithms, neural networks
process training data by mimicking the interconnectivity of the human brain through
layers of nodes. Each node is made up of inputs, weights, a bias (or threshold) and an
output. If that output value exceeds a given threshold, it “fires” or activates the node,
passing data to the next layer in the network. Neural networks learn this mapping
function through supervised learning, adjusting based on the loss function through the
process of gradient descent. While neural networks tend to be more accurate and can
accept more data, this comes at a performance efficiency cost as they tend to be slower
to train compared to traditional language models.
• Speaker Diarization (SD): Speaker diarization algorithms identify and segment speech
by speaker identity. This helps programs better distinguish individuals in a conversation
and is frequently applied at call centers distinguishing customers and sales agents.
• There is one important difference between the artificial intelligence program and robot. The AI
program performs in a computer stimulated environment, while the robot performs in the physical
world.
Example:
In chess, an AI program can be able to make a move by searching different nodes and has no facility
to touch or sense the physical world.
However, the chess playing robot can make a move and grasp the pieces by interacting with the
physical world.
Image formation
Image formation is a physical process that captures object in the scene through lens and creates a 2-D
image.
Let's understand the geometry of a pinhole camera shown in the following diagram.
In the above figure, an optical axis is perpendicular to the image plane and image plane is generally
placed in front of the optical center.
So, let P be the point in the scene with coordinates (X,Y,Z) and P' be its image plane with
coordinates (x, y, z).
If the focal length from the optical center is f, then by using properties of similar triangles, equation
is derived as,
The 3-D information extraction process plays an important role to perform the tasks like
manipulation, navigation and recognition. It deals with the following aspects:
The segmentation is used to arrange the array of image pixels into regions. This helps to match
semantically meaningful entities in the scene.
• The goal of segmentation is to divide an image into regions which are homogeneous.
Determination of the position and orientation of each object relative to the observer is important for
manipulation and navigation tasks.
For example: Suppose a person goes to a store to buy something. While moving around he must
know the locations and obstacles, so that he can make the plan and path to avoid them.
• The whole orientation of image should be specified in terms of a three dimensional rotation.
•
3. To determine the shape of each and every object
When the camera moves around an object, the distance and orientation of that object will change but
it is important to preserve the shape of that object.
For example: If an object is cube, that fact does not change, but it is difficult to represent the global
shape to deal with wide variety of objects present in the real world.
• If the shape of an object is same for some manipulating tasks, it becomes easy to decide how to
grasp that object from a particular place.
• The object recognition plays most significant role to identify and classify the objects as an example
only when the geometric shapes are provided with color and texture
There are number of techniques available in the visual stimulus for 3D-image extraction such
as motion, binocular stereopsis, texture, shading, and contour. Each of these techniques operates
on the background assumptions about physical scene to provide interpretation.
Image processing is the process of transforming an image into a digital form and performing certain
operations to get some useful information from it. The image processing system usually treats all images
as 2D signals when applying certain predetermined signal processing methods.
• Sharpening and restoration - Create an enhanced image from the original image
• Pattern recognition - Measure the various patterns around the objects in the image
• Retrieval - Browse and search images from a large database of digital images that are similar
to the original image
Image Acquisition
Image acquisition is the first step in image processing. This step is also known as preprocessing
in image processing. It involves retrieving the image from a source, usually a hardware-based
source.
Image Enhancement
Image enhancement is the process of bringing out and highlighting certain features of interest in
an image that has been obscured. This can involve changing the brightness, contrast, etc.
Image Restoration
Image restoration is the process of improving the appearance of an image. However, unlike
image enhancement, image restoration is done using certain mathematical or probabilistic
models.
Color Image Processing
Color image processing includes a number of color modeling techniques in a digital domain.
This step has gained prominence due to the significant use of digital images over the internet.
Wavelets are used to represent images in various degrees of resolution. The images are
subdivided into wavelets or smaller regions for data compression and for pyramidal
representation.
Compression
Compression is a process used to reduce the storage required to save an image or the bandwidth
required to transmit it. This is done particularly when the image is for use on the Internet.
Morphological Processing
Morphological processing is a set of processing operations for morphing images based on their
shapes.
Segmentation
Segmentation is one of the most difficult steps of image processing. It involves partitioning an
image into its constituent parts or objects.
After an image is segmented into regions in the segmentation process, each region is represented
and described in a form suitable for further computer processing. Representation deals with the
image’s characteristics and regional properties. Description deals with extracting quantitative
information that helps differentiate one class of objects from the other.
RECOGNITION
Image processing has been extensively used in medical research and has enabled more efficient
and accurate treatment plans. For example, it can be used for the early detection of breast cancer
using a sophisticated nodule detection algorithm in breast scans. Since medical usage calls for
highly trained image processors, these applications require significant implementation and
evaluation before they can be accepted for use.
In the case of traffic sensors, we use a video image processing system or VIPS. This consists of
a) an image capturing system b) a telecommunication system and c) an image processing system.
When capturing video, a VIPS has several detection zones which output an “on” signal whenever
a vehicle enters the zone, and then output an “off” signal whenever the vehicle exits the detection
zone. These detection zones can be set up for multiple lanes and can be used to sense the traffic
in a particular station.
Left - normal traffic image | Right - a VIPS image with detection zones (source)
Besides this, it can auto record the license plate of the vehicle, distinguish the type of vehicle,
monitor the speed of the driver on the highway and lots more.
Image Reconstruction
Image processing can be used to recover and fill in the missing or corrupt parts of an image. This
involves using image processing systems that have been trained extensively with existing photo
datasets to create newer versions of old and damaged photos.
Face Detection
One of the most common applications of image processing that we use today is face detection. It
follows deep learning algorithms where the machine is first trained with the specific features of
human faces, such as the shape of the face, the distance between the eyes, etc. After teaching the
machine these human face features, it will start to accept all objects in an image that resemble a
human face. Face detection is a vital tool used in security, biometrics and even filters available
on most social media apps these days.
Benefits of Image Processing
The implementation of image processing techniques has had a massive impact on many tech
organizations. Here are some of the most useful benefits of image processing, regardless of the
field of operation:
• The digital image can be made available in any desired format (improved image, X-Ray,
photo negative, etc)
• Information can be processed and extracted from images for machine interpretation
• The pixels in the image can be manipulated to any desired density and contrast
Digital Image Processing: It is used for manipulating digital images with the help
of computer algorithms
For digital image processing, however, the output may be an image or information
associated with that image, such as data on features, characteristics, bounding boxes,
or masks
Digital image processing includes eight key phases:
These techniques are often used during the image augmentation process. When you
lack data, you can extend your dataset with slightly augmented images. In this way,
you can improve the way your neural network model generalizes data and make sure
it provides high-quality results.
Object recognition is a key technology behind driverless cars, enabling them to recognize
a stop sign or to distinguish a pedestrian from a lamppost. It is also useful in a variety of
applications such as disease identification in bioimaging, industrial inspection, and
robotic vision.
Object detection and object recognition are similar techniques for identifying objects, but
they vary in their execution. Object detection is the process of finding instances of
objects in images. In the case of deep learning, object detection is a subset of object
recognition, where the object is not only identified but also located in an image. This
allows for multiple objects to be identified and located within the same image.
How Object Recognition Works
You can use a variety of approaches for object recognition. Recently, techniques
in machine learning and deep learning have become popular approaches to object
recognition problems. Both techniques learn to identify objects in images, but they differ
in their execution.
OBJECT RECOGNITION TECHNIQUES
1.Template matching
Template matching is a technique for finding small parts of an image which match a template image. It is
a straightforward process. In this technique template images for different objects are stored. When an
image is given as input to the system, it is matched with the stored template images to determine the
object in the input image. Templates are frequently used for recognition of characters, numbers, objects,
etc. It can be performed on either color or gray level images. Template matching can either be pixel to
pixel matching or feature based. In feature based the features of template image is compared to features of
sub-images of the given input image; to determine if the template object is present in the input image.
B. Color based
Color provides potent information for object recognition. A simple and efficient object detection scheme
is to represent and match images on the basis of color histograms.
The color information is extended in two existing methods for object detection, the part-based detection
framework and the Efficient Subwindow Search approach . The three main criteria which should be taken
into account when choosing an approach to integrating color into object detection are feature
Combination, photometric invariance and compactness.
variety of color models used for recognition of multicolored objects according to the following criteria: 1.
Robustness to a change in viewing direction
The color models have High discriminative power; robustness to object occlusion and cluttering;
robustness to noise in the images.
Some methods try to bound the region of the image in which passive scanning is applied. It is a
computationally expensive and inefficient scanning method. In this method at each sampling point costly
feature extraction is performed, while the probability of detecting an object or suitable interest point can
be squat In active scanning local samples are used to guide the scanning process. At the current scanning
position a local image sample is extracted and mapped to a shifting vector indicating the next scanning
position. The method takes successive samples towards the expected object location, while skipping
regions unlikely to contain the object. The goal of active scanning is to save computational effort, while
retaining a good detection performance.
The active object-detection method (AOD-method) scans the image for multiple discrete time steps in
order to find an object. In the AOD-method this process consists of three phases:
3.Verifying object presence at the last scanning position with a standard object detector.
4.Shape based
Recently, shape features have been extensively explored to detect objects in real-world images. The shape
features are more striking as compared to local features like SIFT because most object categories are
better described by their shape then texture, such as cows, horses and cups and also for wiry objects like
bikes, chair or ladders, local features unavoidably contain large amount of background mess. Thus shape
features are often used as a replacement or complement to local features
Berg, et.al. have proposed a new algorithm to find correspondences between feature points for object
recognition in the framework of deformable shape matching. The basic subroutine in deformable shape
matching takes as input an image with an unknown object (shape) and compares it to a model by solving
the correspondence problem between the model and the object. Then it performs aligning transformation
and computes a similarity based on both the aligning transform and the residual after applying the
aligning transformation
Deep learning techniques have become a popular method for doing object
recognition. Deep learning models such as convolutional neural networks, or CNNs, are
used to automatically learn an object’s inherent features in order to identify that object.
For example, a CNN can learn to identify differences between cats and dogs by analyzing
thousands of training images and learning the features that make cats and dogs different.
There are two approaches to performing object recognition using deep learning:
• Training a model from scratch: To train a deep network from scratch, you gather
a very large labeled dataset and design a network architecture that will learn the
features and build the model. The results can be impressive, but this approach
requires a large amount of training data, and you need to set up the layers and
weights in the CNN.
• Using a pretrained deep learning model: Most deep learning applications use
the transfer learning approach, a process that involves fine-tuning a pretrained
model. You start with an existing network, such as AlexNet or GoogLeNet, and
feed in new data containing previously unknown classes. This method is less time-
consuming and can provide a faster outcome because the model has already been
trained on thousands or millions of images.
Deep learning offers a high level of accuracy but requires a large amount of data to make
accurate predictions.
Machine Learning Workflow
To perform object recognition using a standard machine learning approach, you start with
a collection of images (or video), and select the relevant features in each image. For
example, a feature extraction algorithm might extract edge or corner features that can be
used to differentiate between classes in your data.
These features are added to a machine learning model, which will separate these features
into their distinct categories, and then use this information when analyzing and
classifying new objects.
You can use a variety of machine learning algorithms and feature extraction methods,
which offer many combinations to create an accurate object recognition model.
OBJECT RECOGNITION BY APPEARANCE
Appearance means what an object tends to look like. For example, football is rather round in
shape. It is important to know every class of images with a classifier. Taking the example of
faces, looking at the camera-every face looks similar under good light and perfect resolution. A
strategy called sliding window includes computing features for an object and present it to a
classifier. One strategy is to estimate and correct the illumination in each image window and
another is to build features out of gradient orientations. To find faces of different sizes, repeat the
sweep over larger or smaller versions of the image. Then, we post process the responses across
scales and location to produce the final set of detections. Postprocessing have several
overlapping windows that each report a match for a face. To yield a single high quality match,
we can combine these partial overlapping matches at nearby locations. Therefore, it gives a face
detector that can search over locations and scales.
Since, several effects can move features around in an image of the object, many objects produce
much more complex patterns than faces do. Effects include: Foreshortening: which causes a
pattern viewed at a slant to be significantly distorted Aspect: which causes objects to look
different when seen from different directions. Occlusion: when some parts are hidden from some
viewing directions. Self-occlusion can be defined as objects can occlude one another or parts of
an object can occlude other parts. Deformation: where internal degrees of freedom of the object
change its appearance. An object recognizer is then a collection of features that can tell whether
the pattern elements are present, and whether they are in about the right place. With a histogram
of the pattern elements that appear can be considered as the most obvious approach to represent
the image window. This approach does not work particularly well, because too many patterns get
confused with one another.
Each year car accidents kill about 1.2 million people, and to avoid this problem cars should be
provided with sensors which detect pedestrians and result in saving many lives. The most usual
cases are lateral or frontal views of a walk. In these cases, we see either a “lollipop” shape-the
torser is wider than the legs, which are together in the stance phase of the walk-or a “scissor”
shape-where the legs are swinging in the walk. Therefore, we need to build a useful
movingwindow pedestrian detector. To represent the image window, it is better to use
orientations than edges because there isn’t always a strong contrast between a pedestrian and the
background. Pedestrians can move their arms and legs around, so we should use a histogram to
suppress some spatial detail in the feature. When breaking up the window into cells, overlapping
occurs and hence, build an orientation histogram in each cell. Through this feature, we can
determine whether head-and-shoulders curve is at the top of the window or at the bottom, but
will not change, if the head is moved bit slightly.
Binocular Stereopsis
This idea is similar to motion parallax, but we use two or more images separated in space.
Binocular stereopsis enables a wilder field of vision for the predators who have eyes in the front.
If we super pose the two images there will be a disparity in the location of the image feature in
the two images as a given feature in the scene will be in a different place relative to the Z-axis of
each image plane. Using optical flow equations vector T acting for time δt, with Tx=b/ δt and
Ty=Tz=0. Horizontal disparity is equal to the ratio of base line to the depth, and vertical
disparity is zero i.e,H=b/Z,V=0. Humans fixate at a point in the scene at which the optical axes
of the two eyes intersect nder normal viewing conditions. The actual disparity is δθ, we have
disparity=bδZ/Z2. In humans, b(the base line distance between the eyes) is about 6cm. So for
Z=30cm we get small value δZ=0.036mm which means at a distance of 30cm humans can
differentiate the depths which differ by a little length as 0.036mm.
Multiple views
Most of the techniques that have been developed had make the use of the information available
in multiple views, even from hundreds or thousands of cameras. There are few problems in
multiple views which can be solved algorithmically: Correspondence problem: In the 3D
world,• identifying features in the different images that are projections of the same feature.
Relative orientation problem: Finding out the• transformation between the coordinate systems
fixed to the different cameras. Depth estimation problem: Finding out the depths• of various
points in the world for which image plane projections were available in at least two views.
Texture
Texture is used to estimate distances and for segmenting objects. The texture elements are also
known as texels.
As shown in the above figure, the paving tiles are identical in the scene but appear different in
the image for 2 reasons. Differences in the distances of the texels from the• camera. Differences
in the foreshortening of the texels.• Various algorithms have been developed to make use of
variation in the appearance of the projected texels as a basis for finding out the surface normals
but they were not accurate as much as the algorithms which were used for multiple views.
problem can be changed to: given the group of pixel coordinates , find the corresponding set of camera
There is an angle between segments mainly for the ankles and knees to be differentiated. If there
are M image rectangles having the right torso as O(M6) for a model, then the best allocation of
rectangles to segments will get slow. However this can be solved by using various speed-ups
which are available for an appropriate choice of ψ. This model is usually known as pictorial
structure model.
We generally have to build a model of segment appearances when we don’t have an idea of what
a person looks like. Appearance model is the description of what a person looks like.
USING VISION
If vision systems could analyze video and understood what people are doing, we would be able to: design
buildings and public places better by collecting and using data about what people do in public; build more
accurate, more secure, and less intrusive surveillance systems; build computer sports commentators; and
build human-computer interfaces that watch people and react to their behavior
Let us consider a vision system for an automated vehicle driving on a freeway. The tasks faced by the
driver include the following:
1. Lateral control—ensure that the vehicle remains securely within its lane or changes lanes smoothly
when required.
3. Obstacle avoidance—monitor vehicles in neighboring lanes and be prepared for evasive maneuvers if
one of them decides to change lanes. The problem for the driver is to generate appropriate steering,
acceleration, and braking actions to best accomplish these tasks.
For lateral control, one needs to maintain a representation of the position and orientation of the car
relative to the lane. We can use edge-detection algorithms to find edges corresponding to the lane-marker
segments. We can then fit smooth curves to these edge elements. The parameters of these curves carry
information about the lateral position of the car, the direction it is pointing relative to the lane, and the
curvature of the lane. This information, along with information about the dynamics of the car, is all that is
needed by the steering-control system. If we have good detailed maps of the road, then the vision system
serves to confirm our position (and to watch for obstacles that are not on the map).
For longitudinal control, one needs to know distances to the vehicles in front. This can be accomplished
with binocular stereopsis or optical flow. Using these techniques, visioncontrolled cars can now drive
reliably at highway speeds.