Synopsis of M. Tech. Thesis On Face Detection Using Neural Network in Matlab by Lalita Gurjari
Synopsis of M. Tech. Thesis On Face Detection Using Neural Network in Matlab by Lalita Gurjari
Synopsis of M. Tech. Thesis On Face Detection Using Neural Network in Matlab by Lalita Gurjari
DATE: 25-09-2013
INTRODUCTION
The goal of my thesis is to show that the face detection problem can be solved efciently and accurately using a view-based approach implemented with articial neural networks. Specically, I will demonstrate how to detect upright, tilted, and non-frontal faces in cluttered grayscale images, using multiple neural networks whose outputs are arbitrated to give the nal output. Object detection is an important and fundamental problem in computer vision, and there have been many attempts to address it. The techniques which have been applied can be broadly classied into one of two approaches: matching twoor three-dimensional geometric models to images [Seutens et al., 1992, Chin and Dyer, 1986, Besl and Jain, 1985], or matching view-specic image-based models to images. Previous work has shown that view-based methods can effectively detect upright frontal faces and eyes in cluttered backgrounds [Sung, 1996, Vaillant et al., 1994, Burel and Carel, 1994]. This thesis implements the view-based approach to object using neural networks, and evaluates this approach in the face detection domain.
In developing a view-based object detector that uses machine learning, three main subproblems arise. First, images of objects such as faces vary considerably, depending on lighting, occlusion, pose, facial expression, and identity. The detection algorithm should explicitly deal with as many of these sources of variation as possible, leaving little unmodelled variation to be learned. Second, one or more neural-networks must be trained to deal with all remaining variation in distinguishing objects from non-objects. Third, the outputs from multiple detectors must be combined into a single decision about the presence of an object. The automatic recognition of human faces presents a significant challenge to the pattern recognition research community; human faces are very similar in structure with minor differences from person to person. They are actually within one class of human face. Furthermore, lighting condition changes, facial expressions, and pose variations further complicate the face recognition task as one of
the difficult problems in pattern analysis. This proposed a novel concept, faces can be recognized using line edge detection. A face pre filtering technique is proposed to speed up the searching process. It is a very encouraging finding that the proposed face recognition technique has performed superior to the most of the existing comparison experiments. This describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. As continual research is being conducted in the area of computer vision, one of the most practical applications under vigorous development is in the construction of a robust real-time face detection system. Successfully constructing a real-time face detection system not only implies a system capable of analyzing video streams, but also naturally leads onto the solution to the problems of extremely constraint testing environments. Analyzing a video sequence is the current challenge since faces are constantly in dynamic motion, presenting many different possible rotational and illumination conditions. While solutions to the task of face detection have been presented, detection performances of many systems are heavily dependent upon a strictly constrained environment. The problem of detecting faces under gross variations remains largely uncovered. This paper gives a face detection system which uses an image based neural network to detect face images. Face to Face communication is a real time process operating at a time scale. The level of uncertainty at this time scale is considerable, making it necessary for humans & machines to rely on sensory rich perceptual primitives rather than slow symbolic inference process. Because of real time bandwidth & environmental constraints, video processing has to deal with much lower resolution & image quality, when compared photograph processing. Video images can be easily acquired & they can capture the motion of person, so its make possible to track people until they are in a position convenient for recognition. The face is the most distinctive and widely used key to a persons identity. The area of Face detection has attracted considerable attention in the advancement of human-machine interaction as it provides a natural and efficient way to communicate between humans and machines. The problem of detecting the faces and facial parts in image sequences has become a popular area of research due to emerging applications in intelligent humancomputer interface, surveillance systems, content-based image retrieval, video conferencing, financial transaction, forensic applications, pedestrian detection, and image database management system and so on. Face detection is essentially localising and extracting a face region from the background. This may seem like an easy task but
the human face is a dynamic object and has a high degree of variability in its appearance, which makes face detection a difficult problem in computer vision. Overview of the Matlab Environment The name MATLAB stands for matrix laboratory,originally written to provide easy access to matrix software developed by the LINPACK and EISPACK projects. Today, MATLAB engines incorporate the LAPACK and BLAS libraries, embedding the state of the art in software for matrix computation. MATLAB is an interactive, matrix based system for scientific and engineering numeric computation and visualization. Its basic data element is an array that does not require dimensioning. It is used to solve many technical computing problems, especially those with matrix and vector formulation, in a fraction of the time it would take to write a program in a scalar non interactive language such as C.
Matlab Environment
The software section is completely based on MATLAB. In our interface we have used MATLAB for face recognition. We have used it in such a way that it matches the face from the predefined database and generates an event. This event is used to control the device by giving the controller input to control the output and thus control controls the door.While some may regard face detection as simple pre-processing for the face recognition system, it is by far the most important process in a face detection and recognition system. However face recognition is not the only possible application of a fully automated face detection system. There are applications in automated colour film development where information about the exact face location is useful for determining exposure and colour levels during film development. The are even uses in face tracking for automated camera control in the film and television news industries. In this project the author will attempt to detect faces in still images by using image invariants. To do this it would be useful to study the grey-scale intensity distribution of an average human face. The following 'average human face' was constructed from a
sample of 30 frontal view human faces, of which 12 were from females and 18 from males. A suitably scaled colormap has been used to highlight grey-scale intensity differences. The grey-scale differences, which are invariant across all the sample faces are strikingly apparent. The eye-eyebrow area seem to always contain dark intensity (low) gray-levels. while nose forehead and cheeks contain bright intensity (high) greylevels. After a great deal of experimentation, the researcher found that the following areas of the human face were suitable for a face detection system based on image invariants and a deformable template. Most face detection systems attempt to extract a fraction of the whole face, therebyeliminating most of the background and other areas of an individual's head such as hair that are not necessary for the face recognition task. With static images, this is often done by running a 'window' across the image. The face detection system then judges if a face is present inside the window (Brunelli and Poggio, 1993). Unfortunately, with static images there is a very large search space of possible locations of a face in an image. Face may be large or small and be positioned anywhere from the upper left to the lower right of the image. Most face detection systems use an example based learning approach to decide whether or not a face is present in the window at that given instant (Sung and Poggio,1994 and Sung,1995). A neural network or some other classifier is trained using supervised learning with 'face' and 'non-face' examples, thereby enabling it to classify an image (window in facedetection system) as a 'face' or 'non-face'.. Unfortunately, while it is relatively easy to find face examples, how would one find a representative sample of images which represent non-faces (Rowley et al., 1996) Therefore, face detection systems using example based learning need thousands of 'face' and 'non-face' images for effective training. Rowley,Baluja, and Kanade (Rowley et al.,1996) used 1025 face images and 8000 non-face images (generated from 146,212,178 sub-images) for their training set! There is another technique for determining whether there is a face inside the face detection system's window - using Template Matching. The difference between a fixed target pattern (face) and the window is computed and thresholded. If the window contains a pattern which is close to the target pattern(face) then the window is judged as containing a face.
System overview
The face detection system designed as shown in Fig.1
1 Skin color filter (Preprocessing) - The first step in preprocessing a color image consists of passing it through a skin color filter that detects the skin pixels. This is used to discard many of the pixels in the case of color images, thus reducing the amount of comparisons between the window and the image. 2 Filtering the image - This part consists of continuously applying a mask of 20*20 pixels to the preprocessed image. The mask is to some degree invariant to rotation and scale. 3 Multilayer Perceptron (MLP) - The prenetwork is a single multilayer perceptron (MLP). This is a neural network with input, hidden, and one output neurons (the output neuron is responsible for outputting either a face or a nonface). The prenetwork is trained using backpropagation. This filter eliminates many of the pixels to be considered in the comparison and is applied directly to grayscale images. For color images, the output of the skin filter is fed to the MLP. 4 Detection - The output of the neural network varies between 1 and -1 according to it whether a face has been detected or not, respectively.
Implementation methods
1 Skin color filter Detection of skin color in color images is a very popular and useful technique for face detection. Many techniques have reported for locating skin color regions in the input image. While the input color image is typically in the RGB format, these techniques usually use color components in the color space, such as the HSV formats. That is because RGB components are subject to the lighting conditions thus the face detection may fail if the lighting condition changes.
The first step in preprocessing a color image consists of passing it through a skin color filter that detects the skin pixels. This is used to discard many of the pixels in the case of color images, thus reducing the amount of comparisons between the window and the image. The first step in designing a skin color filter consists of changing the image from RGB to HSV, where H stands for Hue, S for Saturation and V for value. This reduces the effect of illumination. H, S, and V are continuous values varying between 0 and 1. Quantization is applied for both H and S to get discrete values. 10 points have been considered for each of the previously mentioned parameters. Then, a color histogram is formed of H, S, and the pixel value. Thus, pairs of H and S are formed and for each of them, the corresponding number of pixels is determined. This allows us to get the first condition for a pixel to be a skin pixel. In fact, the color histogram (H, S) is compared to a skin threshold (determined empirically). If it is greater, the pixel can be classified as a skin pixel if it satisfies the second condition (discussed later). Otherwise, the pixel is rejected for being a non skin pixel. The second condition consists of comparing the edge at each pixel with an edge threshold
(determined empirically as well). The edge value is obtained by computing the gradient image using Sobel operator. This is useful for detecting edges in the image. If the computed edge at the pixel is less than the threshold, and the first condition has been satisfied, the pixel is classified as a skin pixel (and set to white). Otherwise, it is set to black. The algorithm of the skin color filter can be summarized as follows: 1. Transform the image from RGB to HSV. 2. Compute the HSV values for each pixel and the color histogram (H, S). 3. Compute the gradient of RGB image using Sobel operator. 4. if (color histogram (H, S) > skin threshold and edge (x, y) < edge threshold) Pixel (x, y) = 1 (white) skin pixel Pixel (x, y) = 0 (black) non skin pixel. 2 Multilayer Perceptron (MLP) The prenetwork is a single multilayer perceptron (MLP). This is a neural network with input, hidden, and one output neurons (the output neuron is responsible for outputting either a face or a nonface). The prenetwork is trained using back-propagation. This filter eliminates many of the pixels to be considered in the comparison and is applied directly to grayscale images. For color images, the output of the skin filter is fed to the MLP. 3 Detection 3.1 Filtering the image This part consists of continuously applying a mask of 20*20 pixels to the preprocessed image. The mask is to some degree invariant to rotation and scale. The output of this operation (output of the neural network) varies between 1 and -1 according to whether a face has been detected or not, respectively. If the face in the original (preprocessed) image is larger than the window size, the image is sub-sampled (i.e., its size is reduced) and the filter is applied to the image at each size until the new face fits the mask. First step: At each step, another processing of the image is done to correct its illumination. This is done by first creating a function that varies linearly with the intensity inside the window. More precisely, the function varies linearly inside an oval in the window, and the outer contour is black to discard the background pixels. This transformed version of the image is then subtracted from the original one. Once this lighting correction has been done,
histogram equalization is applied to the image to emphasize its contrast. As before, equalization is done in the oval part of the window. This is done to make sure that all images have the same properties regardless of the conditions under which they were taken and of the type of camera used. Second step: The extracted window is fed to the input layer of the neural network that determines whether the image contains a face or not. The hidden layers of the network consist of three types of units, with each type being specialized in one task. The first type is a set of four receptive fields (hidden units) that are responsible for detecting features such as the individual eyes, the nose, and the corners of the mouth. These units look at 10*10 pixel regions. The second category consists of 16 units that look at 5*5 pixel regions and have the same job as the ones described above. The third type is constituted of 6 units that look at 20*5 pixel regions and are responsible for detecting the mouth and the pair of eyes. This is possible since the units are horizontal. Third step: If the output of the network is 1, a face is detected. The opposite occurs for an output of -1. In order to train the system, a set of face and nonface images were used. Some features such as the eyes, the nose and the mouth were labeled, and the images were scaled and rotated using the following algorithm: 1. Initialize F, a vector that will be the average positions of each labeled feature over all the faces, with the feature locations in the first face F1. 2. The feature coordinates in F are rotated, translated, and scaled, so that the average locations of the eyes will appear at predetermined locations in a 20*20 pixel window. 3. For each face i, compute the best rotation, translation, and scaling to align the faces features Fi with the average feature locations F. Such transformations can be written as a linear function of their parameters. Thus, we can write a system of linear equations mapping the features from Fi to F. The least squares solution to this over-constrained system yields the parameters for the best alignment transformation. Call the aligned feature locations Fi. 4. Update F by averaging the aligned feature locations Fi for each face i. 5. Go to step 2. The selection of nonface images during training is done as follows:
1. Create an initial set of nonface images by generating 1000 random images. Apply the preprocessing steps to each of these images. 2. Train a neural network to produce an output of 1 for the face examples, and -1 for the nonface examples. The training algorithm is standard error backpropagation with momentum. After the first iteration, we use the weights computed by training in the previous iteration as the starting point. 3. Run the system on an image of scenery which contains no faces. Collect sub images in which the network incorrectly identifies a face (an output activation > 0) 4. Select up to 250 of these sub images at random, apply the preprocessing steps, and add them into the training set as negative examples. Go to step 2.
3.2 The merge of overlapping detections and arbitration First step: merging overlapping detections Most faces are detected at many nearby locations, and therefore, the final detection of the face consists of taking all these detections and combining them to find the true position of the face in the image. For each location found, the number of nearby detections is determined, and compared to a given threshold. If the number of detections is greater than the threshold, a face is correctly detected. Otherwise, this is a false detection. The location of the final detection is given by the centroid of all the nearby
detections. This allows the different detections to be merged to give the final one. Once a face has been detected using the above approach, all the other detections are considered as errors and as such, are discarded. The only detected part we keep from the image is the one with a high enough number of detections within a small neighborhood. Second step: arbitration among multiple networks The above step is helpful in reducing the number of false detections (also called false positives). To reduce this number even further, a second step can be added, which consists of applying many networks and arbitrating between their outputs. Each detection at a particular position and scale is saved in an output pyramid and the outputs of different pyramids are ANDed together. When the outputs are ANDed, the detected part of an image will be correctly classified as a face if both networks agree upon it. Since it is rare that two networks will misclassify the faces, this strategy is helpful in decreasing the number of false detections. However, this strategy might reject a correctly identified face if only one of the two networks detects it. RESULTS The training sets are given of with frontal faces & are only roughly aligned. This was done by having a person place a bounding box around each face just above the eyebrows and about half-way between the mouth and the chin. This bounding box was then enlarged by 50% and then cropped and scaled to 20 by 20 pixels. By observing the performance of this face detector on a number of test images. It was noticed a few different failure modes. The face detector was trained on frontal, upright faces. The faces were only very roughly aligned so there is some variation in rotation both in plane and out of plane. Informal observation suggests that the face detector can detect faces that are tilted up to about 15 degrees in plane and about 45 degrees out of plane (toward a profile view). The detector becomes unreliable with more rotation than this. Also noticed that harsh backlighting in which the faces are very dark while the background is relatively light sometimes causes failures. It is interesting to note that using a nonlinear variance normalization based on robust statistics to remove outliers improves the detection rate in this situation. Finally, this face detector fails on significantly occluded faces. If the eyes are occluded for example, the detector will usually fail. The mouth is not as important and so a face with a covered mouth will usually still be detected.
REFERENCE
[1] Milan Sonka, Vaclav Hlavac, Roger BoyleImage Processing, analysis and Machine Vision, Tata McGraw-Hill ISBN [2] Rowley, Baluja, and Kanade: Neural Network-Based Face Detection IEEE Patt. Anal. Mach. Intell., 20:22 38. [3] Henry A. Rowley, Shumeet Baluja, Takeo Kanade. (1998). Rotation Invariant Neural Network-Based Face Detection. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'98) p. 38. [4] H. A. Rowley, S. Baluja, and T. Kanade. Neural network- based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):2338, 1998. [5]http://engineeringprojects101.blogspot.in/2012/07/face-detection-using matlab.html. [6] Goldstein, A. J., Harmon, L. D., and Lesk, A. B., Identification of human IEEE 59, pp. 748-760, (1971). [7] Nakamura, O., Mathur, S., and Minami, T., "Identification of human faces isodensity maps", Pattern Recognition, Vol. 24(3), pp. 263-272, (1991). (Signature of Candidate) based on faces", Proc.
Remarks of Supervisor:
(Signature of Supervisor )