Exploring Apple's VISION Framework
Exploring Apple's VISION Framework
Note: The project in this tutorial was built using Swift 5 in Xcode 12 beta. So you
need to have Xcode 12 beta to run the project. you don't have to have a device
(cheers:-)) as this project was built to support a simulator.
The app we are going to build is very simple. our app allows the user's to select an
image from the iPhone's gallery and as soon as the image loads we detect the hand and
body pose landmarks using our VISION framework.
Now, enough with the theory part, let's dive into the coding session.
Getting Started
First, go to the Xcode 12 beta and create a new project. Select the app template under
iOS(previously "Single View App") and on the next page give the application name and
make sure the interface is set to "storyboard" and swift is selected as the
programming language.
We will maintain a simple UI and give more priority to the VISION part. So let's
begin! go to the Main.storyboard and select the image view from the UI controls
library. Drag and drop the image view into our view Controller. Set the leading,
trailing to zero, and center it vertically in the safe area. Set the height of the
image view as 400. This makes the image view to sit in the center of the view, no
matter the size of the device. Now to access the phone's photo library we will be
requiring a button action. so add a button at the bottom of the view controller and
name it as "Select Photo" in the attributes inspector. Set the vertical spacing
between the button and the image view as 100 and center it horizontally in the safe
area. Now drag and drop two switch controls into the view controller. select both of
them and embed them in the stack view. These switch controls allow the users to choose
between hand pose and body pose. set the leading and trailing of the stack view and
set the vertical space between the stack view and image view as 140. This pins the
switch controls to the image view. Now add a label at the bottom of each switch in the
view controller and add the constraints to these labels. Finally, go to the attribute
inspector of each label and give the names as hand and body respectively. This
completes the UI for our app. Now, the storyboard pretty much looks like this
Implementing the photo library function
Now that we have set everything in the storyboard, it's time to write some code to use
these elements. Go to viewController.swift and adopt the
UINavigationControllerDelegate protocol that will be required by the
UIImagePickerController class.
Now, add three new IBOutlets and name them as imageView, handSwitch and bodySwitch for
the image view, and two switch controls respectively. Next, create the IBaction method
for selecting an image from the gallery. The code in the view controller should look
like this.
}
}
Make sure you connect all these outlet variables and action methods to the respective
controls in the storyboard. In the "selectPhotoAction", we create a constant of type
UIImagePickerController() . we set its delegate method to the class and then we
present the UIImagePickerController() to the user.
extension ViewController:UIImagePickerControllerDelegate {
func imagePickerControllerDidCancel(_picker: UIImagePickerController) {
dismiss(animated: true, completion: nil)
}
}
The line above handles the app if the user cancels the image. It also assigns the
class method UIImagePickerControllerDelegate to the ViewController. After this declare
two variables imageWidth and imageHeight of type CGFloat to get the image width and
height respectively. Declare the pathLayer variable of type CALayer . We use this
pathLayer to hold the Vision results. Your code should now look a little like this.
}
}
extension ViewController:UIImagePickerControllerDelegate {
func imagePickerControllerDidCancel(_picker: UIImagePickerController) {
dismiss(animated: true, completion: nil)
}
}
To access the photo library, you need to get permission from the users. So Go to your
Info.plist and add this entry "Privacy – Photo Library Usage Description". We do this
because Starting from iOS 10, you will need to specify the reason why your app needs
to access the camera and photo library.
Now, to get the image in standard size and orientation, we define a function that
transforms the input gallery image into the required size. This function accepts the
UIImage as an argument and returns the transformed image.
Now, we define the other function that prepares the pathLayer variable to hold the
Vision results and displays the selected image in the UIImageView . This function
takes the UIImage as the argument.
// 1
pathLayer?.removeFromSuperlayer()
pathLayer = nil
imageView.image = nil
// 2
let correctedImage = scaleAndOrient(image: image)
// 3
imageView.image = correctedImage
// 4
let scaleDownRatio = max(widthRatio, heightRatio)
// 5
imageWidth = fullImageWidth / scaleDownRatio
imageHeight = fullImageHeight / scaleDownRatio
// 6
let xLayer = (imageFrame.width - imageWidth) / 2
let yLayer = imageView.frame.minY + (imageFrame.height - imageHeight) / 2
let drawingLayer = CALayer()
drawingLayer.bounds = CGRect(x: xLayer, y: yLayer, width: imageWidth, height:
imageHeight)
drawingLayer.anchorPoint = CGPoint.zero
drawingLayer.position = CGPoint(x: xLayer, y: yLayer)
drawingLayer.opacity = 0.5
pathLayer = drawingLayer
// 7
self.view.layer.addSublayer(pathLayer!)
}
In the above code, At 1, we remove the previous path & old image from the View.
At 2, we call the previously defined func scaleAndOrient(image: UIImage) -> UIImage to
transform the image.
At 3, we place the image inside the imageView.
At 4, we scale down the image according to the stricter dimension.
At 5, we cache image dimensions to reference when drawing CALayer paths.
At 6, we prepare the pathLayer variable that we defined earlier of type CALayer to
hold Vision results.
At 7, we add the pathLayer to the main View.
That's it! Now, we are ready to start with the actual part of this tutorial that is
VISION.
The first step is to create a request handler. For this, we use the
VNImageRequestHandler() .
The next step is to create the request using VNDetectHumanHandPoseRequest() . For this
purpose, we create a lazy variable. Because the process to create request is
computationally expensive. so add the following lazy variable to the ViewController
class.
Now we write a function to creates the Requests for hand pose and body pose.
In the above function, we created an array of type VNRequest . Based on the switch
status we append the corresponding request and return the requests array.
We need to send these requests to the request handler. so, we write a function that
fetches the above requests and passes them to the request handler.
Once the results are fetched, these results are handed over to the
handleDetectedHandLandmarks() function in the completion handler of the
handLandmarkRequest variable. The implementation of this function is as follows
In the above function, At 1, we handle the errors if any during the generation of
results and print the error to the debugger and return.
At 2, If there is no error we pass the results and image bounds to the
drawFeaturesforHandPose() function.
// 4
var indexPoints = [VNRecognizedPoint]()
indexPoints.append(allPoints[.handLandmarkKeyIndexTIP]!)
indexPoints.append(allPoints[.handLandmarkKeyIndexDIP]!)
indexPoints.append(allPoints[.handLandmarkKeyIndexPIP]!)
indexPoints.append(allPoints[.handLandmarkKeyIndexMCP]!)
indexPoints.append(allPoints[.handLandmarkKeyWrist]!)
// 5
var indexCGPoints = [CGPoint]()
var middleCGPoints = [CGPoint]()
var littleCGPoints = [CGPoint]()
var thumbCGPoints = [CGPoint]()
var ringCGPoints = [CGPoint]()
// 6
for point in indexPoints {
let reqPoint = CGPoint(x: (point.location.x), y:
(point.location.y))
catch {
print("error")
}
}
CATransaction.commit()
}
At 1, we need to get a boundary for the detected observations. This helps us to know
where the hand is located in the image. For Vision Face detection API, we get a
default boundingBox method for observations. But for Hand detection API, there is no
such method. So we write our method.
At 2, we have to convert the observation box we get in step 1 to the image scale. To
achieve this, we send the "observationBox" and image bounds to the following function.
// Reposition origin.
rect.origin.x *= imageWidth
rect.origin.x += bounds.origin.x
rect.origin.y = (1 - rect.origin.y) * imageHeight + bounds.origin.y
return rect
}
At 3, for each hand, vision returns 21 observation points(4 points for each finger and
one for wrist). Each point is accessed using a key associated with it. These points
are again grouped for each finger and can be accessed via group keys. The following
are the group keys provided by Vision. These are used to access all the points in a
particular finger at a time.
Now, with all these points in their respective arrays, we draw lines between these
points for each finger so that we get to see the landmarks given by the Vision api.For
this purpose, we create variables of type CGmutablePath() that hold the path between
the points and variables of type CAShapeLayer() are used to show these lines on the
view.
Now with all these done, it is time to make use of these functions. So we implement
the imagePickerController(_:didFinishPickingMediaWithInfo) method in the
ViewController to process the selected image and send the image to the Vision api's.
Notice, that we turned off the body switch as we are yet to implement the body pose.
Now that we are done with Hand pose implementation, I request you to try the body pose
implementation on your own so that you get to understand the methods we implemented.
It is almost similar to hand pose except for the Group Key names which I will provide
below. In any case if you get a doubt in implementation, download the full project
from the git repo where I have implemented both hand pose and body pose.
This marks the end of our tutorial on implementation of Human hand pose and body pose.