Tutorial on iOS ARKit: Creating Drawings in the Air Using Only Your Fingers

Apple just revealed ARKit, their latest augmented reality (AR) library. While it might seem like another addition to the AR toolkit, its impact runs deeper, especially given the advancements in AR over the past few years.

ARKit tutorial illustration: Interacting with virtual objects in an iOS ARKit app

This post will guide you through building a fun ARKit example project using iOS ARKit. Imagine placing your fingers on a table as if holding a pen, tapping your thumbnail on the screen, and starting to draw. Upon finishing, you can transform your drawing into a 3D object, as illustrated in the animation below. The complete source code for this iOS ARKit example is available at GitHub.

Demonstration of our iOS ARKit sample augmented reality app being used

The Timeliness of iOS ARKit

Augmented reality is not a new concept for seasoned developers. Its initial significant development can be traced back to the time when developers gained access to individual frames from webcams. Those early AR applications were primarily focused on facial transformations. However, the novelty of turning faces into bunnies wore off quickly, and the initial hype subsided.

The missing ingredients for AR’s practicality have always been usability and immersion. Previous AR trends highlight this observation. For instance, when developers gained access to individual mobile camera frames, AR experienced a resurgence. Alongside the triumphant return of bunny transformers, a wave of apps emerged that superimposed 3D objects onto printed QR codes. However, these never gained widespread adoption because they augmented QR codes, not reality.

Then came Google Glass, a glimpse into the future. Sadly, within two years, this promising product met its demise. While critics attributed its failure to various factors, including social implications and Google’s lackluster launch, a crucial reason stands out - environmental immersion. Although Google Glass addressed usability, it remained limited to projecting 2D images onto the real world.

Tech giants like Microsoft, Facebook, and Apple took this lesson to heart. In June 2017, Apple unveiled iOS ARKit, a remarkable library that prioritized immersion. Holding a phone still poses a user experience barrier, but as Google Glass demonstrated, hardware isn’t the main obstacle.

With this significant shift, a new wave of AR excitement seems imminent. This time, it might find its niche and propel AR app development into the mainstream. This opens up Apple’s ecosystem and user base to augmented reality app development companies.

With that historical context, let’s delve into coding and experience Apple’s augmented reality firsthand!

Exploring ARKit’s Immersive Capabilities

ARKit offers two primary features: determining the camera’s location in 3D space and detecting horizontal planes. For the former, ARKit treats your phone as a camera moving within a 3D environment, anchoring virtual 3D objects to specific points. For the latter, it identifies horizontal surfaces like tables for object placement.

The technology behind this is Visual Inertial Odometry (VIO). Put simply, VIO merges camera frames with motion sensor data to track the device’s position in 3D space. This tracking is achieved by identifying high-contrast features or edge points in the image, such as the boundary between a blue vase and a white table. The relative movement of these points between frames helps estimate the device’s 3D location. Consequently, ARKit struggles when facing a featureless wall or experiencing rapid movements that result in blurry images.

Starting Your ARKit Journey in iOS

As of this writing, ARKit is integrated into the beta version of iOS 11. To begin, download iOS 11 Beta on an iPhone 6s or later and install the latest Xcode Beta. While you can initiate a new ARKit project through New > Project > Augmented Reality App, starting with the official Apple ARKit sample is more convenient for this tutorial, particularly for plane detection. It provides essential code blocks. Let’s dissect this example code and tailor it for our project.

First, determine the engine - Sprite SceneKit or Metal. The Apple ARKit example uses iOS SceneKit, Apple’s 3D engine. Next, set up a view for rendering 3D objects by adding an ARSCNView type view.

ARSCNView, a subclass of SceneKit’s SCNView, renders the live camera feed as the scene background, seamlessly aligning SceneKit space with the real world by treating the device as a moving camera.

However, ARSCNView requires an AR session object to handle camera and motion processing. So, begin by assigning a new session:

1
2
3
4
self.session = ARSession()
sceneView.session = session
sceneView.delegate = self
setupFocusSquare()

The last line incorporates a visual aid, Focus Square (provided by the sample code, not ARKit), which helps visualize plane detection status. The image below shows a focus square projected onto a table:

Focus square projected on a table using Apple ARKit

The next step is to activate the ARKit session. Restarting it each time the view appears is recommended, as past session data becomes irrelevant when user tracking is lost. Therefore, start the session in viewDidAppear:

1
2
3
4
5
override func viewDidAppear(_ animated: Bool) {
    let configuration = ARWorldTrackingSessionConfiguration()
    configuration.planeDetection = .horizontal
    session.run(configuration, options: [.resetTracking, .removeExistingAnchors])
}

This code configures the ARKit session to detect horizontal planes, currently the only option provided by Apple, although future object detection capabilities are hinted at. It then initiates the session and resets tracking.

Lastly, update the Focus Square whenever the camera’s position (device orientation or position) changes. This is achieved within the SCNView renderer delegate function, called before each 3D engine frame render:

1
2
3
func renderer(_ renderer: SCNSceneRenderer, updateAtTime time: TimeInterval) {
    updateFocusSquare()
}

At this point, running the app should display the focus square over the camera stream, actively searching for a horizontal plane. The next section will delve into plane detection and positioning the focus square accordingly.

Plane Detection in ARKit

ARKit can detect, update, and remove planes. For streamlined handling, create a dummy SceneKit node that stores the plane’s position information and a reference to the focus square. Planes are defined along the X and Z axes, with Y representing the surface normal. Keeping drawn nodes’ positions within the plane’s Y value creates the illusion of being printed on it.

Plane detection relies on ARKit’s callback functions. For instance, this function is triggered when a new plane is detected:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
var planes = [ARPlaneAnchor: Plane]()

func renderer(_ renderer: SCNSceneRenderer, didAdd node: SCNNode, for anchor: ARAnchor) {
    if let planeAnchor = anchor as? ARPlaneAnchor {
        serialQueue.async {
            self.addPlane(node: node, anchor: planeAnchor)
            self.virtualObjectManager.checkIfObjectShouldMoveOntoPlane(anchor: planeAnchor, planeAnchorNode: node)
        }
    }
}
    
func addPlane(node: SCNNode, anchor: ARPlaneAnchor) {
    let plane = Plane(anchor)
    planes[anchor] = plane
    node.addChildNode(plane)
}

...

class Plane: SCNNode {
    
    var anchor: ARPlaneAnchor
    var focusSquare: FocusSquare?
    
    init(_ anchor: ARPlaneAnchor) {
        self.anchor = anchor
        super.init()
    }
    ...
}

This function receives two parameters: anchor and node. node is a regular SceneKit node positioned at the plane’s location and orientation, but invisible due to lacking geometry. It serves as a placeholder for your custom, also invisible, plane node, which stores orientation and position data within the anchor.

This information is stored in ARPlaneAnchor using a 4x4 matrix. While a deep dive into matrices is beyond this post’s scope, envision it as a 2D array containing 4x4 floating-point numbers. By multiplying these numbers with a 3D vertex (v1) in its local space, you obtain a new vertex (v2) representing v1 in world space. For instance, if v1 = (1, 0, 0) locally and you want to position it at x = 100 in world space, v2 becomes (101, 0, 0) relative to world space. This math becomes more complex with rotations, but understanding it isn’t strictly necessary (refer to this excellent article’s relevant section for a thorough explanation).

checkIfObjectShouldMoveOntoPlane verifies if any drawings exist and whether their y-axis aligns with the newly detected plane.

Returning to updateFocusSquare() from the previous section, the goal is to keep the focus square centered on the screen while projecting it onto the nearest detected plane. The following code accomplishes this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
func updateFocusSquare() {
    let worldPos = worldPositionFromScreenPosition(screenCenter, self.sceneView)
    self.focusSquare?.simdPosition = worldPos
}

func worldPositionFromScreenPosition(_ position: CGPoint, in sceneView: ARSCNView) -> float3? {
    let planeHitTestResults = sceneView.hitTest(position, types: .existingPlaneUsingExtent)
    if let result = planeHitTestResults.first {
        return result.worldTransform.translation
    }
    return nil
}

sceneView.hitTest identifies real-world planes corresponding to a 2D screen point by projecting it onto the closest underlying plane. result.worldTransform is a 4x4 matrix encompassing the detected plane’s transform data, while result.worldTransform.translation conveniently extracts the position.

With this information, you can place 3D objects on detected surfaces based on a 2D screen point. Let’s proceed to the drawing aspect.

Drawing in AR

Drawing shapes by following a finger involves detecting the finger’s movement, placing vertices at those locations, and connecting them. While Bezier curves offer smooth connections, we’ll opt for a simplified approach. For each new finger position, a tiny, rounded box with near-zero height is placed on the detected plane, resembling a dot. Once the drawing is finished and the 3D button is selected, the height of these objects will be adjusted based on finger movement.

The PointNode class represents a point:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
let POINT_SIZE = CGFloat(0.003)
let POINT_HEIGHT = CGFloat(0.00001)

class PointNode: SCNNode {
    
    static var boxGeo: SCNBox?
    
    override init() {
        super.init()
        
        if PointNode.boxGeo == nil {
            PointNode.boxGeo = SCNBox(width: POINT_SIZE, height: POINT_HEIGHT, length: POINT_SIZE, chamferRadius: 0.001)
            
            // Setup the material of the point
            let material = PointNode.boxGeo!.firstMaterial
            material?.lightingModel = SCNMaterial.LightingModel.blinn
            material?.diffuse.contents  = UIImage(named: "wood-diffuse.jpg")
            material?.normal.contents   = UIImage(named: "wood-normal.png")
            material?.specular.contents = UIImage(named: "wood-specular.jpg")
        }
        
        let object = SCNNode(geometry: PointNode.boxGeo!)
        object.transform = SCNMatrix4MakeTranslation(0.0, Float(POINT_HEIGHT) / 2.0, 0.0)
        
        self.addChildNode(object)
        
    }
    
    . . .

}

The geometry is translated along the y-axis by half its height to ensure the object’s base rests at y = 0, appearing above the plane.

Within SceneKit’s renderer callback function, an indicator mimicking a pen tip is drawn using the same PointNode class. A point is placed at that location if drawing is enabled, or the drawing is raised into a 3D structure if 3D mode is active:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
func renderer(_ renderer: SCNSceneRenderer, updateAtTime time: TimeInterval) {

    updateFocusSquare()

    // Setup a dot that represents the virtual pen's tippoint
    if (self.virtualPenTip == nil) {
        self.virtualPenTip = PointNode(color: UIColor.red)
        self.sceneView.scene.rootNode.addChildNode(self.virtualPenTip!)
    }

    // Draw
    if let screenCenterInWorld = worldPositionFromScreenPosition(self.screenCenter, self.sceneView) {
            
        // Update virtual pen position
        self.virtualPenTip?.isHidden = false
        self.virtualPenTip?.simdPosition = screenCenterInWorld

        // Draw new point
        if (self.inDrawMode && !self.virtualObjectManager.pointNodeExistAt(pos: screenCenterInWorld)){
            let newPoint = PointNode()
            self.sceneView.scene.rootNode.addChildNode(newPoint)
            self.virtualObjectManager.loadVirtualObject(newPoint, to: screenCenterInWorld)
        }
            
        // Convert drawing to 3D
        if (self.in3DMode ) {
            if self.trackImageInitialOrigin != nil {
                DispatchQueue.main.async {
                    let newH = 0.4 *  (self.trackImageInitialOrigin!.y - screenCenterInWorld.y) / self.sceneView.frame.height
                    self.virtualObjectManager.setNewHeight(newHeight: newH)
                }
            }
            else {
                self.trackImageInitialOrigin = screenCenterInWorld
            }
        }
            
    }

virtualObjectManager manages the drawn points. In 3D mode, the difference from the previous position is calculated to adjust the height of all points.

Currently, drawing occurs on the detected surface, assuming the virtual pen is at the screen center. Now, let’s make it interactive by incorporating finger detection.

Detecting Fingertips in AR

Apple’s iOS 11 introduces the Vision Framework, offering efficient computer vision techniques. We’ll leverage its object tracking for this tutorial. Object tracking involves providing an image with the target object’s bounding box coordinates, initializing tracking, and then feeding in new images with the object’s movement. The library will then return the object’s updated location.

Here’s a little trick: Ask the user to position their hand on the table as if holding a pen, ensuring their thumbnail faces the camera. Then, they should tap their thumbnail on the screen. Two points to note: the thumbnail’s contrast against the skin and table should provide sufficient features for tracking, and since both the hand and table are on the same plane, projecting the thumbnail’s 2D location into 3D will approximate the finger’s position.

The image below illustrates potential feature points detectable by the Vision library:

iOS ARKit Feature points detected by the Vision library

Initialize thumbnail tracking within a tap gesture:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// MARK: Object tracking
    
fileprivate var lastObservation: VNDetectedObjectObservation?
var trackImageBoundingBox: CGRect?
let trackImageSize = CGFloat(20)
    
@objc private func tapAction(recognizer: UITapGestureRecognizer) {
        
    lastObservation = nil
    let tapLocation = recognizer.location(in: view)
        
    // Set up the rect in the image in view coordinate space that we will track
    let trackImageBoundingBoxOrigin = CGPoint(x: tapLocation.x - trackImageSize / 2, y: tapLocation.y - trackImageSize / 2)
    trackImageBoundingBox = CGRect(origin: trackImageBoundingBoxOrigin, size: CGSize(width: trackImageSize, height: trackImageSize))
        
    let t = CGAffineTransform(scaleX: 1.0 / self.view.frame.size.width, y: 1.0 / self.view.frame.size.height)
    let normalizedTrackImageBoundingBox = trackImageBoundingBox!.applying(t)
        
    // Transfrom the rect from view space to image space
    guard let fromViewToCameraImageTransform = self.sceneView.session.currentFrame?.displayTransform(withViewportSize: self.sceneView.frame.size, orientation: UIInterfaceOrientation.portrait).inverted() else {
        return
    }
    var trackImageBoundingBoxInImage =  normalizedTrackImageBoundingBox.applying(fromViewToCameraImageTransform)
    trackImageBoundingBoxInImage.origin.y = 1 - trackImageBoundingBoxInImage.origin.y   // Image space uses bottom left as origin while view space uses top left
        
    lastObservation = VNDetectedObjectObservation(boundingBox: trackImageBoundingBoxInImage)
        
}

The challenge lies in converting the tap location from UIView coordinates to image coordinates. ARKit provides displayTransform for image to viewport conversion, but not the inverse. The solution? Using the matrix inverse.

Within the renderer, feed in new images to track finger movement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
func renderer(_ renderer: SCNSceneRenderer, updateAtTime time: TimeInterval) {

    // Track the thumbnail
    guard let pixelBuffer = self.sceneView.session.currentFrame?.capturedImage,
        let observation = self.lastObservation else {
             return
    }
    let request = VNTrackObjectRequest(detectedObjectObservation: observation) { [unowned self] request, error in
        self.handle(request, error: error)
    }
    request.trackingLevel = .accurate
    do {
        try self.handler.perform([request], on: pixelBuffer)
    }
    catch {
        print(error)
    }

    . . .
}

Upon completion, the object tracking triggers a callback function to update the thumbnail’s location, essentially reversing the tap recognizer code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
fileprivate func handle(_ request: VNRequest, error: Error?) {
    DispatchQueue.main.async {
        guard let newObservation = request.results?.first as? VNDetectedObjectObservation else {
            return
        }
        self.lastObservation = newObservation
                        
        var trackImageBoundingBoxInImage = newObservation.boundingBox
            
        // Transfrom the rect from image space to view space
        trackImageBoundingBoxInImage.origin.y = 1 - trackImageBoundingBoxInImage.origin.y
        guard let fromCameraImageToViewTransform = self.sceneView.session.currentFrame?.displayTransform(withViewportSize: self.sceneView.frame.size, orientation: UIInterfaceOrientation.portrait) else {
            return
        }
        let normalizedTrackImageBoundingBox = trackImageBoundingBoxInImage.applying(fromCameraImageToViewTransform)
        let t = CGAffineTransform(scaleX: self.view.frame.size.width, y: self.view.frame.size.height)
        let unnormalizedTrackImageBoundingBox = normalizedTrackImageBoundingBox.applying(t)
        self.trackImageBoundingBox = unnormalizedTrackImageBoundingBox
            
        // Get the projection if the location of the tracked image from image space to the nearest detected plane
        if let trackImageOrigin = self.trackImageBoundingBox?.origin {
            self.lastFingerWorldPos = self.virtualObjectManager.worldPositionFromScreenPosition(CGPoint(x: trackImageOrigin.x - 20.0, y: trackImageOrigin.y + 40.0), in: self.sceneView)
        }            
    }
}

Finally, use self.lastFingerWorldPos instead of the screen center for drawing.

The Future of ARKit

This post showcased AR’s immersive potential through finger interaction and real-world object integration. Advancements in computer vision and AR-focused hardware like depth cameras will further enhance our ability to interact with the 3D world.

While not yet publicly available, Microsoft’s Hololens demonstrates a serious commitment to AR. This device combines specialized hardware with advanced 3D environment recognition. While the future of AR unfolds, you can be a part of it by developing captivating and immersive augmented reality experiences. Just do us all a favor and explore creative possibilities beyond turning objects into bunnies!