How ARKit 2 works, and why Apple is so focused on AR
Apple is investing in AR today, even though the killer use case isn’t here yet.
Augmented reality (AR) has played prominently in nearly all of Apple’s events since iOS 11 was introduced, Tim Cook has said he believes it will be as revolutionary as the smartphone itself, and AR was Apple’s biggest focus in sessions with developers at WWDC this year.
But why? Most users don’t think the killer app for AR has arrived yet—unless you count Pokémon Go. The use cases so far are cool, but they’re not necessary and they’re arguably a lot less cool on an iPhone or iPad screen than they would be if you had glasses or contacts that did the same things.From this year’s WWDC keynote to Apple’s various developer sessions hosted at the San Jose Convention Center and posted online for everyone to view, though, it’s clear that Apple is investing heavily in augmented reality for the future.
We’re going to comb through what Apple has said about AR and ARKit this week, go over exactly what the toolkit does and how it works, and speculate about the company’s strategy—why Apple seems to care so much about AR, and why it thinks it’s going to get there first in a coming gold rush.
What ARKit is and how it works
Let’s start with exactly what ARKit is and does. We are going to thoroughly review the high-level features and purposes of the toolkit. If you want even more detail, Apple has made talks and documentation on the subject available on its developer portal.
The simplest, shortest explanation of ARKit is that it does a lot of the heavy lifting for app developers in terms of working with the iOS device’s camera, scanning images and objects in the environment, and positioning 3D models in real space and making them fit in.
Or as Apple puts it:
ARKit combines device motion tracking, camera scene capture, advanced scene processing, and display conveniences to simplify the task of building an AR experience. You can use these technologies to create many kinds of AR experiences using either the back camera or front camera of an iOS device.
Apple initially launched ARKit with iOS 11 in 2017. App developers could use Xcode, Apple’s software-development environment on Macs, to build apps with it. ARKit primarily does three essential things behind the scenes in AR apps: tracking, scene understanding, and rendering.
Tracking keeps tabs on a device’s position and orientation in the physical world, and it can track objects like posters and faces—though some of those trackable items were not supported in the initial iOS 11 release.
Scene understanding essentially scans the environment and provides information about it to the developer, the app, or the user. In the first release, that meant horizontal planes and a few other things.
Rendering means that ARKit handles most of the work for placing 3D objects contextually in the scene captured by the device’s camera, like putting a virtual table in the middle of the user’s dining room while they’re using a furniture shopping app.
ARKit does this by tracking the environment in some specific ways. Let’s review what the initial release supported on that front.
In the orientation tracking configuration, ARKit uses the device’s internal sensors to track rotation in three degrees of freedom, but it’s like turning your head without walking anywhere—changes in physical position aren’t tracked here, just orientation in a spherical virtual environment with the device at the origin. Orientation tracking is an especially useful approach for augmenting far off objects and places outside the device’s immediate vicinity.
There’s more to world tracking. It tracks the device’s camera viewing orientation and any changes in the device’s physical location. So unlike orientation tracking, it understands if the device has moved two feet to the right. It also does this without any prior information about the environment.
Further, ARKit uses a process called visual inertial odometry, which involves identifying key physical features in the environment around the device. Those features are recorded from multiple angles as the device is moved and reoriented in physical space (moving is required; rotation doesn’t provide enough information). The images captured in this process are used together to understand depth; it’s similar to how humans perceive depth from two eyes.
This generates what Apple calls a world map, which can be used to position and orient objects, apply lighting and shadows to them, and much more. The more a user moves and reorients, the more information is tracked, and the more accurate and realistic the augmentations can become. When ARKit builds the world map, it matches it to a virtual coordinate space in which objects can be placed.
The device needs uninterrupted sensor data, and this process works best in well-lit environments that are textured and that contain very distinct features; pointing the camera at a blank wall won’t help much. Too much movement in the scene can also trip the process up.
ARKit tracks world map quality under the hood, and it indicates one of three states that developers are advised to report in turn to users in some way:
- Not available: The world map is not yet built.
- Limited: Some factor has prevented an adequate world map from being built, so functionality and accuracy may be limited.
- Normal: The world map is complete and good augmentation can be expected.
Plane detection uses the world map to detect surfaces on which augmented reality objects can be placed. When ARKit launched with iOS 11, only horizontal planes were detected and usable, and variations like bumps and curves could easily disturb efforts to accurate place 3D objects in the view.
Using these three tracking techniques, developers can tap ARKit to easily place 3D objects they’ve modeled on a plane in the user’s view of the camera image on the device’s screen.
Features added in iOS 11.3
Apple released ARKit 1.5 with iOS 11.3 earlier this year. The update made general improvements to the accuracy and quality of experiences that could be built on with ARKit without significant added developer effort. It also increased the resolution of the user’s camera-based view on their screen during AR experiences.
The initial version of ARKit could only detect, track, and place objects on flat horizontal surfaces, so ARKit 1.5 added the ability to do the same with vertical surfaces and (to some extent) irregular surfaces that aren’t completely flat. Developers could place objects on the wall, not just the floor, and to a point, literal bumps in the road were no longer figurative bumps in the road.
ARKit 1.5 added basic 2D image tracking, meaning that ARKit apps could recognize something like a page in a book, a movie poster, or a painting on the wall. Developers could easily make their applications introduce objects to the environment once the device recognized those 2D images. For example, a life-sized Iron Man suit could be placed in the environment when the user points the device’s camera at an Avengers movie poster.
What Apple will add in iOS 12
That brings us to WWDC on June 4, 2018, where Apple announced iOS 12 and some major enhancements and additions to ARKit that make the platform capable of a wider range of more realistic applications and experiences.
The changes allow for virtual objects that fit into the environment more convincingly, multi-user AR experiences, and objects that remain in the same location in the environment across multiple sessions.
Saving and loading maps
Previously, AR world maps were not saved across multiple sessions, and they were not transferable between devices. That meant that if an object was placed in a scene at a particular location, a user could not revisit that location and find that the app remembered the previous scene. It also meant that AR experiences were always solo ones in most ways that mattered.
In iOS 11.3, Apple introduced relocalization, which let users restore a state after an interruption, like if the app was suspended. This is a significant expansion of that. Once a world map is acquired in iOS 12, the user can relocalize to it in a later session, or the world map can be shared to another user or device using the MultipeerConnectivity framework. Sharing can happen via AirDrop, Bluetooth, Wi-Fi, or a number of other methods.
ARKit understands that the device is in the same scene as it was in another session, or the same one as another device was, and it can determine its position in that previous world map.
Apple demonstrated this by building an AR game for developers to study and emulate called Swiftshot, which had multiple users interacting with the same 3D objects on multiple devices at once.
But multi-user gaming is not the only possible use case. Among other things, saving and loading maps could allow app developers to create persistent objects in a certain location, like a virtual statue in a town square, that all users on iOS devices would see in the same place whenever they visited. Users could even add their own objects to the world for other users to find.
There are still some limitations, though. Returning to a scene that has changed significantly in the real world since the last visit can obviously cause relocalization to fail, but even changed lighting conditions (like day vs. night) could cause a failure, too. This is a notable new feature in ARKit, but some work still needs to be done to fully realize its potential.
Apple has added a new configuration called ARImageTrackingConfiguration, which further enables building applications that focus on 2D images rather than using the full world tracking approach. This is more performant for tracking lots of images at once, so it allows for superior experiences in certain apps that are built entirely around 2D image recognition.
The device tracks interesting points on visible 2D images in scene, then goes through what Apple calls a dense tracking stage, wherein an image in the scene is warped into a rectangular shape so it can be compared to the reference image the app already knows about. An error image is generated to consider the differences, and then the position and orientation are adjusted until the error is sufficiently minimized.
ARKit 2 also extends this to 3D objects. Fundamentally, the way it reads the real-world 3D object is similar to the way it builds world maps. As with 2D image tracking, developers must include a reference object in the app to compare the real-world object to. Apple now offers a developer-focused tool for doing exactly this. Developers are advised to track rigid objects that are texture rich, and that are neither transparent nor reflective.
Apple explains object tracking this way:
Your app provides reference objects, which encode three-dimensional spatial features of known real-world objects, and ARKit tells your app when and where it detects the corresponding real-world objects during an AR session.
The potential applications of this feature are numerous. ARKit could recognize a specific children’s action figure and add virtual objects to the scene with which the toy could appear to interact, for example—that’s basically what the LEGO app that was demonstrated at the WWDC keynote did. Or ARKit could identify a specific make and model of a car in the real world, and place a representation of the car’s name and specifications near its location in the user’s view.
Improved face tracking
Using the front-facing TrueDepth sensor array on the iPhone X and likely some or all other future iOS devices, ARKit 2 adds the ability to track tongue movements (Apple has said many users in testing try sticking their tongues out right away when using Animoji, and are disappointed when it doesn’t work) and track eyes individually (meaning you can wink now). It improves tracking of the user’s gaze too, and Apple also made improvements to fidelity in different lighting conditions.
Finally, ARKit 2 supports advanced environment texturing. This means a few things. First, Apple says that it trained a neural network on thousands of environments, allowing ARKit to essentially hallucinate the contents of gaps in the scene and world map with some degree of accuracy.
Environment texturing also allows for rendering objects in a more realistic way in the context of the scene. ARKit tracks the ambient light in the environment, and generates shadows from virtual objects in the user’s view. This alone closes a big gap in how real users might perceive an object to be; an object that does not cast a shadow messes with our heads, it turns out.
It also applies reflections of the surrounding environment to objects that should have them—another gap-closer for convincing augmentation.
USDZ: The new AR object file format
In addition to new features in ARKit 2, Apple announced its new USDZ file format for AR objects at WWDC this year.
Based on Pixar’s open-source USD (universal scene description) format—USDZ was developed in a collaboration between Apple and Pixar—USDZ contains the 3D model and its textures in one file. It will be supported in iOS 12 and macOS Mojave.
USDZ files are relatively small, and can be shared across devices, or viewed on the web or in Apple’s Quick Look feature in macOS. Adobe announced native support for USDZ in its applications, which was a big win for the file format’s early adoption. But as with anything new, the format’s future is not yet known.
Why Apple is so focused on AR
According to some analysts and reports, smartphone sales were down for the first time in 2017. Though the iPhone X has been the world’s top-selling smartphone for much of the time since its introduction, Apple barely eked out smartphone sales in recent quarters that were comparable to those a year prior.
The company is not currently in any immediate danger of disappointing stockholders—it had its best March quarter ever this year with more than $61 billion in revenue—but there may be concerning trends on the hardware front, so Apple needs to plan ahead.
The fact that Apple is still impressing investors with its quarterly earnings report is mostly thanks to two factors: the average selling price and profit margin of its phones (both are higher than much of the competition), and its growing services businesses, which include things like Apple Music and iCloud.
Apple has emphasized those services with an ongoing marketing and public relations campaign designed to establish the company as the privacy-focused alternative to digital services companies like Google and Facebook. But if smartphone sales decline a lot more, that might not make up the difference for Apple—in part because Apple has committed not to build a lucrative business out of users’ private data for advertisers in entirely the same way that Google and Facebook have.Apple has achieved its biggest past successes by entering wild west markets where mature products were not yet present. Smartphones, tablets, and computers are mature markets. Apple needs a major new product category in which to get ahead of the ball.
Tim Cook and other Apple executives seem to have picked AR as one of the best bets for that. Driving a file format like USDZ could help the company establish leadership in the space, assuming the format is widely adopted. Building out ARKit now means the company will be ready to roll when the watershed moment happens, rather than hurriedly trying to build a platform from scratch.
Apple is also reportedly exploring the idea of launching a new advertising platform. It tried this in 2010 with iAd, but made some poor bets in the process, and iAd was discontinued in 2016. A widely adopted AR platform could be a huge boon for Apple’s advertising ambitions.
ARKit makes it easy for app developers to add marketing and advertising activations that are contextual to the user’s location, which existing products they have with them, and so on. The object detection and image tracking features are aimed at that use case (among others). Reference images could replace QR codes, for example—a tool that has always shown promise to advertisers and marketers but that has been just a bit too clunky to become ubiquitous.There might also also a bit of futuristic idealism at play. There are numerous possibilities for applications. You could summon a lifelike figure to follow through a maze-like airport or mall to find what you’re looking for, as if you were being guided by a real person. Future iterations of the iPhone X’s front-facing TrueDepth sensor could perform sentiment analysis based on your facial expressions, delivering different content and functionality in apps or games depending on your mood. You could participate in wholly immersive entertainment not dissimilar from the Star Trek holodeck, with interactive, lifelike characters performing a compelling drama in your living room. You could walk down a street in New York City with a visible overlay over every door, showing you ratings and menus for each restaurant and bar. You could go bird watching and have every species that enters your field of view immediately identified for you before your eyes.
It’s wild, hypothetical stuff, but there are plenty of people in Silicon Valley who are compelled by distant hypotheticals.
We’ve seen reports that Apple is working on AR glasses internally. If the experiments pan out, it would be quite a few years off—Cook himself said we’ve a long way to go in a recent interview—but the technology will mature eventually. If Apple launches the first mainstream-viable version of that product, and if AR is as revolutionary as it hopes, it could be another watershed moment like the first iPhone.
That said, Apple is not the only player here. Microsoft had an early lead in some respects with HoloLens, though it has fallen behind in some respects. But it’s still pushing its Windows Mixed Reality platform today, and a new version of HoloLens is rumored to be coming. Google has Tango (and ARCore for current mobile devices), which has some advantages over ARKit—though most of those advantages are in Tango only, and Apple might have internal tech that stands toe-to-toe with Tango but that isn’t public yet because its not applicable to currently available consumer devices.AR’s ubiquity and importance is not assured, nor is Apple’s dominance should the technology get there. But Apple is making a major effort to make both of those things a reality, and that means investment starts early—so early that many of its customers won’t see the value of the investment for some time.