The Greatest Engineering Challenge to Improve Mobile Augmented Reality Headsets

In 2016, tethered systems like the HTC Vive and Oculus Rift rose in popularity swiftly, and largely due to the fact that they used “outside-in tracking” technology, relying on powerful workstation computers to do most of the legwork. This was perceived as less cumbersome than devices like HTC Vive lighthouse base stations, which must be installed in the user’s immediate vicinity to create a tracking environment for measuring a user’s orientation and position.

Since then, Oculus Go, HTC Vive Focus and the Microsoft Mixed Reality headsets have evolved to inside-out tracking, meaning no base stations are required to track a user’s position, motion and orientation. This is due largely to the design of Microsoft HoloLens. The HoloLens popularized the concept of inside-out tracking, and DAQRI Smart Glasses followed suit after their evolution from the DAQRI Smart Helmet. DAQRI now uses inside-out tracking as well to make their Smart Glasses easier to set up and use—untethered and mobile.

Latency is the main challenge for mobile augmented reality headsets, more so than for virtual reality headsets, which up until recently, were mostly tethered to powerful workstation or laptop computers.

The Microsoft HoloLens and DAQRI Smart Glasses (pictured here) do not have the processing power of a workstation computer, therefore they do not have the ability to offset latency levels to 5 milliseconds, which is the minimum acceptable level for a pleasant and unnoticeable visual lag. (Image courtesy of DAQRI.)

Too much latency and you receive the classic complaint about all augmented, virtual and mixed reality experiences: they make you nauseous or sick.

For mobile headsets that use inside-out tracking like the HoloLens, a technique known as visual-inertial fusion has to be incorporated into the user experience in order to offset latency to undetectable levels. Prediction, late correction and inertial measurement units (IMUs) are techniques and hardware that are combined by product design teams to bring end-to-end latencies under 5 milliseconds, greatly improving the user experience.

Correcting latency is much harder to do in the realm of augmented reality.

Latency

Specifically, it is known as motion-to-photon latency, which is a property of both augmented and virtual reality and is characterized in terms of the user experience by the amount of time it takes for a display to catch up with a user’s motion. If you move your head to the right, the displayed content has a lag and possible unwanted artefacts that appear relative to the time, orientation and position of the headset. This makes the experience unpleasant, making users nauseous and compelled to stop and take off the headset.

In virtual reality, the headset fully occludes physical reality from the user as they are visually and fully immersed in a virtual environment. The vestibular system within the ear combined with muscle motion causes the perception of misalignment between the person and the digital content, but a larger range of latency (up to 20 milliseconds) remains mostly inconspicuous.

In augmented reality, digital content is overlaid by optical see-through (OST) systems onto the real world. Human perceptual systems do not have any latency. This is why any overlaid virtual content in AR that does have latency is way more pronounced than latency in the fully occluded virtual environment of VR. Couple this with the fact that mobile AR headsets do not have the computing power of a tethered workstation or laptop, and it’s easy to understand why latency is more of an engineering challenge for AR as opposed to VR.

Sensors

For mobile augmented reality devices like the HoloLens and DAQRI Smart Glasses, motion estimation—otherwise known as inside-out pose tracking—is accomplished by an array of IMUs and tracking cameras.

Think of every image as part of an assembly line. Tracking cameras (for both AR and VR headsets) typically run at a frame rate of 30 Hz, which translates to one image captured and read out for processing every 33 milliseconds. This frame rate also limits exposure time per frame to 33 milliseconds.

Microsoft HoloLens uses an LCoS screen for display, which functions similarly to a DLP projector. The HoloLens set the tone for inside-out pose tracking when it first hit the market in 2015. (Image courtesy of Microsoft.)

If the exposure rate is 20 milliseconds, the timestamp most convenient for processing is the middle point, which is 10 milliseconds. When the exposure is complete, the age of the image is 10 milliseconds old and will be read out and sent to the image processing system which adds 5 milliseconds to the age of the image, bringing it to 15 milliseconds.

But IMUs run at rates of one to several orders of magnitude greater than cameras. If an IMU operates at 1000 Hz, it will generate one sample per millisecond with virtually no latency and the average age of the sample will be 1 millisecond.

Tracking

Tracking extrapolates a 6 degrees-of-freedom (6DOF) motion estimation, showing which way the headset has changed position. When the camera image reaches the tracking system, the age of the image is 15 milliseconds old. The IMU has less latency and 1 millisecond-old IMU data, which is less old than the camera frame and generally not processed at this stage. If the camera is running at 30 Hz, then the tracking system should be running at the same rate, giving it a 33-millisecond processing budget.

Rendering

In rendering, a frame buffer (representing a 2D image) is sent to the display of the mobile AR unit. The renderer uses several different sources of data to create this 2D image. Since rendering processes at a higher rate than motion estimations, it can be done separately and without interference from tracking computations.

If rendering operates at 60 Hz, then a new frame is generated at a rate of approximately once every 17 milliseconds. If you consider the length of time that is 17 milliseconds, transmission of the frame to the display only begins at the tail-end of that length of time.

Display

The mechanics of displays for mobile AR displays is crucial in terms of fending off unnecessary latency. When the data is sent one pixel at a time and one line at a time—which is also known as the scanout—and the display is running at 60 Hz, the frame’s transmission time to reach the display adds 17 milliseconds of additional latency.

There are two main types of displays that are used for AR and VR: Color Sequential Displays and Line Sequential Displays.

Color Sequential Displays

Known as “liquid crystalline on a silicon wafer”, or LCoS displays, these are found primarily in mobile AR units, versus Line Sequential Displays such as LCD and OLED displays, which are common in consumer electronics products and VR.

LCoS displays are used in Microsoft HoloLens and DAQRI Smart Glasses and are only capable of showing one color at a time. To show a full RGB (Red, Green Blue) frame, the LCoS display matrix receives data for only the red pixel parts prior to turning on the red LED which shows the red image. Then the red LED is turned off, and the same process is repeated with blue and green, depending on the RGB data attributes. (Image courtesy of JDC Technology.)

The LCoS data is organized in subframes of different colors instead of interleaved RGB frames, which is beneficial considering that GPUs create interleaved RGB frames by default. So the display process lines up with the GPU hardware process, not to mention that DisplayPort and HDMI also work with interleaved RGB frames.

If you sense a catch-22 here, you’re correct. If single color subframe isn’t ready until the whole RGB frame is received, a frame of latency occurs before a separated-color subframe is displayed.

To complicate matters further for mobile AR LCoS displays, each successive subframe in sequence has an offset delay of 4 milliseconds, adding to overall latency.

IMUs, Prediction and Late Correction to the Rescue

If a product design team for mobile AR could reduce camera exposure time, motion blur would be reduced. But small cheap cameras don’t do this very well without leaving artefacts that would counteract the improvement in the first place.

Faster tracking sequences or processing units could reduce the amount of time it takes to process an image, but they will incur a cost in overall accuracy.

But IMUs can run at 1000 Hz, so every sample is 1 millisecond old at most. If you use IMU data to calculate a relative motion path from 1 timestamp to another in every sample within that length of time, this allows design teams to apply the relative motion over an absolute pose from the tracker, calculating a new pose that is only 1 millisecond old, which is way less latent than without this calculation, which would incur about 40-70 milliseconds of latency total.

DAQRI Smart Glasses are a marvel of modern engineering. In this mobile AR unite, there is a 6th Generation Intel Core m7 CPU, a dedicated vision processing unit for 6-DOF tracking, two LCoS Optical Displays 44° Diagonal FOV; Resolution: 1360 X 768; Frame Rate: 90 fps; Connectivity: WiFi 802.11 A/B/G/N/AC 2.4/5 GHz Bluetooth; Battery: Built in rechargeable lithium ion battery 5800 mAh; Storage: 64 GB Solid State Drive. (Image courtesy of DAQRI.)

Since a pose will be old by the time the rendered frame shows up on the display, a proper estimate of the length of time between the beginning of a rendering process and when photons are emitted in the form of a fully rendered frame appearing on the display is needed. This can be calculated since the display runs concurrently with the GPU. This allows product design teams to “see” into the future and generate a future pose using models that take data from recent motion data like constant acceleration, constant velocity, and higher order motion curves to create an accurate future pose for accurate motion prediction.

Motion prediction is accurate when the length of time to predict is very small, like 20 milliseconds, but loses accuracy as the length of time expands. A common practice to correct this is to warp the frame buffer after the render step with an even more up-to-date pose prediction. (Image courtesy of DAQRI.)

This is a 2D operation, and as such, it only takes a few milliseconds compared to more complex 3D operations. On a modern GPU, this “late warping”—a type of late correction—can be executed just before the frame buffer is transmitted to the display.

The big difference between LCoS displays (used for mobile AR) and OLED displays (used for VR like HTC Vive and Oculus Rift) is that there is no standard protocol for LCoS, and as a result, LCoS manufacturers have different proprietary protocols. The reason protocols are different is due to the fact that GPUs produce color sequential frame buffers by design, so extra processing is needed to correct the color sequential format needed by the LCoS display. To correct this, LCoS display manufacturers include a chip that receives frame buffers in common display standards, process a color separation and finally passes the data on via proprietary protocol.

The HIMAX LCoS is used in both DAQRI Smart Glasses and Microsoft HoloLens. Since only a single-color is shown at a time, this can lead to the colors breaking up under motion, producing what’s known as the rainbow effect.

Optimizing LCoS Display Mobile AR

If LCoS displays produce unwanted effects and artefacts because of the innate sequencing necessary to consistently refresh the display, late warping techniques can be used for correction. But mobile units are hindered by a lack of processing power versus systems tethered to powerful workstations and laptops. To balance catch-22s between requirements for high throughput, low latency and low power, the process of connecting a GPU to an LCoS display still requires proprietary display protocols.

Absent standardization of such protocols, extensions to improve mobile AR support from the GPU to the LCoS display are needed as stepping stones to improve motion-to-photon latency. Though they are beginning to be developed, they are still a few years away. Until then, proprietary solutions will continue to be the norm. Admirers and mobile AR enthusiasts can likely agree that the amount of care and feats of engineering happening right now to improve motion-to-photon latency are anything but standard.