Google launched an update for Voice Access so that it can automatically detect icons using only the pixel values displayed on the screen, regardless of whether icons have been given suitable accessibility labels or not. With this update, the users will have more control over Android devices using voice commands.

Apparently, Voice Access needs on-screen user interface (UI) elements to have reliable accessibility labels to function properly. These elements are provided to the operating system’s accessibility services via the accessibility tree. However, the usability of Voice Access sometimes reduces because, in many apps, adequate labels aren’t always available for UI elements.

Therefore, to address the issue, Google came up with a way where a machine learning model automatically detects icons on the screen based on UI screenshots. In this way, Voice Access can determine whether elements like images and icons have accessibility labels, or the labels are provided to Android’s accessibility services.

The updated Google Voice Access introduces a vision-based object detection model called IconNet. This model can automatically detect 31 types of icons on the screen in a manner that is agnostic to the underlying structure of the app being used, launched as part of the latest version of Voice Access.

Furthermore, this model is optimized to run on-device for mobile environments, with a compact size and fast inference time to enable a seamless user experience. According to a Google blog post, the tech giant is currently working on improving the model and hopes that it can extend the icon detection capability to 70 types in near future.

How does Google detect icons?

The Google scientists wrote in their blog post that the problem of detecting icons on app screens is similar to classical object detection. Based on the novel CenterNet architecture, here, IconNet extracts features from input images and then predicts appropriate bounding box centers and sizes. In order to train the IconNet, they collected and labeled more than 700,000 app screenshots, streamlining the process by using heuristics, auxiliary models, and data augmentation techniques to identify rarer icons and enrich existing screenshots with infrequent icons.

Now, Voice Access can easily identify the icons when actions are performed by tapping anywhere in the region of the UI element of interest. So, whenever the users refer to icons detected by IconNet by their names, e.g. “Tap Gallery”, Voice Access can detect it faster than before.

Gilles Baechler and Srinivas Sunkara, the Google Software Engineers at Google Research said in a blog post, “We are constantly working on improving IconNet. Among other things, we are interested in increasing the range of elements supported by IconNet to include any generic UI element, such as images, text, or buttons. We also plan to extend IconNet to differentiate between similar looking icons by identifying their functionality. On the application side, we are hoping to increase the number of apps with valid content descriptions by augmenting developer tools to suggest content descriptions for different UI elements when building applications”.

Google introduced Voice Access back in 2016 and is still working on a number of projects based on the voice command coupled with AI.