Microsoft’s quest to build computing systems that understand the world around them doesn’t end with the company’s Project Oxford machine-learning technology. Researchers at the Redmond, Wash., software maker are also developing systems that mimic how humans pull information from the things they see.
“When a person is asked about something in a photo, they’re taking in a lot of details—a lot of words—to answer questions about it,” blogged Microsoft spokesperson Athima Chansanchai. “Now, a team of Microsoft researchers, together with colleagues from Carnegie Mellon University, has created a system that uses computer vision, deep learning and language understanding to analyze images and answer questions the same way humans would.”
Together, the researchers created a model that “applies multi-step reasoning to answer questions about pictures,” said Chansanchai. The technology is being advanced by Li Deng, Xiaodong He and Jianfeng Gao from Microsoft Research’s Deep Learning Technology Center, along with Carnegie Mellon University researchers Zichao Yang and Alex Smola.
“The system takes in information a human set of eyes and brain would, looking at a scene’s action (if there is any) and the relationships among multiple visual objects,” said Chansanchai. “Though it may sound simple for humans, it’s a lot for a computer to learn language and to find answers in an image. But using deep neural networks, it can.”
Deng and his group are imbuing the system with the ability to pay attention, focus on visual cues and infer answers progressively to solve problems. It’s an advancement in human behavior modeling that was not possible a few years ago, he said.
Microsoft envisions that the work will lead to systems that can anticipate human needs and provide real-time recommendations. Systems that can answer questions based on visual information are also key to developing artificial intelligence tools, according to the company.
For example, the technology can potentially lead to improved bike safety.
“The system could power all kinds of applications, such as a warning system for bicyclists. With a mounted camera continuously taking in the environment around the cyclist,” said Chansanchai.
The image analysis system builds on Microsoft’s prior work on technologies that can automatically caption photos. “The researchers say that was an important step in getting to this point because descriptions of scenes, annotated by people, provide meaning to a picture. That helps train the computer to understand the image the way a person would.”
Microsoft is increasingly banking on machine-learning systems as a way to help developers build a new generation of intelligent apps. Last month, the company announced the public beta of the Project Oxford Language Understanding Intelligent Service (LUIS), enabling coders to create applications that understand spoken instructions and search queries, similar to Microsoft’s own virtual assistant, Cortana. Project Oxford is a collection of machine-learning application programming interfaces (APIs) that also includes face and emotion detection, speech recognition and computer vision.