The voice recognition packages available today base their recognition process around a probabilistic table of word pairs and triples used to map spoken phonemes to written text.
This knowledge of likely word relationships allows the software to predict that (to use a phrase from one of our test documents) “sagebrush strewn” is more often written than “sagebrush sewn” and so is more often likely to be a correct transcription.
Voice recognition and other complex pattern recognition software tasks such as computer vision or document search are just in the beginning stages of understanding and taking advantage of query context.
This is something that humans do instinctively. We unconsciously rely on clues such as speaker identity, location, current activities and the general topic of conversation surrounding a specific utterance to help us fill in missing sounds and words.
This isnt all that different from the “frame model” thats been part of artificial intelligence discussions since the 1960s.
For example, there are things you can expect of someone whos in the “paying for dinner” frame thats a subset of the “sitting in a restaurant” frame thats nested in the “social situation” frame. And those expectations help you understand that persons speech more accurately than if you did not have those contexts.
It follows that major progress in the voice recognition field could incorporate context such as face recognition, calendar coordination and other contextual data, such as contents of the chart that the conference room projector is displaying at the moment or the headings in the current document.
Advances in metadata integration and ubiquitous networking may have as much to do with what comes next in voice recognition as advances in acoustics or probability.