We currently see a large demand for automated handling of voice conversations. Companies wish to evolve their traditional IVR solutions that still use DTMF touch-tone to capture customer input. Instead they would like customers to be able to speak with the IVR. This blog explains the various technologies that can be used in this context:
Keyword spotting
Keyword spotting means detecting a defined keyword within spoken text. So if the keyword is “car”, and the spoken sentence is “My car’s license plate is AB789CG“, then the result would be positive.
If the user says “My vehicle’s license plate is AB789CG“, then it wouldn’t recognize it, as the exact word “car ” is not recognized – so you would end up with many false negatives.
If the user says “my card is an Ace“, might be positive, as “card” sounds similar to “car”. If the keyword is “Bee”, it would also be positive, as the “B” is spoken in the same way as “Bee”. Both scenarios are false positives
Speech Recognition
Automated Speech Recognition (ASR) transcribes the entire text of a spoken sentence. In the above example, the user could say “I would like to insure my car that has the license plate umm, AB seven hundred eighty nine B G“, or he could say “I would like to insure my car with plate is AB seven eight nine CG“. For a human being, both sentences are equivalent, but the an ASR would not return the same result. The ASR might also return as a result such as “I would like to insure my car with plate A bee seven eight nine see gee“, as we haven’t told the ASR what to expect.
ASR engines integrate to an IVR system through the MRCP protocol
Speech Grammars
A speech grammar is a function of speech recognition, where you can tell the ASR engine what to expect. In our example, you could say that a license plates corresponds to “letter/ letter/ number/ number/ number/ letter/ letter”. With a speech grammar, the result of any of the sentences above would always result to be “I would like to insure my car with plate AB789CG“, independently of how it’s spoken. So it speech grammars will eliminate the “umm”, “ah”, “bee”, “gee”, “see” from the ASR but correctly replace this with B/ G/ C .
Speech grammars are essential if you would like to recognize structured content, such as sequences of numbers and letters (such as ID’s, account numbers, amounts, license plates and the like). It’s also essential if you would like to recognize dates, currencies. You can also define your own vocabularies such as cities (in a country), Last Names, First Names, Road names… . The better your speech grammars are defined, the higher the recognition rate will be. For example, if you tell a grammar to recognize a first name, and the user says “my name is Paris Hilton”, then the word “Paris” would be interpreted as “Firstname”, and not as the city “Paris”.
Another example is “I see a bee” (ie, I see a flying insect). With a speech grammar such as “letter, letter, letter, letter”, this would be recognized as “ICAB”.
A standard for Speech Grammars is SRGS.
Natural Language Understanding
NLU is used both in chatbots and conversational IVR’s. An NLU is capable of recognizing intents and entities. In the above example, it doesn’t matter whether the user says “my car” or “my vehicle” or “my Audi” doesn’t matter – it is all an object of the type “car”. An intent is what a user wants to do. So whether a users would say “I want to provide coverage for my car” or “I would like to insure my care”, or “I would like to get an insurance policy for my car” or “I would like to get a car policy” would all yield the same result, which is the intent to get an insurance. The advantage of NLU is that companies don’t need to anticipate all possible ways of saying things – this is being done by the NLU engine.
Voice Biometrics
Voice biometrics take an audio “fingerprint” of a user, and then compare spoken speech of users with the fingerprints it has on record to verify whether a user is who he claims to be, based on the sound, speed, rhythm, frequency distribution etc of our individual voices.
AI (artificial intelligence) or not?
Artificial Intelligence and neural networks are frequently used to detect the above meanings in spoken or typed text. The advantage of these is that they are very versatile and can be continuously updated and improved with additional data samples. The downside is that they are usually more CPU or GPU intensive as they require more processing power and might be slower than traditional explicit speech models that use things such as hidden Markov Chains and Fourier Transformations for speech.
Expertflow
Expertflow has an extensive experience with a series of cognitive tools such as Spacy, PerspectiveAPI, Microsoft Cognitive Services, NLTK, rasa, Google Dialogflow, IBM Watson, Sestek, Duckling , and specialized Data science engineers fluent in Python and neural networks. We understand the pro’s and cons of each technology, and in which scenario which technology makes most sense. We are reluctant to engage up-front with one or another vendor at the beginning of a project, as very often customer requirements change over the course of a project, and being stuck with one particular technology might prevent ultimate success.