Blog

How Voice Recognition Technology Works

Voice recognition technology has progressed immensely over the last few years. Research into voice recognition and its applications has been going on since the 70s.

Voice recognition technology is concerned with using voice samples to identify a speaker (voice identification) or to confirm the identity of a speaker (voice verification). These are the two different applications of voice recognition technology.

How does voice recognition technology work?
Voice recognition technology works by recording a voice sample of a person’s speech and digitizing it to create a unique voice print or template. Each spoken word is broken up into discrete segments which comprise several tones. These tones can be digitized and are captured to create a speaker’s unique voice template.

Physiological component
The physiological component of a person’s voice is based on the shape of that person’s vocal tract, i.e. the shape of the larynx, nose and mouth. Biometric technology uses the wave form of the voice sample to digitally recreate the shape of an individual’s vocal tract. No two individuals can have the same vocal tract and therefore every person will have a unique voice imprint.

Behavioral component
This component represents the physical movement of the individual’s jaw, tongue and larynx. Variation in this movement causes changes in the pace, manner and pronunciation of a person’s voice, which include the person’s accent, tone, pitch, pace of talking etc.

Factors that can cause a template mismatch
Currently voice recognition systems are not completely infallible, and there are factors that can cause a voice recognition system to return an erroneous result.

Differences in the physical well being and emotional state of an individual can also cause a person’s speech to change. For example, having a cold can change the person’s speech enough to create a mismatch between the stored template and the person’s current voice sample. This is even possible if the person is excited, depressed, stressed or even on medication. Other factors, such as background noise, ambient temperature and the input device can also affect the result of the system.

There are two main types of voice recognition systems

Text dependent: These require the speaker to say a predetermined word / phrase, which is called a “pass phrase”. The pass phrase can be anything, ranging from the name of the person, their birthplace to their favorite color or a set of numbers. The phrase is compared to a sample captured during enrollment.

Text independent: These systems are trained to recognize a person without a pass phrase. However, they require longer speech inputs from the speaker in order to identify distinct vocal characteristics such as pitch, cadence and tone.

Some systems incorporate both text dependent and independent systems.

Owing to the countless benefits that voice recognition systems have, they are becoming more popular with each passing day.