CIRCUITS WITH EARS
-
NEW TECHNOLOGY
-
Most of us remember HAL, the unlike star of Stanley Kubrick's 2001: a Space
Odyssey. When the film was made
in the late 1960's, the idea of conversing with a computer was pure science
fiction. But as has happened so often,
yesterday's science fiction is today's technology. Speech synthesis by
computers is familiar to most of us by now,
but what about the area of speech recognition?
While speaker-independent recognition of connected speech by a computer is
still a decade off, firm toeholds have
been established in developmental areas important to its eventual success.
We're going to take a look at some of
the speech-recgnition systems now available, and what's being done to improve
the state of the art.
-
A simplified explanation
-
Speech-recognition systems rely on the matching of a spoken word to a stored
model of that word. In practice,
the way words are modeled is the key of the success and accuracy of a system -
as well as to its expense, speed,
and more.
-
Generally, the user of the system is asked to speak each word in the system's
limited vocabulary several
times. Those spoken samples are analyzed by a variety of techniques, and the
samples of each word or phrase
are compared to one another.Differences are minimized, similarities maximized,
and the resulting model (called a template) is stored in the system's memory.
-
Once the "training" has been completed, any word or phrase spoken into the
system is analyzed using the same
techniques used in deriving the templates. That analyzed data is compared with
the stored templates, and a score
assigned to each match. If no score is high enough to be accepted as a fit, the
system gives a "non-recognition" message or asks to have the word repeated. If
more than one word is scored high enough so that there are several possible
fits, the system can ask which is correct or ask to have the word repeated.
However, in about 98% of all attemps, a single word is reconized uniquely.
-
Since it's important in matching a word to know precisely where a word begins
or ends, there is usually some hardware or software incorporated to give that
information. Also, there is usually some provision for normalizing the time
distribution of the word - that is to say, the duration of the voiced sounds
within the word. Without time normalization, variation in the ways we pronounce
a given word would make matching it against its template very difficult.
-
Generally, both time-dependent and time-independent analyses are done. The
time-independent analysis is usually concerned with the spectral distribution
of the word. For example, a spectral distribution analysis (called a histogram)
of the word six would show that the word has a lot of s sound within it, but
not that the s sounds occurs twice, once at each end of the word. Rather, the
spectral histogram would show how much energy appeared at any one frequency
during the speaking of the word. In practice, narrow bands of frequency during
the speaking of the word. In practice, narrow bands of frequencies are usually
samped - although there is some progress in the Fourier analysis of speech
through new hybrid analog/digital microprocessor technology, but thats's a
subject best left until it can be covered somewhat more meaningfully.
-
How it works
-
Let's take a look at the elements of most of today's speech-recognition
hardware in a little more detail.
(see fig. 1)
. The first step is to provide as favorable a signal-to-noise ration as
possible. A noise-cancelling mike close to the speaker's mouth (often on a
headset) and push-to-talk operation help to accomplish that.
-
Also, there is usually some preemphasis and shaping of the incoming audio to
help eleiminate background noise and help accentuate some of the weaker
segments of the speech spectrum. Some form of automatic gain control is usually
used, either in the form of an analog compressor or as a part of the computer's
task.

-
How it works continued
-
Since spectrum analysis is time independent and since it can be used to
indicate whether or not speech is
present the incoming speech is first analyzed for energy content in each voice
frequency spectrum sub-band of interest. While the energy content in each
sub-band is significant, the amplitude variations of speech overall are
generally of no help in analyzing speech; instead, zero-crossings have been
found to convey the most significant speech information. Those are counted to
give frequency information, although in some methods the interval between zero
crossings is counted instead.
-
In addition to the energy content in each frequency sub-band, some measurement
of the rate of change of speech spectrumenergy (rapid for explosive sounds for
example, and gradual for vowels) might also be made.
-
Once the end of the word has been detected, the word is framed, defining its
beginning and end, and time-related acoustic phenomena are analyzed. An
acoustic-feature detector extracts key features, including pauses, vowels and
vowel-like sounds, formants, and so on. Then the word is divided into a number
of equal parts (Threshold Technology, for example, uses 16 samples that are
spaced equally in time) to obtain a time normalized pattern of those key
features.
-
Those patterns are compared with the templates stored in memory; the algorithms
that are used for those comparisons are a key difference between the various
speech-recognition systems. In all Systems, the input word is
compared against the stored vocabulary, and the similarities and differences
are weighted into a correlation score. Those scores might be expressed as a
product, a vector distance, a probability evaluation, or a figure of merit. The
score is a numerical characterization of how good the match is.
-
Most systems require that the match or "fit" exceed some minimum value in order to be valid. Larger vocabularies, or more critical applications, often require a higher minimum value.
-
Speaker dependence
-
Let's consider the problem of recognizing more than one voice. For the
speaker-dependent recognition systems available today (speaker-dependent
means that the system can only effectively recognize words spoken by the person
who trained it), there is an easy answer: trade-off vocabulary for more voices.
A system capable of recognizing
one speaker and a vocabulary of eighty words could just as well accommodate two
speakers, each training it to a list of forty words-or eight speakers and ten
words, five speakers and sixteen words, and so on.
-
Bell Labs has successfully made speech-recognition systems capable of
recognizing isolated "utterances" spoken by designated speakers. Those
systems use eighth-order LPC (Linear Predictive Coefficient) analysis. You may
recognize LPC analysis as the technique used by Texas Instruments to translate
speech into much-compressed data, and back again, in their Speak & Spell and
elsewhere.
-
The object of the Bell Labs investigation is an automatic directory assistance
system, but they found that the limited vocabulary and speaker dependence of
contemporary speech recognizers made the recognition of spoken names
impractical, if not impossible.
-
Limiting the vocabulary to the "names" of the letters used in spelling names
makes the task more manageable, but there are still drawbacks. One is that the
names of the letters are short compared to most words and so they don't give a
recognizer much to go on. There are also many letters whose names sound a great
deal like each other.
-
Bell Labs found an answer. They decided that even if they don't know for
certain what a given letter is, it's enough to know that it's one of say five
probable candidates. A string of six letters gives enough information that an
exact match to a recorded directory listing can be made most of the time, at
least under experimental conditions.
But what about speaker independence?
-
Slurring
-
In the same way that a system maximizes the similarities and minimizes the
differences between successive samples of a spoken word during training,
samples of the same word spoken by different individuals pmduce an even broader
template. In that way, differences between one speaker's articulation of a word
and that of another are slurred together. By extension, a system could become
speaker-independent if any such thing as a ''universal'' template (an absolute
set of similarities in the ways all people say a word) could be found.
-
Unfortunately, slurring also blurs the recognition capabilities of a system by
making dissimilar words sound more like each other. It may become impossible to
discriminate between similar-sounding words.
Just as today's speaker-dependent systems are evaluated in terms of their
accuracy - a 98% matching rate, for example-future speaker-independent systems
may be rated both for overall accuracy and for the percentage of the population
that the accuracy figure applies to.
-
Speaker independence is the first priority in improving coming generations of
speech-recognition systems according to most manufacturers we talked to. One
promising approach involves producing speaker-adaptive systems that in some way
modify stored templates to help adjust them into a closer match with the
particular voice characteristics of the speaker. For example, a brief initial
sample of the voice might determine if it is that of a man, woman, or child and
whether it is basso, alto, tenor, soprano, etc. The spectrum's sub-band energy
distribution would obviously shift slightly as the pitch of the speaker's voice
shifts, and weighting factors could be introduced into the analysis to help
correct for differences between speakers.
-
Connected speech
-
We have seen that time analysis of speech for today's isolated word-recognition
systems requires proper framing of the word, which means recognizing its
beginning and its end. But normal speech is connected speech, with the end of
one word often indistinguishable from the beginning of the next.
-
IBM has been working on the problem of recognizing words and phsases in the
midst of continuous speech. Using a large mainframe computer and some advanced
techniques including spectrographic analysis, they've been able to take text
derived from a 1000-word vocabulary and with a speaker reading the text at a
normal pace transcribe the spoken text into printed copy with better than 90%
accuracy.
-
What's here and what's coming
-
Speech-recognition equipment is available at every level from single boards
an experimenter can connect to his computer to huge mainframe systems like
those in use at Bell Labs and IBM. A great deal of attention is being given to
continued developments in the sophistication and accuracy of voice-entry
terminals, which accept spoken rather than keyboard entered data.
-
Threshold Technology, Incorporated and Centigram Corporation are two of the
leaders in speech terminals. A newcomer to that area is one of the pioneers in
speech recognition for experimenters, Heuristics Incorporated.
-
New on the experimental end is the Cognivox by Voicetek, and the VET/1 and
VET/2 from Scott Instruments
-
Commercial speech-recognition Systems are also made by Verbex Corporation
(formerly Dialog Systems Incorporated), Scope Electronics Incorporated, and
Interstate Electronics, as well as Perception Technology Incorporated.
MIKE

-
How MIKE operates
-
Mike lears and recognizes patterns derived from spectrum-analysis data. When
learning a word, Mike stores patterns in memory for future reference. When
attempting to recognize a word, Mike compares the incoming pattern to each
reference pattern and generates a set of closeness of fit scores. Above a
certain threshold, the highest score is taken to indicate successful
recognition.
-
The spectrum analysis is performed every 25 milliseconds to measure the energy
in 19 logarithmically spaced frequency bands over the 300-hz to 3.000 hz range.
Mike's approach to that analysis is unique. The data to be analyzed is spun
past a single filter 16 times., each time a a different frequency, so that the
frequency of interst matches the center frequency of the filter. That is in
contrast to the conventional approach, which involves using 16 indivually tuned
filters opeating in parallel.
-
The spectrum-analysis data is digitized and passed to the word-framing process.
When a sufficient level of spectral activity is detected, the beginning of a
word is marked. When that activity falls below a threshold, the end of the word
is marked. Since Mike is an isolated-word recognition device, a silent interval
of approximately 100 milliseconds is required at the end of a word to frame it
adequately.
-
Noise-canceling and time-base normalization are integral parts of the
word-framing process. During silent intervals, constant (ambient) noise is
measured; during word framing, that constant noise signal is subtracted from
the input signal. When a word or segment of sound has been isolated, it is
normalized to a fixed time-duration to compensate for different speaking rates.
-
The pattern-generation process further operated on the framed word to extract
features on interest and to reduce it to a string of approximately 240 bits.
The pattern is then generated using a proprietary mapping algorithm.
-
In training Mike, patterns are logically or'ed with the patterns of previous
repetitions of the word being learned. Typically, two or three repetitions of
each vocabulary word suffice for reliable recognition. When Mike is attempting
to recognize, patterns are compared by and'ing them in turn with each of the
previously learned reference patterns. The matching ones are tallied to form a
set of scores for each comparison.
-
Mike recognizes a word if its score is both above a threshold and greater than
the next highest score by a prescribed increment. A code indicating the
identity of the recognized patterns is transmitted to a host device. If a word
is framed but does not meet the recognition criteria, a no-recognition code is
transmitted.
-
Centigram's recognition approach is patented in the United States (Patent
nummer 4,087,630) and patents have been applied for in 15 other countries.
Copyright 1979 by Centigram Corp., Sunnyvale CA reprinted by permission



-
Radio Electronics JUNE 1981
-
R-E