- published by E and MM november 1981 page 51/52 by Alan Davies
JABBERWOCKY!...
- 'Twas brillig and the slithy toves
Did gyre and gimbie in the wabe;
All mimsy were the bomgoves,
And the morne raths outgrabe.
'Beware the Jabbewock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!'
from 'jabberwocky' - Lewis Carroll
Reproduced by kind permission of Frederick Warne Publishers.
... or is there more to this than meets the ear?
- In this article we take a look at some methods of electronic speech production and manipulation
and investigate some applications of this technology in the recording industry and the rapidly
expanding market for the 'talking chip'
-
Undeniably some of the most interesting and frequently employed special effects in recent
popular music have been various treatments and subtle uses of the human voice and the
qualities which it possesses. A very good example of the use of the untreated voice is
David Bowie's 'Ashes to Ashes'. A close examination of this reveals an extremely compelling
use of background voices which seem to half chant, half whisper the words of the song.
This has the effect of drawing listener's interest to it just in the same way as two people
whispering aross the room arouses a curiosity as to what is being said. This ploy commands
attention and is certainly powerful musical 'hook'.
Another common effect which trades upon vocal qualities is the ubiquitous Wah-Wah pedal -
creatively used in the film music 'Theme Shaft'. Related to this are the 'Mouth Tube' and
the much more sophisticated 'Vocoder'. One of the earliest examples of the use of the
latter's sound in popular music was 'Sparky's Magic Piano'. More recently ELO's 'Mister
Blue Sky' and television's talking robot 'Metal Mickey' have both used this equipment.
-
But why this fascination with vocal effects? To explain this it is important to appreciate
the way in which the human ear responds to sounds. Recent research has shown that the hearing
system is not only sensitive to the frequency and amplitude of an incomming signal but also
to the way in which both these parameters vary temporally. For example, if a pure sine wave
(no harmonics) of constant pitch and amplitude is played to a listener then he soon tires
of this - the ear becomes fatigued by the stimulus. If however the signal is mildly frequency
modulated (i.e. a slight vibrato introduced) at a rate of say 8-10Hz then the ear is able
to sustain a greater exposure to this before fatigue sets in. The same principle applies
to the introduction of amplitude modulation (tremolo): in both cases the incoming signal
is more interesting to the ear. If the principles of frequency and amplitude modulation
are now extended to waveforms having much higher harmonic contents (such as a ramp wave)
then this is even more interesting as modulations of the fundamental then produce more
complex modulations of the harmonic structure resulting in an extremely 'active' sound.
This discussion tends to suggest that there are receptors within the auditory system which
are 'tuned' to detect both frequency and amplitude variations in an incoming signal.
-
Indeed it has been shown that they are even capable of determining the shape of the modulating
waveforms! Thus it is transients within a sound which are important to maintain interest and
also very important when it comes to the recognition of, for example, musical instruments
or speech.
This may be easily seen when you try to simulate the sounds of conventional musical instruments
on a synthesiser. The problem arises from the fact that it is very difficult to introduce
sufficient variation of both frequency and amplitude into the waveforms produced. With sounds
of short duration then it is just about possible to deceive the ear but with any sustained
sound such as the imitation of a held violin or oboe note then the ear is able to detect the
too regular nature of the waveform and labels it as electronically generated. With the advent
of the new generation of computer synthesisers (such as the Fair-light CMI) then the
real-timecontrol of frequency and amplitude parameters is possible to a very fine degree but
it is still difficult to produce a really convincing 'held sound' - the waveforms are still
too 'perfect' and do not possess the unpredictable irregularities common to all natural sounds.
So, the human ear is highly sensitive to changes in an incoming signal and this is a clue as
to the power of the human voice as a communicator and also its magnetic attraction when used
for special effects. There is no more flexible sound generator known to man than the human
voice. It is capable of extremely precise amplitude, frequency and harmonic control over a
relatively wide range resulting in a vast repertoire of expression.
-
How is the human voice able to achieve all this? Let's take a closer look at the way in which
speech is produced and the reasons why certain sounds are described as possessing a 'vocal
character'.
Speech is composed of two main component sounds:
(1) VOICED SOUNDS. These are produced when air from the lungs
is forced between the vocal chords, which are situated in the windpipe, causing these membranes
to vibrate and a pulsating column of air to enter the mouth and nasal cavities. The fundamental
pitch of the resultant note is determined by the length, thickness and tension of the vocal
chords.
(2) UNVOICED SOUNDS. If the air from the lungs is not forced
through the vocal chords but simply expelled through the mouth then unvoiced sounds such
as 'f' or 'h' are produced. These are very similar in nature to the sounds which may be
produced by the filtering of a 'white noise' source.
The shape of the mouth and the nasal cavities determines the character of both the above types
of sound - they act as complex filters, the response of which is variable by altering the
shape of the mouth. (Try vocalising the sound 'ah' and then slowly altering the shape of the
mouth and listen carefully for the changes in the harmonic structure which results from this.
All the vowel sounds can be produced in this manner). Precise variations are obtained by
movements of the tongue and lips which alter the resonant features of the filter system,
creating areas in which certain frequencies are boosted and others cut. The ranges in which
frequencies are boosted are known as formant bands (which are also present in the resonant
structures of musical instruments and largely account for their different sounds - each
instrument can be said to have its own formant 'fingerprint'). The lips play a particularly
important role in the production of sounds which may be distinguished by their dynamic
amplitude characteristics such as the percussive attack transients in sounds such as 'p'.
Overall then, the voice may be regarded as a complex sound generating instrument consisting of
ah amplitude and frequency controlled oscillator (vocal chords and lungs), noise generator
lungs) and a set of formant filters (mouth and nasal cavities). Viewed in this light it would
seem that the basic ingredients required for voice production are available on a conventional
music synthesiser and it poses the question as to the feasibility of producing vocal sounds
using conventional synthesis techniques. These would involve using a voltage controlled
oscillator to simulate the vocal chords (ensuring that the waveform produced is sufficiently
rich in harmonic content, e.g. a pulse wave) and a noise generator for the unvoiced sounds.
Circuitry would be required to switch back and forth between these two sound sources depending
on whether voiced or unvoiced sounds were desired. For the filtering section, a bank of voltage
controlled bandpass filters could be used each tuned to a quarter or third of an octave apart,
covering the area of the audio band in which speech components are most prominent
(approx. 150-8000 Hz). The array of filters would be similar to those employed in a graphic
equaliser except that those of course are not voltage controlled. If you possess a graphic
equaliser with sufficient frequency discrimination between its bands - preferably a 20 channel
unit - then you can have a go at simulating various vowel shapes on it by using a pulse wave
as a signal source and adjusting the slider controls to approximate the outlines of the
frequency spectra of the vowel shapes shown in Figure 1.
-
Figure 1. Guide to frequency spectra of vowel shapes.

-
Interesting backing sounds for songs may be produced in this way especially if the output
from the equaliser be passed through a chorus unit producing not just one vowel sound but a
multiple effect.
Thus far the possibility of speech production using conventional synthesis techniques seems on
the cards. However, it is when considering the extremely complicated control voltages which
would be required to manipulate the filter bank that we come up against the main snag with
this system.
How can we overcome this problem? One possibility is to store the control voltages digitally
resulting ina hybrid analogue - digital speech synthesiser. Control voltages would be stored
in ROM (Read Only Memory) and a microprocessor could read these out and convert them via a D/A
(Digital to Analogue) converter into analogue voltages for the filter bank. This would enable
a limited vocabulary of words to be produced governed by the storage capacity of the ROM.
A slightly different approach to speech synthesis and one which is now becoming more
commonplace is, the entirely digital system. This in some ways is an extension of the one
described above, in that the components of words are stored in ROM. Now the data stored is
such that when read out and fed through a D/A converter, the analogue voltage produced is no
longer just a control voltage to be applied to a filter bank but may be immediately fed to an
amplifier and will produce the desired sound of say a vowel or consonant. The beauty of this
system is that instead of having to store lots of control voltages - up to 22 for each sound
in a large hybrid system consisting of 22 bandpass filter channels - it is now possible to
store less values for the same resultant sound. A further extension of this principle leads
to even more compact storage of words.
-
Consider the following:
better;batter;matter;match;fetch;mud, it is possible to divide these up into component sounds:
better-(1)beh (2)tur
batter-(3)bah (2)tur
matter-(4)mah (2)tur
match-(4)mah (5)ch
fetch-(6)feh (5)ch
mud-(7)muh (8)d
From these individual components it is now possible to make new words such as:
fetter-(6)feh (2)tur
batch -(3)bah (5)ch
bed - (1)beh (8)d
bad - (3)bah (8)d
mad - (4)mah (8)d
much -(7)muh (5)ch
fed - (6)feh (8)d
etc.
-
As may be seen, an extension of this system will result in a large vocabulary being available
from relatively few component parts. This is the method which is used in vanous devices such
as the talking calculator or spelling game or in anything which uses a 'talking chip'.
So much for methods of producing speech 'from scratch'. Let's return to the topic of vocal
special effects. These make use of an existing human voice and subject it to different forms
of electronic processing. One of the most obvious of these is of course the addition of either
reverberation or echo. Another is the use of a frequency-shifter to generate the effect of two
voices singing together a fixed interval apart. But perhaps one of the most popular vocal
effects units is the vocoder.
Vocoding or VOice-CODING is not a new concept. Indeed the original idea was conceived before
the Second World War. There was interest in Germany in the thirties due to the military
potential of the unit for encoding secret messages. The first person to use the term vocoder
to describe a commercial unit was an American called Homer Dudley who in 1936 devised a
machine for the compression of the bandwidth of speech for transmission purposes. The modern
vocoder still operates on the same principles, namely that of the real-time superimposition of
speech onto a 'carrier signal' - nowadays this usually means a musical instrument.
Utilising this system it is possible to make almost anything speak from a guitar to a full
symphony orchestra.
-
The way in which the unit works may be seen by referring to the block diagram

of the circuitry of a typical vocoder (Figure 2).
-
This is somewhat simplified but gives an overall view of the processes involved.
Speech is input at point 'A' and is then split up into discrete frequency
bands by a series of bandpass filters. At the output of each of these there is an envelope
follower which produces a DC voltage proportional to the amplitude of the signal present in the
particular frequency band. The ban of bandpass filters thus produces series of control voltages
which precisely follow the frequency spectrum of the incoming speech signal. These control
voltages are used to control a bank of VCAs (Voltage Controled Amplifiers) as shown. Connected
to the signal input of each one of these is the 'carrier signal' (e.g. guitar sound which
enters the vocoder at point 'B'. This carrier is used for the production of the 'voiced'
portions of the speech and a noise generator for the unvoiced'. The circuit which selecs either
'carrier' or 'noise' is the 'voiced/unvoiced detector'. This compares the relative levels of
high and low frequencies in the incorning speech signal. When there is a higher proportion of
frequencies above 4000 hz than below, the noise generator is switched in as the component
speech being input at that moment will be 'unvoiced'. The outputs of the VCA's go to an
identical bank of bandpass filters to those used for the analysis of the incoming speech signal.
Therefore, the control voltages derived from the speech input now determine the amplitude of
each frequency band in the carrier signal, allowed through to the output summing amplifier.
The speech has therefore imposed its frequency spectrum on the musical carrier. Result - talking music!
The combination of the transient nature of both speech and music which this unit affords
provides formidable tool for the making of aurally arresting sound effects which if used
sparingly will always demand the listener's interest.
So, we have now considered some methods of speech production and processing but what of this
Jabberwocky? Perhaps, when it comes to effects, it's not so much what is said but the way it
sounds which is important! E and MM