In the 1960s, researchers at Yale University’s Haskins Laboratories attempted to produce a machine that would read printed text aloud to blind people. Alvin Liberman and his colleagues figured the solution was to isolate the “phonemes,” the ostensible beads-on-a-string equivalent to movable type that linguists thought existed in the acoustic speech signal. Linguists had assumed (and some still do) that phonemes were roughly equivalent to the letters of the alphabet and that they could be recombined to form different words. However, when the Haskins group snipped segments from tape recordings of words or sentences spoken by radio announcers or trained phoneticians, and tried to link them together to form new words, the researchers found that the results were incomprehensible.1

That’s because, as most speech scientists agree, there is no such thing as pure phonemes (though some linguists still cling to the idea). Discrete phonemes do not exist as such in the speech signal, and instead are always blended together in words. Even “stop consonants,” such as [b], [p], [t], and [g], don’t exist as isolated entities; it is impossible to utter a stop consonant without also producing a vowel before or after it. As such, the consonant [t] in the spoken word tea, for example, sounds quite different from that in the word to. To produce the vowel sound in to, the speakers’ lips are protruded and narrowed, while they are retracted and open for the vowel sound in tea, yielding different acoustic representations of the initial consonant. Moreover, when the Haskins researchers counted the number of putative phonemes that would be transmitted each second during normal conversations, the rate exceeded that which can be interpreted by the human auditory system—the synthesized phrases would have become an incomprehensible buzz.

Half a century after this phoneme-splicing talking machine failed at Haskins, computer systems that recognize and synthesize human speech are commonplace. All of these programs, such as the digital assistant Siri on iPhones, work at the word level. What linguists now know about how the brain functions to recover words from streams of speech now supports this word-level approach to speech reproduction. How humans process speech has also been molded by the physiology of speech production. Research on the neural bases of other aspects of motor control, such as learned hand-arm movements, suggests that phonemes reflect instruction sets for commands in the motor cortex that ultimately control the muscles that move our tongues, lips, jaws, and larynxes as we talk. But that remains a hypothesis. What is clear about language, however, is that humans are unique among extant species in the animal kingdom. From the anatomy of our vocal tracts to the complexity of our brains to the multifarious cultures that depend on the sharing of detailed information, humans have evolved the ability to communicate like no other species on Earth.

How vocalizations become human speech

Pipe organs provide a useful analogy for the function of the human vocal tract. These instruments date back to the medieval period in Europe and consist of a bellows, which provides the necessary acoustic energy, and a collection of pipes of various lengths. Each key on the organ controls a valve that directs turbulent airflow into a particular pipe, which acts as an acoustic filter, allowing maximum energy to pass through it at a set of frequencies determined by its length and whether it is open at one or both ends. A longer pipe will result in a set of potential acoustic energy peaks—its so-called “formant frequencies”—at relatively low frequencies, while a shorter pipe will produce a higher set of formant frequencies. In the human body, the lungs serve as the bellows, providing the source of acoustic energy for speech production. The supra-laryngeal vocal tract (SVT), the airway above the larynx, acts as the pipes, determining the formant frequencies that are produced.

As Charles Darwin pointed out in 1859, the lungs of mammals and other terrestrial species are repurposed swim bladders, air-filled organs that allow bony fish to regulate their buoyancy. Lungs have retained the elastic property of swim bladders. During normal respiration, the diaphragm as well as the abdominal muscles and the intercostal muscles that run between the ribs work together to expand the lungs. The elastic recoil of the lungs then provides the force that expels air during expiration, with alveolar (lung) air pressure starting at a high level and falling linearly as the lungs deflate. During speech, however, the diaphragm is immobilized and alveolar air pressure is maintained at an almost uniform level until the end of expiration, as a speaker adjusts her intercostal and abdominal muscles to “hold back” against the force generated by the elastic recoil of the lungs.

Discrete phonemes do not exist as such in the speech signal, and instead are always blended together in words.

This pressure, in combination with the tension of the muscles that make up the vocal cords of the larynx, determines the rate at which the vocal cords open and close—what’s known as the fundamental frequency of phonation (F0), perceived as the pitch of a speaker’s voice. In most languages, the F0 tends to remain fairly level, with momentary controlled peaks that signal emphasis, and then decline sharply at a sentence’s end, except in the case of certain questions, which often end with a rising or level F0. F0 contours and variations also convey emotional information.

In tonal languages, F0 contours differentiate words. For example, in Mandarin Chinese the word ma has four different meanings that are conveyed by different local F0 contours. In all of the world’s languages, however, the primary acoustic factors that specify a vowel or a consonant are its formant frequencies, determined by the positions of the tongue, the lips, and the larynx, which can move up or down to a limited degree. The SVT in essence acts as a malleable organ pipe, letting maximum energy through it at a set of frequencies determined by its shape and length. Temporal cues, such as the length of a vowel, also play a role in differentiating both vowels and consonants. For example, the duration of the vowel of the word see is longer than the duration of the vowel of the word sit, which has almost the same formant frequencies.

Perceiving the formant frequencies of speech and assigning them to the words that a person intends to communicate is complicated. For one thing, people differ in vocal tract length, which affects the formant frequencies of their speech. In 1952, in one of the first experiments aimed at machine recognition of speech, Gordon Peterson and Harold Barney at Bell Telephone Laboratories found that the average formant frequencies of the vowel [i]—such as in the word heed—were 270, 2,290, and 3,010 Hz for 76 adult males. In other words, local energy peaks in the acoustic signal occur at these formant frequencies and convey the vowel.2 The average formant frequencies of the vowel [u]—as in the word who—were 300, 870, and 2,240 Hz for the same group of men. Adult women produced formant frequencies that were higher for the same vowels because their SVTs were shorter than the men’s. Adolescents’ formant frequencies were higher still. Nonetheless, human listeners are typically able to identify these spoken vowel sounds thanks to a cognitive process known as perceptual normalization, by which we unconsciously estimate the length of a speaker’s SVT and correct for the corresponding shift in formant frequencies.

Research has shown that listeners can deduce SVT length after hearing a short stretch of speech or even just a common phrase or word. University of Alberta linguist Terrance Nearey’s comprehensive 1978 study showed that the vowel [i] was an optimal signal for accounting for SVT length, and [u] only slightly less so.3 This explained one of the results of a 1952 Peterson and Barney project aimed at developing a voice-activated telephone dialing system that would have to work for men, women, and people who spoke different dialects of English. The duo presented a panel of listeners with words having the form h-vowel-d [hVd], such as had and heed, produced by 10 different speakers in quasi-random order, and asked the participants to identify each word. Out of 10,000 trials, listeners misidentified [i] only two times and [u] just six times, but misidentified words having other vowels hundreds of times. Similarly, in a 1994 experiment in which listeners had to estimate people’s height (which roughly correlates with vocal tract length) by listening to them produce an isolated vowel, the vowel [i] worked best.4

In short, people unconsciously take account of the fact that formant frequency patterns, which play a major role in specifying words, depend on the length of a speaker’s vocal tract. And both the fossil record and the ontogenetic development of children suggest that the anatomy of our heads, necks, and tongues have been molded by evolution to produce the sounds that clearly communicate the intended information.

Acoustics and physiology of human speech

Humans have a unique anatomy that supports our ability to produce complex language. The elastic recoil of the lungs provides the necessary acoustic energy, while the diaphragm, intercostal muscles, and abdominal muscles manipulate how that air is released through the larynx, a complex structure that houses the vocal cords, and the supralaryngeal vocal tract (SVT), which includes the oral cavity and the pharynx, the cavity behind the mouth and above the larynx.

When air from the lungs rushes against and through the muscles, cartilages, and other tissue of the vocal cords, they rapidly open and close to produce what’s known as the fundamental frequency of phonation (F0), or the pitch of a speaker’s voice. The principal sounds that form words—known as formant frequencies—are produced by changes to the positions of the lips, tongue, and larynx.

In addition to the anatomy of the SVT, humans have evolved increased synaptic connectivity and malleability in certain neural circuits in the brain important for producing and understanding speech. Specifically, circuits linking cortical regions and the subcortical basal ganglia appear critical to support human language.

See full information

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *