Thursday, June 5, 2014

The magic of speech synthesis: linear predictive coding

Growing up in the '80s and '90s, I had a pretty decent idea how a lot of tech around me worked. Maybe I couldn't actually fix a TV with a blown tube or swap out a dead (soldered) CPU on a motherboard yet, but I knew how the big pieces fit together, what they were supposed to do, and what might happen if a given piece went kaput.

Speech synthesizers were not in that category.

When I first encountered a Speak 'N' Spell, it seemed like magic. The voice was so crude and inhuman it was obviously computer-generated (i.e., not recorded). It was halting and seemingly stitched together from scraps of speech, but I'd never even heard of phonemes, let alone a process by which a chip like the one I found inside could spit out words and phrases.

For a long time, I had an inordinate fascination with the SnS, the General Instruments SP0256-AL2, and the speech synthesis cartridges for the TI-99/4A and TRS-80. (Wasn't there a C64 speech cartridge too?) I never did find out much about how they worked, though, or get my hands on hardware to experiment.

Linear Predictive Coding: Speech Analysis, Synthesis, Compression

Fast-forward 20 years or so to DSP class... and it turns out that most of those devices, along with a healthy amount of speech synthesis today, is based on variants of the linear predictive coding (LPC) technique. For my class project, I worked up an LPC example in Matlab to peek under the hood.

LPC models the human vocal tract as a medium-order time-varying filter (typically 10th-order) excited by pitched and unpitched (noise) impulses created by the diaphragm and vocal cords. A speech sequence (e.g., a word) is created from a train of impulses filtered with changing filter coefficients and gain.

LPC discretizes speech into overlapping frames of 10-20 ms, where the filter coefficients, gain, and impulse type and pitch are constant for a given  frame.

LPC is most commonly used as a compression scheme: speech is analyzed to estimate frame parameters, the frame parameters are transmitted using far fewer bits than the original speech, and the parameters are applied to a filter and impulse train in the receiver to synthesize output speech.

The figure shows data from the whole process. From the top, there's the filtered input audio, the detected pitch period in samples for each frame, the resulting excitation signals (pulse trains in green, noise in blue) and gains, and the final synthesized output.

 
Conclusions

Basic LPC turned out to be easier and more interesting to implement than I expected... considering that I didn't write custom code for everything and that I did leave out quite a bit of work that would normally be required to tune up the sound quality, optimize computing time, and/or achieve compression specs. (Here's a great writeup on all the work that went into the Speak 'N' Spell.)

A few samples of the output:


It's pretty cool to be able to pull speech apart, in a sense, and put it back together any way you like. I'm interested in experimenting with my code to create interesting musical textures, including vocoding by replacing the impulse train with audio from a musical instrument.

Code is here!

1 comment: