1996 Research Summaries for the Ptolemy Project

Modeling of Residuals in Music Analysis-Synthesis


Researcher:Michael Goodwin
Advisors:Edward A. Lee and David Wessel
Sponsors:CNMAT (Center for New Music and Audio Technology)

In the analysis and synthesis of a musical signal, the input music is analyzed to derive a set of intermediate parameters from which a perceptually approximate version of the music can be resynthesized. The accuracy of this approximation depends on the number of intermediate parameters that the analysis-synthesis system can manage. For instance, in a low bit rate coder where the analysis takes place at the transmitter and the synthesis takes place at the receiver, the number of intermediate parameters is limited by the allowable bit rate; the lower the bit rate, the less accurate the synthesis is. The application under consideration in this research is a music synthesizer that allows modification of the intermediate parameters so as to achieve musically significant transformations such as time-scaling, pitch-shifting, and spectral shaping.

Musical signals are commonly modeled as a sum of sinusoidal components with time-varying amplitudes and frequencies. Analysis methods based on this model generally estimate the parameters of the sinusoids on a frame-by-frame basis; the ensuing synthesis imposes low-order evolution models on the frame-rate analysis data to generate the sample-rate synthetic signal. For example, the amplitude of a sinusoidal component is typically interpolated linearly from frame to frame. As a result of this frame-based approach, variations in the original signal that occur on time scales smaller than a frame will not be accurately represented in the synthesis. If the synthetic signal is subtracted from the original signal, such rapid signal variations will appear in the leftover signal, which is termed the "residual". In music analysis-synthesis, this residual tends to contain both note attack components (such as marimba mallet strikes) and broadband noise processes (such as saxophone breath noise). Since the residual contains perceptually important features of the original music, the synthetic signal derived from the frame-rate sinusoidal model parameters tends to lack realism. In the following, the attack components are not considered; instead, two models of the noise process are discussed.

The primary focus of this research is to improve the synthesis realism by modeling the residual noise process so as to enable reinjection of this noise at the synthesis stage; two such models have been experimented with. The first model is based on psychoacoustic data relating to noise perception; in perceiving a broadband noise, the human auditory system is primarily sensitive to the total energies in a set of auditory equivalent rectangular bands (ERB) similar to critical bands, and is not sensitive to the specific distribution of the energy within a single band. Based on the energies in the ERBs, a piecewise constant spectral estimate is derived. Excitation of this filter estimate with a white noise input generates a synthetic residual that captures the salient spectral features of the broadband noise processes of the original residual. Some of the fine structure is lost, however; for instance, undesired crackling artifacts are introduced into smooth breathy residuals. These artifacts can be reduced by smoothing the synthetic noise from frame to frame, but analytical determination of the appropriate smoothing function is still an open problem. With this in mind, a second model has been explored; here, the residual is modeled as a superposition of narrowband stochastic processes that are individually generated by modulating sinusoids with random lowpass envelopes. Because the synthesis engine in this project is well-suited to generating sinusoids, this model is advantageous to the ERB approach. Also, crackle distortion can be readily avoided in the synthetic residual by enforcing continuity constraints on the modulated sinusoids at the frame boundaries. Thus, synthesis based on this model is easily achieved. At present a corresponding analysis approach is under development.

In the synthesis stage of an analysis-synthesis system, it is often desirable to inject noise to account for signal features not well modeled by the analysis parameters. For example, in speech coders, voiced sounds are typically synthesized according to parameters related to pitch whereas unvoiced sounds such as sibilants and fricatives are generated using shaped noise. In musical applications, noise is necessary to recreate the effects of turbulence in wind instruments and the anomalies of the attacks of percussive sounds, for instance.

This project relates to the development of an analysis-based music synthesizer wherein the analysis essentially consists of a short-time Fourier transform process which extracts the time-varying amplitudes, frequencies, and phases of the many sinusoids comprising the original sound. The resultant spectral representation is well-suited to transformations such as time scaling, pitch transposition, or various spectral modifications. However, the analysis parameters characterize only the deterministic signal components; for the efficient synthesis of realistic musical sounds, then, it is necessary to generate a flexible yet accurate spectral noise representation which can be included in the signal parametrization before synthesis. The intent of this research is to derive such a representation that is capable of undergoing transformations without introducing perceptually undesirable artifacts.


Send comments to Michael Goodwin at michaelg@eecs.berkeley.edu.