|
VLDCMCaR (pronounced vldcmcar)
|
Abstract Background Proposal Methods Applications Examples Future Work Conclusion References |
|
VLDCMCaR (pronounced vldcmcar) is a MATLAB application for exploring concatenative audio synthesis using six independent matching criteria. The entire application is encompassed in a graphical user interface (GUI). Using this program a sound or composition can be concatenatively synthesized using audio segments from a corpus database of any size. Mahler can be synthesized using hours of Lawrence Welk; howling monkeys can approximate President Bush's speech; and a Schoenberg string quartet can be remixed using Anthony Braxton playing alto saxaphone. A composer could also record a vocalization of a composition and fill it in using any sound material desired.
A mosaic is an medium assembled by several small pieces that contribute to a perception of a larger image. When the mosaic is viewed at a close distance the image isn't clear; but further away, and with a blurring of one's eyes, the image emerges. Figure 1 shows a mosaic, but assembled by hundreds of photographs (Figure 2) instead of colored tiles (Silver 2003). A simple algorithm selects each picture-tile that best approximates a particular portion of the original image. For instance it will look for red images when in a red region, and images that have lines and content that can approximate the lines and content in the original (see center image of slanted grill).
![]() |
![]() |
Figure 1: Photo-Mosaic |
Figure 2: Detail of Figure 1 |
This technique is similar to the drawing method shown in Figure 3. Using a network of strings for reference, the artist fills in each square on the paper with what he sees in the corresponding square of his view. Square by square the original is transcribed onto paper. Even though the artist isn't replacing each square by a different picture, he breaks down the original into parts and uses the strings as reference points to create the complete picture. The algorithm of the artist here is basically the same as that of the computer matching tiles. It should also be noted that any image can be enlarged or shrunk using a proportional network of strings.
![]() |
Figure 3: Albrecht Durer: An Artist Drawing Through a Network of Strings |
A method similar to photo-mosaicing exists in the synthesis of speech, called "concatenative speech synthesis" (Hunt 1996). This technique, actually developed in the early sixties, is used for text-to-speech synthesis. The computer segments a text into chunks that are then synthesized using a large database of spoken sound units called diphones. For instance to synthesize the word "cabbage" the computer segments it into the diphones "ka," "ba," and "g." Then from a large database of spoken dipohones it finds the best ones. These components are strung together to obtain the sound synthesis of the word.
Concatenative speech synthesis and photo mosaic methods have recently
been applied to creating "audio mosaics," or "musaics" (Hazel 2003; Lazier
2003; Schwarz 2000, 2003; Zils 2001). From a "corpus" of sound samples, best
matches are found to approximate a given "target" sound. The methods for
analyzing the target and corpus can be very complex, perhaps entailing pattern
recognition, and score following. Whereas Schwarz and Zils use the computer to
automatically make the matches, composer John Oswald selects and assembles his
sound snippets by hand (Oswald 2003).
While the authors cited above have done significant research in this field (Schwarz 2003, Zils 2001), their software isn't directly accessible and does not approach the methods how I want to. I decided instead to create a prototype application using MATLAB to explore the creative usefulness of these methods. Instead of using pattern recognition and score following, VLDCMCaR (pronounced vldcmcar) makes matches based on up to six basic statistical quantities from a uniform application of windows. The user can select any of these parameters, specify a percent allowable spread, and can choose what happens when either too many or no matches are found.
A good analogy for this technique is the following. First make an outline of
Shakespeare's Hamlet as the "target"—the thing to be synthesized.
Then create a "corpus" that includes every word, line, stanza, and act, of
Samuel Becket's Waiting for Godot. The contents of the corpus thus fill
in the outline of the target. This is accomplished by finding best
substitutions for each word, line, stanza, or act of Hamlet, using the words,
lines, stanzas and acts of Waiting for Godot. A new play is thus
created, Waiting for Hamlet, which has characteristics of both plays.
The algorithm used in VLDCMCaR (pronounced vldcmcar) is basically the same as those used to create photo mosaics. A feature vector is created for each frame (window) of the target, which serves to characterize them. Each feature vector consists of six values (Table 1). The RMS feature of "Congregation.wav" is displayed in Figure 4.
| Feature Measure | Meaning of Feature |
| Number of Zero Crossings | General noisiness, existence of transients |
| Root Mean Square (RMS) | Mean acoustic energy (loudness) |
| Spectral Centroid | Mean frequency of total spectral energy |
| Spectra Dropoff | Frequency below which 85% of energy exists |
| Harmonicity | Deviation from a harmonic spectra |
| Pitch | Estimate of fundamental frequency | Table 1: Feature Vector Elements |
|---|
![]() |
Figure 4: Feature plots of a sound |
The target sound is analyzed with a user-specified window and hop size (window skip). These can be any number of samples depending on the requirements of the synthesis. Smaller windows create larger analysis databases with higher time-resolution, but less frequency-resolution. Larger windows create smaller analysis databases with higher frequency-resolution, but lower time-resolution. For each window a feature vector is produced and stored in the analysis database. This is done for both the target and corpus database. An example of an analysis is shown in Figure 5, in the analysis windows at the top left. Note that "Congregation.wav" was analyzed using a window size of 512 samples, and window skip of 256; "MonkeyCorpus.wav" was analyzed using a window size of 16,384 samples, and a hop size of 1,024 samples.
With these analyses then the target can be synthesized using the audio-data of
the corpus. As long as the synthesis window skip matches the window skip of the
target, the target and synthesis will be the same duration. If the synthesis
window skip is twice as long as the target window skip the synthesis will
be twice as long. The window size for the synthesis doesn't have to match the
window size of the corpus analysis. But if these are significantly different
there is a higher chance of making a mismatch. Though even then the results can
be interesting.
![]() |
Figure 5: Screen shot of VLDCMCaR (pronounced vldcmcar) |
The synthesis algorithm iterates through the target database, finding matching frames from the corpus that are optimal within user-specified limits. For instance in Figure 5 the user has specified in the bottom middle pane to first find all analysis frames in the corpus that have an RMS within ±5% of the target analysis frame RMS, and then out of those find the frames that have a spectral centroid within ±5% of the target analysis frame spectral centroid. Once the best matches are found one is either selected at random or by finding the most optimal frame, depending on the selected synthesis options.
![]() |
Figure 6: 3 Windows 512 samples long, hop size 256 |
The matching frame is then accessed from the corpus audio-file and written into the target synthesis. In Figure 5 the user is synthesizing the target with a window size of 4,096 samples and a hop size of 256 samples. Since the target was analyzed with a hop size of 256 samples the synthesis will be as long as the target. A higher hop size will time-stretch the resynthesis of the target; a lower hop size will compress it. A graphic demonstration of the adding and shifting of windows is shown in Figure 6, which displays three Hann windows satisfying the constant overlap-add (COLA) condition.
The user can specify any number of six features to match, in any order; but as the
number of these increases the probability of finding a matching frame in the
corpus becomes small unless the corpus grows in size. In this particular case
the corpus is made from 30 audio-files of recorded monkey sounds. The total
duration of this corpus is only about seven minutes. Other corpora I have used
include a 3-CD set of Lawrence Welk's "champagne music", Schoenberg's four
string quartets, and sixty minutes of solo alto saxaphone works by Anthony
Braxton. It is also possible to create a corpus from several different
sources, for instance a Bach flute partita, an Imam chanting the Koran, and
Charles Ives' fourth symphony.
In addition to the matching criteria the user can specify other options for the
synthesis. Figure 7 shows options presently available for synthesis. "Force
match" picks the best match for the checked matching criteria if none is found
within the given limits. "Random match" means if the algorithm finds more than
one suitable match, choose one at random instead of picking the best. "Force
RMS" makes the corpus frame have the same RMS as the target frame. Basically
this applies the target amplitude envelope to the synthesis. When "Extend
matches" is checked, any zero matches will be replaced by extending the
previous match. Thus if only the first frame of the target finds a match, the
rest of the synthesis will be the extension of that frame. Finally "Reverse
Samples" flips the matching corpus samples. Other synthesis options will be added.
![]() |
Figure 7: Options for synthesis |
Figure 5 shows that the user is synthesizing the target with a window size of
4,096. Even though the corpus was analyzed with a window length of 16,384
samples, the synthesis will use only the first 4,096 samples of that. If a
window size larger than 16,384 samples is specified then additional
samples will be taken outside the end of the window. In this way there is
little practical need to create several corpus analysis databases with different
window sizes and hop sizes. A more useful analysis window size would be
about 4,096 samples, or 0.1 seconds, with a hop size of at most 2,048
samples, or 0.05 seconds. Larger window sizes could then be specified for
the synthesis without hurting the matching criteria.
![]() |
Figure 8: A view of the synthesis using Monkeys to fill in speech |
Once the synthesis has finished VLDCMCaR (pronounced vldcmcar) displays
the resynthesized sound in the upper right corner and the matching process
output in the lower right corner (Figure 8). As can be seen from this for frame
684 the number of corpus frames matching the RMS criteria is 109. Out of this
the number of frames satisfying the spectral centroid threshold is only 2. The
result can be played and exported as an audio-file. The target audio waveform
is shown in the "Get Data Points" window. One can see the resemblance in the
synthesis at the top right.
I have created several sound examples using VLDCMCaR (pronounced vldcmcar). I am currently using this software to compose material for "Concatenative Variations on a Theme by Mahler." This will be a multi-movement work for multiple channels, demonstrating these methods and exploring an aesthetic of extreme remixing and recontextualization. Table 2 presents some of the interesting results I have obtained for three types of targets: percussive and dynamic, speech, and instrumental.
Click on entries in the "Result" column to hear the synthesis. The "Target" column countains that original sound used, and information about where it came from, and its analysis parameters. For instance the Schoenberg target came from his fourth string quartet and was analyzed with a window size of 4096 points, and a hop size of 2048 points. The "Corpus" column contains information about the corpus content. The Monkeys corpus was analyzed with a window size of 16,384 samples, hop size of 1024 samples, and contains a total of 18,861 feature vectors, or points. The "Synthesis Window" columns contain information about the synthesis. For instance sound example "B2" was synthesized with a window and hop size of 256 samples, and a rectangular window shape. The "Matching Criteria" column contains (if I remembered) the feature vector search parameters. Finally the "Other Options" column contains the synthesis options I specified, or anything else of note (Figure 7).
| Synthesis Window | |||||||
| Result | Target | Corpus | Size | Skip | Shape | Matching Criteria | Other Options |
| 01 | Mahler, Ritenuto (Second Symphony) (2048,1024) |
Mahler, Ritenuto (934 points) |
2048 | 1024 | Hann | RMS ± 0% Spectral Rolloff ± 0% |
none |
| 02 | 2048 | 1024 | Hann | RMS ± 10% Spectral Rolloff ± 10% |
Random Match | ||
| Synthesis Window | |||||||
| Result | Target | Corpus | Size | Skip | Shape | Matching Criteria | Other Options |
| A1 | Mahler, Ritenuto (Second Symphony) (2048,1024) |
Monkeys (16384,1024) (18,861 points) |
16384 | 1024 | Hann | RMS ± 5% Spectral Rolloff ± 10% |
Mixed with Mahler |
| A2 | Monkeys | 16384 | 4096 | Hann | RMS ± 5% Spectral Rolloff ± 10% |
none | |
| A3 | Animals (22050,2048) (80,003 points) |
2048 | 1024 | Hann | RMS ± 5% Spectral Rolloff ± 5% |
||
| A4 | Tea Cans (1024,512) (3,143 points) |
1024 | 512 | Hann | (can't remember) | ||
| A5 | Imam Chanting Koran (8192,2049) (4,682 points) |
4096 | 2049 | Hann | Spectral Centroid ± 0.1% Spectral Rolloff ± 0.1% |
Force RMS Extend Matches |
|
| A6 | John Cage Music for Voice (2048,1024) (288,084 points) |
5000 | 2250 | Hann | (can't remember) | none | |
| A7 | Lawrence Welk 3-CD set (2048,1024) (381,361 points) |
11000 | 5000 | Tukey, 25% | |||
| A8 | Schoenberg String Quartets 1-4 (2048,1024) (359,485 points) |
10000 | 1024 | Hann | Spectral Centroid ± 0.05% Spectral Rolloff ± 0.1% |
Force RMS Extend Matches |
|
| Synthesis Window | |||||||
| Result | Target | Corpus | Size | Skip | Shape | Matching Criteria | Other Options |
| B1 | George W. Bush (512,256) |
Monkeys | 256 | 256 | Rectangle | (can't remember) | none |
| B2 | Monkeys | 512 | 256 | Hann | |||
| B3 | Monkeys | 2048 | 256 | Hann | |||
| B4 | Tea Cans | 512 | 256 | Hann | |||
| B5 | Mahler, Ritenuto (2048,1024) |
256 | 256 | Rectangle | |||
| B6 | Bach, Partita 1 (2048,512) (56,657 points) |
2048 | 256 | Hann | Spectral Centroid ± 1% Spectral Centroid ± 1% |
Force Match Force RMS |
|
| Synthesis Window | |||||||
| Result | Target | Corpus | Size | Skip | Shape | Matching Criteria | Other Options |
| C1 | "Congregation" (Ministry, "Psalm 69") (512,256) |
Tea Cans | 512 | 256 | Hann | (can't remember) | none |
| C2 | Animal Sounds | 22050 | 1024 | Hann | |||
| C3 | Anthony Braxton (sax) (2048,1024) (102,505 points) |
2048 | 256 | Hann | Spectral Centroid ±0.1% Spectral Rolloff ±0.1% |
Force RMS Extend Matches |
|
| C4 | Lawrence Welk | 8192 | 4096 | Hann | Spectral Centroid ±0.1% Spectral Rolloff ±0.1% |
Force RMS Extend Matches |
|
| Synthesis Window | |||||||
| Result | Target | Corpus | Size | Skip | Shape | Matching Criteria | Other Options |
| D1 | Schoenberg, Mvt 1 (String Quartet 4) (4096, 2048) |
Anthony Braxton | 2048 | 1024 | Hann | Spectral Centroid ±0.1% Spectral Rolloff ±0.1% |
Force RMS Extend Matches |
| D2 | 1024 | 512 | Hann | Spectral Centroid ±0.5% Spectral Rolloff ±0.5% |
Force RMS Extend Matches |
||
| D3 | 4096 | 2048 | Hann | Pitch ±1% | First 200 frames Force RMS Extend Matches |
||
| Table 2: Synthesis Examples | |||||||
|---|---|---|---|---|---|---|---|
Examples 01 and 02 demonstrate that the algorithm is working as predicted. In example 01 the RMS and Spectral Centroid must match perfectly, and indeed the synthesis is a perfect reconstruction of the target. In example 02 a larger range is permitted and from the valid frames one is chosen at random.
Mahler's dynamic Ritenuto from his second symphony provides a well-defined form using percussion and brass. This form is reflected quite markedly in examples A1-8. Examples A1,2 provide a humorous but significant example of creating a realistic monkey community becoming more and more agitated. At times example A5 uses corpus frames that sound like a rolling snare drum, but is really the sibilance from the chanting. In the Lawrence Welk synthesis of Mahler there is a great high note hit by wavering clarinets concluding a gradual and beautiful climax.
The two speech samples provide a different type of target. Importance here is
not in rise and fall, pitch and rhythm; it is in sibilance, vowel formants, and
gutterals. No matter the corpus material, the words Bush utters can be picked
out, if not completely understood (depending on whether you heard the original
target beforehand). Much of the sibilance remains, even when using the Bach
flute Partita. Examples B1,2,4, with such small window sizes, are basically
granular synthesis. The "congregation" example provides another speech example,
but the synthesis is a little different. Example C2 lengthens the original by
four times. Words can still be understood, though at a much slower pace.
Example C4 does the same thing, but expands the target 16 times, uses Lawrence
Welk for the corpus, extends the matches, and forces the target RMS. The result
is a Welk medley with the same amplitude envelope as the target, but a duration
of 2.5 minutes. (If you hear words in this you are indeed special.)
Examples D1-3 show the application of these methods to a short segment of
Schoenberg's introduction to his fourth string quartet. The synthesis uses
sixty minutes of Anthony Braxton playing solo alto saxaphone. Examples D1,2
shorten the target 2 and 4 times respectively. The pitch matching criteria is
used for example D3. It performs somewhat well when only one instrument is
heard and is stable, but breaks down in interesting ways when other pitches are
present. While filtering the target into bands might relieve this problem, a
better pitch algorithm is sorely needed.
I have been interested in using as source material the entire recorded collection of works by an artist like Mozart, Bach, Beethoven, or Mahler. Initially I was thinking of remixing the bits and pieces randomly to generate hours of new music. Selection only by randomness however produces a schizophrenic and uniform experience that has no feeling of rise, fall, tension, relaxation, discord, concord, and resolution—aspects that I strive to create in my work. Instead, by filling in an outline that has these aspects, with the material in a corpus, more control is available. What comes out of VLDCMCaR (pronounced vldcmcar) can be a complete composition; however there is usually more work necessary to compose a more coherent work.
It can be envisioned that a composer sings and speaks parts of their
composition into a soundfile and uses this as the target. The corpus will
follow these "directions" and fill in the outline. This is a very unique method
for composition. Even more advanced would be a real-time application of these
methods. One could sing with a chorus of monkeys, or speak a symphony of Mahler.
Thus speech would become music, and perhaps music could become speech.
Another potential application is the use of concatenative synthesis for data
sonification. I will be looking into using this software for the sonification
of time-domain data. An upcoming conference is calling for works involving the
sonification of a common dataset: a 22-channel electroencephalagram (EEG) of a
person listening to popular Australian music. While the target is not audio
data, its outline could be used to form the material of the corpus. Thus the
EEG of the person listening to music will be used to recompose the music. This
presents an interesting alternative to listening directly to the data, or
translating it to MIDI data. A user could select a corpus of sounds to reflect
the activity of the data, filling in the outline. This is an entirely new method
for sonification.
In its present state VLDCMCaR (pronounced vldcmcar) will not be publicly distributed. Eventually (around the time of the 2004 International Computer Music Conference) this software will be made public, open source, and free. By then I will have had time to fix and optimize the application. Of prime importance are implementing the harmonicity algorithm and refining the pitch algorithm.
There is considerable work to be done on several aspects of the application.
The format of the database needs to be redesigned. As it stands it generates a
corpus audio file that can be indexed into by the synthesis routine. This
obviously duplicates large amounts of data. Instead the database should
contain paths and filenames of all audio files used to create it. That way the
synthesis function can index into each necessary audio file, rather than one
large corpus sound file.
Additional work will explore filtering the target into bands and then generating
analyses for each band. That way a fuller representation of the target can be
created. Each band is then synthesized and combined to form the result. A
composition could thus consist of a Russian Orthodox Men's Chorus in the lower
frequencies, and Diamanda Galas in the higher frequencies.
VLDCMCaR (pronounced vldcmcar) is an acronym for "Very Large Database for Concatenative Music Composition and Recontextualization." The MATLAB GUI I have presented here is really the interface to the VLDCMCaR, so a new name is needed to describe it. This application demonstrates that these relatively unintelligent methods (compared with machine listening) of concatenative synthesis for composition are very interesting and effective in musical contexts. Further work is justified to port these methods to faster machine languages, utilize smarter database systems, and appraoch a realtime implementation for performance.
Hazel, S. Soundmosaic, 2003.
Hunt, A., Black, A. "Unit selection in a concatenative speech synthesis system using a large speech database." ICASSP, vol. 1, pp. 373-376, 1996.
Lazier, A., P. Cook. "Mosievius: Feature Driven Interactive Audio Mosaicing." In Proceedings of the 6th International Conference on Digital Audio Effects (DAFx-03), London, UK, September 8-11, 2003.
Oswald, J. Plunderphonics, 2003.
Schwarz, D. "A System for Data-Driven Concatenative Sound Synthesis." In Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-00), Verona, Italy, December 7-9, 2000.
Schwarz, D. "New Developments in Data-Driven Concatenative Sound Synthesis." In Proceedings of the 2003 International Computer Music Conference, Singapore, 2003.
Silver, R. Photomosaics, 2003.
Sturm, B. L. "MATConcat: An Application for Exploring Concatenative Sound Synthesis Using MATLAB," In Proceedings of the 2004 International Computer Music Conference, Miami, Florida, 2004.
Zils, A., F. Pachet. "Musical Mosaicing." In Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-01), Limerick, Ireland, December 6- 8, 2001.