VLDCMCaR (pronounced vldcmcar)
Very Large Database for Concatenative Music Composition and Recontextualization

Bob L. Sturm
Graduate Media Arts & Technology
University of California, Santa Barbara
December 4, 2003

 

Abstract   Background   Proposal   Methods   Applications   Examples   Future Work   Conclusion   References

Abstract

VLDCMCaR (pronounced vldcmcar) is a MATLAB application for exploring concatenative audio synthesis using six independent matching criteria. The entire application is encompassed in a graphical user interface (GUI). Using this program a sound or composition can be concatenatively synthesized using audio segments from a corpus database of any size. Mahler can be synthesized using hours of Lawrence Welk; howling monkeys can approximate President Bush's speech; and a Schoenberg string quartet can be remixed using Anthony Braxton playing alto saxaphone. A composer could also record a vocalization of a composition and fill it in using any sound material desired.


Background

A mosaic is an medium assembled by several small pieces that contribute to a perception of a larger image. When the mosaic is viewed at a close distance the image isn't clear; but further away, and with a blurring of one's eyes, the image emerges. Figure 1 shows a mosaic, but assembled by hundreds of photographs (Figure 2) instead of colored tiles (Silver 2003). A simple algorithm selects each picture-tile that best approximates a particular portion of the original image. For instance it will look for red images when in a red region, and images that have lines and content that can approximate the lines and content in the original (see center image of slanted grill).

Figure 1: Photo-Mosaic

Figure 2: Detail of Figure 1


This technique is similar to the drawing method shown in Figure 3. Using a network of strings for reference, the artist fills in each square on the paper with what he sees in the corresponding square of his view. Square by square the original is transcribed onto paper. Even though the artist isn't replacing each square by a different picture, he breaks down the original into parts and uses the strings as reference points to create the complete picture. The algorithm of the artist here is basically the same as that of the computer matching tiles. It should also be noted that any image can be enlarged or shrunk using a proportional network of strings.

Figure 3: Albrecht Durer: An Artist Drawing Through a Network of Strings


A method similar to photo-mosaicing exists in the synthesis of speech, called "concatenative speech synthesis" (Hunt 1996). This technique, actually developed in the early sixties, is used for text-to-speech synthesis. The computer segments a text into chunks that are then synthesized using a large database of spoken sound units called diphones. For instance to synthesize the word "cabbage" the computer segments it into the diphones "ka," "ba," and "g." Then from a large database of spoken dipohones it finds the best ones. These components are strung together to obtain the sound synthesis of the word.


Concatenative speech synthesis and photo mosaic methods have recently been applied to creating "audio mosaics," or "musaics" (Hazel 2003; Lazier 2003; Schwarz 2000, 2003; Zils 2001). From a "corpus" of sound samples, best matches are found to approximate a given "target" sound. The methods for analyzing the target and corpus can be very complex, perhaps entailing pattern recognition, and score following. Whereas Schwarz and Zils use the computer to automatically make the matches, composer John Oswald selects and assembles his sound snippets by hand (Oswald 2003).


Proposal

While the authors cited above have done significant research in this field (Schwarz 2003, Zils 2001), their software isn't directly accessible and does not approach the methods how I want to. I decided instead to create a prototype application using MATLAB to explore the creative usefulness of these methods. Instead of using pattern recognition and score following, VLDCMCaR (pronounced vldcmcar) makes matches based on up to six basic statistical quantities from a uniform application of windows. The user can select any of these parameters, specify a percent allowable spread, and can choose what happens when either too many or no matches are found.


A good analogy for this technique is the following. First make an outline of Shakespeare's Hamlet as the "target"—the thing to be synthesized. Then create a "corpus" that includes every word, line, stanza, and act, of Samuel Becket's Waiting for Godot. The contents of the corpus thus fill in the outline of the target. This is accomplished by finding best substitutions for each word, line, stanza, or act of Hamlet, using the words, lines, stanzas and acts of Waiting for Godot. A new play is thus created, Waiting for Hamlet, which has characteristics of both plays.


Methods

The algorithm used in VLDCMCaR (pronounced vldcmcar) is basically the same as those used to create photo mosaics. A feature vector is created for each frame (window) of the target, which serves to characterize them. Each feature vector consists of six values (Table 1). The RMS feature of "Congregation.wav" is displayed in Figure 4.


Feature Measure Meaning of Feature
Number of Zero Crossings General noisiness, existence of transients
Root Mean Square (RMS) Mean acoustic energy (loudness)
Spectral Centroid Mean frequency of total spectral energy
Spectra Dropoff Frequency below which 85% of energy exists
Harmonicity Deviation from a harmonic spectra
Pitch Estimate of fundamental frequency
Table 1: Feature Vector Elements

Figure 4: Feature plots of a sound


The target sound is analyzed with a user-specified window and hop size (window skip). These can be any number of samples depending on the requirements of the synthesis. Smaller windows create larger analysis databases with higher time-resolution, but less frequency-resolution. Larger windows create smaller analysis databases with higher frequency-resolution, but lower time-resolution. For each window a feature vector is produced and stored in the analysis database. This is done for both the target and corpus database. An example of an analysis is shown in Figure 5, in the analysis windows at the top left. Note that "Congregation.wav" was analyzed using a window size of 512 samples, and window skip of 256; "MonkeyCorpus.wav" was analyzed using a window size of 16,384 samples, and a hop size of 1,024 samples.


With these analyses then the target can be synthesized using the audio-data of the corpus. As long as the synthesis window skip matches the window skip of the target, the target and synthesis will be the same duration. If the synthesis window skip is twice as long as the target window skip the synthesis will be twice as long. The window size for the synthesis doesn't have to match the window size of the corpus analysis. But if these are significantly different there is a higher chance of making a mismatch. Though even then the results can be interesting.

Figure 5: Screen shot of VLDCMCaR (pronounced vldcmcar)


The synthesis algorithm iterates through the target database, finding matching frames from the corpus that are optimal within user-specified limits. For instance in Figure 5 the user has specified in the bottom middle pane to first find all analysis frames in the corpus that have an RMS within ±5% of the target analysis frame RMS, and then out of those find the frames that have a spectral centroid within ±5% of the target analysis frame spectral centroid. Once the best matches are found one is either selected at random or by finding the most optimal frame, depending on the selected synthesis options.

Figure 6: 3 Windows 512 samples long, hop size 256


The matching frame is then accessed from the corpus audio-file and written into the target synthesis. In Figure 5 the user is synthesizing the target with a window size of 4,096 samples and a hop size of 256 samples. Since the target was analyzed with a hop size of 256 samples the synthesis will be as long as the target. A higher hop size will time-stretch the resynthesis of the target; a lower hop size will compress it. A graphic demonstration of the adding and shifting of windows is shown in Figure 6, which displays three Hann windows satisfying the constant overlap-add (COLA) condition.


The user can specify any number of six features to match, in any order; but as the number of these increases the probability of finding a matching frame in the corpus becomes small unless the corpus grows in size. In this particular case the corpus is made from 30 audio-files of recorded monkey sounds. The total duration of this corpus is only about seven minutes. Other corpora I have used include a 3-CD set of Lawrence Welk's "champagne music", Schoenberg's four string quartets, and sixty minutes of solo alto saxaphone works by Anthony Braxton. It is also possible to create a corpus from several different sources, for instance a Bach flute partita, an Imam chanting the Koran, and Charles Ives' fourth symphony.


In addition to the matching criteria the user can specify other options for the synthesis. Figure 7 shows options presently available for synthesis. "Force match" picks the best match for the checked matching criteria if none is found within the given limits. "Random match" means if the algorithm finds more than one suitable match, choose one at random instead of picking the best. "Force RMS" makes the corpus frame have the same RMS as the target frame. Basically this applies the target amplitude envelope to the synthesis. When "Extend matches" is checked, any zero matches will be replaced by extending the previous match. Thus if only the first frame of the target finds a match, the rest of the synthesis will be the extension of that frame. Finally "Reverse Samples" flips the matching corpus samples. Other synthesis options will be added.

Figure 7: Options for synthesis


Figure 5 shows that the user is synthesizing the target with a window size of 4,096. Even though the corpus was analyzed with a window length of 16,384 samples, the synthesis will use only the first 4,096 samples of that. If a window size larger than 16,384 samples is specified then additional samples will be taken outside the end of the window. In this way there is little practical need to create several corpus analysis databases with different window sizes and hop sizes. A more useful analysis window size would be about 4,096 samples, or 0.1 seconds, with a hop size of at most 2,048 samples, or 0.05 seconds. Larger window sizes could then be specified for the synthesis without hurting the matching criteria.

Figure 8: A view of the synthesis using Monkeys to fill in speech


Once the synthesis has finished VLDCMCaR (pronounced vldcmcar) displays the resynthesized sound in the upper right corner and the matching process output in the lower right corner (Figure 8). As can be seen from this for frame 684 the number of corpus frames matching the RMS criteria is 109. Out of this the number of frames satisfying the spectral centroid threshold is only 2. The result can be played and exported as an audio-file. The target audio waveform is shown in the "Get Data Points" window. One can see the resemblance in the synthesis at the top right.


Examples

I have created several sound examples using VLDCMCaR (pronounced vldcmcar). I am currently using this software to compose material for "Concatenative Variations on a Theme by Mahler." This will be a multi-movement work for multiple channels, demonstrating these methods and exploring an aesthetic of extreme remixing and recontextualization. Table 2 presents some of the interesting results I have obtained for three types of targets: percussive and dynamic, speech, and instrumental.


Click on entries in the "Result" column to hear the synthesis. The "Target" column countains that original sound used, and information about where it came from, and its analysis parameters. For instance the Schoenberg target came from his fourth string quartet and was analyzed with a window size of 4096 points, and a hop size of 2048 points. The "Corpus" column contains information about the corpus content. The Monkeys corpus was analyzed with a window size of 16,384 samples, hop size of 1024 samples, and contains a total of 18,861 feature vectors, or points. The "Synthesis Window" columns contain information about the synthesis. For instance sound example "B2" was synthesized with a window and hop size of 256 samples, and a rectangular window shape. The "Matching Criteria" column contains (if I remembered) the feature vector search parameters. Finally the "Other Options" column contains the synthesis options I specified, or anything else of note (Figure 7).


  Synthesis Window  
Result Target Corpus Size Skip Shape Matching Criteria Other Options
01 Mahler, Ritenuto
(Second Symphony)
(2048,1024)
Mahler, Ritenuto
(934 points)
2048 1024 Hann RMS ± 0%
Spectral Rolloff ± 0%
none
02     2048 1024 Hann RMS ± 10%
Spectral Rolloff ± 10%
Random Match
  Synthesis Window  
Result Target Corpus Size Skip Shape Matching Criteria Other Options
A1 Mahler, Ritenuto
(Second Symphony)
(2048,1024)
Monkeys
(16384,1024)
(18,861 points)
16384 1024 Hann RMS ± 5%
Spectral Rolloff ± 10%
Mixed with Mahler
A2   Monkeys 16384 4096 Hann RMS ± 5%
Spectral Rolloff ± 10%
none
A3   Animals
(22050,2048)
(80,003 points)
2048 1024 Hann RMS ± 5%
Spectral Rolloff ± 5%
 
A4   Tea Cans
(1024,512)
(3,143 points)
1024 512 Hann (can't remember)  
A5   Imam Chanting Koran
(8192,2049)
(4,682 points)
4096 2049 Hann Spectral Centroid ± 0.1%
Spectral Rolloff ± 0.1%
Force RMS
Extend Matches
A6   John Cage
Music for Voice
(2048,1024)
(288,084 points)
5000 2250 Hann (can't remember) none
A7   Lawrence Welk
3-CD set
(2048,1024)
(381,361 points)
11000 5000 Tukey, 25%    
A8   Schoenberg
String Quartets 1-4
(2048,1024)
(359,485 points)
10000 1024 Hann Spectral Centroid ± 0.05%
Spectral Rolloff ± 0.1%
Force RMS
Extend Matches
  Synthesis Window  
Result Target Corpus Size Skip Shape Matching Criteria Other Options
B1 George W. Bush
(512,256)
Monkeys 256 256 Rectangle (can't remember) none
B2   Monkeys 512 256 Hann    
B3   Monkeys 2048 256 Hann    
B4   Tea Cans 512 256 Hann    
B5   Mahler, Ritenuto
(2048,1024)
256 256 Rectangle    
B6   Bach, Partita 1
(2048,512)
(56,657 points)
2048 256 Hann Spectral Centroid ± 1%
Spectral Centroid ± 1%
Force Match
Force RMS
  Synthesis Window  
Result Target Corpus Size Skip Shape Matching Criteria Other Options
C1 "Congregation"
(Ministry, "Psalm 69")
(512,256)
Tea Cans 512 256 Hann (can't remember) none
C2   Animal Sounds 22050 1024 Hann    
C3   Anthony Braxton
(sax)
(2048,1024)
(102,505 points)
2048 256 Hann Spectral Centroid ±0.1%
Spectral Rolloff ±0.1%
Force RMS
Extend Matches
C4   Lawrence Welk 8192 4096 Hann Spectral Centroid ±0.1%
Spectral Rolloff ±0.1%
Force RMS
Extend Matches
  Synthesis Window  
Result Target Corpus Size Skip Shape Matching Criteria Other Options
D1 Schoenberg, Mvt 1
(String Quartet 4)
(4096, 2048)
Anthony Braxton 2048 1024 Hann Spectral Centroid ±0.1%
Spectral Rolloff ±0.1%
Force RMS
Extend Matches
D2     1024 512 Hann Spectral Centroid ±0.5%
Spectral Rolloff ±0.5%
Force RMS
Extend Matches
D3     4096 2048 Hann Pitch ±1% First 200 frames
Force RMS
Extend Matches
 
Table 2: Synthesis Examples

Examples 01 and 02 demonstrate that the algorithm is working as predicted. In example 01 the RMS and Spectral Centroid must match perfectly, and indeed the synthesis is a perfect reconstruction of the target. In example 02 a larger range is permitted and from the valid frames one is chosen at random.

Mahler's dynamic Ritenuto from his second symphony provides a well-defined form using percussion and brass. This form is reflected quite markedly in examples A1-8. Examples A1,2 provide a humorous but significant example of creating a realistic monkey community becoming more and more agitated. At times example A5 uses corpus frames that sound like a rolling snare drum, but is really the sibilance from the chanting. In the Lawrence Welk synthesis of Mahler there is a great high note hit by wavering clarinets concluding a gradual and beautiful climax.


The two speech samples provide a different type of target. Importance here is not in rise and fall, pitch and rhythm; it is in sibilance, vowel formants, and gutterals. No matter the corpus material, the words Bush utters can be picked out, if not completely understood (depending on whether you heard the original target beforehand). Much of the sibilance remains, even when using the Bach flute Partita. Examples B1,2,4, with such small window sizes, are basically granular synthesis. The "congregation" example provides another speech example, but the synthesis is a little different. Example C2 lengthens the original by four times. Words can still be understood, though at a much slower pace. Example C4 does the same thing, but expands the target 16 times, uses Lawrence Welk for the corpus, extends the matches, and forces the target RMS. The result is a Welk medley with the same amplitude envelope as the target, but a duration of 2.5 minutes. (If you hear words in this you are indeed special.)


Examples D1-3 show the application of these methods to a short segment of Schoenberg's introduction to his fourth string quartet. The synthesis uses sixty minutes of Anthony Braxton playing solo alto saxaphone. Examples D1,2 shorten the target 2 and 4 times respectively. The pitch matching criteria is used for example D3. It performs somewhat well when only one instrument is heard and is stable, but breaks down in interesting ways when other pitches are present. While filtering the target into bands might relieve this problem, a better pitch algorithm is sorely needed.


Application

I have been interested in using as source material the entire recorded collection of works by an artist like Mozart, Bach, Beethoven, or Mahler. Initially I was thinking of remixing the bits and pieces randomly to generate hours of new music. Selection only by randomness however produces a schizophrenic and uniform experience that has no feeling of rise, fall, tension, relaxation, discord, concord, and resolution—aspects that I strive to create in my work. Instead, by filling in an outline that has these aspects, with the material in a corpus, more control is available. What comes out of VLDCMCaR (pronounced vldcmcar) can be a complete composition; however there is usually more work necessary to compose a more coherent work.


It can be envisioned that a composer sings and speaks parts of their composition into a soundfile and uses this as the target. The corpus will follow these "directions" and fill in the outline. This is a very unique method for composition. Even more advanced would be a real-time application of these methods. One could sing with a chorus of monkeys, or speak a symphony of Mahler. Thus speech would become music, and perhaps music could become speech.


Another potential application is the use of concatenative synthesis for data sonification. I will be looking into using this software for the sonification of time-domain data. An upcoming conference is calling for works involving the sonification of a common dataset: a 22-channel electroencephalagram (EEG) of a person listening to popular Australian music. While the target is not audio data, its outline could be used to form the material of the corpus. Thus the EEG of the person listening to music will be used to recompose the music. This presents an interesting alternative to listening directly to the data, or translating it to MIDI data. A user could select a corpus of sounds to reflect the activity of the data, filling in the outline. This is an entirely new method for sonification.


Future Work

In its present state VLDCMCaR (pronounced vldcmcar) will not be publicly distributed. Eventually (around the time of the 2004 International Computer Music Conference) this software will be made public, open source, and free. By then I will have had time to fix and optimize the application. Of prime importance are implementing the harmonicity algorithm and refining the pitch algorithm.


There is considerable work to be done on several aspects of the application. The format of the database needs to be redesigned. As it stands it generates a corpus audio file that can be indexed into by the synthesis routine. This obviously duplicates large amounts of data. Instead the database should contain paths and filenames of all audio files used to create it. That way the synthesis function can index into each necessary audio file, rather than one large corpus sound file.


Additional work will explore filtering the target into bands and then generating analyses for each band. That way a fuller representation of the target can be created. Each band is then synthesized and combined to form the result. A composition could thus consist of a Russian Orthodox Men's Chorus in the lower frequencies, and Diamanda Galas in the higher frequencies.


Conclusion

VLDCMCaR (pronounced vldcmcar) is an acronym for "Very Large Database for Concatenative Music Composition and Recontextualization." The MATLAB GUI I have presented here is really the interface to the VLDCMCaR, so a new name is needed to describe it. This application demonstrates that these relatively unintelligent methods (compared with machine listening) of concatenative synthesis for composition are very interesting and effective in musical contexts. Further work is justified to port these methods to faster machine languages, utilize smarter database systems, and appraoch a realtime implementation for performance.


References

Hazel, S. Soundmosaic, 2003.


Hunt, A., Black, A. "Unit selection in a concatenative speech synthesis system using a large speech database." ICASSP, vol. 1, pp. 373-376, 1996.


Lazier, A., P. Cook. "Mosievius: Feature Driven Interactive Audio Mosaicing." In Proceedings of the 6th International Conference on Digital Audio Effects (DAFx-03), London, UK, September 8-11, 2003.


Oswald, J. Plunderphonics, 2003.


Schwarz, D. "A System for Data-Driven Concatenative Sound Synthesis." In Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-00), Verona, Italy, December 7-9, 2000.


Schwarz, D. "New Developments in Data-Driven Concatenative Sound Synthesis." In Proceedings of the 2003 International Computer Music Conference, Singapore, 2003.


Silver, R. Photomosaics, 2003.


Sturm, B. L. "MATConcat: An Application for Exploring Concatenative Sound Synthesis Using MATLAB," In Proceedings of the 2004 International Computer Music Conference, Miami, Florida, 2004.


Zils, A., F. Pachet. "Musical Mosaicing." In Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-01), Limerick, Ireland, December 6- 8, 2001.