A language for basic sound synthesis¶

Note

For the track based on arithmetic expressions, see A language for arithmetic.

The original PLAI course used arithmetic to introduce some ideas. In this variant, we use a simple model of sound synthesis processes to construct a language we can use analogous to A language for pictures. The motivation to introduce these small “domains” is that they help anchor the discussions on programming languages, and when they are not too familiar, help students ask key questions.

Note

We’re not going to be building a full fledged synth toolkit. What we’re interested in is making (hopefully) some simple sounds and compositions. The ideas here form the basis for Synth.jl.

Growing a language for sounds¶

We’re about to launch off a precipice in our efforts to figure out a language for generating sounds. When we set out on such a task in any domain, there are a few things we need to do to build up our understanding of the domain first. What are you going to build a language for if you don’t understand it in the first place? We’ll need to –

Get a sense of the “vocabulary” we want for working with sound generation.
Get a sense of how we wish to be able to generate sounds, transform them or combine more than one to form a new sounds.
Figure out the essence of sound composition – i.e. a minimal “core language” in which we we can express the ideas we’re interested in. Translate more specific ideas into this core language.

Note that we do not need to get all of this right at the first shot. We can take some reasonable steps and improve what we have at hand when we recognize the opportunity. To do that effectively, we’ll need to keep our eye on the mentioned “minimal core” as we go along.

Credits

This section is named in honour of an amazing talk of the same title by Guy Steele, the co-creator of Scheme - Growing a language, by Guy Steele (youtube link) given in 1998. It is a fantastic talk and a great performance & delivery, that I much recommend students watch and contemplate on. The beginning of the talk may unfortunately put off some as it appears sexist, but Guy is aware of it and explains himself a little into the talk. So do press on.

A plausible vocabulary¶

We hear a sound when a pattern of vibration hits our ear drums, causing them to vibrate according to the incoming air pressure patterns. When these vibrations are represented in a computer, we need to convert them into numbers and so we “sample” them at a sufficiently high frequency (called the “sample rate”) and measure the pressure values at these instants. This process is called “analog to digital conversion” (“ADC” for short). We can then reproduce these sounds by converting this sequence of numbers into air pressure influences emitted through a speaker. This process is called “digital to analog conversion” (“DAC” for short).

The auditorily simplest sound is a tone with a specific “pitch” and “volume”. Such a tone is a periodic (in time) waveform with a specific period corresponding to its pitch and a specific amplitude corresponding to its volume. The two pairs of concepts are related but not the same. While “pitch” is a perceptual aspect comparable to “colour” in the visual field, “period” or equivalently “frequency” is a definable objective property of the waveform itself. Similarly, while “volume” is a perceptual aspect, “amplitude” is a definable objective property of the waveform.

\[\begin{split}\begin{array}{rcl} \text{pitch} & \approx \propto & 12 \log_2(\text{frequency}) \text{ } (\text{in semitones}) \\ \text{volume} & \approx \propto & 20 \log_{10}(|\text{amplitude}|) \text{ } (\text{in dB}) \end{array}\end{split}\]

A tone that lasts for ever at the same amplitude and frequency doesn’t make for much music, does it? At the very least, we want tones to decay over time to mimic, say, a vibrating string. And we want a pattern of frequencies in time to make a melody.

The shape of the periodic waveform that makes a tone defines its “timbre” (pronounced “tahmbur”) or “colour”. The most colourless of these tones is perhaps the “sinewave” - which is easily modelled by the mathematical expression \(\sin(2\pi ft)\) where \(f\) is its frequency and \(t\) is the time. This tone has amplitude \(1.0\) and to get a different amplitude, we just have to multiply it by the scale factor we need, such as \(0.5\).

It so happens that we can build up more complex tones if we only had sinewaves available at our disposal, by mixing sine waves of different frequencies with different amplitudes.

The above description gives us a starting point for building up our vocabulary. To build up our language, we are free to think at as high a level as we want to. Our only intent here is to capture the idea of how we wish to construct a sound, without concern for how we’re actually going to turn the construction into actual sound. S-expressions are great for this purpose, so let’s dive in.

A sinewave oscillator that is vibrating at a particular frequency can be written as –

(oscil <freq>)

If we want to scale this oscillation to a different amplitude, we can write it as –

(mod <amp> (oscil <freq>))

This is already interesting in many ways.

By writing (mod <amp> (oscil <freq>)) we’re saying “make a new tone with a different volume by scaling this oscillation. This means the following expression should be interpretable too - (mod <amp1> (mod <amp2> (oscil <freq>))). There is already a “sound transformation as expression composition” emerging here.
We have another choice at hand - where if we’re always going to think of an oscillator as having an amplitude and a frequency, we could’ve written (oscil <amp> <freq>) and the leave the details for later.

Think

Is any one approach better than the other in some objective sense for our domain? Note that neither of the approaches is “wrong”.

We’ve left out what the <amp> and <freq> are supposed to be. Obviously we want to be able use fixed numbers for these like (mod 0.5 (oscil 440.0)) which is a tone oscillating at \(440\text{Hz}\) with an amplitude of \(0.5\) … and we should have that make sense in our language. But what would something like this mean?

(mod (oscil 2.0) (oscil 440.0))

The only reading is that we’re varying the ampltiude of the 440Hz oscillation at a low frequency of 2Hz. When we render this as a sound snippet, we would perhaps expect the volume to go up and down audibly.

Question

How many times should you expect the volume to go up and down every second in this case?

This is called “amplitude modulation” in synthesis parlance. To “modulate” means to “vary” and in this context we’re varying the “amplitude”. So could we also perhaps have the frequency of the oscillator here be modulated? -

(oscil (mod 440.0 (oscil 2.0)))

Let’s think about what the above expression could mean. We can pick it apart bit by bit.

(oscil 2.0) is something we know - a sine wave oscillation between \(-1.0\) and \(1.0\) that oscillates twice every second.
So (mod 440.0 (oscil 2.0)) is a sinewave oscillation that varies between \(-440.0\) and \(440.0\) twice every second.
If step 2 is supposed to be in the frequency position of the outermost (oscil ...), that would stand for the concept “an osciallator whose frequency varies from \(-440.0\) to \(440.0\) Hz twice every second. Negative frequencies are indistinguishable to the ear from positive frequencies since they’re both “oscillations”. So this will sound like the frequency is oscillating between \(0.0\) Hz to \(440.0\) Hz four times a second.

In audio synthesis parlance, we call this “frequency modulation”.

We’re building up a language for expressing simple sounds. So now we may want to ask what if I want to express the idea of the frequency oscillating between, say, 300Hz and 400Hz twice a second instead of from 0 to 440Hz. How might we want to write that?

Think

Try to come up with a few ways of expressing that idea on your own first, using s-expressions. It is just a way to write it down. We don’t yet need to consider how we’re evaluating it.

Here is a candidate.

(oscil (+ 350.0 (mod 50.0 (oscil 2.0))))

While that looks reasonable, we’re playing too loose with our notation here.

Think

What exactly is too loose about that notation? Think about it first for 5 minutes before proceeding because spoilers are ahead.

It is like we want to think about adding a number (“350.0”) with an expression that we think of as sound (:rkt”(mod 50.0 (oscil 2.0))”) or at least as an “oscillation”. The simpler thing to do here instead of rethinking the notion of addition is to look closely at what we meant when we used the number \(440.0\) as a frequency or the number \(0.5\) as the amplitude. In particular, we considered using an (oscil 2.0) in place of the amplitude number \(0.5\)!

While we wrote a number, we chose to model these two aspects as plain numbers, but what we meant was that these numbers stand for values that don’t change over time. If we make that notion explicit, we may write it as (konst 440.0) and (konst 0.5).

Now since (konst 300.0) and (oscil 2.0) are the same kinds of things, we can define the operation of “mixing” two such waveforms by introducing another word mix.

(oscil (mix (konst 350.0) (mod (konst 50.0) (oscil 2.0))))

In our case, we might just want to define mix using mathematical addition sample by sample, i.e. (mix (konst a) (konst b)) = (konst (+ a b)). Similarly we have (mod (konst 0.0) <expr>) = (konst 0.0). We should also be expecting the following to hold –

(mix <s1> <s2>) = (mix <s2> <s1>)
(mix <s1> (konst 0.0)) = <s1>
(mod <s1> (konst 0.0)) = (mod (konst 0.0) <s1>) = (konst 0.0)
(mod (konst 1.0) <s1>) = <s1>
(mix <s1> (mix <s2> <s3>)) = (mix (mix <s1> <s2>) <s3>)
(mod <s1> (mod <s2> <s3>)) = (mod (mod <s1> <s2>) <s3>)
(mod <s1> (mix <s2> <s3>)) = (mix (mod <s1> <s2>)
                                  (mod <s1> <s3>))

If we further wish to generalize (mod (mix (konst a) (konst b)) <s1>) = (mix (mod (konst a) <s1>) (mod (konst b) <s1>)), we might want to add the following symmetry as well, though it is not an inevitable conclusion. –

(mod <s1> <s2>) = (mod <s2> <s1>)

So we understand what our “mix” and “mod” expressions now mean and we’re starting to describe properties of these operations. Given how closely we thought of “mix” as being addition and “mod” as being multiplication, we should very much expect to model these using those mathematical operations.

Stuff, structure and properties

When we’re working out a particular domain, it is useful to look at three aspects - a) what is the “stuff” we’re working with .. the nouns, b) how are we making this stuff using other stuff .. i.e. what is the structure we’re adding to these nouns? and c) what properties do these combination and transformation operations have.

While we can describe some sounds that last forever using the above constructs, we also want to be able to describe time limited and time shifted sounds.

If we want to talk about sound <a> starting 4.0 seconds from t=0, we can write (after 4.0 <a>) where <a> is any expression that stands for a sound. This adds another “structure”, whose “property” is (after d1 (after d2 <s1>)) = (after (+ d1 d2) <s1>). Since we can’t move sounds into the past, we treat all the d1 and d2 that are less than 0.0 as equivalent to 0.0. So we should more precisely express that property as - (after d1 (after d2 <s1>)) = (after (+ (max 0.0 d1) (max 0.0 d2)) <s1>).
We also want to be able to stop or “cut” a sound after some seconds. We can write this idea as (cut dur <snd>). The meaning here is that until dur elapses, this expression is indistinguishable from <snd>. After dur elapses, this expression is equivalent to (konst 0.0). Such a “cut” has the property (cut d1 (cut d2 <snd>)) = (cut (min d1 d2) <snd>) assuming both durations to be positive.

Wait a minute. We want to make music, i.e. we want to be able to sequence or “stitch” sounds one after another to make patterns. We can introduce another operator to stitch two sounds together at a certain time.

(stitch <s1> dur <s2>)

The meaning being before dur has elapsed, the resultant sound is equivalent to <s1> and <s2> begins and <s1> ends exactly after dur has elapsed. Just by how we’re talking about it, it should be apparent that we can express this idea using things we’ve already described.

(stitch <s1> dur <s2>) = (mix (cut dur <s1>) (after dur <s2>))

Syntactic sugar

We just defined a “new” concept in terms of structure we’ve already articulated. At such a point, we should pause and think whether we want this new word to be genuinely a new word in our language or something we introduce as “syntactic sugar” in it - something that we’ll mechanically expand out into its definitionally equivalent form before constructing the specified sound using only the primitive structures. On the other hand, we can still choose to make this a primitive for other auxiliary reasons such as performance – for example if we’re able to implement the “stitch” operator more efficiently if we do it directly rather than through the primitive operators in our language.

If we perform the expansion, then we gain the ability to examine the correctness properties of our audio renderer without having to worry about another operator that needs to be examined/tested with every other operator. The savings from this during language design are significant enough that it is an important consideration before you choose your stand.

Think

What is the difference between the above definition of stitch and the alternative below? (Refer to definitions of cut and after given earlier.) –

(stitch <s1> dur <s2>)
    = (mix (mod (cut dur (konst 1.0)) <s1>)
           (mod (after dur (konst 1.0)) <s2>))

Apart from oscillations, it is useful to be able to modulate sounds using linear “envelopes”. So we’ll add one final operator for that.

(line a dur b)

This represents a “sound” (not really a sound, but just a time varying value) that starts at the value a and over the course of dur seconds linearly rises (or falls) to b and after dur seconds stays fixed at value b.

The language and its interpreter¶

Now we need to be able to express any expression in our language as a data structure we can process into a sound using some, as yet unknown, algorithm.

(define-type SndExpr (U oscil mod mix konst after cut line))
(struct oscil ([freq : SndExpr]))
(struct mod ([a : SndExpr] [b : SndExpr]))
(struct mix ([a : SndExpr] [b : SndExpr]))
(struct konst ([v : Real]))
(struct after ([dur : Real] [snd : SndExpr]))
(struct cut ([dur : Real] [snd : SndExpr]))
(struct line ([a : Real] [dur : Real] [b : Real]))

To define an interpreter for SndExpr, we need to decide what interpreting such an expression is expected to produce. Since we want to render our sounds to a file, we can think of our interpreter as producing a vector of samples generated at a certain sampling rate that we can then write to a file using a separate utility function.

(: interp (-> Real Real SndExpr (Vectorof Real)))
(define (interp sample-rate duration sndexpr)
    ...)

This would perhaps be not a bad way to do it in the general case. However, we’ve designed our language as an “expression language” where there is some recursive structure to how we can “compose” our sound specification expressions. We can therefore use the general nature of lambda functions to model our operators and write a separate renderer using the underlying function based representation. We might call this the “reference implementation” of the language.

A model for the sound expressions¶

In our case, we can model each sound as a “generator” which, when asked, produces a sample and another generator for the rest of the sound. Note the recursive way we’ve described it. We can directly model this structural recursion using functions.

(struct step ([val : Real] [gen : Gen]))
(define-type Gen (-> Real step))

; As a warm up, we define the simplest of them.
(: konst (-> Real Gen))
(define (konst v)
  (lambda (dt)
    (step v (konst v))))

The above definition of konst already demonstrates the “temporal recursion” in the definition. For the other operators, we’re going to have to do something similar.

(: oscil (-> Gen Gen))
(define (oscil freq)
  ; Using a helper function here since we need to accumulate
  ; phase from one time step to the next.
  (define (oscil* freq phase)
    (lambda (dt)
      (let ([f (freq dt)])
        (step (sin (* 2 pi phase))
              (oscil* (step-gen f) (+ phase (* (step-val f) dt)))))))
  (oscil* freq 0.0))

(: mod (-> Gen Gen Gen))
(define (mod a b)
  (lambda (dt)
    (let ([av (a dt)] [bv (b dt)])
      (step (* (step-val av) (step-val bv))
            (mod (step-gen av) (step-gen bv))))))


(: mix (-> Gen Gen Gen))
(define (mod a b)
  (lambda (dt)
    (let ([av (a dt)] [bv (b dt)])
      (step (+ (step-val av) (step-val bv))
            (mod (step-gen av) (step-gen bv))))))


(: after (-> Real Gen Gen))
(define (after dur s)
  (if (<= dur 0.0)
      s
      (lambda (dt)
        (step 0.0 (after (- dur dt) s)))))

(: cut (-> Real Gen Gen))
(define (cut dur s)
  (if (<= dur 0.0)
      (konst 0.0)
      (lambda (dt)
        (let ([v (s dt)])
          (step (step-val v)
                (cut (- dur dt) (step-gen s)))))))

(: line (-> Real Real Real Gen))
(define (line a dur b)
  ; Helper function since we need to keep track of time.
  (define (line* a dur b t)
    (if (>= t dur)
        (konst b)
        (lambda (dt)
          (step (+ a (* (/ t dur) (- b a)))
                (line* a dur b (+ t dt))))))
  (line* a dur b 0.0))

Given this representation, we can render a sound to a file using a procedure of the following shape (i.e. pseudo-code) -

(define (render-to-file filename sndgen dur sample-rate)
  (call-with-output-file filename #:exists 'replace
    (lambda (f)
      <write-file-header>
      (define (loop dt t gen)
        (when (<= t dur)
          (let ([v (g dt)])
            (write-sample-to-file f (as-float32 (step-val v)))
            (loop dt (+ t dt) (step-gen v)))))
      (loop (/ 1.0 sample-rate) 0.0 sndgen))))

See asynth.rkt

You can simply load the linked asynth.rkt file to define the interpreter we’re working through here as it provides all the necessary code and more sound operators if you want to play around. Since the symbols it provides have the same name as the structs we’ve defined here, you’ll want to import it with a prefix like -

(require (prefix-in a: "./asynth.rkt"))

The function konst in the file will now be available as a:konst.

The interpreter¶

We’ve chosen to define our language as an “expression language”, which has recursive structure in its syntax. The struct s we’ve defined above to hold the various terms help us represent the various sub-expressions as we write out our intended sound using them.

Expression languages lend themselves nicely to being modelled in terms of lambda functions for the purpose of studying their structure and properties since lambda expressions themselves have an aligned structure. This doesn’t mean that if we hadn’t defined our language as an expression language we couldn’t have used lambda functions. It is just that our modelling task will get more complicated in such cases.

So, given the lambda-based model in asynth.rkt and the availability of a procedure to render a sound to a WAV file, we can redefine the type of our interpreter (for the moment) like this –

(require racket/match)

(: interp (-> SndExpr a:Gen))
(define (interp sndexpr)
    (match sndexpr
        [(konst v) (a:konst v)]
        ...
        ))

TODO

Read the documentation for match in the Racket docs to understand how the pattern is being specified in the code above. In particular, lists can be matched using the list constructor based expression. Quoted symbols will be matched literally and unquoted symbols will be taken as variables to be bound to the values in the corresponding slots in the list.

TODO

Try and fill out the match branches for the other operators in our language. We’ll need one for each struct we’ve defined above since we need to cover all the cases in SndExpr. It is not a coincidence that the AST terms and our lambda-based reference implementation share a similarity.

AST: Abstract Syntax Tree

The SndExpr type characterizes a tree of expressions that makes up terms in our language. Looking at interp and how it directly interprets this tree, we can see how the interpreter operationalizes the meaning of our terms. This tree captures the logical structure of our terms, even if we might choose to literally write out our terms in a different form, say –

(oscil(300.0) + oscil(200.0)) * oscil(2.0)

This is what we meant when we said we aren’t going to be concerned about “syntax” - we’d convert both into the same “abstract syntax tree” anyway.

Syntactic sugar and “desugaring”¶

We haven’t yet made up our minds about how we’ll bring the stitch operator into our language - as syntactic sugar or as a primitive. But before that, we have to examine what we expect of it. Consider this –

(stitch (konst 0.0) dur <s>)
    = (mix (cut dur (konst 0.0))
           (after dur <s>)) ; By definition
    = (mix (konst 0.0) ; Since (cut dur (konst 0.0)) = (konst 0.0)
           (after dur <s>))
    = (after dur <s>) ; By the `mix` identity stated earlier.

And also this –

(stitch <s> dur (konst 0.0))
    = (mix (cut dur <s>)
           (after dur (konst 0.0))) ; By definition
    = (mix (cut dur <s>)
           (konst 0.0)) ; Since (after dur (konst 0.0)) = (konst 0.0)
    = (cut dur <s>) ; By the `mix` identity stated earlier.

So it looks like if we have stitch as an operator, we can desugar both after and cut using stitch! This would reduce the number of primitives by one, simplifying our language. Accounting for such syntactic sugar has some consequences.

Our interpreter interp does not have to change in its structure! This is great, but its type does change slightly to make clear what kinds of expressions it works with. We’ll see how it changes shortly.
We need a new desugar procedure to translate the sugar terms into ordinary terms in our language.

So let’s introduce stitch as a primitive into our language and the other after and cut as syntactic sugar on top of stitch.

(define-type SndExpr (U oscil mod mix konst stitch line))
(struct oscil ([freq : SndExpr]))
(struct mod ([a : SndExpr] [b : SndExpr]))
(struct mix ([a : SndExpr] [b : SndExpr]))
(struct konst ([v : Real]))
(struct stitch ([a : SndExpr] [dur : Real] [b : SndExpr]))
(struct line ([a : Real] [dur : Real] [b : Real]))

So should we introduce the new sugar terms below?

(struct after ([dur : Real] [a : SndExpr]))
(struct cut ([dur : Real] [a : SndExpr]))

Think

Take 5 mins to think about what the above type declarations are saying and what the consequences of choosing this type declaration would be. Do this now, ‘cos spoilers ahead.

Consider (struct oscil ([freq : SndExpr])). With desugaring in the picture, we have to now realize that this oscil concept is not one, but two – a) oscil which only permits core terms in its frequency expression place, and b) oscil which permits sugar terms as well for its frequency specification.

Like we do with abstraction of ordinary expressions, we can parameterize the oscil type like this –

(struct (e) oscil ([freq : e]))

This makes oscil not a type, but a “type constructor”. We get a type when we “apply” oscil to another type. We can likewise parameterize all our terms so that we can construct their sugar and core variants. Of all our terms, konst and line don’t need such parameterization.

(struct (e) oscil ([freq : e]))
(struct (e) mod ([a : e] [b : e]))
(struct (e) mix ([a : e] [b : e]))
(struct konst ([v : Real]))
(struct (e) stitch ([a : e] [dur : Real] [b : e]))
(struct line ([a : Real] [dur : Real] [b : Real]))

(struct (e) after ([dur : Real] [snd : e]))
(struct (e) cut ([dur : Real] [snd : e]))

Now we’re ready to define what “core expressions” and “sugar expressions” are.

(define (e) SndExpr (U (oscil e) (mod e) (mix e) konst (stitch e) line))

; Core expressions can only contain other core expressions as sub-terms.
(define-type SndCore (SndExpr SndCore))

; Sugar expressions permit sugar terms or core terms which may in turn
; contain other sugar expressions as sub-terms.
(define-type SndSugar (U (after SndSugar) (cut SndSugar) (SndExpr SndSugar)))

The type of our interp function now changes slightly, though its implementation does not (apart from getting rid of after and cut and replacing it with stitch).

(: interp (-> SndCore a:Gen))

We now need to define the process of “desugaring” expressions into their core form. The type of this procedure should now make sense.

(: desugar (-> SndSugar SndCore))
(define (desugar expr)
    (match expr
        [(after dur a)
         (stitch (konst 0.0) dur a)]
        [(cut dur a)
         (stitch a dur (konst 0.0))]))

That doesn’t work! Fix it!

If you run that definition by typed/racket, you’ll find that it does not type check. Find out what’s wrong with it and fix it.

Desugaring and compilation¶

We described “desugaring” as the process of taking a term and reducing it to other more “primitive” terms. This process goes by another word - “compilation” - and compilers do in essence just that - reduce terms in one language (given as an AST) into terms in, usually, a “lower level” language. Procedures that translate between languages may also be known as “cross compilers”.

This step affords us some interesting possibilities. For instance, we can transform code into equivalent but more efficient forms. In our case, say, if we see (mix (konst 0.0) <snd>) we can simply expand it into <snd>. This is where the “properties” we outlined come into play in an explicit way.

A language for basic sound synthesis¶

Growing a language for sounds¶

A plausible vocabulary¶

The language and its interpreter¶

A model for the sound expressions¶

The interpreter¶

Syntactic sugar and “desugaring”¶

Desugaring and compilation¶

Comp308: Principles of Programming Languages

Navigation

Related Topics