Some Audio Effects

Sun, 10 Aug 2008

Another quick post (quick in terms of me writing this bit - not much proof reading or going out of my way to make things clear), this time with some toys to play with.

Compressor

The first is a cool compressor, which is way more general than anything I've seen. (This is talking about dynamic range compression, not compression in the file size domain like mp3).

For those not in the know, a compressor will try to increase the volume of the quieter signals, and keep the loud signals loud (hence it compresses the dynamic range). There are a couple of uses for it, one is destroying the sound of a song (see commercial radio), another is for making something sound better. This compressor should be general enough to do both.

Basically, the compressor will track the loudness of the signal passing through it, and once it goes beyond a certain threshold, it reduce the gain of the signal by some ratio. The exact time that the gain is reduced is controlled by the attack and release (in order to preserve transients, the gain reduction won't kick in straight away, and once the signal drops below the threshold it should also slowly return to the original signal).

This compressor has a very general method for tracking loudness. It takes the power mean of the samples inside a window (this gives us two parameters - the size of the window, and the type of power mean (a power mean is the generalisation of most means, so you can do an arithmetic mean, rms, or anywhere in between (or outside))). Power mean is then low pass filtered, which prevents it from jumping around lots (another parameter).

So then for every loudness, we have a response curve which maps an input value to an output value. e.g. we could have a curve that halved all the inputs, which would then halve the amplitude of the output, or we could have something that followed a square root curve, or any other curve. Since there are lots of different loudness levels, we can just specify a couple of these response curves, and assign them to loudness levels, the compressor will then interpolate between these curves to get the actual curve for a particular loudness level.

Frequency Splitter

The next toy is a frequency splitter. This takes a wave file and splits it into a bunch of new wave files each with different band pass filters applied to them. The filters aren't particularly steep edged, so you will get plenty of overlap between the output files in the frequency space. For some things this is bad, for others I can imagine it would be good. I made this as a step to make the compressor a bit more general (so it could be a multi-band compressor). Converting the compressor to being multi-band using this should be trivial, the only reason I didn't do it was because I couldn't decide what the interface would look like (that is, the code interface, not the user interface).

It just uses a whole bunch of low pass filters, and then subtracts the outputs from adjacent filters to get particular bands. Writing a simple low pass filter is super easy (output = a * input + (1 - a) * last_input) and different values of a will give you different frequency responses.

64-bit and Naming Conventions

So in the time between writing the audio library I use and now, I have changed my home coding convention (I did it at the time to ensure I wouldn't confuse work code with home code, now I don't know what the new naming convention will be as I start tomorrow). Also in that time, I moved from being a person who used x86 only to a person who uses x86, Ix86-64 and armel architectures. Some of the code in the libary was assuming 32 bits, now it should work in 32-bit, and does work in 64-bit (testing it is boring).

I also can't decide what type to use for referring to a chunk of memory that I know I'm going to access by byte. I previously used a mixture of char* and unsigned char, now I've changed to void which forces me to be really explicit in places. I'm not sure if I'll stick with this, but that is how it is now.

Download

This should compile in amkel, but there is also a make file. Although I don't like writing them, there is the big advantage that make is everywhere, so for the moment I'm going to write them... until I find/make a better solution.

compress and fsplit (C, GPL3)

Frequency Tracking

Sun, 18 Nov 2007

This is a fairly simple program for doing frequency tracking. Frequency tracking is where you take a sound and try to follow the dominant pitch of that sound. For lots of sounds this dominant pitch doesn't really exist (e.g. a cymbal sound, or multiple sounds playing at once) so I chose to measure the frequency, confidence and whether or not a dominant pitch exists.

There are heaps of ways of doing this, I think at the moment my pet favourite is to use a filter bank. A filter bank is just a whole bunch of bandpass filters covering bands over a bit of the frequency spectrum. So one might cover 50Hz to 60Hz, the next from 60Hz to 75Hz etc. When I first started doing stuff in the frequency domain, I really wanted to make sure that there was no overlap between such bands. After doing lots of stuff in this area, I think I'm of the opinion that it isn't really possible, and even if it was it wouldn't be that good. So, while each filter tries to cover a specific area, they do overlap somewhat.

The magic numbers I've been using mean that I have 200 filters covering the spectrum from 50Hz to 2KHz (spaced in a geometric series, so they are closer together down the bottom). The filter design is pretty simple, I have 201 low pass filters, which work by making each sample a running average of the previous samples (i.e. f[t] = f[t-1]*alpha + f[t]*(1-alpha)) where alpha is chosen to put the cutoff frequency in the right spot. I then apply the filter to a section of sound (512 samples long) and subtract the results to get bandpass filters. So I might subtract a 200Hz lowpass from a 210Hz lowpass to give a 200-210Hz bandpass filter. I then just measure the RMS energy in each band.

Rather than just finding the maximum energy band and declaring it the winner, I keep something resembling a probability distribution over all the bands. It starts out with uniform probability, and then each probability is multiplied by the amount of energy present in that band, and a constant is added (so that we don't get stuck at 0). After this we find the total of all the numbers (which in general won't add up to 1 now) and remember this number as the confidence. We then normalise the numbers, and the largest one is considered the winner.

If the winner is at the high end of the spectrum, then we say it is unpitched. It seems right in practice.

To test it, I made a program which will take a wave file and add a sine wave over the top of the pitch it detected. I don't have many sounds where it is just a single instrument, so I don't have anything particularly good to test it on. But, it seems to go ok, just not when there are multiple instruments.

You can download the pitch tracker here (C code, GPL3, builds in amkel). And you can get some wav files to test it on from youiseek.com (CC Attribution Noncommercial Share Alike) (a local band... not particularly good for testing this on, but given that my wave file loader is so picky I just went for the first thing that worked).

To build it, just compile together all the .c files with -std=c99 (or amkel pt.c). Run with the first argument being the wav file to run on.

You might also notice that I've changed my coding style. Previously I named variable and function identifiers with camelCase, now I use under_scores (I know it is one word). It makes it ugly when I interface with old code (the sound library), but I think changing/avoiding the sound library will be less work than changing/avoiding gtk. Also, it makes it very easy for me to tell which things I develop at home, and which things I do at work (since I use the other convention at work).

Smoothing Filter

Sun, 14 Oct 2007

Most audio filters are linear, which means that the maths behind them is pretty well understood (not by me, but by smart people), and their behaviour can be predicted, and duplicated easily. There are even devices you can buy that will let you plug an expensive filter into it, give it a moment, and then it will have a perfect duplicate of that effect (see convolution reverbs (and a reverb is just a type of filter)).

But, not all filters are linear, and in my opinion, the interesting ones aren't (like the sorting filter). Mainly because you get a bit more surprised when you hear them applied to different sounds.

So todays filter is a smoothing filter. We start out with a sound wave, and we try to approximate the wave by just picking a few key points on it, and fitting a curve through those points. I've chosen to do it the super inefficient way:

  1. Approximate the whole sound wave with 2 points, one at the start and one at the end.
  2. Find the place where the approximated wave is the most different to the actual wave.
  3. Refine the approximation by adding a new point at this spot with the most error.
  4. Go to 2, until we are close enough to the original wave.

So, for this to work, we need to have the entire sound available (you can't do this on a live sound - though you could do it on short windows of sound), and a fair bit of computer time (depending on how long the sound is). The actual effect it produces is probably best explained by a picture:

Original waveform, filtered waveform, aggressively filtered waveform

(top is original, middle is quite filtered, bottom is very much filtered)

And of course, the actual clips (as FLAC):

Original Filtered Aggressively Filtered

I interpolate using a cosin-ish thing, based on the two points either side of the point we are interpolating. Just because it is simple, and it isn't linear.

Stealth Distortion

Fri, 20 Jul 2007

I'm playing around with a simple program to synthesize sounds, and I've managed to make a sound which looks just like a perfect sine wave, but sounds like a sine wave with some dirt (maybe a bit like a harmonica). I call it stealth distortion. This is what the wave looks like:

Looks like a sine wave

This is what it sounds like (FLAC. ~100KB)

I put an envelope around it, so you get half a second of fade in (no distortion), one second of distorted sound and half a second of fade out (no distortion). I'm pretty confused by the whole thing, because I didn't intend for it to sound like this at all.

You might think that I've snuck in some odd harmonics that are subtle so you can't see them. Well, I'd like to think so too - but here is the spectrum of the distorted bit (narrowband filter of 8192 samples, using a Hanning window):

Looks normal enough

It is late, so I'm assuming tomorrow I will realise that I'm an idiot.

Sorting Filter Part 2

Thu, 28 Jun 2007

This is a kind of follow up to the previous thing about sorting filters. As well as doing the effect to images (which I did only because I had an easily accessible image library) you can apply it to sounds (which I have now done, since I have an easily accessible sound library). Actually, it makes a bit more sense when you apply it to a sound, and it is easier to see what is going on.

For those who didn't understand what I was doing, it will hopefully make more sense here: you take the waveform of a sound (which is just a collection of numbers) and put the numbers into groups of a fixed size (say groups of 10). In each of these groups, you sort the numbers. This will give you a wave that (typically) has a discontinuity every 10 numbers at the boundary between groups. So we then repeat the process 9 more times, with the groups slightly offset each time, now each of these new waveforms has a discontinuity every 10 numbers, but all the discontinuities are at different spots, and when we mix together all of these waveforms, we get something that is smooth (not true in general, but true in this case). This is what I call a sorting filter. So what does a sorting filter actually do to a sound?

Basically, it will will harshen high frequency sounds. If you have a sine wave that is of a high enough frequency, the sorting filter will turn it into a triangular wave, while a lower frequency sine wave gets through without any trouble. Since the filter is non-linear(*), this doesn't mean that the filtered high+low frequency sound will be a high frequency triangular wave and a low frequency sine wave.

I also modified the filter so that when you sort, you can sort either ascending or descending depending on whether the first sample in this window is above or below the last sample. This gives us 2 variables we can tweak with the filter - the width (how wide the window is) and the sorting policy (always ascend, always descend, same as first/last, opposite to first/last).

So I'm not quite sure how to demonstrate the workings of it. I had four 10 second long clips of audio which I've used as test dummies. I tried to pick different sounding things, you get 3 points for every song you know. The current best score is 12 (by me). I then filtered them, 8 times - 2 different window sizes (10 and 50) and 4 different sorting choices (always ascend, always descend, follow trend between first and last, disobey trend between first and last).

So this recording is just a few of the demonstrations put together in this order (no S or K samples in here, the S ones sound pretty bad, there are some ok K ones)

T(original) T(ascend10) T(oppose10) M(original) M(ascend10) M(follow10) M(oppose10) M(follow50) M(oppose50)

Sample Reel

Things to listen for:

  1. Listen to the hi-hat (or lack of) in the T samples.
  2. Note how the kick drum in the M samples stays big.
  3. Both of the ascend10 filters pretty much destroy the hi-hat.
  4. The snare drum in M(follow50)

You will probably hear these and think a lot of them sound rubbish, and there is no possible use for them. You wouldn't take one of these filters and apply it to an entire track, but I can definitely imagine mixing in a little bit of one of these to a track to add something different.

If you want to hear the rest of the sounds, or play with the source (it compiles cleanly with amkel and is GPL2) then get the whole sorting filter package

(*) A linear process is one where if you apply it to two signals separately and then mix the results together you get the same result as if you mixed the two signals together and then applied the process. (for the maths people, f is linear if f(a) + f(b) = f(a + b)) (**)

(**) This isn't quite the textbook definition of linear process. They also include the requirement that if you apply it to a signal which has been multiplied by a constant, then the result is the same as if you applied it to the original signal and then multiplied the result by that constant (or f(kx) = kf(x)). But to me, the number of interesting things where the first condition is true and this condition isn't is so small that I don't really bother caring about this one.

Deconvolution Part 2

Fri, 01 Jun 2007

(see part 1)

Cepstrum Filtering

Cepstrum filtering is a method of deconvolution which has a pretty neat idea behind it. If we take the Fourier transform of a signal that has been convolved with something, then we get a list of frequency components, but we know that each component was made by multiplying the original by that of the convolution. So if we can guess which number multiplied to give us the number we have, then everything is fine. Well, we can't really guess that, so we take the log of this number, now if you remember your high school maths, then you will know that log(ab) = log(a) + log(b). So this doesn't quite seem to help, because now we are just trying to guess which two numbers were added together, so then we take another Fourier transform. What does this achieve? Well, here we start making assumptions about the original signal and the echos - we assume that the original signal was changing, and the echos always worked the same way (so if there was an echo after 2ms at the start of the signal, there would be an echo after 2ms at the end of the signal, and everywhere in between). This means that the signal will have high frequency stuff going on after this second Fourier transform, and the echo signal will have the low frequency stuff going on.

You then draw a line somewhere and say "everything before this is the echo and everything after is the signal" and then reverse the whole process with those parts separated out. In theory, this will give you the signal and the convolution.

Another way of looking at this is thinking about what happens in a simple case when there is a single point of echo in a signal. Imagine you are 10 metres away from the edge of an infinite swimming pool (i.e. it is infinite in every direction except for straight in front of you where you have a wall). Lets say that if you make a ripple, then it will hit the wall and return to you in 20 seconds (so the wave travels at 1 metre per second). And finally, you have a device which records the water level somewhere in the pool. For simplicity, we will put it where you are making ripples.

If you make a splash once every 20 seconds, then after your first splash, all of them will be slightly bigger, because they will contain the reflection of the previous splash (note that it doesn't get bigger and bigger and out of control, because there is no wall behind you to reflect the wave back). A similar thing will happen if you make a splash once every 10 seconds, just the echo won't work for the first 2 splashes. And for 5 seconds, and 2.5 seconds, and all these numbers as you keep on halving. What happens to a splash made once every 7 seconds? Well, it does get an echo, but it doesn't really line up with the splash you are making, so because there isn't much pattern, it will add about as much to your splash as it removes. So if you draw a graph of frequency of your splash vs typical amplification, then you get a pattern that has little waves in it. However, usually frequencies live nicely on logarithmic scales (since halving a frequency gives you the same "note" just an octave down), and once you do this you get a nice wave pattern, where the frequency of this wave corresponds to the echo time somehow, and the amplitude of the wave corresponds to how much echo there is.

Here are two graphs I found from my report. The first is the original spectrum, the second is the original+echo spectrum. The difference is very subtle, but you can hopefully see that the original+echo spectrum is more wavey towards the right (higher frequencies). I don't have time to redo this properly (with useful things like units):

Original: Spectrum of original signal

Original+echo: Spectrum of original+echo signal

At this point, you might be tempted to try to work out the frequency of this wavey pattern by using a Fourier transform. Unfortunately, due to maths being awesome, it just gives you the original signal in reverse. The cepstrum filtering trick should fix this.

Unfortunately, I found that it was super sensitive to noise, and I never got any stable stuff out of it. I'm still curious though, and I hope it was an implementation bug.

Panic

So I mentioned before that this was an assignment. And it wasn't any old assignment, it was an assignment which was worth 100% of my final mark. There was no exam, no class tests, no participation marks or anything. And it was due 2 days before all the marks had to be in, so you couldn't get an extension, or complain if you got a bad mark. I never really cared about marks, but I was also very nervous because if I handed this in as my assignment, then it would seem pretty dodgy. I knew it wouldn't fail (the lecturer loved FFTs, and I had two implementations of it, which surely would be a bonus, but still I didn't think that a result like this was worth enough for an entire subject.

I went for a walk I think to try to work out what I was going to do, and I kept on thinking about why doing a Fourier transform of the spectrum didn't work, and came to the conclusion that I should be looking for an answer in the time domain, not the frequency domain.

Auto-correlation

One neat thing you can do with two signals is to correlate them. If you were at a concert and you had a microphone at the front and one at the back, and you took recordings from both, if you played them back together, then it would sound rubbish because the time delays would be slightly different, giving an extra echo in the sound that would be very strong. If you delay the signal from the front microphone enough, then you should get a much better result. Correlation is a method you can use for doing this, and it is really simple - you just try lots of different delay times, and for each possible delay time, you multiply the amplitude of one source with the amplitude of the other, and add this result to the total for that particular delay time. So if both signals are big positive numbers at the delay compensated time, then it adds lots to the total for this delay, if one is positive and the other is negative, then it subtracts. Typically the correlation is done for every possible time from the start of a signal to the end (with both signals being the same length).

Some things to note:

  • The correlation of a periodic signal (like a sine wave) with another at the same frequency will also be periodic, with the same frequency being present.
  • The correlation of noise and anything else will be pretty much silence.
  • The correlation of a signal and a delayed signal will have a sharp spike at the time where the delay is (and possibly others, if they have periodic stuff)

In our case though, we don't have two signals, we just have the one, and it has both the echo and the original mixed together. So instead of correlating our signal with another, we correlate it with itself. This is called auto-correlation. Now some facts about auto-correlation:

  • The auto-correlation of a periodic signal (like a sine wave) is periodic, with the frequency still being present.
  • The auto-correlation of noise is a sharp spike at t=0, and silence after.
  • The auto-correlation of a signal with a delay will have a spike at the time where the delay is (and possibly others, if it has periodic stuff)
  • An auto-correlation always has a spike at the start

So, I decided that I would just try to find the spike in the auto-correlation. This was done by taking lots of auto-correlations of the signal at different times, and taking the average size of each spike. The final result would then theoretically be a convolution with a spike at the start, and spikes at all the echo spots, with heights corresponding to the amplitude of the echo). This would work great if the original signal was white noise. Unfortunately, because of the periodic stuff, we can have other spikes, if a song has lots of middle Cs in it, then you might mistake the repeated pulses as being an echo.

To fix this, I did a dodgy hack, which worked pretty well. Do a Fourier transform of the signal, look for any frequency components that are stronger than some threshold, and set them to 0, then do an inverse Fourier transform. The idea with this is that we get rid of all the periodic stuff and are left with just the hisses, clicks and stuff like that.

Actually testing the method was difficult, but from my artificial tests I was able to extract the delay time accurately in everything, but I don't think I had the amplitude of the echos correct enough that you could hope to reverse the process. Actually, reversing the process is quite a difficult problem. Before our problem was that we knew S' = S * C, but didn't know what S or C were. Now we know C, but you can't just divide both sides by C, since that isn't straight multiplication - that is convolution.

Here is an example output from this, running off a recording from a club, where I've added a really obvious echo to it:

I like to think that the stuff on the very left is actually the reverb from the room. I have no idea how I'd test it though.

So yeah, it isn't quite a full result. But at least we can kind of mimic the echos in a room (or "steal" someones reverb unit).

Implementation Fun

Unfortunately, I can't provide any source code for this yet. I don't think it belongs to me any more due to uni copyright madness. And in any case, it was a hacked together application. I might do something similar later though.

If you decide to do it, there is a neat trick you can do for the Fourier transforms - because of the way you use frequencies, you don't actually need to care that the output is bit-reversed, if you just implement a decimation in time and a decimation in frequency algorithm, then you can do transforms and inverse transforms and get the results always in the right order.

Deconvolution Part 1

Sat, 26 May 2007

A while back I did an assignment for a really fun subject. My assignment was going to be about audio deconvolution using cepstrum filtering. I ended up doing the assignment on something which might have been an idea of my own, but probably isn't, but is still a cool idea. Basically, it is one of those ideas you have which no one told you about, but it isn't so crazily complicated that it is infeasible for someone to come up with the idea.

After writing most of this, I realised that it was very long, so I'm breaking it up into parts. It will end up as either 2 or 3 parts I think.

Deconvolution

Suppose you are a musical purist who likes to pretend you can hear things that aren't there, buys one directional cables made from the trace metal in African swallows, wooden volume knobs from the remote Amazon and leads that have been subjected to zero-gravity and buried for 100 years in order to achieve the best sound quality. You come to the realisation that the only way you can get the true sound of your favourite band is to have them in your own house. So you throw out all your audio gear and pay the band to live at your house and whenever you want music, they will play it for you. Lets suppose you also have a bathroom which is 40 metres long, 4 metres wide and 4 metres high. Every surface in your bathroom is covered in hard tiles. Now imagine that you are at one end having a shower, while the band (wishing to respect your privacy) has set themselves up at the other end of the bathroom.

Ok, so this whole musical purist thing is too hard to keep up. Basically, the sounds you hear are never the original sounds as produced by whatever instruments are making them, but they are those original sounds plus lots of little echos (and in the 40m bathroom case, it would be particularly bad). Deconvolution is the method used for separating the original sound from the echos of it. The echos can be characterised as a list of times when the echos happen, and how loud they are. e.g. "an echo 0ms after the original at 90% volume, an echo 1ms after the original at 50% volume, an echo 4ms after at 10% volume, etc". It is a bit easier to view this as another signal, where the amplitude at a certain time is how much echo there is after that time. For example:

Simple convolution

This is a very simple case where we get 90% of the original signal (at t=0ms), an echo after 1ms at 50% volume, and 4ms later a faint 10% echo. Three things to note:

  1. This particular example is highly unlikely, since the total energy is greater than what was put in (90% + 50% + 10% = 150%).
  2. I've made it discrete - I don't think of time as a continuous thing because I'm a computer scientist. It is just the way we are.
  3. The echo at 0ms isn't really an echo - it is just the original sound which is coming straight to us without bouncing off any walls or anything like that. It is just easier to think about it as an echo.

For a real life example, usually the very start of the convolution has a few identifiable peaks corresponding to early reflections (things where the sound has hit only one or two walls before getting to you) and then a horrible mess after that which tapers out (corresponding to all the sounds which bounced forwards and backwards off the front wall and back wall several times before finally hitting a pot plant and making their way to you).

Deconvolution is a problem which is impossible to solve. Unfortunately, for any sound you have there are an infinite number of pairs of "original sound" and "echos" that you can have which make this sound. It is a bit like if we were trying to find out which numbers were multiplied together to get 15. It could have been 1 and 15, 2 and 7.5, 3 and 5, etc. However, although it is impossible, we can still get results which are mostly correct most of the time. The idea is that we say which pairs are more likely to happen. In our 15 example, maybe we know that the numbers are probably whole numbers, and probably near each other - then the original numbers might have been 3 and 5 (or 5 and 3). Sure, they might have been 1 and 15, but when you are doing the impossible you sometimes have to make some compromises.

Convolution

Convolution is the opposite of deconvolution - you take a signal and a list of echos, and you give a new signal with all of these echos applied to it. The most straightforward way of doing this is to take the signal, copy it once for each echo, offset it by that echo, scale it by the echo's volume and then mix all of these signals together. This means that the time taken is proportional to the number of echos multiplied by the length of the signal (since for each signal we need to scale the volume with each different echo). For the size of numbers we are dealing with, this often turns out to be a bit too slow. Fortunately there is a method of speeding it up significantly so it is nearly proportional to just the length of the signal (assuming the signal is longer than the list of echos).

The technique uses Fourier Transforms to do its magic, and it isn't intuitively obvious why it should work either. This seems to be a common theme with things to do with Fourier Transforms. Basically a Fourier Transform takes a signal and tells us what frequency components it is made up of. I will hopefully write about Fourier transforms in more detail later. But the short story is that if you take the Fourier transform of the signal and the convolution, then multiply together the parts that correspond to each other (so if the original signal had a 50hz frequency with amplitude 5, and the convolution had a 50hz frequency with amplitude 2, then the resulting signal has a 50hz frequency with amplitude 10). Then invert the Fourier Transform (which is just the opposite of a Fourier Transform - building a signal from frequency components) then the result is the signal convolved with the convolution. You will have to take my word for it at the moment, but don't take my word too strongly, I've used a bit of mathematician's license with the description, and I have a feeling that I was supposed to reverse one of the signals too, but those are just details.

Time for a break

Convolution is very similar to multiplication like you do at school - you have two numbers (one representing the signal, one representing the echos) and each pair of digits is multiplied together in some order and mixed together (the final step where you add all the numbers together). The main difference is that there is no carrying:

Long multiplication

Compare with convolution: (convolving 7123 with 2181) Convolution

The two calculations are pretty much the same, and if you mind is warped enough then they have the same answer, just the convolved form has slightly more information in it. So from this, you can see my justification for why deconvolution is impossible. If I tell you {14, 9, 61, 23, 20, 26, 3} and ask you what numbers were convolved to give this answer, then you will have a bit of trouble doing so.

 

About

I'm a nerd living in Sydney. This is a place where I can write stuff about my interests and not care that no one else is reading.

I like music, maths, programming, pretty pictures, filters and other good things.

(more info)

It should be fairly obvious that this isn't connected to my employer at all.

Email me (not a catchpa)

Email policy

Subscribe

RSS Feed RSS

Get an aggregator

Liferea (Linux)

Vienna (OSX)

Feedreader (Windows)

Google Reader (Web based)

I've only used Liferea, so I can't vouch for the other ones.

About this site

This site runs a (modified) version of blosxom.

The host is GeekISP, and they seem to do an excellent job.