(see part 1)
Cepstrum Filtering
Cepstrum filtering is a method of deconvolution which has a pretty neat idea behind it. If we take the Fourier transform of a signal that has been convolved with something, then we get a list of frequency components, but we know that each component was made by multiplying the original by that of the convolution. So if we can guess which number multiplied to give us the number we have, then everything is fine. Well, we can't really guess that, so we take the log of this number, now if you remember your high school maths, then you will know that log(ab) = log(a) + log(b). So this doesn't quite seem to help, because now we are just trying to guess which two numbers were added together, so then we take another Fourier transform. What does this achieve? Well, here we start making assumptions about the original signal and the echos - we assume that the original signal was changing, and the echos always worked the same way (so if there was an echo after 2ms at the start of the signal, there would be an echo after 2ms at the end of the signal, and everywhere in between). This means that the signal will have high frequency stuff going on after this second Fourier transform, and the echo signal will have the low frequency stuff going on.
You then draw a line somewhere and say "everything before this is the echo and everything after is the signal" and then reverse the whole process with those parts separated out. In theory, this will give you the signal and the convolution.
Another way of looking at this is thinking about what happens in a simple case when there is a single point of echo in a signal. Imagine you are 10 metres away from the edge of an infinite swimming pool (i.e. it is infinite in every direction except for straight in front of you where you have a wall). Lets say that if you make a ripple, then it will hit the wall and return to you in 20 seconds (so the wave travels at 1 metre per second). And finally, you have a device which records the water level somewhere in the pool. For simplicity, we will put it where you are making ripples.
If you make a splash once every 20 seconds, then after your first splash, all of them will be slightly bigger, because they will contain the reflection of the previous splash (note that it doesn't get bigger and bigger and out of control, because there is no wall behind you to reflect the wave back). A similar thing will happen if you make a splash once every 10 seconds, just the echo won't work for the first 2 splashes. And for 5 seconds, and 2.5 seconds, and all these numbers as you keep on halving. What happens to a splash made once every 7 seconds? Well, it does get an echo, but it doesn't really line up with the splash you are making, so because there isn't much pattern, it will add about as much to your splash as it removes. So if you draw a graph of frequency of your splash vs typical amplification, then you get a pattern that has little waves in it. However, usually frequencies live nicely on logarithmic scales (since halving a frequency gives you the same "note" just an octave down), and once you do this you get a nice wave pattern, where the frequency of this wave corresponds to the echo time somehow, and the amplitude of the wave corresponds to how much echo there is.
Here are two graphs I found from my report. The first is the original spectrum, the second is the original+echo spectrum. The difference is very subtle, but you can hopefully see that the original+echo spectrum is more wavey towards the right (higher frequencies). I don't have time to redo this properly (with useful things like units):
Original:

Original+echo:

At this point, you might be tempted to try to work out the frequency of this wavey pattern by using a Fourier transform. Unfortunately, due to maths being awesome, it just gives you the original signal in reverse. The cepstrum filtering trick should fix this.
Unfortunately, I found that it was super sensitive to noise, and I never got any stable stuff out of it. I'm still curious though, and I hope it was an implementation bug.
Panic
So I mentioned before that this was an assignment. And it wasn't any old assignment, it was an assignment which was worth 100% of my final mark. There was no exam, no class tests, no participation marks or anything. And it was due 2 days before all the marks had to be in, so you couldn't get an extension, or complain if you got a bad mark. I never really cared about marks, but I was also very nervous because if I handed this in as my assignment, then it would seem pretty dodgy. I knew it wouldn't fail (the lecturer loved FFTs, and I had two implementations of it, which surely would be a bonus, but still I didn't think that a result like this was worth enough for an entire subject.
I went for a walk I think to try to work out what I was going to do, and I kept on thinking about why doing a Fourier transform of the spectrum didn't work, and came to the conclusion that I should be looking for an answer in the time domain, not the frequency domain.
Auto-correlation
One neat thing you can do with two signals is to correlate them. If you were at a concert and you had a microphone at the front and one at the back, and you took recordings from both, if you played them back together, then it would sound rubbish because the time delays would be slightly different, giving an extra echo in the sound that would be very strong. If you delay the signal from the front microphone enough, then you should get a much better result. Correlation is a method you can use for doing this, and it is really simple - you just try lots of different delay times, and for each possible delay time, you multiply the amplitude of one source with the amplitude of the other, and add this result to the total for that particular delay time. So if both signals are big positive numbers at the delay compensated time, then it adds lots to the total for this delay, if one is positive and the other is negative, then it subtracts. Typically the correlation is done for every possible time from the start of a signal to the end (with both signals being the same length).
Some things to note:
- The correlation of a periodic signal (like a sine wave) with another at the same frequency will also be periodic, with the same frequency being present.
- The correlation of noise and anything else will be pretty much silence.
- The correlation of a signal and a delayed signal will have a sharp spike at the time where the delay is (and possibly others, if they have periodic stuff)
In our case though, we don't have two signals, we just have the one, and it has both the echo and the original mixed together. So instead of correlating our signal with another, we correlate it with itself. This is called auto-correlation. Now some facts about auto-correlation:
- The auto-correlation of a periodic signal (like a sine wave) is periodic, with the frequency still being present.
- The auto-correlation of noise is a sharp spike at t=0, and silence after.
- The auto-correlation of a signal with a delay will have a spike at the time where the delay is (and possibly others, if it has periodic stuff)
- An auto-correlation always has a spike at the start
So, I decided that I would just try to find the spike in the auto-correlation. This was done by taking lots of auto-correlations of the signal at different times, and taking the average size of each spike. The final result would then theoretically be a convolution with a spike at the start, and spikes at all the echo spots, with heights corresponding to the amplitude of the echo). This would work great if the original signal was white noise. Unfortunately, because of the periodic stuff, we can have other spikes, if a song has lots of middle Cs in it, then you might mistake the repeated pulses as being an echo.
To fix this, I did a dodgy hack, which worked pretty well. Do a Fourier transform of the signal, look for any frequency components that are stronger than some threshold, and set them to 0, then do an inverse Fourier transform. The idea with this is that we get rid of all the periodic stuff and are left with just the hisses, clicks and stuff like that.
Actually testing the method was difficult, but from my artificial tests I was able to extract the delay time accurately in everything, but I don't think I had the amplitude of the echos correct enough that you could hope to reverse the process. Actually, reversing the process is quite a difficult problem. Before our problem was that we knew S' = S * C, but didn't know what S or C were. Now we know C, but you can't just divide both sides by C, since that isn't straight multiplication - that is convolution.
Here is an example output from this, running off a recording from a club, where I've added a really obvious echo to it:

I like to think that the stuff on the very left is actually the reverb from the room. I have no idea how I'd test it though.
So yeah, it isn't quite a full result. But at least we can kind of mimic the echos in a room (or "steal" someones reverb unit).
Implementation Fun
Unfortunately, I can't provide any source code for this yet. I don't think it belongs to me any more due to uni copyright madness. And in any case, it was a hacked together application. I might do something similar later though.
If you decide to do it, there is a neat trick you can do for the Fourier transforms - because of the way you use frequencies, you don't actually need to care that the output is bit-reversed, if you just implement a decimation in time and a decimation in frequency algorithm, then you can do transforms and inverse transforms and get the results always in the right order.