Beat Detection [Thoughts/Comments/Ideas/Questions]
I've seen several old posts on beat detection and wanted to add my persepective in a new thread.
I'm an Electrical Engineer specialized in DSP and general signal processing so I think I have a different perspective compared to a lot of you software guys.
I'm trying to develop a beat detection algorithm plugin for WinAmp using some of it's imbedded abilities. Namely, all I need is the audio stream that WinAmp sends to the sound card (and thus into the analog waveform we call sound).
What I'm looking for in this post is others' thoughts/comments/ideas on my algorithm as well as trying to help get some better information out there compared to some of the other treads that I've seen with this subject.
First and foremost, I must make a definition of average signal energy. In this case for us, it would be the energy of the sound waveform we hear coming from our speakers. The signal energy is defined as the integral with respect to time of the signal squared over time period t1 to t2. I will refrain from writing that down in simple text notation as it will look fairly ugly. Average signal energy is defined as the signal energy devided by the number of samples (in discrete time) or the length of sample time (in continuous time). The second definition I need to make is instantaneous signal energy. The instantaneous signal energy is defined as the signal value squared.
In order for my algorithm to be valid, I have to make an assumption: there is greater signal energy during a beat than during a non-beat. Many other threads have hinted at the fact that there may be more bass or treble in the signal during a beat, but this assumption encompasses the entire frequency spectrum amungst other things.
Now, to the algorithm:
I select a sample size, of around 300 samples (this value is in need of tweaking, and I have not done enough analysis to determine the best sample size). I keep a running computation of the average signal energy by using the audio stream sent to the sound card. Note that the oldest samples of the 300 are constantly being replaced by the newer samples. I then use the current sound data to compute the instantaneous signal energy. I then compare the instaneous signal energy to the average signal energy. If it is greater, then there is a beat occuring.
This algorithm gives an "ok" result but there is a long stage where a beat occurs. This is because there are many samples where the instanteous signal energy is greater than the average. I refined the algorithm by observing the instaneous signal energy to the average. When the instanteous is at a maximum and greater than the average signal energy, a beat has occured. This change gave great results.
Now most people want to target a certain frequency spectrum with beat detection (most notably bass). This can be accomplished by taking the fourier transform of the sampled signal which effectively convertes the sound waveform in the time domain to a signal in the frequency domain. (the x-axis goes from time to frequency, the y-axis remains amplitude). [more on the best transform to use later].
The algorithm can then be completed as it was before, observing only the particular frequency range of interest. Note that the computations for signal energy are valid and identical in both the frequency and time domains.
The beauty of this algorithm is it can target different frequency ranges because different types of music typically have beats in different frequency components. (that's why trying to use a bass detection algorithm won't work well with latin music and vice versa).
Ok, so this sounds like a really great idea and all, but so far I've described everything in the continuous time/frequency domain. Obviously, we're working on a computer in the discrete domain. [And trust me, do not try and create an analog circuit that can do this functionality, it doesn't work, I've explored that].
Description of the algorithm in Discrete Time:
1. Keep a running average of +/- 300 samples of audio that is being read from WinAmp
2. Do a DCT (Discrete Cosine Transform) of the 300 samples to break the signal into it's frequency components.
3. Observe the instantneous signal energy and note when it is at a maximum and greater than the average signal energy.
4. A beat has occred when (3) has been met.
Notes about the algorithm:
1. I choose the DCT (discrete cosine transform) because it is easy to compute and gives desireable values. I could go into a long explanation of why it gives desireable values, but I'll just say it instead: other transforms such as the DFT and FFT give both real and imagonary parts where the DCT does not because of the nature of the cosine. So with no imaginary parts, there is no additional computations/complexities to the algorithm. The DCT can also be represented in Matrix form, so the computation is simple and non-iterative.
2. The sample value chosen [300 samples] is just a ballpark value. By making it larger, the algorithm will be more resistive to sound change and will observe beats more frequently. With less samples, the algorithm may become unstable as it will change too quickly to be a valid comparator to the instantaneous signal energy. Tweaking this sample size to the proper value will ensure that sound amplitude (volume) does not effect beat detection.
3. I've ran this algorithm with some tweaking using MATLAB with excellent results. The one concern I began to have is MATLAB wasn't fast enough to do everything. I could acctually observe a lag between when a beat actually occured and when a beat was detected. This is because MATLAB has a mediocre sampling ability. Second, the sound had some distortion on it. Thirdly, MATLAB is MUCH slower than something like C++, using C++ would get rid of that lag. I now am trying to make the move into using software like WinAmp to get the audio data that I need to run the algorithm.
Well I hope that this brought insight to some and sparked new ideas/comments. I'd love to hear them. I hope that everyone was able to get through this long post as I almost passed out just reading through it once.
Hope to hear interesting responses!