Well, what you say about calculating the X-Y velocity and stretch for only a small matrix and using a weighted average to get the velocity/stretch for other pixels helps a bit, but this doesn't help too much with the application of the actual transform itself (copying all those pixels)
Even with vis plugins that use a set of fixed delta fields that are only calculate once at initialization (look at the source for some XMMS plugins, such as Infinity and Jakdaw to get a good idea of the basics of vis), you still need a LOT of CPU just to push those pixels around from one frame to another.
MMX can help if you're doing the transforms in RGB space and not a palettized color space (do all 3 color components of a pixel at once). If you're using a palettized color space, it's not very useful, as you're RARELY going to copy a row of 8 pixels to an identical row of 8 pixels somewhere else on the screen. (Trust me - I've tried. I forget right now whether it was Infinity or Jakdaw that used RGB colorspaces at all points, but I was able to get a small performance boost from it using MMX. I could probably get more, but NOTHING like Milkdrop's ability to run wicked-fast at 1152x864.)
Jakdaw actually does use hardware accelration to a slight degree - It renders to an OpenGL texture, which is scaled up and antialiased by the hardware. (This leaves the interesting thought of doing plugin work in YUV color spaces - could make for some interesting effects, and most modern video cards support hardware scaling/conversion of YUV overlays in order to accelerate video applications, as almost any video compression codec in existence works in one of the YUV colorspaces.)
It seems to me like Ryan has somehow convinced the hardware to perform the inter-frame transform. (I'm sure the hardware can do it, as it's a similar technique to moving walls, etc. in 3D games).
A simple "waterfall" effect could be achieved by texture-mapping your waveform onto a square. Every frame, the square moves down the screen, and is faded a bit. (Don't know if the hardware can do the fading itself, but subtracting a small amount from every pixel in a given frame/texture CAN be easily accelerated a great deal with MMX, since all the pixels are staying together in the frame and it's easy to operate on 8-pixel blocks.) You'd have to keep moving your wave upwards on the texture map so that it always appears in the same position, and you'd also have to occasionally start a new square above the one scrolling off the screen.
Now the question is, is there a technique for making 3D hardware perform a 2D pinch/whirl/other transform given a set of X and Y shifts for each portion of the texture or object... Hmm, time to go figure out where my OpenGL book is and maybe actually starting to learn OGL.