I have written before about the appeal of wideband codecs, and the damping effect that royalty issues have on them. Speex is an interesting alternative wideband codec: open source and royalty free. Having discussed the new Skype codec with Jonathan Christensen earlier this year I thought it would be interesting to hear from the creator of Speex, Jean-Marc Valin.
MS: What got you into codecs?
JMV: I did a master’s in Sherbrooke lab – the same lab that did G.729 and all that. I did speech enhancement rather than codecs, but learned a bit about codecs there and after I did my master’s I thought it would be nice to have a codec that everybody could use, especially on Linux. All those proprietary codecs were completely unavailable on Linux because of patent issues, so I thought it would be something nice to do. I met a guy named David Rowe who thought the same thing and knew more about codecs than I did, so we started Speex together. In the end he didn’t have much time to write code but I did and he had great advice and feedback.
MS: How much of Speex did you write, and how much was contributed by others?
JMV: I wrote about 90%, but most of the contributions were not in code but in terms of bug reports, feedback, suggestions, or in the early beginning David Rowe didn’t write much code but he gave me really good advice. So a lot was contributed, but not a lot of the contributions were code. The port to windows was contributed.
MS: Were there any radical innovations in algorithms that a contributor came up with?
JMV: No, I don’t think there’s an issue of that. And even what I wrote it was mostly just a matter of putting together building blocks that were generally known, and just putting together so that a decent codec resulted. There’s nothing in Speex that somebody would look at and say “Wow, this is completely unheard of.” There are a few features that aren’t in other codecs, but they’re not fundamental breakthroughs or anything like that.
MS: How is Speex IPR-free? Do you just study the patents and figure out work-arounds or do you just assume that if you write code from scratch it’s not infringing, or do you look at patents for speech technologies that have already expired…
JMV: It’s actually a mixture of all that. Basically the first thing with Speex is that I wasn’t trying to innovate, especially in the technological sense. A lot of Speex is built on really old technology that either wasn’t patented or if it was the patents had expired. A lot of 80’s technologies.
CELP is 80’s technology. CELP was not patented. There are developments of it like ACELP which was patented – actually by my former university, so although its actually a pretty nice technique I couldn’t use it so I just avoided it and used something else, which turned out to be not that much worse, and in the end it didn’t really matter – it was just a bit of an inconvenience.
MS: Are the users like Adobe calling you to verify that Speex is IPR free?
JMV: I had a few short contacts with them. I didn’t speak with any lawyers, so I assume somebody had a look at it and decided that it was safe enough to use. It’s a fundamental problem with patents, in any case, regardless of whether you’re open source royalty free or proprietary, patented or anything like that. Anyone can claim they have a patent on whatever you do. At some point it’s a calculated risk, and Speex is not more risky than any other technology. Even if you license the patents you never know who else might claim they have a patent on the thing.
MS: Has anybody tried that with Speex?
MS: How long has Speex been in use?
JMV: I started Speex in Feb. 02 and v1.1 was released in March 03 at which point the bit stream was frozen. All codecs have to freeze the bit stream at some point. All the G.72x ITU codecs have a development phase, then they agree on the codec and they say “this is the bit stream and it’s frozen in stone,” because you don’t want people changing the definition of the codec because nobody would be able to talk to each other.
MS: But you can change the implementations of the algorithms that generate the bit stream?
JMV: Most of the ITU codecs have a so-called “bit-exact” definition, which means a certain bit pattern as audio input has to produce exactly this pattern as the compressed version. This leaves a lot less room for optimization.
MS: Does Speex have a bit exact definition?
JMV: No. The decoder is defined, so the bit stream itself is defined, but there is no bit-exact definition, and there can’t really be because there is a floating point version and you can’t do bit exact with floating point anyway.
In that sense it’s more similar to the MPEG codecs that are also not bit-exact.
After the bit stream was frozen I spent quite a lot of time doing a fixed point port of Speex so it could run on ARM and other processors that don’t have floating point support. I also spent some time doing quality optimizations that didn’t involve changing the bit stream. There are still a lot of things you can do in terms of improving the encoder to produce a better combination of bits.
MS: So the decoder doesn’t change, but the encoder can be improved and that will give you a better end result?
JMV: Exactly. That’s what happened for example with MP3, where the first encoders were really, really bad. And over time they improved, and the current encoders are much better than the older ones.
MS: Have you optimized Speex for particular instruction sets?
JMV: There are a few optimizations that have been done in assembly code for ARM. Mostly for the ARM4 architecture there’s a tiny bit of work that I did several years ago to use the DSP instructions where available.
MS: How much attention have you paid to tweaking Speex for a particular platform, like for example a particular notebook computer?
JMV: Oh, no, no. First, all of that is completely independent of the actual codec. In the case of Speex I have in the same package I have the Speex codec and a lot of helper functions, echo cancellation, noise suppression and things like that. Those are completely independent of the codec. You could apply them to another codec or you could use Speex with another echo canceller. It’s completely interchangeable, and there are not really any laptop specific things. The only distinction between echo cancellers is between acoustic echo cancellers and line echo cancellers, which are usually completely different. The acoustic echo will be used mostly in VoIP when you have speakers and microphones instead of headsets. What really changes in terms of acoustic echo is not really from one laptop to another but from one room to another because you are canceling the echo from the whole room acoustics.
MS: Isn’t there a direct coupling between the speaker and the microphone?
JMV: Overall what you need to cancel is not just the path from the mic to the speakers. Even with the same laptop the model will change depending on all kinds of conditions. There’s the direct path which you need to cancel, but there’s also all the paths that go through all the walls in your room. Line echo cancellers only have a few tens of milliseconds, whereas acoustic echo cancellers need to cancel over more than 100 milliseconds and cancel all kinds of reflections and things like that.
Even if you are in front of your laptop and you just move slightly the path changes and you have to adjust for that.
MS: So who did the echo cancellers in the Speex package – was that you?
MS: G.711 has an annex that includes PLC (Packet Loss Concealment), and others say PLC is a part of their codec.
JMV: The PLC is tied to the codec in the case of Speex and pretty much all relatively advanced codecs. G.711 is pretty old and all packets are completely independent, so you can do pretty much anything you like for concealment. For Speex or any other CELP codec you need to tie the PLC to the codec.
MS: As far as wideband is concerned, how wideband is Speex? What are the available sample rates?
JMV: Wideband was part of the original idea of Speex. I didn’t even think about writing a narrowband version of it. And in the end some people convinced me that narrowband was still useful so I did it. But it was always meant to be wideband. The way it turned out to be done was in an embedded way, which means that if you take a wideband Speex stream it is made up of narrowband stream and extra information for the higher frequencies. That makes it pretty easy to interoperate with narrowband systems. For instance if you have a wideband stream and you want to convert it to the PSTN you just remove the bits that correspond to the higher frequencies and you have something narrowband. This is for 16 kHz. For higher frequencies, Speex will support also a 32 kHz mode – I wouldn’t say that it’s that great, and that’s one of the reasons I wrote another codec which is called CELT (pronounced selt).
MS: What about the minimum packet size you have for Speex?
JMV: Packetization for Speex is in multiples of 20 ms. The total delay is slightly more than that – around 10 ms, so the total delay introduced by the codec is around 30 ms, which is similar to the other CELP based codecs.
MS: is Speex a variable bit rate codec?
JMV: It has both modes. In most VoIP applications people want to use constant bit rate because they know what their link can operate at. In some cases you can use VBR, that’s an option that Speex supports.
VBR will reduce the average bandwidth so if you have hundreds of conversations going through the same link, then at the same quality VBR will take of the order of 20% less bandwidth, or something like that. I don’t remember the exact figures.
A conversation can go above the average bit rate just as easily as it can go below.
MS: Can you put a ceiling on it, suppose you specify a variable bit rate not to exceed 40 kbps?
JMV: Yes, that’s a supported option. It would sound slightly worse than a constant bit rate of 40 kbps. There’s always the trade-off of bit rate and quality. I believe some people already do it in the Asterisk PBX, but I could be wrong on that one.
MS: How does Speex compare to other codecs on MIPS requirement?
JMV: I haven’t done precise comparisons, but I can say that in terms of computational complexity Speex narrowband is comparable to G.729 (not G.729A, which is less complex) and AMR-NB and Speex wideband is comparable to AMR-WB. The actual performance on a particular architecture may vary depending on how much optimization has been done. In most applications I’ve seen, the complexity of Speex is not a problem.
MS: So what about AMR-WB? Seems like it’s laden with IP encumbrances? What are the innovations in that that make it really good, and do you think it’s better than Speex or does Speex have alternative means of getting the same performance?
JMV: I never did a complete listening test comparing Speex to AMR-WB. To me Speex sounds more natural, but I’m the author, so possibly someone would disagree with me on that. In any case there wouldn’t b a huge difference of one being much better than the other. The techniques are pretty different, AMR-WB uses ACELP. Both are fundamentally CELP but they do it very differently.
MS: The textbooks say CELP models the human voice tract. What does that mean?
JMV: It’s not really modeling; it’s making many assumptions that are sort of true if the signal is actually voice. Basically the LP part of CELP is Linear Prediction, and that is a model that would perfectly fit the vocal tract if we didn’t have a nose. The rest has to do with modeling the pitch, which is very efficient, assuming the signal has a pitch, which is not true of music, for instance. The Code Excited part is mostly about vector quantization, which is an efficient way of coding signals in general.
The whole thing all put together makes it pretty efficient for voice.
MS: What is the biggest design win that you know of for Speex?
JMV: There are a couple of high profile companies that use Speex. Recently the one that people talked about was Flash version 10. Google Talk is using it as well.
MS: Do you track at all how many people are using it in terms of which applications are using it?
JMV: In some cases I hear about this company using Speex, or that company tells me they are using it or they ask me a few questions so they can use it. So I have a vague idea of a few companies using it, but I don’t really track them or even have a way to track them because a part of the idea of being open source is that anyone can use it with very few restrictions, and with no restrictions from me on having to get a license or anything like that.
MS: How many endpoints are running Speex now?
JMV: It’s pretty much impossible to say. There are a large number of video games that use Speex. It’s very popular in that market because it’s free.
MS: Would gamers want to use CELT instead? That’s a very delay-sensitive environment.
JMV: I think it depends on the bandwidth. I was involved in one of the first games that used it was unreal tournament in 2004 and they were using a very low bit rate, so CELT wouldn’t have worked. Now the bandwidths are larger, so possibly someone will want to use CELT at some point.
MS: What is CELT?
JMV: It’s an acronym for Constrained Energy Lap Transform. It actually works pretty equally either on voice or music. The bandwidth is a bit more than Speex. Speex in wideband mode will usually take about 30 kbps at 16 kHz, whereas with CELT usually you want to use at least 40 kbps. At 40 kbps you have pretty decent voice quality, at full audio bandwidth, 44 or 48 kHz. This is the CD sample rate or slightly higher, 48 which is another widely used rate. For those sample rates which basically give you the entire audio spectrum in terms of bit rate usually you need at least 40 for voice. You can go a bit lower but not much. If you use 48 you get decent music quality and at 64 you get pretty good music quality.
MS: Is CELT a replacement for Speex?
JMV: No, there is definitely a place for both of them. There’s actually very little competition between them. usually people either want the lower rate of Speex; for instance if you want something that works at 20 kbps, you use Speex, and CELT is for higher bit rate, lower delays, and also supports music, so there’s nearly no overlap between the two.
MS: How does CELT compare to MP3?
JMV: I actually did some tests with an older version, and in terms of quality alone it was already better than MP3 which was originally quite surprising to me, because my original goal was not to beat MP3 but to make the delay much lower, because you can’t use MP3 for any kind of real-time communication, because the delay will be way more than 100 ms. CELT is designed for delays that are lower than 10 ms.
MS: Wow! So how many milliseconds are in a packet?
JMV: It is configurable. You can use CELT with packets as small as around 2 ms. or you can use packets that are up to 10. The default I recommend is around 5 ms.
MS: So the IP overhead must be astronomical! 2 ms at 64 kbps is 16 bytes per packet!
JMV: In normal operation you wouldn’t use the 2 ms mode, but I wanted to enable real-time music collaboration over the network. So you can have two people on the same continent that play music together in real time over the net. This is something you can’t do with any other codec that exists today.
MS: So the Internet delay is going to be at least 70ms.
JMV: It depends. Overall you need to have less than 25 ms one – way delay for the communication to work. That’s the total delay. So if you look at existing codecs, even so-called low-delay AAC already has at least 20 ms of packetization delay. So if you add the network and anything else you can’t make it. Codecs such as AMR-WB will have 25 or 30 ms for packetization, so you have already busted right there. It won’t work for music. So this is one reason why I wrote CELT.
MS: Have you played music over the Internet with CELT yet?
JMV: I haven’t tried it yet but some other people have tried it and reported pretty good results.
The other goal, even if you are not trying to play music over the network, it has to do with AEC which although some software does it relatively well it’s still not a completely solved problem. If you are able to have the delay low enough you almost don’t have to care about echo cancellation at all because the feedback is so quick that you don’t even notice the echo. Just like when you speak on the phone you hear your own voice in the head set and it doesn’t bother you because it’s instantaneous.
MS: Has anybody done any research on how long the delay can get before it begins to become disorienting?
JMV: There’s is some research, usually for a particular application and it will depend on the amount of echo and all of that but usually if you manage to get it below around 50 ms for the round trip usually it won’t be a problem, and the lower the delay the less annoying the echo is, even if you don’t cancel it.
MS: What phone are you talking on now?
JMV: My company phone. At home I’m set up with my cable provider.
MS: So you don’t use Speex yourself?
JMV: I used it when I was in Australia and I wanted to talk with my family back here. But I’m not using it in any regular fashion now.
MS: So what software did you use with the webcam?
JMV: At the time I was using Ekiga and OpenWengo. Both are Linux clients, because I don’t run Windows on my machines. Open Wengo is one of the few on Linux that can talk to a Windows machine.
MS: Have you ever used Skype?
JMV: Once or twice, but not regularly.
MS: What kind of cell phone do you have?
JMV: A really basic cell phone I think I have sent maybe one or two SMS messages in my life and that’s the most complicated I have ever done with that phone, which I use mainly in case of emergencies. I am not a telecommunication power user or anything like that.
MS: I am thinking that cell phones are how wideband codecs are going to take off. I’m not talking about AMR-WB. There are going to be hundreds of millions of smart phones sold, over the next few years that have Wi-Fi in them. And because you will be able to do VoIP from a platform that you can load applications onto, it seems like a Wi-Fi voice client for smart phones is going to be a way that wideband audio can really infiltrate and take off. I’m thinking that that might be a way for you to start using Speex in your daily life.
JMV: Well, I sure hope that Speex will take off a lot more in that area. Originally it wasn’t planned to go that far. Originally the only market I had in mind was the Linux or open source market with PC based soft phones. That’s the only thing I cared about. It was designed mainly for IP networks as opposed to wireless, and I just wanted to see far it would go. It turned out to be a lot further than I expected.
Porting to Windows was done pretty early in the game. That was a contribution – I have never actually compiled it for Windows. And eventually people started porting it to all sorts of devices I have never heard of like embedded DSPs and lots of different architectures.
MS: I must say, thank you very much. I feel that wideband audio is a great benefit to the telephone world, and will undoubtedly become very common over time, and one of the biggest impediments to wideband audio is the intellectual property issue, so having an open source, IPR-free implementation of a wideband codec that seems to be a good one is just a great thing for the world, and a wonderful thing you have done for the world.
JMV: I think wideband will be pretty important, especially for voice over IP because it’s basically the only way that VoIP can ever say that it’s better than the PSTN. As long as people stay with narrowband the best VoIP can be is “almost as good as PSTN.” And yes, IPR is a pretty serious issue there.