HD Voice, Peering and ENUM

The most convenient route between telephone service providers is through the PSTN, since you can’t offer phone service without connecting to it. Because of this convenience telephone service providers tend to consider PSTN connectivity adequate, and don’t take the additional step of delivering IP connectivity. This is unfortunate because it inhibits the spread of high quality wideband (HD Voice) phone calls. For HD voice to happen, the two endpoints must be connected by an all-IP path, without the media stream crossing into the PSTN.

For example, OnSIP is my voice service provider. Any calls I make to another OnSIP subscriber complete in HD Voice (G.722 codec), because I have provisioned my phones to prefer this codec. Calls I make to phone numbers (E.164 numbers) that don’t belong to OnSIP complete in narrowband (G.711 codec), because OnSIP has to route them over the PSTN. If OnSIP was able to use an IP address for these calls instead of an E.164 number, it could avoid the PSTN and keep the call in G.722.

Xconnect has just announced an HD Voice Peering initiative, where multiple voice service providers share their numbers in a common directory called an ENUM directory. When a subscriber places a call, the service provider looks up the destination number in the ENUM directory; if is there, it returns a SIP address (URI) to substitute for the phone number, and the call can complete without going over the PSTN. About half the participants in the Xconnect trial go a step further than ENUM pooling: they interconnect (“peer”) their networks directly through an Xconnect router, so the traffic doesn’t need to traverse the public Internet. [See correction in comments below]

There are other voice peering services that support this kind of HD connection, notably the VPF (Voice Peering Fabric). The VPF has an ENUM directory, but as the name suggests, it does not offer ENUM-only service; all the member companies interconnect their networks on a VPF router.

Some experts maintain that for business-grade call quality, it is essential to peer networks rather than route over the public Internet. Packets that traverse the public Internet are prone to delay and loss, while properly peered networks deliver packets quickly and reliably. In my experience, this has not been an issue. My access to OnSIP and to Vonage is over the public Internet, and I have never had any quality issues with either provider. From this I am inclined to conclude that explicit peering of voice networks is overkill, and that if you have a VoIP connection all that is needed for HD voice communication is to list your phone number in an ENUM directory. Presumably the voice service providers in Xconnect’s trial that are not peering share this opinion.

Xconnect’s ENUM directory is enormous, partly because it is pooled with Pathfinder – the GSMA ENUM directory administered by Neustar. Xconnect’s ENUM directory had over 120 million numbers in it as of 2007.

Xconnect and the VPN only add to their ENUM directories the numbers owned by their members. But even if you are not a customer of one of their members, you can still list your number in an ENUM directory, e164.org. This way, anybody who checks for your number in the directory can route the call over the Internet. Calls made this way don’t need to use SIP trunks, and they can complete in HD voice.

If you happen to have an Asterisk PBX, you can easily provision it to check in a list of ENUM directories before it places a call.

HD Voice – state of deployment

At the HD Voice Summit in Las Vegas last week, Alan Percy of AudioCodes gave a presentation of the state of deployment of HD Voice, citing three levels of deployment: announced interest, trials and service deployment.

Percy’s take was that in the “Crossing the Chasm” technology adoption lifecycle, HD Voice is right at the chasm.

Here is his list, augmented with input from Jan Linden of GIPS,Tom Lemaire of FT/Orange, Doug Mohney of HD Voice News and Dave Erickson of Wyde Voice:

Category Company Stage
PC VoIP Skype >500 m downloads
QQ (China) >500 m downloads
Gizmo5 (now Google)
Wireline telco France Telecom 500K HD users
British Telecom Trials
FT/Orange Spain Deployed 1Q09
FT/Orange Poland Deploys 1Q10
Mobile Orange (Moldova) Production
Orange (UK) Deploys 3Q10
Orange (Belgium) Deploys 2010
CLEC VoIP Alteva Production
SimpleSignal Production
Ooma 25K HD users
8×8 >70K HD users
OnSIP Production
Phone.com Trials
US MSOs CableVision/Lightpath Limited Trials
Conferencing ZipDX Production
ClearOne Production
Citrix Production
FreeConferenceCall.com Production
Global Crossing Limited Trials

The main codecs in each of these deployments are: Skype:SILK; QQ, Citrix, Freeconferencecall:iSAC; mobile:AMR-WB; all others: G.722.

Alan pointed out the conspicuous lack of involvement of the cable companies (MSOs), even though Cable Labs has done a good job of creating HD specifications for them.

HD Communications Project

As part of the preparation for the fall HD Communications Summit, Jeff Pulver has put up a video clip promoting HD Voice for phone calls. It goes over the familiar arguments:

  • Sound quality on phone calls hasn’t improved since 1937. Since most calls are now made on cell phones, it has actually deteriorated considerably.
  • The move to VoIP has made it technically feasible to make phone calls with CD quality sound or better, yet instead VoIP calls are usually engineered to sound worse than circuit-switched calls (except in the case of Skype.)
  • Improved sound quality on phone calls yields undisputed productivity benefits, particularly when the calls involve multiple accents.
  • Voice has become a commodity service, with minimal margins for service providers, yet HD Voice offers an opportunity for differentiation and potentially improved margins.

The HD Communications Summit is part of the HD Connect Project. The HD Connect Project aims to provide a coordination point for the various companies that have an interest in propagating HD Voice. These companies include equipment and component manufacturers, software developers and service providers.

Among the initiatives of the HD Connect Project is a logo program, like the Wi-Fi Alliance logo program. The logo requirements are currently technically lax, providing an indicator of good intentions rather than certain interoperability. Here’s a draft of the new logo:

HD Connect Draft Logo

Another ingredient of the HD Connect project is the HDConnectNow.org website, billed as “the news and information place for The HD Connect Project.”

It is great that Jeff is stepping up to push HD Voice like this. With the major exception of Skype almost no phone calls are made with wideband codecs (HD Voice). Over the past few years the foundation has been laid for this to change. Several good wideband codecs are now available royalty free, and all the major business phone manufacturers sell mostly (or solely) wideband-capable phones. Residential phones aren’t there yet, but this will change rapidly: the latest DECT standards are wideband, Gigasets are already wideband-capable, and Uniden is enthusiastic about wideband, too. As the installed base of wideband-capable phones grows, wideband calling can begin to happen.

Since most dialing is still done with old-style (E.164) phone numbers, wideband calls will become common within companies before there is much uptake between companies. That will come as VoIP trunking displaces circuit-switched, and as ENUM databases are deployed and used.

GIPS webinar on Wideband Voice

GIPS yesterday sponsored a webinar on HD voice in the teleconferencing and videoconferencing industries. It was introduced by Elliot Gold of Telespan Publishing. Elliot made the point that HD video will be running at 90% of video equipment sales by the end of this year, and that HD audio is at least equally important to the user experience and should be the next technology transformation in the equipment and services markets.

Dovid Coplon of GIPS gave a more technical presentation, starting with a very accessible explanation of the benefits of HD audio.

Dovid made some interesting observations. He named some users of GIPS wideband technology including Google Talk and Yahoo Messenger. He cited a May 2009 poll that showed 10% of respondents used HD audio all the time or whenever they could. To me this is a surprisingly high number, possibly reflecting a biased poll sample, or perhaps a large number of respondents that didn’t understand the question.

Dovid compared the benefits of the GIPS wideband codecs, iSAC and iPCM-WB. iSAC is a lower-bit-rate, higher complexity codec than IPCM-WB. Dovid pointed out that with IP packet overhead being so high, decreasing the bit-rate of a codec is not as useful as one might think. The implication was that the lower complexity of IPCM-WB outweighed its bandwidth disadvantage in mobile applications. He also included a codec quality comparison chart based on a MUSHRA scale. The chart predictably showed iSAC way better than any of the others, and anomalously showed Speex at 24 kbps as inferior to Speex at 16 kbps.

Dovid also echoed all the codec engineers I have talked with in emphasizing that the codec is a small piece of the audio quality puzzle, and that network and physical acoustic impairments can have a greater effect on user experience.

You can download the slides here.

Interview with Jan Lindén of GIPS

In my ongoing series of wideband codec interviews I have discussed SILK with Jonathan Christensen of Skype and Speex with its primary author, Jean-Marc Valin. I have also written about Siren from Polycom. So it is high time to look at one of the best known and most widely deployed wideband codecs, iSAC from GIPS. I spoke with Jan Lindén, V.P. of Engineering at GIPS.

The entire interview is transcribed below. The highlights I got from it are that the biggest deployment of wideband voice in the world is from QQ in China, that the next revision of iSAC will be wideband, that 3G is still inadequate for videocalls and that GIPS indemnifies customers of iSAC against IPR challenges.

MS: GIPS has been in the wideband phone call business longer than anybody else. What do you think about the market?
JL: I think the market is definitely going towards wideband. Part of it is all the soft phones that people have used, that’s one step. The fastest way to make it really move is to get the cell phones to support HD Voice. Then people will realize that you can’t have anything else that is worse than what you have on your cell phone. And in the conference and video conference space, whoever has tried an HD Conference as opposed to a regular one immediately recognizes the advantage. The question is, how do you make people experience it? You can’t just wait for demand. You have to offer solutions so customers see the benefit.

MS: Do you think 2009 will be a watershed year for HD Voice?
JL: For sure the industry has woken up, and seen that this is an interesting area. The question is how much it will be in demand by customers. We know that if you try it you definitely want it, but how do you make the customers see that? We are seeing all the enterprise IP phones going to wideband, and we have started to see that move to residential solutions as well. Of course not for the ATA, but for anything like video phones and IP phones, there is much more interest in wideband. Especially for video, because people expect a higher quality in general. There is a lot of interest in video. People are building all kinds of solutions.

MS: What is driving the video solutions?
JL: In the softphone space obviously the biggest use is for personal use where you call your family – I live in San Francisco and my parents and all my siblings live in Sweden, and we talk all the time over video so the kids can see each other and the grandparents can see the grandkids.

MS: Which video solution do you use for that?
JL: Right now I use a solution from one of our customers, for which we supply the audio and video subsystems. I am using a pre-release of that; when it becomes available it’s going to be pretty good.

MS: So who are your main customers?
JL: The biggest names are IBM, Google, Yahoo, WebEx, Nortel, AOL, Citrix, Avaya and Samsung. For example we supply the audio and video subsystem for IBM Sametime. Maybe our biggest customer in terms of deployment is QQ in China. They have hundreds of millions of users. It’s similar to Yahoo or Google. They are not very well known outside China, but they are much bigger than Skype for example, in terms of online users at any given moment. They use iSAC.

MS: So all these customers run on PCs, right?
JL: We also have people who use our stuff in embedded solutions. IP phones, a few mobile devices – Samsung has some of our technology on their cell phones as an application. There is a video phone called the Ojo phone. There are a few ATA devices in Asia, a Wi-Fi phone from NEC. We will have some announcements later.

MS: How does cell phone video work?
JL: Most of them it’s not really videophone, more regular streaming. It depends on the service provider’s solution, which can be expensive. To get effective video phone performance you need Wi-Fi – 3G is still inadequate for good video quality. If the picture is small it can be decent, but you get delay, and the inconsistency of the data network means that a Wi-Fi solution is much more stable and gives better quality.

MS: Does GIPS sell complete conferencing systems?
JL: No, just the audio subsystem – for example Citrix Online’s GoToMeeting uses our audio subsystem to provide HD Voice on their conference bridge.

MS: What is the difference between iSAC and iPCM?
JL: The biggest difference is that iPCM wideband has a significantly higher bitrate, better quality, more robust against packet loss. The biggest reason people don’t use it is that its bitrate is about 80kbps, while iSAC is variable between 10 and 32, so it has a much lower bit rate. They both have a 16 KHz sampling rate.

MS: Do you see the necessity for a super-wideband codec?
JL: We think that’s something we should support. We haven’t done it previously because the benefit from narrowband to wideband is a much bigger step than from wideband to super wideband. We are supporting super wideband in our next release of iSAC.

MS: What about the transcoding issue?
JL: Pragmatically, you will have to have transcoding in some scenarios. You will not find a way to get everybody to agree on one codec or even two or three, but you will probably get two or three codecs that cover most of what’s used.

MS: What about the idea of a mandate to support at least 3 different WB codecs – would that give a good chance of having one in common?
JL: It’s a good idea, but the question is, will you get everybody to buy into it? It’s the most crucial point. Of course those codecs will have to be good codecs that are not expensive, and preferably not too associated with one player that will create political issues with other players in the market.

MS: You mentioned “not too expensive.” iLBC is royalty free, but narrowband. Does GIPS offer a royalty free wideband codec?
JL: No, our iSAC codec is not free per se, but if you look at other pricing available today in the market we are effectively only charging for the indemnification. We are not charging even close to as much as typical codecs like AMR-WB.

MS: So that’s huge that you offer indemnification.
JL: Yes, and no free codecs do that, obviously. If you want indemnification you have to pay something.

MS: Who else indemnifies?
JL: Typically not the codec vendor, but if you go and buy a chip from someone that has a codec on it from somebody like TI you will typically get indemnification, but from the chip vendor, not the IPR vendor.

MS: So GIPS is unique among codec vendors in offering indemnification?
JL: Yes, but we don’t see ourselves as a codec vendor. We offer broader solutions that have codecs as just one element, engines that have all the signal processing and code you need to handle an implementation of voice and video on a platform. That’s where the value is. And the codecs we indemnify as they are a part of that solution. You can also buy our codecs separately, and then we also indemnify. Since we sell a product rather than just supplying IP, people expect indemnification.

VoIP Peering

I have been calling myself a lot recently, because I am chairing a panel on network interconnection at Jeff Pulver’s HD Communications show this week, and I wanted to get some real-world experience. The news is surprisingly good.

I subscribed to several VoIP service providers, and Polycom was kind enough to send me one of their new VVX 1500 video phones. So with the two Polycom phones on my desk (the other, an IP 650, is the subject of my HD Voice Cookbook) I was able to make HD Voice calls to myself, between different VoIP service providers.

All the calls I made were dialed with SIP URIs rather than phone numbers. Dialing with a SIP URI forces the call to stay off the PSTN. This means that the two phones are theoretically able to negotiate their preferred codec directly with each other. For these particular phones the preferred codec is G.722, a wideband codec. The word “theoretically” is needed because calls between service providers traverse multiple devices that can impose restrictions on SIP traffic – devices like SIP Proxies and Session Border Controllers. I presumed that HD compatibility would be the exception rather than the rule, but it turns out I was wrong about that. Basically all the calls went through with the G.722 codec except when the service provider’s system was misconfigured. Even more pleasingly, I was able to complete several video calls between the X-Lite client on my PC and the Polycom VVX 1500 (though the completion was random at about a 50% rate), and when I had a friend from Polycom call me from his VVX 1500 using my SIP address, the call completed in video on the first attempt.

Effectively 100% of VoIP calls made from phones are dialed using E.164 (PSTN) phone numbers, and consequently wideband codecs are almost never used (Skype is the huge exception, but Skype calls are normally made from a PC, not a phone). The benefit of E.164 addressing is that you can call anybody with a phone. What I learned from my experiment is that with SIP addressing you can call anybody with Internet connectivity, and have a much better audio experience.

This is somewhat surprising. Many engineers consider the Internet to be too unreliable to carry business-critical phone calls, and VoIP service providers like to interconnect directly with each other using peering arrangements like the Voice Peering Fabric and Xconnect.net. There is an exhaustive series of articles about VoIP Peering at VoIP Planet.

Interview with Jean-Marc Valin of Speex

I have written before about the appeal of wideband codecs, and the damping effect that royalty issues have on them. Speex is an interesting alternative wideband codec: open source and royalty free. Having discussed the new Skype codec with Jonathan Christensen earlier this year I thought it would be interesting to hear from the creator of Speex, Jean-Marc Valin.
MS: What got you into codecs?
JMV: I did a master’s in Sherbrooke lab – the same lab that did G.729 and all that. I did speech enhancement rather than codecs, but learned a bit about codecs there and after I did my master’s I thought it would be nice to have a codec that everybody could use, especially on Linux. All those proprietary codecs were completely unavailable on Linux because of patent issues, so I thought it would be something nice to do. I met a guy named David Rowe who thought the same thing and knew more about codecs than I did, so we started Speex together. In the end he didn’t have much time to write code but I did and he had great advice and feedback.
MS: How much of Speex did you write, and how much was contributed by others?
JMV: I wrote about 90%, but most of the contributions were not in code but in terms of bug reports, feedback, suggestions, or in the early beginning David Rowe didn’t write much code but he gave me really good advice. So a lot was contributed, but not a lot of the contributions were code. The port to windows was contributed.
MS: Were there any radical innovations in algorithms that a contributor came up with?
JMV: No, I don’t think there’s an issue of that. And even what I wrote it was mostly just a matter of putting together building blocks that were generally known, and just putting together so that a decent codec resulted. There’s nothing in Speex that somebody would look at and say “Wow, this is completely unheard of.” There are a few features that aren’t in other codecs, but they’re not fundamental breakthroughs or anything like that.
MS: How is Speex IPR-free? Do you just study the patents and figure out work-arounds or do you just assume that if you write code from scratch it’s not infringing, or do you look at patents for speech technologies that have already expired…
JMV: It’s actually a mixture of all that. Basically the first thing with Speex is that I wasn’t trying to innovate, especially in the technological sense. A lot of Speex is built on really old technology that either wasn’t patented or if it was the patents had expired. A lot of 80’s technologies.
CELP is 80’s technology. CELP was not patented. There are developments of it like ACELP which was patented – actually by my former university, so although its actually a pretty nice technique I couldn’t use it so I just avoided it and used something else, which turned out to be not that much worse, and in the end it didn’t really matter – it was just a bit of an inconvenience.
MS: Are the users like Adobe calling you to verify that Speex is IPR free?
JMV: I had a few short contacts with them. I didn’t speak with any lawyers, so I assume somebody had a look at it and decided that it was safe enough to use. It’s a fundamental problem with patents, in any case, regardless of whether you’re open source royalty free or proprietary, patented or anything like that. Anyone can claim they have a patent on whatever you do. At some point it’s a calculated risk, and Speex is not more risky than any other technology. Even if you license the patents you never know who else might claim they have a patent on the thing.
MS: Has anybody tried that with Speex?
JMV: No.
MS: How long has Speex been in use?
JMV: I started Speex in Feb. 02 and v1.1 was released in March 03 at which point the bit stream was frozen. All codecs have to freeze the bit stream at some point. All the G.72x ITU codecs have a development phase, then they agree on the codec and they say “this is the bit stream and it’s frozen in stone,” because you don’t want people changing the definition of the codec because nobody would be able to talk to each other.
MS: But you can change the implementations of the algorithms that generate the bit stream?
JMV: Most of the ITU codecs have a so-called “bit-exact” definition, which means a certain bit pattern as audio input has to produce exactly this pattern as the compressed version. This leaves a lot less room for optimization.
MS: Does Speex have a bit exact definition?
JMV: No. The decoder is defined, so the bit stream itself is defined, but there is no bit-exact definition, and there can’t really be because there is a floating point version and you can’t do bit exact with floating point anyway.
In that sense it’s more similar to the MPEG codecs that are also not bit-exact.
After the bit stream was frozen I spent quite a lot of time doing a fixed point port of Speex so it could run on ARM and other processors that don’t have floating point support. I also spent some time doing quality optimizations that didn’t involve changing the bit stream. There are still a lot of things you can do in terms of improving the encoder to produce a better combination of bits.
MS: So the decoder doesn’t change, but the encoder can be improved and that will give you a better end result?
JMV: Exactly. That’s what happened for example with MP3, where the first encoders were really, really bad. And over time they improved, and the current encoders are much better than the older ones.
MS: Have you optimized Speex for particular instruction sets?
JMV: There are a few optimizations that have been done in assembly code for ARM. Mostly for the ARM4 architecture there’s a tiny bit of work that I did several years ago to use the DSP instructions where available.
MS: How much attention have you paid to tweaking Speex for a particular platform, like for example a particular notebook computer?
JMV: Oh, no, no. First, all of that is completely independent of the actual codec. In the case of Speex I have in the same package I have the Speex codec and a lot of helper functions, echo cancellation, noise suppression and things like that. Those are completely independent of the codec. You could apply them to another codec or you could use Speex with another echo canceller. It’s completely interchangeable, and there are not really any laptop specific things. The only distinction between echo cancellers is between acoustic echo cancellers and line echo cancellers, which are usually completely different. The acoustic echo will be used mostly in VoIP when you have speakers and microphones instead of headsets. What really changes in terms of acoustic echo is not really from one laptop to another but from one room to another because you are canceling the echo from the whole room acoustics.
MS: Isn’t there a direct coupling between the speaker and the microphone?
JMV: Overall what you need to cancel is not just the path from the mic to the speakers. Even with the same laptop the model will change depending on all kinds of conditions. There’s the direct path which you need to cancel, but there’s also all the paths that go through all the walls in your room. Line echo cancellers only have a few tens of milliseconds, whereas acoustic echo cancellers need to cancel over more than 100 milliseconds and cancel all kinds of reflections and things like that.
Even if you are in front of your laptop and you just move slightly the path changes and you have to adjust for that.
MS: So who did the echo cancellers in the Speex package – was that you?
JMV: Yes.
MS: G.711 has an annex that includes PLC (Packet Loss Concealment), and others say PLC is a part of their codec.
JMV: The PLC is tied to the codec in the case of Speex and pretty much all relatively advanced codecs. G.711 is pretty old and all packets are completely independent, so you can do pretty much anything you like for concealment. For Speex or any other CELP codec you need to tie the PLC to the codec.
MS: As far as wideband is concerned, how wideband is Speex? What are the available sample rates?
JMV: Wideband was part of the original idea of Speex. I didn’t even think about writing a narrowband version of it. And in the end some people convinced me that narrowband was still useful so I did it. But it was always meant to be wideband. The way it turned out to be done was in an embedded way, which means that if you take a wideband Speex stream it is made up of narrowband stream and extra information for the higher frequencies. That makes it pretty easy to interoperate with narrowband systems. For instance if you have a wideband stream and you want to convert it to the PSTN you just remove the bits that correspond to the higher frequencies and you have something narrowband. This is for 16 kHz. For higher frequencies, Speex will support also a 32 kHz mode – I wouldn’t say that it’s that great, and that’s one of the reasons I wrote another codec which is called CELT (pronounced selt).
MS: What about the minimum packet size you have for Speex?
JMV: Packetization for Speex is in multiples of 20 ms. The total delay is slightly more than that – around 10 ms, so the total delay introduced by the codec is around 30 ms, which is similar to the other CELP based codecs.
MS: is Speex a variable bit rate codec?
JMV: It has both modes. In most VoIP applications people want to use constant bit rate because they know what their link can operate at. In some cases you can use VBR, that’s an option that Speex supports.
VBR will reduce the average bandwidth so if you have hundreds of conversations going through the same link, then at the same quality VBR will take of the order of 20% less bandwidth, or something like that. I don’t remember the exact figures.
A conversation can go above the average bit rate just as easily as it can go below.
MS: Can you put a ceiling on it, suppose you specify a variable bit rate not to exceed 40 kbps?
JMV: Yes, that’s a supported option. It would sound slightly worse than a constant bit rate of 40 kbps. There’s always the trade-off of bit rate and quality. I believe some people already do it in the Asterisk PBX, but I could be wrong on that one.
MS: How does Speex compare to other codecs on MIPS requirement?
JMV: I haven’t done precise comparisons, but I can say that in terms of computational complexity Speex narrowband is comparable to G.729 (not G.729A, which is less complex) and AMR-NB and Speex wideband is comparable to AMR-WB. The actual performance on a particular architecture may vary depending on how much optimization has been done. In most applications I’ve seen, the complexity of Speex is not a problem.
MS: So what about AMR-WB? Seems like it’s laden with IP encumbrances? What are the innovations in that that make it really good, and do you think it’s better than Speex or does Speex have alternative means of getting the same performance?
JMV: I never did a complete listening test comparing Speex to AMR-WB. To me Speex sounds more natural, but I’m the author, so possibly someone would disagree with me on that. In any case there wouldn’t b a huge difference of one being much better than the other. The techniques are pretty different, AMR-WB uses ACELP. Both are fundamentally CELP but they do it very differently.
MS: The textbooks say CELP models the human voice tract. What does that mean?
JMV: It’s not really modeling; it’s making many assumptions that are sort of true if the signal is actually voice. Basically the LP part of CELP is Linear Prediction, and that is a model that would perfectly fit the vocal tract if we didn’t have a nose. The rest has to do with modeling the pitch, which is very efficient, assuming the signal has a pitch, which is not true of music, for instance. The Code Excited part is mostly about vector quantization, which is an efficient way of coding signals in general.
The whole thing all put together makes it pretty efficient for voice.
MS: What is the biggest design win that you know of for Speex?
JMV: There are a couple of high profile companies that use Speex. Recently the one that people talked about was Flash version 10. Google Talk is using it as well.
MS: Do you track at all how many people are using it in terms of which applications are using it?
JMV: In some cases I hear about this company using Speex, or that company tells me they are using it or they ask me a few questions so they can use it. So I have a vague idea of a few companies using it, but I don’t really track them or even have a way to track them because a part of the idea of being open source is that anyone can use it with very few restrictions, and with no restrictions from me on having to get a license or anything like that.
MS: How many endpoints are running Speex now?
JMV: It’s pretty much impossible to say. There are a large number of video games that use Speex. It’s very popular in that market because it’s free.
MS: Would gamers want to use CELT instead? That’s a very delay-sensitive environment.
JMV: I think it depends on the bandwidth. I was involved in one of the first games that used it was unreal tournament in 2004 and they were using a very low bit rate, so CELT wouldn’t have worked. Now the bandwidths are larger, so possibly someone will want to use CELT at some point.
MS: What is CELT?
JMV: It’s an acronym for Constrained Energy Lap Transform. It actually works pretty equally either on voice or music. The bandwidth is a bit more than Speex. Speex in wideband mode will usually take about 30 kbps at 16 kHz, whereas with CELT usually you want to use at least 40 kbps. At 40 kbps you have pretty decent voice quality, at full audio bandwidth, 44 or 48 kHz. This is the CD sample rate or slightly higher, 48 which is another widely used rate. For those sample rates which basically give you the entire audio spectrum in terms of bit rate usually you need at least 40 for voice. You can go a bit lower but not much. If you use 48 you get decent music quality and at 64 you get pretty good music quality.
MS: Is CELT a replacement for Speex?
JMV: No, there is definitely a place for both of them. There’s actually very little competition between them. usually people either want the lower rate of Speex; for instance if you want something that works at 20 kbps, you use Speex, and CELT is for higher bit rate, lower delays, and also supports music, so there’s nearly no overlap between the two.
MS: How does CELT compare to MP3?
JMV: I actually did some tests with an older version, and in terms of quality alone it was already better than MP3 which was originally quite surprising to me, because my original goal was not to beat MP3 but to make the delay much lower, because you can’t use MP3 for any kind of real-time communication, because the delay will be way more than 100 ms. CELT is designed for delays that are lower than 10 ms.
MS: Wow! So how many milliseconds are in a packet?
JMV: It is configurable. You can use CELT with packets as small as around 2 ms. or you can use packets that are up to 10. The default I recommend is around 5 ms.
MS: So the IP overhead must be astronomical! 2 ms at 64 kbps is 16 bytes per packet!
JMV: In normal operation you wouldn’t use the 2 ms mode, but I wanted to enable real-time music collaboration over the network. So you can have two people on the same continent that play music together in real time over the net. This is something you can’t do with any other codec that exists today.
MS: So the Internet delay is going to be at least 70ms.
JMV: It depends. Overall you need to have less than 25 ms one – way delay for the communication to work. That’s the total delay. So if you look at existing codecs, even so-called low-delay AAC already has at least 20 ms of packetization delay. So if you add the network and anything else you can’t make it. Codecs such as AMR-WB will have 25 or 30 ms for packetization, so you have already busted right there. It won’t work for music. So this is one reason why I wrote CELT.
MS: Have you played music over the Internet with CELT yet?
JMV: I haven’t tried it yet but some other people have tried it and reported pretty good results.
The other goal, even if you are not trying to play music over the network, it has to do with AEC which although some software does it relatively well it’s still not a completely solved problem. If you are able to have the delay low enough you almost don’t have to care about echo cancellation at all because the feedback is so quick that you don’t even notice the echo. Just like when you speak on the phone you hear your own voice in the head set and it doesn’t bother you because it’s instantaneous.
MS: Has anybody done any research on how long the delay can get before it begins to become disorienting?
JMV: There’s is some research, usually for a particular application and it will depend on the amount of echo and all of that but usually if you manage to get it below around 50 ms for the round trip usually it won’t be a problem, and the lower the delay the less annoying the echo is, even if you don’t cancel it.
MS: What phone are you talking on now?
JMV: My company phone. At home I’m set up with my cable provider.
MS: So you don’t use Speex yourself?
JMV: I used it when I was in Australia and I wanted to talk with my family back here. But I’m not using it in any regular fashion now.
MS: So what software did you use with the webcam?
JMV: At the time I was using Ekiga and OpenWengo. Both are Linux clients, because I don’t run Windows on my machines. Open Wengo is one of the few on Linux that can talk to a Windows machine.
MS: Have you ever used Skype?
JMV: Once or twice, but not regularly.
MS: What kind of cell phone do you have?
JMV: A really basic cell phone I think I have sent maybe one or two SMS messages in my life and that’s the most complicated I have ever done with that phone, which I use mainly in case of emergencies. I am not a telecommunication power user or anything like that.
MS: I am thinking that cell phones are how wideband codecs are going to take off. I’m not talking about AMR-WB. There are going to be hundreds of millions of smart phones sold, over the next few years that have Wi-Fi in them. And because you will be able to do VoIP from a platform that you can load applications onto, it seems like a Wi-Fi voice client for smart phones is going to be a way that wideband audio can really infiltrate and take off. I’m thinking that that might be a way for you to start using Speex in your daily life.
JMV: Well, I sure hope that Speex will take off a lot more in that area. Originally it wasn’t planned to go that far. Originally the only market I had in mind was the Linux or open source market with PC based soft phones. That’s the only thing I cared about. It was designed mainly for IP networks as opposed to wireless, and I just wanted to see far it would go. It turned out to be a lot further than I expected.
Porting to Windows was done pretty early in the game. That was a contribution – I have never actually compiled it for Windows. And eventually people started porting it to all sorts of devices I have never heard of like embedded DSPs and lots of different architectures.
MS: I must say, thank you very much. I feel that wideband audio is a great benefit to the telephone world, and will undoubtedly become very common over time, and one of the biggest impediments to wideband audio is the intellectual property issue, so having an open source, IPR-free implementation of a wideband codec that seems to be a good one is just a great thing for the world, and a wonderful thing you have done for the world.
JMV: I think wideband will be pretty important, especially for voice over IP because it’s basically the only way that VoIP can ever say that it’s better than the PSTN. As long as people stay with narrowband the best VoIP can be is “almost as good as PSTN.” And yes, IPR is a pretty serious issue there.

Open up Skype?

Skype is the gorilla of HD Voice. Looking at my Skype client I see that there are at this moment about 16 million people enjoying the wideband audio experience on Skype. The other main type of Voice over IP, SIP, is rarely used for HD Voice conversations, though I wrote an HD Voice Cookbook to help to popularize wideband codecs on SIP. Since Skype has the largest base of wideband codec users, those who are enthusiasts of both HD Voice and SIP are eager for SIP networks to interoperate with Skype, allowing all HD-capable endpoints to talk HD to each other. Skype does already kind of interoperate with SIP, but only through the PSTN, which reduces the wideband media stream to narrowband. Opening up Skype would solve this problem, so it’s obviously a good idea. What is not so clear, however, is what it means to “open up Skype.”

Skype reinvented Voice over IP, and did it better than SIP. SIP was originally intended to be a lightweight way to set up real-time communications session. It was the Internet Engineering Task Force’s response to the complexities of the ITU VoIP standard, H.323. But SIP got hijacked by the telephone industry, and recast into the familiar mold of proliferating standards and proprietary implementations. SIP is no longer lightweight, implementation is a challenge and only the basic features are easily interoperable.

Take a look at my HD Voice Cookbook to see what it takes to set up a typical SIP phone, then compare this to installing Skype on your PC. Or compare it to the simplicity of plugging in a POTS phone to your wall socket. So we have:

  • Skype, free video calls with HD voice from your PC to anywhere in the world;
  • POTS, narrowband voice-only calls that cost about $30 per month plus per-minute charges for international calls; or
  • SIP, that falls somewhere in between the two but which is way too complex for consumers to set up, and which people only really use for narrowband because everybody else only uses it for narrowband, so there’s no network effect.

Open VoIP standards got a several-year start on Skype, starting with H.323 and going on to SIP; but from its inception Skype blew them out of the water. To be sure it had a strong hype amplifier since P2P file sharing was controversial at that time, and Skype came from the same people as Kazaa, but at that time NetMeeting (an H.323 VoIP program) had an enormous installed base, since it came as part of Windows. The problem Skype solved was ease of use.

Skype doesn’t just give you video and wideband voice. It’s all encrypted and you get all sorts of bonus features like conferencing, presence, chat, desktop sharing, NAT traversal and dial-by-name. And did I mention it’s free?

The open standards VoIP community was beaten fair and square by Skype, blowing a several year start in the process.

Let me clarify that. In terms of minutes of voice traffic on network backbones, SIP traffic outweighs Skype, so from that point of view, SIP is not so beaten by Skype. The sense in which Skype has trounced the open standards VoIP community is in providing users with something better and cheaper than the decades-old PSTN experience, which carrier VoIP merely strives to emulate at a marginally lower price.

So it seems to me like sour grapes to clamor for Skype to make technical changes to conform to open standards, especially if those changes would impair some of the benefits that Skype offers users. How would users benefit from opening up Skype? Would the competition lower the cost of a Skype call? It’s hard to see how, when Skype calls are free. Would the service be more accessible, or accessible to more customers? No, because anybody with a browser can download Skype free by typing “Skype” or even “Skipe” into their browser’s search field. Would the open standards community innovate faster than Skype, and provide more and better features? Not based on the their respective track records. The open standards community has had plenty of time to out-innovate Skype and manifestly failed.

Anyway, what are the senses in which Skype is not open? It is certainly interoperable with the PSTN; SkypeIn and SkypeOut are among the cheapest ways to make calls on the PSTN. Actually, this may be the greatest threat to Skype’s innovation. SkypeIn and SkypeOut are the only way that Skype makes money; this is a powerful motivation for Skype to not incent users to abandon them. If this remains the only economic force acting on the company Skype is likely to decay into an old-style regular phone service provider.

After a lot of debate with people who know about these things, there seem to be two main ways in which Skype could be said to be not open:

  1. The protocol is proprietary and not published, so third parties can’t implement endpoints that interoperate with Skype endpoints.
  2. Only Skype can issue Skype addresses, and Skype controls the directories rather than using DNS like SIP.

Let’s look at the issue of the proprietary protocol first. Let’s break it into two parts, first who defines the protocols and second, their secrecy. In the debate between the cathedral and the bazaar, the cathedral has recently been losing out to the bazaar amongst the theorizers. We see the success of Apache, MySQL, Linux and Firefox and it looks as though the cathedral is being routed in the marketplace, too. But on the other hand we have successful companies like Apple, Google, Intel and Skype, whose success demonstrates that a design monopoly can often deliver a more elegant and tight user experience. There is no Linus Torvalds of SIP. Having taken the decision to implement a protocol other than SIP, it seems fine to me that whoever invented the Skype protocol should continue to design it, especially since they have manifestly done a much better job than the designers of SIP – ‘better’ in the sense of being more appealing to users.

What about the secrecy? A while back one of the original designers of SIP, Henning Schulzrinne, with his colleague Salman Baset, reverse engineered the Skype network and published his findings here. There is more technical background on Skype here. According to Baset and Schulzrinne:

Login is perhaps the most critical function to the Skype operation. It is during this process a Skype client authenticates its user name and password with the login server, advertises its presence to other peers and its buddies, determines the type of NAT and firewall it is behind, discovers online Skype nodes with public IP addresses, and checks the availability of latest Skype version.

Opening up the protocol to let other people use it would enable them to implement their own Skype login servers. This would enable a parallel network, but in the absence of a new protocol that enabled the login servers to exchange information, it would not lead to interoperability, in the sense of users on Skype being able to view the presence information of users on the parallel network, or even retrieve their IP address to make a call. So it would have the effect of fragmenting the Skype network, rather than opening it. Alternatively the Skype login servers could implement the SIP protocol to exchange presence information. But then it would start to be a SIP network, not a Skype network. And the market numbers say that users find SIP inferior to Skype. So why do it?

Opening up the protocol to let other people write Skype clients that logged into the Skype login servers would open up the network, but at the risk of introducing interoperability issues due to faulty interpretations of the specification. Network protocols are notoriously prone to this kind of problem. But guaranteed interoperability of the clients is one of the primary benefits of Skype over SIP from the point of view of the user, who would therefore not benefit from this step.

So why not have Skype distribute binaries that expose to third party applications the functionality of the protocols and the ability to log into the Skype login server through a published API? Wait a sec – they already do that.

Another objection to Skype publishing the protocols for third parties to implement is that there would be a danger of the third parties implementing some parts of the protocol but not others. For example not the encryption part, or not the parts that enable clients to be super-nodes or relays. A proliferation of this kind of free-rider would stress the network, making it more prone to failure.

Related to the issue of who implements the login servers is who issues Skype addresses. There is a central authority for issuing phone numbers (the ITU), and a central authority for issuing IP addresses (the IANA). But in both cases, the address space is hierarchical, allowing the central authority to delegate blocks of addresses to third party issuers. The Skype address space is not hierarchical, so it would require some kind of reworking to enable delegation. Alternatively the Skype login servers could accept logins from anybody with a SIP address. But there would be no guarantee that the client logging in was interoperable.

Scanning back through this posting, I see that my arguments could be parodied as “you can’t argue with success,” and “if it ain’t broke don’t fix it.” Arguments of this type are normally weak, so in this case I think my points are actually “there are reasons for Skype’s success,” “fixes could break it,” and “users would be better served if Skype competitors concentrated on seducing them with a superior offering,” the last of which, after all, is how Skype has won its users away from the traditional telecom industry. Some people are trying this approach, notably Gizmo5, which I plan to write about later.

HD Voice Cookbook

One of the themes of this blog is how phone conversations can sound much better in VoIP because of wideband codecs. If you have a corporate IT department and a new PBX from a company like Cisco, Avaya, Nortel, Siemens or Alcatel-Lucent, the phones can normally can be configured to use the (wideband) G.722 codec on internal calls. And if you use Skype on your PC, it normally runs with a wideband codec, unless you make a SkypeOut call to a regular phone number.

But what if you are working out of a home office, and you just want your desk phone to sound good, and to use a wideband codec when calling other phones with wideband capabilities? Unfortunately its still a project that can require some technical skills and a lot of time. To make it easier for you, here’s a cookbook explaining step by step how I did it for a particular implementation (Polycom IP650 phone using an account at OnSIP).

Skype for iPhone

Well, that last post on the likely deficiencies of VoIP on iPhones may turn out to have been overly pessimistic. It looks as though Hell is beginning to freeze over. Skype is now running on iPhones over the Wi-Fi connection, and for a new release it’s running relatively well. AT&T deserves props for letting it happen – unlike T-Mobile, which isn’t letting it happen and therefore deserves whatever the opposite of props is.

6 hours after it was released Skype became the highest-volume download on Apple’s AppStore. In keeping with Skype’s reputation for ease of use, it downloads and installs with no problems, though as one expects with first revisions it has some bugs.

My brief experience with it has included several crashes – twice when I hung up a call and once when a calendar alarm went off in the middle of a call. Another interesting quirk is that when I called a friend on a PC Skype client from my iPhone, I heard him answer twice, about 3 seconds apart. Presumably a revision will be out soon to fix these problems.

Other quirky behaviour is a by-product of the iPhone architecture rather than bugs, and will have to be fixed with changes to the way the iPhone works. The biggest issue of this kind is that it is relatively hard to receive calls, since the Skype application has to be running in the foreground to receive a call. This is because the iPhone architecture preserves battery life by not allowing programs to run in the background.

Similar system design characteristics mean that when a cellular call comes in a Skype call in progress is instantly bumped off rather than offering the usual call waiting options. I couldn’t get my Bluetooth headset to work with Skype, so either it can’t be done, or the method to do it doesn’t reach Skype’s exemplary ease of use standards.

Now for the good news. It’s free. It’s free to call from anywhere in the world to anywhere in the world. And the sound quality is very good for a cell phone, even though the codec is only G.729. I expect future revisions to add SILK wideband audio support to deliver sound quality better than anything ever heard on a cell phone before. The chat works beautifully, and it is synchronized with the chat window on your PC, so everything typed by either party appears on both your iPhone and PC screen, with less than a second of lag.

After a half-hour Skype to Skype conversation on the iPhone I looked at my AT&T bill. No voice minutes and no data minutes had been charged, so there appear to be no gotchas in that department. A friend used an iPod Touch to make Skype Wi-Fi calls from an airport hot-spot in Germany – he reports the call quality was fine.

The New York Times review is here