The problem#

While in my last post, I said serverless WebRTC was too cumbersome, I wanted to try to see how streamlined I could make the process. While researching, I did find a variant of serverless-webrtc called serverless-webrtc-qrcode as well as a similar demo, webrtc-qr that both use QR codes as an easy way to transmit the offer strings. But both require that both sides have a camera to scan QR codes, while my use case is a WebRTC connection between my desktop without a camera and my smartphone.

The solution#

minimal-webrtc now has a checkbox to enable serverless mode. In that mode, the QR code shown by the host is a much longer URL that includes the initial WebRTC offer. Opening that URL on another device (or in another browser window) will show another QR code along with a "Copy" button. With the first device, either press the "Scan QR code" button and point it at the QR code or use some other mechanism to copy the text and paste it into the text area labeled "Paste offer here:".

To run it locally, download the source code and run a web server to serve the wwwroot/ directory. If both devices can run a web server, then you can just access it via localhost on each, but, as before, because WebRTC requires HTTPS, to run it on your local network, you may need to set up a self-signed certificate.

The details#

There's a few separate problems to solve:

Reducing the volume of WebRTC setup traffic to just a single message each from the client and the host.
Reducing the size of the WebRTC setup traffic to fit in a QR code.
Transferring strings when QR codes are not an option (e.g. to a device without a camera).

Fewer WebRTC messages#

ICE candidates in offer#

Logging at the logs, it's clear that the vast majority of the messages sent over the WebSocket channel are ICE candidates. Luckily, with a little work, we can avoid sending them as separate messages.

The default way to handle ICE candidates is to send the offer first and then send candidates one at a time until there are no more available or a connection is established. This is intended to get the connection going as fast as possible, especially in situations where the ICE candidates are coming from querying a server. But since minimal-webrtc only uses locally generated candidates, there's nothing to wait for. And, even if there were, waiting allows us to avoid sending multiple messages.

By sending offers only when iceGatheringState is "complete", we ensure that all of the ICE information is communicated in a single message. Our original code did something like the following to send an offer as soon as a negotiationneeded event is generated and send each ICE candidate as it is generated:

pc.onnegotiationneeded = async function () {
  await pc.setLocalDescription(await pc.createOffer());
  send({description: pc.localDescription});
}
pc.onicecandidate = ({candidate}) => send({candidate});

Instead, we can wait for the iceGatheringState to change and the ICE candidates will already be included in the offer, so there's no need to send them manually:

function sendOffer() {
  if (pc.iceGatheringState == "complete") {
    send({description: pc.localDescription});
  }
}
pc.onicegatheringstatechange = e => sendOffer();
pc.onnegotiationneeded = async function () {
  await pc.setLocalDescription(await pc.createOffer());
  sendOffer();
}

Data channel first#

In order to simplify the logic around when an offer should be generated, I made the initial WebRTC connection not depend on the audio/video settings and always create just an RTCDataChannel first. Once that connection is established, it can be used instead of the WebSocket which goes through the server to transmit the settings as well as the WebRTC offers to set up the audio/video connections. This is done by having the sendJson function check if the data channel is open and use it instead of the WebSocket if so.

The initial offer is now always generated by the host (the first device to open the page that has the settings form to decide what audio/video channels are sent). In the normal (non-serverless) mode, the client just sends a message to let the host know it is ready to receive the initial offer. In serverless mode, the initial offer is contained in the URL which is used to start the client, and the only response needed to initiate the WebRTC data channel connection is the offer from the client which is passed back to the host via a QR code or copy and paste.

Offer too big#

Between those two fixes, the communication to set up a WebRTC connection is down to a single message in each direction. But there's a catch: those messages are in the range of 1000-1500 characters, too big to fit in a QR code. Well, technically, that's within the allowed limits of what fits in a QR code, but getting a QR reader to actually read such a QR code does not seem to work in practice.

serverless-webrtc-qrcode and webrtc-qr both solve this problem with essentially the same solution: multiple QR codes. Since they're implementing their own readers, they both have variants of the same trick of displaying a rotation of multiple QR codes that have some metadata about how many there are and having the reader keep reading until it has seen all of them. (The QR code standard actually specifies a variant of this called Structured Append for splitting data into multiple QR codes.)

serverless-webrtc-qrcode limits its QR codes to 400 characters while webrtc-qr limits its QR codes to 40 characters. Which suggests that if we can compress the offer down to around 400 characters, then such the multiple QR code trick is probably not necessary.

Preset dictionary compression#

serverless-webrtc-qrcode actually already uses lz-string to compress its offers, which isn't good enough. In my experiments, I found node-lz4 gets better compression and lzma-js even better. But still not good enough. (I also ran across JSONC but didn't try it as it seemed focused on compressing structured JSON and probably wouldn't do well on a JSON document dominated by a single large string.)

Taking a step back, I notice that WebRTC offers are very similar to each other. Perhaps if I could encode just what is different about the specific offer being sent, then that would be smaller. Instead of trying to figure out how to compute such a diff, I noticed that's essentially what LZ-family compression algorithms already do: continually go through a byte stream and represent it in terms of its similarities to what appears earlier in the byte stream.

To be clear, this isn't a novel idea. I got the idea from SPDY's original header compression, which used preset dictionaries to make DEFLATE do a better job of compressing HTTP headers with similar reasoning: most sites use the same HTTP headers, so keeping around a collection of common ones makes the compression work better. This CloudFlare blog post explains the concept in a slightly different setting.

Although I didn't find it while developing this code, pako is a Javascript port of zlib which supports preset dictionaries for DEFLATE compression/decompression. Unfortunately, while it does better than LZMA (without a preset dictionary) on some inputs, it still doesn't get below the 400 byte target size.

Prefix compression#

For compression libraries that don't support preset dictionaries, they can be faked by noticing that compressing with a preset dictionary is very similar to compressing the dictionary, throwing away the output, and then compressing the actual data. While compression APIs don't have support for exactly that, simply compressing the concatenation prefix + data and then throwing away the beginning is pretty close. There's a couple complications:

We need to figure out how much at the beginning of the compressed string is the repetitive part about prefix that should be thrown away and where the part about data actually starts.
If the compression format has a header that includes a length or checksum, that will complicate things.

I wrote prefix-compression-test to help work through those issues and determine how well the compression is actually working.

It defines a class PrefixCompressor which is parameterized on compress/decompress functions and a prefix string. When compressing it uses the base compress function on prefix + data and finds the first byte that differs from compressing prefix alone. It chops off that much from the start of the compressed data and replaces it with just the number of bytes removed from the end of the compressed prefix. Then the decompress function uses that information to reconstruct the full compressed data, calls the base decompress function, and slices off prefix from the start of the result before returning.

Additionally, if that fails to work, it has additional logic to determine how many bytes to skip from the beginning of the compressed data to find a match in order to estimate the length of the header if there is one.

That code told me that for LZMA, it always had to skip the first 7 bytes. I found documentation of the LZMA header which said that the first 5 bytes are the options and the following 8 bytes are the maximum size of the decompressed data. Since all of the files I was testing with were below 2^16 bytes (64 KiB), their size took two bytes to express. Since that documentation said the length field could be set to its max value to indicate it was unknown, I just changed my LZMA compression function to overwrite those bytes with 0xff bytes.

Then PrefixCompressor with LZMA compression using the prefix I made out of WebRTC offers that I scrubbed the connection-specific information out of successfully compressed new offers down to below 400 bytes.

Transferring strings#

Reading the QR code#

Generating the QR code was straightforward. I used the QR-Code-generator library. For scanning QR codes, I found QR Scanner.

At first I had trouble getting it to read the QR code at all, but it turned out that unintuitively, having the QR code full frame in the camera didn't work. I had to hold the camera further away so the QR code took up only about half of the width and then it was able to read the QR code.

But it read the QR code as an empty string. After some debugging, I figured out the problem was this line which returns only the text interpretation of the QR code. Since I was using the QR code to transmit compressed data, I encoded the binary bytes directly (as opposed to encoding it in Base64). It turns out the library QR Scanner uses to decode the QR code does provide that information but it stores it in result.binaryData. Binary data isn't a common use case of QR codes, so no one noticed. As a workaround, I just modified the minified .js file so it returns result instead of result.data, so I could get access to result.binaryData.

Content Security Policy warning#

QR Scanner causes Firefox to generate the following warning:

Content Security Policy: The page’s settings observed the loading of a resource at https://.../camera/static/js/qr-scanner-worker.min.js (“worker-src”). A CSP report is being sent.

Trying to change the Content Security Policy, I was only able to cause it to get blocked; I couldn't figure out how to make that warning go away. Once I figured out QR Scanner was in fact working anyway, I just ignored the warning.

Transferring strings via SSH#

As my desktop does not have a camera, to make this connection, I still have to transfer one string from my smartphone to my desktop, specifically the link generated in the first step of the serverless connection.

Normally, I transfer text to my smartphone using qr (which outputs QR codes as text to the terminal) or in the other direction by opening an SSH session from my smartphone (using JuiceSSH) and attaching to the same screen session as on my desktop and using xclip to transfer text to the desktop's clipboard. As we know what we want to do with the text, this can be streamlined:

On desktop:

$ screen -S copy
[in screen session]
$ while read -r url; do chromium --incognito --app="$url"; done

On phone:

$ screen -x copy

and then paste the URL and hit enter. The browser window should pop up on the desktop with a QR code to scan to complete the connection.

Summary#

Putting all of the pieces together:

Always wait for ICE gathering to complete so there's only one offer message to send.
Always initially set up just an RTCDataChannel to keep negotiation simple.
Use PrefixCompressor to keep the offer message small enough to fit in a QR code.
Allow transferring via a Base64 encoded string instead of a QR code to support devices without cameras.

A Weird Imagination

Serverless WebRTC