A few words about WebSockets

The last post was about me vs AD. However, it was more about “how to go with client and proper Architecture Decisions” one than a technical post. So in this one, I relate to what I do in the very same project, but from the side techies like - from the layer of understanding set deep down below this whole abstraction we all see.

WebSockets got popular a few years back. However, the official standard came up to the existence about December of 2011. So, the technology itself is not a fresher in transport technologies for Web. WS is a great solution for every website which needs dynamic, full-duplex communication between unmanaged client code and a managed server code. What’s more is that communication is very efficient, what I will try to explain further in this post.

What is one of the most important things about what I include here is that all standard information about WebSocket protocol is not normative.

Design Philosophy

For starters, let say what is a design philosophy which stays behind this protocol. The design philosophy is to put the minimal possible framing. WebSocket uses framing only to be a frame-based protocol instead of stream-based and to be able to differentiate between Unicode text data and binary data. The main expectation about the metadata in WebSocket communication is that it is fully managed by the application, to some extent it is similar to how it is designed for TCP protocol - metadata in WS protocol is layered upon itself by the application using it, like HTTP or any other application protocol does it with TCP.

To talk down things to how it looks on a concept basis, I would like to quote a RFC standard document:

"Conceptually, WebSocket is really just a layer on top of TCP that
   does the following:
    o  adds a web origin-based security model for browsers

    o  adds an addressing and protocol naming mechanism to support
      multiple services on one port and multiple host names on one IP
      address

    o  layers a framing mechanism on top of TCP to get back to the IP
      packet mechanism that TCP is built on, but without length limits

    o  includes an additional closing handshake in-band that is designed
      to work in the presence of proxies and other intermediaries"

Adding all those 4 points to what I described earlier, makes all what WS adds over TCP literally visible - there is nothing more. Having all that said, we can come to a conclusion that WebSocket is a efficient one, because it bases on a least possible frame count, however there is no streaming session which is not draining resources on the other hand.

Let’s do a techie talk now

So, having all that said about why does WebSocket protocol actually exist, we can go forward and talk about how it actually works. There are distinct points of how WS protocol expects servers to implement it.

We can distinguish the following operations/states during session of WebSocket protocol communication:

Open Connection Hadshake,
Establishing Connection,
Established Connection,
Closing Connection Handshake.

Well, let me ask you if it doesn’t really seem similar to any other protocol you know…? Oh, actually it seems, yes? WS connection lifecycle is similar or close to identical to TCP connection lifecycle. To be more coherent, WS is just an additional abstraction layer over a raw TCP connection which is just narrowed down to what Web needs in fact.

Let’s dont forget about mentioning that WS protocol implements simple security mechanism to keep connections separated from one each other and secure those from being forged by some other TCP-based protocols' clients. This mechanism is strictly related to Open Connection Handshake step, thus I will describe it while talking about this step.

Open Connection Handshake Step

This step is really close to what TCP does at the beginning of communication between the hosts. Instead of sending frames with flags ACK and ACK SYN set to positive byte (1), the client starts with sending HTTP request with Upgrade header. The dxample of such is below:

        GET /push HTTP/1.1
        Host: push.example.com
        Upgrade: websocket
        Connection: Upgrade
        Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
        Origin: http://example.com
        Sec-WebSocket-Protocol: pushproto
        Sec-WebSocket-Version: 13

After retriving such request, a HTTP server upgrades connection to WebSocket. In return, server answers with HTTP 101 Switching Protocols response which means everything went well and what comes next is all WebSocket communication. If something doesn’t add up on the server’s side, the server will answer with one of HTTP 4xx responses and the communication goes on following HTTP schema. There is always one, supposed way for WS client to fail Open Connection Handshake - the client must fail security checks on the server side. That means that Origin header will then contain a value which doesn’t match with the real request origin(DNS name, IP etc). After that, the server has the key generated by the client, which is concatenated with a GUID token and then hashed with SHA-1 or another hashing algorithm, but truncated to 160bits what is then base64-encoded. After generating such token, it gets returned to the client in Sec-Websocket-Accept header field, and that means that the server is willing to establish a connection. If this field is non-existant or it exists but with any other value than expected, it just means that server refuses to connect.

Close Connection Handshake Step

As I presented above, the Open Connection step was really close to TCP and has some standard points. Well, we can’t say the exact same thing about Close Connection Handshake. This step is non-normative and what’s more even the RFC itself doesn’t define the exact steps to finish the connection. The paper just compares it to TCP finish connection process so to sending FIN/ACK frames. So, to sum up, whereas leaving no field to discussion, I would like to quote a RFC itself about how to initiate this step

    Either peer can send a control frame with data containing a specified
    control sequence to begin the closing handshake (detailed in
    Section 5.5.1).  Upon receiving such a frame, the other peer sends a
    Close frame in response, if it hasn't already sent one.  Upon
    receiving _that_ control frame, the first peer then closes the
    connection, safe in the knowledge that no further data is
    forthcoming.

And after doing this, we are ready to state that the conenction is closed and the peers are not longer ready to read data from sockets, as stated here:

    After sending a control frame indicating the connection should be
    closed, a peer does not send any further data; after receiving a
    control frame indicating the connection should be closed, a peer
    discards any further data received.

A little more about Security Model

WebSocket protocol is designed for being used next to classic HTTP client, what makes it use Origin-based authentication model. Another layer of security is added by mentioned headers started with “Sec-”. By the time of when the RFC was written it wasn’t possible to hack such headers using HTML and JS to send prepared HTTP request using XMLHttpRequest. Thus, the only client which can possibly connect to such server is the WebSocket client. Another option is to use custom WS client while writing desktop app for example, but this option also allows possible attacker to set Origin header as needed. However, in such case there is a need for server architecture designer to prevent such scenario from happening with adding one or more security layers to authenticate the client.

Establishing a Connection

There is a one important thing about WS protocol and establishing and established connections - it is a full-duplex connection which is opened, established and then it lasts unless it is closed by the server or the client. What’s more, WS protocol uses ports 80 and 443 next to HTTP servers. In fact, in simple setups that connections would probably be handled by the same exact server which would just do the role of HTTP and WS server as one: nginx for example. However, there are more elaborate ones, when the HTTP servers are separated and earlier before them there are firewalls, load balancer or even sometime proxies. The second one can be easier to manage when deploying a bigger, production system which will be needed to scale up easily.

FIN ACK

To sum this all up, WS protocol is just another one which is layered above TCP transport protocol. It is not any kind of modification of HTTP, it just uses HTTP to be manifested during classic HTTP-based application flow. This is, because it was designed to be used by the scripts executed by the browsers which are no more, no less but HTTP clients. The best case for it to be used is to create real-time web applications which doesn’t need streaming-based protocols, but need a full-duplex, low cost communication with an option to send text and binary data.

This would be all for now, thanks for reading. Greetings and stay tuned!