WebSocket is one of the technologies under the HTML5 umbrella that allows full-duplex communication channels over TCP between clients and servers. Prior to WebSocket, web developers had to do some whacky things in order to overcome the request-response nature of HTTP and achieve the same behavior.
Having full-duplex communication channels in web applications can open a door to a world of interactive web applications such as live chats and collaborative games, and can benefit various other fields in which real time communication is crucial, such as finance, sports, and analytics – All of it, in a simple, standard and plugin-free way and with a minimum overhead (To be precise, WebSocket is just one way to develop real time web applications. Two other ways are WebRTC, which allows browser to browser P2P communication, and Server-Sent-Events, which allows a server to continuously push data to clients over HTTP).
My journey towards understanding WebSocket started by reading The Definitive Guide to HTML5 WebSocket, on which I highly recommend. The book starts with a good overview of the history that led to the development of the WebSocket protocol, continues with a deep dive into the protocol itself, and sums up with practical examples of two sub protocol implementations over it.
Overview of the WebSocket Protocol
The Opening Handshake
To establish a new WebSocket connection, a client sends an HTTP Upgrade request to the server. For example, in my demo-client (which will be discussed later on), the request looks like this:
The Origin, Host, User-Agent, Upgrade, and Connection headers are standard HTTP headers (I assume that the Origin header is null because the HTML page that invoked the request was loaded from my local disk rather than being served from a web server). The Sec-* headers are specific to the WebSocket protocol. The Sec-WebSocket-Version header is used to insure that the server and the client are using the same version of the protocol. The Sec-WebSocket-Key header contains a random 16-byte value that has been base64 encoded. It will be used by the server to prove that it received the client’s request (will be discussed shortly). The Sec-WebSocket-Extensions is an optional header that is used to negotiate a set of extensions to the protocol that will be used for the duration of the connection. In our case, the client states that it would like to use the deflate-frame extension. Another optional header that is not used in the above example is the Sec-WebSocket-Protocol header, through which the client and the server can negotiate the sub-protocol that will be used over the WebSocket connection.
The server’s response to the above request is:
To indicate that the server accepted the request, it responds with the 101 Switching Protocols status code. The interesting header here is Sec-WebSocket-Accept, which corresponds to the Sec-WebSocket-Key header in the request. Its value is constructed by concatenating the value of the Sec-WebSocket-Key header with the constant suffix 258EAFA5-E914-47DA-95CA-C5AB0DC85B11, creating a SHA1 hash of the concatenated string and encoding the hashed value in base64. The following code calculates a value for the Sec-WebSocket-Accept header that matches the client key that is stored in the clientKey variable:
If the client included the Sec-WebSocket-Protocol or the Sec-WebSocket-Extensions headers in the request, the server can add those header as well, containing the protocols and extensions that it supports.
The above request and response are being referred to in the protocol as the opening handshake. After a successful handshake, the client and the server can send messages to each other over the connection.
Each message between the client and the server in the WebSocket protocol is called Frame and must be constructed according to the following format:
The first bit of each frame indicates if it is the last frame of a message. A single message can be partitioned to multiple frames. If this bit is set, it means that this is the final frame of the message. The following three bits are reserved for future use.
The low nibble of the first byte contains the frame’s opcode which can be one of the following:
The 0x1 and 0x2 opcodes means that the type of the frame’s payload is textual (UTF-8 encoded) or binary, respectively. The 0x8 opcode means that the sender wishes to close the connection (will be discussed shortly). For frames that are part of a fragmented message, the opcode should be supplied only for the first frame. The other frames should contain the 0x0 opcode.
The 0x9 and 0xA opcodes are used to send keep-alive frames between the client and the server. Upon receiving a frame with the 0x9 opcode, the receiving endpoint should respond with a frame that contains the 0xA opcode. These keep-alive frames are needed to prevent the connection from being closed by the underlying protocol when there is a long period with no messages being sent between the endpoints, which is likely to happen in some scenarios (for example, a chat client that did not send or receive messages for a couple of minutes).
The highest bit of the second byte is the masking bit and it should be set whenever the payload is masked. The protocol states that all the frames between the client and the server must be masked and that all the frames between the server and the client must not be masked (Section 10.3 of the protocol explains why).
The remaining 7 bits of the second byte contains the length of the actual data that was sent (the payload). If the data is up to 125 bytes in length, its size will be encoded into those 7 bits. If the data is larger than 125 bytes, those bits will contain special values to indicate that, and the actual length will be encoded in the following way:
- If the value of these 7 bits is 126, the length of the payload will be encoded into the next 2 bytes.
- If the value of these 7 bits is 127, the length of the payload will be encoded into the next 8 bytes.
By using the above method, the protocol ensures that the minimum required number of bytes will be used to encode the length of each frame’s payload. There is no need to send 7 additional bytes over the wire for frames that contain payload that is smaller than 125 bytes.
The next 4 bytes in the frame (starting with the 3rd, 5th, or 11th byte, depending on the number of bytes that were used to encode the payload’s length) are used to store the key that should be used to unmask the payload. These 4 bytes will be used only if the masking bit (highest bit of second byte) was set.
The remaining bytes in the message (which are determined according to the payload length) are used to store the actual data, and should be interpreted according to the frame’s opcode. As stated, if the frame is masked, those bytes should first be unmasked, by using the masking key:
The Closing Handshake
When one of the endpoints decides to close the connection, it should send a frame with the 0x8 opcode. This frame can contain a status code (there are predefined codes in the protocol) and a textual reason in the payload. Upon receiving a closing frame, the other endpoint must reply with its own closing frame. After both sides of the connection received and sent a closing frame, the WebSocket connection can be closed.
The core of the server implementation is the WebSocketServer class, which uses the TcpListener class to accept new TCP connections. Upon receiving a new connection, the class checks whether this is a WebSocket connection by looking for the opening handshake:
The OnAcceptClient method is being called for each new client connection. That method calls the ReceiveClientHandshake method to asynchronously receive data from the new TCP connection and then continues to wait for other connections. When a new data is received from the TCP connection, the OnHandshakeReceived method is called. In this method, we check whether enough bytes were received and whether the received data is a valid client handshake (by using the OpeningHandshakeHandler class) and close the connection if not. Otherwise, we create a response (again, by using the OpeningHandshakeHandler class) and send it back to the client:
In the SendHandshake method, we asynchronously send the server’s response to the client. When the send operation completes, the OnHandshakeSendCompleted method is being called. At that point, we can be sure that we have a valid WebSocket connection. A new instance of the ClientConnection class is created and the OnClientConnected event is being raised to indicate that we have a new client.
The ClientConnection class encapsulates a single WebSocket connection, and is the class with which we interact for the entire lifetime of the connection. Its public API exposes methods to send data, and to disconnect, as well as events which are raised when data is received or when the client disconnects:
The first method that needs to be called on a new ClientConnection instance is the StartReceiving method, which asynchronously receives data from the client:
When new data has arrived, the OnDataReceived method is being called. In it, we handle the received frame and call StartReceiving again to receive the next frame. Let’s take a look at the HandleFrame method:
In the HandleFrame method, we use the Frame class to parse the received bytes into something that is easier to work with and raise the correct events based on the frame’s opcode. Note that we return false in case the received frame was a closing frame. In that case, the OnDataReceived method will not call the StartReceiving method again.
The Frame class itself is just a way to encapsulate the frame’s parsing and byte-splitting code into something more elegant:
With the Frame class in place, receiving and sending frames becomes much simpler. Take a look for example at the Send methods of the ClientConnection class:
In those methods, we just create a new Frame instance with the correct opcode and data and let it handle all the rest (note the call to ToBuffer).
That’s pretty much it. With the WebSocketServer, ClientConnection, OpeningHandshakeHandler, and Frame classes doing the heavy lifting, creating an echo server is as simple as it gets:
I don’t think that the above code requires any explanations. The complete server’s code can be found on GitHub.
In order to focus on the important parts, I kept the page layout as simple as possible:
Which gives us the following page:
When the user clicks the Connect button, the connect function is being called:
In it, we create a new WebSocket instance, giving it the server’s address as a parameter. Then, we add handlers for the following events: onopen, which is fired when the connection is established, onerror, which is fired when an error has occurred, onmessage, which is fired when a new message from the server has arrived, and onclose, which is fired when the connection has been closed. In each one of these event handlers we print a message to the screen by using the log function, and change the style of the DOM elements by using the changeState function:
When the user clicks the Send button, the send function is being called. In it, we check that the connection is open and if it is, we send the message to the server by calling the send function on the webSocket instance:
Finally, when the user clicks the Disconnect button, the disconnect function is being called. In it we just call the close function on the webSocket instance:
That’s it. We have a fully functional WebSocket server, implemented in C#, and a corresponding client based on the HTML5 WebSocket API. You can run both the client and the server on your local machine. I urge you to download the code and start experimenting with it. Have fun!