Learn HTML5 with Me – WebSocket from Scratch

August 24, 2013

no comments

WebSocket is one of the technologies under the HTML5 umbrella that allows full-duplex communication channels over TCP between clients and servers. Prior to WebSocket, web developers had to do some whacky things in order to overcome the request-response nature of HTTP and achieve the same behavior.

Having full-duplex communication channels in web applications can open a door to a world of interactive web applications such as live chats and collaborative games, and can benefit various other fields in which real time communication is crucial, such as finance, sports, and analytics – All of it, in a simple, standard and plugin-free way and with a minimum overhead (To be precise, WebSocket is just one way to develop real time web applications. Two other ways are WebRTC, which allows browser to browser P2P communication, and Server-Sent-Events, which allows a server to continuously push data to clients over HTTP).

My journey towards understanding WebSocket started by reading The Definitive Guide to HTML5 WebSocket, on which I highly recommend. The book starts with a good overview of the history that led to the development of the WebSocket protocol, continues with a deep dive into the protocol itself, and sums up with practical examples of two sub protocol implementations over it.

Overview of the WebSocket Protocol

The Opening Handshake

To establish a new WebSocket connection, a client sends an HTTP Upgrade request to the server. For example, in my demo-client (which will be discussed later on), the request looks like this:

   1: GET ws://127.0.0.1:54321/ HTTP/1.1

   2: Origin: null

   3: Host: 127.0.0.1:54321

   4: User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36

   5: Upgrade: websocket

   6: Connection: Upgrade

   7: Sec-WebSocket-Key: BbLRLYrRTZS85NWjOLMXGQ==

   8: Sec-WebSocket-Extensions: x-webkit-deflate-frame

   9: Sec-WebSocket-Version: 13

The Origin, Host, User-Agent, Upgrade, and Connection headers are standard HTTP headers (I assume that the Origin header is null because the HTML page that invoked the request was loaded from my local disk rather than being served from a web server). The Sec-* headers are specific to the WebSocket protocol. The Sec-WebSocket-Version header is used to insure that the server and the client are using the same version of the protocol. The Sec-WebSocket-Key header contains a random 16-byte value that has been base64 encoded. It will be used by the server to prove that it received the client’s request (will be discussed shortly). The Sec-WebSocket-Extensions is an optional header that is used to negotiate a set of extensions to the protocol that will be used for the duration of the connection. In our case, the client states that it would like to use the deflate-frame extension. Another optional header that is not used in the above example is the Sec-WebSocket-Protocol header, through which the client and the server can negotiate the sub-protocol that will be used over the WebSocket connection.

The server’s response to the above request is:

   1: HTTP/1.1 101 Switching Protocols

   2: Connection: Upgrade

   3: Sec-WebSocket-Accept: TwlhayhKaWFcyaAr5boetomx+4k=

   4: Upgrade: websocket

To indicate that the server accepted the request, it responds with the 101 Switching Protocols status code. The interesting header here is Sec-WebSocket-Accept, which corresponds to the Sec-WebSocket-Key header in the request. Its value is constructed by concatenating the value of the Sec-WebSocket-Key header with the constant suffix 258EAFA5-E914-47DA-95CA-C5AB0DC85B11, creating a SHA1 hash of the concatenated string and encoding the hashed value in base64. The following code calculates a value for the Sec-WebSocket-Accept header that matches the client key that is stored in the clientKey variable:

   1: const string keySuffix = "258EAFA5-E914-47DA-95CA-C5AB0DC85B11";

   2: var paddedKey = clientKey + keySuffix;

   3: var hasher = SHA1.Create();

   4: string encodedKey = Convert.ToBase64String(hasher.ComputeHash(Encoding.UTF8.GetBytes(paddedKey)));

If the client included the Sec-WebSocket-Protocol or the Sec-WebSocket-Extensions headers in the request, the server can add those header as well, containing the protocols and extensions that it supports.

The above request and response are being referred to in the protocol as the opening handshake. After a successful handshake, the client and the server can send messages to each other over the connection.

Frame Format

Each message between the client and the server in the WebSocket protocol is called Frame and must be constructed according to the following format:

FrameFormat

The first bit of each frame indicates if it is the last frame of a message. A single message can be partitioned to multiple frames. If this bit is set, it means that this is the final frame of the message. The following three bits are reserved for future use.

The low nibble of the first byte contains the frame’s opcode which can be one of the following:

   1: public enum Opcodes

   2: {

   3:     Continuation = 0x0,

   4:     Text = 0x1,

   5:     Binary = 0x2,

   6:     Close = 0x8,

   7:     Ping = 0x9,

   8:     Pong = 0xA

   9: }

The 0x1 and 0x2 opcodes means that the type of the frame’s payload is textual (UTF-8 encoded) or binary, respectively. The 0x8 opcode means that the sender wishes to close the connection (will be discussed shortly). For frames that are part of a fragmented message, the opcode should be supplied only for the first frame. The other frames should contain the 0x0 opcode.

The 0x9 and 0xA opcodes are used to send keep-alive frames between the client and the server. Upon receiving a frame with the 0x9 opcode, the receiving endpoint should respond with a frame that contains the 0xA opcode. These keep-alive frames are needed to prevent the connection from being closed by the underlying protocol when there is a long period with no messages being sent between the endpoints, which is likely to happen in some scenarios (for example, a chat client that did not send or receive messages for a couple of minutes).

The highest bit of the second byte is the masking bit and it should be set whenever the payload is masked. The protocol states that all the frames between the client and the server must be masked and that all the frames between the server and the client must not be masked (Section 10.3 of the protocol explains why).

The remaining 7 bits of the second byte contains the length of the actual data that was sent (the payload). If the data is up to 125 bytes in length, its size will be encoded into those 7 bits. If the data is larger than 125 bytes, those bits will contain special values to indicate that, and the actual length will be encoded in the following way:

  • If the value of these 7 bits is 126, the length of the payload will be encoded into the next 2 bytes.
  • If the value of these 7 bits is 127, the length of the payload will be encoded into the next 8 bytes.

By using the above method, the protocol ensures that the minimum required number of bytes will be used to encode the length of each frame’s payload. There is no need to send 7 additional bytes over the wire for frames that contain payload that is smaller than 125 bytes.

The next 4 bytes in the frame (starting with the 3rd, 5th, or 11th byte, depending on the number of bytes that were used to encode the payload’s length) are used to store the key that should be used to unmask the payload. These 4 bytes will be used only if the masking bit (highest bit of second byte) was set.

The remaining bytes in the message (which are determined according to the payload length) are used to store the actual data, and should be interpreted according to the frame’s opcode. As stated, if the frame is masked, those bytes should first be unmasked, by using the masking key:

   1: private static void UnMask (byte[] payload, int maskingKey)

   2: {

   3:     int currentMaskIndex = 0;

   4:  

   5:     byte[] byteKeys = BitConverter.GetBytes(maskingKey);

   6:     for (int index = 0; index < payload.Length; ++index)

   7:     {

   8:         payload[index] = (byte)(payload[index] ^ byteKeys[currentMaskIndex]);

   9:         currentMaskIndex = (++currentMaskIndex)%4;

  10:     }

  11: }

The Closing Handshake

When one of the endpoints decides to close the connection, it should send a frame with the 0x8 opcode. This frame can contain a status code (there are predefined codes in the protocol) and a textual reason in the payload. Upon receiving a closing frame, the other endpoint must reply with its own closing frame. After both sides of the connection received and sent a closing frame, the WebSocket connection can be closed.

Server Implementation

You cannot fully understand something without having your hands dirty. So, to sum up my learning, I decided to write my own echo server in C# (which is the ‘hello world’ of WebSocket), along with a corresponding client in JavaScript. I will walk you through the implementation, through which I hope that you will better understand the WebSocket protocol.

The core of the server implementation is the WebSocketServer class, which uses the TcpListener class to accept new TCP connections. Upon receiving a new connection, the class checks whether this is a WebSocket connection by looking for the opening handshake:

   1: private void OnAcceptClient(IAsyncResult asyncResult)

   2: {

   3:     if (!m_isStarted)

   4:         return;

   5:  

   6:     TcpClient client = m_tcpListener.EndAcceptTcpClient(asyncResult);

   7:     ReceiveClientHandshake(client);

   8:  

   9:     m_tcpListener.BeginAcceptTcpClient(OnAcceptClient, null);

  10: }

  11:  

  12: private void ReceiveClientHandshake(TcpClient client)

  13: {

  14:     var buffer = new byte[1024];

  15:     var socketAsyncEventArgs = new SocketAsyncEventArgs();

  16:  

  17:     socketAsyncEventArgs.UserToken = client;

  18:     socketAsyncEventArgs.Completed += OnHandshakeReceived;

  19:     socketAsyncEventArgs.SetBuffer(buffer, 0, buffer.Length);

  20:  

  21:     bool isAsync = client.Client.ReceiveAsync(socketAsyncEventArgs);

  22:     if (!isAsync)

  23:         OnHandshakeReceived(client.Client, socketAsyncEventArgs);

  24: }

  25:  

  26: private void OnHandshakeReceived(object sender, SocketAsyncEventArgs e)

  27: {

  28:     var client = (TcpClient) e.UserToken;

  29:  

  30:     int numberOfBytesReceived = e.SocketError != SocketError.Success ? 0 : e.BytesTransferred;

  31:     if (numberOfBytesReceived <= 0)

  32:     {

  33:         client.Client.Shutdown(SocketShutdown.Both);

  34:         client.Close();

  35:         return;

  36:     }

  37:  

  38:     // Note: We're working under the assumption that the entire handshake will arrive in one frame

  39:     string data = Encoding.UTF8.GetString(e.Buffer, 0, e.BytesTransferred);

  40:     string handshakeString = OpeningHandshakeHandler.CreateServerHandshake(data);

  41:     if (String.IsNullOrEmpty(handshakeString))

  42:     {

  43:         client.Client.Shutdown(SocketShutdown.Both);

  44:         client.Close();

  45:         return;

  46:     }

  47:  

  48:     byte[] handshakeBytes = Encoding.UTF8.GetBytes(handshakeString);

  49:     SendHandshake(client, handshakeBytes);

  50: }

The OnAcceptClient method is being called for each new client connection. That method calls the ReceiveClientHandshake method to asynchronously receive data from the new TCP connection and then continues to wait for other connections. When a new data is received from the TCP connection, the OnHandshakeReceived method is called. In this method, we check whether enough bytes were received and whether the received data is a valid client handshake (by using the OpeningHandshakeHandler class) and close the connection if not. Otherwise, we create a response (again, by using the OpeningHandshakeHandler class) and send it back to the client:

   1: private void SendHandshake(TcpClient client, byte[] handshake)

   2: {

   3:     var sendEventArgs = new SocketAsyncEventArgs();

   4:  

   5:     sendEventArgs.UserToken = client;

   6:     sendEventArgs.SetBuffer(handshake, 0, handshake.Length);

   7:     sendEventArgs.Completed += OnHandshakeSendCompleted;

   8:  

   9:     client.Client.SendAsync(sendEventArgs);

  10: }

  11:  

  12: private void OnHandshakeSendCompleted(object sender, SocketAsyncEventArgs e)

  13: {

  14:     var client = (TcpClient)e.UserToken;

  15:  

  16:     var clientConnection = new ClientConnection(Guid.NewGuid(), client);

  17:     clientConnection.Disconnected += OnClientDisconnected;

  18:  

  19:     m_clients.TryAdd(clientConnection.Id, clientConnection);

  20:     OnClientConnected(clientConnection);

  21: }

In the SendHandshake method, we asynchronously send the server’s response to the client. When the send operation completes, the OnHandshakeSendCompleted method is being called. At that point, we can be sure that we have a valid WebSocket connection. A new instance of the ClientConnection class is created and the OnClientConnected event is being raised to indicate that we have a new client.

The ClientConnection class encapsulates a single WebSocket connection, and is the class with which we interact for the entire lifetime of the connection. Its public API exposes methods to send data, and to disconnect, as well as events which are raised when data is received or when the client disconnects:

   1: public class ClientConnection

   2: {

   3:     public Guid Id { get; private set; }

   4:  

   5:     public event Action<ClientConnection, string> ReceivedTextualData;

   6:     public event Action<ClientConnection, byte[]> ReceivedBinaryData;

   7:     public event Action<ClientConnection> Disconnected;

   8:  

   9:     public ClientConnection(Guid id, TcpClient tcpClient) {…}

  10:     public void StartReceiving() {…}

  11:     public void Send(byte[] data) {…}

  12:     public void Send(string data) {…}

  13:     public void Disconnect(){…}

  14: }

The first method that needs to be called on a new ClientConnection instance is the StartReceiving method, which asynchronously receives data from the client:

   1: public void StartReceiving()

   2: {

   3:     var buffer = new byte[1024];

   4:     var socketAsyncEventArgs = new SocketAsyncEventArgs();

   5:  

   6:     socketAsyncEventArgs.Completed += OnDataReceived;

   7:     socketAsyncEventArgs.SetBuffer(buffer, 0, buffer.Length);

   8:  

   9:     bool isAsync = m_tcpClient.Client.ReceiveAsync(socketAsyncEventArgs);

  10:     if (!isAsync)

  11:         OnDataReceived(m_tcpClient, socketAsyncEventArgs);

  12: }

  13:  

  14: private void OnDataReceived(object sender, SocketAsyncEventArgs e)

  15: {

  16:     if (!m_isConnected)

  17:         return;

  18:  

  19:     int numberOfBytesReceived = e.SocketError != SocketError.Success ? 0 : e.BytesTransferred;

  20:     if (numberOfBytesReceived <= 0)

  21:     {

  22:         Disconnect();

  23:         return;

  24:     }

  25:  

  26:     if (HandleFrame(e))

  27:         StartReceiving();

  28: }

When new data has arrived, the OnDataReceived method is being called. In it, we handle the received frame and call StartReceiving again to receive the next frame. Let’s take a look at the HandleFrame method:

   1: private bool HandleFrame(SocketAsyncEventArgs args)

   2: {

   3:     Frame frame = Frame.FromBuffer(args.Buffer);

   4:  

   5:     if (frame.Opcode == Frame.Opcodes.Close)

   6:     {

   7:         Disconnect();

   8:         return false;

   9:     }

  10:  

  11:     // Note: No support for fragmented messages

  12:     if (frame.Opcode == Frame.Opcodes.Binary)

  13:         ReceivedBinaryData(this, frame.UnmaskedPayload);

  14:     else if (frame.Opcode == Frame.Opcodes.Text)

  15:     {

  16:         string textContent = Encoding.UTF8.GetString(frame.UnmaskedPayload, 0, (int)frame.PayloadLength);

  17:         ReceivedTextualData(this, textContent);

  18:     }

  19:  

  20:     return true;

  21: }

In the HandleFrame method, we use the Frame class to parse the received bytes into something that is easier to work with and raise the correct events based on the frame’s opcode. Note that we return false in case the received frame was a closing frame. In that case, the OnDataReceived method will not call the StartReceiving method again.

The Frame class itself is just a way to encapsulate the frame’s parsing and byte-splitting code into something more elegant:

   1: public class Frame

   2: {

   3:     public enum Opcodes

   4:     {

   5:         Continuation = 0x0,

   6:         Text = 0x1,

   7:         Binary = 0x2,

   8:         Close = 0x8,

   9:         Ping = 0x9,

  10:         Pong = 0xA

  11:     }

  12:  

  13:     public bool IsFin { get; private set; }

  14:     public bool IsMasked { get; private set; }

  15:     public ulong PayloadLength { get; private set; }

  16:     public int MaskingKey { get; private set; }

  17:     public byte[] UnmaskedPayload { get; private set; }

  18:     public Opcodes Opcode { get; private set; }

  19:  

  20:     public Frame(Opcodes opcode, byte[] payload, bool isFin) {...}

  21:     public byte[] ToBuffer() {...}

  22:     

  23:     public static Frame FromBuffer(byte[] buffer) {...}

  24: }

With the Frame class in place, receiving and sending frames becomes much simpler. Take a look for example at the Send methods of the ClientConnection class:

   1: public void Send(byte[] data)

   2: {

   3:     var frame = new Frame(Frame.Opcodes.Binary, data, true);

   4:     Send(frame);

   5: }

   6:  

   7: public void Send(string data)

   8: {

   9:     var frame = new Frame(Frame.Opcodes.Text, Encoding.UTF8.GetBytes(data), true);

  10:     Send(frame);

  11: }

  12:  

  13: private void Send(Frame frame)

  14: {

  15:     if (!m_isConnected)

  16:         return;

  17:  

  18:     byte[] buffer = frame.ToBuffer();

  19:  

  20:     var sendEventArgs = new SocketAsyncEventArgs();

  21:     sendEventArgs.SetBuffer(buffer, 0, buffer.Length);

  22:  

  23:     m_tcpClient.Client.SendAsync(sendEventArgs);

  24: }

In those methods, we just create a new Frame instance with the correct opcode and data and let it handle all the rest (note the call to ToBuffer).

That’s pretty much it. With the WebSocketServer, ClientConnection, OpeningHandshakeHandler, and Frame classes doing the heavy lifting, creating an echo server is as simple as it gets:

   1: public class EchoServer

   2: {

   3:     private readonly WebSocketServer m_server;

   4:  

   5:     public EchoServer(IPAddress address, int port)

   6:     {

   7:         m_server = new WebSocketServer(address, port);

   8:         m_server.OnClientConnected += OnClientConnected;

   9:     }

  10:  

  11:     public void Start()

  12:     {

  13:         m_server.Start();

  14:     }

  15:  

  16:     public void Stop()

  17:     {

  18:         m_server.Stop();

  19:     }

  20:  

  21:     private void OnClientConnected(ClientConnection client)

  22:     {

  23:         client.ReceivedTextualData += OnReceivedTextualData;

  24:         client.Disconnected += OnClientDisconnected;

  25:         client.StartReceiving();

  26:     }

  27:  

  28:     private void OnClientDisconnected(ClientConnection client)

  29:     {

  30:         client.ReceivedTextualData -= OnReceivedTextualData;

  31:         client.Disconnected -= OnClientDisconnected;

  32:     }

  33:  

  34:     private void OnReceivedTextualData(ClientConnection client, string data)

  35:     {

  36:         client.Send(data);

  37:     }

  38: }

I don’t think that the above code requires any explanations. The complete server’s code can be found on GitHub.

Client Implementation

The client implementation is much simpler and is based on the WebSocket API, which is supported in the latest releases of all the major browsers.

In order to focus on the important parts, I kept the page layout as simple as possible:

   1: <!doctype html>

   2: <title>WebSocket Client</title>

   3: <meta charset="utf-8">

   4: <body>

   5:  

   6:     <div>

   7:         <span>Server Address : </span>

   8:         <input id=serverAddress type=text value=ws://127.0.0.1:54321>

   9:         <button id=connect onclick=connect()>Connect</button>

  10:         <button id=disconnect onclick=disconnect() disabled>Disconnect</button>

  11:     </div>

  12:  

  13:     <div id=messageInputContainer style="visibility: collapse">

  14:         <span>Enter Message : </span>

  15:         <input id=message type=text>

  16:         <button onclick=send()>Send</button>

  17:     </div>

  18:  

  19:     <div id=messages>

  20:  

  21:     </div>

  22: </body>

Which gives us the following page:

ClientPageLayout

When the user clicks the Connect button, the connect function is being called:

   1: var webSocket;

   2:  

   3: var connect = function(){

   4:     var serverAddressInput = document.getElementById("serverAddress");

   5:     var address = serverAddressInput.value;

   6:     webSocket = new WebSocket(address);

   7:  

   8:     webSocket.onopen = function(e) {

   9:         changeState(true);

  10:         log("Connection open...");

  11:     };

  12:  

  13:     webSocket.onerror = function (e) {

  14:         changeState(false);

  15:         log("Connection error...");

  16:     };

  17:  

  18:     webSocket.onmessage = function(e){

  19:         if(typeof e.data === "string")

  20:             log("Received : " + e.data);

  21:         else

  22:             log("Binary message received...")

  23:     }

  24:  

  25:     webSocket.onclose = function(e){

  26:         log("Connection Closed...");

  27:         changeState(false);

  28:     }

  29: }

In it, we create a new WebSocket instance, giving it the server’s address as a parameter. Then, we add handlers for the following events: onopen, which is fired when the connection is established, onerror, which is fired when an error has occurred, onmessage, which is fired when a new message from the server has arrived, and onclose, which is fired when the connection has been closed. In each one of these event handlers we print a message to the screen by using the log function, and change the style of the DOM elements by using the changeState function:

   1: var log = function(message){

   2:     var text = document.createTextNode(message);

   3:     var div = document.createElement('div');

   4:     div.appendChild(text);

   5:     div.innerText = message;

   6:  

   7:     document.getElementById("messages").appendChild(div);

   8: }

   9:  

  10: var changeState = function(isConnected){

  11:     var container = document.getElementById("messageInputContainer");

  12:     container.style.visibility=isConnected?"visible":"collapse";

  13:  

  14:     var connectButton = document.getElementById("connect");

  15:     connectButton.disabled = isConnected;

  16:  

  17:     var disconnectButton = document.getElementById("disconnect");

  18:     disconnectButton.disabled = !isConnected;

  19: }

When the user clicks the Send button, the send function is being called. In it, we check that the connection is open and if it is, we send the message to the server by calling the send function on the webSocket instance:

   1: var send = function() {

   2:         if (webSocket.readyState != 1) {

   3:             log("Cannot send data when the connection is closed...");

   4:             return;

   5:         }

   6:         var messageInput = document.getElementById("message");

   7:         var message = messageInput.value;

   8:         log("Sending : " + message);

   9:         webSocket.send(message);

  10:     }

Finally, when the user clicks the Disconnect button, the disconnect function is being called. In it we just call the close function on the webSocket instance:

   1: var disconnect = function(){

   2:         log("Closing connection...")

   3:         webSocket.close();

   4:     }

That’s it. We have a fully functional WebSocket server, implemented in C#, and a corresponding client based on the HTML5 WebSocket API. You can run both the client and the server on your local machine. I urge you to download the code and start experimenting with it. Have fun!

Cross posted from http://www.programmingtidbits.com/post/2013/08/24/Learn-HTML5-with-Me-WebSocket-from-Scratch.aspx

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*