Creating high-performance UDP servers on Windows and Linux
[SHOWTOGROUPS=4,20]
August 29, 2018 Allen Drennan
There is a lack of information available on building highly scalable UDP servers on the Internet. What information exists, often falls short of best practices. UDP servers are the central backbone of many video game servers and streaming services, but very few good examples or discussions exist on how to construct them on Windows and Linux. This article covers advanced topic areas related to UDP servers and assumes the reader has some understanding of threads, sockets and the available APIs already.
Most implementations revolve around the standard socket APIs, Wsa/RecvFrom() and Wsa/SendTo(). They are relatively easy to understand and there are plenty of examples. RecvFrom() typically receives a datagram from a widely-known listening port and provides you the socket address of the sender. SendTo() simply sends a datagram, usually to the socket address that was previously provided by RecvFrom(). Many UDP server implementations start with these basic APIs and build everything else around them.
If you are using a simple communication model, you might have an event loop in a thread to handle I/O that simply calls RecvFrom() and have multiple threads to handle parallel I/O. This is relatively efficient and from a pure communication-only perspective is the fastest approach, but also introduces issues as you start to build your application logic. The first issue you may encounter is that you may need to keep track of client sessions (or pseudo streams) and route each incoming datagram to the proper session object. This is required by most applications using UDP at some level in the application’s logic, and security libraries such as DTLS (datagram TLS) require you to maintain information about security state for each client session. If you are using socket addresses then you probably would create a hash table to map your socket addresses to your session object. This would require some lock mechanism to maintain integrity. Suddenly you are performing a lot of extra processing for each datagram you receive and performance begins to suffer.
You can use Windows APIs such as I/O completion ports (IOCP) and Registered I/O (RIO) and EPoll on Linux to improve performance. They can be applied asynchronous and non-blocking. However, these APIs work with socket handles not socket addresses, and since UDP is connection-less there is a widely-held misunderstanding that UDP cannot work or should not work with socket handles.
In fact it can work with socket handles. UDP socket handles work well with asynchronous communications with APIs such as IOCP and EPoll, they perform substantially better internally (inside the kernel). They also help you avoid complicated application logic for lock and hash tables to maintain state or lookup session objects for using things like DTLS. If you are using socket addresses with RecvFrom() and SendTo() then you are not leveraging the full performance benefits of these APIs for scalable UDP servers.
Overview
In order to use socket handles with UDP you need to use the Connect() socket API. This is also where developers usually abandon their effort. First off, we all have been taught that UDP is connection-less (and it is) so why would I want to Connect() it? Secondly the steps required to properly setup a socket handle for UDP to both send and receive on a server is pretty confusing and if you don’t do it correctly it will never work. I personally think this is a primary reason why so many implementations stick with RecvFrom() and use socket addresses, because it is easy to understand. There are also some upper limits on the number of socket handles that can be used at one time, but this unlikely to be your bottleneck on any given server.
The asynchronously capable socket APIs on Windows such as IOCP and RIO, and Linux Epoll are designed to be very efficient using socket handles. If you could relate a client session to a socket handle, then these APIs can directly send and receive using the same approach you would use for a TCP session. Consider that last statement for a moment, because it is important. If you use socket handles for both TCP and UDP, then you would be able to unify a great deal of communication logic and client session objects for both protocols. This is also an important aspect of using socket handles instead of socket addresses. With handles you have a uniform architecture to your communication and application logic.
Besides having more consistent and straightforward code, socket handles perform better. The kernel processes datagrams more efficiently when they are related to a socket handle because of the structure of the internal routing tables. (see UDP – Performance p.255 Unix Network Programming by Richard Stevens) This is because when you use a socket address, the kernel will internally do a lookup and connect the socket handle, send the datagram and disconnect the socket handle. This overhead can substantially reduce performance of datagrams. Each underlying socket implementation handles this differently and performance can vary by OS revision, but fundamentally socket handles perform better. This is especially true for overlapped and event APIs that work directly with socket handles.
Another major benefit is that IOCP/RIO on Windows and EPoll on Linux allow you to include extra data along with the overlapped operation or the event. Since the socket handle is related directly to a single client session, any stateful information and session object could be related to the overlapped operation or event. This is an important distinction. If we can include session information with the operation, then we can avoid many locks and hash table lookups. A properly architected IOCP/RIO server can do this an avoid thread contention and race conditions. A discussion of this specific topic is beyond the scope of this article, but needless to say that as long as you only have a single pending overlapped read at a time, you are not going to have to lock your session object with IOCP regardless of how many I/O threads are running. This isn’t entirely true with EPoll servers, since EPoll’s oneshot behavior is inconsistent.
Back to the topic at hand though. If we could allocate a socket handle to the client session, we could leverage all of these aforementioned benefits.
Linux UDP Server
On Linux, the current most scalable approach is to use the EPoll apis. EPoll has involved over the years and is quite stable and scalable for both UDP and TCP servers. Additionally, Linux does an excellent job of implementing scalable sockets for UDP in the kernel.
Linux I/O Model
A straightforward performance I/O model on Linux would involve the pre-allocation of a group of threads whose only purpose is to process I/O in parallel. Each of these threads would be setup with the epoll_ctl() api as edge-triggered EPOLLET and oneshot delivered EPOLLONESHOT. This is the preferred model before Linux kernel 4.5.
Due to potential race conditions in the Epoll implementation more recent versions of the kernel have introduced EPOLLEXCLUSIVE to avoid potential scaling issues. This is used in conjunction with level-triggered I/O which is the default.
Either approach is good at creating highly scalable UDP servers on Linux. Each of these threads would call epoll_wait() in a loop.
This is the basic model of a scalable EPoll server and it is pretty much the same for UDP as it is for TCP.
Using UDP socket handles on Linux
In order to take advantage of socket handles with UDP on Linux, there are numerous steps in the initial setup of the client session. Personally I like to think of this setup process in a similar manner as to how you would handle an initial accept for a TCP session. Once the UDP session is accepted, you can continue your processing in a highly efficient manner.
To make this all work in a Linux UDP server, you need to:
Windows UDP Server
On Windows we can use either I/O completion ports or Registered I/O, the current most scalable approach. The concepts are nearly identical between the apis, so we will discuss IOCP primarily.
Using IOCP for UDP servers seems like a dark art. There is a widely held belief that you must pre-allocate memory buffers to receive data. This is not true, and it is possible to perform a read-zero operation for UDP servers with IOCP.
For highly scalable UDP servers on Windows, memory can be precious so avoiding allocating memory buffers leads to greater scale. Additionally the pre-allocation of memory buffers requires a great deal of extra logic to manage these buffers as hash tables or queues with locking mechanisms. All of this slows down the processing of individual datagrams and is completely unnecessary.
Note: Unfortunately some aspects of how socket handles work under Unix and Linux, do not work properly on Windows. More on that topic later.
[/SHOWTOGROUPS]
August 29, 2018 Allen Drennan
There is a lack of information available on building highly scalable UDP servers on the Internet. What information exists, often falls short of best practices. UDP servers are the central backbone of many video game servers and streaming services, but very few good examples or discussions exist on how to construct them on Windows and Linux. This article covers advanced topic areas related to UDP servers and assumes the reader has some understanding of threads, sockets and the available APIs already.
Most implementations revolve around the standard socket APIs, Wsa/RecvFrom() and Wsa/SendTo(). They are relatively easy to understand and there are plenty of examples. RecvFrom() typically receives a datagram from a widely-known listening port and provides you the socket address of the sender. SendTo() simply sends a datagram, usually to the socket address that was previously provided by RecvFrom(). Many UDP server implementations start with these basic APIs and build everything else around them.
If you are using a simple communication model, you might have an event loop in a thread to handle I/O that simply calls RecvFrom() and have multiple threads to handle parallel I/O. This is relatively efficient and from a pure communication-only perspective is the fastest approach, but also introduces issues as you start to build your application logic. The first issue you may encounter is that you may need to keep track of client sessions (or pseudo streams) and route each incoming datagram to the proper session object. This is required by most applications using UDP at some level in the application’s logic, and security libraries such as DTLS (datagram TLS) require you to maintain information about security state for each client session. If you are using socket addresses then you probably would create a hash table to map your socket addresses to your session object. This would require some lock mechanism to maintain integrity. Suddenly you are performing a lot of extra processing for each datagram you receive and performance begins to suffer.
You can use Windows APIs such as I/O completion ports (IOCP) and Registered I/O (RIO) and EPoll on Linux to improve performance. They can be applied asynchronous and non-blocking. However, these APIs work with socket handles not socket addresses, and since UDP is connection-less there is a widely-held misunderstanding that UDP cannot work or should not work with socket handles.
In fact it can work with socket handles. UDP socket handles work well with asynchronous communications with APIs such as IOCP and EPoll, they perform substantially better internally (inside the kernel). They also help you avoid complicated application logic for lock and hash tables to maintain state or lookup session objects for using things like DTLS. If you are using socket addresses with RecvFrom() and SendTo() then you are not leveraging the full performance benefits of these APIs for scalable UDP servers.
Overview
In order to use socket handles with UDP you need to use the Connect() socket API. This is also where developers usually abandon their effort. First off, we all have been taught that UDP is connection-less (and it is) so why would I want to Connect() it? Secondly the steps required to properly setup a socket handle for UDP to both send and receive on a server is pretty confusing and if you don’t do it correctly it will never work. I personally think this is a primary reason why so many implementations stick with RecvFrom() and use socket addresses, because it is easy to understand. There are also some upper limits on the number of socket handles that can be used at one time, but this unlikely to be your bottleneck on any given server.
The asynchronously capable socket APIs on Windows such as IOCP and RIO, and Linux Epoll are designed to be very efficient using socket handles. If you could relate a client session to a socket handle, then these APIs can directly send and receive using the same approach you would use for a TCP session. Consider that last statement for a moment, because it is important. If you use socket handles for both TCP and UDP, then you would be able to unify a great deal of communication logic and client session objects for both protocols. This is also an important aspect of using socket handles instead of socket addresses. With handles you have a uniform architecture to your communication and application logic.
Besides having more consistent and straightforward code, socket handles perform better. The kernel processes datagrams more efficiently when they are related to a socket handle because of the structure of the internal routing tables. (see UDP – Performance p.255 Unix Network Programming by Richard Stevens) This is because when you use a socket address, the kernel will internally do a lookup and connect the socket handle, send the datagram and disconnect the socket handle. This overhead can substantially reduce performance of datagrams. Each underlying socket implementation handles this differently and performance can vary by OS revision, but fundamentally socket handles perform better. This is especially true for overlapped and event APIs that work directly with socket handles.
Another major benefit is that IOCP/RIO on Windows and EPoll on Linux allow you to include extra data along with the overlapped operation or the event. Since the socket handle is related directly to a single client session, any stateful information and session object could be related to the overlapped operation or event. This is an important distinction. If we can include session information with the operation, then we can avoid many locks and hash table lookups. A properly architected IOCP/RIO server can do this an avoid thread contention and race conditions. A discussion of this specific topic is beyond the scope of this article, but needless to say that as long as you only have a single pending overlapped read at a time, you are not going to have to lock your session object with IOCP regardless of how many I/O threads are running. This isn’t entirely true with EPoll servers, since EPoll’s oneshot behavior is inconsistent.
Back to the topic at hand though. If we could allocate a socket handle to the client session, we could leverage all of these aforementioned benefits.
Linux UDP Server
On Linux, the current most scalable approach is to use the EPoll apis. EPoll has involved over the years and is quite stable and scalable for both UDP and TCP servers. Additionally, Linux does an excellent job of implementing scalable sockets for UDP in the kernel.
Linux I/O Model
A straightforward performance I/O model on Linux would involve the pre-allocation of a group of threads whose only purpose is to process I/O in parallel. Each of these threads would be setup with the epoll_ctl() api as edge-triggered EPOLLET and oneshot delivered EPOLLONESHOT. This is the preferred model before Linux kernel 4.5.
Due to potential race conditions in the Epoll implementation more recent versions of the kernel have introduced EPOLLEXCLUSIVE to avoid potential scaling issues. This is used in conjunction with level-triggered I/O which is the default.
Either approach is good at creating highly scalable UDP servers on Linux. Each of these threads would call epoll_wait() in a loop.
This is the basic model of a scalable EPoll server and it is pretty much the same for UDP as it is for TCP.
Using UDP socket handles on Linux
In order to take advantage of socket handles with UDP on Linux, there are numerous steps in the initial setup of the client session. Personally I like to think of this setup process in a similar manner as to how you would handle an initial accept for a TCP session. Once the UDP session is accepted, you can continue your processing in a highly efficient manner.
To make this all work in a Linux UDP server, you need to:
- Create a UDP listening socket using the socket() api This will be our well-known listening port.
- Obtain a socket address for the UDP listening socket. There are various ways to do this. I typically use getaddrinfo(). We will need the listening socket address in step 7.
- Use SetSockOpt() with SO_REUSEADDR against the listening socket. This is required to be able to bind() and connect() to the same socket.
- Use epoll_ctl() with EPOLL_CTL_ADD and EPOLLIN with the listening socket. To initiate the listening process you need to start the process by adding the EPOLLIN event flag. Along with this event you should also include a pointer to a data object (EPoll_Data.ptr). Your data object should have a flag to indicate whether or not you have already allocated a session object. We examine this flag with every event we will receive.
- Use epoll_wait() to wait for your events in a loop.
- If you receive an EPOLLIN event, then examine the flag inside of the data object (EPoll_Data.ptr) to see if we need to allocate a session object.
- If this is a EPOLLIN event without a session object, then:
- Use RecvFrom() to obtain the client socket address. We also need to keep this first data buffer we received, so we can pass it up to the application layer once we have setup the client session and socket handle.
- Create a new UDP socket using the socket() api. This new socket will be the socket we will be assigning to the client session. It should match the listening socket’s family, socket type and protocol. This is our client socket.
- Use SetSockOpt() with SO_REUSEADDR against the client socket. This is also required.
- Bind() the client socket to the socket address of the listening socket. On Linux this essentially passes the responsibility for receiving data for the client session from the well-known listening socket, to the newly allocated client socket. It is important to note that this behavior is not the same on other platforms, like Windows (unfortunately).
- Connect() the client socket to client socket address. This is the socket address received in the RecvFrom() method, not the listening socket address. This will setup the socket so that data can be sent to the client session using the new client socket with the Send() api.
- Then finally, use epoll_ctl() with EPOLL_CTL_ADD and EPOLLIN with the client socket. Along with this event you should also include a pointer to your session object (EPoll_Data.ptr).
- If this is a EPOLLIN event with a session object, then:
- Use Recv() to read the data from the socket. We do not need to use RecvFrom() since the client socket is already allocated and the client session object is already created.
Windows UDP Server
On Windows we can use either I/O completion ports or Registered I/O, the current most scalable approach. The concepts are nearly identical between the apis, so we will discuss IOCP primarily.
Using IOCP for UDP servers seems like a dark art. There is a widely held belief that you must pre-allocate memory buffers to receive data. This is not true, and it is possible to perform a read-zero operation for UDP servers with IOCP.
For highly scalable UDP servers on Windows, memory can be precious so avoiding allocating memory buffers leads to greater scale. Additionally the pre-allocation of memory buffers requires a great deal of extra logic to manage these buffers as hash tables or queues with locking mechanisms. All of this slows down the processing of individual datagrams and is completely unnecessary.
Note: Unfortunately some aspects of how socket handles work under Unix and Linux, do not work properly on Windows. More on that topic later.
[/SHOWTOGROUPS]