Introduction

Network applications are getting popular by the day. In one of my previous articles I discussed the importance of choosing the right protocol. Subsequently there are a plethora of vendors supporting a given protocol, opening doors to evaluate the best vendor to offer the most optimum protocol implementation (framework/product). Products are typically evaluated based on various dimensions i.e throughput, intuitiveness, portability to support the environment (windows/linux/osx), developer friendliness etc. Most often performance of a framework is evaluated based on advertised statistics and benchmarks. While remembering what former prime minister Benjamin Disraeli said: “There are three kinds of lies: lies, damned lies, and statistics. It is important to understand that these benchmarks represent a list of generic paradigms, it might not relate to the solution one is intending to implement i.e for an Extract, Transform, Load (ETL) scenario, it would be senseless to evaluate a benchmark done on secured proxies. This article will elaborate approaches such as multi-buffer, shared buffer, zero copy and memory mapping which are currently followed by network application vendors i.e middleware to implement protocols. Further articulate instances on how each approach could provide the most optimum resource benefits depending on the solution being addressed. Goal is to highlight the importance of conducting a protocol design evaluation on a given product to judge its suitability for the solution.

Protocol Programming Approaches and Optimizations

Conventional Programming Approach

Programming languages i.e Java has introduced a set of NIO APIs to write network applications. NIO offers channels and buffers to read and write data over the network as explained in detail here. Further libraries such as Netty, Mina makes it easier to write network applications adhering to a standard thread model. Despite the availability of these libraries, there are certain levels of optimizations which will allow significant improvements in performance. They all depend on factors such as expected portability, control of data, performance and memory footprint. While taking Java language offerings as an example, this section would elaborate few of the approaches which could be used to optimize resource utilization and discuss their pros and cons. Starting with Figure 1 below which illustrates a conventional programming approach.

Figure 1 - Conventional Java based NIO Framework Design

As illustrated in Figure 1,

An incoming byte stream is read from the source into a buffer in the OS Kernel
Bytes in the kernel buffer will then be copied into a buffer in the user space/application
Application may process the bytes and sequentially perform another copy of bytes to an output buffer in the user space.
The bytes will be read from the output/target buffer into the kernel space
The copied bytes will then be transferred to the target channel.

While the above is a working model. The following factors should be noted,

As Eli Bendersky explained in his post, context switching and/or switching between user space and kernel space would have performance implications
Array copying between buffers will drain CPU and bandwidth.
Reading all data into the buffer will demand for more resource allocation in terms of memory. More buffers used within the application will demand for more memory to be allocated to contain data.

Aforementioned limitations can be controlled and minimized. Next section would elaborate on a few ways to optimize the above which will make an application perform much better with improved resource utilization.

Optimizing Buffer Usage

As discussed in the earlier section, multiple buffers require memory and CPU (to copy data). This approach is resource intensive in comparison to the rest of the approaches which will be articulated. Hence it would be interesting to identify ways to reduce the buffer count as an optimization. WSO2 Pass Through Transport is a tangible example of reducing the buffer count by using a shared buffer mode, which in return allows it to be one of the fastest Enterprise Service Bus (ESB)s in the market. Figure 2 below illustrates how a shared buffer could be applied.

Figure 2 - Shared buffer for reading and writing

As depicted in the above Figure 2, the incoming byte stream is read into a single shared buffer in the user space/application. The data which is collected in this buffer is directly channeled to the target, reducing the number of times it needs to be copied. As opposed to using multiple buffers described in earlier sections, Figure 1, this method would allow better resource utilization in terms of both CPU and memory. This method could be applied for many paradigms i.e secure proxy, header based routing etc, where content could be partially read i.e read only header and transfer the remaining large fraction of content using a single buffer. However, for instances which require observing or processing the entire content within the application, use of a single buffer will not provide performance and memory benefits.

Zero Copy

Zero copy is mainly used to serve static content i.e files in the disk via the network. This also is one of the key technologies behind Kafka to be one of the fastest event streaming platforms as explained in detail here. There’re also instances where term zero copy is used to describe zero copy between buffers within the user space i.e Netty uses the term zero copy to describe its buffer interface as described here. However, the focus of this section is to illustrate Operating System (OS) level zero copy as explained in detail here.

Figure 3 - OS Level Zero Copy

As illustrated in Figure 3, zero copy will completely bypass the phase of reading data into the application/user space. Allowing it to perform much better in comparison to all the approaches discussed in earlier sections. The flow of zero copy is fairly straightforward, application instructs the OS to transfer data between two file descriptors, this is done via a syscall. In Java the method transferTo() is used which calls sendfile(..) in the OS which then transfers data between a given file to a network channel or vice versa. More details on Zero Copy could be found here. Zero copy will conserve both CPU and memory from excessive copying of data and context switching between user and kernel spaces.

Amids the significant performance gain it brings, zero copy however is dependent on OS and underlying hardware. Zero Copy is more commonly used to transfer static content i.e files from the disk. Very limited mention of zero copy between multiple network sockets, at least in Java only FileChannel is compatible with performing zero copy. Possibly the reason would be splice() syscall which potentially allows zero copy between network channels is tightly bound to linux and also demands Network Controller (NIC) to support DMA. Which makes this approach less portable and provides no control over the data flowing. Lack of control will also increase the complexity. Hence proper measures should be taken to handle errors and exceptions.

Using Memory Maps

Applications also tend to use mmap(), POSIX compliant Unix syscall which will memory map a file. This method will allow applications to transfer data without reading the content into the user space similar to zero copy, allowing it to gain great performance benefits as a result. However as opposed to zero copy, this method will still require to switch context between the user and kernel spaces. Eduard Brose in his study articulates the pros and cons of using this approach. He basically commented on the cost it would incur for memory map table creation, modification and flushing and also the complexity of implementation i.e there are exceptions which would need to be handled carefully. However table creation and maintenance cost would not outweigh the cost it would take to copy the entire data into the user space. Hence for large data/files this cost would be negligible. Memory maps will also have the same limitations as zero copy and also highly criticized for its development complexity.

Applicability Matrix

Each of the performance optimizations articulated in earlier sections has its own pros and cons. There’s no silver bullet approach which could be generalized for every use case. Hence it would be important for an application developer to choose a framework that would best suit a given solution to reap the maximum benefits in terms of control, portability, memory, bandwidth and performance. Figure 4 summarizes the above findings which could potentially be used as a guideline.

Approach	Portability	Control and Applicability	Memory Footprint	Performance
Multiple buffers in user space ( to read/write)	Supported by many programming languages, Portable (does not have OS and hardware dependencies)	Applications will have better control over data. Applicable for use cases which requires intermediate data reading/processing i.e transform	High, since multiple buffers will hold data	Low, in comparison to other approaches
Shared Buffer in user space	Supported by many programming languages, Portable (does not have OS and hardware dependencies)	Applications will have better control over data. Partial reads can be done for processing. Not applicable for large transformations which requires the entire content to be read.	Medium, intermediate buffer will be required in user space	High in comparison to the Multi Buffer approach. Low in comparison to the rest.
Zero Copy	OS and hardware dependant, mostly. Socket to socket zero copy i.e using splice(..) could only be done in linux and requires the NIC to support DMA	Application will have no control over the data. Exceptions need to be properly handled. Applicable for transferring static content i.e files in the disk over the network	Lowest, no data will be copied to the user space	High in comparison to the other approaches. CPU not utilized for data copy and no switching between user and kernel space
Memory Maps	OS dependent, mmap() is specific to Unix	Application will have no control over the data. Exceptions need to be properly handled.	Low in comparison to Muti/shared buffer approaches, memory will only be allocated for table mappings.	Potentially High to transfer large data in comparison to Muti/shared buffer approaches. Low in comparison to zero copy

Figure 4 - Applicability Matrix

Conclusion

Based on the evaluation carried out, it is apparent there’s no generic approach for every solution. Using a shared buffer within the user space might be an optimization which could be applied for the majority of the cases given its nature of reduced complexity and better portability compared to zero copy, memory mapped files and it’s capability to perform better in comparison to using multiple buffers. Using approaches such as zero copy looks promising for a limited number of cases such as transferring static content over the network. Nevertheless, application designer’s should use their discretion to choose the most optimum approach for a given solution. However it would be captivating to see operations such as splice() evolving itself to be compatible across multiple environments and more paradigms emerging on zero copy that illustrates socket to socket transfer with partial reads in future.

Tech Explosives

Friday, December 25, 2020

Spotting The Most Optimum Framework For Your Network Application