Friday, December 25, 2020

Spotting The Most Optimum Framework For Your Network Application



Network applications are getting popular by the day. In one of my previous articles I discussed the importance of choosing the right protocol. Subsequently there are a plethora of vendors supporting a given protocol, opening doors to evaluate the best vendor to offer the most optimum protocol implementation (framework/product). Products are typically evaluated based on various dimensions i.e throughput, intuitiveness, portability to support the environment (windows/linux/osx), developer friendliness etc. Most often performance of a framework is evaluated based on advertised statistics and benchmarks. While remembering what former prime minister Benjamin Disraeli said: “There are three kinds of lies: lies, damned lies, and statistics. It is important to understand that these benchmarks represent a list of generic paradigms, it might not relate to the solution one is intending to implement i.e for an Extract, Transform, Load (ETL) scenario, it would be senseless to evaluate a benchmark done on secured proxies. This article will elaborate approaches such as multi-buffer, shared buffer, zero copy and memory mapping which are currently followed by network application vendors i.e middleware to implement protocols. Further articulate instances on how each approach could provide the most optimum resource benefits depending on the solution being addressed. Goal is to highlight the importance of conducting a protocol design evaluation on a given product to judge its suitability for the solution. 

Protocol Programming Approaches and Optimizations 

Conventional Programming Approach 

Programming languages i.e Java has introduced a set of NIO APIs to write network applications. NIO offers channels and buffers to read and write data over the network as explained in detail here. Further libraries such as Netty, Mina makes it easier to write network applications adhering to a standard thread model. Despite the availability of these libraries, there are certain levels of optimizations which will allow significant improvements in performance. They all depend on factors such as expected portability, control of data, performance and memory footprint. While taking Java language offerings as an example, this section would elaborate few of the approaches which could be used to optimize resource utilization and discuss their pros and cons. Starting with Figure 1 below which illustrates a conventional programming approach. 

Figure 1 - Conventional Java based NIO Framework Design 

As illustrated in Figure 1, 

  1. An incoming byte stream is read from the source into a buffer in the OS Kernel 

  2. Bytes in the kernel buffer will then be copied into a buffer in the user space/application 

  3. Application may process the bytes and sequentially perform another copy of bytes to an output buffer in the user space. 

  4. The bytes will be read from the output/target buffer into the kernel space

  5. The copied bytes will then be transferred to the target channel.  

While the above is a working model. The following factors should be noted, 

  • As Eli Bendersky explained in his post, context switching and/or switching between user space and kernel space would have performance implications 

  • Array copying between buffers will drain CPU and bandwidth. 

  • Reading all data into the buffer will demand for more resource allocation in terms of memory. More buffers used within the application will demand for more memory to be allocated to contain data. 

Aforementioned limitations can be controlled and minimized. Next section would elaborate on a few ways to optimize the above which will make an application perform much better with improved resource utilization. 

Optimizing Buffer Usage  

As discussed in the earlier section, multiple buffers require memory and CPU (to copy data). This approach is resource intensive in comparison to the rest of the approaches which will be articulated. Hence it would be interesting to identify ways to reduce the buffer count as an optimization. WSO2 Pass Through Transport is a tangible example of reducing the buffer count by using a shared buffer mode, which in return allows it to be one of the fastest Enterprise Service Bus (ESB)s in the market. Figure 2 below illustrates how a shared buffer could be applied. 

Figure 2 - Shared buffer for reading and writing

As depicted in the above Figure 2, the incoming byte stream is read into a single shared buffer in the user space/application. The data  which is collected in this buffer is directly channeled to the target, reducing the number of times it needs to be copied. As opposed to using multiple buffers described in earlier sections, Figure 1, this method would allow better resource utilization in terms of both CPU and memory. This method could be applied for many paradigms i.e secure proxy, header based routing etc, where content could be partially read i.e read only header and transfer the remaining large fraction of content using a single buffer. However, for instances which require observing or processing the entire content within the application, use of a single buffer will not provide performance and memory benefits.     

Zero Copy  

Zero copy is mainly used to serve static content i.e files in the disk via the network. This also is one of the key technologies behind Kafka to be one of the fastest event streaming platforms as explained in detail here. There’re also instances where term zero copy is used to describe zero copy between buffers within the user space i.e Netty uses the term zero copy to describe its buffer interface as described here. However, the focus of this section is to illustrate Operating System (OS) level zero copy as explained in detail here.

Figure 3 - OS Level Zero Copy 

As illustrated in Figure 3, zero copy will completely bypass the phase of reading data into the application/user space. Allowing it to perform much better in comparison to all the approaches discussed in earlier sections. The flow of zero copy is fairly straightforward, application instructs the OS to transfer data between two file descriptors, this is done via a syscall. In Java the method transferTo() is used which calls sendfile(..) in the OS which then transfers data between a given file to a network channel or vice versa. More details on Zero Copy could be found here. Zero copy will conserve both CPU and memory from excessive copying of data and context switching between user and kernel spaces. 

Amids the significant performance gain it brings, zero copy however is dependent on OS and underlying hardware. Zero Copy is more commonly used to transfer static content i.e files from the disk. Very limited mention of zero copy between multiple network sockets, at least in Java only FileChannel is compatible with performing zero copy. Possibly the reason would be splice() syscall which potentially allows zero copy between network channels is tightly bound to linux and also demands Network Controller (NIC) to support DMA. Which makes this approach less portable and provides no control over the data flowing. Lack of control will also increase the complexity. Hence proper measures should be taken to handle errors and exceptions.

Using Memory Maps 

Applications also tend to use mmap(), POSIX compliant Unix syscall which will memory map a file. This method will allow applications to transfer data without reading the content into the user space similar to zero copy, allowing it to gain great performance benefits as a result. However as opposed to zero copy, this method will still require to switch context between the user and kernel spaces. Eduard Brose in his study articulates the pros and cons of using this approach. He basically commented on the cost it would incur for memory map table creation, modification and flushing and also the complexity of implementation i.e there are exceptions which would need to be handled carefully. However table creation and maintenance cost would not outweigh the cost it would take to copy the entire data into the user space. Hence for large data/files this cost would be negligible. Memory maps will also have the same limitations as zero copy and also highly criticized  for its development complexity.  

Applicability Matrix    

Each of the performance optimizations articulated in earlier sections has its own pros and cons. There’s no silver bullet approach which could be generalized for every use case. Hence it would be important for an application developer to choose a framework that would best suit a given solution to reap the maximum benefits in terms of control, portability, memory, bandwidth and performance. Figure 4 summarizes the above findings which could potentially be used as a guideline.  



Control and Applicability 

Memory Footprint


Multiple buffers in user space ( to read/write)

Supported by many programming languages, Portable (does not have OS and hardware dependencies)  

Applications will have better control over data. Applicable for use cases which requires intermediate data reading/processing i.e transform

High, since multiple buffers will hold data 

Low, in comparison to other approaches 

Shared Buffer in user space 

Supported by many programming languages, Portable (does not have OS and hardware dependencies) 

Applications will have better control over data. Partial reads can be done for processing. Not applicable for large transformations which requires the entire content to be read. 

Medium, intermediate buffer will be required in user space 

High in comparison to the Multi Buffer approach. Low in comparison to the rest. 

Zero Copy

OS and hardware dependant,  mostly. Socket to socket zero copy i.e using splice(..)  could only be done in linux and requires the NIC to support DMA

Application will have no control over the data. Exceptions need to be properly handled. Applicable for transferring static content i.e files in the disk over the network  

Lowest, no data will be copied to the user space 

High in comparison to the other approaches. CPU not utilized for data copy and no switching between user and kernel space 

Memory Maps

OS dependent, mmap() is specific to Unix

Application will have no control over the data. Exceptions need to be properly handled.

Low in comparison to Muti/shared buffer approaches, memory will only be allocated for table mappings. 

Potentially High to transfer large data in comparison to Muti/shared buffer approaches. Low in comparison to zero copy 

Figure 4  - Applicability Matrix 


Based on the evaluation carried out, it is apparent there’s no generic approach for every solution. Using a shared buffer within the user space might be an optimization which could be applied for the majority of the cases given its nature of reduced complexity and better portability compared to zero copy, memory mapped files and it’s capability to perform better in comparison to using multiple buffers. Using approaches such as zero copy looks promising for a limited number of cases such as transferring static content over the network. Nevertheless, application designer’s should use their discretion to choose the most optimum approach for a given solution. However it would be captivating to see operations such as splice() evolving itself to be compatible across multiple environments and more paradigms emerging on zero copy that illustrates socket to socket transfer with partial reads in future. 

Wednesday, May 15, 2019

Integrate WSO2 APIM with ETCD


Wide adoption of SOA/MSA influenced modern systems to be disaggregated. These disparate systems need to be integrated together to form a business unit or an application. These systems as a whole are being implemented based on well defined standards/interfaces i.e http, jms which makes it possible to seamlessly integrate. Each interface introduces a way to locate these systems/resources i.e url in the form of an endpoint. These endpoints could change rapidly, i.e ip change etc, which creates a requirement to centrally manage them and to allow services/systems who are connecting with each other to discover them.

The following describes how WSO2 APIM could be integrated with ETCD to dynamically discover the endpoints. The tutorial depicts the following,

  1. A sample service would represent a unit of work which will be done through a warehouse management system.
  2. The services are exposed as rest endpoints, and they could be accessed through the ip and port (localhost, 8090)
  3. Hypothetically let’s consider that the port which the service is running would change from time to time.
  4. Each time the port changes it would not be practical to reflect the change in endpoint to each other service/application which consumes it

Hence, following would demonstrate how the service url could be added as an entry to etcd registry, this entry would be an identifier which will represent the url of the service. Connecting applications will query for the url of this endpoint through the registry instead of directly having to embed the url in the application. The change in the url endpoint should only be reflected in the registry entry. The change will be reflected to the connecting applications without having to apply any changes to the application.

Sample Service

Make sure that Ballerina is downloaded and configured. For detailed instructions please refer here.

A sample mock service could be found here.

Once the service is available, execute the following,

ballerina run abc_warehouse.bal

The above will start the service in localhost:8090, the following message would be prompted in the console as a result,

Initiating service(s) in 'abc_warehouse.bal'
[ballerina/http] started HTTP/WS endpoint

ETCD Setup

Make sure that ETCD is running, execute the following

The following response could be observed,


Insert key into the etcd registry/this could also be used to change the entry

curl -X PUT  -d value="http://localhost:8090/warehouse/order"

To retrieve the key execute the following,

WSO2 APIM Integration

WSO2 APIM provides the capability to define dynamic endpoints. Using this capability in conjunction with mediation extensions. APIM gateway could be integrated with ETCD registry to dynamically discover the service endpoints. 

The following would be the mediation extension which should be added.

<sequence name="warehouseSvcDiscoverySeq" trace="disable" xmlns="">
    <property name="State" value="Discovery Sequence"/>
  <call blocking="true">
      <http method="get" uri-template=""/>
  <log level="full" />
  <property description="WarehouseOrderEP Value" expression="json-eval($.node.value)" name="WarehouseOrderEP" scope="default" type="STRING"/>
    <property name="Discovery Endpoint Value" expression="$ctx:WarehouseOrderEP"/>
  <header name="To" expression="$ctx:WarehouseOrderEP"/>

The relevant extension could be downloaded from here.


After applying the above extension. Invoke the WSO2 APIM gateway endpoint.

curl -k -H "Content-Type: application/json" -X POST -d ‘{"cust_id":"hmart", "delivery":"addr1", "contact":"784-7948754", "item":"A100", "quantity":120}’ -v
In the warehouse ballerina service following could be observed,

ABC - Order received

2019-05-15 11:40:20,773 INFO [] - Order placed for Item: A100, quantity: 120. Deliverary to: addr1. Contact: 784-7948754

Change the port in which the ballerina service is running (from 8090 to 9090)

In the ETCD registry entry reflect the port change by issuing the following command,

curl -X PUT -d value="http://localhost:9090/warehouse/order"

After reflecting the change, invoke the APIM gateway URL. The response should be received.


Follow this post to integrate with Consul.

What’s coming up ?

The tutorial explains how ETCD integration could be done via dynamic endpoints and mediation extensions. WSO2 APIM 3.0.0 release would provide out of the box support for this integration with its Micro Gateway offering, for more information refer here.

Friday, August 10, 2018

How Ballerina does I/O

Recently i came across this article which i found rather fascinating and thought of comparing how Ballerina does I/O in comparison to other programming languages which were compared in the article .

If you have not read the article before. I encourage you to read it before reading the rest of this post. Since the concepts explained in the article will not be re iterated.

Authors Summary of the Article 

Article mainly compared between the existing I/O capabilities provided by different programming languages (PHP, Java, Node.js and Go). The following table illustrates how the author in the article summarized his findings. 

The author mainly compares between several programming languages on their I/O capabilities and elaborates the importance of programming language supporting Multi-threading over creation of a process, Non-blocking I/O and ease of use (providing transparency for the developers to write code over having to manage callbacks, callbacks were artificially created as a result of non-blocking functionality in some of the programming languages which reduces the ease of use).

The author summarizes in his article how "Golang" provides most of the characteristics required by a programming language to perform I/O operations, in comparison to other programming languages he compared.

Since Golang is the successor. I thought of further evaluating pros and cons of  Golang to really identify it's capabilities and it's global understanding.

Further Evaluation 

Further, beyond the boundaries of the article i came across the following post which describes some of pain points in developers are experiencing in Golang ,

- lack of Vectored-I/O which means, this will not support operations such as scatter-gather. 
- Each time a IO read is performed a buffer is allocated first before performing the sys call. This would however will not guarantee that the read will return by filling the content to the allocated buffer.

To further elaborate this, say you're attempting to read content from a socket. When attempting to read it is expected for content to be present (obviously.. Dha).

There are several approaches to read content, the post explains that in Golang, first a buffer is allocated, sequentially a read() sys call will be called. As stated above bytes might not be present when attempting to read, if there're no bytes the read will return empty. However, allocated buffer remains the same which means increased memory footprint.

So is there an alternative for this ? how can you allocate and perform a sys() calls to read only if the content is present. I will explain further in this post on how Ballerina does this. Again if you have not read the article i encourage you to do so. The author explains how the sys calls work which i will not repeat.

Ballerina I/O Architecture 


Shown above is an illustration of Ballerina I/O architecture.

I/O or Input/Output is about how a program reads data for processing or writes data after processing. I wouldn't be wrong to say every program involves an I/O operation some way or another either you read from a file process it, write some of the content to a socket or perhaps you do some calculation and write the result to the standard output and the end of the day you would mainly see the following attributes involved.

# I/O sources - these could be files, socket, devices etc basically the entities which you would use to gather or place information which is required by your programs for processing.
# Information - these information are mainly delivered to a program in bytes which is the standard way to represent information
# Interpretation - information or a sequence of bytes (1 or more) could be interpreted differently. In other words byte is basically a number. This number could represent a character in alphabet, pixel in an image, minute detail of an audio clip, state in a traffic light etc. Once a byte is being absorbed or being written into an I/O source. The program basically interprets it's value to perform it's processing these interpretations can commonly be characters, records etc.

Interpretation can be independent of it's source. Whether you get the information from socket, file, standard input it doesn't matter as long as it's the same information. Ballerina I/O architecture revolves around this principal.

So what advantage will it provide ?

say as a developer, you write one program to read from a file and perform it's processing. Say suddenly you're being told that the very same content should be read over the network. so in this case the source would change from file to socket, however the information and the interpretation stays the same. So in Ballerina all you need to change is the source and the rest of the program would work. You will see how when you further identify the semantics/syntax it provides.


You could try the samples available in the site to get a broader sense of how to use I/O apis. Following is a pseudo code on how I/O apis look like.

function main(string... args) {
         //Initialize channel/I/O source
         io:ByteChannel sourceChannel = io:openFile(filePath, permission);
         var result = sourceChannel .read(numberOfBytes);
         //Initialize channel/I/O source
         sourceChannel .write(..);   

So basically as a developer when you write a program and figure out the I/O source needs to change from file to socket all you need to change is how ByteChannel is initialized from file to socket, in ballerina all the I/O sources will be represented as a ByteChannel.
Going back to the beginning of the post, when compared ballerina with other programming languages compared in the article. Following are some highlights   

# Async model -

Ballerina uses a threading model which will have controlled amount of IO threads, reduced context switches, based on NIO (Non blocking I/O) which will be described further in this article.

# Ease of use -

As you would observe. There're no callbacks. For the developer the non-blocking behaviour is transparent. Which makes it as easy to use as other comparative programming languages which tops the list for ease of use. 

As i mentioned in the beginning of the post, let me illustrate the approach ballerina uses to read content to avoid the limitation of causing additional memory footprint. 

Under the Hood (How it works)

Say you're reading content from a socket. The functional flow looks like the following,

Illustrating thought the syntax highlighted earlier in the post. When the program execution hits the following line,

var result =;

The functional flow looks like the following,

1. When read is called, an event would be registered with a selector (multiplexer which will capture event from Kernal and notify accordingly). At this point no buffers will be allocated and no thread will be hanged or blocked internally. The event would await until it's being notified by the Kernal. 
2. When the specific I/O source writes data to the Kernal buffer. The selector will be notified through the kernal to the user space. Notifying the user space of readiness of data to be read from the buffer from the given channel
3. At this point there is certainty there's data available to be read, hence the selector manager will dispatch read event to IOThreadPool (which is a dedicated thread group that will handle I/O operations in Ballerina), only at this point Ballerina would allocate a buffer
4. Data will be read into the newly allocated buffer and once the operation is complete, the next LoC will be executed. The entire operation is natively non-blocking but developer does not need to struggle handling callbacks.


The post mainly compared some of the factors considered to evaluate I/O apis offered by different programming languages. The comparison points were derived from article. sequentially some of the pain points of the existing I/O models were discussed. Finally Ballerina I/O architecture was compared against the comparison points discussed in the article and further explained how ballerina addresses some of the commonly known limitations which are even faced by the successor mentioned in article.

I encourage you to try ballerina out yourself. 

Thursday, August 3, 2017

What's Special About WSO2 ESB ??

I am a bit late to write this post. Better late than never :)

Why should you consider WSO2 ESB ?

The recently published article will unveil the answer to the question  What is WSO2 ESB?

WSO2 ESB is one of the most matured products in the WSO2 stack, it's scalable, it's fast and it has all the features which will support all your integration needs. This i believe is evident and would be self explanatory if you download it.

Monday, February 27, 2017

How i got started with Ballerina

I am certain most of my friend's would click on the link to see me dancing :)

With the announcement of Ballerina, the new integration language. I thought of writing a quick summary on how i got started. 


I downloaded Ballerina from here. Also i referred Installation-instructions to get started.

Writing an EIP

CBR as a very common EIP in the integration world was something i tried out with Ballerina. So here's how i did it.  

Creating a Mock Service in Ballerina

Something i was longing to try out in Ballerina is to be able to write a service which could be executed in the same runtime. So here's how i did it, 

Started the composer, and viola it provided a graphical view for me to represent the service and what it should do and all i had to do was drag and drop a few elements to the canvas. This was like drawing a floor chart. 

the service i created would accept an incoming http message and send a mock respond back. The source view showed the language syntax i could use, here's how that looked like.

import ballerina.lang.messages;

service GadgetInventoryMockService {
resource inquire(message m) {
message response = {};
json payload = `{"inquire":"gadget","availability":"true"}`;
messages:setJsonPayload(response, payload);
reply response;

Similarly i managed to create both the services ("Widget Inventory" and "Gadget Inventory").

Routing with Ballerina

Just like creating a service i was able to drag an drop a set of elements from the graphical view and create the router

import ballerina.lang.jsons;
import ballerina.lang.messages;

service ContentBasedRouter {
resource lookup(message m) {
http:ClientConnector widgetEP = create http:ClientConnector("http://localhost:9090/widgets");
http:ClientConnector gadgetEP = create http:ClientConnector("http://localhost:9090/gadgets");
json requestMessage = messages:getJsonPayload(m);
string inventoryType = jsons:getString(requestMessage, "$.type");
message response = {};
if (inventoryType == "gadget") {
response = http:ClientConnector.get(gadgetEP, "/", m);
else {
response = http:ClientConnector.get(widgetEP, "/", m);
reply response;

While looking back i realize, it was not only convenient to create the message flow, but it was also easier for me to describe the flow through the diagram. The way it was describing the connections, the message flow and the client as seperate entities (the picture was actually speaking 1000 words :) ). 

Running What I Wrote 

I was excited to see how this diagram, would look like when it's running.

This is all what i had to do,

ballerina run service ./gadgetInventoryMockService.bal ./widgetInventoryMockService.bal ./router.bal

where, gadgetInventoryMockService.bal and widgetInventoryMockService.bal were the mock services i wrote and router.bal is the routing logic. In this case i would've preferred to actually be able to bundle the whole project into one package instead of having to give each an individual file as arguments. I checked on this capability with the team and this will be supported in the near future by the composer. So i'll have my fingers crossed for this. As a result in my local machine each of the bal files were running as a service in the following URLs. The files i used could be found here.

Gadget Inventory Mock Service
Widget Inventory Mock Service

So to practically experience how Ballerina routed the requests i did the following, using cURL client i sent the following request, 

curl -v http://localhost:9090/route -d '{"type" : "gadget"}'

The following response should be observed,


Re executed the request with the following,
curl -v http://localhost:9090/route -d '{"type" : "widget"}'

Then the following response should be observed,

In general there're more components i.e fork-join capability which will be required to implement some of the EIPs i wanted to try out i.e scatter-gather, so tick tock for the next release. However, it was a great experience.