Pages

Tuesday, August 2, 2011

Sudden drop in traffic through some interfaces in Power-1 (solved)

Recently we encounter a strange behaviour in one of the Power-1 clusters deployed in one of our telco customers. A brief explanation of the problem we encountered is as follows.
AAA traffic was entering from the external interface and was designated to a RADIUS server which lies in another interface. In the traffic graphs they have experienced a sudden drop of traffic and all the authentication requests are lost in that instance. After a couple of seconds the traffic is back to normal. This happens not only in the peak hours but also in other times too. Even if they switched from one Power-1 device to the other the problem remained.
After digging down into the problem we came up with a solution.
As we suggested there are interface drops recorded. You can get an idea of the Tx/Rx errors and drops by issuing "ifconfig "
So when we issue this command to the relevant interfaces we noticed that there are huge amount of Rx drops in the external interface.
From this we can come to a conclusion that Rx buffer is not sufficient. To better understand about the problem it's always better understand what the Rx buffer is.
When the NIC receives packets it issues an interrupt to the CPU to handle the packet. For each packet it receives it generates an interrupt. So when the CPU is interrupted the CPU handles the packet first by executing the relevant interrupt procedure and then handling the packet with the relevant software component (in this case the Check Point firewall kernel). But the CPU cannot handle the packets at the rate the NIC receives them. So the NIC needs some sort of a temporary storage location. So the NIC is allocated some temporary storage (buffer) from the RAM. This is the same for the Tx buffer.
You can always view the allocated Tx and Rx buffers to an interface by issuing "ethtool -g ". By issuing this command you can see the maximum values as well as the current allocated value.
Now we know what the Rx buffer really is. So the buffer gets filled when the CPU is taking too much time to process traffic. Does Check Point provides a solution for increasing its performance, i.e. both throughput and connection rate. Well it does. It is the SecureXL technology. For SPLAT it is provided with the Performance Pack (another module that is loaded). So we can speed up the packet handling if we tune up SecureXL.
So we analysed the SecureXL stats as well. You can get the details from "fwaccel stats" or else you can view them from "/proc/ppk/statistics". So as doubted the f2f (non accelerated traffic) was higher than accelerated packets. So we tried to optimize this by modifying the rule base. So after some effort put on the rule base, we could get the stats to an acceptable value.
Still the problem remained. Then we moved on to the next step of increasing the buffer memory.
In this case the maximum was 4096kB. We increased the value in 1024kB increments. Until we solved the problem.
As expected it solved the problem.