In part 3, I talked about how having an application being susceptible to intermittent issues is another reason why small and medium businesses need application testing. In you haven’t read it, go here.
In this post, I talk about reason #4.
In part 2, I mentioned that the customer was using NAT to separate its network from the SaaS provider’s network. That’s not uncommon. Every company wants to help protect its network from anything malicious from another network. Nothing new there.
What was also common, especially among smaller businesses and homes, was the type of NAT the customer was using. They were using port address translation (PAT), which is an extension of the network address translation method.
A Quick Primer on PAT
Port Address Translation is a part of NAT that allows for multiple devices on a local network (LAN) to be mapped to a single public IP address. This mechanism is also referred to as single-address NAT. A goal of using PAT is to preserve the number of IP addresses that are available to be used.
When PAT is being used, the IP addresses from an internal network are translated by a router to the external network’s single IP address. Each request from a different internal IP address is separated by a port, hence port address translation.
So, for example, the customer had over 1,000 users accessing the SaaS provider’s application. At any given point, that’s 1,000 users, whose computer IP address would need to be mapped to one external IP address using various ports.
PAT is a commonly used method in our homes. This is what Comcast and Frontier, my ISP, use to give us access to the Internet.
The customer decided to do the same as well. However, in order to save some money, they allocated one IP address. The extra cost of another IP address from their WAN provider may not have too cost-prohibitive. After all, they are a smaller company and need to save where they can.
Oh Those Chatty Apps
In all the data that was captured and I analyzed, I often noticed that when a user received the “Page Cannot Be Displayed” error, the application had opened as many as 100 TCP connections in a span of a couple minutes!
So let’s do the math.
1,000 user times 100 TCP connections. That’s 100,000 connections!
So what do you think will happen when 100,000 connections are requested with PAT? The router will attached a port to the external IP address for each connection.
See the problem?
The TCP protocol is a connection-oriented protocol that utilizes port numbers to determine client and server data. RFC 793 defines the initial TCP specification, and it states that the TCP header allocates 16 bits to specify the port number.
So all of these connections being opened by the application should open a connection on the external side and using up a TCP port. In this case, the egress router at the customer’s data center use up 100,000 ports to get across the WAN provider’s network over to the SaaS provider.
The problem is that the 16 bits TCP allows only gives us about 65,000 ports!
For this reason, the router is short about 35,000 ports. It will run out of ports to allocate to all the user connections. What should happen is that packets will get dropped at that router when this happens.
TCP port issues
So it turns out that the primary culprit was the lack of available ports to allocate on the customer’s egress router for an application that opens up many TCP connections. This data center router simply did not have enough ports, and was dropping those packets.
The short-lived nature of typical HTTP requests meant that a refresh of the screen could buy the router enough time to free up some ports for new requests. But user productivity was low due to getting all of these errors so constantly and intermittently.
Sorry, It’s You
After analyzing all of the data, I recommended that the customer add one or two additional external IP address that the data center router can translate IP addresses to with PAT. Doing this resolved the issue.
The packet loss on the WAN provider’s network was an issue to address. The application provider’s network was not an issue, but certainly the application’s chatty nature was.
But both of these were known issues. The customer was convinced it was an issue on the application provider’s network.
The unknown was that the customer’s choice to use PAT with one external IP address proved to be the main issue causing performance problems for its users.
So what I hope you are able to take away from this post series is that even smaller or medium-sized companies need to consider application testing before deployment to their users. Once you know how the application works, you can plan your network or use of your network’s resource appropriately.
After the lack of additional IP addresses for PAT was successfully tested, the customer added a firewall to more effectively handle the application’s numerous connections.
So after months of pain, the customer’s users could use the application without having the frustrating browser error.
The SaaS provider was able to prove that it’s network was not the issue, and while their application is chatty, it was not the root cause of their customer’s issue. And neither was the network.
Are you at a small or medium business? Do you think you don’t need to test your applications? Tell me why.
Photo credit: Jeanotderivative work