Years ago, I was working with a company that was in the middle of migrating many of their applications to a new data center. I came in as part of the team, and my part was to analyze the applications to determine if there would be any performance issues after the migration occurred.
For one application, I was involved from the beginning and thus, was able to do some pre-migration testing with the application team. I asked all the questions that I felt would give me the information I needed to test this application and give them my recommendation to mitigate any potential issues. Much of these questions are found in my ebook, Analyzing HTTP: Nine Things to Look For When Profiling and Troubleshooting Web Performance.
After getting my questions answered, the application team and I identified a number of transactions that a typical user of this application would execute. And we went about testing the application.
With the tools at my disposal, I was able to determine that the application would not perform well when moved to the new data center. So changes would need to be made – either to the application or to the infrastructure in order to mitigate the expected performance issues.
This web application was Java-based. After doing the testing with the application team and analyzing the data, I saw that it was very chatty. Many of the application’s transactions had a high number of applications turns as well. So it’s chatty and has a high number of turns. That’s a recipe for some frustrated user calls to the helpdesk!
The application also had a global user base. So network latency could be as high as 500ms to the data center. And that’s when the network is working as expected. With a chatty application, the bigger the distance between the user and the application’s servers, the worst the performance can be expected.
So I knew I could not recommend this app for migration. Otherwise, I’d probably be migrating myself out of the building.
As is typical of chatty applications, everything performed well when the client was local to the server. During testing, transactions routinely had just above 500ms response times.
That’s great! But that was local testing.
Problems with applications like this do not surface until users are moved away from the server, or as was the case here, the server is moved from the users. Analysis showed that overall application response time would increase by 70 times if it was moved to the new data center as-is!
The Application Performance Recommendations
Because of the analysis that I did, I made two broad recommendations to the application team:
The first was to make changes to the application to make the transactions less chatty. Some requests could be combined. Others could be cached. Or Java commands could be re-written.
The second, which I always recommend because companies are not always able to implement the primary recommended for a number of reasons, was to utilize thin-client technology.
The customer already had a Citrix server farm being used for other applications. I recommended that they deploy this J2EE application on Citrix. The ICA protocol would be able to give users the impression that the application is fast, even though it would not be due to the distance.
The customer went with recommendation #2.
Months later, I am pulled into a series of meetings stating that the migration has started, and performance is terrible.
Some had gone wrong! I have recommended Citrix and other such thin-client technologies in the past with success, so I knew it wasn’t something related to the ICA protocol not doing its job.
After getting involved again and running some tests to capture data on the client and server sides, I noticed a pattern.
Every time a user complained about slowness in the application, I noticed a burst of network traffic on the server.
Well, night time for the server was morning time for the frustrated users.
After working with the server team to identify any sort of nightly scan, the culprit was uncovered.
It was a nightly system security scan. The packet data showed that there wasn’t a lot of network traffic generated, but this scan can be very CPU intensive, depending on what’s being scanned at the time.
This is already known, so such a scan is supposed to occur at night time of the local region. However, during the migration, one of the things that was done was copy the OS image of the server local to the users, and put that image on the server in the new data center. Unfortunately, the start time of that scan was somehow missed.
So that was our problem. The start time of the scan was never changed!
Changes were made for this, and users were happy again.
So how do you avoid something as small as a security scan to be the cause of performance issues that can derail your data center migration or consolidation?
Steps to a Successful Migration
Well, these things happen, but here are 3 general tips for a successful data center migration for your application:
- Build a team who is dedicated to supporting the migration
- Develop a testing process that every application must go through
- Incorporate the appropriate tools
All of this was in place on this project. Because of that, once the performance problems were identified, the team was able to get together and I was able to help them find the culprit. Without that this, the migration of the application would have impacted.
Do you have any data center migration stories? What issues did you encounter? Let me know about it below.
Photo credit: geralt