Firstly, you need to take a look at the current infrastructure. How are you measuring application performance across the data centre elements, especially the SAN where the critical applications and data reside? In most cases this is done element by element (switch, server, storage), rather than looking at an end-to-end view. The reason for this is often down to previous consolidations. Element tools can only give you a view of capacity, performance or utilisation on their part of the SAN. If you have multiple vendors supplying your SAN storage infrastructure, then you use each vendor’s tools to see their bit.
Have you ever had a major trouble ticket raised and the people responsible (both internal and external) for the elements blame one another? This finger pointing exercise is both massively inefficient and costly to the business. What you need is a view of the whole SAN and all its components regardless of vendor. Once you have that you can then go about baselining the application performance. The old adage of ‘you can’t manage what you can’t measure’ comes into play here. You need to know how an application is performing before a migration or consolidation - if you don’t how it’s performing how will you know it is performing better in the new system?
Vendors are very reluctant to give a service level agreement for their new system but they will all assure you that: ‘Trust me the new system will be far higher performance with a far greater return on investment than before!’ Be sure you talk about performance rather than availability. Availability is, of course, massively important but an application can have five 9’s availability and still run slowly. It’s vital to know what you have currently and what to expect in the new or consolidated system.
Monitor the migration or consolidation
Secondly, now that you know how the application is performing (i.e. you have set a threshold of 10 milliseconds per transaction as an end-to-end baseline), you can start the migration or consolidation process. Vendors are very helpful here and will provide services to ensure this process goes smoothly. However it would be better to be able to model what is going to happen with an application before the migration or consolidation, particularly when virtualisation is involved. For example, you are consolidating 5 physical machines down to one and supporting 50 or more virtual machines - it is a good idea to see if it will work before you commit to moving the live application. Also, by monitoring in real time you can see everything that is going on across the SAN and alleviate any potential or real bottlenecks that cause application latency.
A typical recent example is a company (who shall remain nameless) that bought a million Euro array to improve the latency performance of their data warehouse by a couple of milliseconds. While they were waiting for delivery, they added monitoring to their infrastructure and discovered that the data warehouse, which was still attached via the SAN to an array, had a faulty cable and faulty HBA that was causing the excess latency. A few hundred Euros spent on the problem would have averted the entire need for the new array.
Optimise the infrastructure
Thirdly, the SAN is made up of may components, often many thousand, and each one of these has the potential to cause a problem. By far the best way to manage this complexity is to ‘ring fence’ an application and monitor it in real-time. For example you may have an SAP based application that is running your supply chain, this application can be isolated so you know what virtual and physical machines, switch ports, storage ports, etc. it is using. A performance threshold can then be set for this application and a window in a console created with a simple traffic light visual to show how it is performing.
This graphical view can be customised to show each department involved in supporting the application what is going on. Switch port utilisation is typically less than 10% so savings can be quickly made by load balancing traffic across them and deferring buying additional capacity until it is really needed. The main reason business critical applications are not virtualised is because performance and availability SLAs can’t be set in an environment that has a large area of ‘SAN blindness’. By monitoring the application in real-time, end-to-end across the SAN, cost efficiencies of virtualisation and private cloud technologies can be fully realised.
On-going performance monitoring
Finally, staff managing elements of the SAN can sometimes forget why they are there – to keep the critical applications that are running the business available and performing well. When there is latency in an application there are instances of the server people blaming the switch people, the network people blaming the application owner, and everybody blaming the storage staff. Being able to improve the ‘mean time to innocence’ (or guilt) of the component or department that is causing the latency is key – and this can only be done quickly with a real-time end-to-end view.
On-going real-time monitoring of the SAN will confirm problems before they impact the users, alert you to inefficiency in capacity or I/O throughput, and allow a far better understanding of the SAN so future consolidations and migrations are performed on the elements that need improvement. Many large enterprises across the globe are now adopting real time, end-to-end application and infrastructure performance monitoring to improve their competitive advantage. Organisations that stay with a ‘black hole’ in their SAN and an ‘overprovision and hope’ policy will become less competitive on the market as their IT systems cost more and are far less efficient than their competitors.
Tags: Storage Networking