Over the years I have encountered many services that relied on monitoring systems that were monolithic or relied too heavily on a single approach. Take for example synthetic transactions where a client somewhere submits a transaction to a generic API in a complex multi-tiered system. In one instance the team responsible for running the service with only server and synthetic transaction monitoring was convinced without a doubt (they had already drank from the 5 gallon bucket of coolaide ) that their high availability numbers meant the service was running great. The service in question was just converting to an ITIL/MOF process and they were perplexed by the high number of recurring incidents. They were further flummoxed by incidents which did not show up in monitoring and alarming, but rather came from end users, developers, and execs at the company.
From an outside perspective it’s relatively intuitive to see what was happening here. They had huge holes in the monitoring infrastructure that they relied on to get an internal picture of service health. In addition to this when you are running an incident management team without completely integrating a problem management flow into your process you are basically trying to bail out a boat with a one foot hole in its bottom using a thimble. Monitoring, incident management, and continual service improvement have to be performed together like a trio of finely tuned instruments to be successful. If you can’t coordinate these three instruments as a conductor, you are doomed to failure. Without getting too deep into this trio I wanted to touch on how important they are to each other. The rest of this article will focus on holistic monitoring and I will leave ITIL/MOF, Incident Management, and Problem Management for a future series.
First and foremost, you must take a broad approach to monitoring. Synthetic transactions are great to get a rough picture, but they are a sample in time, and do not represent the actual quality of service the user is experiencing or reflect the actual state of the service as a whole. Synthetic transactions by nature will not hit every server in every tier of the system due to the complexity of top tier load balancing, middle tier hashing and load balancing and back tier partitioning and replica state.
Organic transaction monitoring, true QoS
To broaden the monitoring effectiveness in multi-tiered system, organic transaction monitoring and an associated BI infrastructure should be used alongside synthetic transaction monitoring. Within the system the quality of every single transaction should be collected, aggregated, stuffed into a BI warehouse to report on in close to real time. This type of monitoring is light years beyond relying solely on synthetic data and will give you true insight into how individual customers, partners, and consumers of your service perceive the quality of delivery. Knowing at every level of the system how long the transaction took, and what its result was is a huge advantage. You can visually see patterns in the data that tell you more about your service in real time than weeks of testing would. You can SEE traffic patterns. You have now transformed yourself from the guy who watches red and green lights on a screen to the Systems Engineering equivalent of Cypher sitting at the console watching the matrix code and deciphering blonds, brunettes and redheads….
Server HW Monitoring
No approach to monitoring is complete without an effective toolset to monitor the health of the servers you run your service on. Preventing impact to the service is another important component of monitoring. A great example of this is seeing a SQL farm that represents the backend of a major global scale service with insufficient server / preventative monitoring. When you engineer redundancy into your services sometime you take this safety net for granted. Take the case of this SQL farm which has redundant partitioned data stores, with multiple RAID 10 drives on each server. Bad drive on one of the sides of the RAID 10 array? No problem you have redundancy, the server and service continues to hum along and your availability is wonderful. But if you are not alerting on the dirty bit / drive failure, and you are relying on datacenter staff to identify this via a walkthrough you are sitting on an availability time bomb. The MTBF rating on a drive is not a bullet proof vest for your service. I can’t tell you how many times I have seen a multi drive failure in a RAID 10 array (on both stripes) because there was no alarm and the DC folks didn’t see the red light on the drive and replace / rebuild. The larger the service the more prone you are to errors like this. With the addition of targeted and carefully thought out server monitoring and alarming you can avoid this pitfall. This is a single example but there are many others that you should use the server monitoring tool in your bag to solve. Make sure you have monitoring cases for all your hardware. Alerting on specific events is a great strategy to start with. Have an event catalog for the HW platform and OS you are running on and use it as a basis for this monitoring and alarming.
We are done! Our service now has synthetic transactional monitoring, organic QoS monitoring and transactional BI, server monitoring, and our availability looks fantastic.
You are not done; you are about half way there. WHAT?! You say “Justin what are you talking about! We are doing great! Availability is at an all-time high, incidents are down dramatically and the exec team is throwing a party for us!” You are definitely not done…and at the risk of depressing you, you will never be 100% done with monitoring if you are using a continual service improvement methodology. Let’s continue.
Capacity & Bottleneck monitoring (performance monitoring)
As the feature team innovates and releases code at break neck speeds to keep up with competition, there is little accurate consideration to how new features change the performance profile of the service. Just saying the words “performance profile of the service” is an immense simplification of the gravity of this topic. In actuality each tier of your service, each component of your service has to be considered as an individual and evaluated from an isolated standpoint. Capacity and performance bottlenecks will be vastly different for each component and hitting this bottleneck will have far reaching effects throughout your system. The bottleneck on your web tier could be CPU, your API or cache tier could be memory, your SQL/File tier could be controller throughput, disk IO, memory, CPU, network, or disk space. If fact you could have almost any bottleneck at any tier or component type. What makes this worse is that you need to understand what your primary and secondary bottlenecks are for each individual component to plan a monitoring scheme to alert you to approaching bottlenecks. A classic care here is that a production site on V1 of the service has plenty of headroom, in each tier and component type, from the standpoint of bottlenecks. V2 gets deployed and you notice that bug fixes to existing features, changes to existing features, and new feature adoption of the user base is now eating up 35% more CPU in the API tier, the cache hit ratio has dropped on the cache tier which in turn is driving IOPS and disk queue length up on the SQL/File tier getting you dangerously close to your alert threshold or worse an actual bottleneck. Surely you have a test team that does benchmarking, performance testing, breakpoint / bottleneck testing, failover and recovery, etc. Too often service operators rely on QA results and “performance certifications” and fail to consider that there are far too many variables to get accurate enough results in these areas within test environments to know with confidence what will happen when the bits are deployed. To be confident in QA results in these areas you need to have stellar parity in data distribution, server ratios, load profiles, configuration, isolation, and scenarios. Even if you have this the result set will almost never be close because the new feature sets the test team is simulating load for have not yet been adopted by a production user base. What this means is the load running in test for new features is basically a guess. As a production operator or engineering team responsible for the production site you need to rely on your own assessment and monitoring here.
Some rules to live by related to Capacity and Bottlenecks
Rule 1 – You must understand the primary and secondary bottlenecks for each service tier and component type
Rule 2 – Ensure you have sufficient monitoring in place to identify trends that will ultimately lead to a violation of the max
Rule 3 – Never deploy (anything – hotfixes, releases, patches, etc.) broadly without piloting and seeing the ACTUAL impact to the service as a whole (each component, each tier)
Rule 4 – Never let a bottleneck occur, your capacity planning and service expansion should be far enough in advance to avoid violations
Rule 5 – Never trust without data. Test, test, test. Validate with data.
Point of Service Monitoring
Up to this point in the discussion we have talked about monitoring methods that are focused geographically near the system. When we ask the question what happens outside of our sphere of influence or the parts of the system under our control, things get a little more complicated. Everything we have considered thus far is judged from the point of being physically at or near the system. What happens when our system is located in several US datacenters and geo distributed across those datacenters and we start getting reports of system availability issues from users in Canada, China, India or anywhere else in the world? Our initial reaction is “surely you can’t expect me to monitor and control the internet globally?” While we certainly can’t control it, we do have some ways to monitor and influence it. If a large number of users in Canada are paying for your service and they cannot get to it reliably at times due to any number of network issues between you and them do you think they will be able to digest the technical significance of a route flapping or a routing loop in a poorly peered transit network? They will not be able to digest this and will blame your service for the hit in availability or reliability. With this in mind, it’s critical to understand where your client base is, where your partner base is and to monitor the actual experience these customers have from their region. If you find recurring problems or patterns in specific geographical areas it may make sense to work directly with transit networks, ISP’s and backbone providers to alleviate the issues where possible. Have remote monitoring in key area according to the distribution of your customers to be able to see what they see and solve the problems. If your application has a client that customers use consider instrumenting it with the ability to report back end to end QoS. If your application is web based synthetic transactions from an area monitoring node may be the best option.
Many teams have separate networking organizations that don’t tightly interact with the teams responsible for operating the service. In addition to this many global scale services have taken the time to build networks that have little to no SPOF’s. It’s hard to think of a global scale service that does not have redundant L2 and L3, redundant load balancing, HSRP implementation, and strong redundant peering at the network core layer. Even so it’s critical to expose network monitoring results and events to the team that is responsible for the service. Without this information the service delivery team is blind to network risk or events that could cause a hit to availability. If a top of rack switch is throwing errors and could take out an entire rack of API servers or worse SQL servers the service delivery team could mitigate this risk by shunting traffic away from the VIP’s servicing that rack or failover the SQL servers to other replicas residing in a different rack, colocation or even datacenter. Network monitoring design, implementation, and alarms should be exposed to the service operations team to mitigate risk.
Any monitoring strategy should take a multipronged approach to cover as much of the surface area of the service as possible and economical. Ultimately the right monitoring approach for your service will be based on what your service delivers, your availability goals, and your customers’ expectations. Your system may have many other considerations and areas not covered in this discussion where you require special case monitoring to ensure you have an accurate picture of your service as a whole. In fact we did not cover;
• Version monitoring
• Configuration monitoring
• Change monitoring
• Environmental monitoring
• Backup monitoring
• Security monitoring
As well as a slew of other monitoring topics. If there is sufficient interest it’s likely I will cover these topics in a follow-up. If you have questions on methods or implementing a specific type of monitoring feel free to ask. I’m happy to help where I can.
Thank you for taking the time to read and comment.
© Justin White and JCW Press, 2012. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Justin White and Nolander Press with appropriate and specific direction to the original content.