Sunday, September 07, 2008

Quantifying DBA Workload and Measuring Automation ROI

I mentioned in a prior blog entry that I would share some insight on objectively measuring DBA workload to determine how many DBAs are needed in a given environment. Recently, I received a comment to that posting (which I’m publishing below verbatim) the response to which promoted me to cover the above topic as well and make good on my word.

Here’s the reader’s comment:
Venkat,
I was the above anonymous poster. I have used Kintana in the past for doing automation of E-Business support tasks. It was very good. The hard part was to put the ROI for the investment.

My only concern about these analysts who write these reports is that they are not MECE (Mutually Exclusive, Collectively exhaustive). They then circulate it to the CIO's who use it to benchmark their staff with out all the facts.

So in your estimate, out of the 40 hours a DBA works in a week (hahaha), how many hours can the RBA save?

The reason I ask is that repetitive tasks take only 10-20% of the DBA's time. Most of the time is spent working on new projects, providing development assistance, identify issues in poorly performing systems and so on. I know this because I have been doing this for the past 14 years.

Also, from the perspective of being proactive versus reactive, let us take two common scenario's. Disk Failure and a craxy workload hijacking the system. The users would know it about the same time you know it too. How would a RBA help there?

Thanks
Mahesh
========

Here’s my response:

Mahesh,

Thanks for your questions. I’m glad you liked the Kintana technology. In fact if you found that (somewhat antiquated) tool useful, chances are, you will absolutely fall in love with some of newer run book automation (RBA) technologies, specifically database automation products that comply with RBA 2.0 norms like Data Palette. Defining a business case prior to purchasing/deploying new technology is key. Similarly, measuring the ROI gained (say, on a quarterly basis) is equally relevant. Fortunately, both of these can be boiled down to a science with some upfront work. I’m providing a few tips below on how to accomplish this, as I simultaneously address your questions.

Step 1] Identify the # of DBAs in your organization – both onshore and offshore. Multiple that number by the blended average DBA cost. Take for instance, a team of 8 DBAs – all in the US. Assuming the average loaded cost per DBA is $120K/year, we are talking about $960K being spent per year on the DBAs.

Step 2] Understand the average work pattern of the DBAs. At first blush it may seem that only 10-20% of a DBA’s workload is repeatable. My experience reveals otherwise. Some DBAs may say that “most of their time is spent on new projects, providing development assistance, identify issues in poorly performing systems and so on.” But ask yourself, what does that really mean? These are all broad categories. If you get more granular, you will find repeatable tasks patterns in each of them. For instance, “working on new projects” may involve provisioning new dev/test databases, refreshing schemas with production data, etc. These are repeatable, right? Similarly, “identifying issues in poorly performing systems” may involve a consistent triage/root cause analysis pattern (especially many senior DBAs tend to have a methodology for dealing with this in their respective environments) and can be boiled down to a series of repeatable steps.

It’s amazing how many of these activities can be streamlined and automated, if the underlying task pattern is identified and mapped out on a whiteboard. Also, I find that rather than asking DBAs “what do you do…”, ticketing systems often reveal a better picture. I recently mined 3 months worth of tickets from a Remedy system for an organization with about 14 DBAs (working across Oracle and SQL Server) and the following picture emerged (all percentages are rounded up):

- New DB Builds: 210 hours (works out to approx. 3% of overall DBA time)
- Database Refreshes/Cloning: 490 hours (7%)
- Applying Quarterly Patches: 420 hours (6%)
- SQL Server Upgrades (from v2000 to v2005): 280 hours (4%)
- Dealing with failed jobs: 140 hours (2%)
- Space management: 245 hours (3.5%)
- SOX related database audits and remediation: 280 hours( 4%)
- … (remaining data truncated for brevity…)

Now you get the picture… When you begin to add up the percentages, it should total 100% (otherwise you have gaps in your data; interview the DBAs to fill those gaps.)

Step 3] Once I have this data, I pinpoint the top 3 activities - not isolated issues like dealing with disk failure, but the routine tasks that the DBAs need to do multiple times each week, or perhaps each day – like the morning healthcheck, dealing with application faults such as blocked locks, transaction log cleanup, and so on.

Also, as part of this process, if I see that the top items’ description pertains to a category such as “working on new projects…”, I break down the category into a list of tangible tasks such as “provisioning a new database”, “compliance-related scanning and hardening”, etc.

Step 4] Now I’m ready to place each of those top activities under the microscope and begin to estimate how much of it follows a specific pattern. Note that 100% of any task may not be repeatable, but that doesn’t mean it cannot be streamlined across environments and automated - even a 20-30 percent gain per activity has huge value!

Once you add up the efficiency numbers, a good RBA product should allow you to gain anywhere from 20 to 40 percent overall efficiency gains - about $200K to $400K of higher productivity in the case of our example with 8 DBAs - which means, they can take on more databases without additional head-count or, support the existing databases in a more comprehensive manner (step 5 below). These numbers should be treated as the cornerstone for a solid business case, and for measuring value post-implementation – task by task, process by process.

(Note: The data from Step 2 can also be used to determine how many DBAs you actually need in your environment to handle the day-to-day workload. If you only have the activities listed and not the corresponding DBA time in your ticketing system, no worries… Have one of your mid-level DBAs (someone not too senior, not too junior) assign his/her educated guess to each activity in terms of number of hours it would take him/her to carry out each of those tasks. Multiple that by the number of times each task is listed in the ticketing system and derive a weekly or monthly total. Multiple that by 52 or 12 to determine the total # of DBA hours expended for those activities per year. Divide that by 2,000 (avg. # of hours/year/DBA) and you have the # indicating the requisite # of DBAs needed in your environment. Use a large sample-set from the ticketing system (say, a year) to avoid short-term discrepancies. If you don’t have a proper ticketing system, no problem – ask your DBA colleagues to track what they are working on within a spreadsheet for a full week or month. That should give you a starting point to objectively analyze their workload and build the case for automation technology or adding more head-count, or both!)

Step 5] Now sit down with your senior DBAs (your though-leaders) and identify all the tasks/activities that they would like to do more of, to stay ahead of the curve, avoid some of those frequent incidents and make performance more predictable and scalable - activities such as more capacity planning, proactive maintenance and healthchecks, more architecture/design work, working more closely with Development to avoid some of those resource-intensive SQL statements, use features/capabilities in newer DBMS versions, audit backups and DR sites more thoroughly, defining standard DBA workbooks, etc.) Also specify how will that help the business – in terms of reduced performance glitches and service interruptions, fewer war-room sessions and higher uptime, better statutory compliance and so on. The value-add from increased productivity should become part of the business case. One of my articles talks more about where the additional time (gained via automation) can be fruitfully expended to increase operational excellence.

My point here is, there’s no such thing in IT as a “one time activity”. When you go granular and start looking at end-to-end processes, you see a lot of commonalities. And then all you have to do is summarize the data, crunch the numbers, and boom - you get the true ROI potential nailed! Sounds simple, huh? It truly is.

Last but not least, regarding your two examples: “disk failure” and “a crazy workload hijacking the system” – those necessarily may not be the best examples to start when you begin to build an automation efficiency model. You need to go with the 80-20 rule - start with the 20% of the task patterns that take up 80% of your time. You refer to your use cases as “common scenarios”, but I’m sure you don’t have the failed disk problem occurring too frequently. If these above issues do happen frequently in your environment (assuming in the short term) and you have no control over them, other than reacting in a certain way, then as Step 3 suggests, let’s drill into how you react to them. That’s the process you can streamline and automate.

Let me use the “crazy workload” example to expound further. Say I’m the DBA working the early Monday morning shift and I get a call (or a ticket) from Help Desk stating that a user is complaining about “slow performance”. So I (more or less) carry out the following steps:

1. Identify which application is it (billing, web, SAP, Oracle Financials, data warehouse, etc.)
2. Identify all the tiers associated with it (web server, app server, clustered DB nodes, etc.)
3. Evaluate what kind of DB is the app using (say, a 2-node Oracle 10.2 RAC)
4. Run a healthcheck on the DB server (check on CPU levels, free memory, swapping/paging, network traffic, disk space, top process list, etc.) to see if anything is amiss
5. Run a healthcheck on the DB and all the instances (sessions, SQL statements, wait events, alert-log errors, etc.)
6. If everything looks alright from steps 4 and 5, I update the ticket to state the database looks fine and reassign the ticket to another team (sys admin, SAN admin, web admin team, or even back to the Help Desk for further analysis of the remaining tiers in the application stack).
7. If I see a process consuming copious amounts of I/O or CPU on a DB server, I check to see if it’s a DB-related process or some ad-hoc process a sys admin has kicked off (say, an ad-hoc backup in the middle of the day!). If it’s a database process, I check and see what that session is doing inside the database – running a SQL statement or waiting on a resource, etc. Based on what I see, I may take additional steps such as run an OS or DB trace on it – until I eliminate a bunch of suspects and narrow down the root cause. Once I ascertain symptoms and the cause, I may kill the offending session to alleviate the issue and get things back to normal - if it’s a known issue (and I have pre-approval to kill it). If I can’t resolve it then and there, I may gather the relevant stats, update the ticket and reassign it to the group that has the authority to deal with it.

As the above example shows, many of the steps above (specifically, 1 to 7) can be modeled as a “standard operating procedure” and automated. If the issue identified is a known problem, you can build a rule in the RBA product (assuming the product supports RBA 2.0 norms) to pinpoint the problem signature and link it to a workflow that will apply the pre-defined fix, along with updating/closing the ticket. If the problem is not a known issue, the workflow can just carry out steps 1 to 7, update the ticket with relevant details there and assign it to the right person or team. Now I don’t need to do these steps manually every time I get a call stating “there seems to be a performance problem in the database…” and more importantly, if it’s truly a database problem, I can now deal with the problem even before the end user experiences it and calls the Help Desk.

In certain other situations, when Help Desk gets a phone call about performance issues, they can execute the same triage workflow and either have a ticket created/assigned automatically or solve the issue at their level if appropriate. This kind of remediation avoids the need for further escalation of the issue and in many cases, avoids incorrect escalations from the Help Desk (how many times have you been paged for a performance problem that’s not caused by the database?). If the problem cannot be automatically remediated by the DBA (e.g., failed disk), the workflow can open a ticket and assign it to the Sys Admin or Storage team.

This kind of scenario not only empowers the Help Desk and lets them be more effective, but also reduces the workload for Tier 2/3 admin staff. Last but not least, it reduces a significant amount of false positive alerts that the DBAs have to deal with. In one recent example, the automation deployment team I was working with helped a customer’s DBA team go from over 2,000 replication-related alerts a month (80% of them were false positives, but needed to be looked at and triaged anyway…) to just over 400. I don’t know about you, but to me, that’s gold!

One final thing: this may sound somewhat Zen, but do look at an automation project as an ongoing journey. By automating 2 or 3 processes, you may not necessarily get all the value you can. Start with your top 2 or 3 processes, but once those are automated, audit the results, measure the value and then move on to the next top 2 or 3 activities. Continue this cycle until the law of diminishing returns kicks in (usually that involves 4-5 cycles) and I guarantee your higher-ups and your end-users alike will love the results. (More on that in this whitepaper.)

Wednesday, September 03, 2008

Clouds, Private Clouds and Data Center Automation

As part of Pacific Crest’s Mosaic Expert team, I had the opportunity to attend their annual Technology Leadership Forum in Vail last month. I participated in half-a-dozen panels and was fortunate to meet with several contributors in the technology research and investment arena. Three things seemed to rank high on everyone’s agenda: cloud computing and its twin enablers - virtualization and data center automation. The cloud juggernaut is making everyone want a piece of the action – investors want to invest in the next big cloud (pun intended!), researchers want to learn about it and CIOs would like to know when and how to best leverage it.

Interestingly, even “old-world” hosting vendors like Savvis and Rackspace are repurposing their capabilities to become cloud computing providers. In a similar vein InformationWeek recently reported some of the telecom behemoths like AT&T and Verizon with excess data center capacity have jumped into the fray with Synaptic Hosting and Computing as a Service - their respective cloud offerings. And to add to the mix, terms such as private clouds are floating around to refer to organizations that are applying SOA concepts to data center management making server, storage and application resources available as a service for users, project teams and other IT customers to leverage (complete with resource metering and billing) – all behind the corporate firewall.

As already stated in numerous publications, there are obvious concerns around data security, compliance, performance and uptime predictability. But the real question seems to be: what makes an effective cloud provider?

Google’s Dave Girourad was a keynote presenter at Pacific Crest and he touched upon some of the challenges facing Google as they opened up their Google Apps offering in the cloud. In spite of pouring hundreds of millions of dollars on cloud infrastructure, they are still grappling with stability concerns. It appears that size of the company and type of cloud (public or private) is less relevant, and more relevant is the technology components and corresponding administrative capabilities behind the cloud architecture.

Take another example: Amazon. They are one of the earliest entrants to cloud clouding and have the broadest portfolio of services in this space. Their AWS (Amazon Web Services) offering includes storage, queuing, database and a payment gateway in addition to core computing resources. Similar to Google, they have invested millions of dollars, yet are prone to outages.

In my opinion, while concerns over privacy, compliance and data security are legitimate and will always remain, the immediate issue is around scalability and predictability of performance and uptime. Clouds are being touted as a good way for smaller businesses and startups to gain resources, as well as for businesses with cyclical resource needs (e.g., retail) to gain incremental resources at short notice. I believe the current crop of larger cloud computing providers such as Amazon, Microsoft and Google can do a way better job with compliance and data security than the average startup/small business. (Sure, users and CIOs need to weigh their individual risk versus upside prior to using a particular cloud provider.) However for those businesses that rely on the cloud for their bread-and-butter operations whether cyclical or around-the-year, uptime and performance considerations are crucial. If the service is not up, they don’t have a business.

Providing predictable uptime and performance always boils down to a handful of areas. If provisioned and managed correctly, cloud computing has the potential to be used as the basis for real-time business (rather than being relegated to the status of backup/DR infrastructure.) But the key question that CIOs need to ask their vendors is: what is behind the so-called cloud architecture? How stable is that technology? How many moving parts does it have? Can the vendor provide component-level SLA and visibility? As providers like AT&T and Verizon enter the fray, they can learn a lot from Amazon and Google’s recent snafus and leverage technologies that can simplify the environment enabling it to operate in lights-out mode – making the difference behind a reliable cloud offering and one that’s prone to failures.

The challenge however, as Om Malik points out on his GigaOm blog, is that much of cloud computing infrastructure is fragile because providers are still using technologies built for a much less strenuous web. Data centers are still being managed with a significant amount of manual labor. “Standards” merely imply processes documented across reams of paper and plugged into Sharepoint-type portals. No doubt, people are trained to use these standards. But documentation and training doesn’t always account for those operators being plain forgetful, or even sick, on vacation or leaving the company and being replaced (temporarily or permanently) with other people who may not have the same operating context within the environment. Analyst studies frequently refer to the fact that over 80% of outages are due to human errors.

The problem is, many providers while issuing weekly press releases proclaiming their new cloud capabilities, haven’t really transitioned their data center management from manual to automated. They may have embraced virtualization technologies like VMware and Hyper-V, but they are still grappling with the same old methods combined with some very hard-working and talented people. Virtualization makes deployment fast and easy, but it also significantly increases the workload for the team that’s managing that new asset behind the scenes. Because virtual components are so much easier to deploy, it results in server and application sprawl and demands for work activities such as maintenance, compliance, security, incident management and service request management go through the roof. Companies (including the well-funded cloud providers) do not have the luxury of indefinitely adding head-count, nor is throwing more bodies at the problem always a good idea. They need to examine each layer in the IT stack and evaluate it for cloud readiness. They need to leverage the right technology to manage that asset throughout its lifecycle in lights-out mode – right from provisioning to upgrades and migrations, and everything in between.

That’s where data center automation comes in. Data center automation technologies have been around now for almost as long as virtualization and are proven to have the kind of maturity required for reliable lights-out automation. Data center automation products from companies such as HP (on the server, storage and network levels) and Stratavia (on the server, database and application levels) make a compelling case for marrying both physical and virtual assets behind the cloud with automation to enable dynamic provisioning and post-provisioning life-cycle management with reduced errors and stress on human operators.

Data center automation is a vital component of cloud computing enablement. Unfortunately, service providers (internal or external) that make the leap from antiquated assets to virtualization to the cloud without proper planning and deployment of automation technologies tend to provide patchy services giving a bad name to the cloud model. Think about it… Why can some providers offer dynamic provisioning and real-time error/incident remediation in the cloud, while others can’t? How can some providers be agile in getting assets online and keeping them healthy, while others falter (or don’t even talk about it)? Why do some providers do a great job with offering server cycles or storage space in the cloud, but a lousy job with databases and applications? The difference is, well-designed and well-implemented data center automation - at every layer across the infrastructure stack.