Sunday, September 07, 2008

Quantifying DBA Workload and Measuring Automation ROI

I mentioned in a prior blog entry that I would share some insight on objectively measuring DBA workload to determine how many DBAs are needed in a given environment. Recently, I received a comment to that posting (which I’m publishing below verbatim) the response to which promoted me to cover the above topic as well and make good on my word.

Here’s the reader’s comment:
Venkat,
I was the above anonymous poster. I have used Kintana in the past for doing automation of E-Business support tasks. It was very good. The hard part was to put the ROI for the investment.

My only concern about these analysts who write these reports is that they are not MECE (Mutually Exclusive, Collectively exhaustive). They then circulate it to the CIO's who use it to benchmark their staff with out all the facts.

So in your estimate, out of the 40 hours a DBA works in a week (hahaha), how many hours can the RBA save?

The reason I ask is that repetitive tasks take only 10-20% of the DBA's time. Most of the time is spent working on new projects, providing development assistance, identify issues in poorly performing systems and so on. I know this because I have been doing this for the past 14 years.

Also, from the perspective of being proactive versus reactive, let us take two common scenario's. Disk Failure and a craxy workload hijacking the system. The users would know it about the same time you know it too. How would a RBA help there?

Thanks
Mahesh
========

Here’s my response:

Mahesh,

Thanks for your questions. I’m glad you liked the Kintana technology. In fact if you found that (somewhat antiquated) tool useful, chances are, you will absolutely fall in love with some of newer run book automation (RBA) technologies, specifically database automation products that comply with RBA 2.0 norms like Data Palette. Defining a business case prior to purchasing/deploying new technology is key. Similarly, measuring the ROI gained (say, on a quarterly basis) is equally relevant. Fortunately, both of these can be boiled down to a science with some upfront work. I’m providing a few tips below on how to accomplish this, as I simultaneously address your questions.

Step 1] Identify the # of DBAs in your organization – both onshore and offshore. Multiple that number by the blended average DBA cost. Take for instance, a team of 8 DBAs – all in the US. Assuming the average loaded cost per DBA is $120K/year, we are talking about $960K being spent per year on the DBAs.

Step 2] Understand the average work pattern of the DBAs. At first blush it may seem that only 10-20% of a DBA’s workload is repeatable. My experience reveals otherwise. Some DBAs may say that “most of their time is spent on new projects, providing development assistance, identify issues in poorly performing systems and so on.” But ask yourself, what does that really mean? These are all broad categories. If you get more granular, you will find repeatable tasks patterns in each of them. For instance, “working on new projects” may involve provisioning new dev/test databases, refreshing schemas with production data, etc. These are repeatable, right? Similarly, “identifying issues in poorly performing systems” may involve a consistent triage/root cause analysis pattern (especially many senior DBAs tend to have a methodology for dealing with this in their respective environments) and can be boiled down to a series of repeatable steps.

It’s amazing how many of these activities can be streamlined and automated, if the underlying task pattern is identified and mapped out on a whiteboard. Also, I find that rather than asking DBAs “what do you do…”, ticketing systems often reveal a better picture. I recently mined 3 months worth of tickets from a Remedy system for an organization with about 14 DBAs (working across Oracle and SQL Server) and the following picture emerged (all percentages are rounded up):

- New DB Builds: 210 hours (works out to approx. 3% of overall DBA time)
- Database Refreshes/Cloning: 490 hours (7%)
- Applying Quarterly Patches: 420 hours (6%)
- SQL Server Upgrades (from v2000 to v2005): 280 hours (4%)
- Dealing with failed jobs: 140 hours (2%)
- Space management: 245 hours (3.5%)
- SOX related database audits and remediation: 280 hours( 4%)
- … (remaining data truncated for brevity…)

Now you get the picture… When you begin to add up the percentages, it should total 100% (otherwise you have gaps in your data; interview the DBAs to fill those gaps.)

Step 3] Once I have this data, I pinpoint the top 3 activities - not isolated issues like dealing with disk failure, but the routine tasks that the DBAs need to do multiple times each week, or perhaps each day – like the morning healthcheck, dealing with application faults such as blocked locks, transaction log cleanup, and so on.

Also, as part of this process, if I see that the top items’ description pertains to a category such as “working on new projects…”, I break down the category into a list of tangible tasks such as “provisioning a new database”, “compliance-related scanning and hardening”, etc.

Step 4] Now I’m ready to place each of those top activities under the microscope and begin to estimate how much of it follows a specific pattern. Note that 100% of any task may not be repeatable, but that doesn’t mean it cannot be streamlined across environments and automated - even a 20-30 percent gain per activity has huge value!

Once you add up the efficiency numbers, a good RBA product should allow you to gain anywhere from 20 to 40 percent overall efficiency gains - about $200K to $400K of higher productivity in the case of our example with 8 DBAs - which means, they can take on more databases without additional head-count or, support the existing databases in a more comprehensive manner (step 5 below). These numbers should be treated as the cornerstone for a solid business case, and for measuring value post-implementation – task by task, process by process.

(Note: The data from Step 2 can also be used to determine how many DBAs you actually need in your environment to handle the day-to-day workload. If you only have the activities listed and not the corresponding DBA time in your ticketing system, no worries… Have one of your mid-level DBAs (someone not too senior, not too junior) assign his/her educated guess to each activity in terms of number of hours it would take him/her to carry out each of those tasks. Multiple that by the number of times each task is listed in the ticketing system and derive a weekly or monthly total. Multiple that by 52 or 12 to determine the total # of DBA hours expended for those activities per year. Divide that by 2,000 (avg. # of hours/year/DBA) and you have the # indicating the requisite # of DBAs needed in your environment. Use a large sample-set from the ticketing system (say, a year) to avoid short-term discrepancies. If you don’t have a proper ticketing system, no problem – ask your DBA colleagues to track what they are working on within a spreadsheet for a full week or month. That should give you a starting point to objectively analyze their workload and build the case for automation technology or adding more head-count, or both!)

Step 5] Now sit down with your senior DBAs (your though-leaders) and identify all the tasks/activities that they would like to do more of, to stay ahead of the curve, avoid some of those frequent incidents and make performance more predictable and scalable - activities such as more capacity planning, proactive maintenance and healthchecks, more architecture/design work, working more closely with Development to avoid some of those resource-intensive SQL statements, use features/capabilities in newer DBMS versions, audit backups and DR sites more thoroughly, defining standard DBA workbooks, etc.) Also specify how will that help the business – in terms of reduced performance glitches and service interruptions, fewer war-room sessions and higher uptime, better statutory compliance and so on. The value-add from increased productivity should become part of the business case. One of my articles talks more about where the additional time (gained via automation) can be fruitfully expended to increase operational excellence.

My point here is, there’s no such thing in IT as a “one time activity”. When you go granular and start looking at end-to-end processes, you see a lot of commonalities. And then all you have to do is summarize the data, crunch the numbers, and boom - you get the true ROI potential nailed! Sounds simple, huh? It truly is.

Last but not least, regarding your two examples: “disk failure” and “a crazy workload hijacking the system” – those necessarily may not be the best examples to start when you begin to build an automation efficiency model. You need to go with the 80-20 rule - start with the 20% of the task patterns that take up 80% of your time. You refer to your use cases as “common scenarios”, but I’m sure you don’t have the failed disk problem occurring too frequently. If these above issues do happen frequently in your environment (assuming in the short term) and you have no control over them, other than reacting in a certain way, then as Step 3 suggests, let’s drill into how you react to them. That’s the process you can streamline and automate.

Let me use the “crazy workload” example to expound further. Say I’m the DBA working the early Monday morning shift and I get a call (or a ticket) from Help Desk stating that a user is complaining about “slow performance”. So I (more or less) carry out the following steps:

1. Identify which application is it (billing, web, SAP, Oracle Financials, data warehouse, etc.)
2. Identify all the tiers associated with it (web server, app server, clustered DB nodes, etc.)
3. Evaluate what kind of DB is the app using (say, a 2-node Oracle 10.2 RAC)
4. Run a healthcheck on the DB server (check on CPU levels, free memory, swapping/paging, network traffic, disk space, top process list, etc.) to see if anything is amiss
5. Run a healthcheck on the DB and all the instances (sessions, SQL statements, wait events, alert-log errors, etc.)
6. If everything looks alright from steps 4 and 5, I update the ticket to state the database looks fine and reassign the ticket to another team (sys admin, SAN admin, web admin team, or even back to the Help Desk for further analysis of the remaining tiers in the application stack).
7. If I see a process consuming copious amounts of I/O or CPU on a DB server, I check to see if it’s a DB-related process or some ad-hoc process a sys admin has kicked off (say, an ad-hoc backup in the middle of the day!). If it’s a database process, I check and see what that session is doing inside the database – running a SQL statement or waiting on a resource, etc. Based on what I see, I may take additional steps such as run an OS or DB trace on it – until I eliminate a bunch of suspects and narrow down the root cause. Once I ascertain symptoms and the cause, I may kill the offending session to alleviate the issue and get things back to normal - if it’s a known issue (and I have pre-approval to kill it). If I can’t resolve it then and there, I may gather the relevant stats, update the ticket and reassign it to the group that has the authority to deal with it.

As the above example shows, many of the steps above (specifically, 1 to 7) can be modeled as a “standard operating procedure” and automated. If the issue identified is a known problem, you can build a rule in the RBA product (assuming the product supports RBA 2.0 norms) to pinpoint the problem signature and link it to a workflow that will apply the pre-defined fix, along with updating/closing the ticket. If the problem is not a known issue, the workflow can just carry out steps 1 to 7, update the ticket with relevant details there and assign it to the right person or team. Now I don’t need to do these steps manually every time I get a call stating “there seems to be a performance problem in the database…” and more importantly, if it’s truly a database problem, I can now deal with the problem even before the end user experiences it and calls the Help Desk.

In certain other situations, when Help Desk gets a phone call about performance issues, they can execute the same triage workflow and either have a ticket created/assigned automatically or solve the issue at their level if appropriate. This kind of remediation avoids the need for further escalation of the issue and in many cases, avoids incorrect escalations from the Help Desk (how many times have you been paged for a performance problem that’s not caused by the database?). If the problem cannot be automatically remediated by the DBA (e.g., failed disk), the workflow can open a ticket and assign it to the Sys Admin or Storage team.

This kind of scenario not only empowers the Help Desk and lets them be more effective, but also reduces the workload for Tier 2/3 admin staff. Last but not least, it reduces a significant amount of false positive alerts that the DBAs have to deal with. In one recent example, the automation deployment team I was working with helped a customer’s DBA team go from over 2,000 replication-related alerts a month (80% of them were false positives, but needed to be looked at and triaged anyway…) to just over 400. I don’t know about you, but to me, that’s gold!

One final thing: this may sound somewhat Zen, but do look at an automation project as an ongoing journey. By automating 2 or 3 processes, you may not necessarily get all the value you can. Start with your top 2 or 3 processes, but once those are automated, audit the results, measure the value and then move on to the next top 2 or 3 activities. Continue this cycle until the law of diminishing returns kicks in (usually that involves 4-5 cycles) and I guarantee your higher-ups and your end-users alike will love the results. (More on that in this whitepaper.)

Wednesday, September 03, 2008

Clouds, Private Clouds and Data Center Automation

As part of Pacific Crest’s Mosaic Expert team, I had the opportunity to attend their annual Technology Leadership Forum in Vail last month. I participated in half-a-dozen panels and was fortunate to meet with several contributors in the technology research and investment arena. Three things seemed to rank high on everyone’s agenda: cloud computing and its twin enablers - virtualization and data center automation. The cloud juggernaut is making everyone want a piece of the action – investors want to invest in the next big cloud (pun intended!), researchers want to learn about it and CIOs would like to know when and how to best leverage it.

Interestingly, even “old-world” hosting vendors like Savvis and Rackspace are repurposing their capabilities to become cloud computing providers. In a similar vein InformationWeek recently reported some of the telecom behemoths like AT&T and Verizon with excess data center capacity have jumped into the fray with Synaptic Hosting and Computing as a Service - their respective cloud offerings. And to add to the mix, terms such as private clouds are floating around to refer to organizations that are applying SOA concepts to data center management making server, storage and application resources available as a service for users, project teams and other IT customers to leverage (complete with resource metering and billing) – all behind the corporate firewall.

As already stated in numerous publications, there are obvious concerns around data security, compliance, performance and uptime predictability. But the real question seems to be: what makes an effective cloud provider?

Google’s Dave Girourad was a keynote presenter at Pacific Crest and he touched upon some of the challenges facing Google as they opened up their Google Apps offering in the cloud. In spite of pouring hundreds of millions of dollars on cloud infrastructure, they are still grappling with stability concerns. It appears that size of the company and type of cloud (public or private) is less relevant, and more relevant is the technology components and corresponding administrative capabilities behind the cloud architecture.

Take another example: Amazon. They are one of the earliest entrants to cloud clouding and have the broadest portfolio of services in this space. Their AWS (Amazon Web Services) offering includes storage, queuing, database and a payment gateway in addition to core computing resources. Similar to Google, they have invested millions of dollars, yet are prone to outages.

In my opinion, while concerns over privacy, compliance and data security are legitimate and will always remain, the immediate issue is around scalability and predictability of performance and uptime. Clouds are being touted as a good way for smaller businesses and startups to gain resources, as well as for businesses with cyclical resource needs (e.g., retail) to gain incremental resources at short notice. I believe the current crop of larger cloud computing providers such as Amazon, Microsoft and Google can do a way better job with compliance and data security than the average startup/small business. (Sure, users and CIOs need to weigh their individual risk versus upside prior to using a particular cloud provider.) However for those businesses that rely on the cloud for their bread-and-butter operations whether cyclical or around-the-year, uptime and performance considerations are crucial. If the service is not up, they don’t have a business.

Providing predictable uptime and performance always boils down to a handful of areas. If provisioned and managed correctly, cloud computing has the potential to be used as the basis for real-time business (rather than being relegated to the status of backup/DR infrastructure.) But the key question that CIOs need to ask their vendors is: what is behind the so-called cloud architecture? How stable is that technology? How many moving parts does it have? Can the vendor provide component-level SLA and visibility? As providers like AT&T and Verizon enter the fray, they can learn a lot from Amazon and Google’s recent snafus and leverage technologies that can simplify the environment enabling it to operate in lights-out mode – making the difference behind a reliable cloud offering and one that’s prone to failures.

The challenge however, as Om Malik points out on his GigaOm blog, is that much of cloud computing infrastructure is fragile because providers are still using technologies built for a much less strenuous web. Data centers are still being managed with a significant amount of manual labor. “Standards” merely imply processes documented across reams of paper and plugged into Sharepoint-type portals. No doubt, people are trained to use these standards. But documentation and training doesn’t always account for those operators being plain forgetful, or even sick, on vacation or leaving the company and being replaced (temporarily or permanently) with other people who may not have the same operating context within the environment. Analyst studies frequently refer to the fact that over 80% of outages are due to human errors.

The problem is, many providers while issuing weekly press releases proclaiming their new cloud capabilities, haven’t really transitioned their data center management from manual to automated. They may have embraced virtualization technologies like VMware and Hyper-V, but they are still grappling with the same old methods combined with some very hard-working and talented people. Virtualization makes deployment fast and easy, but it also significantly increases the workload for the team that’s managing that new asset behind the scenes. Because virtual components are so much easier to deploy, it results in server and application sprawl and demands for work activities such as maintenance, compliance, security, incident management and service request management go through the roof. Companies (including the well-funded cloud providers) do not have the luxury of indefinitely adding head-count, nor is throwing more bodies at the problem always a good idea. They need to examine each layer in the IT stack and evaluate it for cloud readiness. They need to leverage the right technology to manage that asset throughout its lifecycle in lights-out mode – right from provisioning to upgrades and migrations, and everything in between.

That’s where data center automation comes in. Data center automation technologies have been around now for almost as long as virtualization and are proven to have the kind of maturity required for reliable lights-out automation. Data center automation products from companies such as HP (on the server, storage and network levels) and Stratavia (on the server, database and application levels) make a compelling case for marrying both physical and virtual assets behind the cloud with automation to enable dynamic provisioning and post-provisioning life-cycle management with reduced errors and stress on human operators.

Data center automation is a vital component of cloud computing enablement. Unfortunately, service providers (internal or external) that make the leap from antiquated assets to virtualization to the cloud without proper planning and deployment of automation technologies tend to provide patchy services giving a bad name to the cloud model. Think about it… Why can some providers offer dynamic provisioning and real-time error/incident remediation in the cloud, while others can’t? How can some providers be agile in getting assets online and keeping them healthy, while others falter (or don’t even talk about it)? Why do some providers do a great job with offering server cycles or storage space in the cloud, but a lousy job with databases and applications? The difference is, well-designed and well-implemented data center automation - at every layer across the infrastructure stack.

Wednesday, July 09, 2008

So, what's your “Database to DBA” Ratio?

The “Database to DBA” ratio is a popular metric for measuring DBA efficiency in companies. (Similarly, in the case of other IT admins, the corresponding “managed asset to admin” ratio (such as, "Servers to SA" ratio in the case of systems administrators) seems to be of interest.) What does such a metric really mean? Ever so often, I come across IT Managers bragging that they have a ratio of “50 DB instances to 1 DBA” or “80 DBs to 1 DBA”... -- Is that supposed to be good? And conversely, is a lower ratio such as “5 to 1” necessarily bad? Compared to what? In response, I get back vague assertions such as “well, the average in the database industry seems to be “20 to 1”. Yeah? Sez who??

Even if such a universal metric existed in the database or general IT arena, would it have any validity? A single DBA may be responsible for a hundred databases. But maybe 99 of those databases are generally “quiet” and never require much attention. Other than the daily backups, they pretty much run by themselves. But the remaining database could be a monster and may consume copious amounts of the DBA's time. In such a case, what is the true database to DBA ratio? Is it really 100 to 1 or is it merely 1 to 1? Given such scenarios, what is the true effectiveness of a DBA?

The reality is, a unidimensional *and* subjective ratio, based on so-called industry best practices, never reveals the entire picture. A better method (albeit also subjective) to evaluate and improve DBA effectiveness would be to establish the current productivity level ("PL") as a baseline, initiate ways to enhance it and carry out comparisons on an ongoing basis against this baseline. Cross-industry comparisons seldom make sense, however the PL from other high-performing IT groups in similar companies/industries may serve as a decent benchmark.

Let's take a moment to understand the key factors that should shape the PL. In this regard, an excellent paper titled “Ten Factors Affect DBA Staffing Requirements” written by two Gartner analysts, Ed Holub and Ray Paquet comes to mind. Based somewhat on that paper, I’m listing below a few key areas that typically influence your PL:

1. Rate of change (in the environment as indicated by new rollouts, app/DDL changes, etc.)
2. Service level requirements
3. Scope of DBA services (do the DBAs have specific workbooks, or are the responsibilities informal)
4. # of databases under management
5. Database sizes
6. Data growth rate
7. Staff skills levels
8. Process maturity (are there well-defined standard operating procedures for common areas such as database installation, configuration, compliance, security, maintenance and health-checks)
9. Tools standardization
10. Automation levels

In my mind, these factors are most indicative of the overall complexity of a given environment. Now let’s figure out this PL model together. Assign a score between 1 (low) to 10 (high) in each of the above areas as it pertains to *your* environment. Go on, take an educated guess.

Areas 1 to 6 form what I call the Environmental Complexity Score. Areas 7 to 10 form the Delivery Maturity Score. Now lay out an X-Y Line graph with the former plotted on the Y-axis and the latter plotted on the X-axis.

Your PL depends on where you land. If you picture the X-Y chart as comprising 4 quadrants (left top, left bottom, right top and right bottom), the left top is "Bad", the left bottom is "Mediocre", the right top is "Good" and the right bottom is "Excellent".


Bad indicates that your environment complexity is relatively high, but the corresponding delivery maturity is low. Mediocre indicates that your delivery maturity is low, but since the environment complexity is also relatively low, it may not be a huge issue. Such environments probably don't see issues crop up frequently and there is no compelling need to improve delivery maturity. Good indicates that your environmental complexity is high, but so is your delivery maturity. Excellent indicates that your delivery maturity is high even with the environment complexity being low. That means you are truly geared to maintain service levels even if the environment gets more complex in the future.

Another thing that Excellent may denote is that your delivery maturity helps keep environment complexity low. For instance, higher delivery maturity may enable your team to be more proactive/business-driven, actively implement server/db consolidation initiatives to keep server or database counts low. Or the team may be able to actively implement robust data archival and pruning mechanisms to keep overall database sizes constant even in the face of high data growth rates.


So, now you have a Productivity Level that provides a simplistic, yet comprehensive indication of your team's productivity, as opposed to the age-old "databases to DBA" measure. Also, by actively addressing the areas that make up Delivery Maturity, you have the opportunity to enhance your PL.

But this PL is still subjective. If you would like to have a more objective index around your team's productivity and more accurately answer the question "how many DBAs do I need today?", there is also a way to accomplish that. But more on that in a future blog.

Thursday, May 01, 2008

Six Ways to Tell if an RBA Tool Has Version 2.0 Capabilities

Here’s an update to my last blog about RBA version 2.0. I promised a reader I would provide a set of criteria for determining whether a given RBA platform offers version 2.0 capabilities, i.e., the ability to create and deploy more dynamic and intelligent workflows that can evolve as the underlying environment changes, as well as accommodate the higher flexibility required for advanced IT process automation. In this piece, I expound mostly on the former requirement since my prior blog talked about the latter in quite some detail.

If you don’t have time to read the entire blog post, here’s the exec summary version of the six criteria:
1. Native metadata auto-discovery capabilities for target environmental state assessment
2. Policy-driven automation
3. Metadata Injection
4. Rule Sets for 360-degree event visibility and correlation
5. Root cause analysis and remediation
6. Analytics, trending and modeling.

Now, let’s look at each of these in more detail:

1. Native metadata auto-discovery capabilities for target environmental state assessment. In this context, the term “metadata” refers to data describing the state of the target environment where automation workflows are deployed. Once this data is collected (in a centralized metadata repository owned by the RBA engine), it effectively abstracts target environmental nuances and can be leveraged by the automation workflows deployed there to branch out based on changes encountered. This effectively reduces the amount of deployment-specific hard-coding that needs to go into each workflow. This metadata can also be shared with other tools that may need access to such data to drive their own functionality. Most first-generation RBA tools do not have a metadata repository of their own. They rely on an existing CMDB to pull relevant metadata or in the absence of a CMDB, the workflows attempt to query such data at runtime.

As I have mentioned in my earlier post, the existing metadata in typical CMDBs only scratches the surface for the kind of metadata required to drive intelligent automation. For instance, in the case of a database troubleshooting workflow, the kind of information you may need for automated remediation can range from database configuration data to session information to locking issues. While a CMDB may have the former, you would be hard-pressed to get the latter. Now, RBA 1.0 vendors will tell you that their workflows can collect all such metadata data at runtime, however some of these collections may require administrative/root privileges. Allowing workflows to run with admin privileges (esp for tasks not requiring such access) can be dangerous (read, potential security policy or compliance violation). Alternatively, allowing them to collect metadata during runtime makes for some very bulky workflows that can quickly deplete system resources on the target environments especially in the case of workflows that are run frequently. Not ideal.

The ideal situation is to leverage built-in metadata collection capabilities within the RBA platform to identify and collect the requisite metadata in the RBA repository (start with the bare minimum and then add additional pieces of metadata, as required by the incremental workflows you roll out). If there is a CMDB in place, the RBA platform should be able to leverage that for configuration-related metadata and then fill in the gaps natively.

Also, the RBA platform needs to have an “open collection” capability. This means, the product may come with specific metadata collection capabilities out-of-the-box. However if the users/customers wish to deploy workflows that need additional metadata which is not available out of the box, they should be able to define custom collection routines.

2. Policy-Driven Automation. Unfortunately, not all environmental metadata can be auto-discovered or auto-collected. For instance, attributes such as whether an environment is Development, Test or Production or what its maintenance window is, etc. are useful pieces of information to have since the behavior of an automation workflow may have to change based on these attributes. However given the difficulty in discovering these, it may be easier to specify them as polices that can be referenced by workflows. Second generation RBA tools have a centralized policy layer within the metadata repository wherein specific variables and their values can be specified by users/customers. If the value changes (say, maintenance window start time changes from Saturday 9 p.m. Mountain Time to Sunday 6 a.m. Mountain Time, then it only needs to be updated in one area (the policy screen) and all the downstream workflows that rely on it get the updated value.

3. Metadata Injection. Here’s where things get really interesting. Once you gather relevant metadata (either auto-discovered or via policies), there needs to be a way for disparate automation workflows to leverage all that data during runtime. The term “metadata injection” refers to the method by which such metadata is made available to the automation workflows - especially to make runtime decisions and branch into pre-defined sub-processes or steps.

As an example of how this works, the Expert Engine within Stratavia’s Data Palette parses the code within the workflows steps for any metadata variable references and then substitutes (injects) those with the most current metadata. Workflows also have access to metadata history (for comparisons and trending; more on that below).

Here’s a quick picture (courtesy of Stratavia Corp) that shows differences in how workflows are deployed in an RBA version 1 tool versus RBA 2.0. Note how the metadata repository works as an abstraction layer in the latter case, not only abstracting environmental nuances but also simplifying workflow deployment.


4. Rule Sets for 360-degree event visibility and correlation. This capability allows relevant infrastructure and application events to be made visible to the RBA engine – events from the network, server, hypervisor (if the server is a virtual machine), storage, database and application need to come together for a comprehensive “360-degree view”. This allows the RBA engine to leverage Rule Sets for correlating and comparing values - especially for automated troubleshooting, incident management and triage.

Further, an RBA 2.0 platform should be capable of handling not just simple boolean logic and comparisons within its rules engine, but also advanced Rule Sets for correlations involving time series, event summaries, rate of change and other complex conditions.

5. Root cause analysis and remediation. This is one of my favorites! Accordingly to varied analyst studies, IT admin teams spend as much as 80% of problem management time trying to identify root cause and 20% in fixing the issue. As many of you know, root cause analysis is not only time consuming, but also places significant stress on team members and business stakeholders (as people are under the gun for quickly finding and fixing the underlying issue). After numerous “war-room” sessions and finger-pointing episodes, the root cause (sometimes) emerges and gets addressed.

A key goal of second-generation RBA products is to go beyond root cause identification to automation of the entire root cause analysis process. This is done by gathering the relevant statistics in the metadata repository to obtain a 360-degree view (mentioned earlier), analyzing them via Rule Sets (also referred above) and then providing metrics to identify the smoking gun. If the problem is caused by a Known Error, the relevant remediation workflow can be launched.

Many first-generation RBA tools assume this kind of root cause analysis is best left to a performance monitoring tool that the customer may already have deployed. So what happens if the customer doesn’t have one deployed? What happens if a tool capable of such analysis is deployed, but not all the teams are using that tool? Usually, each IT team has its own point solution(s) that looks at the problem in a silo’d manner, which commonly leads to finger-pointing. I have rarely seen a single performance analysis tool that is universally adopted by all IT admin teams within a company and everyone working off the same set of metrics to identify and deal with root cause. If IT teams have such a tough time identifying root cause manually, should companies just forget about trying to automate that process? Maybe not… With RBA 2.0, regardless of what events disparate monitoring tools may present, such data can be aggregated (either natively and/or from those existing monitoring tools) and evaluated via a Rule Set to identify recurring problem signatures and promptly dealt with via a corresponding workflow.

All that most monitoring systems do is present a problem or potential problem (yellow/red lights), send out an alert and/or run a script. Combining root cause analysis capabilities and automation workflows within the RBA platform helps improve service levels and frequently reduces alert floods (caused frequently by the monitoring tools), unnecessary tickets and incorrect escalations.

Improving service levels – what a unique concept! Hey wait a minute, isn’t that what automation is really supposed to do? And yet, it amazes me how many companies don’t want to go there and just continue dealing with incidents and problems manually. RBA 2.0 begins to weaken that resistance.

6. Analytics, trending and modeling. These kind of capabilities are nice to have and if leveraged from an RBA context, can be really powerful. Once relevant statistics and KPIs are available within the metadata repository, Rule Sets and workflows should be able to access history and summary data (pre-computed based on predictive algorithms and models) to be able to understand trends and patterns to deal with issues before they become incidents.

These can be simplistic models (no PhD required to deploy them), and yet avoid many performance glitches, process failures and downtime. For instance, if disk space is expected to run out in the next N days, it may make sense to automatically open a ticket in Remedy. But if space is expected to run out in N hours, it may make sense to actually provision incremental storage on a pre-allocated device or to perform preventative space maintenance (e.g., backup and delete old log files to free up space, etc.), in addition to opening/closing of the Remedy ticket. A good way to understand these events and take timely action is to compare current activity with historical trends and link the outcome to a remediation workflow.

Tuesday, April 29, 2008

Run Book Automation Gets Smarter

Over a year ago, I had written about different data center automation options, including Run Book Automation (RBA) or as some call it, IT Process Automation. Since then there’s been lots of activity in that space. Many IT organizations that had merely been curious about this kind of technology in earlier months have now begun to earmark specific budgets to evaluate and deploy these tools. I’m beginning to see more and more RFPs in this area. Even though some of the vendors in this space like iConclude and RealOps have been acquired and seem to have lost some of the core talent and drive behind the technology, this area is continuing to see a tremendous amount of innovation driven primarily by other startups. In fact, it almost seems that a new version of run book automation has evolved. Analysts are referring to the enhancements via glowing adjectives such as intelligent process automation, Run Book Automation version 2.0 and Decision Automation based on (the complexity of) the use case being automated (such as, disaster recovery automation in the latter case).

There appear to be primarily two catalysts for the emergence of RBA 2.0:
1. RBA 1.0 was too simplistic (and limiting). RBA by itself wasn’t meant to introduce any new automation functionality. It was merely designed to string along existing tools and scripts in the right sequence to automate specific low-level and mundane IT processes. Some prominent examples are automation of Help Desk (a.k.a. “Tier 1”) work patterns such as trouble ticket enrichment and basic alert-flow triage and response. Given that the primary user-base for this kind of technology were junior IT operators, the assumption was that they typically wouldn’t know much coding or scripting and would need generic out-of-the-box functions to automate things such as opening and closing tickets in popular ticketing systems, rebooting servers, changing job schedules, and so on. Also frequently these products wouldn’t even expose the source code within these steps for any kind of customization – all to keep complexity at bay. Depending on what you intended to accomplish, you end up buying different “integration packs” that allowed you to interact with specific toolsets that were already deployed within your environment and chain together the requisite steps.

2. RBA 1.0 was too static in nature. When you defined and rolled out a workflow, it made specific assumptions about the platform, version and state of the environment and the underlying toolsets that it connected to. If any of those underlying components changed, then all too often the workflow would cease to function in the manner expected, producing unreliable results and diluting the value of automation. Some of the RBA 1.0 products queried the state of the environment at runtime to avoid this problem, but that resulted in bulky workflow steps that consumed higher CPU, memory and I/O resources on the target environments (the impact was especially evident in the case of frequently run workflows).

Companies that had deployed RBA Version 1 began to run into these limitations as they attempted to move up the IT food chain - to Tier 2 and Tier 3 teams and their increasingly complex activities. Most Tier 2/3 areas, such as database administration, systems administration or application support called for these two deficiencies to be overcome.

Consequently RBA 2.0 showed up with specific enhancements to address the above two areas. (I'm going to lead with the solution for problem #2 since that's a wee more challenging.)
- More intelligent and dynamic workflows – Dynamic workflows evolve in a pre-defined (read, pre-approved) way as the environment undergoes changes. This is accomplished in RBA 2.0 via the introduction of a metadata repository between the automation workflows and the target environment. This metadata layer captures the current state of the target environment (along with historical data for comparison and trend analysis purposes) and injects this information into the workflows just prior to or during runtime (this process is referred to as “metadata injection”).

This centralized metadata repository gets populated via relevant collections – either natively or via integration with a CMDB if one exists in the environment. (Note: Even if a CMDB exists, native collections are still relevant since the metadata required for advanced process automation goes way deeper than the metadata found in a conventional CMDB. For instance, in the case of database automation, the type of collections include not only DBMS platform, version, patch level and configuration settings, but also functional aspects such as sessions logged into the database, wait events experienced, locks held, etc. – all of which may be required to either trigger or drive real-time automation behavior.)
Environmental attributes that cannot be auto-discovered can be specified within the metadata repository via a central policy engine.

Metadata injection is one of the key differentiators within RBA 2.0 to bring about requisite changes in automation behavior. This allows workflows to acquire a degree of dynamism (in a pre-approved way) - without getting bloated with all kinds of ad-hoc/runtime checks or worse, becoming stale (and having to be maintained/updated manually).

- Higher flexibility to accommodate higher complexity – RBA version 2 often exposes the source-code beneath the steps for editing and allows new steps to be added in any scripting language the user may prefer. It is not uncommon to see two disparate Tier 2 teams each with its own scripting language preference (if you don’t believe me, just look towards any Sys Admin team that manages both UNIX and Windows boxes or any DBA team that manages both Oracle and SQL Server… These preferences all too often have a tendency to get religious.) The ability to view code and modify the underlying workflow steps, along with support for disparate scripting languages (within the same workflow) allows admin teams to inspect out-of-the-box functionality and make relevant modifications to the existing workflow templates to fit their more advanced requirements.

Talking of products offering RBA version 2, Stratavia’s Data Palette is leading the charge in this area via its central metadata repository and decision automation capabilities. (The product just picked up another award today for making the top software innovators list at the Interop/Software 2008 event.)

Deploying RBA is a strategic decision for many organizations. Expect to have to live with your choice for quite some time. Before you place your bet on a specific solution, take a broad look at representative IT processes (if required, across multiple tiers/teams) you expect to automate today as well as, in the next 24 months, and ensure you are investing in a platform that comes closest to supporting your organization’s ambitions.

Tuesday, March 18, 2008

Data Palette Gets Better and BladeLogic Joins the BMC Camp

It’s just Tuesday, and already it’s been a hectic week for data center automation with Stratavia announcing Data Palette 5.0 and BMC acquiring BladeLogic for about $800M.

Let me start with what's more relevant to customers and end-users: Data Palette 5.0 is something the Stratavia engineers have been working on for almost 2 years now (in conjunction with version 4). It is a game-changer in that it is the first data center automation (DCA) platform to combine native provisioning capabilities with run book automation and database automation to address DCA needs across multiple domains – the sys admins (with server provisioning and hardening requirements), the Tier 1/Help Desk (with generic run book automation requirements) and the DBAs/App admins (with more advanced database/application automation requirements).

The traditional DCA products like BladeLogic and Opsware have essentially focused on provisioning. In the case of Opsware, they started out with server provisioning and then expanded into network and storage provisioning by acquiring Rendition and Creekpath respectively. However these vendors quickly figured out that provisioning without process automation forms an incomplete offering. After all, most admins would like to do “something” with newly provisioned hardware – like harden the server, install specific applications on it, create user accounts and so on – all of which require process automation capabilities. To address that, Opsware last year acquired run book automation vendor, iConclude. BladeLogic addressed that problem via an OEM with run book automation vendor, RealOps (relabeled as Orchestration Manager) and when the latter got acquired by BMC, established an OEM arrangement with another vendor Opalis.

Sounds nice, huh? Well not so much for customers, because as anyone who has tried to integrate multiple products together knows, leveraging disparate tools in a seamless manner is not as easy as it sounds. There are always different GUIs, configuration styles and architectural components to reckon with, in addition to competing terminology, metadata and redundant functionality. Such integration challenges have been more pronounced in the case of BladeLogic especially, since with an OEM, you only have so much control over another vendor’s offering… But I guess BladeLogic has an excellent bunch of sales guys that have been able to sell into large organizations in spite of these challenges. Either that, or customers didn’t know they had an alternative.

With Data Palette 5.0, this hurdle towards a coherent integration of provisioning and run book automation capabilities is a thing of the past. Data Palette’s new console provides both set of capabilities that tie into each other via a common architecture that includes a shared console, policies, rules, metadata and consistent terminology. Now you can define templates for provisioning Windows, Linux and UNIX servers, along with corresponding CIS policies and apply them via a generic run book/workflow. You can use the same console to discover existing servers and leverage an automated run book to scan for compliance and configuration policy violations, report on them and optionally, remediate them. No more having to go from one product screen to the other and having to re-enter rules and preferences.

The other improvement is the thin client for both defining and deploying workflows. Both administration functions (such as defining workflows) as well as user functions (executing the workflows) can be accomplished via the web GUI, built primarily with Ajax and Flash. The new GUI implements the concept of a “digital whiteboard” with easy drag-and-drop of reusable workflows steps to illustrate a custom process. Each of the workflow steps are chained together via an input/output mapping that allows users to create new workflows without the need for scripting or having to study the code within each step. However if you are a scripter, the code is available for viewing/edits and you can add additional steps to the library in any scripting language with built-in version control (in fact, you can mix and match steps in different languages – all within the same workflow!!!).

Data Palette 5.0 also brings in the concept of custom policies. I realize different vendors seem to gravitate towards a different definition for the word “policy”. Within Data Palette, policies are a collection of common attributes that you can use to treat multiple “like” servers (or other data center assets) as one. For example, let’s say you have 40 Windows 2003 machines and 10 VMware Windows virtual machines that you need to apply PCI rules to. You can create a PCI policy outlining the relevant Windows-specific rules and attributes and then apply that singular policy to all 50 machines. Corresponding automation workflows can then utilize the policy attributes to log onto those servers, do the appropriate checks and report on all of them in a consistent manner. Finally, Data Palette’s original forte in database administration automation and 5.0’s innovation in terms of functionality and ease of use makes it stand head and shoulders above anyone else in terms of out-of-the-box database and application related content and capabilities.

BTW, in case you are somewhat wondering how Data Palette differentiates from BladeLogic and Opsware (both in terms of current capability as well as future direction), here’s a quick picture to represent each vendor’s competencies.


Both BladeLogic and Stratavia share some customers (BladeLogic Operations Manager used for server provisioning and configuration management, and Stratavia Data Palette utilized for database and application automation). They do have a good team that knows what they are doing. I would hate to see a good team and good technology disrupted. And looks like BladeLogic had no dearth of suitors. In closing, I hope BladeLogic gets its fair share of mindshare at BMC.

Monday, March 10, 2008

Taking a Stab at a Shared Industry Definition for a "DBA Workbook"

Whenever I come across shops where business users, application support groups or infrastructure teams complain about the DBAs not meeting their expectations, the first question I ask is “OK, so what’s your DBA really supposed to do?”. And I’m amazed by the range of answers I get. It seems like there is quite a bit of vagueness regarding the DBA role. Just recently I blogged about measuring IT admin productivity, but in the case of DBAs, it appears there is not even a consistent set of expectations regarding what to expect from them, let alone measure their productivity.

Alright first things first! Has anyone come across a good DBA Workbook? I’m not even talking about a cookbook with step-by-step instructions on how to do different tasks, but just a preliminary workbook that documents exactly what the heck they are supposed to do.

So, what’s goes into an average DBA’s work-day? As I work at multiple companies, I see DBAs doing things all across the gamut - from ensuring backups are proper to tuning SQL statements to even writing stored procedures! It seems DBAs in many environments either gravitate towards what their predecessors used to do (and try to meet expectations that were set way before their time) or end up catering to whatever the loudest squeaky voice tends to ask. Sometimes I see DBAs being grouped into teams of specialists – like Operations DBAs (aka Systems DBAs), Application DBAs, Development DBAs and Engineering DBAs. I’m myself partial to having the DBAs segregated into not more than 3 groups: say, the Operations DBA team, a Development DBAteam and optionally, an Application DBA team (especially if the environment has large complex ERP/CRM type implementations). I don’t see the need for additional silos such as Engineering DBAs or Build DBAs; instead I view them as more of an area of focus for personnel within existing teams. For instance, each DBA team needs to have Tier 3 personnel that are responsible for Engineering-type DBA work like defining standardized database configurations. Anything more than 3 core teams tends to prevent economies of scale and can create workload imbalance in my experience (at any point in time, you may find one DBA group to be super-busy while another is relatively sitting idle or working on low priority activities). Similarly, creating too many silos can also lead to islands of institutional knowledge wherein none of the DBAs have a comprehensive understanding of the entire IT stack – from the database to the OS and storage, and all the way back to the application. A siloed lop-sided view can prevent effective root cause analysis during problem management.

It is also not uncommon to find sub-groups within the 3 main DBA groups based on DBMS platform (Oracle team versus SQL Server team, etc.) and OS platforms (UNIX DBAs versus Windows DBAs). The former sub-category tends to be more prevalent than the latter. However I’m increasing finding DBA managers interested in building cross-platform awareness. As leading DBMS platforms begin to display similar capabilities, databases are increasingly getting commoditized. Also with different 3rd party applications requiring multiple DBMS platforms as well as changing business requirements (say, the occasional M&A) casts more pressure on a DBA team to manage heterogeneous platforms. Furthermore, decision automation platforms like Data Palette help create an “abstraction layer” across physical DBMSs allowing DBAs skilled in one platform to be productive on others. So in the future, we may be seeing even less of a need for platform-based sub-categories.

Alright, so here’s a stab at a preliminary DBA workbook. Please feel free to adopt it as you see fit (since clearly there is no such thing as a “single size fits all” here.) This workbook is not supposed to be a precise guide for what DBAs in your specific organization should be working on, rather it’s meant to serve as a starting point template for you to customize per your business requirements. But do bear in mind that if you find the need to reinvent it completely, then perhaps you don’t really need a DBA, it’s some other role you are seeking to fill. So in other words, consider a 10-20% deviation from this definition as reasonable.

All you DBAs and DBA Managers out there – I would love to get your input on what you feel is right or missing here. Hopefully this can develop into a common definition of what should constitute a legitimate scope of work for DBAs. Such a definition sets expectations both within and outside the DBA team, and fosters higher collaboration. It should help managers analyze gaps in DBA scope of work in their own organizations, as well as identify gaps in skill-sets for future training and hiring. And the best part is that such a workbook can be linked to an SLA (that outlines the service levels the DBAs are responsible for achieving), and last but not least, be the pre-cursor for a more detailed cookbook comprising specific run-books for recurring (and some non-recurring yet complex) operations. The goal is to reduce surprises for other groups and avoid DBAs having to hear users and customers say things like “oh, I expected our DBAs to do more. In my previous company, they did so much more…”.

Also, if you ever venture into a DBA outsouring/co-sourcing arrangement with a 3rd party vendor, this can serve to delineate what you expect the vendor's DBAs to do versus your own team.

Operations/Systems DBA Workbook
- Database backups
- Database recovery
- Implementing robust change control and configuration management policies
- Capacity planning & system-level architecture
- OS/Storage/DB configuration and optimization
- Database security checks and audits, including compliance-related reporting & controls
- General database health checks
- Database monitoring
- Working with DBMS support (Oracle Support, Microsoft Support, etc.)
- Defining and maintaining quantitative SLAs and helping enhance current service levels
- Physical data model/architecture, DDL generation and maintenance
- DB server migrations and decommissioning
- Database creation and configuration
- Applying patches
- Upgrading the databases from version X to version X+Y
- Maintenance (proactive, reactive and automated); both scheduled and unscheduled
- Database refreshes
- Log-file review
- Trouble-shooting and repairs
- Reporting on state of databases
- Database tuning (proactive and reactive)
- Maintaining specialized environments such as log shipping, replication, cluster-based instances, etc.
- SOP (Std. Operating Procedure) definition & automation
- DBMS and related tools license management
- Documentation related to all Systems DBA areas
- Management of database-related tools (monitoring systems, tool repositories, etc.)
- Automated measurement of quantitative SLAs & any deviations thereof
- Development & utilization of detailed triage processes
- Detailed user workload analysis & segregation
- Infrastructure advise and planning
- Disaster recovery/failover services


Application DBA Workbook
- Setting up application/reporting environment
- Loading flatfile data into the database (script setup, automation & troubleshooting)
- Database cloning
- Application back-end process design and management
- Carry out application-specific trouble-shooting and reactive repairs
- Application configuration and setup
- New application implementation and upgrades
- Creating and implementing application security policy recommendations, including compliance-related reporting & controls
- Data management and manipulation. For example, ensuring data cleanliness, validating scrubbing rules, addressing batch job failures due to bad data (referential integrity violations, etc.)
- Perform application workflow analysis, characterization and segregation to reduce workflow and data-related failures
- Define triage and escalation procedures for each application failure such that appropriate parties can take ownership of specific issues during problem situations
- Liaison with 3rd party application vendors (e.g., Oracle/PeopleSoft Support) for support and trouble-shooting
- Work with in-house Development DBA and Development team in longer-term proactive application repair and re-architecture
- Application DBA related documentation and cross-training
- Assist QA in building and implementing test plans to validate and test application bug fixes, patches and upgrades
- Evaluate recurring/common application failures and define application “Repair SOPs” so that similar application failures get handled in a consistent manner.
- Strengthen application architecture for better performance and scalability and reduced failures - Working with the Systems DBAs on application/database integration (example: integrating two databases, etc.)
- Working with the Systems DBAs on configuring the database to optimally accommodate each application (example: providing consultative input on tasks such as data sizing, capacity planning, custom setup of memory structures and disk layouts, etc.)


Development DBA Workbook
- Specialized database SQL tuning
- Documentation related to all pre-production DBA areas and cross-training
- Participate in application and database design reviews
- New development projects implementation
- Logical data modeling and architecture
- Logical to physical design conversion
- SQL tuning
- Developing and deploying DDL/DML coding standards
- Coding (or recoding) of specific database modules for enhanced database/application performance
- Database code (DDL/DML) version control
- Scripting and testing schema builds and new rollouts
- DBMS software installation, configuration and management for development database environments as per standards set in Production (by Systems DBAs)
- Data cloning within development environments
- Training Developers on writing optimal SQL code
- General liaison between Systems DBAs and Developers
- General liaison between Systems DBAs and Application DBAs
- Change-control guidelines and provisions for new project implementations
- Assist QA in building and implementing test plans to validate and test pre-production DB bug fixes, patches and upgrades; as well as assist in stress/volume testing.

Thursday, February 21, 2008

How Do You Measure IT Admin Productivity?

A key part of my job is to carry out operational maturity assessments for different organizations. This helps me understand where a given IT admin team (usually DBAs, sys admins or app support teams) currently is (chaotic, reactive, proactive, etc.), where the business needs them to be, and how to bridge the gap in the shortest amount of time.

During the course of these assessments, I run into two kinds of IT administrators – the folks that view the extent of automation I (or their managers) advocate as a somewhat scary proposition and state that they wouldn’t want to jeopardize their environment by automating beyond a certain (rather basic) level. They claim to take complete personal responsibility for their environment and do what it takes to minimize risk in the day-to-day functioning of the company, by manually performing tasks that are considered critical to keeping the lights on. Let’s call this group the reluctant automators.

Then there are the folks that are eager to push the envelope to see how far and how fast their environments can embrace automation. Let’s refer to this group as the eager automators.

The latter group seems to view their jobs as something beyond keeping the lights on; they view their role as that of an enabler - wherein they allow business groups to focus on the organization’s core competency. Take for instance, a brokerage firm whose primary business function is to trade stocks profitably for their customers. This creates a primary layer of business users - the traders that trade the stock, and multiple peripheral teams that support the primary users and enable them to be successful. In most organizations where technology is not the core offering, IT admins tend to be a part of the peripheral layer.

Reluctant automators often fret about formal interactions with other groups and are usually disassociated from the primary business users. They view meetings as a time-sink; something that takes them away from their “real work” – i.e., doing things on their computers – things like moving data, keeping the configuration current, viewing log files, etc. – all relevant things, no doubt. They frequently view the keyboard activities as the “brainy stuff” and a meeting tends to be an activity where they can actually space out… (you know what I’m talking about, you have run into those folks with the vacant stares in meetings…)

The eager automators on the other hand, often state that repeatable things (the same ol’ stuff like moving data around, viewing log file output and updating configuration) leave them brain-dead and hence they can’t wait to automate them off their day-to-day workload (or at least semi-automate them such that they can be reliably offloaded to less senior people!) They manage the mundane by exceptions and use the freed up time to instead work on defining crisper user requirements and service levels, measuring deviations in performance and managing them. These kinds of activities require them to spend a lot of their time in meetings – with users, project teams and infrastructure peer groups. Even though some of these meetings are badly run (and are indeed boring), they have learnt to embrace these as opportunities to get better aligned with the business, understand growth trends , share insight and influence behavior.

So here’s the paradox that I (and I’m sure, countless IT managers) frequently run into. Given these two IT admin profiles, who is more productive? There is often a tendency in corporate America to mistake hard-work for higher productivity. But here, I don’t necessarily mean more hard-working in the sense of someone working 80 hours a week. Rather, who gets more stuff done? (Assuming there is a standard and reasonable expectation of “stuff” or job output/quality, in the first place.)

Both groups comprise ethical, hard-working people who usually put in way more than 40 hours per week to earn their salaries. However both camps are equally adamant that their outlook is the right one for the company.

I have my own thoughts on the matter (will post them in a future blog), but which school of thought do *you* subscribe to? And why is that appropriate? Unfortunately, there is no universally accepted HR manual that helps us decide whether one notion is better than the other.

And more pertinently, regardless of immanent viewpoints regarding individual employee performance, how can a manager quantitatively measure something as seemingly subjective as productivity? Good processes such as ITIL seem to help some by allowing work activities better documented by exposing # of service requests, # of incidents handled, etc. as well as time put in. However this information is often too uni-dimensional and does not share the complete story. For instance, it does not always take into account factors such as which incidents were truly avoidable and which service requests were redundant or occurred due to incomplete work done in the first place! And often, in many environments, IT admins get away with incomplete ticketing records. (So not all activities are logged into the ticketing system, and the information that gets logged is not always consistent or complete.)

Conventional IT management does not appear to provide an easy answer to these questions… how do you deal with it?

Tuesday, January 22, 2008

3 Predictions for MySQL Post Sun Acquisition

The $1B acquisition of MySQL by Sun didn’t seem to get its fair share of publicity with the gargantuan $8.5B same-day acquisition of BEA by Oracle. Nevertheless it was a monster payout for MySQL, which finally has the opportunity come into its own. With Sun’s stewardship and products like Stratavia’s Data Palette providing automation functionality for MySQL, it finally has the opportunity to overcome its two primary weaknesses: (a) lack of proficient support tools and the resultant perceived lack of stability and scalability and (b) lack of personnel with adequate proficiency in managing enterprise class installations. These flaws have frequently relegated MySQL to a third-tier DBMS platform behind the usual suspects (Oracle, SQL Server, DB2), even legacy platforms (Sybase, Informix) and alas, nimbler open-source and pseudo open-source platforms (PostgreSQL and EnterpriseDB). The digerati have it pinned as being suited for merely small shops and/or small LAMP apps, in spite of some robust enterprise-class functionality in the newer releases.

With the acquisition, I predict 3 things to happen to MySQL in the next 12-24 months:
1. Enhanced native support (finally!) – More native tools for MySQL, especially better monitoring and ad-hoc GUI tools, which will greatly enhance DBA productivity and allow DBAs from other platforms to take on a more favorable disposition towards MySQL. Regardless, as in other DBMS platforms, there will always be gap between DBA demand and supply and MySQL-savvy automation products such as Data Palette will continue to be seen as a viable option to fill part of this gap.
2. More partner activity - With Sun’s partner-base being substantially larger than MySQL, more ISVs are likely to pay attention to the latter and begin to establish a tertiary tools eco-system – including better migration, upgrade and patching tools, data load/unload utilities and more robust maintenance and backup capabilities.
3. Higher innovation – With Sun’s reputation for innovation especially in areas open source, MySQL should come into its own with functionality matching and hopefully, surpassing the mainstream DBMS platforms (a smaller legacy footprint usually enables faster innovation…). Even table-stakes capabilities such as an enhanced SQL optimizer, partitioning and index-types for handling large data-sets without crashes/corruption would go a long way in building credibility for MySQL as a platform that can be relied on for mission-critical enterprise class deployments. These would also make it alluring for ISVs that are looking for more cost-effective options for embedding databases within their products. (I would think the latter would be an especially strong target market for MySQL, especially once the restrictive licensing issues are resolved…)

I still don’t expect a lot of automation capabilities for MySQL right out the chute. Funnily enough, I have been presciently upbeat about MySQL and had championed including out of the box support for MySQL within Data Palette a couple of years ago. While customers have appreciated that capability, ironically enough it was never a make or break deal for us (even with customers having some serious MySQL presence); just icing on the cake. However with the acquisition, I’m predicting MySQL will find more mainstream adopters for both itself and its supporting ecosystem making that early investment in MySQL automation capabilities worthwhile for Stratavia.

Go MySQL! And go Sun for stepping up to the plate! Pretty soon it’s going to be time to make it real… perhaps starting with fixing that open-source licensing model…?

Thursday, January 17, 2008

Who is to blame for Oracle patch application failures... Is that even a real question?

I just read a revealing blog entry by Jaikumar Vijayan about how two-thirds of DBAs miss timely and accurate application of Oracle’s quarterly critical patch updates and subsequent objections from some quarters regarding how this is showing DBAs in a bad light! Wow, I gotta say I sure don't agree with the latter view-point!

Honestly, I don’t think the article quite alludes that DBAs are to blame for this. However that is indeed my assertion. I have been in databases for over 16 years and as a Systems DBA, I consider myself and others like me to be the primary custodian of my company’s (and clients’) most important asset - data. To imply that somehow lack of corporate security standards in this area or lack of visible threats such as Slammer is sufficient grounds for ignoring patches is hare-brained in my humble opinion. I consider it my job to come up to speed on the latest patch updates, document the decision to apply the patch (or not!), educate and work with the security auditors (internal and external) if any, and ensure the databases I’m responsible for are totally secure. Not doing that is akin to saying, “oh, it’s okay for me not to test my database backups unless some corporate mandate forces me to do so.” Database patches, like auditing backups, forms one of the most basic job functions of a DBA. Claiming ignorance or being over-worked in other areas doesn’t count as an excuse, especially after your customers’ credit card data has been stolen by a hacker!

A comment to the above story by Slavich Markovich (one of the people quoted there), says that DBAs are not lazy and goes on to state that they just have too many things they are working on. Since when has a DBA not been “over-worked”? Way too often, that’s been part of the DBA’s job description – even though it doesn’t have to be that way. I know many fine DBAs that regardless of their work-load or corporate politics or prevailing state of ignorance, don’t let themselves be stopped from applying the relevant CPUs. They research the patchset, educate their peers, security groups and auditors, evangelize the benefits to application managers, arrange to test it in non-production and coordinate the entire process with change control committees to ensure its success.

Markovich lists several items in his comment that supposedly deter DBAs from adhering to a regular patch cycle. I’m taking each of his points one by one (in red font below) and responding to it:
1. The need to test all applications using the database is a heavy burden => yeah? so is coordinating and testing backups, deal with it!

2. Oracle supports only the latest patchsets => Oracle’s patch-set frequency and support policies tend to vary from version to version. Rather than making bland, generic statements, log onto metalink, do the relevant research, see what patch-set applies to which databases in your environment and go, make it happen!

3. The lack of application vendor certification of the CPUs => sure, certain CPU patches sometimes impact application functionality, but the majority don’t. Regardless, that’s what testing is for – to ensure your application functionality is not negatively impacted by a patch-set. If the testing shows no adverse impact, do a limited deployment and then move to a full deployment of that patch-set. CPUs are released by Oracle almost every quarter, so don’t expect all 3rd party vendors to update their certification matrix quite so rapidly. And BTW, most application problems that are suspected to have been caused by an Oracle CPU can be traced back to an error in the patch application process (why bother reading all the patch-set release notes, right?)

4. The simple fact that it takes a huge amount of work to manually shutdown the database and apply the patch in an organization running hundreds if not thousands of instances => if I had a nickel for every time I heard this… dude, ever read of run book automation for databases? If you are still stuck dealing with manual task methods and scripts, you only have yourself to blame. (Do yourself a favor, check out the Data Palette automation platform. )

5. For production critical databases you have to consider maintenance windows which might come once a year => ever heard of rolling upgrades/patches? If your environment isn’t clustered or lacks a standby environment, work with the application and change control committees to negotiate more reasonable maintenance windows or even better, build a business case for a Data Guard implementation. Remember, to be a successful DBA in today’s complex IT environment, you can’t just be a command-line expert, you need to possess good communication skills, not to mention salesmanship. Use those skills to jockey for reliable service levels – which includes, a well-patched and secure database environment. Don’t just attempt to live within the constraints imposed on you.

6. The lack of understanding by some IT security personnel of the severity of the problem simply does not generate enough pressure in the organization => quit blaming others, you are responsible for those databases, not some arcane person with a fancy title that has “security” somewhere in there.

7. All in all, I know of companies that analyze and deploy CPUs as soon as three months after release but those companies are very few and usually have budgets in the millions for such things… => also, quit generalizing. There are many DBAs that work for small, private companies with miniscule budgets that take their Oracle CPUs very seriously and vice versa.

The truth is, a secure, stable and scalable database environment has very little to do with the size of the budget and everything to do with astute DBAs that think outside the box and take charge of their environment.