Thursday, May 01, 2008

Six Ways to Tell if an RBA Tool Has Version 2.0 Capabilities

Here’s an update to my last blog about RBA version 2.0. I promised a reader I would provide a set of criteria for determining whether a given RBA platform offers version 2.0 capabilities, i.e., the ability to create and deploy more dynamic and intelligent workflows that can evolve as the underlying environment changes, as well as accommodate the higher flexibility required for advanced IT process automation. In this piece, I expound mostly on the former requirement since my prior blog talked about the latter in quite some detail.

If you don’t have time to read the entire blog post, here’s the exec summary version of the six criteria:
1. Native metadata auto-discovery capabilities for target environmental state assessment
2. Policy-driven automation
3. Metadata Injection
4. Rule Sets for 360-degree event visibility and correlation
5. Root cause analysis and remediation
6. Analytics, trending and modeling.

Now, let’s look at each of these in more detail:

1. Native metadata auto-discovery capabilities for target environmental state assessment. In this context, the term “metadata” refers to data describing the state of the target environment where automation workflows are deployed. Once this data is collected (in a centralized metadata repository owned by the RBA engine), it effectively abstracts target environmental nuances and can be leveraged by the automation workflows deployed there to branch out based on changes encountered. This effectively reduces the amount of deployment-specific hard-coding that needs to go into each workflow. This metadata can also be shared with other tools that may need access to such data to drive their own functionality. Most first-generation RBA tools do not have a metadata repository of their own. They rely on an existing CMDB to pull relevant metadata or in the absence of a CMDB, the workflows attempt to query such data at runtime.

As I have mentioned in my earlier post, the existing metadata in typical CMDBs only scratches the surface for the kind of metadata required to drive intelligent automation. For instance, in the case of a database troubleshooting workflow, the kind of information you may need for automated remediation can range from database configuration data to session information to locking issues. While a CMDB may have the former, you would be hard-pressed to get the latter. Now, RBA 1.0 vendors will tell you that their workflows can collect all such metadata data at runtime, however some of these collections may require administrative/root privileges. Allowing workflows to run with admin privileges (esp for tasks not requiring such access) can be dangerous (read, potential security policy or compliance violation). Alternatively, allowing them to collect metadata during runtime makes for some very bulky workflows that can quickly deplete system resources on the target environments especially in the case of workflows that are run frequently. Not ideal.

The ideal situation is to leverage built-in metadata collection capabilities within the RBA platform to identify and collect the requisite metadata in the RBA repository (start with the bare minimum and then add additional pieces of metadata, as required by the incremental workflows you roll out). If there is a CMDB in place, the RBA platform should be able to leverage that for configuration-related metadata and then fill in the gaps natively.

Also, the RBA platform needs to have an “open collection” capability. This means, the product may come with specific metadata collection capabilities out-of-the-box. However if the users/customers wish to deploy workflows that need additional metadata which is not available out of the box, they should be able to define custom collection routines.

2. Policy-Driven Automation. Unfortunately, not all environmental metadata can be auto-discovered or auto-collected. For instance, attributes such as whether an environment is Development, Test or Production or what its maintenance window is, etc. are useful pieces of information to have since the behavior of an automation workflow may have to change based on these attributes. However given the difficulty in discovering these, it may be easier to specify them as polices that can be referenced by workflows. Second generation RBA tools have a centralized policy layer within the metadata repository wherein specific variables and their values can be specified by users/customers. If the value changes (say, maintenance window start time changes from Saturday 9 p.m. Mountain Time to Sunday 6 a.m. Mountain Time, then it only needs to be updated in one area (the policy screen) and all the downstream workflows that rely on it get the updated value.

3. Metadata Injection. Here’s where things get really interesting. Once you gather relevant metadata (either auto-discovered or via policies), there needs to be a way for disparate automation workflows to leverage all that data during runtime. The term “metadata injection” refers to the method by which such metadata is made available to the automation workflows - especially to make runtime decisions and branch into pre-defined sub-processes or steps.

As an example of how this works, the Expert Engine within Stratavia’s Data Palette parses the code within the workflows steps for any metadata variable references and then substitutes (injects) those with the most current metadata. Workflows also have access to metadata history (for comparisons and trending; more on that below).

Here’s a quick picture (courtesy of Stratavia Corp) that shows differences in how workflows are deployed in an RBA version 1 tool versus RBA 2.0. Note how the metadata repository works as an abstraction layer in the latter case, not only abstracting environmental nuances but also simplifying workflow deployment.


4. Rule Sets for 360-degree event visibility and correlation. This capability allows relevant infrastructure and application events to be made visible to the RBA engine – events from the network, server, hypervisor (if the server is a virtual machine), storage, database and application need to come together for a comprehensive “360-degree view”. This allows the RBA engine to leverage Rule Sets for correlating and comparing values - especially for automated troubleshooting, incident management and triage.

Further, an RBA 2.0 platform should be capable of handling not just simple boolean logic and comparisons within its rules engine, but also advanced Rule Sets for correlations involving time series, event summaries, rate of change and other complex conditions.

5. Root cause analysis and remediation. This is one of my favorites! Accordingly to varied analyst studies, IT admin teams spend as much as 80% of problem management time trying to identify root cause and 20% in fixing the issue. As many of you know, root cause analysis is not only time consuming, but also places significant stress on team members and business stakeholders (as people are under the gun for quickly finding and fixing the underlying issue). After numerous “war-room” sessions and finger-pointing episodes, the root cause (sometimes) emerges and gets addressed.

A key goal of second-generation RBA products is to go beyond root cause identification to automation of the entire root cause analysis process. This is done by gathering the relevant statistics in the metadata repository to obtain a 360-degree view (mentioned earlier), analyzing them via Rule Sets (also referred above) and then providing metrics to identify the smoking gun. If the problem is caused by a Known Error, the relevant remediation workflow can be launched.

Many first-generation RBA tools assume this kind of root cause analysis is best left to a performance monitoring tool that the customer may already have deployed. So what happens if the customer doesn’t have one deployed? What happens if a tool capable of such analysis is deployed, but not all the teams are using that tool? Usually, each IT team has its own point solution(s) that looks at the problem in a silo’d manner, which commonly leads to finger-pointing. I have rarely seen a single performance analysis tool that is universally adopted by all IT admin teams within a company and everyone working off the same set of metrics to identify and deal with root cause. If IT teams have such a tough time identifying root cause manually, should companies just forget about trying to automate that process? Maybe not… With RBA 2.0, regardless of what events disparate monitoring tools may present, such data can be aggregated (either natively and/or from those existing monitoring tools) and evaluated via a Rule Set to identify recurring problem signatures and promptly dealt with via a corresponding workflow.

All that most monitoring systems do is present a problem or potential problem (yellow/red lights), send out an alert and/or run a script. Combining root cause analysis capabilities and automation workflows within the RBA platform helps improve service levels and frequently reduces alert floods (caused frequently by the monitoring tools), unnecessary tickets and incorrect escalations.

Improving service levels – what a unique concept! Hey wait a minute, isn’t that what automation is really supposed to do? And yet, it amazes me how many companies don’t want to go there and just continue dealing with incidents and problems manually. RBA 2.0 begins to weaken that resistance.

6. Analytics, trending and modeling. These kind of capabilities are nice to have and if leveraged from an RBA context, can be really powerful. Once relevant statistics and KPIs are available within the metadata repository, Rule Sets and workflows should be able to access history and summary data (pre-computed based on predictive algorithms and models) to be able to understand trends and patterns to deal with issues before they become incidents.

These can be simplistic models (no PhD required to deploy them), and yet avoid many performance glitches, process failures and downtime. For instance, if disk space is expected to run out in the next N days, it may make sense to automatically open a ticket in Remedy. But if space is expected to run out in N hours, it may make sense to actually provision incremental storage on a pre-allocated device or to perform preventative space maintenance (e.g., backup and delete old log files to free up space, etc.), in addition to opening/closing of the Remedy ticket. A good way to understand these events and take timely action is to compare current activity with historical trends and link the outcome to a remediation workflow.