Thursday, November 12, 2009

Reference Architecture for Delivering IT as a Service

I’m back now after a 4-month “hiatus” that comprised multiple customer engagements – productive activity that keeps me (and my rants here) relevant. A key area I have been working on is using Stratavia’s Data Palette to help my customers deliver IT as a service within their organizations – or to be more precise, helping them deliver applications as a service. From an “end-user” perspective such as project managers, application team leads and QA managers (i.e., the recipients of these services), individual components of the infrastructure (plain servers or raw storage) don’t matter; it’s about having fully baked apps appropriately packaged and delivered - including the webserver, the application server and the database layers.

The obvious reason CIOs are looking to upgrade their IT delivery capabilities is around improving business efficiency and agility, while of course reducing costs. But a less frequently cited, but equally vital reason is to keep up with the competition! For instance, every financial services firm out there has already built or is building an internal cloud. In fact, larger organizations across all industry verticals are taking the next step to attain scalability via newer delivery models such as self-service and cloud computing. But to truly gain tangible benefit from such scalable models is a challenge. And currently, application administration is the weakest link in the chain!

The whole premise behind cloud computing of being able to rapidly mass-deploy applications in the cloud frequently comes to a screeching halt due to the way IT currently operates. Most IT administration teams are not just geared up to provision and manage scores of complex heterogeneous application tiers in an agile manner – unless more and more manpower is added and even that doesn’t scale after a certain point. Sure there is help from the conventional systems management tools vendors like HP, BMC, IBM, EMC, VMware and Cisco. Automation products from these venerable vendors are able to help organizations reduce server, network and storage provisioning time from multiple weeks to a few hours. But then the bottleneck just shifts upstream into the application layers: specifically the middleware and the database tiers. End-to-end automation of these tiers is a pre-requisite to large-scale application deployments on the cloud.

Conventional server provisioning and runbook automation products do not have application-level smarts nor native (application-specific) automation functionality to be of help here. Apart from some very basic application binary installs, they cannot be used to automate the complex activities across the application operations lifecycle, depicted alongside - unless one ends up writing and maintaining millions of lines of custom script-code (job security, anyone?).

Stratavia provides IT organizations with a way to break this logjam at the database and application tiers. Stratavia does this via its Data Palette automation platform, along with a portfolio of automation modules, called DCA Apps that can be plugged into the underlying platform. The DCA Apps include solutions for the entire operations lifecycle of the database and middleware tiers represented in the graphic above. The solution allows companies to obtain the following benefits (while complementing prior investments such as HP/Opsware and BMC/BladeLogic ), thus truly enabling IT to be delivered as a service.

- Streamlining and improving IT operations

  • Standardize IT processes across heterogeneous platforms and assets

  • Reduce delay between service request and delivery; Improve service level metrics such as “first-time-right” and “on-time-delivery”

  • Establish & control delivery quality across multi-tiered skill sets; Enable non-SMEs to carry out complex operations

  • Improve service delivery with “self service” capability in key areas such as application build provisioning, code releases and migrations

  • Remove compliance & support risks due to variety of version, patch & configuration requirements

- Increasing efficiencies

  • Reduce IT Admin time spent on mundane activities

  • Increase Asset to Admin ratio
The reference architecture Stratavia enables to accomplish these objectives is as follows (click on the schematic to view an enlarged version):
The right side of the schematic above shows the major tiers that make up the entire application stack. The middle portion shows the role of the Data Palette automation fabric in both orchestrating and performing the administrative activities across the entire operations lifecycle. (The Data Palette platform includes the orchestration capabilities, while the DCA Apps perform the administration activities.) These lifecycle activities include provisioning and patching, configuration and compliance management, recurring maintenance (e.g., log pruning, backups, healthchecks, index rebuilds, table reorgs, partition shuffles, etc.), incident response (false positive alert suppression and white noise reduction, problem diagnosis and root cause analysis, auto-resolving known errors, etc.) and frequent service requests (e.g., code releases, database refreshes and cloning, upgrades, adding/modifying user accounts, adding space, restoring an application snapshot, failover, etc.) Data Palette also provides out-of-the-box integration adaptors to be able to auto-cut tickets, update a CMDB, and interface with various systems management toolsets in order to adhere to standard ITIL processes while carrying out these activities (not dissimilar to a DBA or App Admin who performs this work manually).

The fabric also helps in abstracting the backend component-level complexities from the end-users.

On the left side and the top, the schematic illustrates 4 classes of users:

  • Non-Technical End Users: This class of users refers to the application and business end-users. These users are typically are not too IT operations-savvy (nor should they have to be!) and conventionally request resources via a Help Desk / ticketing system such as BMC Remedy, HP Service Manager or Service-Now.com. Once the ticket is created, it is assigned to the appropriate technicians and may traverse multiple IT operations groups before the request is fulfilled. Data Palette enables self-service capabilities in this scenario by presenting a Service Catalog front-end to these users. Frequently, Data Palette’s native adaptors are used to integrate with existing ticketing systems so that these end-users do not have to be exposed to the Data Palette console (and not have to learn yet another tool or interface!). The Service Catalog is established within the system they are already familiar with, wherein they can put in their request along with relevant details such as service name, required duration, billing code, etc. Once the request is saved, it can be auto-routed to a manager for approval. The ticket creation or approval action triggers an automated workflow (within a Data Palette DCA App) that will provision the service and make it available to the end-user while updating/closing out the ticket once the service is brought online. The service is usually multi-layers and can comprise multiple sub-workflows that will provision a database instance, install an Apache webserver, WebLogic app server, create user accounts in the database and so on. The minutiae are abstracted from the end-user.

  • IT Operators – This category of users refers to Tier 1 personnel such as Help Desk operators, NOC personnel and even outsourced/offshore administrators in some cases. These users tend to be the preliminary points of contact for alerts from different monitoring tools or problem calls from end users. Data Palette empowers these IT operators to be able to carry out automated incident triage and even auto-remediation of recurring incidents thereby reducing the need to escalate to IT Operations Administrators (Tier 2 personnel). IT Operators are not SMEs, but have a greater degree of awareness of the IT environment and hence can have direct access to the Data Palette console (i.e., bypass the previously mentioned Service Catalog) along with the ability to execute specific workflows in certain environments - managed via Data Palette’s multi-tenancy and role-based access control.

  • IT Operations Administrators – These are the Tier 2 SMEs – the DBAs and the Application Admins that have privileges within Data Palette to deploy automation services, along with the ability to set relevant Policies and metadata to properly influence Data Palette’s automation behavior in the environments they manage.

  • IT Operations Engineers – These are Tier 3 SMEs - the IT Operations Engineers (also referred to as Applications or Database Systems Engineers or Architects) that have the ability to define automation services by configuring the Data Palette DCA Apps and the corresponding workflows, along with any site-specific pre and post Steps. They decide which automation service should be available to which user-type across the enterprise, what parameters should be entered (balancing ease-of-use against flexibility and control), which toolsets to integrate with, which metadata to leverage, and so on. Accordingly, their role within Data Palette is a super-set of the prior users allowing them to read, write (update) and execute automation workflows and corresponding service definitions.

Finally, the bottom portion of the reference architecture shows Data Palette integrating with existing enterprise monitoring and popular configuration and compliance audit toolsets such as MS SCOM, HP OVO, Patrol, Tivoli, Tripwire, EMC Ionix Configuration Manager and Guardium via its integration adaptors for these product sets. These products (and others like them) are frequently already deployed by enterprises for performance monitoring and scanning OS, database and application configurations and can be set up to invoke a Data Palette remediation workflow (via Data Palette’s web service APIs) to address drifts and SLA violations. Data Palette’s configuration repair and incident resolution workflows can fix the violations in online mode, or schedule the repair during the appropriate maintenance window based on pre-defined policies. Administrators can set up on an environment-by-environment basis, which violations should be immediately repaired, versus which ones need to be scheduled, versus which ones can be safely ignored.

Enterprise features of the Data Palette platform such as multi-tenancy, RBAC (role-based access control), single sign-on, LDAP integration and Smart Groups (wherein multiple intra-cloud assets can be addressed and manipulated as a single entity) along with self-configuring, out-of-the-box automation content for the database and application tiers makes the above architecture and corresponding value imminently attainable. (As a point of reference, a Proof of Concept takes 3 days to 2 weeks depending on the scope; a broader Pilot including integrations with existing toolsets can be implemented in 2 to 4 weeks.)

Email me at vdevraj at stratavia dot com if you would like a detailed whitepaper on this solution architecture.

Monday, July 20, 2009

Why aren't databases getting migrated to VMware?

During a recent customer CTO focus group meeting, a key topic discussed was the perceived unwillingness (or should one say, inability) of many larger organizations to move their databases onto virtual servers.

The majority of databases, especially production systems continue to run on physical servers. Given the emphasis in today's economy on data center consolidation and the need to enable newer IT delivery models such as self-service and private cloud initiatives, migrating databases from expensive underutilized physical servers to shared virtual environments should be a no-brainer. However IT managers find themselves swimming upstream as soon as they broach the topic of database migrations with their application users or for that matter, their DBAs.

Migrating databases to virtual environments is difficult for several reasons - especially given their complexity and mission-critical nature, the extended time required to perform those migrations gracefully and the general lack of tolerance from application users to corresponding maintenance/outage windows. A database has hooks deep into the underlying operating system as well as the overlaying application stack. Even relatively “minor” migrations wherein both the source and target platforms and versions are exactly the same can impact database performance and stability if all the associated factors are not duly considered. Let's explore some of these factors.

A database migration in this context is defined as moving a database from a physical to a virtual environment. Whenever possible, the source and target environments retain the same operating system (OS) attributes. But frequently, those may need to be changed as well. For instance, when migrating an Oracle or DB2 database running on an IBM pSeries server with AIX 5.x to a virtual environment on VMware, the underlying operating system type has to change to a Linux flavor on the x86 platform. Accordingly, the following are the two main options for a database migration:
• The same OS on both the physical (source) and virtual (target) environments
• A change in the OS type or version in the target virtual environment.

The latter use case is obviously more complex than the former. Regardless, with both use cases, there are several factors, decision points and associated actions that need to be taken into consideration. The sequence of actions in the schematic below (Figure 1) illustrates a few of the core issues associated with such a migration for Oracle, SQL Server and DB2. (Click on the picture below to view a larger image.)


Especially as Rows 2 and 4 in Figure 1 reveal, there are specific actions that need to be taken both on the source (physical) and target (virtual) environments. Furthermore, all of these actions need to be taken in line with corporate standards and best practices including checking for change control approvals, carrying out the work within pre-defined maintenance windows, and rolling back specific actions upon failure or other environment conditions such as the maintenance window being exceeded.

All too often, IT managers and generic IT administrators (read, non-DBAs) mistakenly believe that the scope of database migrations is limited to provisioning a new target database server, copying the data contents from the source to the target and finally, re-pointing the applications to the new target environment. In this context, one needs to differentiate between a database server and a database instance. In the case of the former, one can use existing provisioning tools to set up the requisite platform and version build, patch-level, kernel parameters, file-systems and other operating system-related aspects (equating to Row 3 in Figure 1 above) and utilize standard run book automation (RBA) tools to interact with change control and ticketing mechanisms (equating to Rows 1 and 5 in Figure 1), however these provisioning and RBA tools fall short when it comes to the setup and management of much of the database internals (as shown in Rows 2 and 4 in Figure 1) – tasks that make up the core of the migration process. For instance, tools such as BMC BladeLogic, HP Opsware SAS and VMware vCenter Server can rapidly provision target database server environments, but lack the database instance and application specific context and the corresponding automation content to discern and establish crucial post-server-provisioning aspects. Conventional RBA tools such as BMC BladeLogic Orchestration Manager, HP Operations Orchestrator and VMware vCenter Orchestrator have the capability to interact with change control and ticketing systems to perform the peripheral tasks in Rows 1 and 5 in Figure 1, but lack knowledge of database internals hence requiring all the steps indicated in Rows 2 and 4 to be developed from scratch – a process that can take multiple years.

Similarly, Virtual server migration tools as VMware vCenter Converter (a.k.a. VMware P2V) and Novell PlateSpin PowerConvert that migrate a server in its entirety also don't do justice to the task at hand because a database server may comprise multiple databases/instances supporting several different applications, and the DBA may need to selectively migrate a few databases/instances at a time (based on application user and change management approvals) rather than the entire database server. P2V-like tools don't take these intra-database structures and other application-level dependencies into account. Also, these tools only deal with x86 servers running Windows and Linux. Non-x86 hardware platforms and other operating systems such as AIX and Solaris are woefully ignored.

A database instance has its own unique set of requirements and configuration options that need to be defined and managed. Take for example, an Oracle database environment. A typical mid-sized to large company (the kind that will benefit most from automation) may have dozens, even hundreds of database instances across different versions (e.g., 9i, 10g, 11g) and configurations (standalone, RAC, Data Guard, etc.). New databases have to be set up for multiple environments (production, QA, stage, etc.) and applications (OLTP, reporting, warehouse, batch, etc.). Based on the OS platform, database version, configuration, usage factors and size, different data transfer methods have to be selectively applied during a migration (e.g., RMAN duplicate, Transportable Tablespaces, Data Pump, Export/Import, etc.). Once the data is extracted, it needs to be imported into the target environment using the appropriate method(s) and reconfigured for user and application access. Structures such as stored procedures, triggers and other objects need to be transferred as well using appropriate mechanisms. External stored procedures need to be recompiled and relinked on the target. Depending on the source and target operating systems, issues such as endian formats, byte sizes and character types need to be considered. Shared libraries have to be installed and clustering related parameters have to be defined. Backup methods need to be reconfigured and rescheduled. Agents need to be reinstalled. Maintenance and other scheduled jobs have to be set up. Database links need to be re-established in the case of federated or distributed databases. Thus, several configuration, security and performance related factors need to be evaluated and appropriately addressed.

Many of these considerations have prevented DBAs from viewing automation as a reliable method to perform migrations. And performing them manually is not a realistic option since they can take up several thousands of DBA hours and carry a significant risk of performance degradation or downtime if human errors creep in. These complex issues prevent most databases from being migrated to virtual environments, in spite of the obvious cost advantages associated with virtualization. There is adequate fear prevalent amongst IT managers to keep their hands off databases and leave them running on mammoth underutilized physical servers.

That’s where database automation products such as Stratavia’s Data Palette can help. Specifically, Data Palette provides value over conventional server provisioning and RBA technologies as well as P2V-type tools in two key ways:
- due to its comprehensive database automation content, it can handle complex migration use cases out-of-the-box

- due to its ability to collect, persist and embed metadata within its automation workflows (to provide environmental awareness to those workflows), it can handle heterogeneity and ongoing changes in platform builds, versions, application attributes and usage profiles in a much more scalable manner. It allows a consistent set of migration workflows to be used across disparate environments from a central console, making the automation easy to deploy, maintain and run.

Migrating databases from physical servers to virtual machines becomes straightforward with Data Palette due to the ability of its workflow engine to span both physical and virtual environments (within a single workflow) - allowing the entire spectrum of activities related to database migration (as illustrated in Figure 1) to be automated end-to-end with the press of a button.

This automation content coupled with Data Palette’s role-based access control and Service Catalog type user interface allows senior DBAs as well as junior or offshore database operations personnel and even non-DBAs for that matter (e.g., trusted, but non-database-savvy IT personnel such as systems administrators and application project leads) to perform these migrations in self-service mode without requiring them to have knowledge of underlying database processes, and without local administrative access on the source and target servers - in a manner that meets all DBA and change control approvals.

Post-migration, a key Data Palette feature called Smart Groups™ allows the database instances to retain all of their prior monitoring, maintenance and other scheduled activity in a seamless manner without requiring any manual intervention. These advanced management capabilities and out-of-the-box automation content allows complex database physical to virtual migration projects to be completed in days, rather than weeks and months.

Friday, May 15, 2009

Implementing a Simple Internal Database or Application Cloud – Part II

Based on the overview of private clouds in my prior blog, here’s the 5-step recipe for launching your implementation:

  1. Identify the list of applications that you want to deploy on the cloud;

  2. Document and publish precise end-user requirements and service levels to set the right expectations – for example, time in hours to deliver a new database server;

  3. Identify underlying hardware and software components – both new and legacy – that will make up the cloud infrastructure;

  4. Select the underlying cloud management layer for enabling existing processes, and connecting to existing tools and reporting dashboards;

  5. Decide if you wish to tie in access to public clouds for cloud bursting, cloud covering or simply, backup/archival purposes.

Identify the list of applications that you want to deploy on the cloud
The primary reason for building a private cloud is control – not only in terms of owning and maintaining proprietary data within a corporate firewall (mitigating any ownership, compliance or security concerns), but also deciding what application and database assets to enable within the cloud infrastructure. Public clouds typically give you the option of an x86 server running either Windows or Linux. On the database front, it’s usually either Oracle or SQL Server (or the somewhat inconsequent MySQL). But what if your core applications are built to use DB2 UDB, Sybase or Informix? What if you rely on older versions of Oracle or SQL Server? What if your internal standards are built around AIX, Solaris or HP/UX system(s)? What if your applications need specific platform builds? Having a private cloud gives you full control over all of these areas.

The criteria for selecting applications and databases to reside on the cloud should be governed by a single criterion – popularity; i.e., which applications are likely to proliferate the most in the next 2 years, across development, test and production . List down your top 5 or 10 applications and their infrastructure – regardless of operating system and database type(s) and version(s).

Document precise requirements and SLAs for your cloud
My prior blog entry talked about broad requirements (such as self-service capabilities, real-time management and asset reuse), but break those down into detailed requirements that your private cloud needs to meet. Share those with your architecture / ops engineering peers, as well as target users (application teams, developer/QA leads, etc.) and gather input. For instance, a current cloud deployment that I’m working with is working to meet the following manifesto, arrived at after a series of meticulous workshops attended by both IT stakeholders and cloud users:


In the above example, BMC Remedy was chosen as the front-end for driving self-service requests largely because the users were already familiar with using that application for incident and change management. In its place, you can utilize any other ticketing system (e.g., HP/Peregrine or EMC Infra) to present a friendly service catalog to end users. Up-and-coming vendor Service-now extends the notion of a service catalog to include a full-blown shopping cart and corresponding billing. Also, depending on which cloud management software you utilize, you may have additional (built-in) options for presenting a custom front-end to your users – whether they are IT-savvy developers, junior admin personnel located off-shore or actual application end-users.


Identify underlying cloud components
Once you have your list of applications and corresponding requirements laid out, you can begin to start the process of defining which hardware and software components you are going to use. For instance, can you standardize on the x86 platform for servers with VMware as the virtualization layer? Or do you have a significant investment in IBM AIX, HP or Sun hardware? Each platform tends to have its own virtualization layer (e.g., AIX LPARs, Solaris Containers, etc.), all of which can be utilized within the cloud. Similarly, for the storage layer, can you get away with just one vendor offering – say, NetApp filers, or do you need to accommodate multiple storage options such as EMC and Hitachi? Again, the powerful thing about private clouds is – you get to choose! During a recent cloud deployment, the customer required us to utilize EMC SANs for production application deployments, and NetApp for development and QA.

Also, based on application use profiles and corresponding availability and performance SLAs, you may need to include clustering or facilities for standby databases (e.g., Oracle Data Guard or SQL log shipping) and/or replication (e.g., GoldenGate, Sybase Replication Server).

Now as you read this, you are probably saying to yourself – “Hey wait a minute… I thought this was supposed to be a recipe for a ‘simple cloud’. By the time I have identified all the requirements, applications and underlying components (especially with my huge legacy footprint), the cloud will become anything but simple! It may take years and gobs of money to implement anything remotely close to this…” Did I read your mind accurately? Alright, let’s address this question below.


Select the right cloud management layer
Based on all the above items, the scope of the cloud and underlying implementation logistics can become rather daunting – making the notion of a “simple cloud” seem unachievable. However here’s where cloud management layers comes to the rescue. A good cloud management layer keeps things simple via three basic functions:

  1. Abstraction;
  2. Integration; and
  3. Out-of-the-box automation content
Cloud management software ties in existing services, tools and processes to orchestrate and automate specific cloud management functions end-to-end. Stratavia’s Data Palette is an example of intelligent cloud management software. Data Palette is able to accommodate diverse tasks and processes – such as asset provisioning, patching, data refreshes and migrations, resource metering, maintenance and decommissioning – due to significant out-of-the-box content. (Stratavia references them as Solution Packs, which are basically discrete products that can be plugged into the underlying Data Palette platform.) All such content is externally referenceable via its Web Service API for easy integration with existing tools – making it easy to integrate with 3rd party service catalogs such as Service-Now or Remedy.

Data Palette does not post restrictions on server, operating system and infrastructure components within the cloud. Its database Solution Packs support various flavors and versions of Oracle, SQL Server, DB2, Sybase and Informix running on UNIX (Solaris, HP/UX and AIX), Linux and Windows. Storage components such as NetApp are managed via the Zephyr API (ZAPI) and OnTapi interfaces.

However in addition to out-of-the-box integration and automation capabilities, the primary reason for keeping complexity at bay is due to the use of an abstraction layer. Data Palette uses a metadata repository that is populated via native auto-discovery (or via integration to pre-deployed CMDBs and monitoring tools) to gather a set of configuration and performance metadata that identifies the current state of the infrastructure, along with centrally defined administrative policies. This central metadata repository makes it possible for the right automated procedures to be executed on the right kind of infrastructure – avoiding mistakes (typically associated with static workflows from classic run book automation products)) such as executing an Oracle9i specific data refresh method on a recently upgraded Oracle10g database – without the cloud administrator or user having to track and reconcile such infrastructural changes and manually adjust/tweak the automation workflows. Such metadata-driven automation keeps the automation workflows dynamic, allowing the automation to scale seamlessly to hundreds or thousands of heterogeneous servers, databases and applications.

Metadata collections can also be extended to specific (custom) application configurations and behavior. Data Palette allows Rule Sets to be applied to incoming collections to identify and respond to maintenance events and service level violations in real-time making the cloud autonomic (in terms of self-configuring and self-healing attributes), with detailed resource metering and administrative dashboards.

Data Palette’s abstraction capabilities also extends to the user interface wherein specific groups of cloud administrators and users are maintained via multi-tenancy (called Organizations), Smart Groups™ (dynamic groupings of assets and resources), and role based access control.

Optionally tie in access to public clouds for Cloud Bursting, Cloud Covering or backup/archival purposes
Now that you are familiar with the majority of the ingredients for a private cloud rollout, the last item worth considering is whether to extend the cloud management layer’s set of integrations to tie into a public cloud provider – such as GoGrid, Flexiscale or Amazon EC2. Based on specific application profiles, you may be able to use a public cloud for Cloud Bursting (i.e., selectively leveraging a public cloud during peak usage) and/or Cloud Covering (i.e., automating failover to a public cloud). If you are not comfortable with the notion of a full service public cloud, you can consider a public sub-cloud (for specific application silos e.g., “storage only”) such as Nirvanix or EMC Atmos for storing backups in semi-online mode (disk space offered by many of the vendors in this space is relatively cheap – typically 20 cents per GB per month). Most public cloud providers offer an extensive API set that the internal cloud management layer can easily tap into (e.g., check out wiki.gogrid.com). In fact, depending on your internal cloud ingredients, you can take the notion of Cloud Covering to the next level and swap applications running on the internal cloud to the external cloud and back (kind of an inter-cloud VMotion operation, for those of you who are familiar with VMware’s handy VMotion feature). All it takes is an active account (with a credit card) with one of these providers to ensure that your internal cloud has a pre-set path for dynamic growth when required – a nice insurance policy to have for any production application – assuming the application’s infrastructure and security requirements are compatible with that public cloud.

Wednesday, January 28, 2009

Implementing a Simple Internal Database or Application Cloud - Part I

A “simple cloud”? That comes across as an oxymoron of sorts since there’s nothing seemingly simple about cloud computing architectures. And further, what do DBAs and app admins have to do with the cloud, you ask? Well, cloud computing offers some exciting new opportunities for both Operations DBAs and Application DBAs – models that are relatively easy to implement, and bring immense value to IT end-users and customers.

The typical large data center environment has already embraced a variety of virtualization technologies at the server and storage levels. Add-on technologies offering automation and abstraction via service oriented architecture (SOA) are now allowing them to extend these capabilities up the stack – towards private database and application sub-clouds. These developments seem more pronounced in the banking, financial services and managed services sectors. However while working on Data Palette automation projects at Stratavia, every once-in-a-while I do come across IT leaders, architects and operations engineering DBAs in other industries as well, that are beginning to envision how specific facets of private cloud architectures can enable them to service their users and customers more effectively (while also compensating for the workload for some of their colleagues that have exited their companies due to the ongoing economic turmoil). I wanted to specifically share here some of the progress in database and application administration with regard to cloud computing.

So, for those database and application admins that haven’t had a lot of exposure to cloud computing (which BTW, is a common situation since most IT admins and operations DBAs are dealing with boatloads of “real-world hands-on work” rather than participating in the next evolution of database deployments), let’s take a moment to understand what it is and its relative benefits. An “application” in this context, refers to any enterprise-level app - both 3rd party (say, SAP or Oracle eBusiness Suite) as well as home-grown N-Tier apps that have a fairly large footprint. Those are the kind of applications that get maximum benefit from the cloud. Hence I use the word “data center asset” or simply “asset” to refer to any type of database or application. However at times, I do resort to specific database terminology and examples, which can be extrapolated to other application and middleware types as well.

Essentially a cloud architecture refers to a collection of data center assets (say, database instances, or just schemas to allow more granularity) that are dynamically provisioned and managed throughout their lifecycle – based on pre-defined service levels. This lifecycle covers multiple areas starting with deployment planning (e.g., capacity, configuration standards, etc.), provisioning (installation, configuration, patching and upgrades) and maintenance (space management, logfile management, etc.) extending all the way to incident and problem management (fire-fighting, responding to brown-outs and black-outs), and service request management (e.g., data refreshes, app cloning, SQL/DDL release management, and so on). All of these facets are managed centrally such that the entire asset pool can be viewed and controlled as one large asset (effectively virtualizing that asset type into a “cloud”).

Here’s a picture representing a fully baked database cloud implementation (if the picture is blurry, click on it to open up a clearer version):

As I had mentioned in a prior blog entry, there are multiple components that have come together to enable a cloud architecture. But more on that later. Let’s look at database/application specific attributes of a cloud (you could read it as a list of requirements for a database cloud).

  • Self-service capabilities: Database instances or schemas need to be capable of rapidly being provisioned based on user specifications by administrators, or by the users themselves (in selective situations – areas where the administrators feel comfortable giving control to the users directly). This provisioning can be done on existing or new servers (the term “OS images” is more appropriate given that most of the “servers” would be virtual machines rather than real bare metal) with appropriate configuration, security and compliance levels. Schema changes or SQL/DDL releases can be rolled out in a scheduled manner, or on-demand. The bulk of these releases, along with other service requests (such as refreshes, cloning, etc.) should be capable of being carried out by project teams directly– with the right credentials (think, role-based access control).
  • Real-time infrastructure: I'm borrowing a term from Gartner (specifically, distinguished analyst Donna Scott's vocabulary) to describe this requirement. Basically, the assets need to be maintained in real-time per specific deployment policies (such as development environment versus QA or Stage), tablespaces and datafiles created per specific naming / size conventions and filesystem/mount-point affinity (accommodating specific SAN or NAS devices, different LUN properties and RAID levels for reporting/batch databases versus OLTP environments), data backed up at the requisite frequency per the right backup plan (full, incremental, etc.), resource usage metered, failover/DR occurring as needed, and finally, archived and de-provisioned based on either a specific time-frame (specified at the time of provisioning) or on-demand -- after the user or administrator indicates that the environment is no longer required (or after a specific period of inactivity). All of this needs to be subject to administrative/manual oversight and controls (think, dashboards and reports, as well as ability to interact with or override automated workflow behavior).
  • Asset type abstraction and reuse: One should be able to mix-and-match these asset types. For instance, one can rollout an Oracle-only database farm or a SQL Server-only estate. Alternatively, one can also span multiple database and application platforms allowing the enterprise to better leverage their existing (heterogeneous) assets. Thus, the average resource consumer (i.e., the cloud customer) shouldn’t have to be concerned about what asset types or sub-types are included therein – unless they want to override default decision mechanisms. The intra-cloud standard operating procedures take those physical nuances into account, thereby effectively virtualizing the asset type.
The benefit of a database cloud includes empowering users to carry out diverse activities in self-service mode in a secure, role-based manner, which in turn, enhances service levels. Activities such as having a database provisioned or a test environment refreshed can often take multiple hours and days. Those can be reduced to a fraction of their normal time – reducing latency especially in situations where there needs to be hand-offs and task turn-over across multiple IT teams. In addition, the resource-metering and self-managing capabilities of the cloud allow better resource utilization and avoids resource waste, improving performance levels, and reducing outages and removing other sources of unpredictability from the equation.

A cloud, while viewed as bleeding edge by some organizations is being viewed by larger organizations as being critical – especially in the current economic situation. Rather than treating each individual database or application instance as a distinct asset and managing it per its individual requirements, a cloud model allows virtual asset consolidation, thereby allowing many assets to be treated as one and promoting unprecedented economies of scale in resource administration. So as companies continue to scale out data and assets, but cannot afford to correspondingly scale up administrative personnel , the cloud helps them achieve non-linear growth.

Hopefully the attributes and benefits of a database or application cloud (and the tremendous underlying business case) become apparent here. My next blog entry (or two) will focus on the requisite components and the underlying implementation methods to make this model a reality.

Friday, January 09, 2009

Protecting Your IT Operations from Failing IT Services Firms

The recent news about India-based IT outsourcing major Satyam and its top management’s admissions of accounting fraud bring forth shocking and desperate memories of an earlier time – when multiple US conglomerates such as Enron, Arthur Andersen, Tyco, etc. fell under similar circumstances, bringing down with them the careers and aspirations of thousands of employees, customers and investors. Ironically Satyam (the name means “truth” in the mother language, Sanskrit), whose management have been duping investors for several years now (by their own admission) had received the Recognition of Commitment award from the US-based Institute of Internal Auditors in 2006, and was featured in Thomas Friedman’s best-seller “The World is Flat”. Indeed, how the mighty have fallen…

As one empathizes with those affected, the key question that comes to mind is, how do we prevent another Satyam? However that line of questioning seems rather idealistic. The key question should probably be, how can IT outsourcing customers protect themselves from these kinds of fallouts? Given how flat the world is, an outsourcing vendor’s (especially one as ubiquitous as Satyam in this market) fall from grace has reverberations throughout the global IT economy - directly in the form of failed projects, and indirectly in the form of lost credibility for customer CIOs who rely on these outsourcing partners for their critical day-to-day functioning.

Having said that, here are some key precautionary measures (in an evolving order) companies can take to protect themselves and their IT operations beyond standard sane efforts such as using multiple IT partners, use of structured processes and centralized documentation/knowledge-bases.
· Move from time & material (T&M) arrangements to fixed-priced contracts
· Move from static knowledge-bases to automated standard operating procedures (SOPs)
· Own the IP associated with process automation

Let’s look at how each of these afford higher protection in situations such as the above:
· Moving from T&M arrangements to fixed price contracts - T&M contracts rarely provide incentive to the IT outsourcing vendor to bring in efficiencies and innovation. The more hours that are billed, the more revenue they make – so,where’s the motivation to reduce the manual labor? On the other hand, T&M labor makes customers vulnerable to loss of institutional knowledge and gives them little to no leverage when negotiating rates or contract renewals because switching out a vendor (especially one that holds much of the “tribal knowledge”) is easier said than done.

With fixed price contracts, the onus on ensuring quality and timely delivery is on the IT services vendor (to do so profitably requires use of as little labor as possible) and subsequently, one finds more structure (such as better documentation and process definition) and higher use of innovation and automation. All of this works in the favor of the customer and in the case of a contractor or the vendor no longer being available, makes it easier for a replacement to hit the ground running.

· Moving from static knowledge-bases to automated SOPs – It is no longer enough to have standard operating procedures documented within Sharepoint-type portals. It is crucial to automate these static run books and documented SOPs via data center automation technologies, especially newer run book automation product sets (a.k.a. IT process automation platforms) that allow definition and utilization of embedded knowledge within the process workflows. These technologies allow contractors to move static process documentation to workflows that use this environmental knowledge to actually perform the work. Thus, the current process knowledge no longer merely resides in peoples’ heads, but gets moved to a central software platform thereby mitigating loss of key contractor personnel/vendors.

· Owning the IP associated with such process automation platforms – Frequently, companies that are using outsourced services ask “why should I invest in automation software? I have already outsourced our IT work to company XYZ. They should be buying and using such software. Ultimately, we have no control over how they perform the work anyway…” The Satyam situation is a classic example of why it behooves end-customers to actually purchase and own IP related to process automation software, rather than deferring it to the IT services partner. By having process IP defined within a software platform that the customer owns, it makes it conceivable to switch contractors and/or IT services firms. If the IT services firm owns the technology deployment, the corresponding IP walks out the door with the vendor preventing the customer from getting the benefit of the embedded process knowledge.

It is advisable for the customer to have some level of control and oversight over how the work is carried out by the vendor. It is fairly commonplace for the customer to insist on use of specific tools and processes such as ticketing systems, change control mechanisms, monitoring tools and so on. The process automation engine shouldn’t be treated any different. The bottomline is, whoever has the process IP carries the biggest stick during contract renewals. If owning the technology is not feasible for the customer, at least make sure that the embedded knowledge is in a format wherein it can be retrievable and reused by the next IT services partner that replaces the current one.

Friday, January 02, 2009

Zen and the Art of Automation Continuance

The new year is a good time to start thinking about automation continuance. Most of us initiate automation projects with a focus on a handful of areas – such as provisioning servers or databases, automating the patch process, and so on. While this kind of focus is helpful in ensuring a successful outcome for that specific project, it also has the effect of reducing overall ROI for the company – because once the project is complete, people move on to other day-to-day work patterns (relying on their usual manual methods), instead of continuing to identify, streamline and automate other repetitive and complex activities.

Just recently I was asked by a customer (a senior manager at a Fortune 500 company that has been using data center automation technologies, including HP/Opsware and Stratavia's Data Palette for almost a year) "how do I influence my DBAs to truly change their behavior? I thought they had tasted blood with automation, but they keep falling back to reactive work. How do I move my team closer to spending majority of their time on proactive work items such as architecture, performance planning, providing service level dashboards, etc.?” Sure enough, their DBA team started out impressively automating over half-a-dozen tasks such as database installs, startup/shutdown processes, cloning, tablespace management, etc., however during the course of the year, their overall reactive workload seems to have relapsed.

Indeed, it can seem an art to keep IT admins motivated towards continuing automation.

A good friend of mine in the Oracle DBA community, Gaja Krishna Vaidyanatha coined the phrase “compulsive tuning disorder” to describe DBA behavior that involves spending a lot of time tweaking parameters, statistics and such in the database almost to the point of negative returns. A dirty little secret in the DBA world is that this affliction frequently extends to areas beyond performance tuning and can be referred to as “compulsive repetitive work disorder”. Most DBAs I work with are aware of their malady, but do not know how to break the cycle. They see repetitive work as something that regularly falls on their plate and they have no option but to carry out. Some of those activities may be partially automated, but overall, the nature of their work doesn’t change. In fact, they don’t know how to change, nor are they incented or empowered to figure it out.

Given this scenario, it’s almost unreasonable for managers to expect DBAs to change just because a new technology has been introduced in the organization. It almost requires a different management model, laden with heaps of work redefinition, coaching, oversight, re-training and to cement the behavior modification, a different compensation model. In other words, the surest way to bring about change in people is to change the way they are paid. Leaving DBAs to their own devices and expecting change is not being fair to them. Many DBAs are unsure how any work pattern changes will impact their users’ experience with the databases, and whether that change will cost them their jobs even. It’s just too easy to fall back to their familiar ways of reactive work patterns. After all, the typical long hours of reactive work shows one as a hardworking individual, providing a sense of being needed and fosters notions of job security.

In these tough economic times however, sheer hardwork doesn’t necessarily translate to job security. Managers are seeking individuals that can come up with smart and innovative ways for non-linear growth. In other words, they are looking to do more with the same team - without killing off that team with super long hours, or having critical work items slip through the cracks.

Automation is the biggest enabler of non-linear growth. With the arrival of the new year, it is a good time to be talking about models that advocate changes to work patterns and corresponding compensation structures. Hopefully you can use the suggestions below to guide and motivate your team to get out of the mundane rut and continue down the path of more automation (assuming of course, that you have invested in broader application/database automation platforms such as Data Palette that are capable of accommodating your path).

1. Establish a DBA workbook with weights assigned to different types of activity. For instance, “mundane activity” may get a weight of say 30, whereas “strategic work” (whatever that may be for your environment) may be assigned a weight of 70. Now break down both work categories into specific activities that that are required in your environment. Make streamlining and automating repetitive task patterns an intrinsic part of the strategic work category. Check your ticketing system to identify accurate and granular work items. Poll your entire DBA team to fill in any gaps (especially if you don’t have usable ticketing data). As a starting point, here’s a DBA workbook template that I had outlined in a prior blog.

2. Introduce a variable compensation portion to the DBAs’ total compensation package (if you don’t already have one) and link that to the DBA workbook - specifically to the corresponding weights. Obviously, this will require you to verify whether DBAs are indeed living up to the activity in the workbook by having a method to evaluate that. Make sure that there are activity IDs and cause codes for each work pattern (whether it’s an incident, service request or whatever). Get maniacal about having DBAs create a ticket for everything they do and properly categorize that activity. Also integrate your automation platform with your ticketing system so you can also measure what kind of mundane activity are being carried out in a lights-out manner. For instance, many Stratavia customers establish ITIL-based run books for repetitive DBA activities within Data Palette. As part of these automated run-books, tickets get auto-created/auto-updated/auto-closed. That in turn will ensure that automated activities, as well as manual activities get properly logged and relevant data is available for end-of month (or quarterly or even annual) reconciliation of work goals and results – prior to paying out the bonuses.

If possible, pay out the bonuses at least quarterly. Getting them (or not!) will be a frequent reminder to the team regarding work expected of them versus the work they actually do. If there are situations that truly require the DBAs to stick to mundane work patterns, identify them and get the DBAs to streamline, standardize and automate them in the near future so they no longer pose a distraction from preferred work patterns.

Many companies already have bonus plans for their DBAs and other IT admins. However they link those plans to areas such as company sales, profits or EBITDA levels. Get away from that! Those areas are not “real” to DBAs. IT admins rarely have direct control on company revenue or spending levels. Such linking, while safer for the company of course (i.e., no revenue/profits, no bonuses), does not serve it well in the long run. It does not influence employee behavior other than telling them to do “whatever it takes” to keep business users happy and the cash register ringing, which in turn promotes reactive work patterns. There is no motivation or time for IT admins to step back and think strategically. But changing bonuses and variable compensation criteria to areas that IT admins can explicitly control – such as sticking to a specific workbook with more onus on strategic behavior – brings about the positive change all managers can revel in, and in turn, better profits for the company.

Happy 2009!

PS => I do have a more formal white-paper on this subject titled “5 Steps to Changing DBA Behavior”. If you are interested, drop me a note at “dbafeedback at stratavia dot com”. Cheers!