Tuesday, December 21, 2010

Protect the Security of Your Data Center

 

Over the years I have been involved with my share of data center recovery exercises. Everything from power outages, hurricanes, earthquakes and chemical explosions and each time, no matter how well the business continuity plan is written, there is always something that is learned. Here are a few that stand out from my experience.

Where’s the Business Continuity Plan?

Before you tell everyone that you have a business continuity plan (BCP) in place, you should actually make sure that you have written one so you know what to do. Oh, and by the way, everyone should know where it is so that it can be easily located. This sounds like a no brainer but you would be surprised at how many companies jump in with both feet without actually planning out their recovery process. So what is the lesson learned? It’s important to actually write out a business continuity plan, point out the weaknesses and don’t be afraid to present it to your executive team. However, before you do that I suggest knowing the answers to the questions they are going to ask, which will help you get their buy in and support for the plan. Here are a few things you will need to know:

  • How much will this cost?
  • What if we lost our data center today?
  • What are our existing recovery plans?
  • What disasters are we currently protected against?
  • What is our risk to specific disasters?

How much will this cost?

This might be the easiest question because the answer isn’t a dollar figure. Instead, address what a data center disaster would cost the company in man hours and business impact. While your costs may involve purchasing risk analysis or additional hardware (all of which are important and will need the management team support), the more important question you can ask is, “What will this cost us if we don’t?” I can tell you for a fact that the management team will understand the impact when your company can’t service its customers, can’t receive messages and can’t close sales. The cost is simply justified. You only need one business continuity plan to protect you from every disaster, and let’s face it, whether it is a tsunami or a simple mistake of pulling the wrong disk from the array, you are at risk for downtime. It is money well spent and the knowledge of having a detailed BCP in place puts everyone at ease. However, as stated before, when you have your BCP written make sure you know where it is. The last thing you want is to have to think “now where did I put that” or “who had it last?” In fact, I’d highly recommend getting a fire proof safe. Put the plan in, along with some granola bars, coffee and aspirin – because you will need all three.

What if we lost our data center today?

It’s important to be honest because you will short change yourself if you try to sugar coat it. The truth is if we can’t recover our business critical systems within a 48-hour period, the risks of going out of business increase significantly, and at the very least we lose customers, productivity and ultimately revenue. That is the truth and it will get the attention of most executives, who will therefore provide the support and resources you will need to implement your plan. It’s important to remember that the plan doesn’t get written in a weekend and implemented the following week. It takes months to do, and there is a process with 10 steps that are clearly outlined. This is followed by testing, revisions and updates that are continuous, so executive support is imperative to the overall process. While this may be the extreme, I know most companies certainly have the ability to recover their business critical systems within a 2-4 hour RTO and are protected from the most common data center incidents like a failed drive, processor or power supply, and these are the most common disasters IT managers face (and hopefully the only ones your company will have to face). However, that doesn’t cover an incident like I experienced a few years ago, and not one I would have thought to include in a BCP.

What are our recovery plans?

A few years ago I was managing a BCP team where we were setting up the controls to replicate systems from five locations on the east coast to a disaster recovery facility in Arizona. A new rack of servers was duplicated with the primary data center IT systems and sent to one of the locations to backup one of the remaining sites. The shipping company dropped the rack off the loading doc and the impact shot the drives through the chassis. But that wasn’t the real disaster we faced, just one of the challenges that we had to deal with along the way. The real disaster occurred when we were about to bring the disaster recovery facility online and I received a call that the UPS at one of the locations had exploded. I had never considered the fact that UPS units are essentially large, chemical-filled batteries that can explode and when they do they cover everything in all of their chemical makeup glory. In addition, it was the primary power source for the data center so not only was the power out for all of the IT systems, but those IT systems were also covered in toxic goo and we had to promptly contact HAZMAT for cleanup. Once it was determined that the datacenter wasn’t coming online anytime soon the recovery process was started and systems were brought online within fifteen minutes, restoring operations to a functional level. This was only possible because everyone knew what their responsibilities were, and we reacted as a coordinated team with controlled violence. So, know your recovery plan inside and out!

What are our existing recovery plans?

You may be surprised to learn that your existing data center recovery plans may not be in all that bad of shape. However, there is always room for improvement. Most recovery plans include some form or combination of tape for recovery, which is an option, but only for those systems that have a greater than 24-48 RTO/RPO, and is not really a solution if you needed to recover an entire data center. What is required to recover an entire data center is a co-location or disaster recovery facility with virtualized blade server infrastructure to minimize overall footprint, power and cooling costs, and so the systems can be readily available versus readily recoverable. Being available versus recoverable is a big difference when it comes to RPO and RTO.

What disasters are we currently protected against?

Most IT managers will have procedures in place that protect against a server, storage failure or corruption, but few are protected against entire data center failures. There are many types of disasters that will be identified in the risk assessment of a BCP but they boil down to three categories. There are sudden impact disasters like environmental, chemical spills and fires, weather-related disasters like hurricanes, tornadoes and earth quakes, and human-related disasters like malicious attacks or pandemic flu outbreaks. These are the types of disasters that are typically addressed in a business continuity plan to fully engage the disaster rather than just recover a few servers. It is a tougher sell to executives to prepare for this type of disaster but the whole point is to be pro-active and preventative in order to keep the business assets protected. The best disaster is the one that doesn’t happen and the best plan is the one that doesn’t need to be enacted.

What is our risk to specific disasters?

Depending on the location of the corporate facilities or data center, some of these three categories may or may not apply to you when it comes to the natural disasters that are more likely in specific regions of the world. For example, although there is a fault line that runs through the middle of the Midwest United States, it is far more likely that northern California will be hit by an earthquake rather than St. Louis, MO. Similarly, it isn’t likely that San Francisco will be at risk of a tornado like the Midwest frequently is. This will all be identified in the risk assessment of the BCP analysis to help evaluate and rate the level of risk that your company or data center is subject to. Another environmental example that is often overlooked is when a company is near a major interstate highway. In this case, there is the potential risk of a semi truck full of chemicals or worse HAZMAT material overturning and causing the evacuation of square miles. A similar event recently occurred just north of Boston, MA, where a chemical company that made paint exploded, leveling an entire city block and causing structural damage to buildings beyond that.

While these types of events are less common than the typical “worm du jour” attacking your IT systems, it is certainly something that is important to consider not only for the recovery of your data center, but for the health and safety of your company’s most valuable asset: its employees.

Wednesday, October 6, 2010

The Cost of Downtime

Figuring out the cost of downtime is the first step in creating a business continuity plan.

Here’s how to do it:

Calculating the cost of downtime can help you set your RTO and RPO objectives so there are no additional nasty surprises during and after a system outage. Knowing your cost of downtime can also help senior management understand IT system disaster recovery hardware and software budgets. While there is a simple formula below for calculating your cost of downtime, you can also consider these questions:
• How much money would your company lose if you lost all your transaction data for the last twelve hours, or even the last ten minutes?
• What is the value of the knowledge contained in your company’s last twelve hours worth of e-mails and e-mail attachments? What would it cost to have your engineers recreate the last twelve hours of work?
• What’s your exposure if you can’t produce this data in compliance with Sarbanes-Oxley, HIPPA, SEC and other regulations?

Here’s a simple way to estimate the average cost per hour of downtime.
Cost Per Occurrence = (To + Td) x (Hr + Lr)
To = Length of Outage
Td = Time Delta to Data Backup (How long since the last backup?)
Hr = Hourly Rate of Personnel (Calculate by monthly expense per department divided by the number of work hours.)
Lr = Lost Revenue per Hour (Applies if the department generates profit. A good rule is to look at profitability over three months and divide by the number of work hours.)

Finally, define the recovery objectives for your applications. The best way to quantify your objectives is with a Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each application. Knowing your RTO and RPO objectives will help you determine what software and hardware features and functionality are required for your backup and recovery architecture.

Thursday, September 9, 2010

Failover isn’t Just for Disasters Anymore: Migrations and Provisioning using Dynamic Infrastructure

I have noticed a growing trend implementing disaster recovery solutions for Fortune 500 companies’ IT infrastructure over the last seven years: data center managers want to do more with the controls they have implemented for their business continuity plan. They have disaster recovery solutions in place, but are looking to those same solutions for recovery in the event of a disaster and asking them selves “How can I use this to help with daily operations or little “d” disasters?” The rapid adaption of Dynamic Infrastructure utilized by Double-Take® Software is giving data center managers the ability to help maintain their Business Continuity Plans (BCP) and help facilitate projects such as server, hardware or complete data center migrations.

Dynamic Infrastructure is the ability to move any data, any server, anywhere, anytime with minimal impact to business operations. Data center managers are now using the controls implemented for their BCP to move servers in real time from older, lower- performing servers to new high performance servers; moving from physical to virtual, moving many servers from a local server in the same data center or to a co-location across the street. The controls originally implemented to enable failover are now becoming a data center maintenance necessity - and not just in the event of a disaster. So, failover isn’t just for disasters anymore.

A few customers that I have helped recover realized that having the ability to failover both the data and system state from one server to another similar server (with different hardware) made them realize that they could now move data anywhere, and fail forward at anytime, anywhere without any disruption to their users and or business operations. Now data center managers are asking themselves; “How else can I use business continuity controls for my daily operations?” A few have utilized dynamic infrastructure to move entire data centers – with great success.

Dynamic Infrastructure is becoming ever more important with daily data center operations and the adaptation of virtualization for the disaster recovery center. Virtualization is helping reduce the overall footprint required for the number of servers needed to protect a data center, as well as reducing power, space and overall hardware requirements. Because Dynamic Infrastructure allows you to move data between dissimilar hardware you can easily move from physical to virtual and back again. Many operation managers are just using real-time replication to move from physical to virtual, minimizing any downtime that typical P2V processes require, and keep those systems on the virtual infrastructure. Migrations have now become routine and is easily accomplished with minimal impact to user access. It wasn’t that long ago that the mere mention of the word “migration” (whether it was data, server, hardware or OS) made IT managers wince. The issue with migrations in the past was mostly due to compatibility and not realizing there were compatibility issues until you reach “point of no return”. And you didn’t have any choice to back out -so you forged ahead and tried to work through the driver, hardware or software compatibility issues as they arose. Dynamic Infrastructure has intelligently reduced the typical risks of migration and provisioning by using auto discover features, automating much of the provisioning, and more importantly providing the ability to return to the original production version if needed. Eliminating the proverbial point of no return is pretty important when attempting to migrate systems for any purpose.

Dynamic Infrastructure also eliminates the distance requirement for most other solutions. Hardware migrations as well as most virtualization solutions usually need to be in the same building. For example, if you are looking to replace an older CX series SAN with a newer iSCSI standard design you almost certainly need fibre links between the two systems and they would likely need to be in the same building to prevent network latency from impacting the data transfer. The same goes for moving systems between virtual servers whether using VMware® VMotion™ or other virtualization tools.

Whenever significant distance is involved between servers there are several things to consider.

  • Bandwidth throughput
  • IP latency
  • Volume of data
  • Transactional change rate of the data
  • Disk write speed at the target location

This is less of a consideration when using Dynamic Infrastructure because there are options available that minimize the above listed items:

  • Bandwidth scheduling is available to allow data transfers to occur during your peak production periods when less bandwidth is available for systems transfer.
  • Transfer compression is available that can compress typical database transactions by upwards of 80% less than a normal file or block copy you may see with some synchronous transfers.
  • Asynchronous replication has less of an issue with IP latency than other data transfer systems because it doesn’t require and return acknowledgement that the data has been written on the other side before sending the next set of data for transfer.
  • Because Dynamic Infrastructure mitigates the top three concerns with distance, the volume of data becomes less of an impact. Typically, data transfers for a P2V conversion requires downtime for the system that is being converted which requires the data transfer to be completed usually over a weekend or scheduled change control period. So you might have 48 hours maximum (more than likely you don’t have that long) and transferring a few terabytes of data just isn’t going to move that fast.
  • If bandwidth isn’t a concern then you have to account for the amount of data that can be written to disk. I have heard several customers state “I have fibre” or “gigabit network” - which is great - the throughput won’t be an issue. However, if that 7200 RPM drive in the system you are transferring to can only write 20GB per hour then that will be as fast as that data will move. I have seen some Fibre array SANs that will scream a data transfer at 65 gigabytes per hour, but that is under ideal circumstances and probably not conservative enough for planning purposes.

Because Dynamic Infrastructure is real-time data transfer there isn’t any downtime or deadline that is required to move the systems in order to keep them current. You can start moving a terabyte system on a Monday, throttle the bandwidth utilization to only use what you specify and let it run. Set it and go and let it notify you via SMTP messaging when it is complete and ready to failover to the new system. And the greatest thing about this is that you are capturing all the changes the users made during that transfer process. When the initial synchronization is complete you are confident that you have the most recent and current set of data. If this was a P2V conversion process or a restoration from tape you would have some length of downtime during the conversion or restoration process and then you would have to prevent your users from accessing that system during the process or determine how you would update the target systems with the incremental changes that occurred during that restoration period.

Failover, thanks to Dynamic Infrastructure, becomes a choice over necessity and maintenance over disastrous event. There isn’t a need to wait for the disaster to introduce itself before you start exercising your failover process. Any good BCP should be exercised at least every six months, if not after each change control outage, to see if any modifications need to be made. This also allows the BCP controls to be used for day-to-day operations, which ensures the familiarity with the failover and failback. The failover process is now an essential part of provisioning, migrating or just moving systems to improve performance, and processing either near or far without limitations. Dynamic Infrastructure is the ability to move systems from a data center anytime, anywhere, for whatever purpose. Systems are moved where ever needed for the purpose of provisioning, building co-location facility, enhancing hardware or application performance or performing routine maintenance without interrupting business operations. This not only helps day-to-day operations for data center managers but may ultimately facilitate the protection of systems for cloud computing.

Friday, August 20, 2010

What You Need to Know About Backup for Virtual Machines

Virtualization technology is continuing to be adopted for consolidating data center infrastructure and providing a more flexible platform for moving, provisioning and backing up workloads. This has put an emphasis on backup and recovery; because a single virtual host can contain several virtual machines and is even more important to not only protect the individual virtual machines but the entire virtual host. Virtualization has the same single point of failure as some cluster technologies and services: shared disk. What happens to your virtual infrastructure if you lose connectivity to your primary storage unit? The entire virtual host becomes unavailable and that can put upwards of 8-10 production workloads at risk, but it doesn’t have to be that way.

Before the adoption of virtualization, backups could be configured to backup pieces of an application, data or the operating system of a server and sometimes the entire server that included all of the above. The industry now refers to these as virtual workloads as they contain the operating system, application and usually the associated data as well. When the physical server is converted through the P2V process a virtual disk image (VMDK or VHD) is created which is used to spin up that workload as a virtual machine. So, for each virtual machine there is a corresponding virtual disk image that needs to be protected. However, backing up virtual machines is only one part of the process and the other and just as important is the recovery portion. Virtualization has actually simplified the backup and recovery process as you only need to backup virtual disk image to be able to restore as a whole versus trying to backup and recover bits and pieces of a server. Another advantage is that IT managers can usually restore that disk image to different hardware if necessary or an entirely different virtual host server.

This is where workload portability solutions have developed to easily move virtual workloads between virtual hosts for high availability. Products like VMware® vMotion, Microsoft® Live Migration provide the ability to transfer workloads in real-time between similar virtual platforms and other products like Double-Take® Move provide the ability to move virtual workloads between any virtual platforms. VMware vMotion utilizes the replication functionality of the attached storage to replicate the virtual disk images between devices while Microsoft Live Migration is built upon failover clustering technology that allows shared storage between the virtual hosts. The Double-Take Move product is more hardware and virtual vendor neutral as it allows virtual workloads to be moved in real time across any hardware or virtual platform. All of these solutions are effective backup solutions that can easily transfer the entire virtual disk image to another server and spin up quickly to minimize any interruption to production operations.

But the underlying technology to back those virtual workloads up hasn’t changed much. The processes used to backup virtual machines can be used the same as they were for physical servers and are basically broken down into hardware based replication and host based replication. Tape backup solutions aren’t addressed in this article as they have become more of an archive option and don’t meet RTO or RPO requirements necessary for virtual infrastructure. However, once the virtual disk images have been replicated to a designated backup are, tape is often used to archive those disk images to meet certain industry regulatory compliance.

  • Hardware-Based Replication – Some virtual solutions utilize the inherent replication of the direct attached storage or SAN to replicate the virtual disk images offsite to another storage device for backup. This usually uses either synchronous replication or a snapshot type technology that periodically sends scheduled updates to the virtual disk image at the target destination. Synchronous replication sends blocks of changed data and waits for a confirmation from the receiving device before sending the next block for replication. This happens pretty quickly but it usually has distance limitations and requires more bandwidth (because it sends data in blocks verses bytes). This differs from the snapshot process that is usually scheduled to send changes to the virtual disk image on a defined size or time period and only available between like devices.
  • Host-Based Replication – is an asynchronous technology that sits on the virtual host machine and replicates changes to the virtual disks as they occur and then apply on the target servers in the order of the operation they are received. Asynchronous replication is usually transmitted at byte level as well as provide additional compression in order consume less bandwidth than block level synchronous replication. Host-based replication is also more flexible because it is hardware agnostic and isn’t tied into a specific hardware or virtualization vendors, and it can be used for both physical and virtual environments.

Three Tips for Protecting Virtual Infrastructure

  • Flexible Infrastructure - If you already have a disaster recovery solution in place make sure that it is flexible enough to be used for physical and virtual server platforms. Also, make sure that you aren’t vendor locked into something that is hardware- specific and can only talk to other devices like itself. Data center managers maintain a variety of hardware and require workload flexibility to maintain solutions for every environment. Find a solution that is hardware agnostic and will fit existing design infrastructure.
  • Offsite Storage - Most companies already have a co-location facility in place or, at the very least, a satellite office that can be used to receive data offsite for disaster recovery. Whether it is a duplicate virtual infrastructure in the same data center, across the street or another country, having an up-to-date virtual workload backup provides the option for recovery when needed.
  • WAN Infrastructure - How the data is transmitted will be important when protecting your virtual infrastructure. Some synchronous hardware solutions are block-based replication and require more bandwidth and can have distance limitations. Some of the virtual products on the market use a snapshot-based technology that sends data in periodic chunks (verses only changes made to specific files) and can also saturate existing bandwidth during the transfer process. WAN accelerators from companies such as Riverbed and Silver Peak can improve bandwidth limitations.

In summary, no matter which virtualization product you select, make sure you think about protecting those servers on top of just converting to a virtual environment. Selecting a flexible data backup and recovery solution will not only help provide high availability but it can also help data center managers better maintain these systems by having the ability to provision, convert and move the systems near or far. This provides more options for deploying virtual environments, managing them on a daily basis and enabling a better backup and recovery strategy.

Monday, July 19, 2010

The Unlikely Disaster

Recently a small earthquake rattled the Virginia, DC, Maryland area.  While there was no serious damage, the quake brought to light the fact that disaster can strike in the most unlikely way.  Those who live in California know of the danger and destruction an earthquake can bring, but folks in DC don’t share this understanding because earthquakes in the area a few and far between.  The moral of this event is that we must always be ready for the unlikely disaster.

Without a doubt, each geographic location has it’s own idea of what natures disaster may occur.  Earthquakes in the West, blizzards in the Midwest, Tornados in the plains, and hurricanes in the East.  But the reality is that Mother Nature is women with the prerogative to change her mind at will, and being prepared for one type of disaster may not leave you prepared for another.  Even small natural disasters can impact large regional areas.

Business Continuity follows the same precept in that planning for a certain type of outage does not guarantee that you are protected for other types of outages as well.  Power, cellular, telecom, and internet outages can wreak havoc on your business.  So while you may have the ability to move your business to a CoLo facility, if they are down as well you BC plan is of no value. 

Unfortunately, many BC & DR solutions on the market today don’t meat customer needs.  Instead, they lure customers into a false sense of security that their business is protected.  So while working on your BC plan, be sure to look at the big picture and think about the small, medium, and large disasters as well as the unlikely ones that can destroy your business.

Good planning!

Monday, June 28, 2010

Business Continuity for the iPad and iPhone

With the releases of the iPad and iPhone 4 businesses are being put in an interesting position when it comes to Business Continuity.  For the past few years the iPhone has been the up and coming player for Enterprise email.  While it has not surpassed the Blackberry, it’siPad adoption rate is very high.  Now the iPad enters the fight and adds a new dynamic to your BC planning.

Exchange is the Linchpin 

Exchange  For most “I” users, access to email is the single most important feature for their device of choice.  Because most organizations use Exchange, supporting “I” devices is relatively easy with Active Sync.  But as more users become dependant on remote email access, providing BC for Exchange becomes even more critical.  With users in the office it is easy to see the effect of an Exchange outage.  But mobile users are unseen and unknown, accessing their inboxes 24 x 7 x 365.  Installing Windows updates at 3:00am on a Sunday could easily result in calls to the helpdesk wondering when Exchange will be back online.  Having published maintenance windows and SLA’s is critical to ensure everyone in your company has a clear understanding of what to expect with regards to your servers.

Intranet and Application Access

SharePoint  With the iPad, accessing your companies intranet and applications is much easier than ever before and as such uptime for these services becomes a new or updated headache.  Providing proven and consistent availability for your servers is crucial to keeping the business productive and users happy. 

24 x 7 x 365!

While remote access is not a new technology, both the iPad and the iPhone 4 make it much simpler for a whole new groups of users to access a variety of services. Gone are the days of the required VPN.  Users can easily access files and data from anywhere in the world at any time.  Providing Business Continuity has just become much more complex.  Time to update your BC plan!

Monday, June 14, 2010

Hyper-V R2 – It’s Not Just for Servers!

For the past 15 years I have found interesting ways to test and demo different software and operating systems.  In my early engineer days I had a spare hard drive that I would swap into my computer and boot from that served as my test platform.  As technology progressed I had to a spare computer for testing that I would wipe out and reinstall on a regular basis.  With the advent of virtualization I started to run a hybrid of both physical and virtual machines to meet my needs.  But in the past 5 years, virtual technology has completely dominated my demos and tests.

Virtual Server 2005

When Microsoft released Virtual Server 2005, most technology pundits dismissed it as a futile attempt to go after the VMware production virtual market.  But some of us embraced the technology for what it was: a simple, easy, light-weight, demo and test platform.  Instead of having to install a heavy application that stole precious resources from Windows XP and Vista, Virtual Server was a simple web based app that took advantage of integration with the host OS.  Although it didn’t have all the features of VMware Workstation, it was very stable and effective for giving product demos on a laptop.

Hyper-V RTM

When Hyper-V was released for Windows Server 2008, it became apparent that Microsoft was taking Enterprise virtualization seriously.  Despite it’s limited feature set, Hyper-V was another good solution for product demos.  It’s biggest flaw for this type of use was getting Server 2008 installed on a laptop.  Driver incompatibility and non-existent drivers plagued many laptop owners who wanted to walk the razors edge.  Although it wasn’t impossible to run Hyper-V on a laptop, it was definitely a daunting task.

Windows 7 & Hyper-V R2

Today Microsoft has a veritable panacea available to engineers for test and demo:  Windows 7 & Server 2008 R2.  If you own a laptop from a major manufacturer that runs Windows 7 x64 then the chances are good that you can download Windows 7 drivers that are WHQL.  Most of theses drivers will work in Server 2008 R2, and if your processor supports virtualization then you can also run Hyper-V R2.  While running a server OS on a laptop is not a common occurrence for most folks, it is has become an increasing trend in the “engineer” space. 

Robust Demo and Test Built-in

With Windows 7 and Server 2008 R2 sharing a common architecture and kernel, running 2008 R2 as my main laptop OS is a true pleasure.  Dual cores and 4GB of RAM provide plenty of horsepower to run all my applications, as well as supporting 3 VM’s concurrently.  With virtualization built right into the OS, demo’s are again simple and convenient. 

Yes, there are trade-offs that must be taken into consideration when running Server 2008 as your primary OS.  Cost, hardware and application compatibility, backup, and support are just a few.  But for some of us, these trade-offs are well worth having robust, built-in virtualization.

Running Windows Server 2008 R2 with Hyper-V on your laptop may be a worthwhile venture if you find yourself doing lots of product demos or tests.  Hyper-V R2 – It’s Not Just for Servers!