Disaster recovery is a vital part of any backup strategy, but sometimes it's not clear how it differs from your everyday backups. A Microsoft survey discovered most organizations experience 4 or more disruptions each year with an average cost of $1.5 million an hour. To fight the high cost of downtime, 43% of IT professionals are planning to invest in or improve their business continuity with cloud-based disaster recovery, citing reduced costs and expanded coverage as their primary reasons, according to IDG.
With disaster recovery (DR) taking such a high priority in the IT world right now, we asked our resident expert Josh Larsen, Sales Engineer, to answer some of the most common DR questions.
I have been with Green House Data for nearly 8 years, in that time I’ve built some of our first virtualization private, public cloud platforms, as well as implementing many of the DR solutions we use on a daily basis.
Today I fill a sales engineering role, so I work with our customers to help put together DR solution to meet their specific requirements.
Backup is a traditional model that I think most people are familiar with. It’s when you have a tape drive or attached storage in your server room and you’re performing a backup on a weekly or maybe a full plus schedule. So you’re getting an archive of your data, with all these retention points, so you have daily, monthly, weekly, maybe yearly copies of your data.
You then have the ability to go back and find a specific file or folder and restore that from the archive back to your server.
Disaster recovery gives you a smaller range of recovery points – usually going back just hours – but it’ll allow you to recover a larger range of that data, whether it’s an entire datastore from your SAN or an entire server, and you can recover that to a separate facility. You can do this very quickly, too.
In most environments, we recommend people have both a backup and a DR solution. It gives you a lot of control over what you want to do. If someone deletes a spreadsheet, or it becomes corrupt, you can restore that file to the user’s folder, for instance, but with DR you will protect your entire facility from disaster, fire, or the SAN was unrecoverably corrupt. This is also considered a business continuity solution, where you can continue your daily operations from another facility.
We’re starting to see a convergence between backup and DR, where people are using a DR solution to replicate all of their data offsite, with the ability now added by DR providers to pull back individual files as well. I think this’ll be a game changer down the road.
They can be one and the same. It goes back to how you traditionally think about things. Smaller businesses will have backups on a regular basis, you know, at 11:00 at night when their users aren’t accessing files. That’s going to allow them to restore data from that single point in time. If a user happens to lose data right before that night’s backup, they’re going to have to go back to the previous backup.
What replication is allowing us to do now is to preform these recoveries synchronously or near-synchronously. It means only changed data is moved. It gives you more flexibility as to how you want to recover your data.
While your window might be smaller—not weeks, months, or days back—you have more checkpoints, like 5, 10, or 15 minutes ago. So a lot of applications are starting to leverage replication and deduplication.
These two things are two of the most important when putting together a DR solution. RTO is recovery time objective, or how quickly you need your infrastructure to be ready again in a new location. For instance, this is my critical ERP system so I need that to be back up within an hour. Or, I don’t need my payroll system until the end of the pay period, so I can afford to leave that down for a few days.
Recovery Point Objective is how much data can be lost between the source and the target within that replication. That’s usually the length of time it takes for the data to get from the source to the target. So if it’s a website, and you don’t push regular changes to your site, you might have a higher RPO. But if it’s a highly transactional database, you want to have a really low RPO, so as much of the data as possible will be replicated on to the target environment.
Most of the solutions we use allow us to take a different RPO and RTO for each system. The other thing we can do is set boot orders for our systems. If you’re looking at DR, you don’t just have Application A application on this server, and Application B on this other server, you’re going to have a set of things. One might be the backend web server for another web service, so you need them to come up together. We set a boot order in this example so the backend comes online first and when the application comes back up, it can immediately connect back to the database.
We can also set priorities at the replication level. So with payroll vs. ERP as an example again, you only have so much bandwidth to work with. We configure a kind of “SLA” for that replication, so we make sure each machine is replicating within certain parameters and recovery times, and specify the priority for others after that.
The first thing I would ask is, “Are your offsite backups good enough?” If you’re wondering if you’re going to be able to restore from backups quick enough, the answer is probably no. Do you have an environment to even operate out of? Say there was a fire at your primary site. Do you have a location to restore your backups to? If the answer is once again no, what we’ll do is set up a cloud environment, public or private, or on a large scale maybe even collocated floor space, and then we pair that with a DR solution to add to that existing backup.
Of course the biggest difference is the physical gear in play. At a large scale, you can do DR to a collocated target, but that’s expensive up-front. You’re making a capital and operational expense to set up that gear and operate it only until you need it.
With a cloud target, you don’t need those resources running until you need them, which keeps the month-to-month operational expense very low. We can handle both physical or virtual machines at the source.
Most new DR customers we see now just moved to virtual environments and they’re looking to take advantage of some of those virtual-specific benefits, like taking an entire image-level copy of a VM, or replicating a back-end datastore.
Hot site basically means the target site is always up and running, sometimes called an active-active connection, where you’re almost not even really failing over between sites so much as one site goes away and the other site continues to work.
A warm site has preconfigured infrastructure at the destination, but it may not be up and running, which means there is some failover transition, a slightly longer RTO.
A cold site is not prepared at the target. For some people that’s empty data center floor space without equipment; lately we’re seeing this term used for things like “workplace recovery”, where you have space for your employees to operate out of as well as your infrastructure.
In a perfect world, most people would want a hot site as there is little or no downtime. The problem is it can be cost prohibitive. We have to set up high-availability within the application level or platforms in use, like a SQL cluster, or an Exchange availability group. That can be expensive from a licensing or architectural standpoint, but also the maintenance adds up. You’re essentially maintaining two environments on a daily basis, the source and the target. That means updates, monitoring, the whole thing. Whereas with a warm or cold site you only need to constantly maintain your active infrastructure and your changes are automatically being replicated over. People are ultimately going to identify the pieces of infrastructure that are most vital and set them up as hot, warm, or cold based on priority.
A lot of the DR solutions do include features that allow monitoring for outages. Nothing too fancy, very similar to your current monitoring, except instead of sending an e-mail notification it will go ahead and trigger the failover process. We don’t use this often because there are many false positives. It doesn’t consider issues outside of server availability.
If you have a facility fire, you can’t proactively failover to the target facility, those servers have to actually burn and shut down first. Or maybe you have a malware outbreak, which doesn’t trigger unavailability, and all of a sudden, the malware has replicated over to your failover site and you can’t even recover. False positives could even include something as simple as taking your server down for maintenance and having the software prematurely failover.
When we’re talking about propagating changes between source and target, we can do some additional things. If you have similar or the same vendors for networking gear, or you’re failing over from a virtual machine to the cloud, you can replicate your firewall settings. Many people don’t consider their configuration settings—that data might replicate, but the target environment may not have the same rules, so the changes I just made to this VM might make it unavailable within a DR target environment.
This really gets down to our philosophy on DR, which is not just about recovery. It’s a complete plan: who does this, what does this, and when. That’s why we like to avoid automation where it makes more sense to have human involvement.
This ties into backup vs. replication. I’ve seen this a lot in previous jobs where companies can’t make a full backup of their data. You don’t have enough bandwidth or time to make a full copy of every PDF, document, and e-mail at Law Firm X, for example. Replication and deduplication removes a lot of that pain. The software will locate only what has changed since the last backup and replicate just those changes.
The other thing we can do is pre-seed data, so a company might ship us a hard drive that has a lot of their source data. We can import that, then top it off with deduplication, so we avoid having to copy those initial files via the network and we can start only copying over changed items.
On the recovery side, backup means you need to restore the data from the backup device to your systems, so it has to actually be moved twice. With smart replication, if you can recover your original systems, you only need to move any changed items back.
The first line of verification is going to be monitoring. We employ high-detail in our monitoring. Tying into the RPO and RTO, we can configure alerts that might say, for example, “The RTO has dropped beyond five minutes,” meaning I’m not getting the recoverability window that I need. Or maybe I started replicating too many machines, and I get an alert that the target environment isn’t configured to be large enough to stand them all up. These proactive alerts let us look to see if anything is actually wrong with how we designed the replication. It could be bandwidth that just isn’t allowing data to move quickly enough.
As with any data recovery scenario, testing is going to be your number one best practice. During the discovery phase, we work to set up testing windows so the failover can be tested every month or six months. I wouldn’t go any longer than that.
To check to make sure your data actually came back up, in most cases, customers will be able to log into their infrastructure when it is hosted at our location. Even if the source was at their office, they have web-based access or something they can use to check replication our end. With public or private cloud, they have full access, and they can actually watch the failover in process, with VMs getting powered on, and then test access to each.
Encryption is absolutely possible. Most DR products now include end-to-end encryption, many now also include WAN acceleration on it. We can also use an IPSEC VPN tunnel. Encrypting storage at rest is often an option, depending on the solution.