October 3rd, 2005, 03:32 AM
your experience on full DR testing
i have never been involved in a FULL disaster recovery test which involves a lot of parties so if you are involved in a FULL disaster recovery test before, can share your experiences...?
Some qns in my mind,
1) how long does it take on average to plan such BIG tests... in terms of years?
2) For 24x7 cases, like banks, how does one carry out such tests.. i believe the switch over mechanism must be very transparent and fast to the users involved..
3) what other experiences? thanks
October 3rd, 2005, 10:06 AM
A fact of life is that you are not permitted to stop production or trading
So what do you mean by "FULL" and what do you mean by "disaster"
In reality you do it with a "sample" or "scale model"; somewhat similar to a "proof of concept" exercise in a development environment.
October 3rd, 2005, 10:24 AM
As nihil said you use sample or scale models. The only problem we ever came across when testing was a hardware hack of the crypto drives (ATM’s) then it was only local to that drive and machine. That was sample testing anyways.
Ok maybe I should edit this post. When I said ATM’s I meant Automated Teller Machines not Asynchronous Transfer Mode.
October 3rd, 2005, 11:01 PM
Re: your experience on full DR testing
I like parties!
Originally posted here by ghostmachine
i have never been involved in a FULL disaster recovery test which involves a lot of parties ...
Oh, not that kind.
OK, FULL DRs are almost never completely tested. However, if your org has a warm or hot site, you can come pretty close by running tests to bringing up the warm or hot sites with live data --- then taking them back down. You can test DRs in increments using dry runs and scripts, as well.
If your DR plan is well defined and complete, you can plan and execute tests every six or 12 months. If you are starting from scratch with DR, you may be a couple years out before you are ready for any testing.
Of course, start with backup and restore testing. Validate your backups first. Make sure you can successfully restore. Yeah, your backup system does a nice job, and even verifies at the end of each job. Have you actually taken a volume backup and run a restore just to see if it restores?
Wanna know how good most of the backup system "verifies" are?
The stats ain't pretty.
October 3rd, 2005, 11:13 PM
Good points Rapier~
You need to do some stress testing as well (volume handling).............sort of like "unit testing" in the software development model?
October 3rd, 2005, 11:48 PM
I have a hotsite and we do it pretty much the same way that rapier57 indicated.
We do a full restore to our backup servers, which takes hours. Becuase of that, we try to do it often so we only have to restore files that were modified/added/deleted since the date of our last test. We do it every couple of months. Sometimes as little as 3 months, sometimes every six months.
I've had VERY good experience with our current backup software and have NEVER (to my knowledge) had a bad backup set. So, my stats would offset those that rapier57 speaks of. I test them all the time by restoring to test servers on my producation network.
We can't do a full DR test because you have to keep your production network up at all times. However, since we have a DR site, we have identical setups so we can come damn close. Sometimes we'll even change over the routing so our DR site becomes the live site.
We don't plan for it much as there are only two of us. It goes more along the lines of "Hey, want to get out of the office on Fri? Lets go to our hotsite..."
is a firefox extension that gives you stats on how long you have quit smoking, how much money you\'ve saved, how much you haven\'t smoked and recent milestones. Very helpful for people who quit smoking and used to smoke at their computers... Helps out with the urges.
October 4th, 2005, 03:55 PM
You are actually doing "best practices," phish, and that's good. The auto verify in a backup package doesn't generally give a good validation of the backup. The only real valication is to run a restore and check that it worked. 'Course, all this had to be learnt the hard way.
School of Hard Knocks is rough.
October 4th, 2005, 06:19 PM
October 5th, 2005, 11:22 AM
Well, I don't post much around here, but this grabbed my attention.
For real DR planning and testing, the scope is huge. If your IT shop has a mainframe, AS/400, Unix/Linux systems, and Windoze systems the complexity can be overwhelming. (It is for me)
We have always had a pretty fair DR test for our mainframe and AS/400 systems. We contract with IBM Business Recovery Services, and use their Boulder, Colorado facility for testing annually. I'm not a mainframe or AS/400 guy, but it seems to me that these folks have the easy part. They can basically take their system backup tapes with them to Boulder, and within about 6 hours, they have live systems.
The networking guys get to have fun during the test too. We have manufacturing sites scattered around the US. Each is connected to our data center through ATM, Frame, or DSL/VPN connections. For each annual DR test, the corporate office picks a couple of locations to participate. While the mainframe and AS/400 systems are being brought online, the WAN links for the selected remote sites are switched over to the IBM site. (WAN connectivity is part of our contract with them.) This allows the remote sites to test their access to the restored mainframe and AS/400 environments.
The real fun is in the open systems side of the house. We have several issues to overcome. First, we use ghost images to back up the OS part of a server. This grabs most of the application installs and the OS configuration. This is great, if you have identical hardware to blow that ghost image back down to. It is highly unlikely, though, that IBM would be able to provide identical hardware for us at the BRS site. We use IBM Tivoli Storage Manager to back up the data from our Windoze servers. Once the mainframe is back online in Boulder, we have access to these data backups. Here again, we need to have systems to restore the data to. We have also had to address the issue of internet connectivity, since we have several smaller sites that connect to the data center using VPN over DSL or cable modem. Add to that our internet presence, as parts of it are critical to business continuity.
The solution we finally landed on involves a new data facility in our city. We have leased space in this underground facility and we have gigabit speed connectivity to it. We have moved most of our test systems over there, and we perform nightly production data copies to the site. This location has been configured with a connection to the internet, and our WAN can be transparently switched there as well. We have a couple of production AD domain controllers there and a laptop configured as a domain controller. The laptop gets hand carried to the Boulder facility to provide DNS/WINS/DHCP and AD authentication there. The test systems are quickly reconfigured to act as production systems. Granted, our test systems are not as powerful as the production systems, and we don't have as many of them, but it was decided that the company could live with reduced performance, if it meant having access to these systems at all. We have actually tested most of our mission critical systems in this environment. The non critical systems are handled on an as needed basis.
The other issue to consider in a disaster is ... people. Where are we most of the time? In the data center. Where will we most likely be in the event of a disaster? In the data center. Will we be injured, or will our families be injured during the disaster? Probably. All I can say about this is: Document your systems to the max! We have tried to document everything to the point that a manager or a person from a different area of expertise could bring the disaster systems up. (This is the really difficult part to test)
Well, this got really long, really fast. It doesn't cover everything, and covering everything is the real issue with disaster recovery.
I wonder if there will be snow in Boulder this November?