In part two of our disaster recovery testing series, we explore the critical steps of executing a DR test. From coordinating teams and documenting issues to maintaining communication channels during the test, this episode covers everything you need to know about running an effective DR test.
Host W. Curtis Preston and co-host Prasanna Malaiyandi share practical advice from their extensive experience with disaster recovery testing. They discuss the importance of having backup communication methods, maintaining detailed documentation, and conducting thorough post-test analysis. Learn why testing your DR plan regularly is crucial and how to build a recovery mindset across your organization.
Whether you're planning your first DR test or looking to improve your existing testing procedures, this episode provides valuable insights to help ensure your disaster recovery testing success.
You found the backup wrap up your go-to podcast for all things
backup recovery and cyber recovery.
In our final episode about DR testing, the rubber meets the road.
Last time we talked about getting ready for your DR test, and this time we're
talking about actually running the test.
We'll cover what you need to do during the test, like coordinating between
teams, documenting what goes wrong, because something always goes wrong,
and making sure that you've got backup communication methods ready.
By the way, if you don't know who I am, I'm w Curtis Preston, AKA, Mr.
Backup, and I've been passionate about backup and recovery for over 30 years,
ever since I had to tell my boss.
We had no backups of that really important production database that we had just lost.
I don't want that to happen to you, and that's why I do this.
On this podcast, we turn unappreciated backup admins into Cyber Recovery Heroes.
This is the backup wrap up.
Welcome to the show.
Hi, I am w Curtis Preston, AKA, Mr.
Backup, and if you could just take a couple of seconds to
either like or subscribe or.
Uh, follow the channel so that you can, uh, always get our great content.
That would be awesome.
I am once again joined by a guy who has finally put some of his car
knowledge to use Prasanna Malaiyandi.
I'm doing well Curtis, and yes, I am finally putting some of that car knowledge
to use, uh, for viewers who may not, or listeners who may not be aware.
tend to watch a lot of car YouTube stuff, um, a lot of it
tends to be a brown fabrication engine rebuilding, drag racing.
It's a really odd mix, but a lot of it is just YouTube knowledge.
And so I finally decided to try something different, and I've been taking auto
shop classes at my local community college, which has been amazing.
And so as part of it, you actually have a hands-on lab section where you get to
actually work on cars like your own car.
And right now it's all basic stuff, right?
So changes, underhood inspections, inspecting cooling systems.
But we actually gotta do things like charging tests, uh, compression tests,
leak down tests, replacing spark plugs.
So excited.
I'm actually using these hands for things.
And you did a, you did an oil change yesterday on your wife's car,
do an oil change on my wife's car.
Yep.
I.
How filthy was the oil in your wife's car?
Yeah, it looked almost brand new.
Um, it didn't have many miles since the last oil change.
I'd probably say five or 600 miles, but it was a sacrifice since I needed to
actually do an oil change for the class.
So.
Right, right now, this isn't the first time you've done an oil change, right?
no.
Okay.
I've done one in the past, but this is the first time I've done it on a lift,
which oh my God, is so much nicer.
Oh, you brought, you brought her car into the class.
class.
And we put it
I see, I see.
and I.
Yeah.
Yeah.
Everything's nicer on a lift.
Absolutely.
When you're not like struggling underneath the car, trying not to drop the hot oil
on you, and you're actually able to get a large, like container underneath, like the
drum was probably like three feet wide.
Right, right.
Yeah.
You had one of those that you can wheel around, right?
Yep.
So significantly easier.
And then I was just thinking, I was like, do I have room in
my garage for a two post lift?
Even a short one, but no.
Trust me, I have thought about it back when I was doing a
lot more work on my cars.
I definitely looked into it and I was like, okay, I don't, I
can't spend that kind of money.
So let's talk about something that we actually know a little bit of something
about, uh, so last, so two weeks ago, for those of you that follow the,
uh, episode, or for those of you that follow the show, uh, two weeks ago
we did DR testing part one, and then.
Um, we aired, uh, a great speaking of Dr.
Testing a great episode from 2021, which was the best DR testing story ever, right?
Yep.
Oh yeah.
The scariest, I would say.
Yeah.
Where a guy for reasons that he goes into in the show, he essentially purposefully
destroys his production environment, not just for DR testing, but as a.
As a matter of how everything happened, he ends up testing his DR system
and it, it does work, but oh my God.
There was, there was a, there was a quote in there that said something like, he had
a long weekend that lasted like five days.
Yeah.
Yeah.
So, yeah.
things it's like, and well, and the other challenge is he was up in Alaska,
Yeah,
If he needed to get parts or other things like
right.
luck.
Yeah, exactly
if I'm about to do like a house repair or something else like that,
it's like, you know not to do it on a Saturday or a Sunday or a Friday,
right.
if you have to call someone or you need to pick up something
and you don't do it at night.
Yeah, definitely.
Definitely don't do it at night.
Right.
Yeah.
Um, yeah, so that's a great episode if you didn't listen to that episode.
That is a great episode.
Um, and, uh, uh, yeah, listen to that.
So this one, the, the.
Two weeks ago, we talked essentially about getting ready to do the DR test,
preparing for it, setting the scope for it, agreeing on what's going to be a
success, and then this week we're gonna talk about actually executing the DR test.
And again, this is a DR test.
What would you say is the purpose of a DR test Prasanna?
I.
To make sure that you're actually in the case of an actual disaster,
you're able to recover as agreed upon whatever your agreement was.
Yeah, I, I think that's sort of the general, yeah.
Obviously that's the purpose of a test in general, right.
Is to, is to, is to.
To test whether or not you could do it when you, when you need it.
But since most tests fail, I'm going to say that the other purpose and
perhaps the bigger purpose is to fix the parts of your, of the TR system that
you discover are broken in some way.
Right?
Um, and, uh, so the, the probably one of the biggest.
Outcomes of a DR test is to feed back into the DR plan, right?
Yeah.
just in terms of what fails, I know sometimes people are like,
oh, it's just thinking about like, I can't restore the data.
But a lot of times what really fails is the dependencies that you didn't consider.
Right.
you make sure you're able to fail over and recover your active
directory in your DR site before you can bring your applications online?
You, you know, um, I'm glad you brought that up because I aired
another classic episode about the actual disaster recovery on an island.
And, uh, again, well, it's with the islands, right?
Because Alaska was Kodiak Island.
Um, but this was in a Caribbean island.
And they do an actual deal, you know, an actual recovery because there
was a hurricane that took it out.
And one of those dependencies that you talked about was the lack of internet
Yeah.
and, uh, lack, the lack of power, the lack of internet.
These are all things that we come to expect on a normal everyday basis, which
In
an actual disaster is, is not,
Yep.
not that right.
Yeah.
And we also had that other episode.
Do you remember maybe, I don't know if you want to air that or not.
The dire show one.
That's right, the one that talked about the derecho.
I'm gonna have to, I have to go find that one.
'cause it's not titled the Derecho episode.
It was, um.
I'll have to find, if I can find that, I'll rebroadcast it in the keeping
of the, the disaster recovery theme.
I'll, I'll definitely see if I can find that when I br
'cause that was also very good.
I didn't even know what a derecho was.
Derecho is a land hurricane.
Uh, a hurricane that forms over land.
I don't know why it's called derecho, but that is what it is.
Right.
Yep.
Uh, to me that just means Right, you know, to the right in Spanish.
But I.
You know, it is what it is.
Um, so, uh, so, so we talk about if we're executing the DR test.
Uh, we, we, you know, we, we, we've, we've agreed on what we're gonna test.
We've agreed on what the success criteria is.
It's time to actually start walking through the, the test
we're, we're going to have.
And, and also we created a, an environment that we're going to test in.
We're not doing what our friend from Alaska did.
I, I was just thinking, are you just gonna go around like the TV shows
when they get hit with an attack and they're just like, plug gun,
plug the cables up, plug the cables.
Yeah, don't do that.
We, we have some sort of test of environment.
Generally speaking, today's, it's generally gonna be the
cloud and we're going to start executing the, the, um, this test.
Can you think of, uh, and, and one of the things, again, this is more of
set up a thing, but one of the things you wanna make sure is to allocate
enough time, uh, for this, you know.
For this process to unfold in its natural, um, evolution.
I would say time.
And then also make sure you have the resources right.
And I'm not, I don't mean compute resources, but people because.
Right?
Make sure that people are available, right?
yeah.
Um.
don't do this at like, uh, quarter end because people may
be firefighting other things or.
Yeah.
Yeah.
The company that I, the, the bank, we did it on a weekend.
Um, but it was a dedicated, you know, a, a dedicated weekend where we're
going to do the DR test, and we did that because again, you're, you're
making all these resources available for the DR test, which means they're not
available to do their day job, and their day job would happen during the week.
So we chose to do it on the weekend and.
I'd say the bigger, the bigger you're going, the bigger te the bigger,
this isn't coming out in English.
Uh, the bigger the test, the bigger the need to prepare and to, to have, um,
you know, to make sure you have those resources and to not do it when the
normal production stuff is going on.
requires buy-in from the business communication, right?
All these
Yeah.
right?
Yeah.
Make sure you.
Make sure you communicate to all the powers that be, that you are doing
a DR test, especially if you're gonna do any kind of failover.
Um,
it too, right?
Because you want this to be done on an ongoing basis.
right,
to convince 'em upfront, Hey, here's why it's valuable, such that when you go back
and after the results, right, you're like, Hey, we now need to do another DR test.
Maybe six months down the line,
right.
already bought it.
Another thing as, as we're going through the DR test, we're documenting
what went right, what went wrong, especially what went wrong.
Right.
Um, go ahead.
so
this is an interesting thing 'cause when we had Mike podcast, right, and
he was talking about sort of doing these tabletop exercises, right?
I think it's important the person documenting kind of needs to
take an objective perspective.
Mm-hmm.
Right, because you may be showing some biases or the person documenting
may not want to document certain things, or may just sort of dismiss
it as, Hey, this isn't important,
Right,
Versus actually capturing what happened throughout the process.
right.
Agreed.
Um, the next thing is, and, and we covered this, uh, in the previous
episode, but once you've, you know, we talked about testing little parts
of the infrastructure, but once we grow, once we've tested this piece
and this piece and this piece, I.
I do think it's important to test, you know, you look at the scenario, what
would this scenario do to our company?
Right?
The scenario is a disaster.
The scenario is a fire, a.
A terrorist action, um, and it's gonna take out all of this infrastructure.
What would that do to us?
So for example, you might not need to test your ability to recover from a SaaS outage
when your, if you have a data center and your data center goes out, right?
It's a, it's gonna be scenario dependent.
What you're gonna test, but, um, you, you might wanna, what would be the impact to
our business and our ability to use the different parts of our infrastructure?
And so speaking of dependencies, if we don't have internet of any kind, it
is, it is kind of a SaaS outage, right?
Right.
Um, so, uh, we're gonna, we want to test as many of those parts
of our infrastructure that are going to be impacted by the
scenario that we're testing, right.
Yeah, and sometimes it's a little bit about.
consequences or identifying gaps.
It's like when you're writing code, right?
You normally do unit
Um.
but then when you actually test the end-to-end functionality, you're
like, oh, I didn't realize that this interacts with this other thing
this way, and things don't work.
That's why we also do end-to-end testing in addition to unit tests.
Yeah.
And, and, and again, this is why I went, why back in the beginning I
was saying that the purpose of the DR test is to identify these gaps, right?
The, yeah.
I mean we can have, um, I.
You know, we can have that perfect test that goes well and that's great and
everybody feels better, but it's just as valuable to find the DR test that
had, that had a big hole or a small hole and, um, you know, the, uh, and, and,
and to document that and address that.
And this is why we do it on a regular basis.
I have a question for you, Curtis.
Yeah.
Do you think DR.
Testing?
So most organizations have a risk management team,
Mm-hmm.
right?
Which usually has a lot of this information in terms of, okay,
what are the business risks and everything else like that.
But they're also probably the ones who are coordinating across the business
in order to say, okay, let's do a test.
Mm-hmm.
Right where the infrastructure, DR testing that we're talking about here
is probably one portion of that overall
Mm-hmm.
Mm-hmm.
Do you think that's fair?
Yeah, I think that's fair.
And you know, this is, we're going to.
I think that if we're doing a, a real DR test, we're going to this.
This is a business test as much as it is a technology test, right?
Yeah.
There is this, that overlap between business continuity planning
and disaster recovery planning.
And maybe for a DR test, we're not concerned so much with, um.
Uh, like if, if it's just a DR test, we're not concerned with, let's say,
uh, uh, buildings and people, places for people to, to work and things like that.
We're concerned more with getting the technology back up and running.
But I, I'm glad you brought that up.
That is a, a separate aspect that does need to be taken into account.
Well, and the benefit with this is if there's already a team that is looking
at that business continuity aspect,
Mm-hmm.
You may not need to convince the business as much, right?
In order to be
Right.
right, you should partner with people who already, like that is their job,
Agreed.
Agreed.
them.
Agreed.
We talked about documenting things that we discover here.
I, I think that we should be maintaining like a log of, you
know, all of the tests and the things that we've learned from them.
Because again, that may be helpful for, uh, you know, for
future generations of tests.
You know, It's important to have a Dr.
Runbook and to, to, to have this, you know, one of the pur
the, one of the purposes of the test is to update that runbook.
So let's just talk about that.
Um, the, the, the thing about having a Dr.
Runbook, I do believe in having an electronic copy of the Dr.
Runbook, uh, but uh, also have the ability to easily update.
A paper copy of that runbook.
So the way to do that is to have some sort of documentation system
online that you can easily update.
Um, and then if you want to have a paper copy and you want to have a paper copy,
then um, the best way to have that is a, is a loose leaf type notebook system right
where you can update pages of it, where you don't have to update the entire book.
I have a comment about the electronic copy.
Sure.
I would recommend also keeping a copy out of your normal corporate infrastructure.
Agreed.
Right, right.
in case, say you get hit with ransomware and you no longer have access to that
infrastructure, or someone deletes your account that hosted that data, right?
So make sure it's something completely disconnected as well.
A copy just in case.
And I go back to think about the Pixar story, right, where they just happen to
be lucky with Toy Story two and have a copy offsite offline to save the movie.
Exactly.
Um, yeah, I, I, I think obviously we, we have to keep security in mind.
We have to make sure that what, wherever that system, wherever that other
copy is, it's protected by security.
But the whole point of it is to have it outside the normal security.
So, uh, there, there's a, there's a, um, a balance that you need to have there.
Right.
Um, what about communications during the DR tests?
Um, we need to keep everyone.
Abreast of what's going on.
You wanna talk about that a little bit?
Yeah, so you wanna make sure people aren't working in silos and because during a
DR test things are gonna be chaotic.
but since this is more of a controlled environment, you want to establish
those patterns and say, this is a normal way that we communicate.
It might be via phones, it might be emails.
You might jump into a video conference, right?
Whatever it is that you use, make sure that you have all the right
stakeholders in that session.
Right,
in order.
So, so then everyone knows what's going on.
The other thing though, uh, to mention is make sure you also have
alternate methods, Just like what we talked about, the runbook itself.
Make
right.
case your voiceover IP phones are down in your corporate environment
or your chat slack is down, or whatever else you're using,
Right.
Make sure you have an alternate mechanism to get in touch with people.
Yeah.
That's a real challenge.
Um, I mean, it, it is
Smoke
to have communication during it.
What'd you say?
signals.
So signals, it's, it's a real challenge because we depend so much on technology
and I would say that that, um.
Again, if it's an outage, generally the outage is for you
and not for everything else.
So for example, if you're relying on Zoom, um, as your mechanism, zoom will
probably be up when you have your outage.
You just have to need to make sure that everybody can get to zoom.
So, um, if for example, your your, your challenge there
will be if you are using, um.
You know, a, a, a third party authentication mechanism to get into Zoom
and then you don't have access to that, that could be, that could be a problem.
So these are the things you wanna make sure, you wanna be able to make sure that
you can communicate during the outage.
Um, and I can definitely think of a, you know, of a multi-headed zoom call where
everybody's just sort of keeping everybody abreast of what's going on, right.
Um, and we wanna make sure that the stakeholders are aware of everything
that's going on, as well as the people that are executing the, um,
that are ex executing the test.
Um, and then what about, um, I, I, I think, by the way, the Zoom call,
I think is the best way to have, or something like a Zoom call to have
coordination between the teams if there are multiple teams that are happening.
You don't necessarily have to have everybody who's
doing something with the Dr.
Uh.
To, to be on the Zoom call, but the purpose of the Zoom call, I
think is probably to keep, keep all of the different teams aware
of what the other teams are doing.
Right?
It's almost like a war room, if you will.
Right?
exactly.
The big, again, the bigger the test, the bigger it is, the bigger
the need is to have, uh, some type of communication like this.
Right?
Uh, and then you've also got escalation procedures.
What happens if something doesn't go right?
Who do we call?
Um, yeah.
you could throw a monkey wrench in things and be like, someone's about to
do a, normally is part of the DR test.
Right?
Or would be responsible for something.
You could be like, that person is home sick with the flu
and cannot be in the office.
Now what do you do?
Yeah.
Um, yeah.
If your DR.
Test says, you know, call Steve.
Um, this, this is the, the, you know, the more you have something like that,
the bigger that, that, that kind of thing is gonna be a problem, right?
you bring this up.
So I was reading the register this morning
Mm-hmm.
there was a call in or a write in from a, a reader,
Mm-hmm.
and they were saying that they had worked at a company I.
In it and managed a bunch of infrastructure and they had built
this system to automate all of their, uh, software deployment stuff.
Mm-hmm.
Um, and then they had quit the company, but no one knew how to
operate it, and he had left his number, it's in the closet, was a machine.
He had left his number.
It said, do not reboot, call Steve or whatever his name was.
And he got the call and this was like 20 years later,
Wow.
he got a call and he was like, I don't remember the password.
I'm sorry.
You gotta figure it out on your own.
Wow.
That's crazy.
That's just crazy.
So call Steve.
That's funny.
Um, yeah, don't, don't be like that.
Um, so, uh, let's just say we get to the end of the test, right?
We've successfully recovered all of the, all of the aspects
if we're doing a full DR test.
What needs to happen is a full sort of end-to-end functional test of the
different parts of the business to make sure that not just that the, that a
system was recovered or a database was recovered, but the application and the.
The, the system around that application is able to function.
And again, this is why we go into things like phone systems, right?
Yeah.
Um, you know, if, if the, the application that we're recovering is our customer
call center, um, but we don't have phones, uh, great, uh, you know, all of that
stuff, all of that stuff has to work.
And you've got to do the functional end-to-end test to make sure that all
the parts that you are pretending.
Are, you know, were damaged, are now fully functional.
I agree to that, but I think it's also one of the things, you have to
be careful not to boil the ocean.
Yes.
Yeah, yeah.
Well, again, this is about,
Yeah.
what's that?
a balance.
I.
Well, what I'm saying is, whatever it is, this is, I, I think what you're
talking about is, is more about scope,
Yes.
Because.
Even if we just agreed to test this one part of the application, you
need to do a functional test of whatever it is that you recovered.
E even if it's just a small part of the environment.
What I'm that, that's all I'm saying.
Yeah.
Right.
That, that, that we often focus a little bit too much time on the
recovery, the restore, and we say, okay, the application's restored.
I can walk away.
No, the application's restored.
When the application is restored, when people can do the thing that whatever
it is that application was supposed to.
I was intent, I was thinking more about, be careful about thinking
about all the failure scenarios.
Like I was saying, the person gets sick with the flu, right.
Oh, yeah, yeah, yeah.
Yeah.
about going down that rabbit hole because you will never come back
out because it might be, what if the butterfly flops its wing halfway around
the world and causes X, Y, Z, right?
So,
The butterfly will die.
right.
So don't get overwhelmed by these scenarios is
Yeah.
And, and speaking of not being overwhelmed when we get to the
post, you know, when we get to the, uh, the post game analysis, right?
Let's measure against the success criteria that we agreed to.
Um, we, we look at the things that didn't work and the
bottlenecks and things like that.
The key, again here is to better the world, not to prove that
you were the best or whatever.
I.
Um, I know it can be really difficult.
Say that again.
you were the worst.
You were the worst.
Yeah.
Um, you know, we're looking for things that we can improve.
We're looking for procedures that we can update based on, you know,
the lessons that we learned.
Um, any other post-game analysis?
What can you, that you can think of?
I would also say.
If this is your first time doing this, I think it's also good
to say what things went well.
I think a lot of times we tend to focus on the negatives,
Right.
right?
But if this is your first time, like this is really hard.
This is a hard thing to do.
Yeah.
And you should acknowledge that and realize if you got through, like I
know Curtis, you've always talked about the bank and your DR tests, right?
And how I don't think you guys ever completed a hundred
No.
right?
No.
Yeah, yeah.
So don't be too hard on yourself.
Congratulate yourself first off on doing the test in the first place,
and second, making it to the end of the test, even if everyone is dead.
Um, you know, and then, and then, and then, you know, yeah,
don't be too hard on yourself.
Right.
Uh, because these things, these things rarely do they go well, uh,
unless it's like fully automated.
And, you know, the, the more I will say, the more you can
automate things, the better.
Right?
Yeah.
So you ran the tests,
Mm-hmm.
things that went well, things that went wrong.
think the next step after that is.
Identifying how do you close the gaps,
Right.
And coming up with a plan, because you don't want to
just let these things linger,
Right.
create a plan.
Identify what are the most critical elements that you want to address first
Mm-hmm.
timeframes, and make sure you get buy-in across the board to fix those things.
Yeah.
Agreed.
Right.
Um, you, you, you have a, you have an action item list and who's responsible
for addressing the different things, and then of course, what's the next thing?
You do it again.
Right?
Um.
When,
Uh, soon.
Right.
Um, I would say I'm a fan of more frequent, smaller tests
than like an annual huge test.
Right.
Um, I think the more often we do that, the more we get into the, the
mindset of thinking about the things that can go wrong, because a, a lot
of things are, are, you know, um.
They're the same on different discip disciplines across the, uh, the,
the, uh, the organization, right?
So the more often we test, the more often we get to a recovery mindset and
we start including those things in the system design from the very beginning.
Yeah.
Right?
Um, again, that's the other purpose.
I would add that to.
My original question, that's the other purpose of a DR test, is
to get people to a DR mindset,
Yeah.
um, to a recovery mindset of saying, um, we need to design the infrastructure and
the processes around the infrastructure so that they are easy to recovery.
Right.
Yep.
Or at least even think about it to start with rather than, oh yeah, this failed.
Now what?
What were you gonna do with our Dr.
Yeah.
And, and, and lemme just give you a, a, a silly but simple example of what happens
when you don't have a recovery mindset.
So I go back to the bank, right?
I have so many good stories from the days of the bank, right?
And when we bought a a T 1000, which was, uh, an HP server, it
was a really big server and it had, um, it was a huge server.
It had a hundred gigabytes of data.
Ginormous, wait.
Let me go grab my flash drive.
It was a huge server for the time, and it came with a two gigabyte tape drive.
Right.
I think with compression it was like a four gigabyte tape drive
that, that was a system design.
And there, there were no changes.
No, we, we, we added 30%.
With one server, we added 30% to the capacity of the
data center with one server.
There wasn't a single discussion about what we should do from a
backup and recovery perspective.
That's what happens when you don't have a recovery mindset,
Yeah.
right?
Is that you, you do things, you add things to the system without any thought
to what they would, you know, how that would impact the recovery system.
So that's why we want to have a recovery mindset.
Yep.
Okey dokey.
I think we covered everything.
Yeah.
I think so, yeah, everything you could possibly want to know about
Dr in, uh, four episodes with the two, the two, maybe five.
We'll see.
We'll see if I can find that other episode.
Thanks Prasanna for, uh, you know, once again, uh, you know, great team.
Woo hoo.
Go team.
Go.
Team go and uh, I want to thank you once again to our listeners.
We'd be nothing without you.
That is a wrap.