Disaster Recovery Testing: The Execution Phase

In part two of our disaster recovery testing series, we explore the critical steps of executing a DR test. From coordinating teams and documenting issues to maintaining communication channels during the test, this episode covers everything you need to know about running an effective DR test.
Host W. Curtis Preston and co-host Prasanna Malaiyandi share practical advice from their extensive experience with disaster recovery testing. They discuss the importance of having backup communication methods, maintaining detailed documentation, and conducting thorough post-test analysis. Learn why testing your DR plan regularly is crucial and how to build a recovery mindset across your organization.
Whether you're planning your first DR test or looking to improve your existing testing procedures, this episode provides valuable insights to help ensure your disaster recovery testing success.
You found the backup wrap up your go-to podcast for all things
Speaker:
backup recovery and cyber recovery.
Speaker:
In our final episode about DR testing, the rubber meets the road.
Speaker:
Last time we talked about getting ready for your DR test, and this time we're
Speaker:
talking about actually running the test.
Speaker:
We'll cover what you need to do during the test, like coordinating between
Speaker:
teams, documenting what goes wrong, because something always goes wrong,
Speaker:
and making sure that you've got backup communication methods ready.
Speaker:
By the way, if you don't know who I am, I'm w Curtis Preston, AKA, Mr.
Speaker:
Backup, and I've been passionate about backup and recovery for over 30 years,
Speaker:
ever since I had to tell my boss.
Speaker:
We had no backups of that really important production database that we had just lost.
Speaker:
I don't want that to happen to you, and that's why I do this.
Speaker:
On this podcast, we turn unappreciated backup admins into Cyber Recovery Heroes.
Speaker:
This is the backup wrap up.
Speaker:
Welcome to the show.
Speaker:
Hi, I am w Curtis Preston, AKA, Mr.
Speaker:
Backup, and if you could just take a couple of seconds to
Speaker:
either like or subscribe or.
Speaker:
Uh, follow the channel so that you can, uh, always get our great content.
Speaker:
That would be awesome.
Speaker:
I am once again joined by a guy who has finally put some of his car
Speaker:
knowledge to use Prasanna Malaiyandi.
Speaker:
I'm doing well Curtis, and yes, I am finally putting some of that car knowledge
Speaker:
to use, uh, for viewers who may not, or listeners who may not be aware.
Speaker:
tend to watch a lot of car YouTube stuff, um, a lot of it
Speaker:
tends to be a brown fabrication engine rebuilding, drag racing.
Speaker:
It's a really odd mix, but a lot of it is just YouTube knowledge.
Speaker:
And so I finally decided to try something different, and I've been taking auto
Speaker:
shop classes at my local community college, which has been amazing.
Speaker:
And so as part of it, you actually have a hands-on lab section where you get to
Speaker:
actually work on cars like your own car.
Speaker:
And right now it's all basic stuff, right?
Speaker:
So changes, underhood inspections, inspecting cooling systems.
Speaker:
But we actually gotta do things like charging tests, uh, compression tests,
Speaker:
leak down tests, replacing spark plugs.
Speaker:
So excited.
Speaker:
I'm actually using these hands for things.
Speaker:
And you did a, you did an oil change yesterday on your wife's car,
Speaker:
do an oil change on my wife's car.
Speaker:
Yep.
Speaker:
I.
Speaker:
How filthy was the oil in your wife's car?
Speaker:
Yeah, it looked almost brand new.
Speaker:
Um, it didn't have many miles since the last oil change.
Speaker:
I'd probably say five or 600 miles, but it was a sacrifice since I needed to
Speaker:
actually do an oil change for the class.
Speaker:
So.
Speaker:
Right, right now, this isn't the first time you've done an oil change, right?
Speaker:
no.
Speaker:
Okay.
Speaker:
I've done one in the past, but this is the first time I've done it on a lift,
Speaker:
which oh my God, is so much nicer.
Speaker:
Oh, you brought, you brought her car into the class.
Speaker:
class.
Speaker:
And we put it
Speaker:
I see, I see.
Speaker:
and I.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
Everything's nicer on a lift.
Speaker:
Absolutely.
Speaker:
When you're not like struggling underneath the car, trying not to drop the hot oil
Speaker:
on you, and you're actually able to get a large, like container underneath, like the
Speaker:
drum was probably like three feet wide.
Speaker:
Right, right.
Speaker:
Yeah.
Speaker:
You had one of those that you can wheel around, right?
Speaker:
Yep.
Speaker:
So significantly easier.
Speaker:
And then I was just thinking, I was like, do I have room in
Speaker:
my garage for a two post lift?
Speaker:
Even a short one, but no.
Speaker:
Trust me, I have thought about it back when I was doing a
Speaker:
lot more work on my cars.
Speaker:
I definitely looked into it and I was like, okay, I don't, I
Speaker:
can't spend that kind of money.
Speaker:
So let's talk about something that we actually know a little bit of something
Speaker:
about, uh, so last, so two weeks ago, for those of you that follow the,
Speaker:
uh, episode, or for those of you that follow the show, uh, two weeks ago
Speaker:
we did DR testing part one, and then.
Speaker:
Um, we aired, uh, a great speaking of Dr.
Speaker:
Testing a great episode from 2021, which was the best DR testing story ever, right?
Speaker:
Yep.
Speaker:
Oh yeah.
Speaker:
The scariest, I would say.
Speaker:
Yeah.
Speaker:
Where a guy for reasons that he goes into in the show, he essentially purposefully
Speaker:
destroys his production environment, not just for DR testing, but as a.
Speaker:
As a matter of how everything happened, he ends up testing his DR system
Speaker:
and it, it does work, but oh my God.
Speaker:
There was, there was a, there was a quote in there that said something like, he had
Speaker:
a long weekend that lasted like five days.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
So, yeah.
Speaker:
things it's like, and well, and the other challenge is he was up in Alaska,
Speaker:
Yeah,
Speaker:
If he needed to get parts or other things like
Speaker:
right.
Speaker:
luck.
Speaker:
Yeah, exactly
Speaker:
if I'm about to do like a house repair or something else like that,
Speaker:
it's like, you know not to do it on a Saturday or a Sunday or a Friday,
Speaker:
right.
Speaker:
if you have to call someone or you need to pick up something
Speaker:
and you don't do it at night.
Speaker:
Yeah, definitely.
Speaker:
Definitely don't do it at night.
Speaker:
Right.
Speaker:
Yeah.
Speaker:
Um, yeah, so that's a great episode if you didn't listen to that episode.
Speaker:
That is a great episode.
Speaker:
Um, and, uh, uh, yeah, listen to that.
Speaker:
So this one, the, the.
Speaker:
Two weeks ago, we talked essentially about getting ready to do the DR test,
Speaker:
preparing for it, setting the scope for it, agreeing on what's going to be a
Speaker:
success, and then this week we're gonna talk about actually executing the DR test.
Speaker:
And again, this is a DR test.
Speaker:
What would you say is the purpose of a DR test Prasanna?
Speaker:
I.
Speaker:
To make sure that you're actually in the case of an actual disaster,
Speaker:
you're able to recover as agreed upon whatever your agreement was.
Speaker:
Yeah, I, I think that's sort of the general, yeah.
Speaker:
Obviously that's the purpose of a test in general, right.
Speaker:
Is to, is to, is to.
Speaker:
To test whether or not you could do it when you, when you need it.
Speaker:
But since most tests fail, I'm going to say that the other purpose and
Speaker:
perhaps the bigger purpose is to fix the parts of your, of the TR system that
Speaker:
you discover are broken in some way.
Speaker:
Right?
Speaker:
Um, and, uh, so the, the probably one of the biggest.
Speaker:
Outcomes of a DR test is to feed back into the DR plan, right?
Speaker:
Yeah.
Speaker:
just in terms of what fails, I know sometimes people are like,
Speaker:
oh, it's just thinking about like, I can't restore the data.
Speaker:
But a lot of times what really fails is the dependencies that you didn't consider.
Speaker:
Right.
Speaker:
you make sure you're able to fail over and recover your active
Speaker:
directory in your DR site before you can bring your applications online?
Speaker:
You, you know, um, I'm glad you brought that up because I aired
Speaker:
another classic episode about the actual disaster recovery on an island.
Speaker:
And, uh, again, well, it's with the islands, right?
Speaker:
Because Alaska was Kodiak Island.
Speaker:
Um, but this was in a Caribbean island.
Speaker:
And they do an actual deal, you know, an actual recovery because there
Speaker:
was a hurricane that took it out.
Speaker:
And one of those dependencies that you talked about was the lack of internet
Speaker:
Yeah.
Speaker:
and, uh, lack, the lack of power, the lack of internet.
Speaker:
These are all things that we come to expect on a normal everyday basis, which
Speaker:
In
Speaker:
an actual disaster is, is not,
Speaker:
Yep.
Speaker:
not that right.
Speaker:
Yeah.
Speaker:
And we also had that other episode.
Speaker:
Do you remember maybe, I don't know if you want to air that or not.
Speaker:
The dire show one.
Speaker:
That's right, the one that talked about the derecho.
Speaker:
I'm gonna have to, I have to go find that one.
Speaker:
'cause it's not titled the Derecho episode.
Speaker:
It was, um.
Speaker:
I'll have to find, if I can find that, I'll rebroadcast it in the keeping
Speaker:
of the, the disaster recovery theme.
Speaker:
I'll, I'll definitely see if I can find that when I br
Speaker:
'cause that was also very good.
Speaker:
I didn't even know what a derecho was.
Speaker:
Derecho is a land hurricane.
Speaker:
Uh, a hurricane that forms over land.
Speaker:
I don't know why it's called derecho, but that is what it is.
Speaker:
Right.
Speaker:
Yep.
Speaker:
Uh, to me that just means Right, you know, to the right in Spanish.
Speaker:
But I.
Speaker:
You know, it is what it is.
Speaker:
Um, so, uh, so, so we talk about if we're executing the DR test.
Speaker:
Uh, we, we, you know, we, we, we've, we've agreed on what we're gonna test.
Speaker:
We've agreed on what the success criteria is.
Speaker:
It's time to actually start walking through the, the test
Speaker:
we're, we're going to have.
Speaker:
And, and also we created a, an environment that we're going to test in.
Speaker:
We're not doing what our friend from Alaska did.
Speaker:
I, I was just thinking, are you just gonna go around like the TV shows
Speaker:
when they get hit with an attack and they're just like, plug gun,
Speaker:
plug the cables up, plug the cables.
Speaker:
Yeah, don't do that.
Speaker:
We, we have some sort of test of environment.
Speaker:
Generally speaking, today's, it's generally gonna be the
Speaker:
cloud and we're going to start executing the, the, um, this test.
Speaker:
Can you think of, uh, and, and one of the things, again, this is more of
Speaker:
set up a thing, but one of the things you wanna make sure is to allocate
Speaker:
enough time, uh, for this, you know.
Speaker:
For this process to unfold in its natural, um, evolution.
Speaker:
I would say time.
Speaker:
And then also make sure you have the resources right.
Speaker:
And I'm not, I don't mean compute resources, but people because.
Speaker:
Right?
Speaker:
Make sure that people are available, right?
Speaker:
yeah.
Speaker:
Um.
Speaker:
don't do this at like, uh, quarter end because people may
Speaker:
be firefighting other things or.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
The company that I, the, the bank, we did it on a weekend.
Speaker:
Um, but it was a dedicated, you know, a, a dedicated weekend where we're
Speaker:
going to do the DR test, and we did that because again, you're, you're
Speaker:
making all these resources available for the DR test, which means they're not
Speaker:
available to do their day job, and their day job would happen during the week.
Speaker:
So we chose to do it on the weekend and.
Speaker:
I'd say the bigger, the bigger you're going, the bigger te the bigger,
Speaker:
this isn't coming out in English.
Speaker:
Uh, the bigger the test, the bigger the need to prepare and to, to have, um,
Speaker:
you know, to make sure you have those resources and to not do it when the
Speaker:
normal production stuff is going on.
Speaker:
requires buy-in from the business communication, right?
Speaker:
All these
Speaker:
Yeah.
Speaker:
right?
Speaker:
Yeah.
Speaker:
Make sure you.
Speaker:
Make sure you communicate to all the powers that be, that you are doing
Speaker:
a DR test, especially if you're gonna do any kind of failover.
Speaker:
Um,
Speaker:
it too, right?
Speaker:
Because you want this to be done on an ongoing basis.
Speaker:
right,
Speaker:
to convince 'em upfront, Hey, here's why it's valuable, such that when you go back
Speaker:
and after the results, right, you're like, Hey, we now need to do another DR test.
Speaker:
Maybe six months down the line,
Speaker:
right.
Speaker:
already bought it.
Speaker:
Another thing as, as we're going through the DR test, we're documenting
Speaker:
what went right, what went wrong, especially what went wrong.
Speaker:
Right.
Speaker:
Um, go ahead.
Speaker:
so
Speaker:
this is an interesting thing 'cause when we had Mike podcast, right, and
Speaker:
he was talking about sort of doing these tabletop exercises, right?
Speaker:
I think it's important the person documenting kind of needs to
Speaker:
take an objective perspective.
Speaker:
Mm-hmm.
Speaker:
Right, because you may be showing some biases or the person documenting
Speaker:
may not want to document certain things, or may just sort of dismiss
Speaker:
it as, Hey, this isn't important,
Speaker:
Right,
Speaker:
Versus actually capturing what happened throughout the process.
Speaker:
right.
Speaker:
Agreed.
Speaker:
Um, the next thing is, and, and we covered this, uh, in the previous
Speaker:
episode, but once you've, you know, we talked about testing little parts
Speaker:
of the infrastructure, but once we grow, once we've tested this piece
Speaker:
and this piece and this piece, I.
Speaker:
I do think it's important to test, you know, you look at the scenario, what
Speaker:
would this scenario do to our company?
Speaker:
Right?
Speaker:
The scenario is a disaster.
Speaker:
The scenario is a fire, a.
Speaker:
A terrorist action, um, and it's gonna take out all of this infrastructure.
Speaker:
What would that do to us?
Speaker:
So for example, you might not need to test your ability to recover from a SaaS outage
Speaker:
when your, if you have a data center and your data center goes out, right?
Speaker:
It's a, it's gonna be scenario dependent.
Speaker:
What you're gonna test, but, um, you, you might wanna, what would be the impact to
Speaker:
our business and our ability to use the different parts of our infrastructure?
Speaker:
And so speaking of dependencies, if we don't have internet of any kind, it
Speaker:
is, it is kind of a SaaS outage, right?
Speaker:
Right.
Speaker:
Um, so, uh, we're gonna, we want to test as many of those parts
Speaker:
of our infrastructure that are going to be impacted by the
Speaker:
scenario that we're testing, right.
Speaker:
Yeah, and sometimes it's a little bit about.
Speaker:
consequences or identifying gaps.
Speaker:
It's like when you're writing code, right?
Speaker:
You normally do unit
Speaker:
Um.
Speaker:
but then when you actually test the end-to-end functionality, you're
Speaker:
like, oh, I didn't realize that this interacts with this other thing
Speaker:
this way, and things don't work.
Speaker:
That's why we also do end-to-end testing in addition to unit tests.
Speaker:
Yeah.
Speaker:
And, and, and again, this is why I went, why back in the beginning I
Speaker:
was saying that the purpose of the DR test is to identify these gaps, right?
Speaker:
The, yeah.
Speaker:
I mean we can have, um, I.
Speaker:
You know, we can have that perfect test that goes well and that's great and
Speaker:
everybody feels better, but it's just as valuable to find the DR test that
Speaker:
had, that had a big hole or a small hole and, um, you know, the, uh, and, and,
Speaker:
and to document that and address that.
Speaker:
And this is why we do it on a regular basis.
Speaker:
I have a question for you, Curtis.
Speaker:
Yeah.
Speaker:
Do you think DR.
Speaker:
Testing?
Speaker:
So most organizations have a risk management team,
Speaker:
Mm-hmm.
Speaker:
right?
Speaker:
Which usually has a lot of this information in terms of, okay,
Speaker:
what are the business risks and everything else like that.
Speaker:
But they're also probably the ones who are coordinating across the business
Speaker:
in order to say, okay, let's do a test.
Speaker:
Mm-hmm.
Speaker:
Right where the infrastructure, DR testing that we're talking about here
Speaker:
is probably one portion of that overall
Speaker:
Mm-hmm.
Speaker:
Mm-hmm.
Speaker:
Do you think that's fair?
Speaker:
Yeah, I think that's fair.
Speaker:
And you know, this is, we're going to.
Speaker:
I think that if we're doing a, a real DR test, we're going to this.
Speaker:
This is a business test as much as it is a technology test, right?
Speaker:
Yeah.
Speaker:
There is this, that overlap between business continuity planning
Speaker:
and disaster recovery planning.
Speaker:
And maybe for a DR test, we're not concerned so much with, um.
Speaker:
Uh, like if, if it's just a DR test, we're not concerned with, let's say,
Speaker:
uh, uh, buildings and people, places for people to, to work and things like that.
Speaker:
We're concerned more with getting the technology back up and running.
Speaker:
But I, I'm glad you brought that up.
Speaker:
That is a, a separate aspect that does need to be taken into account.
Speaker:
Well, and the benefit with this is if there's already a team that is looking
Speaker:
at that business continuity aspect,
Speaker:
Mm-hmm.
Speaker:
You may not need to convince the business as much, right?
Speaker:
In order to be
Speaker:
Right.
Speaker:
right, you should partner with people who already, like that is their job,
Speaker:
Agreed.
Speaker:
Agreed.
Speaker:
them.
Speaker:
Agreed.
Speaker:
We talked about documenting things that we discover here.
Speaker:
I, I think that we should be maintaining like a log of, you
Speaker:
know, all of the tests and the things that we've learned from them.
Speaker:
Because again, that may be helpful for, uh, you know, for
Speaker:
future generations of tests.
Speaker:
You know, It's important to have a Dr.
Speaker:
Runbook and to, to, to have this, you know, one of the pur
Speaker:
the, one of the purposes of the test is to update that runbook.
Speaker:
So let's just talk about that.
Speaker:
Um, the, the, the thing about having a Dr.
Speaker:
Runbook, I do believe in having an electronic copy of the Dr.
Speaker:
Runbook, uh, but uh, also have the ability to easily update.
Speaker:
A paper copy of that runbook.
Speaker:
So the way to do that is to have some sort of documentation system
Speaker:
online that you can easily update.
Speaker:
Um, and then if you want to have a paper copy and you want to have a paper copy,
Speaker:
then um, the best way to have that is a, is a loose leaf type notebook system right
Speaker:
where you can update pages of it, where you don't have to update the entire book.
Speaker:
I have a comment about the electronic copy.
Speaker:
Sure.
Speaker:
I would recommend also keeping a copy out of your normal corporate infrastructure.
Speaker:
Agreed.
Speaker:
Right, right.
Speaker:
in case, say you get hit with ransomware and you no longer have access to that
Speaker:
infrastructure, or someone deletes your account that hosted that data, right?
Speaker:
So make sure it's something completely disconnected as well.
Speaker:
A copy just in case.
Speaker:
And I go back to think about the Pixar story, right, where they just happen to
Speaker:
be lucky with Toy Story two and have a copy offsite offline to save the movie.
Speaker:
Exactly.
Speaker:
Um, yeah, I, I, I think obviously we, we have to keep security in mind.
Speaker:
We have to make sure that what, wherever that system, wherever that other
Speaker:
copy is, it's protected by security.
Speaker:
But the whole point of it is to have it outside the normal security.
Speaker:
So, uh, there, there's a, there's a, um, a balance that you need to have there.
Speaker:
Right.
Speaker:
Um, what about communications during the DR tests?
Speaker:
Um, we need to keep everyone.
Speaker:
Abreast of what's going on.
Speaker:
You wanna talk about that a little bit?
Speaker:
Yeah, so you wanna make sure people aren't working in silos and because during a
Speaker:
DR test things are gonna be chaotic.
Speaker:
but since this is more of a controlled environment, you want to establish
Speaker:
those patterns and say, this is a normal way that we communicate.
Speaker:
It might be via phones, it might be emails.
Speaker:
You might jump into a video conference, right?
Speaker:
Whatever it is that you use, make sure that you have all the right
Speaker:
stakeholders in that session.
Speaker:
Right,
Speaker:
in order.
Speaker:
So, so then everyone knows what's going on.
Speaker:
The other thing though, uh, to mention is make sure you also have
Speaker:
alternate methods, Just like what we talked about, the runbook itself.
Speaker:
Make
Speaker:
right.
Speaker:
case your voiceover IP phones are down in your corporate environment
Speaker:
or your chat slack is down, or whatever else you're using,
Speaker:
Right.
Speaker:
Make sure you have an alternate mechanism to get in touch with people.
Speaker:
Yeah.
Speaker:
That's a real challenge.
Speaker:
Um, I mean, it, it is
Speaker:
Smoke
Speaker:
to have communication during it.
Speaker:
What'd you say?
Speaker:
signals.
Speaker:
So signals, it's, it's a real challenge because we depend so much on technology
Speaker:
and I would say that that, um.
Speaker:
Again, if it's an outage, generally the outage is for you
Speaker:
and not for everything else.
Speaker:
So for example, if you're relying on Zoom, um, as your mechanism, zoom will
Speaker:
probably be up when you have your outage.
Speaker:
You just have to need to make sure that everybody can get to zoom.
Speaker:
So, um, if for example, your your, your challenge there
Speaker:
will be if you are using, um.
Speaker:
You know, a, a, a third party authentication mechanism to get into Zoom
Speaker:
and then you don't have access to that, that could be, that could be a problem.
Speaker:
So these are the things you wanna make sure, you wanna be able to make sure that
Speaker:
you can communicate during the outage.
Speaker:
Um, and I can definitely think of a, you know, of a multi-headed zoom call where
Speaker:
everybody's just sort of keeping everybody abreast of what's going on, right.
Speaker:
Um, and we wanna make sure that the stakeholders are aware of everything
Speaker:
that's going on, as well as the people that are executing the, um,
Speaker:
that are ex executing the test.
Speaker:
Um, and then what about, um, I, I, I think, by the way, the Zoom call,
Speaker:
I think is the best way to have, or something like a Zoom call to have
Speaker:
coordination between the teams if there are multiple teams that are happening.
Speaker:
You don't necessarily have to have everybody who's
Speaker:
doing something with the Dr.
Speaker:
Uh.
Speaker:
To, to be on the Zoom call, but the purpose of the Zoom call, I
Speaker:
think is probably to keep, keep all of the different teams aware
Speaker:
of what the other teams are doing.
Speaker:
Right?
Speaker:
It's almost like a war room, if you will.
Speaker:
Right?
Speaker:
exactly.
Speaker:
The big, again, the bigger the test, the bigger it is, the bigger
Speaker:
the need is to have, uh, some type of communication like this.
Speaker:
Right?
Speaker:
Uh, and then you've also got escalation procedures.
Speaker:
What happens if something doesn't go right?
Speaker:
Who do we call?
Speaker:
Um, yeah.
Speaker:
you could throw a monkey wrench in things and be like, someone's about to
Speaker:
do a, normally is part of the DR test.
Speaker:
Right?
Speaker:
Or would be responsible for something.
Speaker:
You could be like, that person is home sick with the flu
Speaker:
and cannot be in the office.
Speaker:
Now what do you do?
Speaker:
Yeah.
Speaker:
Um, yeah.
Speaker:
If your DR.
Speaker:
Test says, you know, call Steve.
Speaker:
Um, this, this is the, the, you know, the more you have something like that,
Speaker:
the bigger that, that, that kind of thing is gonna be a problem, right?
Speaker:
you bring this up.
Speaker:
So I was reading the register this morning
Speaker:
Mm-hmm.
Speaker:
there was a call in or a write in from a, a reader,
Speaker:
Mm-hmm.
Speaker:
and they were saying that they had worked at a company I.
Speaker:
In it and managed a bunch of infrastructure and they had built
Speaker:
this system to automate all of their, uh, software deployment stuff.
Speaker:
Mm-hmm.
Speaker:
Um, and then they had quit the company, but no one knew how to
Speaker:
operate it, and he had left his number, it's in the closet, was a machine.
Speaker:
He had left his number.
Speaker:
It said, do not reboot, call Steve or whatever his name was.
Speaker:
And he got the call and this was like 20 years later,
Speaker:
Wow.
Speaker:
he got a call and he was like, I don't remember the password.
Speaker:
I'm sorry.
Speaker:
You gotta figure it out on your own.
Speaker:
Wow.
Speaker:
That's crazy.
Speaker:
That's just crazy.
Speaker:
So call Steve.
Speaker:
That's funny.
Speaker:
Um, yeah, don't, don't be like that.
Speaker:
Um, so, uh, let's just say we get to the end of the test, right?
Speaker:
We've successfully recovered all of the, all of the aspects
Speaker:
if we're doing a full DR test.
Speaker:
What needs to happen is a full sort of end-to-end functional test of the
Speaker:
different parts of the business to make sure that not just that the, that a
Speaker:
system was recovered or a database was recovered, but the application and the.
Speaker:
The, the system around that application is able to function.
Speaker:
And again, this is why we go into things like phone systems, right?
Speaker:
Yeah.
Speaker:
Um, you know, if, if the, the application that we're recovering is our customer
Speaker:
call center, um, but we don't have phones, uh, great, uh, you know, all of that
Speaker:
stuff, all of that stuff has to work.
Speaker:
And you've got to do the functional end-to-end test to make sure that all
Speaker:
the parts that you are pretending.
Speaker:
Are, you know, were damaged, are now fully functional.
Speaker:
I agree to that, but I think it's also one of the things, you have to
Speaker:
be careful not to boil the ocean.
Speaker:
Yes.
Speaker:
Yeah, yeah.
Speaker:
Well, again, this is about,
Speaker:
Yeah.
Speaker:
what's that?
Speaker:
a balance.
Speaker:
I.
Speaker:
Well, what I'm saying is, whatever it is, this is, I, I think what you're
Speaker:
talking about is, is more about scope,
Speaker:
Yes.
Speaker:
Because.
Speaker:
Even if we just agreed to test this one part of the application, you
Speaker:
need to do a functional test of whatever it is that you recovered.
Speaker:
E even if it's just a small part of the environment.
Speaker:
What I'm that, that's all I'm saying.
Speaker:
Yeah.
Speaker:
Right.
Speaker:
That, that, that we often focus a little bit too much time on the
Speaker:
recovery, the restore, and we say, okay, the application's restored.
Speaker:
I can walk away.
Speaker:
No, the application's restored.
Speaker:
When the application is restored, when people can do the thing that whatever
Speaker:
it is that application was supposed to.
Speaker:
I was intent, I was thinking more about, be careful about thinking
Speaker:
about all the failure scenarios.
Speaker:
Like I was saying, the person gets sick with the flu, right.
Speaker:
Oh, yeah, yeah, yeah.
Speaker:
Yeah.
Speaker:
about going down that rabbit hole because you will never come back
Speaker:
out because it might be, what if the butterfly flops its wing halfway around
Speaker:
the world and causes X, Y, Z, right?
Speaker:
So,
Speaker:
The butterfly will die.
Speaker:
right.
Speaker:
So don't get overwhelmed by these scenarios is
Speaker:
Yeah.
Speaker:
And, and speaking of not being overwhelmed when we get to the
Speaker:
post, you know, when we get to the, uh, the post game analysis, right?
Speaker:
Let's measure against the success criteria that we agreed to.
Speaker:
Um, we, we look at the things that didn't work and the
Speaker:
bottlenecks and things like that.
Speaker:
The key, again here is to better the world, not to prove that
Speaker:
you were the best or whatever.
Speaker:
I.
Speaker:
Um, I know it can be really difficult.
Speaker:
Say that again.
Speaker:
you were the worst.
Speaker:
You were the worst.
Speaker:
Yeah.
Speaker:
Um, you know, we're looking for things that we can improve.
Speaker:
We're looking for procedures that we can update based on, you know,
Speaker:
the lessons that we learned.
Speaker:
Um, any other post-game analysis?
Speaker:
What can you, that you can think of?
Speaker:
I would also say.
Speaker:
If this is your first time doing this, I think it's also good
Speaker:
to say what things went well.
Speaker:
I think a lot of times we tend to focus on the negatives,
Speaker:
Right.
Speaker:
right?
Speaker:
But if this is your first time, like this is really hard.
Speaker:
This is a hard thing to do.
Speaker:
Yeah.
Speaker:
And you should acknowledge that and realize if you got through, like I
Speaker:
know Curtis, you've always talked about the bank and your DR tests, right?
Speaker:
And how I don't think you guys ever completed a hundred
Speaker:
No.
Speaker:
right?
Speaker:
No.
Speaker:
Yeah, yeah.
Speaker:
So don't be too hard on yourself.
Speaker:
Congratulate yourself first off on doing the test in the first place,
Speaker:
and second, making it to the end of the test, even if everyone is dead.
Speaker:
Um, you know, and then, and then, and then, you know, yeah,
Speaker:
don't be too hard on yourself.
Speaker:
Right.
Speaker:
Uh, because these things, these things rarely do they go well, uh,
Speaker:
unless it's like fully automated.
Speaker:
And, you know, the, the more I will say, the more you can
Speaker:
automate things, the better.
Speaker:
Right?
Speaker:
Yeah.
Speaker:
So you ran the tests,
Speaker:
Mm-hmm.
Speaker:
things that went well, things that went wrong.
Speaker:
think the next step after that is.
Speaker:
Identifying how do you close the gaps,
Speaker:
Right.
Speaker:
And coming up with a plan, because you don't want to
Speaker:
just let these things linger,
Speaker:
Right.
Speaker:
create a plan.
Speaker:
Identify what are the most critical elements that you want to address first
Speaker:
Mm-hmm.
Speaker:
timeframes, and make sure you get buy-in across the board to fix those things.
Speaker:
Yeah.
Speaker:
Agreed.
Speaker:
Right.
Speaker:
Um, you, you, you have a, you have an action item list and who's responsible
Speaker:
for addressing the different things, and then of course, what's the next thing?
Speaker:
You do it again.
Speaker:
Right?
Speaker:
Um.
Speaker:
When,
Speaker:
Uh, soon.
Speaker:
Right.
Speaker:
Um, I would say I'm a fan of more frequent, smaller tests
Speaker:
than like an annual huge test.
Speaker:
Right.
Speaker:
Um, I think the more often we do that, the more we get into the, the
Speaker:
mindset of thinking about the things that can go wrong, because a, a lot
Speaker:
of things are, are, you know, um.
Speaker:
They're the same on different discip disciplines across the, uh, the,
Speaker:
the, uh, the organization, right?
Speaker:
So the more often we test, the more often we get to a recovery mindset and
Speaker:
we start including those things in the system design from the very beginning.
Speaker:
Yeah.
Speaker:
Right?
Speaker:
Um, again, that's the other purpose.
Speaker:
I would add that to.
Speaker:
My original question, that's the other purpose of a DR test, is
Speaker:
to get people to a DR mindset,
Speaker:
Yeah.
Speaker:
um, to a recovery mindset of saying, um, we need to design the infrastructure and
Speaker:
the processes around the infrastructure so that they are easy to recovery.
Speaker:
Right.
Speaker:
Yep.
Speaker:
Or at least even think about it to start with rather than, oh yeah, this failed.
Speaker:
Now what?
Speaker:
What were you gonna do with our Dr.
Speaker:
Yeah.
Speaker:
And, and, and lemme just give you a, a, a silly but simple example of what happens
Speaker:
when you don't have a recovery mindset.
Speaker:
So I go back to the bank, right?
Speaker:
I have so many good stories from the days of the bank, right?
Speaker:
And when we bought a a T 1000, which was, uh, an HP server, it
Speaker:
was a really big server and it had, um, it was a huge server.
Speaker:
It had a hundred gigabytes of data.
Speaker:
Ginormous, wait.
Speaker:
Let me go grab my flash drive.
Speaker:
It was a huge server for the time, and it came with a two gigabyte tape drive.
Speaker:
Right.
Speaker:
I think with compression it was like a four gigabyte tape drive
Speaker:
that, that was a system design.
Speaker:
And there, there were no changes.
Speaker:
No, we, we, we added 30%.
Speaker:
With one server, we added 30% to the capacity of the
Speaker:
data center with one server.
Speaker:
There wasn't a single discussion about what we should do from a
Speaker:
backup and recovery perspective.
Speaker:
That's what happens when you don't have a recovery mindset,
Speaker:
Yeah.
Speaker:
right?
Speaker:
Is that you, you do things, you add things to the system without any thought
Speaker:
to what they would, you know, how that would impact the recovery system.
Speaker:
So that's why we want to have a recovery mindset.
Speaker:
Yep.
Speaker:
Okey dokey.
Speaker:
I think we covered everything.
Speaker:
Yeah.
Speaker:
I think so, yeah, everything you could possibly want to know about
Speaker:
Dr in, uh, four episodes with the two, the two, maybe five.
Speaker:
We'll see.
Speaker:
We'll see if I can find that other episode.
Speaker:
Thanks Prasanna for, uh, you know, once again, uh, you know, great team.
Speaker:
Woo hoo.
Speaker:
Go team.
Speaker:
Go.
Speaker:
Team go and uh, I want to thank you once again to our listeners.
Speaker:
We'd be nothing without you.
Speaker:
That is a wrap.