(For frequent listeners, I made a mistake with last week's episode. THIS is the episode I meant to publish last week. I know it has the same description. Sorry about that.)
This disaster recovery case study takes you inside a real DR scenario when a hurricane devastated an island's data center. Our guest shares his firsthand experience managing recovery efforts with limited resources, no mainland connection, and countless unexpected challenges. (This is the on-the-ground account of the story we told last week with "Harry Potter.")
Listen as we explore how basic DR assumptions fell apart, from authentication dependencies to satellite communication limits. Learn why sleeping on air mattresses and eating chicken and rice became part of this disaster recovery case study, and discover critical lessons about DR planning, testing, and documentation that could save your organization. This episode reveals the reality of disaster recovery when everything - including trees - gets stripped away.
You've found the backup wrap-up your go-to podcast for all things,
backup recovery and cyber recovery.
And this episode, we share a fascinating disaster recovery case
study that will blow your mind.
I'm talking with someone who had to do an actual Dr.
In the worst possible circumstances on an island.
After a hurricane took out everything.
And when I say everything, I mean, everything, the data center, the
infrastructure, even the trees.
He had to sleep on an air mattress, eat chicken and rice for two weeks and
figure out how to restore systems when basic assumptions like we have internet.
Weren't true anymore.
If you've ever wondered what a real Dr.
Situation looks like, you're about to find out.
Plus we get into some serious discussions about backup strategy and why.
Assuming things will work.
Isn't the same as testing them.
Trust me, you don't want to miss this episode.
By the way it is a classic episode from four years ago, it was an amazing story.
Then it's a great story now.
And so we're bringing it to you during the holidays.
I hope you enjoy it.
Welcome to the show . I'm your host, w Curtis Preston, AKA, Mr.
Backup, and I have with me my, my meat advisor.
Prasanna Malaiyandi.
How's it going?
Persona,
I don't know if I'm your meat advisor.
I think I am.
Your, uh.
Uh, what do they say?
Sort of your apprentice or wishing to be your apprentice.
You're, you're, yeah.
You're not my, I'm guessing you, you have a lot of vegetarian
dishes in your house, right?
Yes,
we do have a, my wife is vegetarian and Right.
Typically we do vegetarian, and given that we eat a lot of Indian food, Indian
food, you can make a ton of dishes without ever having to make anything meat.
So there's a lot of variety.
Yeah.
Uh, so, so you're more sort of an, you're, you're, you're my meat enthusiast.
Yes.
How's
that?
You, you, you're meat curious.
Yes.
But meat curious.
But I do want to hear about your latest adventure, because I've been, I've
been asking you, I think almost every week how your dry aging is going.
So I know that this will probably come out later, but now that.
Technically we're recording this after Thanksgiving.
I wanna know how was it and what happened.
So I'm gonna, I'm gonna bring on our guest and then, uh, and then
we'll, we'll chitchat about that.
So, uh, we have, uh, it's a rare treat for us because I've been in it for so long.
Rarely do I have someone who's been in it longer than me, and this is
one of those times and I'm super excited because, uh, he started in
it just after I was in high school.
And, uh, started out in the hardware, uh, side of things, actually working
and running Digital Equipment Corporation, which we call Deck.
Uh, their internal email service went into a, uh, it, uh, has done a lot
of things in data center operations, system administrator, data center
manager, and he's recently retired.
Super jelly.
And, and now lives in Seattle.
He is a friend of the person that we previously had on that we called
Harry Potter to keep him anonymous.
And so to continue that tradition, I would like to introduce to
the podcast, uh, Ron Weasley.
Hello.
It's a pleasure to be
here.
You know, we had, we had your friend Harry Yes.
On and um, and, and we, we had a great podcast there.
But, um, I, I, I, I kept asking him, I was like, do you think that the guy that
actually was the one with the fingers on the keyboard would, would talk to us?
And he said, yes.
And so here you are.
Uh, but.
So the, before we get to that, uh, we'll get back to the, to the meat conversation.
So Ron, I have had a, a, an ongoing sort of a project of experimenting
with dry aging meat at home.
And I started out with these things called the umai bags, which UMAI,
it's short for umami and, um.
The, and so Thanksgiving was the first time I did a dry aged brisket at home.
I, I think the dry aging process went really well.
I didn't quite get, and by the way, if you are a brisket fan, uh, just
go to YouTube and type in dry aged brisket and you'll see why I was
interested in dry aging briskets.
'cause the weird thing is that.
A lot of people in the dry aging slash brisket community don't think
that briskets benefit from dry aging, but these videos suggest otherwise.
So I tried it out.
The problem was that I had a, uh, a noon Thanksgiving we had for those
concerned, we had a sort of COD friendly Thanksgiving gathering, so we had.
We had 10 people.
We did it outside.
We were socially distanced.
I actually rented tables and chairs so that I could do that and I, so that we
could, you know, follow all the rules.
Uh, but it was at noon and I had, so that meant I had to start my
brisket at midnight and, um, it meant that part of the brisket.
Well,
you're very dedicated.
Yeah.
Most brisket cooking was while I was sleeping.
And, uh, let's just say the critical part is towards the end when you
need to be checking doneness.
And I was really not wanting to get up at five o'clock in the morning
when this thing was really done.
And so I kind of got up at six.
And the difference between getting up at five and getting up at six
is the difference between a brisket that is tender and a brisket that.
Becomes pot roast and unfortunately, I blew my 60 day experiment.
Mm.
For an extra
hour's sleep.
So I had a, the brisket was super, super tender, super, super juicy.
EAs easily the juiciest brisket I've ever cooked, but it was slightly overdone
and so I was disappointed in it as the brisket maker, the people that.
Loved it.
Um, and so I had no complaints.
I also had no brisket left.
Yeah, that's when, you know, it was good when
we were done.
Uh, I think I grabbed a handful of it, uh, just so that I
could, you know, eat it later.
But yeah, we, we had a, like a 16 pound brisket that was
completely gone from 10 people.
Who also had a Turkey and a ham to eat.
So, you know, I don't know.
Was that a success?
I, I just
did a quick lookup of it, six to eight weeks.
It's saying to, to do that, that's,
yeah.
A lot of it's time.
It's
dedication,
right?
Yeah.
Dedication.
Yeah.
Really.
Yeah.
So I'm current, I'm currently in the process of an actual dry aging experiment
where I have a dedicated refrigerator.
With, uh, precise temperature and humidity control that's going
on, like literally right now.
It started December 1st, and I'm hoping to have the results of
that by, by New Year's and have a New Year's, uh, dry aged brisket.
But, um, but we don't know about that yet, so, um.
We'll talk about that on later podcast.
Ron, you listened to the podcast where we talked about you, right?
Yes.
With with Harry.
Yes.
Yes.
And the, the idea was that there was a hurricane.
That took out, so this actually happened, by the way, this is, this is, you
know, uh, this, this is a true story.
Yes.
The, so a hurricane took out an island.
An island took out an island.
You and Harry both worked for, you know, we'll call it Hogwarts.
Someone had to go down there.
Do the disaster recovery.
My understanding is that it, it was sort of a toss up between you and Harry.
And Harry couldn't get there fast enough and you could, and so you
were on your way to, you drew
the lucky straw, you drew the lucky, lucky straw.
So
you were on, you were on your own.
Does that sound about right?
Yes.
Yeah.
Yeah.
The um, one of the things, kind of the requirements was, um, because of.
The nature of the response was you had to be comfortable doing command line
recovery of the backup application.
Why was that?
Uh, just because you, um, you, you know, the, um, you're gonna
be, um, on the console of the server for a lot of the Okay.
So no gooey for you?
No gooey, no gooey for you at the beginning?
Yeah.
Kind of a recap for the listeners who may not fully recall
the episode, this is where.
The hurricane took out the data center.
I believe that you moved the servers into a different data center to try
to recover, and you moved some of the backup infrastructure as well, correct?
Right.
So the, the way that the site had been set up was, um, that from.
Like a, a main computing, um, standpoint.
They had two main data centers and the design was to have, you know, half the
capacity into one half the capacity and the other with, um, backups and copies.
Going between the two.
Right.
And so we had two backup systems and we replicated between the, the between
them and so data center A would back up half of the servers and replicate
the data center B and B would back up the other half and replicate the A.
So each side had.
Copies of all of the backups for the entire site.
Hmm.
And, um, so site A, uh, data center A was fine.
Um, the building that data center B was in was the one that was
damaged and the damage to the data center, um, wasn't direct.
It was indirect in, you know, the building was damaged, but it was water
damage that came flooding on down and, um, flooded the racks and actually had.
Uh, you know, a foot or so of water, uh, in the, in the floor
of the data center itself.
Uh, the data centers were not raised floor data centers.
They were, um, the, the everything was, um, you know, in cable trays above
and suspended above kind of thing.
So, uh, it was quick.
It sounds like a bad
combination to have with a.
Plug Hurricane.
Yeah.
Um, so like in, in, in the rack that had our equipment, so we had a, um, you know,
backup server, um, uh, and a couple of, uh, media servers and then a storage, um,
and um, and then a tape library and the library was at the bottom and the library
was the one that was the damage the most.
Um, the rest of the equipment, um, was okay, although the, some of the
damage actually was caused by the, um.
Desire to get the equipment quickly out of the one place into the other.
And so when they were taking it out, they weren't as careful as they could
have been with rails and whatnot.
So we had some problems, um, racking the equipment up at the, at the new
location, you know, and they took a, a small little server room and quickly
converted it into a data center to house all of the, the facilities.
Yeah.
And so, um.
So that was from the backup system standpoint, the impact on it.
Um, because the, because of the replication, we were able to start,
um, providing restores of, of a lot of the servers that they needed.
Um.
From the, from the A site, right.
Uh, while we were recovering the, the backups system for the
B site, we, we knew that we could do the recoveries from the A side.
We just knew that when we started to bring back online, A wasn't gonna
have the capacity to back it all up.
Right.
We needed to get, be back on for going forward.
Right.
So, and then we were.
We were dealing with the fact that, um, we had to have vendors come in to work
on their equipment and do a checkout.
And we weren't, we didn't even fire it up until the vendor came in and certified
that, that they were gonna continue to support it after we had done what we did.
Um,
oh, that's interesting.
Yeah.
And so we had
to wait, um, for them to be How long,
how long, how long do you think it was between basically you arrived?
And you could actually start doing something from, from, well, let me
rephrase, but you could actually start doing a restore of any kind.
Um, we were restoring.
I was restoring, starting to restore servers the first or second day.
Um, okay, so you have to understand that.
Well, um, um, so the, the initial recovery team was a team of, uh, you
know, like, like myself for the backups.
We had a, uh, a.
DBA there for, for handle the databases.
They had some network people.
Um, and then they actually had to fly in a couple of vendors for
the, um, um, emergency generators.
Mm-hmm.
Um, to, to them because, um, they were, um.
The, the generators that put in were emergency generators, short term outage.
Well, they found themselves faced with a long-term outage, uh, power wise, right?
It was gonna be a long time before power was brought back.
And so they were running their emergency generators way beyond the duty cycle.
So they had guys in there, um, babying and keeping them going while they, uh, um.
Came up with longer term solutions for how they were gonna power it.
Um, so, um.
So when they, when all this team get got together, as well as this was,
was flowing in, and then you had the local, um, IT staff and that's
just, just the whole site staff.
Um, they had started to pull out their, um, recovery plan and which
pieces of equipment that, uh, and which systems and that needed to
come back online first, second, and third and all that kind of stuff.
Right?
Oh, so they actually had a runbook and a.
Plan put in place ahead of time to, to some
degree.
Yeah.
Yeah.
I mean, they, they kind of knew, you know, based on the business.
What did
that, what did that look like?
Was it actually, I never
saw it myself, you know?
Right.
Um, but I do know that one of the things it was kind of interesting
is that, you know, the run book, um, that, that, uh, that they had was this.
Somewhat abstract thing at the time because, you know, they thought
about it and they planned and they, they tested little pieces of it,
but they never tested it in its entirety, like the disaster presented.
You know, you know, we, we talk about that all the time.
Mm-hmm.
Uh, Ron, that, that.
That, that, that's exactly the same thing.
Back in the day when I was firing backups and anger, when we did a DR test at the
bank that I was at, that we never tested, because, you know, without the cloud,
without virtualization and, you know, and, and additional hardware or whatever, doing
a full DR test is ridiculously difficult.
Right.
It is.
It's,
and, and so no one, no one does that, no one tests their whole runbook, um, right.
Or, well, well, now, I think now more people do.
But, so it sounds like you had that problem.
So they, they had this runbook, but it was primarily in theory up until,
right.
Because what they learned is, is, you know, as, as.
They learned that they had made certain assumptions that they shouldn't have made.
Hmm.
And like one of the things,
go ahead.
Yeah.
Well,
like for instance, active directory design.
Okay.
So the active directory for the island was tied to the mainland.
To the, to the corporate data center.
And when they broke the link to the corporate data center, they were, it's
like, oh, we can't authenticate anything.
So they had this, that's one of the first things they had to bring up.
Right.
So, so help me understand there.
So when, when the hurricane hit, basically they lost connection to the mainland?
Yeah.
Okay.
So they had to, they had to do all of this locally.
Um, interesting.
Yeah.
You know, it's interesting.
You would think that they would not make that assumption being an island.
Right, right.
Well, and, and, and see, uh, you know, um, a lot of it was driven by the
experience of the local, you know, the local experience on the island.
And, um, the, the prior hurricanes that they had had, had not been as devastating
as that particular one that hit them.
And so they were able to make it through without.
The kinds of, of impact and losses that they, they, um, did, I mean, it walked,
walked right up the middle of the island.
I mean, it just devastated them.
Um, and so, you know, so they were dealing with that.
And then one of the other interesting things that, it took a interesting
how it takes a while for you to figure out what's going on.
They had this, um, um, satellite communication hookup,
which was their fallback.
And so that was the way they were talking between, um.
Uh, you know, from, from the facility there to the, to the corporate end of it.
And every day, like around noon or so, the, um, connections
would start dropping off.
I was using it to try to, to phone home, you know, in the afternoons, couldn't
get a, uh, a connection or anything.
And they were like, what's going on?
It was working fine in the morning, but in the afternoon and one of the
network guys started poking around.
Oh.
You know, and he actually called the dish mo slightly.
Well, what it was was that it was an emergency and um, it was supposed to
be for emergency only, short term.
Again, one of these short term kinds of things.
And they had turned it into their main network link.
Well it was, um, it was a metered thing 'cause it was shared by all, you know,
all the other emergency equipment.
So they would use up their full.
A day's allotment by noon or, or even earlier sometimes.
Right.
Also, they had, had, they had a bandwidth allotment up.
Yeah.
And so then they were meter in the afternoon.
Right.
And so they had to, they had to work.
That was one huge, you
know, you know what this reminds me of?
You know, it's, you have unlimited bandwidth.
Just don't use too much of it.
Just, yeah, just don't use, don't use it all.
You hit your
data cap and we're gonna meter you.
Wow.
Okay.
So yeah.
So they, that was a huge thing for, for the networks was to work on,
um, and work with, uh, you know, the vendors and that to try to get, um.
A, a, a reliable, fast enough connection and multiple connections.
Um, I, since I'm no longer, um, working with them, I don't
know the, the end results.
I do know that they were, um, headed towards several, um, um, different
microwave connections from the facility.
To, um, you know, to, to a, a main link that would then take them to the mainland.
Um, running their dedicated link themselves to the mainland was,
you know, just cost prohibited.
Uh, you know, it's interesting.
I I, I did wanna just mention, I, I'm wondering the degree to which this
new, so Elon Musk has now come out with this, uh, I mean, they're right.
They're in beta right now, and it's a completely redesigned way to do.
Satellite based internet connections, starlink, where they have starlink,
starlink, where they starlink, where they have all of these, um, satellites in.
Um, it's a different kind of orbit, I guess than than usual.
And, and they're saying that they can actually get both bandwidth
and latency equal to and or better than, um, what you can do.
On land, and so it, it, it's, I I don't know the degree to, I don't
know how much it scales up mm-hmm.
For a data center connection, but it, it, it's just, it's just the interesting,
you know, thoughts towards the future for things like islands that are,
you know, cut off the way they are.
Mm-hmm.
I like, even, even, you know, the, this, the microwave connection.
Like how reliable is the main connection that they're connecting to there?
Right.
It's probably pretty reliable, but what if it wasn't right?
What if that went down?
Then you're, you're really right.
Well, and, and what they were running into there was like, um, remote,
I, I'm gonna use transceivers, but maybe that's not the right word.
But you have a, um, a, a remote tower that is.
Relaying, you know, it's a relay.
Um, and so, you know, it's, it's receiving a signal and you know,
because microwave is line of sight.
And so if you're gonna go, um, you know, over mountainous things, you
have to send it between a series of towers to get it to where you wanna go.
And, um, they were finding out that they would lose a tower.
And when they go out and look, well, somebody had gone
out and stolen all the gas.
Outta the generator because gas was like hard to get.
Oh, yeah.
Um, and then, um, or, or people were ripping up, um, you know, 'cause power
lines and all the, all the sound people were ripping out the copper, you know,
so, so you run into that kinda stuff where that just made the recovery
effort, you know, external to the site hard, um, which then impacted the site.
Yeah.
Yeah.
You were dealing, you were dealing, in this case, you, you had
somewhat of a perfect storm where you're dealing with the fact that
your data centers were flooded.
You weren't, uh, you didn't have a raised floor.
Uh, so that makes that problem worse.
And then you didn't, you know, the, the, the Dr.
Design made assumptions that were no longer true.
And then meanwhile, the, the things that you did have were being frustrated by.
Other things that were, yes.
It's like, Hey, can you, can you stop messing with the things that actually
work while we're trying to put the data center back together over here?
Wow.
That's, you know, we, we, we live in backup land and we think of the, the,
the restore is the part we focus on.
But it sounds like most of the problems that you were experiencing
had nothing to do with the actual act of getting data from.
Storage devices to server?
No,
actually that part of it went, um, went well in, in the areas.
The, um, the issues that we ran into was, um, like you said, the, um, the
disconnect between the design or intent and the reality as it, as it unfolded.
Um, you know, they'd say, oh, so we need, you know, this server, uh.
Brought back online.
Um, so find the most recent backup and you start looking through.
Mm-hmm.
And it's like, oh, we're not backing that up.
You know, when did you bring that online?
Why didn't you tell us?
You know?
Um, and so we ran into a few of those where, um, we couldn't give them, uh.
Yes.
Oh, that, that is one of the most frustrating.
So, um, and then, um, well, and, and, and then, you know, or we're backing it
up, but we're not including everything, you know, so, um, we're backing it up
in the sense that we're doing an OS backup, but the data that's on it, you
didn't include it, you know, so the, um.
So thi this was a net backup shop, so you were not using all local
drives, is what you're saying.
And, and it was funny that when I first hired on there, came on board, um,
they were in the middle of, um, of a, um, upgrade and kind of a transition.
Uh, the, the manager in charge of the backups and, and
whatnot at the time, um, was.
It was pushing this, we, we should only back up what we really need.
And so he was trying to push the, uh, I hate that responsibility onto the
data owners and saying, you need to define what's important to you and
let us know and we'll back it up.
So they stopped all local drives.
Right.
And um.
I, you know, here's the thing, I, I don't, I don't disagree with the idea
of not backing up worthless data.
I don't disagree with that idea.
What I disagree with is the implementation of, okay, so it, what I believe is mm-hmm.
You should identify what is worthless and then we will exclude that.
Right.
Not identify what is valuable that Right.
And then we will in include that.
I, I just, I just disagree with that, the implementation and just
because you always forget things.
Well, you always forget things and then you, you add things.
It's just like the server that he talked about, right?
You end up adding file systems and the, the, I, you know, I go back to
when I helped redesign the, the backup system for a broadcasting company.
Uh, they were 20 terabytes as I recall when I got there.
And one of the things they weren't doing is that they
weren't doing all local drives.
And I pushed really hard that we should do all local drives as
part of the redesign, we did it, we discovered 10 more terabytes.
Yeah.
Of, of data that they weren't, that they weren't backing it up.
And yes, it was really valuable data.
Right.
Well what, what I found interesting, um.
Through that whole process of, you know, coming on board with that, um,
trying to push that idea of, of the data owners being responsible to identify
and we back up what they say to then we have to do with this recovery.
Right.
Um, I.
My experience.
Yeah.
Well, yeah.
And my experiences in who, whose fault has it been Ron, over the, over
the time that I've been involved in, you know, the system side of things.
Um, and watching, watching the way things unfold over time is that is
really good in theory, that approach of making the, the data owners
responsible for identifying it.
But what I have.
Learned and watched has been.
The real weakness of that is, um, you know, data owners come and go and
you don't get the transition between.
An outgoing data owner and incoming data owner of what's
covered, what's not covered.
Um, you know, I know that I struggled all through my time, um, in, in, you
know, um, it thing of doing proper and adequate documentation so that if I were
hit by a bus as, as the saying goes, um, somebody would know what was, you
know, what to do when they stepped in.
Um, but I don't think that that was.
Happening across the board.
And so, you know, you'd have an assistant was brought online, you know, like four
years ago you had a very conscientious data owner who identified all this
stuff and it's covered well Over those four years that guy moved on and
somebody else not so dedicated, came in.
Some changes were made that weren't documented.
That weren't covered, you know, and so then you have,
you fast forward to, to the.
Disaster and you have a situation well, well, we are only backing
up part of it, you know, and, and
yeah, this is why, or this is why I'm a strong proponent, a, a of
virtualization and B, B, b, the reason with virtualization, it, it helps
to, it helps to minimize this problem because really all we have to do is make
sure we understand, we know about new.
You know, VMware servers or HyperV servers, and then you can, you
can tell your backup server or your backup software, uh, back
up all VMs that show up on this thing, unless I tell you otherwise.
And when you're backing up those VMs, backup up everything on those VMs.
So you, you solve this problem, uh, you know, from a more global perspective.
Um, and, but this idea.
The, you know, we could talk for hours on all of the things that could go
wrong with manually selecting data sets.
Uh, and you know what, if you, if you back up a, a few terabytes of
worthless data, that is way, that is a much smaller problem than the one
you that this discussion started with.
Right?
Which is.
Restart this.
Yeah.
And I'm trying to restore the server.
Yeah.
And it wasn't
backed up.
Yeah.
Um, yeah.
Virtualization, I think, you know, it was, it was fun to watch that come
in and be part of that coming in.
And it did make backups easier from a pers you know, from the
perspective of, of, um, you got it all.
You know, you just, I mean, the equivalent of all local drives
was the default for, for VMs.
One of the things that we struggled with.
Right.
Um.
Uh, was that we were never able to convince and get the, um, virtualization
team to agree upon some kind of a scheme that they would manage their
VMs under that would allow us to do.
What you said, um, Curtis, this, uh, the
automatic discovery of new vm.
Yeah, yeah.
It's like, you know, if you had just pick something, a folder,
a, you know, whatever kind of,
or tags or whatever they wanted to use
the limiter, you could do, you could do within the, the virtualization software
to identify, you know, production machines, development machines,
however you wanted to break 'em up.
And then we come in and says, you know, in this, on this server, on this.
Cluster, whatever, back up all of these types of machines, then we wouldn't
have to worry when they added machines or took machines away, you know?
Right.
So, you know, we, we still had the same problems in the, with the virtual
servers that we did, with the physical servers of them not telling us
Right.
The, I, I think the, the right long-term solution there is, is tags, right?
And, and then, and then I think that there should be a policy that
says if you have a VM that has no tags, start backing it up, put it
in this policy, and then yell at me.
Right?
Because that way you can say, Hey, there's a new vm.
And then they, then they can come to you and go, oh, that's, that's
test or dev or whatever, and, and you can take it out right?
It's funny how, it's funny how, no.
You know, the more things change, the more they stay the same.
Oh,
yeah.
And, and the key, I mean that I learned in mine, uh, experience for this particular
one, this, um, was in the, all the times I did, well, not quite true, this was.
The biggest disaster.
I should, let me put it that way.
'cause I've been involved in a couple of other DR scenarios
that were very small, um mm-hmm.
Kinds of issues, you know.
Um, we had
nothing beats a completely wiped out island.
Right.
Though we had
a data center.
Yeah.
Um, uh, that had a similar problem where, um.
Somebody had not properly cut the holes in the ceiling of the
data center or plugged them when they were running cables through.
And a lab had a, um, a problem and the sprinklers were released and
the water came down and flooded a corner of this data center.
Um, and it happened to be in a city by the Bay.
Um, but um, we ended up.
Shipping their tapes up or up to our facility and then doing the,
the, um, import to be able to then do the restore from our facility
while they were recovering.
Um, you know, the site down there.
So we had, I had that one before, but one of the things that I've
noticed that's kind of like, I think a problem, it's important to.
The, the proper implementation of a, of a DR resolution.
Um, as well as just the planning, the building and setting up your
infrastructure, your IT infrastructure so that it can be recoverable.
And that's communication among the various teams and groups that are responsible
for all the pieces that make up, you know, an IT infrastructure and, um.
You know, that wasn't a strong point with that company.
Um, they, there was some regulatory reasons why they needed to make
sure that, um, no one person had the keys to the kingdom kind of thing.
But the way they did it tended to put walls between the groups and
so they weren't communicating in a manner, um, that I feel that they
should have, that would've helped.
Um.
You know, address like the left hand, not yeah.
Knowing what the right hand is doing
well and, and address some of the assumptions, you know, address some of
the assumptions because I've learned in, in a lot of my implementations, you
know, um, anytime doing it, you know, uh, learned a long time ago in the
beginning that you gotta have a plan.
You gotta kind of figure out what you're gonna do, what the steps,
what you gonna, you know, um, how are you gonna pull it off, and what
are you gonna do if it doesn't?
Go the way you wanted to, you know, 'cause then you gotta back everything out.
Um, and so I would, you know, do the planning part and then run it
by, uh, other people in my group.
To break the assumptions.
'cause I know when I'm building my plan, I'm making certain assumptions.
I just know it right off the bat.
You know, it's just, I know that's the way we all operate.
Um, and so I'm building my plan and then I have to run it by somebody who will
not have my assumptions to find those.
That reminds me, when, when we used to do the DR test, the way we
would do them is I wasn't allowed to participate in the DR test.
Mm-hmm.
We had to have someone else who was an it, who was an IT person, and then
they would follow my documentation.
Mm-hmm.
Um, and, and I had to pretend to, to be dead.
Um, it was always, by the way, you talked about, you know, you got hit by the bus.
I always hated the fact that it, like, why can't I like
win the lottery and disappear?
Right.
Always
out to be a bus.
Why?
Yeah.
Uh, but yeah, I had to pretend to be dead.
And as I, I've, as I've discussed.
More than once on the podcast.
The, the standard was, um, that, you know, a, a success was we got 100% recovery
without having to have Curtis help.
Right?
And not once.
Did we get that?
Because there was always, there was always something that was left out of the
documentation, no matter how much you try.
Right.
To,
yeah.
Well, and that's, I mean, that's why they're, they, they should
be viewed as living documents.
You should be, um, yeah.
Reviewing them.
Yeah.
You should be testing them, you should be updating them.
Um, but you know, there have not been many.
Places that I've worked in the, in the 30 some years that I've
worked in, in doing that kind of stuff, that operated that way.
Um, you know, there were companies that I would, you know, in like I
can remember one that I got in and if they didn't have anything, and
it really pushed really hard to get some kind of a company-wide, uh, Dr.
Plan pulled together and so we put a lot of effort into it and we built a
document that then was put on a shelf.
And it was never looked at it again.
Right?
And it's like, well, why do we even do it?
Binders
has cobwebs and everything over
it.
Why did we even do it?
You know?
Um, the last company at was one of the better ones for at least attempting to
have that be an ongoing part of their, um.
Their operational thing.
'cause they, they, uh, when an, uh, application or a service was
brought in, it had to be identified.
Is it business critical?
You know, what level of, of protection does it need?
And then, um, if it is business critical, they had to have a DR plan and they
had to do a, um, table talk one year.
And then the following year they actually had to do.
Execute the plan, you know?
And so, um, a good portion, did
they have to do the whole plan
for the application?
Kinda like where
we talked earlier?
For
the application.
Right?
For the application.
Okay.
They had to pretend they had to have a greenfield.
Right?
Right.
And so there was still some weaknesses in that part because, um.
Uh, there, there were, there were assumptions made, right?
Uh, that okay.
Mm-hmm.
We're gonna assume that there's a proper infrastructure.
We're gonna assume this, we're gonna assume there's DNS, you know,
from an application standpoint.
Okay, that's fine.
But what was never really done, and this is kind of the breakdown, um, on
the island, was those assumptions about the infrastructure had never been.
Tested or figured out, gone.
And that's what caught them in the beginning.
Right.
Those assumptions were all underwater and,
yeah.
Yeah.
The infrastructure part was the part that had broken the worst, if that's,
yeah, you know?
You know, it's interesting and I just wonder though, like how many times do
people make that sort of assumption?
Like you kind of make assumptions that power may or may not be available, but
the building may or may not be standing, but there are certain things about.
The outside world that you just assume will still be up and running.
Mm-hmm.
As you're trying to work these things through, I guess on an
island, things get more complicated.
Mm-hmm.
In terms of what may or may not be working, like you said, links,
communications links, et cetera.
But I guess you're right, that is something you have to take into
consideration are things outside of your data center that you have to take.
Think about
Right.
But I, I was, I thought about this a lot after, you know, in the aftermath
of that for the, um, um, for, for some time after being involved in that.
And what I experienced there, what was experienced there to me could very
well happen in a regional sort of way here on the mainland and, and leave,
leave a company in the same boat.
Right.
Um, and so, you know, while there are some particularities to being on an
island, um, it doesn't mean that if you're not on an island, you don't
have to worry about the kinds of assumptions that, that, you know, that
turned out to be big problems in their initial, um, you know, recovery attempt.
Yeah.
You.
You would think that being, being on an island, they would not assume
a connection to the mainland.
But a apparently that was the case.
That was the case, yeah.
Um, and the, the, the difficulty I have when I hear this story,
because the, the island situation, what it does is, I think it, it.
Brings to the foreground or makes possible many of the worst case scenarios that
could happen, that that could happen to a lesser degree on the mainland, but,
but to a greater degree on, on an island.
Um, but the, the, the real problem here is that many, many of the modern solutions
that I would think of to solve these problems are based on using the cloud.
Mm-hmm.
Is problematic when you look at an island situation.
Mm-hmm.
It's not as much when you're looking at a mainland situation, but if, for
example, you, you, you, you would have to have, you know, internet connectivity,
which is why I go back to that.
I, I'm really curious to see how this, how the Sea Elon Musk project
goes away or go goes forward.
And whether or not it's, um, you know, something that can
help solve this problem because.
Everything that I know that's being done right now has to do, like
in the really cool DR perspective has to do with using the cloud.
Nobody's talking about, you know, u using physical server DR services or you know,
all of the, basically without using the cloud, it's so much harder and with using
the cloud, it's so much easier to not just to test an application, but to test
the disaster recovery of the entire site.
But here's a question for you, Curtis.
Yeah.
Given the island scenario and assuming that your, the cloud
was available on the island.
Yeah.
Wouldn't, and say everyone got hit by this, um, isn't the cloud provider in the
same situation where they may not have enough resources or they might be down?
That's why I say granted, I, I know that you probably are gonna do DR
to the mainland in this example or somewhere else, but I'm just wondering
if everyone starts doing that.
Don't, like cloud provider is just a big.
Data, it's just somebody else's server.
Yeah, that's,
yeah.
Yeah.
Agreed.
I, I'm just saying that's why the, the, the, the, the island situation is so
problematic because I, I completely agree with you that I, if it was, let's just
say this, this was a pretty small island.
Let's say it's a bigger island and there is a cloud provider.
'cause there, there wasn't, I, I don't think cloud.
Stuff available on the island.
Uh, if you, if it was available, I wouldn't use that as your DR site.
I would use the mainland.
Right.
And so, because Yes, you, because you're completely right if, if your data center's
underwater and you're on an island quite ly their data center's underwater.
Right.
Um, the, um, it reminds me, I used to, you know, manage data
centers in Delaware and we had a.
Our offsite vaulting company.
It wasn't Iron Mountain, it was this local company.
And what they had was they had a World War II bunker, um, like, like bomb shelter.
And that's where they stored tapes for people, which sounds really
good until a hurricane comes.
And so whenever a hurricane was on its way to Delaware, we had to pay money
to have all of our tapes moved out of the bunker and up to the second floor.
So that if our data, if our data center was underwater, our tapes
also were not underwater anyway.
But yeah, I, I just, it's just the, the island situation is frustrating.
Um, and.
Uh, so any further thoughts?
Um, Ron on the, because we haven't even talked about the recovery yet, and I,
I'm just gonna have to have you back 'cause I, I find this solu, I find
this discussion incredibly fascinating.
You're, you know, you're, you're smart.
You know what you're doing, you're articulate, uh, and, and you
know, I'm, I'm super glad that we, that we finally have you on.
Can you think, uh, so what I'm gonna do is we're gonna with, on this podcast,
we'll sort of round out the discussion on.
Everything sort of almost not recovery, like all of the
things that just frustrated you being able to start a recovery.
Can you think of anything else that falls into that category that happened to you?
Well, I was just trying to go back.
Um,
I mean there were no Snickers bars available, for example.
Um, well, I was just gonna ask you about how it was on the island itself.
I know you had a job to do, trying to get.
The company back up and running, but I'm sure there is also.
Yeah, yeah.
Well, okay on that.
Um, so I need to go back to see the island, um, because I didn't really
get to see the island, you know?
Um, it was an interesting, it was an interesting journey.
We, um, we flew down in a corporate jet, right?
That was a trip in and of itself.
Um,
that must have been cool.
Was that cool?
That cool?
Just tell me.
That was cool.
Very cool.
Uh,
but I'll tell you what.
Okay.
It, it.
Spit me in the butt because I didn't have to go through TSA
or any that on the way down.
And so I forgot that I had put a couple of things in my bag, and when I was
leaving the island, I had to actually flag commercial TSA and, and it was,
it was fruit and it, like they take it out and you can't take it with you.
So, um, but um, it was, it was weird.
You know, when we landed, uh, you know, at that time, this was very
early after the, the, um, the init, the hurricane itself, right?
And so, um, limited flights in and out, you had to, um, you had to, they, you had
to, basically, they had to request and were given this tiny window to land in.
And so, boy, they had to make sure they made it on time and all that.
Um, they had a number of.
People with the company that they were relocating to a facility on the mainland
so that they could get online and do work in helping to bring the site back.
Right.
Uh, they couldn't work locally and so they were waiting when we landed and you
know, we got off and they put us in a shuttle and then took us and there wasn't.
I grew up in Montana and Montana in the winter, when you look out,
um, you got the evergreens, but if there's not an evergreen, it's brown.
It's brown, right?
I mean, there's, there's no green, um, there.
Um, and so this looked a lot like Montana in the winter without the snow.
I mean, it was just.
Brown.
There wasn't any leaves, any green anywhere.
It had been stripped bare.
So it was weird seeing that.
Oh, because it had been stripped from, from bear, yeah, from the hurricane.
Gotcha.
The wind just
stripped it off.
One of the guys that, that he, he'd worked at the facility, I
forget, he's been like 20 years.
He said, I'm doing this commute for the last 20 years.
I didn't know there were houses down there on the side of the road, you
know, because it was so lush and green, you couldn't see beyond, oh.
And he says, I'm seeing stuff that I didn't even know was there.
But anyways, um, and then we get there and, you know, there's, they
basically put us up in conference room.
They turned conference rooms into, um, um.
Just, uh, dorm, right?
And we had, um, um, they had these little air mattresses and, um, I brought, uh,
they told us, you know, bring a bag and, and so I brought like a sleeping
bag, you know, change of clothes.
Uh, and, uh, pretty much spent, well, the entire time I was
there, um, I spent it on.
On the facility.
'cause there was nothing to go, no place to really go.
Um, for, for the locals, you know, there were long lines for everything.
Um, and so it just didn't seem right to go getting long lines for those people.
Um, they, they were providing us with, um, you know, the food, um, and it was
a lot of, um, chicken and rice, but, um, you know, you'll survive on, on, on it.
So they, they took the, the company took very.
Good care I thought of, of the people.
Um, and they also were taking real good care of, uh, local workers, um, who could
not come back in, um, to the facility, you know, just because they, they weren't,
their stuff wasn't yet recovered, you know, because there's some manufacturing
goes on there, and so they weren't.
It's ready to do manufacturing.
Um, but they took care of 'em.
They shipped in, um, brought in pallets, pallets full of, of, uh, bulk food
and whatnot and, and, uh, had workers come in, you know, and they just
basically parceled 'em out and helped supplemented a lot of their stuff.
So I thought that was really good.
Their, their response overall in taking care of their, their, their local
employees, that I thought was really well, and, you know, and, and they did
good by us as far as, as, you know.
Under the circumstances, you know, we didn't go hungry and we, you know,
we had a place to sleep, but it was.
FI was there for almost two weeks and it was pretty much, um, you know, 12, 14,
16 hour days in the, in the beginning.
And then it tapered down after about the first week.
Um, and then it was, we were down to sort of eight hour days.
But then you just.
Sitting around a lot of time.
And so since there was nothing to do, we just worked
like, what else?
What else do you do?
Yeah.
When you're in that situation, you know?
Yeah.
Um, subsequent you're not
watching Netflix.
Yeah.
Subsequent teams that came down, um, by, by that time, um, they were
actually coming down on commercial, you know, 'cause commercial,
um, flights had started to pick.
Back up as they had, um, kind of cleared stuff up in that.
Uh, and then they had, um, rented out hotels, uh, space for
them and were shuttling them.
Um, so they weren't staying on the, on the, on the, the campus itself.
And, but so, but it was kind of fun in one sense.
Um, I did a, um, the only exercise I could really get there was just a lot of
walking and I walked around all over the campus just to look at the damage and,
you know, different parts of it had had.
You know, some of it unscathed.
Um, the particular building that the, the data center B was in, um, was one of
the oldest buildings there on the campus.
And so, um, it, it's does, you know, not surprising that it was the
one that suffered the most damage.
So, um, and, uh, but then it was kind of, you know, it was a, the small group
of us that were, were forced together.
People that you'd never met before or didn't know were all kind of
forced together to, to work together.
And that I found that was fun.
That was interesting.
You know, you get to, you get to know people and it, did you
have any.
Did you have any language barriers?
'cause I, I'm guessing you don't speak the language of the locals.
No.
No.
But, but, um, you know, um, everybody within the facility
is, is fluent, uh, English.
It's okay.
You know, because it's a, it's, um, a mainland company, you know, headquarters.
Okay.
Um, so, so they're all, they're all bilingual.
Um, which puts them way up the ladder on than me, you know?
Um, I'm always, I'm always, yeah.
Um, you know, my hat off to anybody that speaks more than one
language as far as I'm concerned.
Um, absolutely.
Yeah.
So, persona,
how many you got?
Persona.
I can understand other languages.
I don't necessarily speak a.
Still, that's still better.
Yeah.
Mm-hmm.
Yeah.
So yeah, so it was a very, very interesting experience.
Um, overall, I'll tell you, you know, it, it was, uh, not that, not
that I wanna have to do it again, but 'cause of the circumstances.
But
you got to really do sort of the worst case scenario of what
a lot of us, um, prepare for.
Prepare for.
And uh, that's why I really wanted to have you on Yeah.
And, and talk to you.
And by the way, we're gonna have you back 'cause you, you've got a lot of
institutional knowledge up there of what it's like to actually do this
in, you know, in the real world.
Um, so we're definitely gonna have you back, but, uh, I'm gonna end
this one now because we try to keep these around 45 minutes or so.
We're definitely gonna have you back.
Would that be alright?
Oh, sure, sure.
Yeah.
You know, having spent most of my career preparing for disasters and,
and running a couple of, uh, of, you know, tests, uh, to actually be.
Be able to participate in a full on recovery of a, of a facility.
Yeah.
That was a, a, a highlight, you know, I mean, again, if you don't want those
kinds of things to happen, but to be able to be a part of one was, yeah.
That's something that I, I'm really proud of.
This was fascinating.
I I, I, it's interesting right at the end there, we got the image of you on a, on
an air mattress with a sleeping bag bag.
Uh, that, that, that, that added like a whole, I didn't even, I didn't even
think about that part of like, you know, that you needed a place to sleep
and those places weren't Exactly, uh, in the, they were in short supply.
Anyway, so, um, thanks so much for, for coming on.
Sure.
Thanks, persona.
Thanks Curtis.
Thank you, Ron.
I, yeah, pleasure.
Like Curtis said, I just have that image, image of you in an air mattress
with a bowl of chicken and rice with, uh, lights flickering, kind
of like an apocalypse like scene.
And the data center slowly coming back, online's like Jurassic Park almost.
I know that system.
All right, and with that, I wanna thank the listeners for your attention and,
uh, and, and listening all the time.
That is a wrap.