In this eye-opening episode, we examine a real-world disaster recovery test gone wrong from Kodiak Island, Alaska. Our guest Paul Van Dyke shares his story of intentionally taking down an entire server environment over a weekend, armed with only backup tapes and determination. This disaster recovery test example showcases both what not to do and valuable lessons learned.
Paul walks us through his bold 2001 decision to reorganize storage across five servers by completely wiping them all at once. What was supposed to be a weekend project turned into a five-day marathon, including sleeping on his office floor to swap backup tapes. While he eventually succeeded in restoring everything, this disaster recovery test example demonstrates why proper testing and planning are crucial for any major infrastructure changes.
Join us for this candid conversation about backup testing, restoration planning, and the unique challenges of managing IT infrastructure on a remote Alaskan island. Learn from Paul's experience so you don't have to learn these lessons the hard way!
You found the backup wrap up your go-to podcast for all things
backup recovery and cyber recovery.
This episode contains one of the craziest DR stories I've ever heard.
In 2021, we talked to Paul Van Dyke and IT supervisor in Kodiak Island, Alaska.
He tested his DR system by intentionally destroying his production environment.
Spoiler alert, he lived and so did his data, but not without significant pain.
This is one of our favorite episodes to look back on, so we're rebroadcasting
it this week since we just got done talking about DR testing.
I'm sure you'll love it.
By the way, if this is your first episode, I'm w Curtis Preston, AKA, Mr.
Backup, and I've been passionate about backup and recovery for
over 30 years, ever since.
I had to tell my boss that we had no backups of the production database.
We just lost.
I don't want that to happen to you, and that's why I do this.
On this podcast, we turn unappreciated backup admins into Cyber Recovery Heroes.
This is the backup wrap up.
Welcome to the show . I'm your host, w Curtis Preston, AKA, Mr.
Backup, and I have with me my personal financial advisor, Prasanna Malaiyandi.
What's up Prasanna.
I am good.
Curtis, what advice have I been giving you?
I.
Well,
you know, I was talking to you about that.
Yeah, yeah.
That loan that I was thinking about doing, and you've been advising me,
you know, that there's this idea that I have of, of doing a loan to a friend
and it's a large enough that, um, I was like, what do you think Prasanna?
I.
And, you know, you gave me advice on moving forward, but
doing all of the right things.
Yep.
And, and I was really surprised at some point.
I, I really expect you, you to say, well, you know, I was watching
this YouTube video on, on personal personal loan administration
and so unfortunately this
what this guy said.
So I don't watch YouTube videos on personal finance, but I do read a mm-hmm.
Forum on personal finance.
So.
See again, again, I, you just, you're just a random foun of knowledge of
random topics and once again, uh, your knowledge came in, uh, came in handy.
So glad I can help Curtis.
That's what I'm here.
It's always good.
And by the way, welcome back to the United States.
Having left it for.
A brief period of time.
It is good to be back, I have to say.
So I did do a long flight to India for a very short trip
and made a long flight back.
What's a long flight?
Uh, so I think flying time was about 26 hours, but door to door was
probably closer to like 34, 35 hours.
Wow.
Yeah.
That's just, I, I've done, I've, I've flown to India,
but I don't, I don't think.
I don't remember.
I I just remember it was really long.
Yeah,
it's long.
And it was
a, it was a, yeah, long,
and especially now with the pandemic, right?
They require masks on the plane the entire time minus Right.
When you're eating or drinking.
Right.
So literally, and even when you're eating and drinking, they're like, oh, take
your mask off in between bites and sips.
Put your mask back on.
Oh, they're very, they're very, um, what, what's cautious?
I dunno what the word is.
Yeah, yeah,
yeah, yeah.
And so it was a little bit of a hassle like.
You take a bite, it's like, instead of taking a bite and chewing, I
would literally take three bites and then put the mask on and
then sit there and chew, right.
And then swallow and then be like, okay, next time.
And I, and you took, took
a covid test on the way out, another Covid test on the way in,
and then another Covid test after we got back, so,
right, right.
All good to
go.
So everything was fine, but it is a little bit of a hassle, but.
Yeah, at least travel is returning back to some form of normal, I guess.
Yeah.
And you, and you have your wife back?
Yep.
My wife came back with me as well, so it's good to be open back.
She's been, she's been gone for a while.
Visiting family and stuff.
Yeah, visiting family.
And we were in India for Di Wally, which if anyone, that's my first time
I've ever been in India for di Wally.
And I have to say, it is crazy with the amount of fireworks going on, like
it sounds more than July 4th here.
Yes, more than July 4th because it happened over seven days.
And for the seven days it would be from like 4:00 PM till 11:00 PM And it's
not just like the little sparklers that you might do here, or even just like
the little rockets, they sounded like full on like Roman candles and like.
Loud gunshots.
I think you and I were on a call a couple times and you heard a
little noise in the background.
You're like, what is that?
I was
like, what?
What is happening already?
And the sky's
like, I, I was reading an article, I think at NPR where they were saying that they
had like a picture before and a picture after of like the sky after fireworks.
And it's just like clear to completely covered in smoke.
That's crazy, but it's good to be back.
Yeah.
Good to have you back.
You missed.
Aw, well it's nice to be on the same.
Likewise, you know?
Yeah.
It is nice to be in the same time zone.
Uh, by the way, I should mention our standard disclaimer Prasanna.
And I work for different companies.
He works for Zoom.
I work for Druva.
This is not a.
Podcast of either company, the opinions that you hear are ours.
And also, uh, please rate this podcast@ratethispodcast.com slash
restore and, uh, or your favorite podcaster if, if it's not listed there.
And, uh, finally, if you are interested in the topics that we talk
about, come, come, come, come, come.
Yeah.
Just like Paul, just, uh, contact me at w Curtis Preston.
gmail.com or at WC preston on Twitter.
And uh, we will have you on.
And it's a friendly environment, right, Paul?
Absolutely, absolutely.
Everybody should come.
And, uh, we love to talk about all things, uh, backup security,
data protection, data resilience, uh, you know, puppies, whatever.
Um, I like puppies and movies.
I also wanna mention our giveaway.
We are giving away one free.
ebook version of my new book, modern Data Protection, published in May
courtesy of O'Reilly and Associates.
All you have to do to qualify for the drawing is uh, to subscribe to
our newsletter on backup central.com.
Just it's right there in the top menu, subscribe.
And in the following week, I will select one new listener.
To receive a free ebook copy.
So our, I've selected the winner from this week, and your name is John Doherty.
Congratulations.
You'll get an email from me and another one from O'Reilly with your ebook.
So.
Back to the podcast.
So our, our next guest is from a part of America that is
connected but not connected.
He has been, uh, in it for quite some time, just, just about as
long as I have, uh, short of, just short of 30 years it looks like.
And he's actually, this is actually, this is the second time we've had this.
He's had one job that entire time.
He has been at the, uh, he is the IT supervisor at the Kodiak Island Borough.
That would be Kodiak, Alaska.
He's two hours, or no, he's one hour behind us.
Welcome to the podcast Paul Van Dyke.
Thank you very much.
It's a pleasure to be here.
So, uh, how, how did we find you, Paul?
I have followed you on Twitter for, um, a number of, uh, for quite some
time, and, uh, enjoy listening to your podcast, uh, when I, when I have
some free time and space and, uh,
when it gets dark and cold in your indoors.
Something like that.
That's, that's right.
Listen to that.
Listen to us by the fire.
Yeah.
So outside Spliting,
firewood or you know, any, any of these, any of these Alaskan activities that I do.
Yeah.
And so then you, you reached out to us, right.
Saying, Hey, you know, 'cause we, because we say this, right?
We say, Hey, if you have, if you want to talk about our favorite topics,
then come on and we will bring you on.
So we, we are so happy to have like an actual.
It practitioner, um, you know, does this I I was about to say in the wild.
I did.
But that, that, that's true.
That leads a whole other, uh, you know, you know that, that, that brings up a
whole other connotation where you live.
What, can you describe what, what it's like where you live?
'cause you know, for those of us that live down here, we have these visions
of what it's like to live in Alaska.
And I have no idea.
You know,
so I'll, uh, I'll, I'll not shatter the, the, the stereotype of interior Alaska
where it is dark and cold all winter long.
We live on an island in the Gulf of Alaska.
It is very similar to the Pacific Northwest, although right now
we're in the, in the mid twenties.
It's snow on the ground.
We rarely, uh, we rarely see single digits Fahrenheit.
Um, it, we, we might see them for a week throughout the winter.
Okay.
Um, mostly we're in the twenties.
We get above freezing and it will, it will warm and, and thaw, thaw and freeze
throughout the winter on, on occasion.
So we don't, we don't go into the deep freeze like interior Alaska
or, and what about the, what about the, the, the, the sunlight aspect?
We are affected by that.
Um, in the, around the solstice we.
Usually get dark, probably three 30 or four in the afternoon, and it doesn't get
light till 9, 9 30, uh, in the morning.
But you don't have this period where you're, where you're dark 24 by seven,
correct?
Correct.
That is, uh, that is Northern Alaska.
Okay.
Because I, um, where was, well, well that's actually where was Northern
Exposure set the, that TV show?
You remember that TV show?
I do remember that that was, that was more interior Alaska, I believe they
were trying to shoot around the, around the Fairbanks area or try to Gotcha.
Because I, I do remember that that was an episode where they had, you
know, there's a period where they get nothing buts on, and then there's a
period where they get nothing but night.
Um,
so I was plot line, so I, I know this is probably going to be my, um.
Inexperienced or talking to people from Alaska, but like how do you
get supplies and stuff like that?
Like you said, you live on an island.
You live in Alaska, right?
Yes.
So everything is either, uh, is either barged in or, or flown in.
And, uh, it's a 45 minute flight to Anchorage, the, the largest
city in Alaska out of Kodiak or, uh, or it's about a 10 to 13 hour
ferry ride from Kodiak to mainland.
I, I, that's, I did not expect that part 10 to 13 hour ferry ride.
Correct.
So how far are you from the mainland?
Um, it's 250 miles from Kodiak to Anchorage.
Okay.
Wow.
And then, uh, and then I'm not sure what the, uh, what the gap is between
Kodiak and the mainland as far as the, uh, the nautical miles that the ferry
takes.
Wow.
That's far.
I did not expect that.
And what, what is the island like from a, you know, what does it look like?
Is it, well, I'll just stop there.
It, you know, it looks a lot like Ireland.
Um, okay.
We do have forest.
We do have, uh, we do have mountains.
Uh, we have one glacier on the island.
Uh, we do have a lot of, uh, a lot of shrubbery, um, alder flat land down
on the southern end of the island.
We're this second largest island in the US right after the Hawaiian island.
Wow.
Wow.
How big is it?
I don't know.
I just live here.
Here's what I know, and that is that the state of Alaska is so much
bigger than most people think it is.
Yes.
Because of the, and I don't know, perhaps one of, you know the, the way that maps
are done, there's a word for, yeah.
Projection.
Do you know what the word is?
The projection.
Well, it had, there's a, there's a, it's like a person's name, I think.
Yeah,
yeah.
It's a person's name.
mcc,
yeah.
Is that what it is?
It, it just has to do with, there, there is a way of spreading out a global map
onto a flat surface and it's that style.
And when you do it that way, Alaska looks a lot smaller
than it, than it actually is.
'cause Alaska is actually bigger.
Than Texas is my understanding, or similar size to
Texas.
I went to college in Texas and before I went, my dad bought me a hat that
said if you cut Alaska in half, Texas would be the third largest state.
I didn't wear that hat on campus.
Is it really that big?
It is.
Is that Wow.
Yeah.
See, I, I knew it was big.
I didn't realize it was that big.
That big,
big.
Yeah.
Yeah.
Yeah.
Um, and then, and then, because Alaska appears
in the corner of most continental United States Maps, yeah.
Hawaii.
It's far away.
Yeah.
Yeah, yeah.
Yeah.
Uh, I'm Googling Kodiak Island size, by the way.
12,000 square miles.
How many people?
There you go.
Uh,
uh, we are, uh, a little over 13,000 people right now.
So, so it's so it's a pretty rural, rural, I can't, that
is a word I have trouble with.
Rural.
Rural, uh, world.
Okay.
And so what do, but we do
have, we do have the world's largest Coast Guard base.
We have a rocket launch facility.
We have a, a thriving fishing industry.
And, um.
And, you know, we have state government represented local government, um, and
other, other service industries in Kodiak.
Well, if you have rockets and you have fish, I mean, that's really all you need.
Curtis like sold, so That's right.
So t tell me, so you work for the bureau, uh, the island bureau.
What?
Tell me what the IT environment is like and what, you know, what, what
do you, what do you need it for?
What do you, you know, what does it look like, et cetera.
We, we have a very small it off, uh, it shop, the, the, the borough
is, um, paramount or, or, uh, akin to a county in most locations.
Okay.
And so, um.
We have the functions of accounting.
We have an assessing department, a finance department, a clerk's office,
an engineering facilities department, and a community development department.
So, um, our number one, our number one, and it's, it's sad, uh, to, to.
To put in this context, but our number one goal is to assess properties
and tax them and collect money to pay for our school district.
Hey, that's, that's important.
Is the, uh, is our school district is the number one expense
tax revenue,
correct?
Yeah.
And so then with that we have, uh, we have other functions to support
development of the community.
And obviously the finance department ensures that everything is accounted for.
Um, and then, so as an IT department, we support all these, all these functions.
We, we do the, uh, the gambit from A to Z.
Um.
We have a virtual infrastructure onsite.
We do a lot of things that are line of business applications to
support the borough that are onsite.
And so we have a, we have an onsite data center.
We do have some functionality in the cloud.
We do use some of the cloud services that other people use, but a lot of
these line of business applications require internal infrastructure.
So we have, we have it.
Go ahead.
So we, we have the, we have infrastructure here to support, um, all those things.
And it, and it needs to be backed up
it seems, given how far you are from the mainland and probably given how far
you are from any public cloud region.
Right.
I'm
sure.
Where, where would be the closest public cloud region, do you know?
Or do you know?
Um.
I believe that there are some cloud, some, uh, public cloud providers in Alaska.
Okay.
I've heard that, uh, some of the telecommunication companies have either
partnered with Azure or AWS and have some, some functionality hosted locally.
Yeah.
My guess is Equinix has some data centers out in Alaska.
Okay.
That some of the public clouds we're probably using.
One of the challenges that we have is Alaska is also known as
the ring of Fire, and so we were very seismically active in 2007.
We, we got fiber optic communications to the island, but before that
we had satellite internet.
So, um, a lot of my early IT career, we were under satellite internet,
and so we didn't feel comfortable.
Uh, outsourcing or, or, or, or cloud services.
Were very new at that time anyway, but we felt like we were an island and we had
to have all of our resources on island.
Now, with better infrastructure and more reliable infrastructure,
uh, we're able to.
To look at outsourcing it.
So when you had all your infrastructure back in 2007 on the island, how did
you deal with like disaster recovery?
Was there, like, is there like another island like nearby that you,
that that's a, that's a good question.
Um, fortunately we never had to do disaster recovery.
Uh, although I, I, I, I, I do have a story from 2001.
Um, this was almost a disaster,
but, uh, yeah, we'll get, we'll get to that.
Yeah.
But, uh, but really we were, uh, you, we cataloged what we had and, and, uh, we've,
we, we just, we haven't had to do it, but we had a, uh, we had a, a plan in place
that we would just acquire more hardware.
Um, I also, as, uh, as part of my job, I am also part of our
emergency, um, operations center, and I'm the logistics section chief.
With our emergency operations center.
So, um, you know, bringing in supplies to our community Mm-hmm.
Would be something that I would be responsible for doing.
Is that, is that a volunteer position or that's part of your
job working for the borough?
Uh, it is part of my job working for the borough.
Okay.
The borough and the city are jointly responsible for emergency response.
So we have, uh.
A city government here in Kodiak as well, and they have, uh, police department, fire
department, and other, other resources.
How many boroughs are on the island?
There's just one borough.
Okay.
And the borough covers Kodiak Island and a portion of the
mainland across the Sheaf Strait.
Oh, weird.
Interesting.
It's, um, it's a, it's a function of the watershed that is on the mainland
that drains into the Sheko Strait for, I believe, as it was explained
to me, for fisheries resources.
So rivers and streams on that side are part of our borough,
uh, for That makes sense.
Fish habitat.
Right.
As a, as a fishing community, it, it, it's important.
It, uh, we have a vested interest in it.
Yeah.
Um, and yeah.
By the way, I did check there is an, um, a us Alaska region in AWS
So I have us, I have central Eastern, east Indiana, Pacific, and Alaska.
So Alaska is its own, uh, AWS um.
According to a website that I just looked at.
That's literally the extent of my research.
But Prasanna, do you know any different?
I do not.
Okay.
Alright.
Yeah, it, it looks like it's its own region, so, um.
So, and, and so then the other thing I would have is in preparation
for any kind of disaster, which I think, so what kind of disasters do
you need to prepare for up there?
Obviously fire, like a giant fire would be a problem.
Do you have, you know, you, you don't have tornadoes or hurricanes
or that sort of thing up way.
Right?
Probably have tsunamis.
Tsunamis.
We,
tsunamis are, are one thing.
We do have wind events.
Um, no we don't have hurricanes 'cause they refuse to call, uh, the windy
day last week a hurricane, even though it was blowing 70 miles an hour.
Hmm.
I wonder if they had derechos.
Yeah.
Did you hear our episode about Derechos?
I don't think I did.
Yeah.
Derecho is a land hurricane.
It's a hurricane that starts over land and we had.
A guest on who was in the middle of a derecho with, and the thing is, unlike
a ocean hurricane, it just, it's more like a tornado in that it just happens.
So he just, he just was on his porch.
I think he was out on his yard or something, right?
Prasanna?
Yeah, he was out
and then he grabbed the dog and ran back in, I think.
Yeah.
Do, uh,
per, do you have the, yeah, it's episode number 1 26 Stop ransomware
attacks in seconds with Greg Edwards.
Right?
Yeah.
So you wouldn't know it from the title, but Yeah.
There, we talked to him.
So it's called a derecho, uh, which is weird.
It's like the Spanish word for right.
But it, it, um, it means, it, it's a land hurricane, which is just, um.
Yeah.
So maybe, maybe that's what, maybe you just need to get, you know, need to
explain to these people, Hey, we didn't have a hurricane, we had a derecho.
Um, no, we, we have, we have low pressures in the Gulf of Alaska,
and that brings about, uh, um, crazy winds, strong winds and, uh, wind.
Um, we did have a, uh.
In our data center.
Was that in the middle of winter?
Yes, it was actually the, the room next to our data center is the, uh, is the me
mechanical room for the building and there was some louvers that were stuck open.
And so a, uh, a coil froze and then it thawed out and so that waterline
broke and ended up flooding.
Um.
Flooding our data center to, to some extent, we had a, a couple inches of water
on the floor and, um, it, it drained.
It drained through the.
Into the basement, but, uh, it was, uh, it was a little scary.
That's also where our electrical connections go through the floor.
Oof.
Into the basement as well.
But it was, it essentially a non-event, though?
It was a non-event?
Uh, we did have, we did have some backup tapes sitting on the counter
and we asked the, the maintenance guy who was wearing rubber boots.
To walk into the water and to grab those backup tapes.
Very important.
Very important.
Paul, do you had so.
Do you use tapes mainly for your backups?
Because I know you mentioned that you had the maintenance
guy go in, grab some tapes.
I, um, we currently, we use, uh, we use tapes, d uh, deduplication
appliances, offsite storage, and uh, and then local storage as well.
So all the things.
We wanna be secure.
We want to be protected.
And, and what do you do to get, you know, give, especially given that
you're an island, what do you do to, you know, separate a copy of the backups
from the thing that you're protecting?
We have, we have a, uh, a safe.
In our data center, we also have a safe across the street
in a, uh, in another building.
So we take our, we take our tapes offsite, which may be, uh, 200, 250 feet away.
You walk em across the street, right?
Yes.
Yeah.
And, and that's good.
I mean, is there, is there any concern, you know, have you had discussions
of, you know, if there was something like a flood or anything like that?
Is there any concern that you know, that you have things too close together there?
Um, a small tactical nuke could, uh, could take, take out my,
my disaster recovery plans.
Mm-Hmm.
Um, I think that would take out their problems.
Yeah.
The whole, the whole island.
I think
that's right.
That
really, um, you know, my backups are, are for, for the use cases that I have.
Um.
If, if our data center were to die, um, if the building were to collapse in an
earthquake, you know, I would be looking for additional hardware to restore onto.
Mm-Hmm.
Um, my backups are really, uh, primarily used for accidentally deleted files.
Ransomware, um.
And, and any localized, you know, localized disasters.
Um, we have talked about doing cloud-based backups, and we do have more bandwidth
available to do cloud-based backups.
But, uh, bandwidth is expensive, as we talked about living on
an island, and until recently it's been rather restricted.
Right?
So to do cloud-based backups would also mean looking at cloud-based recovery.
And Right.
That you
wouldn't be able to bring it back in case we have not made
that decision yet.
Right.
Well, I guess I, I guess, and I, I wasn't even necessarily going to there,
although, you know, I do work for a cloud company and obviously that's,
that's our solution for everything.
Having said that, uh, I was just thinking about, I don't know, an occasional.
Copy of tapes being FedEx to Anchorage or something, you know, even if
it's infrequent, just just to have a copy that's a little farther away
than a few hundred feet, or having,
or having like a building on the mainland that's still part of the borough.
You just shipped the tape sheet.
Yeah.
And, and we have, we have talked about, uh, talked about some of those solutions.
We've also looked at, uh, you know, moving tapes to Iron Mountain.
Um, a lot, but a lot of it depends on, as you asked earlier, what are
the things that we, what are the, what are the hazards and, and how
are we restoring from those hazards?
Right.
Right.
So let's talk about, um.
You know, you, you, you gave us a couple of stories in
your email when you wrote me.
Um, I, I really like this first one that, that
intentionally destroying my complete environment.
That is.
I, one word comes to mind, and not everybody can say this
word, but the word is chutzpah.
Uh, guts my friend, intensely destroying your complete environment
to test your backup tapes or to test your backup system.
Um, you really gotta tell us about that.
So, so, well, first off, what, what possessed you to, to do that?
I ha I had a purpose.
I, I, I honestly had a purpose and, uh, and yeah, when you put it like that, it,
it sounds like, um, sounds like I was missing a few IQ points on that test.
Insight is 2020 and, and, and to survive it is, uh, is, is really
where the, uh, where the beauty is.
So this was, this was post, um.
This was about 2001 Mm-Hmm.
And we had, we had invested in our infrastructure.
We had moved away from, uh, you know, PCs as servers and custom built things.
And we were moving into more industrial acquired servers.
And we had purchased through, uh, through two fiscal years, we had purchased
five compact ml, three 70 servers.
Because we purchased them through two fiscal years.
We had three.
We, we, we, we had, we had some that had 9.1 gigabyte drives and some
that had 18.2 gigabyte drives, but they were all configured to be about
45 gigabytes of raid five storage.
And so we were trying to be 100% by the book.
We installed these five servers.
We had two domain controllers, an email server, a file server,
and an application server.
Just completed my MCSE training, and so we were doing this as a
standard rollout as much as possible.
After about a year, the usage on DISC was very asymmetrical.
Our domain controllers didn't use very much space.
Our email server didn't use very much space.
In the early two thousands, our file server was rapidly growing
and we were adding applications to our application server.
So the 45 gigabytes was filling up in an asymmetrical fashion, and
I started looking at the discs.
We had 9.1 gig drives and 18.2 gig drives, and I said.
Well, if I just move some of these disks around and I take four of the 9.1 gig
drives and build a rate array in the first three servers, and then I move the 18.2
gig drives to the last two servers, I will have matched my storage with my workload.
So I ran a full backup on Friday night, and I came in Saturday morning.
Sorry.
So you weren't ju you weren't just testing backups.
You had a, a purpose, an alternate, like you had an extra purpose besides
just testing your, your backups.
I, I
was trying to match my, my disc space in my servers to the usage of
the, of the demand on these servers.
Gotcha.
I, I just want, I, I pulled up the stats, by the way, on a.
On a, uh, compact, uh, ML three 70 and, uh, that comes with a maximum
of four gigabytes of Ram a Pentium three one gigahertz processor.
And, and here's the best part, an integrated dual channel
wide ultra two SC adapter.
Nice.
Back in the day.
So, so basically you, that's what you mean by basically by pulling drives
apart, you destroyed any rate that was going on and you required, uh, these
rate arrays to be completely rebuilt, which would zero everything out.
And then, and then you do the restore.
Correct.
Uh, and now you, you mentioned two rate arrays, right?
Well, every, every one of these servers had its own rate array.
Okay.
I had two different size drives, so
I Right.
I moved the drives around because each system had a mix of the drive types.
Well, um, as I bought them, I, I had two servers that were full of the 9.1 mm-Hmm.
Gigabyte drives.
And, and each server held six drives.
Yeah.
And then I had bought three servers that had the 18.2.
Gotcha.
Gigabyte drives and I only had, uh, four, drive four of those
drives in, in those three.
Gotcha.
Each of those three servers.
So
yeah.
So basically you, you, in one move, well, a series of small moves wiped
out the storage arrays on five servers.
Is that Yes, yes.
Oh boy.
And everyone was okay with this.
I had planned on doing it.
I, I explained what I was going to do.
And we trusted our backup tapes.
And by the way, the tapes did fine.
They just, everything,
everything was restored over a weekend and it only took me
five days over a weekend in five
days.
What,
so what, so what was that like?
Uh, come Monday morning.
And you had moved into the data center, uh, I'm assuming, is
that what happened, by the way?
So I,
I slept in my office Sunday night because the amount of time it took to rebuild the
rate arrays, to initialize them, and then to start restoring data, which I didn't
realize that it was gonna take longer to restore data than it was to back it up,
which was, which was my first lesson.
Why is that?
Why is that Paul?
Do you know why it takes longer to back
up?
I have not listened to your podcast long enough to to answer that question.
That, by the way, massive suck up response.
I love it.
I don't think we've covered this parti this particular topic, so
that's why I want to bring it up.
I'm guessing based on some numbers that you've thrown out
that this was parody based raid.
Yes.
Right.
This was a raid five.
Probably grade five.
Okay.
Grade five.
That's the answer to the question Prasanna.
Why does it take longer to write?
Because it has to compute the parity
across everything.
Yes,
yes.
Um, there's also, there's also another potential, depending on the
backup product that you're using.
Are you doing any kind of multiplexing?
I.
When you're, when you're doing backups.
I wasn't, um, at that point.
Okay, well, mm-Hmm.
Because that would've made it worse if you were, and, and so let's
just talk about that for a minute.
So the multiplexing is evil.
Uh, it's a necessary evil.
I, I always felt like if, if you're going to tape as tape
got faster and faster, you, you.
You know, backup speeds that were a few megabytes per second
were completely un incapable of making an LTO tape drive happy.
Even older LTOs, let alone modern LTOs.
And so a lot of backup vendors came out with multiplexing where they take
a bunch of little streams and they enter, leave them block by block onto a
tape, which solves the backup problem.
But then when you go to restore, you have to read all of that
data and throw away most of it.
So it makes a really crappy.
Restore speed.
But in your case, I, yeah,
as long as you never have to restore, it's fine.
As long.
Exactly.
Uh, but in your case, I think what you had was to raid the, the parody Right.
Penalty.
And so how long did you think it was gonna take?
My backups were usually done by, by midday Saturday.
So.
Why should it take longer than, uh, the, than the period of
backing up to, to restore it all?
I, I did get four servers up and running by Monday morning.
I had to sleep in my office.
Mm-Hmm.
Sunday night.
So I was here to change tapes in the middle of the restore
to get the file server running.
Then the application server, which had its own complexities, um, from
running multiple applications and trying to get a, a good backup of live
applications, uh, took an additional three days to, to get up and running.
Yeah, I was gonna ask you if you were able to get everything up and running, but it
looks like minus the application server, everything was good to go by Monday.
So you're cri you, you prioritized critical applications.
Applications that would get you yelled at, basically.
Email file server logins.
Those were, yeah, yeah.
Those, those all sound really important.
Uh, you know, uh, I, I don't know.
I, what, what's the equivalent of CEO there?
Uh, uh,
the borough
manager, the Borough manager's laptop.
If that was part of this, that would, that would go.
Um,
yeah.
So do you remember how many.
Tapes, like you said, you were swapping out tapes.
Do you remember like, because I could imagine if you're sleeping
in your office at like, probably like three in the morning, a tape
probably finishes restoring and you're like, dammit, I gotta wake up now.
I I only had two tapes.
Okay.
Two tapes backed at my entire environment.
Um, I was running the Exabyte M two.
Tape drive.
So this was the mammoth drives.
So, yes, so Exabyte had mammoth, Sony had a IT, so this was the next
generation of eight millimeter drives.
Because the, 'cause my, the first tapes I cut my teeth on were
exabyte 82 hundreds, which were the.
The, they were like, I don't know, uh, one gigabyte or
something on those, those drives.
But the mammoth was their attempt at large, and so
they had 60 gigabytes native.
120 gigabytes compressed is what the, the, the advertised capacity was, by
the way, exabyte best company name ever.
Right.
But that company is no more.
Right.
The company that made those drives is, uh, is no more.
And, and, and back when Exabyte was named Exabyte, we were all like,
we're never gonna have an exabyte.
Now we're, you know.
No, it's, it's crazy.
But yeah.
So this was old school.
This was a, um, a cassette, helic scan tape.
We talked, we talked about helic scan a week or so.
Actually, you, you haven't heard it.
Well, the listeners may have heard it by this point, but you haven't heard it
'cause we haven't broadcast it yet, Paul.
Not the fastest tape drives in the world.
So you were able, I guess after five days, get all the data back, get the
applications and everything else up and running, and you still had your job.
I'm still, I'm still here 20, 28 years plus later I'm still here.
Yes.
And, and, and again for, for those who are listening who say, well,
gee, the tapes held a maximum of 120 gigabytes with compression.
You know, if the tapes held a maximum of 120 gigabytes, so that's a maximum
of 250 gigabytes, you had to restore.
What's the big freaking deal?
Well, the big freaking deal was that that was a ton of data back then, right?
That was a really big, let's see, the, the transfer rate, I'm showing it 12, if I did
my math right, 12 megabytes per second, which sounds about right given the.
The generation and timeframe.
So yeah.
So it's only 120 gigabytes, but the advertised thing
was 43 gigabytes per hour.
But clearly you weren't getting that, that that was the problem.
You were not getting the advertised transfer rate because of the right
penalty that you were experiencing when you were doing the Restore.
So what, what, uh, so, so we've already covered it.
You, you made it through.
Everything restored.
Clearly you didn't meet the, the objective, you know, the, the initial
time objective, but you got everything back and you got the critical things
back by Monday morning, and so I, I'm guessing that you didn't.
Like there wasn't a, was was there one of those giant postmortem sessions?
You know, I, I don't think it was till, uh, till maybe a, a week or
two down the road that I realized how incredibly stupid that was.
So you, so you, you then had a postmortem with yourself is what you're saying?
I did.
I did.
You, everybody was happy.
I, I mean, right back in, back in that day, uh, you know, probably, uh, a,
a a month or two later, I had talked to somebody about, about storage area
networks and, and they had given me a quote for, oh, you want to, you want to
add storage to your servers with a storage area network that'll only be $30,000.
In, in, in 2001.
Right.
And, uh, you know, I, I felt rather accomplished.
Yeah.
I, I had done what I intended to do, you know, granted it was a, a sleepless
night on the floor of my office.
But, so you're saying that because you reorganized this data, the, the
storage and you reallocated the storage more efficiently, you didn't need the
$30,000 San, is that what you're saying?
Correct.
So I had achieved the objective that I was after, and I saved my
organization money in my mind.
Did you tweak or change any of your backup restore plans based on this experience?
I think I changed my expectations on my restore plans because it all restored,
but, but at the same time it, it was the knowledge that when you're backing
up live applications that are running.
They, that's a difficult, that's a difficult thing.
And so now I see, you know, um, in, in virtual environments where,
where things are, um, and I'm missing the right word for it, but
where things are flushed and, um.
We asked, brought to rest.
Yeah.
So, so you recognize, I mean, you, you saw when you began telling me
this story, you know, when you started telling us this story, the, the, the
idea of essentially completely deleting your entire data center and then using
your backups really for the first time.
I was a bit flabbergasted.
But you you,
I was young.
I was young.
You're agreeing with me that this was a really bad thing to do.
It sounds like
Paul's gotten wiser with the longer beard and.
Since those days.
Yeah.
By the way, for,
for the, for the listeners here, white there, in
here,
I, I, I, I do think it's appropriate to mention, so, you know, we record
this as audio, but I'm looking at a camera version of, of, you
know, my co-host and my guest here.
And I am the only one who doesn't have this long flowing beard Prasanna.
Still has this yeared, uh, how long?
How long now?
It's now I think
like 19 months.
19 months without shaving.
So he's got this long, uh, much blacker beard than uh,
Paul, but Paul has the length.
Paul ha Paul, I'll just say this, Paul looks like he's from Alaska.
He looks exactly like what I would expect from someone from Alaska.
He has a redneck cap on and this long, you know, beard, although
quite a bit grayer than Prasanna and it may have been this event Paul.
That put that you're like, you know, it's one of those things where
you're like, I did this to myself.
I have no one to blame but myself.
I'm sure you said that many times during that event.
So what, so going back, what, so your goal was laudable and your
eventual results were successful.
What would you have done differently?
I.
To accomplish the same goal, but without perhaps the amount of pain that you had,
I, I would've, I would've had a safety line, uh, or, or some
sort of, sort of safety rope.
Um, I think that, uh, you know, if I had a, a, a spare server or a surplus
server, that I could have migrated each one of my servers over one at a time.
Planned outages, so I wasn't destroying everything.
I had no, I had no capacity, no, no rate arrays, right, left,
right.
You, yeah.
After pulling
drives out, you know, there were no servers that were functioning until I
started restoring, restoring my domain controller from that very first server.
I, I mean, I, I had a, a, a textbook Windows 2000 environment
with two domain controllers.
Both of those were offline.
My email server was offline.
My file server was offline.
My application server was offline.
Everything was offline until I started restoring.
Now it is, I would, I would just start in with one server and, and,
and work through it methodically
with a, I just wonder.
I wonder the degree to which that would've been possible given, you know, I don't
have the, I don't have a whiteboard.
I don't know how much you, because you were moving, drives around
and reallocating resources.
I don't know the degree to which that would've been possible.
It would've cost some money.
But in, in hindsight, I mean, how bad could I have screwed up?
I mean, if, if my, if, if I had, I don't think it
could have been any worse.
I, I mean, if I had accidentally dropped my tapes or, uh, ran them across the,
a, a magnet, uh, between point A and point B and lost my backups, I, I,
it would've been a be
I'm so far beyond it that I don't, I, I don't think about these things.
Well, the, but now you're here and you're talking to us, and so we're
asking you to relive that horrible day.
So I, I, I, I think if, you know, looking back on it, this is, and again, you know.
Love you Paul.
Uh, thanks so much for coming on and being, being, uh, open at, at the same
time, I'm gonna yell at you a little bit.
Um, to me, your core, your core failure was failure to not test at least one
restore prior to doing this, right.
Um, because you blew up your entire environment without any idea.
What restoring even one server was gonna be like if, even if you had just
blown up one server and restored it because your problem, everything worked.
Everything worked.
Your only problem was a failure to set proper expectations even within yourself.
And that was because you'd never actually done a large restore with your backups.
By the way, you are not alone Prasanna.
Is he alone?
Not at all.
And, and, and I will also say that I have been in this situation before.
I'll, I'll tell you a similar story.
A hundred years ago when I bought a, my first commercial backup program, the pro,
the pro, the, the product was called SA.
For archive, even though, which now offends me because it wasn't back,
it wasn't archive, it was backup.
But anyway, SMR, which was a Minneapolis company, they're,
they're no longer software moguls is the name of the company.
And I had bought this as my first commercial backup tool, and we had had
it for a couple of months, but I was still running my dumps to my old tapes.
In the meantime.
Right.
And then we had this, our first large outage, we lost the dis drives on
our primary file server H pfs oh one, I still remember the server name.
And that was 25 years ago.
And I, I was so excited.
I grabbed my SMR tapes and my.
Dump tapes, put them in my back pocket.
And I ran, I remember dri because it, it was a couple miles down the road where
the, where the other data center was.
And I remember running down there throwing in the tape drive.
And I remember kicking off the restore.
And what I remember was this was Blink.
Blink.
Long period.
Blink.
Blink long period.
And what I did was I created a wild loop and I was watching the, the, the, the size
of the file system not grow totally okay.
At least not, not by a speed that was gonna finish anytime that millennium.
And so I called the, the tech support and I was like, Hey, uh,
you know what, what's going on?
And they go, well, by any chance did you turn on the compression feature?
Yes, I did.
It was a software compression feature, which, which.
The way it worked was, this is old school.
It would, it would run a compressed minus, uh, CI think would be, uh, to,
to compress the file, to send an input, and then redirect it to a file in
temp, and then back up that compressed file During a restore, it would com
restore the compressed file into temp, then run Uncompress on the file.
At in that location and then move the file from temp the uncompressed
version of the file from temp.
There's a lot, you know.
Anyway, long story short, it was never gonna finish in
any sort of reasonable time.
And thank God I still had my other tapes, but this was all because I did
the same thing you did and that was, I had never tested a large restore.
Yes.
And um, you know, in your case, thank God you had.
You had enough time.
Thank God you were able to be able to restore the critical servers and time
so that you know nobody's pulling their hair out and your, your borough could
continue to do its function in my case.
Uh, thank God I had the other tapes in my back pocket because I just pulled them out
and just typed, you know, UFS Restore, you
know, I was going to say that, um, one thing for testing, right?
I know Paul, you mentioned that you had two domain controllers, right?
I think potentially you could have taken down one of the domain
controllers and sort of done a restore of that domain controller.
To test it out while still keeping the entire environment up and running,
and then made sure, and you probably would've noticed, hey, my backups
are slow, or My restores are slow,
so he shouldn't test it on his most important, most critical server.
There are so many things that
could have been done way, could, could've, shoulda, would've.
It's okay.
We all learn.
We all learn from these lessons.
Honestly, that's
why you're here.
Yeah.
You and there.
Have you ever seen there, there there's a company called despair.com.
Have you seen this company?
I have not.
So they, they make, they make de-motivation posters and, uh,
one of them, one of them is a picture of a sinking ship.
They look like motivational posters.
They're DEMOTIVATIONAL posters, and one of them is a picture of a sinking ship
and it says, I think it says mistakes.
And then it said it could be that the purpose of your life is to
serve as a warning to others.
I would much rather people listen to this and run.
From, from my, from my decisions then to then to think that I had good decisions.
But one of the funny things though, Curtis, is in your book, I remember
you telling me the story, right?
That when you're writing the book, right, and you were sending it
out for all the reviewers, right?
I.
Someone came back to you and was like, Hey Curtis, you forgot to put
a chapter about testing backups.
Yeah.
I completely left testing out of my book and thanks to Stewart Guy, it's
like, it's like the fourth time that Stewart gets credit on the podcast.
You were right.
Stewart
Guy, his name's Stuart Little like.
Come on.
He's a mouse.
I love you, Stuart.
Anyway.
All right.
Well, Paul, I, I, again, want to applaud you for coming on, for being so honest
about what was clearly, clearly a very large mistake that you made it out alive.
You accomplish your goal.
Right.
It, it's not like.
I mean it, this could have been much worse, right?
It could have been.
I intentionally destroyed my entire environment and then I
found out my backups don't work.
It could have been that one.
That could have, it could have.
It wasn't that, thank God you, it wasn't that.
But thank you.
Like, just this, this is, I mean this may be the best story we've
had on, you know, on the podcast.
We've had some other people that have had, I.
Bad things happen to him.
This is the first time where it was, you know, self-inflicted
user created.
Normally we're blaming the end user.
We don't blame ourself.
Yeah.
This is a classic p CAC situation.
Are you familiar with that acronym?
Absolutely a problem exists between keyboard and chair for those that don't.
Uh, and then the entire environment became fubar, which is, uh, look that up.
Fouled up beyond all recognition.
Yeah.
So wouldn't you agree, Prasanna, this is like.
This has been great.
This has been an awesome story and it's something that I don't think
end users realize what goes on.
Like I know sometimes we blame end users for, oh, you did this, you did that,
but everyone's human things happen.
So don't get frustrated at your IT people.
Absolutely.
By the way, and and I, I say this every once in a while, you know,
there's only two industries where they refer to their customers as users,
IT and drug dealers.
Yeah.
You know, it's a thing anyway, so, uh, tha yeah.
And you know what?
If you're out there and you have a story like this, we'd love to
have you come on and tell it.
We'll even let you be anonymous.
If you're embarrassed about what happened, you know, we'll give you a synonym.
Uh, you know, we, we did some Harry Potter characters for a while.
We'll pick, you know, pick your favorite book and, uh, you know, whatever.
We'll make you a, we'll make you one of the eternals from the movie
that just came out and whatever, you know, whatever you want to be.
If you just, we just love great stories because we learn from it, right?
That's the key.
Mistakes happen.
Um, you know, we learn.
So thanks, uh, to, thanks Paul so much for coming on.
Thank you
and thanks Prasanna for, uh, your insight into this as well.
Anytime
Curtis.
And thanks Paul.
And, uh, thanks to the listeners and remember to subscribe.
That is a wrap.