Backup from Hell: SMB vs 400TB

Experience the backup from hell in this eye-opening episode of The Backup Wrap-up. What started as a straightforward 40TB backup spiraled into a months-long battle with 400TB of data, failing tape drives, and directories containing hundreds millions of files.
Host W. Curtis Preston shares his first-hand account of tackling this backup from hell, including the challenges of dealing with SMB protocol limitations, tape drive failures, and the infamous "million file problem." Learn why backing up 99 million files in a single directory isn't just challenging - it's nearly impossible over standard protocols.
Discover the solutions that finally worked, from switching to disk-based backup to implementing local tar backups. Whether you're a backup admin or IT professional, this episode offers valuable insights into handling extreme backup scenarios.
You found the backup wrap up your go-to podcast for all things
Speaker:
backup recovery and cyber recovery.
Speaker:
In this episode, you'll hear the harrowing tale of what I'm
Speaker:
calling the backup from hell.
Speaker:
A project that started as a simple one-time backup, a 40 terabyte
Speaker:
of two sonology boxes that turned into a 400 terabyte nightmare
Speaker:
that took months to complete.
Speaker:
We're talking hundreds of millions of files with one directory alone
Speaker:
containing 99 million of them.
Speaker:
I'll share how I dealt with failing tape drives ridiculously slow
Speaker:
backup speeds, and ultimate solution that finally got the job done.
Speaker:
If you've ever wondered what happens when everything that could go wrong
Speaker:
with the backup actually goes wrong.
Speaker:
This episode is for you, plus you'll learn some valuable lessons about what to check
Speaker:
before starting a massive backup job.
Speaker:
By the way, if you don't know who I am, I'm w Curtis Preston, AKA, Mr.
Speaker:
Backup, and I've been passionate about backup and recovery for
Speaker:
over 30 years, ever since.
Speaker:
I had to tell my boss that we had no backups of the production
Speaker:
database that we just lost.
Speaker:
I don't want that to happen to you, and that's why I do this show.
Speaker:
On this podcast, we turn unappreciated backup admins into Cyber Recovery Heroes.
Speaker:
This is the backup wrap up.
Speaker:
Welcome to the show, and if I could ask you to just take one quick second
Speaker:
and, uh, subscribe or follow us so you can make sure that you get all of this
Speaker:
great content, that would be great.
Speaker:
I'm w Curtis Preston, AKA, Mr.
Speaker:
Backup, and I have with me a guy that apparently owes Ben Kingsley
Speaker:
a huge apology Prasanna Malaiyandin
Speaker:
how's it going?
Speaker:
Prasanna, why do you owe
Speaker:
an apology?
Speaker:
so as everyone's probably like, who's Ben Kingsley.
Speaker:
So if you don't know, he is an actor and he also played Gandhi in the movie Gandhi.
Speaker:
He did.
Speaker:
Right?
Speaker:
And for the longest time I was a little, not upset, but like the fact that you have
Speaker:
like probably one of the most important Indian people in history being played
Speaker:
By a guy with the name Ben Kingsley.
Speaker:
Exactly.
Speaker:
Yeah.
Speaker:
Ben Kingsley.
Speaker:
And so today I found out that Ben Kingsley is actually Indian.
Speaker:
Half
Speaker:
How about that?
Speaker:
should say.
Speaker:
Yeah,
Speaker:
what?
Speaker:
he's Anglo Indian.
Speaker:
Anglo Indian.
Speaker:
Yes.
Speaker:
It's like us.
Speaker:
You and me we're Indian.
Speaker:
so his paternal side is from Gujarat.
Speaker:
Right.
Speaker:
And his mom's side I think is European.
Speaker:
His dad was a physician who was born in Kenya.
Speaker:
And Ben Kingsley's name is not actually Ben Kingsley.
Speaker:
It's like Krishna Bunge, I think
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
And he realized that he wasn't getting called into the right casting
Speaker:
roles when he was looking for, when he was starting off his career.
Speaker:
So he is like, let me change my name.
Speaker:
And so we changed his name to Ben Kingsley and people started calling
Speaker:
him in and he started getting roles.
Speaker:
Racism in early Hollywood say, it isn't so.
Speaker:
Racism in current Hollywood.
Speaker:
Say it isn't So, Wouldn't be the only to do so.
Speaker:
Yeah.
Speaker:
yeah, so I apologize to Sir Ben Kingsley, uh, for all these years.
Speaker:
Yeah.
Speaker:
You were putting it in the same category as the quote unquote
Speaker:
Indian guy from the Short Circuit movie, which I don't know his name,
Speaker:
but he is very much not an Indian
Speaker:
person.
Speaker:
Do you know who it was?
Speaker:
the name?
Speaker:
I'm looking it up.
Speaker:
Or it's also like how Apu from, uh, the Simpsons is not Indian,
Speaker:
Yeah, he's, he's played by, um.
Speaker:
Oh, I know that.
Speaker:
I know the actor, but his name is escaping me.
Speaker:
So Fisher Stevens, is that
Speaker:
Fisher Stevens.
Speaker:
Yeah, Fisher Stevens.
Speaker:
Who?
Speaker:
Those of you that watch succession
Speaker:
will, uh, uh, Fisher Stevens was in succession.
Speaker:
He was, he was a, a lawyer, a a smarmy lawyer, which
Speaker:
always plays smarmy characters
Speaker:
yeah, I was just thinking, because I remember him from the blacklist
Speaker:
where he plays Marvin, the lawyer.
Speaker:
Yeah, got, he's got kind of the lawyer face.
Speaker:
I'm glad that you, you finally realized the error of your ways.
Speaker:
But did you know he was
Speaker:
No, no, I didn't.
Speaker:
I guess I always brought it up just like you, like I would bring
Speaker:
Ben Kingsley playing Gandhi and, um, as just another example of, uh, you
Speaker:
know, what would we call it, brown face, I guess we'd call it brown face.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
But people taking actor and there's been a lot of those great roles throughout the
Speaker:
Great.
Speaker:
You know, great roles played by very not,
Speaker:
you know, people that are not of that ethnic group.
Speaker:
Yeah.
Speaker:
and I think maybe also at the time, right, there weren't many
Speaker:
Indian actors in Hollywood at all.
Speaker:
And I would rather have the fact, or I would rather it like the movie be made
Speaker:
with someone who is non-Indian, rather, because it's a great movie.
Speaker:
I
Speaker:
don't know.
Speaker:
You've seen it,
Speaker:
good movie.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
So I would rather have that rather than not having the movie at all.
Speaker:
Hmm, I see what you're saying.
Speaker:
I see what you're saying.
Speaker:
Yeah.
Speaker:
And of course, you know, we have the same challenge with, uh, Asian, uh, actors,
Speaker:
right?
Speaker:
Uh, there's literally only three Chinese actors in all of Hollywood.
Speaker:
Like if you, if you look at like the Chinese roles, they've gone to
Speaker:
literally like one, there's one guy.
Speaker:
Uh, I forgot how many roles he's had, but he has had a prolific career playing
Speaker:
every Chinese person that you know.
Speaker:
Um, but, um, anyway, so we're gonna talk about something that we've
Speaker:
alluded to a little bit on the podcast.
Speaker:
Uh, sort of tell the final saga of what I'm calling the backup from Hell.
Speaker:
I may maybe, uh, we should probably phrase that slightly differently.
Speaker:
It's probably the,
Speaker:
the backup that keeps giving.
Speaker:
the back, the backup that, yeah.
Speaker:
Uh, what a mess.
Speaker:
The beginning of the story
Speaker:
that I was asked to do a backup of two Synology boxes that they
Speaker:
were, uh, repurposing, right?
Speaker:
So they were, um, going to move the data.
Speaker:
They, they were gonna reuse these servers, but they wanted to get a backup of the, of
Speaker:
the, the data before they moved it off of
Speaker:
Backup is good.
Speaker:
Yeah,
Speaker:
Backup is good.
Speaker:
Yeah.
Speaker:
Apparently they hadn't had a backup of the, of these servers before.
Speaker:
And, um, then the, the, um, and, and , they said it was
Speaker:
about 40 terabytes of data.
Speaker:
That's the information that I was given and after I had started doing
Speaker:
the backup, I very quickly realized that 40 terabytes might have been.
Speaker:
An understatement.
Speaker:
You, found additional data around
Speaker:
right as you
Speaker:
data.
Speaker:
Yeah.
Speaker:
Uh, so it turned out that it wasn't like 40 terabytes of data.
Speaker:
It was more like 400 terabytes of
Speaker:
Yeah, and
Speaker:
I'm guessing because these were systems that were kind of probably off on the
Speaker:
side, they hadn't been used in a while.
Speaker:
Like that's, I think, the problem, and I think we talked about this in one of
Speaker:
our episodes about sort of systems that kind of get stored away in the corner.
Speaker:
No one worries about
Speaker:
it.
Speaker:
Right?
Speaker:
And do you leave it powered on your old backup systems?
Speaker:
Right.
Speaker:
We just talked about that.
Speaker:
And so I think that becomes a challenge.
Speaker:
It's when you have these systems that are no longer actively being
Speaker:
used, it kind of gets away from you.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
And so the customer really didn't have any idea just how much data that they
Speaker:
were dealing with here, out to be, like I said, like close to half a petabyte of
Speaker:
Yeah.
Speaker:
And, and for you, that changes things significantly because
Speaker:
changes the backup design like massively.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
because your backup target, I think you had mentioned previously
Speaker:
that it was like a server, right?
Speaker:
That you were backing this data up to
Speaker:
I was backing it up via a server, a window server.
Speaker:
And, um, and tape, right?
Speaker:
Had tape, but it's sized like, um, you know, four 40 terabytes.
Speaker:
And so, which is, which is basically the, the, the server and the tape
Speaker:
library was the perfect size for that.
Speaker:
But as I started realizing that it was figure, it was filling up.
Speaker:
And again, this, this is my fault for not really looking at the size of the
Speaker:
data before really jumping in there, but basically I realized very quickly
Speaker:
that this was a whole lot more data
Speaker:
than,
Speaker:
than,
Speaker:
That you expect it and and I think just kind of looking at lessons
Speaker:
learned as you're a backup admin who is being told, Hey, this new
Speaker:
application is coming online.
Speaker:
Make sure that you understand like what is the expected growth of that application.
Speaker:
Because what you size for, say, a five terabyte database with a 1% growth
Speaker:
is very different than like a file server with like a 50% growth rate.
Speaker:
Yeah, exactly.
Speaker:
Um, and just because somebody says they have 10 terabytes of data doesn't mean
Speaker:
that they have 10 terabytes of data.
Speaker:
So you mentioned you had a backup server, you had a tape drive.
Speaker:
Is there a reason you chose to use tape
Speaker:
Well, the, I mean, tape is great for long-term retention of data,
Speaker:
which is what this customer wanted.
Speaker:
They wanted to hold onto this data for a long period of time,
Speaker:
and that's where tape is great.
Speaker:
And tape also is has, uh, you know, if you're able to properly feed it,
Speaker:
tape is actually, can be quite fast.
Speaker:
the challenge that I had when backing up this data that for various reasons,
Speaker:
which I think I, I think by the end I sort of figured out the, the
Speaker:
core reason for various reasons.
Speaker:
Individual backups off of the, these filers, the, they were just
Speaker:
slow, just, um, you know, they were
Speaker:
Like how slow, slow.
Speaker:
like, slow, was like, like, like three and a half kilobytes a second slow.
Speaker:
So like slower than like a 56 K modem back
Speaker:
Yeah.
Speaker:
Right.
Speaker:
And you can multiplex all you want.
Speaker:
So first off, you know, I, I was using NetBackup, which, you know, NetBackup, it
Speaker:
did a great job at what we had available.
Speaker:
Um, the, challenge was that because I couldn't put.
Speaker:
The client on the filers themselves.
Speaker:
So the, was a way allegedly to put a, a backup client on the filer,
Speaker:
but I could never get that to work.
Speaker:
And so I had to back up over SMB because I'm backing up
Speaker:
over SMB, I'm just, I'm just.
Speaker:
I'm just limited at what that was, right?
Speaker:
What,
Speaker:
I could get, and because I'm backing up over SMB, the client
Speaker:
is just the backup server,
Speaker:
right?
Speaker:
So instead of running a backup from two clients, I'm running a backup from one
Speaker:
client because that's the backup server.
Speaker:
I'm backing it up over SMB.
Speaker:
And because of that, I'm limited to the number of jobs I can run at one time.
Speaker:
NetBackup, um, says 99 99 jobs, which should say, gee, that
Speaker:
sounds like a
Speaker:
Nine problems.
Speaker:
right?
Speaker:
But, but the thing is, towards the end, as I was running a lot of these backups,
Speaker:
the aggregate speed of like 99 backups was only like 30, 40 megabytes a second,
Speaker:
you
Speaker:
you're talking about 400 terabytes of data to
Speaker:
400 terabytes of data doing the math.
Speaker:
I backed up for months,
Speaker:
right?
Speaker:
And I tried all these different things.
Speaker:
Uh, you know, num, you know, was I running too many backups at a time?
Speaker:
Was I running not enough backups at a time?
Speaker:
You know, it, um, you know, and then the problem is every, every
Speaker:
test would take days or weeks.
Speaker:
Think we should mention one thing.
Speaker:
You were talking about these test taking days or
Speaker:
mm-hmm.
Speaker:
and then do you wanna mention sort of some of the issues you ran into with these long
Speaker:
running jobs just due to infrastructure or
Speaker:
Yeah.
Speaker:
other issues in the environment?
Speaker:
yeah, you, you backups are not made to run over weeks or months.
Speaker:
Just backup infrastructure isn't made to work like that.
Speaker:
And so when you do backups over weeks or months.
Speaker:
Weird things happen that, cause you know, consternation, one of the things
Speaker:
is LTO tape drives are great, but like we were using like the half high LTO
Speaker:
drives and as far as I could tell, their duty cycle was not meant to
Speaker:
be a hundred percent for two months.
Speaker:
Right.
Speaker:
Um, they're meant to be backed up for, you know, several hours and then give
Speaker:
'em a rest and then back up several hours and then give 'em a rest.
Speaker:
I was just beating the crap outta these things for weeks or months at a time.
Speaker:
And what would happen is after some significant period of time,
Speaker:
it would just go write error.
Speaker:
And that's fine when a backup runs for a few hours and then just try again.
Speaker:
But if you, but if it took you two weeks or three weeks to get to that point
Speaker:
and then you get a write error, um,
Speaker:
then
Speaker:
it's not like you could restart these jobs either, right?
Speaker:
I think you're running into
Speaker:
Yeah.
Speaker:
Well,
Speaker:
I mean, I mean, I could restart em, but, but it's like after
Speaker:
a period of time I became, I.
Speaker:
I eventually got to the point where I said, tape is not my friend.
Speaker:
I, anybody who
Speaker:
this is coming from Mr.
Speaker:
Backup.
Speaker:
know anybody who listens to this podcast knows that I am, I am a friend of tape,
Speaker:
right?
Speaker:
I believe strongly in tape for a lot of reasons, but I don't think that, uh,
Speaker:
specific and, and you know, maybe the, my LTO friends can chime in here, but I don't
Speaker:
think that these tape drives were designed to be backed up to like this for weeks and
Speaker:
months at a time, 24 7 with no, because as soon as one, I was multiplexing
Speaker:
as many backups together as I could.
Speaker:
And when one backup would finish, I would just add another backup onto it, right?
Speaker:
Because
Speaker:
I, I could, I could, I.
Speaker:
what I couldn't do is I couldn't say, well, let's do these 10 backups, let
Speaker:
them run until they're finished, and then we'll do the next 10 backups.
Speaker:
And that would've given the tape drives a, a moment to breathe, I think.
Speaker:
But, uh, I couldn't do that because the, because we, we just
Speaker:
didn't have that kind of time.
Speaker:
And so I
Speaker:
was just, I was just try, you know, tagging it
Speaker:
and, and I know you've always talked about like the shoe shining problem,
Speaker:
given that you're not going very fast with these backups, right.
Speaker:
Do you think that also led to some issues as well for the tape drives?
Speaker:
yeah.
Speaker:
So again, the core problem was that each individual backup was running slow.
Speaker:
matter how many of them that I multiplex together, it was not enough
Speaker:
speed to make the tape drive happy.
Speaker:
And so, yes, the tape driver shoe shining.
Speaker:
And when a tape tribe is continually shoe shining, the tape drive will fail.
Speaker:
And so everything, I remember learning about tape drives was
Speaker:
coming back to haunt me, right?
Speaker:
Um, this is all of the design that I was, that I had done throughout
Speaker:
the years on backup, um, you know,
Speaker:
um, backup system
Speaker:
And system.
Speaker:
all of the things that, you know, what do you do when the backups, you know?
Speaker:
And so I came to understand
Speaker:
that the only way I was gonna finish this backup was to do it to disc.
Speaker:
And just quickly before you move on, I think along the way, didn't
Speaker:
you also have a tape drive that failed that you then had to go
Speaker:
Oh, multiple Multiple times.
Speaker:
Swap out tape drives, reboot tape drives, put in cleaning tapes and tape drives.
Speaker:
And by the way, that's another thing is the way tape drives normally do
Speaker:
is you run them for a certain number of hours and then there's a cleaning
Speaker:
tape that goes in there and cleans it.
Speaker:
And when you have a robotic library, that happens automatically.
Speaker:
Well, when you just run the tape drive for.
Speaker:
Two months, you know, that
Speaker:
And so at some point the tape drive just fails.
Speaker:
Yeah.
Speaker:
um, yeah.
Speaker:
And so I ultimately that the only way to get this done was to, um, you know,
Speaker:
buy, uh, enough disc to back this up.
Speaker:
And that wasn't cheap.
Speaker:
Uh, but I, I didn't think that there was any other way that this was ever
Speaker:
going to get done 'cause again, the core problem that we've had with tape
Speaker:
for the last three decades has been that the backup, if the backup isn't
Speaker:
too fast enough for the tape drive it's a, it's a fundamental mismatch
Speaker:
right?
Speaker:
And so we use to make that better.
Speaker:
But if the multi, but if the speed you're dealing with is in kilobytes a second,
Speaker:
Yeah.
Speaker:
Well, and especially 'cause you're limited by those two, uh, Synology boxes, right?
Speaker:
Which are limiting your bandwidth, right?
Speaker:
It's not like
Speaker:
Yeah.
Speaker:
Synology boxes you can then pull from,
Speaker:
Yeah, and I was, I was watching, like, I was running every kind of tool I could
Speaker:
run to see, like, I wasn't overt tasking.
Speaker:
The, that was the really weird part is that the, it's not like the
Speaker:
Synology boxes were saying, you're really beating the crap out of it.
Speaker:
You shouldn't do so
Speaker:
backups at a time.
Speaker:
It wasn't, it, it was, I didn't have a high I/O wait.
Speaker:
I didn't have high CPU, I didn't have high ram.
Speaker:
There, there was no, there was no
Speaker:
rhyme or as to why we'll get to the rhyme or reason later.
Speaker:
I figured it out.
Speaker:
Um, but, but I knew the tape and I knew the tape and this wasn't gonna work.
Speaker:
So, so I had to bring in, uh, a couple of other Synology disc arrays, by the
Speaker:
way, and populate them with enough disc to handle all of this, uh, this backup.
Speaker:
Right.
Speaker:
Yeah,
Speaker:
And, um.
Speaker:
Then
Speaker:
but that wasn't without its issues either.
Speaker:
Right?
Speaker:
When you, when you brought those in, that wasn't without its issues either.
Speaker:
No, it wasn't without issues.
Speaker:
And the other thing, what I needed to do was to, I felt that with, in terms of the
Speaker:
number of directories that were remaining, I wasn't sure like the different sizes.
Speaker:
So what I did was I split, I.
Speaker:
Those jobs into many smaller jobs.
Speaker:
NetBackup is really good at like running thousands of jobs, right?
Speaker:
So rather than just have a hundred jobs, I turned that into like 2,400 jobs.
Speaker:
Like I went,
Speaker:
I went another level deep and created a policy for each of these
Speaker:
directories, and then I ran those and it was running for a while.
Speaker:
It was, it was, you know, again, more time.
Speaker:
And what I started seeing.
Speaker:
Were these jobs that were like an individual job that was running
Speaker:
inordinate amount of time.
Speaker:
but you also had some jobs that would finish like super fast, right?
Speaker:
Like
Speaker:
They'd finish five, they'd finish in
Speaker:
Some of 'em, some of 'em finished in five minutes, some 'em would finish.
Speaker:
But I noticed that over time there were certain policies that were running for
Speaker:
really, really long periods of time, and eventually started poking around.
Speaker:
when I discovered what ultimately was the, the true culprit.
Speaker:
And, uh, anyone who's been around backup for a long time
Speaker:
has seen this culprit before.
Speaker:
It's just, this is the worst example of this culprit that I've ever seen.
Speaker:
And what is that?
Speaker:
We affectionately refer to it as the million file problem.
Speaker:
Hmm.
Speaker:
Because remember, again, going back to that, um, that client back from
Speaker:
25 years ago, we had one server.
Speaker:
That was going to be storing a bunch of images and it was going
Speaker:
to result in millions of files.
Speaker:
And we knew that back then that the million file problem is, a real problem.
Speaker:
and and million file problem ev over, over the network is even worse, right?
Speaker:
Because everything is, is, is a
Speaker:
round trip.
Speaker:
The way we fixed it back then was we used a product back then called
Speaker:
flashback, which would back up at the raw level, but store the
Speaker:
information, and that was not available to me.
Speaker:
Why?
Speaker:
Because that product no longer exists
Speaker:
No.
Speaker:
because it doesn't run on a Synology box.
Speaker:
Right.
Speaker:
Remember, I'm not the Synology
Speaker:
All it was was an SMB mount to me.
Speaker:
Right?
Speaker:
And by the way, for those curious, yes, I tested SMB, I tested NFS.
Speaker:
It didn't matter.
Speaker:
It didn't matter.
Speaker:
Um, the um.
Speaker:
And
Speaker:
by the way, this was a constant, you know, you know the phrase, never, never
Speaker:
go into battle with an untested weapon.
Speaker:
This was constant example of I am in the battle, I'm in the stuff,
Speaker:
and now I'm trying to test stuff
Speaker:
and, and I did to try to make things better, just made it take longer
Speaker:
and the client just had to wait.
Speaker:
And the the client was incredibly patient, honestly.
Speaker:
And, and you know, I did my best to say, look, I, I've been doing this for 30
Speaker:
years, I've never seen anything like this.
Speaker:
Right.
Speaker:
And that, that helped.
Speaker:
But in the end, I was backing up.
Speaker:
You know, we got down to, I, I learned a way to identify which
Speaker:
were the problem directories.
Speaker:
So I would kick off a policy and I would watch, and I would notice
Speaker:
that had run for, let's say an hour.
Speaker:
And it listed, let's say 300,000 files backed up.
Speaker:
kilobytes.
Speaker:
Hmm.
Speaker:
Literally there's, there's a kilobyte column that
Speaker:
kilobytes of byte and there's no value in there.
Speaker:
We backed up 300,000 files, no kilobytes.
Speaker:
so that, that helped me identify these problem
Speaker:
Problem child.
Speaker:
Yeah.
Speaker:
it and let the other non-problem policies finish.
Speaker:
And
Speaker:
Right.
Speaker:
Yeah.
Speaker:
up getting down to like 150 policies that were the problem policies.
Speaker:
And so I backed them up and I was able to get them.
Speaker:
Over time, I was able to get them backed up, and then finally I got down to about
Speaker:
20 policies, I think somewhere around
Speaker:
policies.
Speaker:
Go ahead.
Speaker:
And at this point when you're down to the 20, like some of these have
Speaker:
been running for a long time, right?
Speaker:
Like how?
Speaker:
like two months backups that have been running for two months,
Speaker:
successfully running for two months.
Speaker:
Yeah.
Speaker:
And what was good was at this point again.
Speaker:
Like this is information that would've been really helpful to have at the
Speaker:
beginning, but it was information that, to get all this information at the
Speaker:
beginning, it would've taken time to, like we, we just wanted to get started.
Speaker:
Yeah.
Speaker:
What I ended up finding was that, um, these backups, um.
Speaker:
The, the, there were millions and millions and millions, like one of the, one
Speaker:
of the directories that I was backing up, it had 99 million files in it,
Speaker:
one directory, 99 million files, and eventually what I realized was that
Speaker:
again, the problem this time was just SMB.
Speaker:
So the fact that every one of these files results in a round
Speaker:
trip conversation, possibly multiple round trip conversations.
Speaker:
Yep.
Speaker:
And I realized that the only way I was gonna back up these truly problem
Speaker:
directories was to back them up locally.
Speaker:
But how do I back them up locally?
Speaker:
Well, luckily this is when I just, you know, basically go back
Speaker:
to dumb, dumb old backup tools.
Speaker:
And so I was able to run a backup using tar logged in locally
Speaker:
on the filers, and then just.
Speaker:
Directing the tarball across the network that finally worked.
Speaker:
That's crazy.
Speaker:
So you had these 20 jobs, right?
Speaker:
And some of them you said were running for 60 plus days, and then you sort of
Speaker:
were like, okay, let me start this over.
Speaker:
And by the way, you were kind of forced to start them over
Speaker:
because something happened right?
Speaker:
At
Speaker:
yeah.
Speaker:
Something some unknown thing.
Speaker:
Um, I think I.
Speaker:
I, I, I don't know.
Speaker:
I, I actually don't know
Speaker:
what caused it, but they, they did fail
Speaker:
and,
Speaker:
And you were like, I'm not gonna start these
Speaker:
yeah.
Speaker:
I'm not gonna start 'em again.
Speaker:
It's just, yeah.
Speaker:
Well, Because
Speaker:
like, one of jobs, the, the one with 99 fi, 99 million
Speaker:
files, we were nowhere near.
Speaker:
I.
Speaker:
yeah.
Speaker:
After 60 days you were barely
Speaker:
yeah, yeah.
Speaker:
We're barely, barely scratching the surface.
Speaker:
so I'm like, I, I, I don't have, I don't have that, you know, I, I don't
Speaker:
have the amount of time that it would take, so, so I switched to, you know,
Speaker:
experimentally once again, experimentally, I'm experimenting on the fly, I'm
Speaker:
doing development in production.
Speaker:
Uh, I was like, well, let me see how long, how quick a tar ball would run.
Speaker:
I ran a tar ball.
Speaker:
I remember for like a day, you remember this?
Speaker:
I ran a
Speaker:
a day and it, I, I had a du of the size of the directory and after a day it had
Speaker:
done like, like a half of it or something.
Speaker:
Yeah.
Speaker:
You're like, what?
Speaker:
Once taking 66 days and barely scratch the
Speaker:
yeah,
Speaker:
You are mainly done.
Speaker:
Almost done within a day.
Speaker:
yeah.
Speaker:
And so I was like, this is the way.
Speaker:
Right.
Speaker:
So it, it, it wasn't, it wasn't a way for everything because the, the, this
Speaker:
was, um, because I, you know, I'm glad that I, that I use NetBackup for the
Speaker:
bulk of it, because then I have the catalog data and, you know, and, um,
Speaker:
but
Speaker:
on the restore side.
Speaker:
yeah, yeah.
Speaker:
So this will.
Speaker:
This will be the diff the restores will be more difficult for these
Speaker:
like remaining 20 directories.
Speaker:
I mean, not, not astronomically.
Speaker:
So like,
Speaker:
you know, can create a tarball, a
Speaker:
list of this.
Speaker:
So, you know, lessons learned, like,
Speaker:
do that.
Speaker:
Don't store millions of files on the other side of a, of an SMB box.
Speaker:
I guess
Speaker:
Yeah, so Well, and I think a couple things, even if it's not SMB, right?
Speaker:
Just having that many files, because I think what people don't realize is
Speaker:
even though the size of every disc has gotten significantly larger, right?
Speaker:
You're talking like 18 terabyte, 20 terabyte disk
Speaker:
Yeah.
Speaker:
They can only handle so many operations per disc, right?
Speaker:
That number hasn't changed.
Speaker:
It's about a hundred per second.
Speaker:
And so no matter how many, how big your disc is, right?
Speaker:
If it was 21 terabyte discs, right, then you get 20 times a hundred iops.
Speaker:
Versus if it's one 20 terabyte disc, you only still get that a hundred.
Speaker:
So that's a big thing that people don't realize with these larger size discs.
Speaker:
Yeah.
Speaker:
And, and the thing was that the.
Speaker:
That many files.
Speaker:
So, because the problem, the, ultimately the problem wasn't disc io, the problem
Speaker:
io.
Speaker:
Right?
Speaker:
Network latency.
Speaker:
So, because
Speaker:
when I actually ran, I ran two tar balls.
Speaker:
I.
Speaker:
Simultaneously is what I did.
Speaker:
I using
Speaker:
I just, I ran, I was always running two at a time.
Speaker:
When I was running two at a time, I/O wait was sitting at 10,
Speaker:
which is, is high,
Speaker:
but I was like, well, it's got nothing else going on, so I'm, I'm
Speaker:
it go.
Speaker:
Right?
Speaker:
The highest I/O wait ran during all of those hundreds of
Speaker:
simultaneous backups was like four.
Speaker:
yeah,
Speaker:
So like I wasn't disc bound.
Speaker:
I was
Speaker:
bound, but not network bound in terms of throughput, network bound, in terms of
Speaker:
Laid C,
Speaker:
and
Speaker:
of operations, just because SMB is very chatty.
Speaker:
very chatty.
Speaker:
It's probably the chattiest of the protocols,
Speaker:
and
Speaker:
we, you
Speaker:
it was just a really combination.
Speaker:
Yeah.
Speaker:
And you know why this, and this is why backup vendors have their own protocols,
Speaker:
like Data Domain has boost, right?
Speaker:
To help alleviate and solve some of these issues.
Speaker:
Yeah.
Speaker:
You talked about, don't, don't do the somewhere we were talking about.
Speaker:
Just don't do this.
Speaker:
I, I'd like, I'd like to talk today.
Speaker:
When I looked at these, these, uh, these directories that had these
Speaker:
tens of millions of files, it was a structure that was very clearly
Speaker:
created by some application.
Speaker:
one of these directors had a common structure created by some.
Speaker:
I'm gonna say stupid application that thought this was perfectly fine.
Speaker:
That it was perfectly fine to create 99 million files for
Speaker:
Do you know, I,
Speaker:
item.
Speaker:
I bet they were using the file system as a database
Speaker:
I don't know.
Speaker:
what it was.
Speaker:
given just like the number of files and the size of those files.
Speaker:
I know it was forensic type information
Speaker:
and I, I don't, I clearly
Speaker:
That, that's fine.
Speaker:
Yeah, yeah,
Speaker:
No, I'm just saying I clearly don't know enough about forensic stuff
Speaker:
to know why they would want tens of
Speaker:
of vials,
Speaker:
but
Speaker:
So where are you?
Speaker:
So you talked about these 20 jobs that you were starting to do tarballs with.
Speaker:
So where are you right now?
Speaker:
So, so we finished all of them, but one, there was one that for some reason
Speaker:
it, it, the file didn't look right.
Speaker:
It was weird.
Speaker:
Um, it, the, the, the backup completed, but the, some reason, the, the tarball,
Speaker:
it just, it just didn't look right.
Speaker:
I don't wanna go into details.
Speaker:
It just didn't look
Speaker:
so I'm rerunning that one.
Speaker:
So it, based on its size and how well it's doing, it should
Speaker:
finish in about a day or so.
Speaker:
Um, and what I'm
Speaker:
is a significant improvement in terms of
Speaker:
A significant improvement a day versus, you know, a year, um,
Speaker:
Or two, I think actually it might have been two.
Speaker:
Yeah,
Speaker:
Agreed.
Speaker:
Um, and what I'm doing is I'm, because again, I don't have the catalog.
Speaker:
What I'm currently running is I'm running a tar TVF.
Speaker:
On all of those files and creating tarballs or creating, I'm sorry, text
Speaker:
files, a list.
Speaker:
of the, the files that are in there.
Speaker:
And then I'm gonna do a count on the files that are in there and
Speaker:
check it against the count of the files that are in the directory.
Speaker:
And, and hopefully those numbers should be the same.
Speaker:
Yeah, because I believe you are even saying that to run things
Speaker:
like a find to get a list of all the files in a directory or a DU
Speaker:
Yeah.
Speaker:
hours, right?
Speaker:
Well, it was days actually.
Speaker:
In
Speaker:
fact, it was why I didn't have this information in the beginning
Speaker:
because everything was so big and every find, every du every command
Speaker:
that I had DU is quicker than find.
Speaker:
DU is.
Speaker:
It just does less work than find.
Speaker:
But the problem that I ultimately realized was that DU wasn't
Speaker:
really being helpful in terms of.
Speaker:
The
Speaker:
scope of the job, what was the scope of the job was determined
Speaker:
by the number of these files.
Speaker:
And I couldn't get those numbers because that was the thing that took forever.
Speaker:
the number of jobs dwindled down to about 20, that's when I
Speaker:
was able to run these, uh, the
Speaker:
and they would, they would actually complete.
Speaker:
And that's when I realized just how bad it was.
Speaker:
so if you had to start this over, and hopefully you never do, but I'm just
Speaker:
saying, if you had to go back to day one, what would you do differently?
Speaker:
I know you talked about making sure you understand the size of your backups.
Speaker:
Right.
Speaker:
It just feels like some of these, you just have to go through the process
Speaker:
though because you don't know what to do.
Speaker:
Like it's not like you could just start day one and be like,
Speaker:
oh, I know I need to go to disc.
Speaker:
I need to do X, Y, and Z.
Speaker:
Right?
Speaker:
It's sort of like a learning process.
Speaker:
would say that I.
Speaker:
Yeah, because the problem is you're going off into the unknown,
Speaker:
you're doing a backup of something that you don't know what it is.
Speaker:
And I, I would say if possible, if at all possible, get things like
Speaker:
dus, uh, you know, discus it, it's a Unix command, but you can load those
Speaker:
tools and windows as well get, like if you're going to back up, if you're
Speaker:
gonna back up a hundred directories.
Speaker:
Get a du of every one of those directories so that you have an idea
Speaker:
of just what you're dealing with,
Speaker:
if at all possible.
Speaker:
Also, look and see if the number files and if the number of, and if you're
Speaker:
trying to do a, you know, it's not that hard, you just run a fine dot dash,
Speaker:
you know, I didn't even do a print just fine dot pipe to wc -l, right?
Speaker:
That was it.
Speaker:
Right?
Speaker:
Um, to, to get the number of files.
Speaker:
I'd say if again.
Speaker:
If I could go back in time, I, I would say maybe do a little bit more of this
Speaker:
research prior to beginning the job.
Speaker:
Um, but that's diff it's, it's easy to say that now,
Speaker:
um, because I know what
Speaker:
I know.
Speaker:
Right.
Speaker:
Um, but the, you know, the core problem was that you've
Speaker:
got these millions of files.
Speaker:
I mean, which is all.
Speaker:
Already gonna be a problem if you're backing it up in any sort of normal way.
Speaker:
But if you're
Speaker:
up remotely over the network, it's going to kill you.
Speaker:
Yeah.
Speaker:
So, um, you gotta figure out a way to do that.
Speaker:
And then I would just say, see if there's anything that you can do with the, with
Speaker:
the application that's created this data
Speaker:
which is why it's important to get involved early on, right when an
Speaker:
application is being developed or deployed, right, to get involved so
Speaker:
they understand the backup requirements.
Speaker:
yeah.
Speaker:
And so, this backup that would never finish, I literally was, I
Speaker:
was starting to think that this thing was never gonna finish.
Speaker:
Um.
Speaker:
It's essentially finally, I mean, it's not, at this point, it's
Speaker:
not a hundred percent, but I'm, I'm now, you know, it's just, I'm
Speaker:
at the finish line.
Speaker:
Yeah.
Speaker:
at the finish line.
Speaker:
Yeah.
Speaker:
Um, it's nice.
Speaker:
I know one of the other things you mentioned that you were using
Speaker:
NetBackup, but you had also looked at other tools out there as well, right?
Speaker:
That could potentially help you with this effort.
Speaker:
Right.
Speaker:
So do you think that that becomes valuable, like either looking at other
Speaker:
tools, um, I know you had reached out to like synology support, you
Speaker:
had reached out to some experts, like
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
The problem there, there were, there were, you could do, like with Synology,
Speaker:
you can like copy the data from A to B.
Speaker:
Mm-Hmm.
Speaker:
They have this ability essentially like, you know, for lack of a
Speaker:
better word, they have Snap Mirror.
Speaker:
they have the equivalent of Snap Mirror.
Speaker:
Yep.
Speaker:
from onSynologygy box to another.
Speaker:
But to me that wasn't really a backup like I wanted in a, in a format, you know,
Speaker:
the end I was forced to not do what I wanted with the tar.
Speaker:
Um, but I wanted it in a cataloged format.
Speaker:
So we looked at a couple of, the problem was never NetBackup.
Speaker:
Right?
Speaker:
NetBackup made it, um, easy to script this whole thing because it was the
Speaker:
only way I could make sense of it.
Speaker:
'cause it was, it was thousands of directories and, um, and even
Speaker:
more thousands of sub directories under those directories.
Speaker:
And the only way I could make sense of this was to script it all.
Speaker:
And, um, the, the fact that NetBackup allowed me to do that was great.
Speaker:
Um, there are some other tools these days, some of the newer tools,
Speaker:
they want to make it easy for you.
Speaker:
But if you get into a complicated situation like this, some of the newer
Speaker:
tools don't even have the ability to sort of grab it by the horns.
Speaker:
The
Speaker:
able to do a NetBackup,
Speaker:
Yeah.
Speaker:
I think the other thing also that you were doing, which I thought was interesting,
Speaker:
was also your scripting, right?
Speaker:
Trying to automate this, like, uh, I know like scheduling your,
Speaker:
the backup policies to run, right?
Speaker:
And then you were sort of doing load balancing to make sure
Speaker:
that you keep the two filers
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
I couldn't, yeah, that was the thing.
Speaker:
I couldn't normally, I, I just, I believe in just throwing
Speaker:
everything in the NetBackup schedule or, and let it figure it out.
Speaker:
But because again, because of the limitations of the weird thing I had,
Speaker:
I, I couldn't figure out a way to load balance across the two target filers.
Speaker:
the NetBackup scheduler.
Speaker:
Um, maybe I could have, uh, done that better.
Speaker:
I don't know.
Speaker:
But, uh, so the way I was doing it was I was just assigning a backup.
Speaker:
a backup would finish, I would assign the next backup to that, that the
Speaker:
was now had more space available to it.
Speaker:
Right.
Speaker:
So I just had a while loop that was running, you
Speaker:
know, checking to see if a backup job was done.
Speaker:
but I think that's important, right?
Speaker:
You can always script some of these things that if it doesn't
Speaker:
exist in the native tools, right?
Speaker:
Don't be afraid.
Speaker:
Yeah.
Speaker:
Don't be afraid.
Speaker:
you know, obviously I'm, I'm pretty good at scripting and
Speaker:
I'm pretty good in the backup.
Speaker:
And, um, th there are, and, and, and, and thanks.
Speaker:
Thanks very much to Veritas for keeping their, uh, their documentation online.
Speaker:
Uh, the number of times I Googled.
Speaker:
You know, backup job, you know, how do, how do I list, uh, you know, and
Speaker:
I know there's a, there's, I know there's a command to, to do this.
Speaker:
How do I do that?
Speaker:
And, you know, and then a man page would come up and I would read it
Speaker:
and I was like, oh, yeah, yeah, yeah.
Speaker:
It's
Speaker:
been a while.
Speaker:
Yeah.
Speaker:
Um.
Speaker:
you have to also thank Cygwin, of course.
Speaker:
Yes, special thanks to to Cygwin Without Cygwin.
Speaker:
That is the tool that you can download and run on any Windows
Speaker:
server to give you Unix capabilities.
Speaker:
I will say there were, there were moments where Cygwin was both helpful and
Speaker:
terrorizing me because it was the whole like backslash versus forward slash thing.
Speaker:
Because in Windows, you know, the file separator is a backslash, which
Speaker:
in Unix is an escape character,
Speaker:
Yep.
Speaker:
and Cygwin wasn't consistent.
Speaker:
When that escape character would be an escape character.
Speaker:
Like, like if you piped it into a file, it would do one thing.
Speaker:
If you piped it into a command, it would do it, it would behave differently.
Speaker:
And, um, so that, that definitely l lent.
Speaker:
The fact that I was doing constant file manipulation on directories
Speaker:
that were seven levels deep,
Speaker:
Yeah.
Speaker:
did not help.
Speaker:
Yeah.
Speaker:
Oh, and then I couldn't, the, the, the, the one thing with
Speaker:
Cygwin is that it doesn't see.
Speaker:
It doesn't see the, to point the backups to NetBackup, I have to point
Speaker:
'em in the backs back slash filer name
Speaker:
share name.
Speaker:
Cygwin doesn't see that.
Speaker:
Cygwin sees only mapped drive names
Speaker:
and
Speaker:
have to map it using
Speaker:
you have to map it to a drive name.
Speaker:
Let's say you map it to,
Speaker:
to letter F, and then in Cygwin you would see /cygdrive/f.
Speaker:
Which would be the same as this backs slash backs mount.
Speaker:
know, I was constantly having to go back and forth between
Speaker:
those two and, and that was fun.
Speaker:
Um,
Speaker:
scripting
Speaker:
here's the thing.
Speaker:
After all of this experience and everything you've learned, you're probably
Speaker:
never gonna use any of this again.
Speaker:
I don't know about that.
Speaker:
I dunno about that.
Speaker:
I tell you what, I'm, I'm taking a tar, all those scripts that
Speaker:
I wrote, um, because I will say this, that, that the NetBackup
Speaker:
documentation while, uh, extensive, it doesn't give a lot of examples.
Speaker:
And so like, I'm thinking of like, um, like the BP duplicate command,
Speaker:
which is the command to copy backups from one place to another.
Speaker:
I couldn't, I couldn't figure out from reading the man page how to
Speaker:
actually do, to do what I needed to do.
Speaker:
So I would, I would like.
Speaker:
I would do, I would have to run tests, you
Speaker:
know, I'd, you know, um, and, um, the, you know, not like now that Cohesity's
Speaker:
acquiring them, it's not like they're now gonna rewrite their man pages.
Speaker:
I just thought that they could have used some more, some more examples.
Speaker:
But
Speaker:
Yeah.
Speaker:
I figured it out eventually.
Speaker:
You know, I think someone used to have a forum that people would post on about.
Speaker:
Yeah, someone used to have that and then, but people stopped posting
Speaker:
on that forum, so I don't know
Speaker:
You know?
Speaker:
Um, where people are getting their help now,
Speaker:
but, uh,
Speaker:
Well, I'm glad that this is almost over,
Speaker:
yeah.
Speaker:
Yeah.
Speaker:
nearly over and I'm glad you're still alive,
Speaker:
I am alive.
Speaker:
I didn't kill anyone along the way.
Speaker:
I didn't scream at anyone.
Speaker:
Like the, the story that
Speaker:
you have heard were, were Curtis Cuss Preston.
Speaker:
I didn't scream at anyone.
Speaker:
yeah.
Speaker:
but I really, really, really think you should do an office space on those filers.
Speaker:
yeah.
Speaker:
Well, that would sort of defeat the purpo of the
Speaker:
but, uh, I, yeah, I, like that idea.
Speaker:
Hmm.
Speaker:
Anyway.
Speaker:
Well, uh, thanks Prasanna for helping me, uh, sort of through this.
Speaker:
You were my constant counselor through this.
Speaker:
I think I learned a bunch.
Speaker:
I know usually I'm all about YouTube knowledge, but in this case it was
Speaker:
the Preston knowledge, so it was good.
Speaker:
I.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
uh, thanks everybody else for, uh, uh, listening along with this sad, sad story
Speaker:
with I think a decent, happy ending.
Speaker:
That is a wrap.
Speaker:
The backup wrap up is written, recorded and produced by me w Curtis Preston.
Speaker:
If you need backup or Dr.
Speaker:
Consulting content generation or expert witness work,
Speaker:
check out backup central.com.
Speaker:
You can also find links from my O'Reilly Books on the same website.
Speaker:
Remember, this is an independent podcast and any opinions that you
Speaker:
hear are those of the speaker.
Speaker:
And not necessarily an employer.
Speaker:
Thanks for listening.