Experience the backup from hell in this eye-opening episode of The Backup Wrap-up. What started as a straightforward 40TB backup spiraled into a months-long battle with 400TB of data, failing tape drives, and directories containing hundreds millions of files.
Host W. Curtis Preston shares his first-hand account of tackling this backup from hell, including the challenges of dealing with SMB protocol limitations, tape drive failures, and the infamous "million file problem." Learn why backing up 99 million files in a single directory isn't just challenging - it's nearly impossible over standard protocols.
Discover the solutions that finally worked, from switching to disk-based backup to implementing local tar backups. Whether you're a backup admin or IT professional, this episode offers valuable insights into handling extreme backup scenarios.
You found the backup wrap up your go-to podcast for all things
backup recovery and cyber recovery.
In this episode, you'll hear the harrowing tale of what I'm
calling the backup from hell.
A project that started as a simple one-time backup, a 40 terabyte
of two sonology boxes that turned into a 400 terabyte nightmare
that took months to complete.
We're talking hundreds of millions of files with one directory alone
containing 99 million of them.
I'll share how I dealt with failing tape drives ridiculously slow
backup speeds, and ultimate solution that finally got the job done.
If you've ever wondered what happens when everything that could go wrong
with the backup actually goes wrong.
This episode is for you, plus you'll learn some valuable lessons about what to check
before starting a massive backup job.
By the way, if you don't know who I am, I'm w Curtis Preston, AKA, Mr.
Backup, and I've been passionate about backup and recovery for
over 30 years, ever since.
I had to tell my boss that we had no backups of the production
database that we just lost.
I don't want that to happen to you, and that's why I do this show.
On this podcast, we turn unappreciated backup admins into Cyber Recovery Heroes.
This is the backup wrap up.
Welcome to the show, and if I could ask you to just take one quick second
and, uh, subscribe or follow us so you can make sure that you get all of this
great content, that would be great.
I'm w Curtis Preston, AKA, Mr.
Backup, and I have with me a guy that apparently owes Ben Kingsley
a huge apology Prasanna Malaiyandin
how's it going?
Prasanna, why do you owe
an apology?
so as everyone's probably like, who's Ben Kingsley.
So if you don't know, he is an actor and he also played Gandhi in the movie Gandhi.
He did.
Right?
And for the longest time I was a little, not upset, but like the fact that you have
like probably one of the most important Indian people in history being played
By a guy with the name Ben Kingsley.
Exactly.
Yeah.
Ben Kingsley.
And so today I found out that Ben Kingsley is actually Indian.
Half
How about that?
should say.
Yeah,
what?
he's Anglo Indian.
Anglo Indian.
Yes.
It's like us.
You and me we're Indian.
so his paternal side is from Gujarat.
Right.
And his mom's side I think is European.
His dad was a physician who was born in Kenya.
And Ben Kingsley's name is not actually Ben Kingsley.
It's like Krishna Bunge, I think
Yeah.
Yeah.
Yeah.
And he realized that he wasn't getting called into the right casting
roles when he was looking for, when he was starting off his career.
So he is like, let me change my name.
And so we changed his name to Ben Kingsley and people started calling
him in and he started getting roles.
Racism in early Hollywood say, it isn't so.
Racism in current Hollywood.
Say it isn't So, Wouldn't be the only to do so.
Yeah.
yeah, so I apologize to Sir Ben Kingsley, uh, for all these years.
Yeah.
You were putting it in the same category as the quote unquote
Indian guy from the Short Circuit movie, which I don't know his name,
but he is very much not an Indian
person.
Do you know who it was?
the name?
I'm looking it up.
Or it's also like how Apu from, uh, the Simpsons is not Indian,
Yeah, he's, he's played by, um.
Oh, I know that.
I know the actor, but his name is escaping me.
So Fisher Stevens, is that
Fisher Stevens.
Yeah, Fisher Stevens.
Who?
Those of you that watch succession
will, uh, uh, Fisher Stevens was in succession.
He was, he was a, a lawyer, a a smarmy lawyer, which
always plays smarmy characters
yeah, I was just thinking, because I remember him from the blacklist
where he plays Marvin, the lawyer.
Yeah, got, he's got kind of the lawyer face.
I'm glad that you, you finally realized the error of your ways.
But did you know he was
No, no, I didn't.
I guess I always brought it up just like you, like I would bring
Ben Kingsley playing Gandhi and, um, as just another example of, uh, you
know, what would we call it, brown face, I guess we'd call it brown face.
Yeah.
Yeah.
But people taking actor and there's been a lot of those great roles throughout the
Great.
You know, great roles played by very not,
you know, people that are not of that ethnic group.
Yeah.
and I think maybe also at the time, right, there weren't many
Indian actors in Hollywood at all.
And I would rather have the fact, or I would rather it like the movie be made
with someone who is non-Indian, rather, because it's a great movie.
I
don't know.
You've seen it,
good movie.
Yeah.
Yeah.
So I would rather have that rather than not having the movie at all.
Hmm, I see what you're saying.
I see what you're saying.
Yeah.
And of course, you know, we have the same challenge with, uh, Asian, uh, actors,
right?
Uh, there's literally only three Chinese actors in all of Hollywood.
Like if you, if you look at like the Chinese roles, they've gone to
literally like one, there's one guy.
Uh, I forgot how many roles he's had, but he has had a prolific career playing
every Chinese person that you know.
Um, but, um, anyway, so we're gonna talk about something that we've
alluded to a little bit on the podcast.
Uh, sort of tell the final saga of what I'm calling the backup from Hell.
I may maybe, uh, we should probably phrase that slightly differently.
It's probably the,
the backup that keeps giving.
the back, the backup that, yeah.
Uh, what a mess.
The beginning of the story
that I was asked to do a backup of two Synology boxes that they
were, uh, repurposing, right?
So they were, um, going to move the data.
They, they were gonna reuse these servers, but they wanted to get a backup of the, of
the, the data before they moved it off of
Backup is good.
Yeah,
Backup is good.
Yeah.
Apparently they hadn't had a backup of the, of these servers before.
And, um, then the, the, um, and, and , they said it was
about 40 terabytes of data.
That's the information that I was given and after I had started doing
the backup, I very quickly realized that 40 terabytes might have been.
An understatement.
You, found additional data around
right as you
data.
Yeah.
Uh, so it turned out that it wasn't like 40 terabytes of data.
It was more like 400 terabytes of
Yeah, and
I'm guessing because these were systems that were kind of probably off on the
side, they hadn't been used in a while.
Like that's, I think, the problem, and I think we talked about this in one of
our episodes about sort of systems that kind of get stored away in the corner.
No one worries about
it.
Right?
And do you leave it powered on your old backup systems?
Right.
We just talked about that.
And so I think that becomes a challenge.
It's when you have these systems that are no longer actively being
used, it kind of gets away from you.
Yeah.
Yeah.
And so the customer really didn't have any idea just how much data that they
were dealing with here, out to be, like I said, like close to half a petabyte of
Yeah.
And, and for you, that changes things significantly because
changes the backup design like massively.
Yeah.
Yeah.
because your backup target, I think you had mentioned previously
that it was like a server, right?
That you were backing this data up to
I was backing it up via a server, a window server.
And, um, and tape, right?
Had tape, but it's sized like, um, you know, four 40 terabytes.
And so, which is, which is basically the, the, the server and the tape
library was the perfect size for that.
But as I started realizing that it was figure, it was filling up.
And again, this, this is my fault for not really looking at the size of the
data before really jumping in there, but basically I realized very quickly
that this was a whole lot more data
than,
than,
That you expect it and and I think just kind of looking at lessons
learned as you're a backup admin who is being told, Hey, this new
application is coming online.
Make sure that you understand like what is the expected growth of that application.
Because what you size for, say, a five terabyte database with a 1% growth
is very different than like a file server with like a 50% growth rate.
Yeah, exactly.
Um, and just because somebody says they have 10 terabytes of data doesn't mean
that they have 10 terabytes of data.
So you mentioned you had a backup server, you had a tape drive.
Is there a reason you chose to use tape
Well, the, I mean, tape is great for long-term retention of data,
which is what this customer wanted.
They wanted to hold onto this data for a long period of time,
and that's where tape is great.
And tape also is has, uh, you know, if you're able to properly feed it,
tape is actually, can be quite fast.
the challenge that I had when backing up this data that for various reasons,
which I think I, I think by the end I sort of figured out the, the
core reason for various reasons.
Individual backups off of the, these filers, the, they were just
slow, just, um, you know, they were
Like how slow, slow.
like, slow, was like, like, like three and a half kilobytes a second slow.
So like slower than like a 56 K modem back
Yeah.
Right.
And you can multiplex all you want.
So first off, you know, I, I was using NetBackup, which, you know, NetBackup, it
did a great job at what we had available.
Um, the, challenge was that because I couldn't put.
The client on the filers themselves.
So the, was a way allegedly to put a, a backup client on the filer,
but I could never get that to work.
And so I had to back up over SMB because I'm backing up
over SMB, I'm just, I'm just.
I'm just limited at what that was, right?
What,
I could get, and because I'm backing up over SMB, the client
is just the backup server,
right?
So instead of running a backup from two clients, I'm running a backup from one
client because that's the backup server.
I'm backing it up over SMB.
And because of that, I'm limited to the number of jobs I can run at one time.
NetBackup, um, says 99 99 jobs, which should say, gee, that
sounds like a
Nine problems.
right?
But, but the thing is, towards the end, as I was running a lot of these backups,
the aggregate speed of like 99 backups was only like 30, 40 megabytes a second,
you
you're talking about 400 terabytes of data to
400 terabytes of data doing the math.
I backed up for months,
right?
And I tried all these different things.
Uh, you know, num, you know, was I running too many backups at a time?
Was I running not enough backups at a time?
You know, it, um, you know, and then the problem is every, every
test would take days or weeks.
Think we should mention one thing.
You were talking about these test taking days or
mm-hmm.
and then do you wanna mention sort of some of the issues you ran into with these long
running jobs just due to infrastructure or
Yeah.
other issues in the environment?
yeah, you, you backups are not made to run over weeks or months.
Just backup infrastructure isn't made to work like that.
And so when you do backups over weeks or months.
Weird things happen that, cause you know, consternation, one of the things
is LTO tape drives are great, but like we were using like the half high LTO
drives and as far as I could tell, their duty cycle was not meant to
be a hundred percent for two months.
Right.
Um, they're meant to be backed up for, you know, several hours and then give
'em a rest and then back up several hours and then give 'em a rest.
I was just beating the crap outta these things for weeks or months at a time.
And what would happen is after some significant period of time,
it would just go write error.
And that's fine when a backup runs for a few hours and then just try again.
But if you, but if it took you two weeks or three weeks to get to that point
and then you get a write error, um,
then
it's not like you could restart these jobs either, right?
I think you're running into
Yeah.
Well,
I mean, I mean, I could restart em, but, but it's like after
a period of time I became, I.
I eventually got to the point where I said, tape is not my friend.
I, anybody who
this is coming from Mr.
Backup.
know anybody who listens to this podcast knows that I am, I am a friend of tape,
right?
I believe strongly in tape for a lot of reasons, but I don't think that, uh,
specific and, and you know, maybe the, my LTO friends can chime in here, but I don't
think that these tape drives were designed to be backed up to like this for weeks and
months at a time, 24 7 with no, because as soon as one, I was multiplexing
as many backups together as I could.
And when one backup would finish, I would just add another backup onto it, right?
Because
I, I could, I could, I.
what I couldn't do is I couldn't say, well, let's do these 10 backups, let
them run until they're finished, and then we'll do the next 10 backups.
And that would've given the tape drives a, a moment to breathe, I think.
But, uh, I couldn't do that because the, because we, we just
didn't have that kind of time.
And so I
was just, I was just try, you know, tagging it
and, and I know you've always talked about like the shoe shining problem,
given that you're not going very fast with these backups, right.
Do you think that also led to some issues as well for the tape drives?
yeah.
So again, the core problem was that each individual backup was running slow.
matter how many of them that I multiplex together, it was not enough
speed to make the tape drive happy.
And so, yes, the tape driver shoe shining.
And when a tape tribe is continually shoe shining, the tape drive will fail.
And so everything, I remember learning about tape drives was
coming back to haunt me, right?
Um, this is all of the design that I was, that I had done throughout
the years on backup, um, you know,
um, backup system
And system.
all of the things that, you know, what do you do when the backups, you know?
And so I came to understand
that the only way I was gonna finish this backup was to do it to disc.
And just quickly before you move on, I think along the way, didn't
you also have a tape drive that failed that you then had to go
Oh, multiple Multiple times.
Swap out tape drives, reboot tape drives, put in cleaning tapes and tape drives.
And by the way, that's another thing is the way tape drives normally do
is you run them for a certain number of hours and then there's a cleaning
tape that goes in there and cleans it.
And when you have a robotic library, that happens automatically.
Well, when you just run the tape drive for.
Two months, you know, that
And so at some point the tape drive just fails.
Yeah.
um, yeah.
And so I ultimately that the only way to get this done was to, um, you know,
buy, uh, enough disc to back this up.
And that wasn't cheap.
Uh, but I, I didn't think that there was any other way that this was ever
going to get done 'cause again, the core problem that we've had with tape
for the last three decades has been that the backup, if the backup isn't
too fast enough for the tape drive it's a, it's a fundamental mismatch
right?
And so we use to make that better.
But if the multi, but if the speed you're dealing with is in kilobytes a second,
Yeah.
Well, and especially 'cause you're limited by those two, uh, Synology boxes, right?
Which are limiting your bandwidth, right?
It's not like
Yeah.
Synology boxes you can then pull from,
Yeah, and I was, I was watching, like, I was running every kind of tool I could
run to see, like, I wasn't overt tasking.
The, that was the really weird part is that the, it's not like the
Synology boxes were saying, you're really beating the crap out of it.
You shouldn't do so
backups at a time.
It wasn't, it, it was, I didn't have a high I/O wait.
I didn't have high CPU, I didn't have high ram.
There, there was no, there was no
rhyme or as to why we'll get to the rhyme or reason later.
I figured it out.
Um, but, but I knew the tape and I knew the tape and this wasn't gonna work.
So, so I had to bring in, uh, a couple of other Synology disc arrays, by the
way, and populate them with enough disc to handle all of this, uh, this backup.
Right.
Yeah,
And, um.
Then
but that wasn't without its issues either.
Right?
When you, when you brought those in, that wasn't without its issues either.
No, it wasn't without issues.
And the other thing, what I needed to do was to, I felt that with, in terms of the
number of directories that were remaining, I wasn't sure like the different sizes.
So what I did was I split, I.
Those jobs into many smaller jobs.
NetBackup is really good at like running thousands of jobs, right?
So rather than just have a hundred jobs, I turned that into like 2,400 jobs.
Like I went,
I went another level deep and created a policy for each of these
directories, and then I ran those and it was running for a while.
It was, it was, you know, again, more time.
And what I started seeing.
Were these jobs that were like an individual job that was running
inordinate amount of time.
but you also had some jobs that would finish like super fast, right?
Like
They'd finish five, they'd finish in
Some of 'em, some of 'em finished in five minutes, some 'em would finish.
But I noticed that over time there were certain policies that were running for
really, really long periods of time, and eventually started poking around.
when I discovered what ultimately was the, the true culprit.
And, uh, anyone who's been around backup for a long time
has seen this culprit before.
It's just, this is the worst example of this culprit that I've ever seen.
And what is that?
We affectionately refer to it as the million file problem.
Hmm.
Because remember, again, going back to that, um, that client back from
25 years ago, we had one server.
That was going to be storing a bunch of images and it was going
to result in millions of files.
And we knew that back then that the million file problem is, a real problem.
and and million file problem ev over, over the network is even worse, right?
Because everything is, is, is a
round trip.
The way we fixed it back then was we used a product back then called
flashback, which would back up at the raw level, but store the
information, and that was not available to me.
Why?
Because that product no longer exists
No.
because it doesn't run on a Synology box.
Right.
Remember, I'm not the Synology
All it was was an SMB mount to me.
Right?
And by the way, for those curious, yes, I tested SMB, I tested NFS.
It didn't matter.
It didn't matter.
Um, the um.
And
by the way, this was a constant, you know, you know the phrase, never, never
go into battle with an untested weapon.
This was constant example of I am in the battle, I'm in the stuff,
and now I'm trying to test stuff
and, and I did to try to make things better, just made it take longer
and the client just had to wait.
And the the client was incredibly patient, honestly.
And, and you know, I did my best to say, look, I, I've been doing this for 30
years, I've never seen anything like this.
Right.
And that, that helped.
But in the end, I was backing up.
You know, we got down to, I, I learned a way to identify which
were the problem directories.
So I would kick off a policy and I would watch, and I would notice
that had run for, let's say an hour.
And it listed, let's say 300,000 files backed up.
kilobytes.
Hmm.
Literally there's, there's a kilobyte column that
kilobytes of byte and there's no value in there.
We backed up 300,000 files, no kilobytes.
so that, that helped me identify these problem
Problem child.
Yeah.
it and let the other non-problem policies finish.
And
Right.
Yeah.
up getting down to like 150 policies that were the problem policies.
And so I backed them up and I was able to get them.
Over time, I was able to get them backed up, and then finally I got down to about
20 policies, I think somewhere around
policies.
Go ahead.
And at this point when you're down to the 20, like some of these have
been running for a long time, right?
Like how?
like two months backups that have been running for two months,
successfully running for two months.
Yeah.
And what was good was at this point again.
Like this is information that would've been really helpful to have at the
beginning, but it was information that, to get all this information at the
beginning, it would've taken time to, like we, we just wanted to get started.
Yeah.
What I ended up finding was that, um, these backups, um.
The, the, there were millions and millions and millions, like one of the, one
of the directories that I was backing up, it had 99 million files in it,
one directory, 99 million files, and eventually what I realized was that
again, the problem this time was just SMB.
So the fact that every one of these files results in a round
trip conversation, possibly multiple round trip conversations.
Yep.
And I realized that the only way I was gonna back up these truly problem
directories was to back them up locally.
But how do I back them up locally?
Well, luckily this is when I just, you know, basically go back
to dumb, dumb old backup tools.
And so I was able to run a backup using tar logged in locally
on the filers, and then just.
Directing the tarball across the network that finally worked.
That's crazy.
So you had these 20 jobs, right?
And some of them you said were running for 60 plus days, and then you sort of
were like, okay, let me start this over.
And by the way, you were kind of forced to start them over
because something happened right?
At
yeah.
Something some unknown thing.
Um, I think I.
I, I, I don't know.
I, I actually don't know
what caused it, but they, they did fail
and,
And you were like, I'm not gonna start these
yeah.
I'm not gonna start 'em again.
It's just, yeah.
Well, Because
like, one of jobs, the, the one with 99 fi, 99 million
files, we were nowhere near.
I.
yeah.
After 60 days you were barely
yeah, yeah.
We're barely, barely scratching the surface.
so I'm like, I, I, I don't have, I don't have that, you know, I, I don't
have the amount of time that it would take, so, so I switched to, you know,
experimentally once again, experimentally, I'm experimenting on the fly, I'm
doing development in production.
Uh, I was like, well, let me see how long, how quick a tar ball would run.
I ran a tar ball.
I remember for like a day, you remember this?
I ran a
a day and it, I, I had a du of the size of the directory and after a day it had
done like, like a half of it or something.
Yeah.
You're like, what?
Once taking 66 days and barely scratch the
yeah,
You are mainly done.
Almost done within a day.
yeah.
And so I was like, this is the way.
Right.
So it, it, it wasn't, it wasn't a way for everything because the, the, this
was, um, because I, you know, I'm glad that I, that I use NetBackup for the
bulk of it, because then I have the catalog data and, you know, and, um,
but
on the restore side.
yeah, yeah.
So this will.
This will be the diff the restores will be more difficult for these
like remaining 20 directories.
I mean, not, not astronomically.
So like,
you know, can create a tarball, a
list of this.
So, you know, lessons learned, like,
do that.
Don't store millions of files on the other side of a, of an SMB box.
I guess
Yeah, so Well, and I think a couple things, even if it's not SMB, right?
Just having that many files, because I think what people don't realize is
even though the size of every disc has gotten significantly larger, right?
You're talking like 18 terabyte, 20 terabyte disk
Yeah.
They can only handle so many operations per disc, right?
That number hasn't changed.
It's about a hundred per second.
And so no matter how many, how big your disc is, right?
If it was 21 terabyte discs, right, then you get 20 times a hundred iops.
Versus if it's one 20 terabyte disc, you only still get that a hundred.
So that's a big thing that people don't realize with these larger size discs.
Yeah.
And, and the thing was that the.
That many files.
So, because the problem, the, ultimately the problem wasn't disc io, the problem
io.
Right?
Network latency.
So, because
when I actually ran, I ran two tar balls.
I.
Simultaneously is what I did.
I using
I just, I ran, I was always running two at a time.
When I was running two at a time, I/O wait was sitting at 10,
which is, is high,
but I was like, well, it's got nothing else going on, so I'm, I'm
it go.
Right?
The highest I/O wait ran during all of those hundreds of
simultaneous backups was like four.
yeah,
So like I wasn't disc bound.
I was
bound, but not network bound in terms of throughput, network bound, in terms of
Laid C,
and
of operations, just because SMB is very chatty.
very chatty.
It's probably the chattiest of the protocols,
and
we, you
it was just a really combination.
Yeah.
And you know why this, and this is why backup vendors have their own protocols,
like Data Domain has boost, right?
To help alleviate and solve some of these issues.
Yeah.
You talked about, don't, don't do the somewhere we were talking about.
Just don't do this.
I, I'd like, I'd like to talk today.
When I looked at these, these, uh, these directories that had these
tens of millions of files, it was a structure that was very clearly
created by some application.
one of these directors had a common structure created by some.
I'm gonna say stupid application that thought this was perfectly fine.
That it was perfectly fine to create 99 million files for
Do you know, I,
item.
I bet they were using the file system as a database
I don't know.
what it was.
given just like the number of files and the size of those files.
I know it was forensic type information
and I, I don't, I clearly
That, that's fine.
Yeah, yeah,
No, I'm just saying I clearly don't know enough about forensic stuff
to know why they would want tens of
of vials,
but
So where are you?
So you talked about these 20 jobs that you were starting to do tarballs with.
So where are you right now?
So, so we finished all of them, but one, there was one that for some reason
it, it, the file didn't look right.
It was weird.
Um, it, the, the, the backup completed, but the, some reason, the, the tarball,
it just, it just didn't look right.
I don't wanna go into details.
It just didn't look
so I'm rerunning that one.
So it, based on its size and how well it's doing, it should
finish in about a day or so.
Um, and what I'm
is a significant improvement in terms of
A significant improvement a day versus, you know, a year, um,
Or two, I think actually it might have been two.
Yeah,
Agreed.
Um, and what I'm doing is I'm, because again, I don't have the catalog.
What I'm currently running is I'm running a tar TVF.
On all of those files and creating tarballs or creating, I'm sorry, text
files, a list.
of the, the files that are in there.
And then I'm gonna do a count on the files that are in there and
check it against the count of the files that are in the directory.
And, and hopefully those numbers should be the same.
Yeah, because I believe you are even saying that to run things
like a find to get a list of all the files in a directory or a DU
Yeah.
hours, right?
Well, it was days actually.
In
fact, it was why I didn't have this information in the beginning
because everything was so big and every find, every du every command
that I had DU is quicker than find.
DU is.
It just does less work than find.
But the problem that I ultimately realized was that DU wasn't
really being helpful in terms of.
The
scope of the job, what was the scope of the job was determined
by the number of these files.
And I couldn't get those numbers because that was the thing that took forever.
the number of jobs dwindled down to about 20, that's when I
was able to run these, uh, the
and they would, they would actually complete.
And that's when I realized just how bad it was.
so if you had to start this over, and hopefully you never do, but I'm just
saying, if you had to go back to day one, what would you do differently?
I know you talked about making sure you understand the size of your backups.
Right.
It just feels like some of these, you just have to go through the process
though because you don't know what to do.
Like it's not like you could just start day one and be like,
oh, I know I need to go to disc.
I need to do X, Y, and Z.
Right?
It's sort of like a learning process.
would say that I.
Yeah, because the problem is you're going off into the unknown,
you're doing a backup of something that you don't know what it is.
And I, I would say if possible, if at all possible, get things like
dus, uh, you know, discus it, it's a Unix command, but you can load those
tools and windows as well get, like if you're going to back up, if you're
gonna back up a hundred directories.
Get a du of every one of those directories so that you have an idea
of just what you're dealing with,
if at all possible.
Also, look and see if the number files and if the number of, and if you're
trying to do a, you know, it's not that hard, you just run a fine dot dash,
you know, I didn't even do a print just fine dot pipe to wc -l, right?
That was it.
Right?
Um, to, to get the number of files.
I'd say if again.
If I could go back in time, I, I would say maybe do a little bit more of this
research prior to beginning the job.
Um, but that's diff it's, it's easy to say that now,
um, because I know what
I know.
Right.
Um, but the, you know, the core problem was that you've
got these millions of files.
I mean, which is all.
Already gonna be a problem if you're backing it up in any sort of normal way.
But if you're
up remotely over the network, it's going to kill you.
Yeah.
So, um, you gotta figure out a way to do that.
And then I would just say, see if there's anything that you can do with the, with
the application that's created this data
which is why it's important to get involved early on, right when an
application is being developed or deployed, right, to get involved so
they understand the backup requirements.
yeah.
And so, this backup that would never finish, I literally was, I
was starting to think that this thing was never gonna finish.
Um.
It's essentially finally, I mean, it's not, at this point, it's
not a hundred percent, but I'm, I'm now, you know, it's just, I'm
at the finish line.
Yeah.
at the finish line.
Yeah.
Um, it's nice.
I know one of the other things you mentioned that you were using
NetBackup, but you had also looked at other tools out there as well, right?
That could potentially help you with this effort.
Right.
So do you think that that becomes valuable, like either looking at other
tools, um, I know you had reached out to like synology support, you
had reached out to some experts, like
Yeah.
Yeah.
The problem there, there were, there were, you could do, like with Synology,
you can like copy the data from A to B.
Mm-Hmm.
They have this ability essentially like, you know, for lack of a
better word, they have Snap Mirror.
they have the equivalent of Snap Mirror.
Yep.
from onSynologygy box to another.
But to me that wasn't really a backup like I wanted in a, in a format, you know,
the end I was forced to not do what I wanted with the tar.
Um, but I wanted it in a cataloged format.
So we looked at a couple of, the problem was never NetBackup.
Right?
NetBackup made it, um, easy to script this whole thing because it was the
only way I could make sense of it.
'cause it was, it was thousands of directories and, um, and even
more thousands of sub directories under those directories.
And the only way I could make sense of this was to script it all.
And, um, the, the fact that NetBackup allowed me to do that was great.
Um, there are some other tools these days, some of the newer tools,
they want to make it easy for you.
But if you get into a complicated situation like this, some of the newer
tools don't even have the ability to sort of grab it by the horns.
The
able to do a NetBackup,
Yeah.
I think the other thing also that you were doing, which I thought was interesting,
was also your scripting, right?
Trying to automate this, like, uh, I know like scheduling your,
the backup policies to run, right?
And then you were sort of doing load balancing to make sure
that you keep the two filers
Yeah.
Yeah.
I couldn't, yeah, that was the thing.
I couldn't normally, I, I just, I believe in just throwing
everything in the NetBackup schedule or, and let it figure it out.
But because again, because of the limitations of the weird thing I had,
I, I couldn't figure out a way to load balance across the two target filers.
the NetBackup scheduler.
Um, maybe I could have, uh, done that better.
I don't know.
But, uh, so the way I was doing it was I was just assigning a backup.
a backup would finish, I would assign the next backup to that, that the
was now had more space available to it.
Right.
So I just had a while loop that was running, you
know, checking to see if a backup job was done.
but I think that's important, right?
You can always script some of these things that if it doesn't
exist in the native tools, right?
Don't be afraid.
Yeah.
Don't be afraid.
you know, obviously I'm, I'm pretty good at scripting and
I'm pretty good in the backup.
And, um, th there are, and, and, and, and thanks.
Thanks very much to Veritas for keeping their, uh, their documentation online.
Uh, the number of times I Googled.
You know, backup job, you know, how do, how do I list, uh, you know, and
I know there's a, there's, I know there's a command to, to do this.
How do I do that?
And, you know, and then a man page would come up and I would read it
and I was like, oh, yeah, yeah, yeah.
It's
been a while.
Yeah.
Um.
you have to also thank Cygwin, of course.
Yes, special thanks to to Cygwin Without Cygwin.
That is the tool that you can download and run on any Windows
server to give you Unix capabilities.
I will say there were, there were moments where Cygwin was both helpful and
terrorizing me because it was the whole like backslash versus forward slash thing.
Because in Windows, you know, the file separator is a backslash, which
in Unix is an escape character,
Yep.
and Cygwin wasn't consistent.
When that escape character would be an escape character.
Like, like if you piped it into a file, it would do one thing.
If you piped it into a command, it would do it, it would behave differently.
And, um, so that, that definitely l lent.
The fact that I was doing constant file manipulation on directories
that were seven levels deep,
Yeah.
did not help.
Yeah.
Oh, and then I couldn't, the, the, the, the one thing with
Cygwin is that it doesn't see.
It doesn't see the, to point the backups to NetBackup, I have to point
'em in the backs back slash filer name
share name.
Cygwin doesn't see that.
Cygwin sees only mapped drive names
and
have to map it using
you have to map it to a drive name.
Let's say you map it to,
to letter F, and then in Cygwin you would see /cygdrive/f.
Which would be the same as this backs slash backs mount.
know, I was constantly having to go back and forth between
those two and, and that was fun.
Um,
scripting
here's the thing.
After all of this experience and everything you've learned, you're probably
never gonna use any of this again.
I don't know about that.
I dunno about that.
I tell you what, I'm, I'm taking a tar, all those scripts that
I wrote, um, because I will say this, that, that the NetBackup
documentation while, uh, extensive, it doesn't give a lot of examples.
And so like, I'm thinking of like, um, like the BP duplicate command,
which is the command to copy backups from one place to another.
I couldn't, I couldn't figure out from reading the man page how to
actually do, to do what I needed to do.
So I would, I would like.
I would do, I would have to run tests, you
know, I'd, you know, um, and, um, the, you know, not like now that Cohesity's
acquiring them, it's not like they're now gonna rewrite their man pages.
I just thought that they could have used some more, some more examples.
But
Yeah.
I figured it out eventually.
You know, I think someone used to have a forum that people would post on about.
Yeah, someone used to have that and then, but people stopped posting
on that forum, so I don't know
You know?
Um, where people are getting their help now,
but, uh,
Well, I'm glad that this is almost over,
yeah.
Yeah.
nearly over and I'm glad you're still alive,
I am alive.
I didn't kill anyone along the way.
I didn't scream at anyone.
Like the, the story that
you have heard were, were Curtis Cuss Preston.
I didn't scream at anyone.
yeah.
but I really, really, really think you should do an office space on those filers.
yeah.
Well, that would sort of defeat the purpo of the
but, uh, I, yeah, I, like that idea.
Hmm.
Anyway.
Well, uh, thanks Prasanna for helping me, uh, sort of through this.
You were my constant counselor through this.
I think I learned a bunch.
I know usually I'm all about YouTube knowledge, but in this case it was
the Preston knowledge, so it was good.
I.
Yeah.
Yeah.
uh, thanks everybody else for, uh, uh, listening along with this sad, sad story
with I think a decent, happy ending.
That is a wrap.
The backup wrap up is written, recorded and produced by me w Curtis Preston.
If you need backup or Dr.
Consulting content generation or expert witness work,
check out backup central.com.
You can also find links from my O'Reilly Books on the same website.
Remember, this is an independent podcast and any opinions that you
hear are those of the speaker.
And not necessarily an employer.
Thanks for listening.