Check out our companion blog!
Oct. 28, 2024

Backup from Hell: SMB vs 400TB

Backup from Hell: SMB vs 400TB

Experience the backup from hell in this eye-opening episode of The Backup Wrap-up. What started as a straightforward 40TB backup spiraled into a months-long battle with 400TB of data, failing tape drives, and directories containing hundreds millions of files.

Host W. Curtis Preston shares his first-hand account of tackling this backup from hell, including the challenges of dealing with SMB protocol limitations, tape drive failures, and the infamous "million file problem." Learn why backing up 99 million files in a single directory isn't just challenging - it's nearly impossible over standard protocols.

Discover the solutions that finally worked, from switching to disk-based backup to implementing local tar backups. Whether you're a backup admin or IT professional, this episode offers valuable insights into handling extreme backup scenarios.

Transcript
Speaker:

You found the backup wrap up your go-to podcast for all things



Speaker:

backup recovery and cyber recovery.



Speaker:

In this episode, you'll hear the harrowing tale of what I'm



Speaker:

calling the backup from hell.



Speaker:

A project that started as a simple one-time backup, a 40 terabyte



Speaker:

of two sonology boxes that turned into a 400 terabyte nightmare



Speaker:

that took months to complete.



Speaker:

We're talking hundreds of millions of files with one directory alone



Speaker:

containing 99 million of them.



Speaker:

I'll share how I dealt with failing tape drives ridiculously slow



Speaker:

backup speeds, and ultimate solution that finally got the job done.



Speaker:

If you've ever wondered what happens when everything that could go wrong



Speaker:

with the backup actually goes wrong.



Speaker:

This episode is for you, plus you'll learn some valuable lessons about what to check



Speaker:

before starting a massive backup job.



Speaker:

By the way, if you don't know who I am, I'm w Curtis Preston, AKA, Mr.



Speaker:

Backup, and I've been passionate about backup and recovery for



Speaker:

over 30 years, ever since.



Speaker:

I had to tell my boss that we had no backups of the production



Speaker:

database that we just lost.



Speaker:

I don't want that to happen to you, and that's why I do this show.



Speaker:

On this podcast, we turn unappreciated backup admins into Cyber Recovery Heroes.



Speaker:

This is the backup wrap up.



Speaker:

Welcome to the show, and if I could ask you to just take one quick second



Speaker:

and, uh, subscribe or follow us so you can make sure that you get all of this



Speaker:

great content, that would be great.



Speaker:

I'm w Curtis Preston, AKA, Mr.



Speaker:

Backup, and I have with me a guy that apparently owes Ben Kingsley



Speaker:

a huge apology Prasanna Malaiyandin



Speaker:

how's it going?



Speaker:

Prasanna, why do you owe



Speaker:

an apology?



Speaker:

so as everyone's probably like, who's Ben Kingsley.



Speaker:

So if you don't know, he is an actor and he also played Gandhi in the movie Gandhi.



Speaker:

He did.



Speaker:

Right?



Speaker:

And for the longest time I was a little, not upset, but like the fact that you have



Speaker:

like probably one of the most important Indian people in history being played



Speaker:

By a guy with the name Ben Kingsley.



Speaker:

Exactly.



Speaker:

Yeah.



Speaker:

Ben Kingsley.



Speaker:

And so today I found out that Ben Kingsley is actually Indian.



Speaker:

Half



Speaker:

How about that?



Speaker:

should say.



Speaker:

Yeah,



Speaker:

what?



Speaker:

he's Anglo Indian.



Speaker:

Anglo Indian.



Speaker:

Yes.



Speaker:

It's like us.



Speaker:

You and me we're Indian.



Speaker:

so his paternal side is from Gujarat.



Speaker:

Right.



Speaker:

And his mom's side I think is European.



Speaker:

His dad was a physician who was born in Kenya.



Speaker:

And Ben Kingsley's name is not actually Ben Kingsley.



Speaker:

It's like Krishna Bunge, I think



Speaker:

Yeah.



Speaker:

Yeah.



Speaker:

Yeah.



Speaker:

And he realized that he wasn't getting called into the right casting



Speaker:

roles when he was looking for, when he was starting off his career.



Speaker:

So he is like, let me change my name.



Speaker:

And so we changed his name to Ben Kingsley and people started calling



Speaker:

him in and he started getting roles.



Speaker:

Racism in early Hollywood say, it isn't so.



Speaker:

Racism in current Hollywood.



Speaker:

Say it isn't So, Wouldn't be the only to do so.



Speaker:

Yeah.



Speaker:

yeah, so I apologize to Sir Ben Kingsley, uh, for all these years.



Speaker:

Yeah.



Speaker:

You were putting it in the same category as the quote unquote



Speaker:

Indian guy from the Short Circuit movie, which I don't know his name,



Speaker:

but he is very much not an Indian



Speaker:

person.



Speaker:

Do you know who it was?



Speaker:

the name?



Speaker:

I'm looking it up.



Speaker:

Or it's also like how Apu from, uh, the Simpsons is not Indian,



Speaker:

Yeah, he's, he's played by, um.



Speaker:

Oh, I know that.



Speaker:

I know the actor, but his name is escaping me.



Speaker:

So Fisher Stevens, is that



Speaker:

Fisher Stevens.



Speaker:

Yeah, Fisher Stevens.



Speaker:

Who?



Speaker:

Those of you that watch succession



Speaker:

will, uh, uh, Fisher Stevens was in succession.



Speaker:

He was, he was a, a lawyer, a a smarmy lawyer, which



Speaker:

always plays smarmy characters



Speaker:

yeah, I was just thinking, because I remember him from the blacklist



Speaker:

where he plays Marvin, the lawyer.



Speaker:

Yeah, got, he's got kind of the lawyer face.



Speaker:

I'm glad that you, you finally realized the error of your ways.



Speaker:

But did you know he was



Speaker:

No, no, I didn't.



Speaker:

I guess I always brought it up just like you, like I would bring



Speaker:

Ben Kingsley playing Gandhi and, um, as just another example of, uh, you



Speaker:

know, what would we call it, brown face, I guess we'd call it brown face.



Speaker:

Yeah.



Speaker:

Yeah.



Speaker:

But people taking actor and there's been a lot of those great roles throughout the



Speaker:

Great.



Speaker:

You know, great roles played by very not,



Speaker:

you know, people that are not of that ethnic group.



Speaker:

Yeah.



Speaker:

and I think maybe also at the time, right, there weren't many



Speaker:

Indian actors in Hollywood at all.



Speaker:

And I would rather have the fact, or I would rather it like the movie be made



Speaker:

with someone who is non-Indian, rather, because it's a great movie.



Speaker:

I



Speaker:

don't know.



Speaker:

You've seen it,



Speaker:

good movie.



Speaker:

Yeah.



Speaker:

Yeah.



Speaker:

So I would rather have that rather than not having the movie at all.



Speaker:

Hmm, I see what you're saying.



Speaker:

I see what you're saying.



Speaker:

Yeah.



Speaker:

And of course, you know, we have the same challenge with, uh, Asian, uh, actors,



Speaker:

right?



Speaker:

Uh, there's literally only three Chinese actors in all of Hollywood.



Speaker:

Like if you, if you look at like the Chinese roles, they've gone to



Speaker:

literally like one, there's one guy.



Speaker:

Uh, I forgot how many roles he's had, but he has had a prolific career playing



Speaker:

every Chinese person that you know.



Speaker:

Um, but, um, anyway, so we're gonna talk about something that we've



Speaker:

alluded to a little bit on the podcast.



Speaker:

Uh, sort of tell the final saga of what I'm calling the backup from Hell.



Speaker:

I may maybe, uh, we should probably phrase that slightly differently.



Speaker:

It's probably the,



Speaker:

the backup that keeps giving.



Speaker:

the back, the backup that, yeah.



Speaker:

Uh, what a mess.



Speaker:

The beginning of the story



Speaker:

that I was asked to do a backup of two Synology boxes that they



Speaker:

were, uh, repurposing, right?



Speaker:

So they were, um, going to move the data.



Speaker:

They, they were gonna reuse these servers, but they wanted to get a backup of the, of



Speaker:

the, the data before they moved it off of



Speaker:

Backup is good.



Speaker:

Yeah,



Speaker:

Backup is good.



Speaker:

Yeah.



Speaker:

Apparently they hadn't had a backup of the, of these servers before.



Speaker:

And, um, then the, the, um, and, and , they said it was



Speaker:

about 40 terabytes of data.



Speaker:

That's the information that I was given and after I had started doing



Speaker:

the backup, I very quickly realized that 40 terabytes might have been.



Speaker:

An understatement.



Speaker:

You, found additional data around



Speaker:

right as you



Speaker:

data.



Speaker:

Yeah.



Speaker:

Uh, so it turned out that it wasn't like 40 terabytes of data.



Speaker:

It was more like 400 terabytes of



Speaker:

Yeah, and



Speaker:

I'm guessing because these were systems that were kind of probably off on the



Speaker:

side, they hadn't been used in a while.



Speaker:

Like that's, I think, the problem, and I think we talked about this in one of



Speaker:

our episodes about sort of systems that kind of get stored away in the corner.



Speaker:

No one worries about



Speaker:

it.



Speaker:

Right?



Speaker:

And do you leave it powered on your old backup systems?



Speaker:

Right.



Speaker:

We just talked about that.



Speaker:

And so I think that becomes a challenge.



Speaker:

It's when you have these systems that are no longer actively being



Speaker:

used, it kind of gets away from you.



Speaker:

Yeah.



Speaker:

Yeah.



Speaker:

And so the customer really didn't have any idea just how much data that they



Speaker:

were dealing with here, out to be, like I said, like close to half a petabyte of



Speaker:

Yeah.



Speaker:

And, and for you, that changes things significantly because



Speaker:

changes the backup design like massively.



Speaker:

Yeah.



Speaker:

Yeah.



Speaker:

because your backup target, I think you had mentioned previously



Speaker:

that it was like a server, right?



Speaker:

That you were backing this data up to



Speaker:

I was backing it up via a server, a window server.



Speaker:

And, um, and tape, right?



Speaker:

Had tape, but it's sized like, um, you know, four 40 terabytes.



Speaker:

And so, which is, which is basically the, the, the server and the tape



Speaker:

library was the perfect size for that.



Speaker:

But as I started realizing that it was figure, it was filling up.



Speaker:

And again, this, this is my fault for not really looking at the size of the



Speaker:

data before really jumping in there, but basically I realized very quickly



Speaker:

that this was a whole lot more data



Speaker:

than,



Speaker:

than,



Speaker:

That you expect it and and I think just kind of looking at lessons



Speaker:

learned as you're a backup admin who is being told, Hey, this new



Speaker:

application is coming online.



Speaker:

Make sure that you understand like what is the expected growth of that application.



Speaker:

Because what you size for, say, a five terabyte database with a 1% growth



Speaker:

is very different than like a file server with like a 50% growth rate.



Speaker:

Yeah, exactly.



Speaker:

Um, and just because somebody says they have 10 terabytes of data doesn't mean



Speaker:

that they have 10 terabytes of data.



Speaker:

So you mentioned you had a backup server, you had a tape drive.



Speaker:

Is there a reason you chose to use tape



Speaker:

Well, the, I mean, tape is great for long-term retention of data,



Speaker:

which is what this customer wanted.



Speaker:

They wanted to hold onto this data for a long period of time,



Speaker:

and that's where tape is great.



Speaker:

And tape also is has, uh, you know, if you're able to properly feed it,



Speaker:

tape is actually, can be quite fast.



Speaker:

the challenge that I had when backing up this data that for various reasons,



Speaker:

which I think I, I think by the end I sort of figured out the, the



Speaker:

core reason for various reasons.



Speaker:

Individual backups off of the, these filers, the, they were just



Speaker:

slow, just, um, you know, they were



Speaker:

Like how slow, slow.



Speaker:

like, slow, was like, like, like three and a half kilobytes a second slow.



Speaker:

So like slower than like a 56 K modem back



Speaker:

Yeah.



Speaker:

Right.



Speaker:

And you can multiplex all you want.



Speaker:

So first off, you know, I, I was using NetBackup, which, you know, NetBackup, it



Speaker:

did a great job at what we had available.



Speaker:

Um, the, challenge was that because I couldn't put.



Speaker:

The client on the filers themselves.



Speaker:

So the, was a way allegedly to put a, a backup client on the filer,



Speaker:

but I could never get that to work.



Speaker:

And so I had to back up over SMB because I'm backing up



Speaker:

over SMB, I'm just, I'm just.



Speaker:

I'm just limited at what that was, right?



Speaker:

What,



Speaker:

I could get, and because I'm backing up over SMB, the client



Speaker:

is just the backup server,



Speaker:

right?



Speaker:

So instead of running a backup from two clients, I'm running a backup from one



Speaker:

client because that's the backup server.



Speaker:

I'm backing it up over SMB.



Speaker:

And because of that, I'm limited to the number of jobs I can run at one time.



Speaker:

NetBackup, um, says 99 99 jobs, which should say, gee, that



Speaker:

sounds like a



Speaker:

Nine problems.



Speaker:

right?



Speaker:

But, but the thing is, towards the end, as I was running a lot of these backups,



Speaker:

the aggregate speed of like 99 backups was only like 30, 40 megabytes a second,



Speaker:

you



Speaker:

you're talking about 400 terabytes of data to



Speaker:

400 terabytes of data doing the math.



Speaker:

I backed up for months,



Speaker:

right?



Speaker:

And I tried all these different things.



Speaker:

Uh, you know, num, you know, was I running too many backups at a time?



Speaker:

Was I running not enough backups at a time?



Speaker:

You know, it, um, you know, and then the problem is every, every



Speaker:

test would take days or weeks.



Speaker:

Think we should mention one thing.



Speaker:

You were talking about these test taking days or



Speaker:

mm-hmm.



Speaker:

and then do you wanna mention sort of some of the issues you ran into with these long



Speaker:

running jobs just due to infrastructure or



Speaker:

Yeah.



Speaker:

other issues in the environment?



Speaker:

yeah, you, you backups are not made to run over weeks or months.



Speaker:

Just backup infrastructure isn't made to work like that.



Speaker:

And so when you do backups over weeks or months.



Speaker:

Weird things happen that, cause you know, consternation, one of the things



Speaker:

is LTO tape drives are great, but like we were using like the half high LTO



Speaker:

drives and as far as I could tell, their duty cycle was not meant to



Speaker:

be a hundred percent for two months.



Speaker:

Right.



Speaker:

Um, they're meant to be backed up for, you know, several hours and then give



Speaker:

'em a rest and then back up several hours and then give 'em a rest.



Speaker:

I was just beating the crap outta these things for weeks or months at a time.



Speaker:

And what would happen is after some significant period of time,



Speaker:

it would just go write error.



Speaker:

And that's fine when a backup runs for a few hours and then just try again.



Speaker:

But if you, but if it took you two weeks or three weeks to get to that point



Speaker:

and then you get a write error, um,



Speaker:

then



Speaker:

it's not like you could restart these jobs either, right?



Speaker:

I think you're running into



Speaker:

Yeah.



Speaker:

Well,



Speaker:

I mean, I mean, I could restart em, but, but it's like after



Speaker:

a period of time I became, I.



Speaker:

I eventually got to the point where I said, tape is not my friend.



Speaker:

I, anybody who



Speaker:

this is coming from Mr.



Speaker:

Backup.



Speaker:

know anybody who listens to this podcast knows that I am, I am a friend of tape,



Speaker:

right?



Speaker:

I believe strongly in tape for a lot of reasons, but I don't think that, uh,



Speaker:

specific and, and you know, maybe the, my LTO friends can chime in here, but I don't



Speaker:

think that these tape drives were designed to be backed up to like this for weeks and



Speaker:

months at a time, 24 7 with no, because as soon as one, I was multiplexing



Speaker:

as many backups together as I could.



Speaker:

And when one backup would finish, I would just add another backup onto it, right?



Speaker:

Because



Speaker:

I, I could, I could, I.



Speaker:

what I couldn't do is I couldn't say, well, let's do these 10 backups, let



Speaker:

them run until they're finished, and then we'll do the next 10 backups.



Speaker:

And that would've given the tape drives a, a moment to breathe, I think.



Speaker:

But, uh, I couldn't do that because the, because we, we just



Speaker:

didn't have that kind of time.



Speaker:

And so I



Speaker:

was just, I was just try, you know, tagging it



Speaker:

and, and I know you've always talked about like the shoe shining problem,



Speaker:

given that you're not going very fast with these backups, right.



Speaker:

Do you think that also led to some issues as well for the tape drives?



Speaker:

yeah.



Speaker:

So again, the core problem was that each individual backup was running slow.



Speaker:

matter how many of them that I multiplex together, it was not enough



Speaker:

speed to make the tape drive happy.



Speaker:

And so, yes, the tape driver shoe shining.



Speaker:

And when a tape tribe is continually shoe shining, the tape drive will fail.



Speaker:

And so everything, I remember learning about tape drives was



Speaker:

coming back to haunt me, right?



Speaker:

Um, this is all of the design that I was, that I had done throughout



Speaker:

the years on backup, um, you know,



Speaker:

um, backup system



Speaker:

And system.



Speaker:

all of the things that, you know, what do you do when the backups, you know?



Speaker:

And so I came to understand



Speaker:

that the only way I was gonna finish this backup was to do it to disc.



Speaker:

And just quickly before you move on, I think along the way, didn't



Speaker:

you also have a tape drive that failed that you then had to go



Speaker:

Oh, multiple Multiple times.



Speaker:

Swap out tape drives, reboot tape drives, put in cleaning tapes and tape drives.



Speaker:

And by the way, that's another thing is the way tape drives normally do



Speaker:

is you run them for a certain number of hours and then there's a cleaning



Speaker:

tape that goes in there and cleans it.



Speaker:

And when you have a robotic library, that happens automatically.



Speaker:

Well, when you just run the tape drive for.



Speaker:

Two months, you know, that



Speaker:

And so at some point the tape drive just fails.



Speaker:

Yeah.



Speaker:

um, yeah.



Speaker:

And so I ultimately that the only way to get this done was to, um, you know,



Speaker:

buy, uh, enough disc to back this up.



Speaker:

And that wasn't cheap.



Speaker:

Uh, but I, I didn't think that there was any other way that this was ever



Speaker:

going to get done 'cause again, the core problem that we've had with tape



Speaker:

for the last three decades has been that the backup, if the backup isn't



Speaker:

too fast enough for the tape drive it's a, it's a fundamental mismatch



Speaker:

right?



Speaker:

And so we use to make that better.



Speaker:

But if the multi, but if the speed you're dealing with is in kilobytes a second,



Speaker:

Yeah.



Speaker:

Well, and especially 'cause you're limited by those two, uh, Synology boxes, right?



Speaker:

Which are limiting your bandwidth, right?



Speaker:

It's not like



Speaker:

Yeah.



Speaker:

Synology boxes you can then pull from,



Speaker:

Yeah, and I was, I was watching, like, I was running every kind of tool I could



Speaker:

run to see, like, I wasn't overt tasking.



Speaker:

The, that was the really weird part is that the, it's not like the



Speaker:

Synology boxes were saying, you're really beating the crap out of it.



Speaker:

You shouldn't do so



Speaker:

backups at a time.



Speaker:

It wasn't, it, it was, I didn't have a high I/O wait.



Speaker:

I didn't have high CPU, I didn't have high ram.



Speaker:

There, there was no, there was no



Speaker:

rhyme or as to why we'll get to the rhyme or reason later.



Speaker:

I figured it out.



Speaker:

Um, but, but I knew the tape and I knew the tape and this wasn't gonna work.



Speaker:

So, so I had to bring in, uh, a couple of other Synology disc arrays, by the



Speaker:

way, and populate them with enough disc to handle all of this, uh, this backup.



Speaker:

Right.



Speaker:

Yeah,



Speaker:

And, um.



Speaker:

Then



Speaker:

but that wasn't without its issues either.



Speaker:

Right?



Speaker:

When you, when you brought those in, that wasn't without its issues either.



Speaker:

No, it wasn't without issues.



Speaker:

And the other thing, what I needed to do was to, I felt that with, in terms of the



Speaker:

number of directories that were remaining, I wasn't sure like the different sizes.



Speaker:

So what I did was I split, I.



Speaker:

Those jobs into many smaller jobs.



Speaker:

NetBackup is really good at like running thousands of jobs, right?



Speaker:

So rather than just have a hundred jobs, I turned that into like 2,400 jobs.



Speaker:

Like I went,



Speaker:

I went another level deep and created a policy for each of these



Speaker:

directories, and then I ran those and it was running for a while.



Speaker:

It was, it was, you know, again, more time.



Speaker:

And what I started seeing.



Speaker:

Were these jobs that were like an individual job that was running



Speaker:

inordinate amount of time.



Speaker:

but you also had some jobs that would finish like super fast, right?



Speaker:

Like



Speaker:

They'd finish five, they'd finish in



Speaker:

Some of 'em, some of 'em finished in five minutes, some 'em would finish.



Speaker:

But I noticed that over time there were certain policies that were running for



Speaker:

really, really long periods of time, and eventually started poking around.



Speaker:

when I discovered what ultimately was the, the true culprit.



Speaker:

And, uh, anyone who's been around backup for a long time



Speaker:

has seen this culprit before.



Speaker:

It's just, this is the worst example of this culprit that I've ever seen.



Speaker:

And what is that?



Speaker:

We affectionately refer to it as the million file problem.



Speaker:

Hmm.



Speaker:

Because remember, again, going back to that, um, that client back from



Speaker:

25 years ago, we had one server.



Speaker:

That was going to be storing a bunch of images and it was going



Speaker:

to result in millions of files.



Speaker:

And we knew that back then that the million file problem is, a real problem.



Speaker:

and and million file problem ev over, over the network is even worse, right?



Speaker:

Because everything is, is, is a



Speaker:

round trip.



Speaker:

The way we fixed it back then was we used a product back then called



Speaker:

flashback, which would back up at the raw level, but store the



Speaker:

information, and that was not available to me.



Speaker:

Why?



Speaker:

Because that product no longer exists



Speaker:

No.



Speaker:

because it doesn't run on a Synology box.



Speaker:

Right.



Speaker:

Remember, I'm not the Synology



Speaker:

All it was was an SMB mount to me.



Speaker:

Right?



Speaker:

And by the way, for those curious, yes, I tested SMB, I tested NFS.



Speaker:

It didn't matter.



Speaker:

It didn't matter.



Speaker:

Um, the um.



Speaker:

And



Speaker:

by the way, this was a constant, you know, you know the phrase, never, never



Speaker:

go into battle with an untested weapon.



Speaker:

This was constant example of I am in the battle, I'm in the stuff,



Speaker:

and now I'm trying to test stuff



Speaker:

and, and I did to try to make things better, just made it take longer



Speaker:

and the client just had to wait.



Speaker:

And the the client was incredibly patient, honestly.



Speaker:

And, and you know, I did my best to say, look, I, I've been doing this for 30



Speaker:

years, I've never seen anything like this.



Speaker:

Right.



Speaker:

And that, that helped.



Speaker:

But in the end, I was backing up.



Speaker:

You know, we got down to, I, I learned a way to identify which



Speaker:

were the problem directories.



Speaker:

So I would kick off a policy and I would watch, and I would notice



Speaker:

that had run for, let's say an hour.



Speaker:

And it listed, let's say 300,000 files backed up.



Speaker:

kilobytes.



Speaker:

Hmm.



Speaker:

Literally there's, there's a kilobyte column that



Speaker:

kilobytes of byte and there's no value in there.



Speaker:

We backed up 300,000 files, no kilobytes.



Speaker:

so that, that helped me identify these problem



Speaker:

Problem child.



Speaker:

Yeah.



Speaker:

it and let the other non-problem policies finish.



Speaker:

And



Speaker:

Right.



Speaker:

Yeah.



Speaker:

up getting down to like 150 policies that were the problem policies.



Speaker:

And so I backed them up and I was able to get them.



Speaker:

Over time, I was able to get them backed up, and then finally I got down to about



Speaker:

20 policies, I think somewhere around



Speaker:

policies.



Speaker:

Go ahead.



Speaker:

And at this point when you're down to the 20, like some of these have



Speaker:

been running for a long time, right?



Speaker:

Like how?



Speaker:

like two months backups that have been running for two months,



Speaker:

successfully running for two months.



Speaker:

Yeah.



Speaker:

And what was good was at this point again.



Speaker:

Like this is information that would've been really helpful to have at the



Speaker:

beginning, but it was information that, to get all this information at the



Speaker:

beginning, it would've taken time to, like we, we just wanted to get started.



Speaker:

Yeah.



Speaker:

What I ended up finding was that, um, these backups, um.



Speaker:

The, the, there were millions and millions and millions, like one of the, one



Speaker:

of the directories that I was backing up, it had 99 million files in it,



Speaker:

one directory, 99 million files, and eventually what I realized was that



Speaker:

again, the problem this time was just SMB.



Speaker:

So the fact that every one of these files results in a round



Speaker:

trip conversation, possibly multiple round trip conversations.



Speaker:

Yep.



Speaker:

And I realized that the only way I was gonna back up these truly problem



Speaker:

directories was to back them up locally.



Speaker:

But how do I back them up locally?



Speaker:

Well, luckily this is when I just, you know, basically go back



Speaker:

to dumb, dumb old backup tools.



Speaker:

And so I was able to run a backup using tar logged in locally



Speaker:

on the filers, and then just.



Speaker:

Directing the tarball across the network that finally worked.



Speaker:

That's crazy.



Speaker:

So you had these 20 jobs, right?



Speaker:

And some of them you said were running for 60 plus days, and then you sort of



Speaker:

were like, okay, let me start this over.



Speaker:

And by the way, you were kind of forced to start them over



Speaker:

because something happened right?



Speaker:

At



Speaker:

yeah.



Speaker:

Something some unknown thing.



Speaker:

Um, I think I.



Speaker:

I, I, I don't know.



Speaker:

I, I actually don't know



Speaker:

what caused it, but they, they did fail



Speaker:

and,



Speaker:

And you were like, I'm not gonna start these



Speaker:

yeah.



Speaker:

I'm not gonna start 'em again.



Speaker:

It's just, yeah.



Speaker:

Well, Because



Speaker:

like, one of jobs, the, the one with 99 fi, 99 million



Speaker:

files, we were nowhere near.



Speaker:

I.



Speaker:

yeah.



Speaker:

After 60 days you were barely



Speaker:

yeah, yeah.



Speaker:

We're barely, barely scratching the surface.



Speaker:

so I'm like, I, I, I don't have, I don't have that, you know, I, I don't



Speaker:

have the amount of time that it would take, so, so I switched to, you know,



Speaker:

experimentally once again, experimentally, I'm experimenting on the fly, I'm



Speaker:

doing development in production.



Speaker:

Uh, I was like, well, let me see how long, how quick a tar ball would run.



Speaker:

I ran a tar ball.



Speaker:

I remember for like a day, you remember this?



Speaker:

I ran a



Speaker:

a day and it, I, I had a du of the size of the directory and after a day it had



Speaker:

done like, like a half of it or something.



Speaker:

Yeah.



Speaker:

You're like, what?



Speaker:

Once taking 66 days and barely scratch the



Speaker:

yeah,



Speaker:

You are mainly done.



Speaker:

Almost done within a day.



Speaker:

yeah.



Speaker:

And so I was like, this is the way.



Speaker:

Right.



Speaker:

So it, it, it wasn't, it wasn't a way for everything because the, the, this



Speaker:

was, um, because I, you know, I'm glad that I, that I use NetBackup for the



Speaker:

bulk of it, because then I have the catalog data and, you know, and, um,



Speaker:

but



Speaker:

on the restore side.



Speaker:

yeah, yeah.



Speaker:

So this will.



Speaker:

This will be the diff the restores will be more difficult for these



Speaker:

like remaining 20 directories.



Speaker:

I mean, not, not astronomically.



Speaker:

So like,



Speaker:

you know, can create a tarball, a



Speaker:

list of this.



Speaker:

So, you know, lessons learned, like,



Speaker:

do that.



Speaker:

Don't store millions of files on the other side of a, of an SMB box.



Speaker:

I guess



Speaker:

Yeah, so Well, and I think a couple things, even if it's not SMB, right?



Speaker:

Just having that many files, because I think what people don't realize is



Speaker:

even though the size of every disc has gotten significantly larger, right?



Speaker:

You're talking like 18 terabyte, 20 terabyte disk



Speaker:

Yeah.



Speaker:

They can only handle so many operations per disc, right?



Speaker:

That number hasn't changed.



Speaker:

It's about a hundred per second.



Speaker:

And so no matter how many, how big your disc is, right?



Speaker:

If it was 21 terabyte discs, right, then you get 20 times a hundred iops.



Speaker:

Versus if it's one 20 terabyte disc, you only still get that a hundred.



Speaker:

So that's a big thing that people don't realize with these larger size discs.



Speaker:

Yeah.



Speaker:

And, and the thing was that the.



Speaker:

That many files.



Speaker:

So, because the problem, the, ultimately the problem wasn't disc io, the problem



Speaker:

io.



Speaker:

Right?



Speaker:

Network latency.



Speaker:

So, because



Speaker:

when I actually ran, I ran two tar balls.



Speaker:

I.



Speaker:

Simultaneously is what I did.



Speaker:

I using



Speaker:

I just, I ran, I was always running two at a time.



Speaker:

When I was running two at a time, I/O wait was sitting at 10,



Speaker:

which is, is high,



Speaker:

but I was like, well, it's got nothing else going on, so I'm, I'm



Speaker:

it go.



Speaker:

Right?



Speaker:

The highest I/O wait ran during all of those hundreds of



Speaker:

simultaneous backups was like four.



Speaker:

yeah,



Speaker:

So like I wasn't disc bound.



Speaker:

I was



Speaker:

bound, but not network bound in terms of throughput, network bound, in terms of



Speaker:

Laid C,



Speaker:

and



Speaker:

of operations, just because SMB is very chatty.



Speaker:

very chatty.



Speaker:

It's probably the chattiest of the protocols,



Speaker:

and



Speaker:

we, you



Speaker:

it was just a really combination.



Speaker:

Yeah.



Speaker:

And you know why this, and this is why backup vendors have their own protocols,



Speaker:

like Data Domain has boost, right?



Speaker:

To help alleviate and solve some of these issues.



Speaker:

Yeah.



Speaker:

You talked about, don't, don't do the somewhere we were talking about.



Speaker:

Just don't do this.



Speaker:

I, I'd like, I'd like to talk today.



Speaker:

When I looked at these, these, uh, these directories that had these



Speaker:

tens of millions of files, it was a structure that was very clearly



Speaker:

created by some application.



Speaker:

one of these directors had a common structure created by some.



Speaker:

I'm gonna say stupid application that thought this was perfectly fine.



Speaker:

That it was perfectly fine to create 99 million files for



Speaker:

Do you know, I,



Speaker:

item.



Speaker:

I bet they were using the file system as a database



Speaker:

I don't know.



Speaker:

what it was.



Speaker:

given just like the number of files and the size of those files.



Speaker:

I know it was forensic type information



Speaker:

and I, I don't, I clearly



Speaker:

That, that's fine.



Speaker:

Yeah, yeah,



Speaker:

No, I'm just saying I clearly don't know enough about forensic stuff



Speaker:

to know why they would want tens of



Speaker:

of vials,



Speaker:

but



Speaker:

So where are you?



Speaker:

So you talked about these 20 jobs that you were starting to do tarballs with.



Speaker:

So where are you right now?



Speaker:

So, so we finished all of them, but one, there was one that for some reason



Speaker:

it, it, the file didn't look right.



Speaker:

It was weird.



Speaker:

Um, it, the, the, the backup completed, but the, some reason, the, the tarball,



Speaker:

it just, it just didn't look right.



Speaker:

I don't wanna go into details.



Speaker:

It just didn't look



Speaker:

so I'm rerunning that one.



Speaker:

So it, based on its size and how well it's doing, it should



Speaker:

finish in about a day or so.



Speaker:

Um, and what I'm



Speaker:

is a significant improvement in terms of



Speaker:

A significant improvement a day versus, you know, a year, um,



Speaker:

Or two, I think actually it might have been two.



Speaker:

Yeah,



Speaker:

Agreed.



Speaker:

Um, and what I'm doing is I'm, because again, I don't have the catalog.



Speaker:

What I'm currently running is I'm running a tar TVF.



Speaker:

On all of those files and creating tarballs or creating, I'm sorry, text



Speaker:

files, a list.



Speaker:

of the, the files that are in there.



Speaker:

And then I'm gonna do a count on the files that are in there and



Speaker:

check it against the count of the files that are in the directory.



Speaker:

And, and hopefully those numbers should be the same.



Speaker:

Yeah, because I believe you are even saying that to run things



Speaker:

like a find to get a list of all the files in a directory or a DU



Speaker:

Yeah.



Speaker:

hours, right?



Speaker:

Well, it was days actually.



Speaker:

In



Speaker:

fact, it was why I didn't have this information in the beginning



Speaker:

because everything was so big and every find, every du every command



Speaker:

that I had DU is quicker than find.



Speaker:

DU is.



Speaker:

It just does less work than find.



Speaker:

But the problem that I ultimately realized was that DU wasn't



Speaker:

really being helpful in terms of.



Speaker:

The



Speaker:

scope of the job, what was the scope of the job was determined



Speaker:

by the number of these files.



Speaker:

And I couldn't get those numbers because that was the thing that took forever.



Speaker:

the number of jobs dwindled down to about 20, that's when I



Speaker:

was able to run these, uh, the



Speaker:

and they would, they would actually complete.



Speaker:

And that's when I realized just how bad it was.



Speaker:

so if you had to start this over, and hopefully you never do, but I'm just



Speaker:

saying, if you had to go back to day one, what would you do differently?



Speaker:

I know you talked about making sure you understand the size of your backups.



Speaker:

Right.



Speaker:

It just feels like some of these, you just have to go through the process



Speaker:

though because you don't know what to do.



Speaker:

Like it's not like you could just start day one and be like,



Speaker:

oh, I know I need to go to disc.



Speaker:

I need to do X, Y, and Z.



Speaker:

Right?



Speaker:

It's sort of like a learning process.



Speaker:

would say that I.



Speaker:

Yeah, because the problem is you're going off into the unknown,



Speaker:

you're doing a backup of something that you don't know what it is.



Speaker:

And I, I would say if possible, if at all possible, get things like



Speaker:

dus, uh, you know, discus it, it's a Unix command, but you can load those



Speaker:

tools and windows as well get, like if you're going to back up, if you're



Speaker:

gonna back up a hundred directories.



Speaker:

Get a du of every one of those directories so that you have an idea



Speaker:

of just what you're dealing with,



Speaker:

if at all possible.



Speaker:

Also, look and see if the number files and if the number of, and if you're



Speaker:

trying to do a, you know, it's not that hard, you just run a fine dot dash,



Speaker:

you know, I didn't even do a print just fine dot pipe to wc -l, right?



Speaker:

That was it.



Speaker:

Right?



Speaker:

Um, to, to get the number of files.



Speaker:

I'd say if again.



Speaker:

If I could go back in time, I, I would say maybe do a little bit more of this



Speaker:

research prior to beginning the job.



Speaker:

Um, but that's diff it's, it's easy to say that now,



Speaker:

um, because I know what



Speaker:

I know.



Speaker:

Right.



Speaker:

Um, but the, you know, the core problem was that you've



Speaker:

got these millions of files.



Speaker:

I mean, which is all.



Speaker:

Already gonna be a problem if you're backing it up in any sort of normal way.



Speaker:

But if you're



Speaker:

up remotely over the network, it's going to kill you.



Speaker:

Yeah.



Speaker:

So, um, you gotta figure out a way to do that.



Speaker:

And then I would just say, see if there's anything that you can do with the, with



Speaker:

the application that's created this data



Speaker:

which is why it's important to get involved early on, right when an



Speaker:

application is being developed or deployed, right, to get involved so



Speaker:

they understand the backup requirements.



Speaker:

yeah.



Speaker:

And so, this backup that would never finish, I literally was, I



Speaker:

was starting to think that this thing was never gonna finish.



Speaker:

Um.



Speaker:

It's essentially finally, I mean, it's not, at this point, it's



Speaker:

not a hundred percent, but I'm, I'm now, you know, it's just, I'm



Speaker:

at the finish line.



Speaker:

Yeah.



Speaker:

at the finish line.



Speaker:

Yeah.



Speaker:

Um, it's nice.



Speaker:

I know one of the other things you mentioned that you were using



Speaker:

NetBackup, but you had also looked at other tools out there as well, right?



Speaker:

That could potentially help you with this effort.



Speaker:

Right.



Speaker:

So do you think that that becomes valuable, like either looking at other



Speaker:

tools, um, I know you had reached out to like synology support, you



Speaker:

had reached out to some experts, like



Speaker:

Yeah.



Speaker:

Yeah.



Speaker:

The problem there, there were, there were, you could do, like with Synology,



Speaker:

you can like copy the data from A to B.



Speaker:

Mm-Hmm.



Speaker:

They have this ability essentially like, you know, for lack of a



Speaker:

better word, they have Snap Mirror.



Speaker:

they have the equivalent of Snap Mirror.



Speaker:

Yep.



Speaker:

from onSynologygy box to another.



Speaker:

But to me that wasn't really a backup like I wanted in a, in a format, you know,



Speaker:

the end I was forced to not do what I wanted with the tar.



Speaker:

Um, but I wanted it in a cataloged format.



Speaker:

So we looked at a couple of, the problem was never NetBackup.



Speaker:

Right?



Speaker:

NetBackup made it, um, easy to script this whole thing because it was the



Speaker:

only way I could make sense of it.



Speaker:

'cause it was, it was thousands of directories and, um, and even



Speaker:

more thousands of sub directories under those directories.



Speaker:

And the only way I could make sense of this was to script it all.



Speaker:

And, um, the, the fact that NetBackup allowed me to do that was great.



Speaker:

Um, there are some other tools these days, some of the newer tools,



Speaker:

they want to make it easy for you.



Speaker:

But if you get into a complicated situation like this, some of the newer



Speaker:

tools don't even have the ability to sort of grab it by the horns.



Speaker:

The



Speaker:

able to do a NetBackup,



Speaker:

Yeah.



Speaker:

I think the other thing also that you were doing, which I thought was interesting,



Speaker:

was also your scripting, right?



Speaker:

Trying to automate this, like, uh, I know like scheduling your,



Speaker:

the backup policies to run, right?



Speaker:

And then you were sort of doing load balancing to make sure



Speaker:

that you keep the two filers



Speaker:

Yeah.



Speaker:

Yeah.



Speaker:

I couldn't, yeah, that was the thing.



Speaker:

I couldn't normally, I, I just, I believe in just throwing



Speaker:

everything in the NetBackup schedule or, and let it figure it out.



Speaker:

But because again, because of the limitations of the weird thing I had,



Speaker:

I, I couldn't figure out a way to load balance across the two target filers.



Speaker:

the NetBackup scheduler.



Speaker:

Um, maybe I could have, uh, done that better.



Speaker:

I don't know.



Speaker:

But, uh, so the way I was doing it was I was just assigning a backup.



Speaker:

a backup would finish, I would assign the next backup to that, that the



Speaker:

was now had more space available to it.



Speaker:

Right.



Speaker:

So I just had a while loop that was running, you



Speaker:

know, checking to see if a backup job was done.



Speaker:

but I think that's important, right?



Speaker:

You can always script some of these things that if it doesn't



Speaker:

exist in the native tools, right?



Speaker:

Don't be afraid.



Speaker:

Yeah.



Speaker:

Don't be afraid.



Speaker:

you know, obviously I'm, I'm pretty good at scripting and



Speaker:

I'm pretty good in the backup.



Speaker:

And, um, th there are, and, and, and, and thanks.



Speaker:

Thanks very much to Veritas for keeping their, uh, their documentation online.



Speaker:

Uh, the number of times I Googled.



Speaker:

You know, backup job, you know, how do, how do I list, uh, you know, and



Speaker:

I know there's a, there's, I know there's a command to, to do this.



Speaker:

How do I do that?



Speaker:

And, you know, and then a man page would come up and I would read it



Speaker:

and I was like, oh, yeah, yeah, yeah.



Speaker:

It's



Speaker:

been a while.



Speaker:

Yeah.



Speaker:

Um.



Speaker:

you have to also thank Cygwin, of course.



Speaker:

Yes, special thanks to to Cygwin Without Cygwin.



Speaker:

That is the tool that you can download and run on any Windows



Speaker:

server to give you Unix capabilities.



Speaker:

I will say there were, there were moments where Cygwin was both helpful and



Speaker:

terrorizing me because it was the whole like backslash versus forward slash thing.



Speaker:

Because in Windows, you know, the file separator is a backslash, which



Speaker:

in Unix is an escape character,



Speaker:

Yep.



Speaker:

and Cygwin wasn't consistent.



Speaker:

When that escape character would be an escape character.



Speaker:

Like, like if you piped it into a file, it would do one thing.



Speaker:

If you piped it into a command, it would do it, it would behave differently.



Speaker:

And, um, so that, that definitely l lent.



Speaker:

The fact that I was doing constant file manipulation on directories



Speaker:

that were seven levels deep,



Speaker:

Yeah.



Speaker:

did not help.



Speaker:

Yeah.



Speaker:

Oh, and then I couldn't, the, the, the, the one thing with



Speaker:

Cygwin is that it doesn't see.



Speaker:

It doesn't see the, to point the backups to NetBackup, I have to point



Speaker:

'em in the backs back slash filer name



Speaker:

share name.



Speaker:

Cygwin doesn't see that.



Speaker:

Cygwin sees only mapped drive names



Speaker:

and



Speaker:

have to map it using



Speaker:

you have to map it to a drive name.



Speaker:

Let's say you map it to,



Speaker:

to letter F, and then in Cygwin you would see /cygdrive/f.



Speaker:

Which would be the same as this backs slash backs mount.



Speaker:

know, I was constantly having to go back and forth between



Speaker:

those two and, and that was fun.



Speaker:

Um,



Speaker:

scripting



Speaker:

here's the thing.



Speaker:

After all of this experience and everything you've learned, you're probably



Speaker:

never gonna use any of this again.



Speaker:

I don't know about that.



Speaker:

I dunno about that.



Speaker:

I tell you what, I'm, I'm taking a tar, all those scripts that



Speaker:

I wrote, um, because I will say this, that, that the NetBackup



Speaker:

documentation while, uh, extensive, it doesn't give a lot of examples.



Speaker:

And so like, I'm thinking of like, um, like the BP duplicate command,



Speaker:

which is the command to copy backups from one place to another.



Speaker:

I couldn't, I couldn't figure out from reading the man page how to



Speaker:

actually do, to do what I needed to do.



Speaker:

So I would, I would like.



Speaker:

I would do, I would have to run tests, you



Speaker:

know, I'd, you know, um, and, um, the, you know, not like now that Cohesity's



Speaker:

acquiring them, it's not like they're now gonna rewrite their man pages.



Speaker:

I just thought that they could have used some more, some more examples.



Speaker:

But



Speaker:

Yeah.



Speaker:

I figured it out eventually.



Speaker:

You know, I think someone used to have a forum that people would post on about.



Speaker:

Yeah, someone used to have that and then, but people stopped posting



Speaker:

on that forum, so I don't know



Speaker:

You know?



Speaker:

Um, where people are getting their help now,



Speaker:

but, uh,



Speaker:

Well, I'm glad that this is almost over,



Speaker:

yeah.



Speaker:

Yeah.



Speaker:

nearly over and I'm glad you're still alive,



Speaker:

I am alive.



Speaker:

I didn't kill anyone along the way.



Speaker:

I didn't scream at anyone.



Speaker:

Like the, the story that



Speaker:

you have heard were, were Curtis Cuss Preston.



Speaker:

I didn't scream at anyone.



Speaker:

yeah.



Speaker:

but I really, really, really think you should do an office space on those filers.



Speaker:

yeah.



Speaker:

Well, that would sort of defeat the purpo of the



Speaker:

but, uh, I, yeah, I, like that idea.



Speaker:

Hmm.



Speaker:

Anyway.



Speaker:

Well, uh, thanks Prasanna for helping me, uh, sort of through this.



Speaker:

You were my constant counselor through this.



Speaker:

I think I learned a bunch.



Speaker:

I know usually I'm all about YouTube knowledge, but in this case it was



Speaker:

the Preston knowledge, so it was good.



Speaker:

I.



Speaker:

Yeah.



Speaker:

Yeah.



Speaker:

uh, thanks everybody else for, uh, uh, listening along with this sad, sad story



Speaker:

with I think a decent, happy ending.



Speaker:

That is a wrap.



Speaker:

The backup wrap up is written, recorded and produced by me w Curtis Preston.



Speaker:

If you need backup or Dr.



Speaker:

Consulting content generation or expert witness work,



Speaker:

check out backup central.com.



Speaker:

You can also find links from my O'Reilly Books on the same website.



Speaker:

Remember, this is an independent podcast and any opinions that you



Speaker:

hear are those of the speaker.



Speaker:

And not necessarily an employer.



Speaker:

Thanks for listening.