In this episode of The Backup Wrap-Up, host W. Curtis Preston discusses the importance of understanding the difference between snapshots and backups. He emphasizes that storage snapshots should not be considered as true backups. The episode also covers the recent 1Password and Okta hack, highlighting the frustration of such incidents, especially for those who advocate for password managers and cloud technologies. Tune in to learn more about the risks and implications of relying solely on snapshots and the importance of proper backup strategies.
Speaker:
I'm not sure who needs to hear this, but snapshots are not backups.
Speaker:
And in this context, I'm talking about virtual snapshots, like
Speaker:
what you do with the storage re.
Speaker:
We're not talking about cloud snapshots, like what you do
Speaker:
with EBS volumes in the AWS.
Speaker:
If you don't know the difference or you can't articulate why
Speaker:
these snapshots are not backups.
Speaker:
Well, have I got a podcast for you?
Speaker:
Hi, I'm W.
Speaker:
Curtis Preston AKA Mister backup.
Speaker:
And my podcast turns unappreciated backup admins into cyber recovery heroes.
Speaker:
This is the backup wrap-up.
Speaker:
Hi, and welcome to the Backup Wrap Up.
Speaker:
I'm your host, W.
Speaker:
Curtis Preston, and I have with me my Spanish language
Speaker:
encourager, Prasanna Malaiyandi.
Speaker:
I gave you a positive thing
Speaker:
I know, I'm impressed.
Speaker:
And I'm also impressed that you are done now with, what, Spanish 2, right?
Speaker:
Spanish 2, uh, Sisi termine con Span, uh, I almost said
Speaker:
Spanish, Espanol 2, um, yeah.
Speaker:
So now I need to, I need to do a little review because there's, it's, it's
Speaker:
really, you know, for a, for an English speaker, uh, the challenges, as I know,
Speaker:
I've spoken to you, the challenges are things like, we're like, it, it, the,
Speaker:
the gender thing is, is not a big deal.
Speaker:
You get used to that.
Speaker:
What's challenging is words that n an ma like PMA you would
Speaker:
think that would be feminine, but it's actually a masculine word.
Speaker:
Uh, because it, it ends in an, uh, ma.
Speaker:
I have to pound that stuff into my head and just say it over and over.
Speaker:
I will be using, the thing that I made where I, I'm actually
Speaker:
using technology to make a verbal recording that I can listen to.
Speaker:
Uh, it's very, very cool.
Speaker:
Uh, but I am excited to move on to Spanish three and.
Speaker:
Bring you dragging with me.
Speaker:
It's helpful.
Speaker:
I, Used to be pretty good at Spanish because I took it in high
Speaker:
school and then I lost everything.
Speaker:
I can order things off a menu, but that's about it.
Speaker:
Right.
Speaker:
Right.
Speaker:
So, uh, I think it's time for the news of the week.
Speaker:
this one is frustrating.
Speaker:
Uh, we'll, we'll, we'll do a frustrating one first and then
Speaker:
some good news about a vendor.
Speaker:
Let's talk about this one password hack.
Speaker:
And it frustrates me for many reasons.
Speaker:
One is, we're such a fan of password managers.
Speaker:
And there are those who are not fans of password managers.
Speaker:
There are also those who are not a fan of the cloud.
Speaker:
And this just, uh, is both of those things.
Speaker:
So, uh, do you want to talk a little bit about the 1Password slash Octahack?
Speaker:
so basically what happened is 1Password, I don't know if it was
Speaker:
1Password or Okta, but basically some hackers got in into 1Password.
Speaker:
They were able to change some things in their Okta instance and try to get a list
Speaker:
of admins, which I'm sure they were going to use to target and be able to deploy
Speaker:
things so they could cause bigger damage.
Speaker:
Right.
Speaker:
But as they were starting to sort of peel back the onion and
Speaker:
figure out, okay, what happened?
Speaker:
They realized what actually happened is it started on the Okta side,
Speaker:
Right.
Speaker:
The reason they were able to do what, what they ended up being able
Speaker:
to do was because Okta had already been, had already been hacked.
Speaker:
And so it's interesting in this case because It wasn't like, oh, they
Speaker:
just got into Okta and then they ping ponged and got into 1Password.
Speaker:
They actually looked at support files that someone in 1Password, an engineer,
Speaker:
had uploaded because they probably had some issue or something else like that.
Speaker:
They were reaching out to Okta and they uploaded basically a support bundle.
Speaker:
Um, and in that support bundle, they contained, in addition to sort of like
Speaker:
information needed to troubleshoot, it also contained session cookies,
Speaker:
Right.
Speaker:
Which they were then able to use.
Speaker:
Right.
Speaker:
Yeah, and so they were able to basically impersonate the one password employee
Speaker:
and log into the system and that's when they started causing havoc.
Speaker:
Yeah.
Speaker:
And, and the good news is that they did, that 1Password did see what was happening.
Speaker:
Uh, they did see this weird session coming from an odd IP.
Speaker:
And they, they shut it down before basically all they did was
Speaker:
attempt to get a list of admins.
Speaker:
They didn't actually get the list of admins is what it looks like.
Speaker:
Um, but the, the whole thing was again, you know, it went back to
Speaker:
this, the fact that the, the hacker initially had compromised the.
Speaker:
The Okta support system, which they then got this support file.
Speaker:
And, you know, in retrospect, it looks like everybody did their job.
Speaker:
It, it looks like, right.
Speaker:
That, that, that.
Speaker:
Things were noticed, things were stopped.
Speaker:
I think that in the case of Okta, maybe they weren't noticed quick
Speaker:
enough because it actually, 1Password was just one company of several that
Speaker:
were compromised because of this.
Speaker:
It doesn't appear for those of you that are 1Password customers, it doesn't
Speaker:
appear that any customer data was accessed, doesn't appear, you know,
Speaker:
unlike the, what was the other one?
Speaker:
The LastPass, the LastPass hack where they actually got the vault.
Speaker:
And then you had to worry if you had insecure passwords, right?
Speaker:
That, that, that, that there were guessable, but it doesn't appear
Speaker:
that that was the case here.
Speaker:
They found it pretty quickly.
Speaker:
think another interesting thing that I found, by the way, this is an
Speaker:
article on The Register, we'll put a link in the show notes description.
Speaker:
Right.
Speaker:
The one thing I found interesting is they were trying to figure
Speaker:
out, okay, what actually happened?
Speaker:
Like, when was This information stolen and so they went back and they looked at
Speaker:
their logs on the Okta side and they were trying to figure out okay was the archive
Speaker:
access before the support engineer on the Okta side accessed it and they were able
Speaker:
to figure out no the support engineer hadn't opened it yet so it wasn't a
Speaker:
rogue support engineer on the Okta side.
Speaker:
And then they looked at the 1Pass site, 1Password site and they saw,
Speaker:
okay, the person who uploaded it was on a public Wi Fi at a hotel.
Speaker:
And they were like, oh, maybe it could have been stolen during the upload
Speaker:
because you know, those connections are always unencrypted and all the rest.
Speaker:
But they looked and they're like, no, the upload process
Speaker:
had TLS end to end up to Okta.
Speaker:
And so.
Speaker:
Everything was encrypted.
Speaker:
It wasn't stolen in the process.
Speaker:
So there must've been something else that had accessed it on the Okta side.
Speaker:
So it's interesting how they're able to piece this all together.
Speaker:
It's almost like CSI, right?
Speaker:
You're piecing together all these clues, trying to figure out, okay,
Speaker:
what happened, what went wrong?
Speaker:
Well, that's what you have to do in an incident response system, right?
Speaker:
You've got to, you know, piece together what you can from the logs that you have.
Speaker:
Logs, logs, logs, right?
Speaker:
That's what it's all about is the logs.
Speaker:
The, the, the only thing that's somewhat distressing is that it
Speaker:
appeared that they made some tweaks.
Speaker:
Some upgrades to their MFA.
Speaker:
Uh, and by upgrades, I mean like they changed some of their stuff.
Speaker:
They're like, maybe we shouldn't allow anyone to log into Okta who isn't
Speaker:
at a, um, a one password IP, right?
Speaker:
Maybe, maybe we shouldn't allow that.
Speaker:
That seems like
Speaker:
Uh, that's,
Speaker:
you should have decided that before, but yeah,
Speaker:
know about that, Curtis, because there are times, like, you might
Speaker:
be traveling, and you might be the super user, and you have to change
Speaker:
something, and you're not at a desk.
Speaker:
Or, imagine that one password blows up, something happens to their site,
Speaker:
and they need to reset passwords, and they could eventually get locked out.
Speaker:
But maybe this is an
Speaker:
I would think, yeah, I would, uh, I would think that that would be an
Speaker:
exception case that you, you would deal with that maybe by the default
Speaker:
would be that you would turn it off.
Speaker:
The other thing was that it appears that now they're using Yubikeys.
Speaker:
For those that aren't familiar with that, Y U B I K E Y.
Speaker:
This is a hardware token.
Speaker:
It's pretty inexpensive as these systems go.
Speaker:
But it is a a hard, you know, it's it's an actual system that you can actually
Speaker:
buy for your own personal use I took a look and it was I think it was a
Speaker:
hundred dollars for two of them Um,
Speaker:
you want to.
Speaker:
you know, and you want two.
Speaker:
Yeah, and so, um, then Uh, and, you know, and so it looks like they're
Speaker:
now using YubiKeys where maybe before they weren't using YubiKeys.
Speaker:
I'm glad that they made those steps.
Speaker:
It's just, it's just, it's just a shame that we always make changes to
Speaker:
systems after we've been, you know, after we've had an exposure, but good
Speaker:
on them, good on them for finding it good on them for responding and
Speaker:
saying, Hey, we could make this better.
Speaker:
Uh, I'm a little dinging Okta here.
Speaker:
There was a thing at the end that Okta basically said that perhaps if you're
Speaker:
sending a support file, uh, what was the, what was the type of that file?
Speaker:
A HAR file?
Speaker:
Um, some sort of archive file.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
And they're like, perhaps you shouldn't, perhaps you should sanitize
Speaker:
that before you send it to us.
Speaker:
And I'm like, well, maybe your system that creates the HAR file
Speaker:
should sanitize it for the customer.
Speaker:
Before you upload it.
Speaker:
Uh, there was a little bit I thought of Victor blaming, uh, towards the end there,
Speaker:
but, um, anyway, a, a, a much, I think one that's gonna be a, a lot easier to
Speaker:
talk about our former employer, Druva.
Speaker:
Uh, there, the headline here, another, it's a tech target article
Speaker:
that they've added a gen AI assistant to their cloud backup tool.
Speaker:
And I watched a, like a video of a demo and it basically, it looked like, uh,
Speaker:
you know, an NLM that you can use to interface with the Druva cloud platform.
Speaker:
And so you could ask it things like, Hey, show me my backups, like in, in like, Hey,
Speaker:
show me my backup backups have failed.
Speaker:
I think the big thing is, especially with, uh, with the sort of focus on
Speaker:
AI and making it easily available to a large audience without needing
Speaker:
to know all the training, Right.
Speaker:
And being an expert in it, I think has now made it sort of commonplace, right?
Speaker:
It's easy to pick up an AI model that has been pre trained
Speaker:
and use it for some purpose.
Speaker:
right?
Speaker:
In their case, they're using, uh, Amazon's, uh, no, no.
Speaker:
Great surprise there.
Speaker:
They're using Amazon's, uh, NLM model,
Speaker:
you referring to?
Speaker:
Sorry, are you referring to NLP or LLM?
Speaker:
You're calling it NLM.
Speaker:
thank you.
Speaker:
Oh, yeah, sorry.
Speaker:
NLP Thank you.
Speaker:
NL NLP.
Speaker:
So they're, they're using Amazon's, uh, product to do this.
Speaker:
No.
Speaker:
Great surprise.
Speaker:
DVA lives in Amazon, and, uh, it, it seems like the biggest thing that
Speaker:
they had to do was to make sure that it was integrated with the security.
Speaker:
That's the biggest thing that I see is, yeah, you have to make sure because if
Speaker:
you just pick up any random LLM, I'll use LLM because that's a large language
Speaker:
model, right, which a lot of the AI models are built on top of, if you.
Speaker:
Pick that and depending on what is trained, you might get back bogus
Speaker:
answers, random answers, answers that aren't even safe, right?
Speaker:
And so at least the fact that Druva is putting guardrails in place to make
Speaker:
sure that what gets returned is sensible and safe, I think is a good step.
Speaker:
Yeah.
Speaker:
So we've seen, uh, we've seen AI use with Alcion.
Speaker:
We saw it with, um, I'm trying to think.
Speaker:
Yeah.
Speaker:
Cohesity came out with one.
Speaker:
Do you remember who else?
Speaker:
It was Druva.
Speaker:
I'm trying to, I know there's another one.
Speaker:
Um,
Speaker:
Was it Commvault?
Speaker:
in the thing.
Speaker:
Oh, Dell, Dell, Dell said, yeah, I haven't, I don't think I've seen
Speaker:
anything from Commvault, but, uh, Dell is also doing, uh, yeah, I merged,
Speaker:
I think, in, uh, uh, uh, NLP and LLM
Speaker:
To NLM.
Speaker:
It's, it's a Curtis only thing.
Speaker:
So, yeah, so that's interesting.
Speaker:
Uh, I think anything that makes it easier to interface with your
Speaker:
backup system is a good thing.
Speaker:
As long as they have those guardrails in place and
Speaker:
And I know you always like to talk about how, what's the job, what's the, one of
Speaker:
the most important jobs that you give to the most junior person at a company?
Speaker:
exactly.
Speaker:
Backup, right?
Speaker:
And so,
Speaker:
why would you do
Speaker:
yeah, and so giving an assistant, if you will, right, to that junior backup person
Speaker:
is helpful as they're learning the ropes.
Speaker:
Yeah.
Speaker:
Well, that is the news of the week.
Speaker:
As you know, every episode of the Backup Wrap Up is going to dive
Speaker:
deep into one particular topic.
Speaker:
This week's topic is snapshot, snapshot, snapshot.
Speaker:
Click, click, click,
Speaker:
It's a topic that comes up a lot.
Speaker:
This word comes up a lot on this show.
Speaker:
So it's time to dive deep into a world that I, I bet at one point in your career.
Speaker:
Persona, you must've heard this.
Speaker:
Word a hundred times a day.
Speaker:
What do you think?
Speaker:
At least.
Speaker:
At least,
Speaker:
least.
Speaker:
because there's certainly one company that thinks they do snapshots different
Speaker:
and better than everyone else.
Speaker:
And they're probably, they certainly, it's certainly at one point in time.
Speaker:
That was true.
Speaker:
I think a number of other vendors are now doing snapshots
Speaker:
the way NetApp did snapshots.
Speaker:
So just a quick story just sort of bring why different ways
Speaker:
to do snapshots really matter.
Speaker:
I was at a, I'm in my brain, uh, live translating the story because I
Speaker:
was at one of the largest companies in the world at a consulting gig.
Speaker:
And we were helping them to pick a new storage and backup system.
Speaker:
They were looking for an integrated system that would do both storage and
Speaker:
data protection as part of that storage.
Speaker:
They were already a NetApp customer and they knew that that meant that
Speaker:
they knew every foible of NetApp.
Speaker:
So, uh, they knew every bad thing about NetApp, but they
Speaker:
also knew the good things.
Speaker:
And I think of all of the consulting gigs that I've done throughout the years,
Speaker:
they had done the best at presenting.
Speaker:
These are our requirements.
Speaker:
And they are well defined and there are reasons behind every one of them.
Speaker:
Mm hmm.
Speaker:
And one of their requirements was that they wanted end users, just regular Joe
Speaker:
and Jane person sitting in the desktop.
Speaker:
To be able to do their own resource.
Speaker:
Right.
Speaker:
And they wanted them to have the ability to do that at points in
Speaker:
time that were like an hour at a time throughout the day, going back
Speaker:
to, you know, it was like 90 days.
Speaker:
So specifically they said, this is why we want 90 days of user browsable snapshots.
Speaker:
At the time that was like.
Speaker:
Not everybody did that, right?
Speaker:
Depending on how you did snapshots as a vendor, you either could or
Speaker:
could not meet that requirement.
Speaker:
Just end of story.
Speaker:
And so, the, the, the responses ranged from, uh, no problem, right?
Speaker:
Literally no problem.
Speaker:
To, I remember one vendor coming in and going, That's the
Speaker:
dumbest requirement we've ever
Speaker:
I bet I can
Speaker:
you want 90 days of user browsable snapshots?
Speaker:
Yeah, I think you know exactly who that was.
Speaker:
It was just chaos and by the way There was a vendor that came in and they had
Speaker:
this let me restate that I'm pretty sure it was that vendor that had this
Speaker:
sort of Very convoluted system that was based on like you had this block system
Speaker:
and then you had this other system that had the blocks and you could,
Speaker:
Stephen
Speaker:
it was just really, really complicated.
Speaker:
And they had this like this wizard of a presenter that
Speaker:
was just an amazing presenter.
Speaker:
That was very, um, you know, charming and very smart and
Speaker:
just presented all of the stuff.
Speaker:
Uh, and, and even though that presenter did their best, the, the
Speaker:
customer just was not having it.
Speaker:
And even though he was a really great presenter.
Speaker:
Just didn't play.
Speaker:
Anyway, the thing is that the key there is that how you do snapshots It very much
Speaker:
dictates how everything's going to work.
Speaker:
Going back to the requirements, right?
Speaker:
You said that they had a very clear list of things.
Speaker:
Do you know why they had that requirement for 90 days?
Speaker:
Was it to like?
Speaker:
What was the purpose behind having that?
Speaker:
Because I'm sure today everyone kind of thinks oh, that's just reasonable, right?
Speaker:
Oh, of course, why don't I have that?
Speaker:
But back then it seemed like that was something very,
Speaker:
they had very specific numbers on the number of restores that had been done.
Speaker:
And I know this as a backup person, but if you look at just
Speaker:
all of the restores that are done.
Speaker:
Ever anywhere, you know, for the history of restores, 99 percent of
Speaker:
them are done from data from yesterday.
Speaker:
Right.
Speaker:
And then, or, or from the most recent snapshot, and then there is this cliff.
Speaker:
of usage that just gets smaller and smaller.
Speaker:
And it's just this incredibly ever increasing line to zero.
Speaker:
And I think in their world, what they showed was that ever increasing line to
Speaker:
zero basically dropped off at 90 days.
Speaker:
And so they said, we need 90 days of user browsable snapshots.
Speaker:
That's what I meant was that they were really good at articulating what
Speaker:
their requirements were and also.
Speaker:
Why those requirements?
Speaker:
So like, you
Speaker:
yeah.
Speaker:
And I think that's a good lesson though for backup admins, right?
Speaker:
You should be looking at the metrics of these systems because if you want
Speaker:
to make a case for say new technologies or new process improvements or other
Speaker:
things like that, having this data to show why you need something so you could
Speaker:
put it into requirements is critical.
Speaker:
exactly, exactly.
Speaker:
So let's define
Speaker:
What's a
Speaker:
what we mean.
Speaker:
Yeah.
Speaker:
What is a snapshot?
Speaker:
Let me give you my definition, let's see how closely it
Speaker:
lines with what you call it.
Speaker:
Alright.
Speaker:
What's your definition
Speaker:
of a
Speaker:
my definition of a snapshot is a point in time copy of the data that existed
Speaker:
at some point in time, so it has to have been plausible, that can be
Speaker:
preserved such that it's not modified when the primary copy gets modified.
Speaker:
So, I would take your definition and I would insert one word
Speaker:
I think to make it perfect
Speaker:
Okay?
Speaker:
and that is the word virtual at the beginning because It's a virtual
Speaker:
copy, because that is really what differentiates a snapshot from a copy,
Speaker:
because you just said a copy, right?
Speaker:
So it is a, I like to use the word view.
Speaker:
It's a view.
Speaker:
Into your volume that you're protecting with the snapshot
Speaker:
at a different point in time.
Speaker:
And I like the word view because it, it, which is very much a database term, right?
Speaker:
It's just a different, it's a way to look at your current volume
Speaker:
at a different point in time.
Speaker:
The, I think the most important thing that differentiates a true snapshot from a lot
Speaker:
of other things that we call snapshots.
Speaker:
Is that it is a virtual copy in that relies on the primary volume that it is
Speaker:
protecting for most of the blocks of data.
Speaker:
The bulk of the blocks, when you're reading that snapshot, the bulk
Speaker:
of those blocks are going to come from the, the current volume.
Speaker:
Are you with me?
Speaker:
Right.
Speaker:
That basically that the change data is going to come from
Speaker:
snapshot.
Speaker:
some, the snapshot, right, how that happens, that's the difference between
Speaker:
copy on write and redirect on write, but the bulk of the data is going
Speaker:
to come from the current volume.
Speaker:
That's basically what I'm
Speaker:
Okay.
Speaker:
I
Speaker:
we on the same page?
Speaker:
Okay.
Speaker:
All right.
Speaker:
So we can go, I mean, only one of us actually worked at a vendor that did
Speaker:
snapshot, so, you know, want to make sure I'm getting things right here.
Speaker:
Um, this is really the key of the difference between a snapshot and a
Speaker:
copy or a snapshot and a backup why does that matter from a backup perspective?
Speaker:
Why does that
Speaker:
snapshots.
Speaker:
They're not independent.
Speaker:
And that's why, you know, those of us that, you know, care about
Speaker:
things like backup and recovery.
Speaker:
We just like to scream and say, snapshots are not backup.
Speaker:
Right.
Speaker:
Um, Now, I will say that you can use snapshots as a way to get backup, but a
Speaker:
snapshot by itself on a volume is, I like to call it a convenience copy, right?
Speaker:
It is a way to go back in time as long as you don't have media failure.
Speaker:
Right, right.
Speaker:
You don't have a double disk failure in a RAID 5 array or a triple
Speaker:
disk failure in a RAID 6 array.
Speaker:
Yeah, and like you're saying, it could use a snapshot to allow you to do other backup
Speaker:
mechanisms like take a copy, preserve it in a point of time, now you move that
Speaker:
data and you back up that data, right?
Speaker:
Which, if you're integrating with applications, you can now take an
Speaker:
application consistent point in time while your database is in hot
Speaker:
backup mode, take that snapshot, now you preserve that point in time.
Speaker:
You can.
Speaker:
Uh, Tha, the database, so it can continue operating like normal, Tha, the database.
Speaker:
What word are you saying there?
Speaker:
I'll follow.
Speaker:
I just, I don't know what word I thought you were saying there.
Speaker:
So basically you're saying, because you freeze it.
Speaker:
You're saying you freeze it and now you thaw it.
Speaker:
Okay, all
Speaker:
Yeah, or you could quiesce and unquiesce.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
Okay.
Speaker:
I just, I don't usually use that term.
Speaker:
So it really threw me.
Speaker:
right.
Speaker:
And then you do your backup off of that snapshot, right?
Speaker:
So you now have a copy that's frozen that you can now do your
Speaker:
backup and it's all good to go.
Speaker:
Right.
Speaker:
Um, I'd say the most common outside of storage arrays, the most common snapshot
Speaker:
that is used in that way is VSS, right?
Speaker:
The Windows Volume Shadow Services.
Speaker:
And it, it's basically integrated into the operating system.
Speaker:
It's integrated into the applications.
Speaker:
It is.
Speaker:
And backup apps can integrate with VSS.
Speaker:
These are all done with APIs and a backup app can show up and say, Hey,
Speaker:
I am here to do a backup of this box.
Speaker:
Um, please, I'm going to really simplify it.
Speaker:
Please take a snapshot of everything that needs to have a snapshot taken of
Speaker:
it before I take a backup, then you take a backup and then they, um, and it can
Speaker:
take a backup of that snapshot, even though the volume continues to change.
Speaker:
It it's given this view into the volume that is static.
Speaker:
And then it can back up that volume, uh, and get that perfectly application
Speaker:
consistent version of the volume.
Speaker:
Uh, even if the backup takes two hours, it doesn't matter.
Speaker:
It has it, all of the blocks will be from the same exact point in time.
Speaker:
And then when it's done, it can tell VSS to delete that snapshot
Speaker:
or it can keep it around.
Speaker:
It's up to you.
Speaker:
It's just, it's a configuration thing.
Speaker:
One thing I do want to mention.
Speaker:
That, like we said, Snapshots just gives you that point in time copy, right?
Speaker:
It's a read only, point in time copy.
Speaker:
Now sometimes you will also hear, and I don't think we're covering
Speaker:
it later, clones being used, right?
Speaker:
Where I take a virtual copy of the volume and start using it, just going
Speaker:
back to the previous discussion, Curtis.
Speaker:
The difference is clones are writable.
Speaker:
So you're making changes to it.
Speaker:
Right, a snapshot is a read only copy that you preserve that point in time,
Speaker:
nothing's gonna change it, and it's always there for you to go back, you
Speaker:
can pull your file, your data out of it.
Speaker:
Clones, on the other hand, give you a copy of the volume at a point in time,
Speaker:
but it's so you can use it for some purpose, like for testing out restore.
Speaker:
Capabilities.
Speaker:
Can I verify my backups?
Speaker:
Those sort of things.
Speaker:
And also doing, like, database recovery against that copy, the clone
Speaker:
copy, and other things like that.
Speaker:
So, clones are different than snapshots, even though they both
Speaker:
might start from a snapshot copy.
Speaker:
Right.
Speaker:
Um, yeah, I, I would probably just call it a read write snapshot, but
Speaker:
maybe that's a contradiction in terms.
Speaker:
Um,
Speaker:
Yes, a snapshot is a point in time, Curtis.
Speaker:
yeah, exactly.
Speaker:
THere are three ways that snapshots are created.
Speaker:
The most common way, I, would you, we say it's still the most common
Speaker:
way, the copy on write method?
Speaker:
Uh, no, I don't think, I don't think so anymore.
Speaker:
Okay.
Speaker:
Well, historically what used to be the most common method before
Speaker:
one vendor ruined it for everybody
Speaker:
Made something
Speaker:
is.
Speaker:
So is called the copy on write method.
Speaker:
And the reason it's called the copy on write method is that we create this,
Speaker:
this storage area that is going to hold the, um, the, the snapshot blocks.
Speaker:
And when we go to update a block, because we're, it's a storage volume, right?
Speaker:
So we're going to update a block and we say, Hey, Uh, there's
Speaker:
a snapshot for this block.
Speaker:
We're going to copy that block out to the snapshot area before you write.
Speaker:
So that's why it's called copy on write.
Speaker:
And that is very expensive because if you think about it, there are, let me count.
Speaker:
There, there, there is a read.
Speaker:
And a right for every right.
Speaker:
There's, there's a says,
Speaker:
and it's not just the one read and write that happens on the snapshot side.
Speaker:
You also have a bunch of metadata and updating indirect blocks and
Speaker:
a whole bunch of other things.
Speaker:
So, yeah, doing a copy on write might lead to, say, 10 additional I.
Speaker:
O.
Speaker:
operations or 12.
Speaker:
Yeah.
Speaker:
for that one block.
Speaker:
And so what happens is that over time that the number of blocks, remember
Speaker:
when I initially, I mentioned that the bulk of the data is going to come from
Speaker:
the primary volume, but over time.
Speaker:
As a snapshot has been created, more and more blocks are going to be copied into
Speaker:
that snapshot area, and which means that at some point, you know, a significant
Speaker:
portion of my snapshot, if I'm.
Speaker:
If I'm reading it, a significant portion is going to come from the snapshot area.
Speaker:
And so it, there are multiple reasons that there's a performance hit.
Speaker:
I think the biggest performance hit is.
Speaker:
You know, you talk about all of the IOs that have to happen every time we update,
Speaker:
uh, you know, we do a copy on write.
Speaker:
The other is that as time goes on, the more and more data that I have to get
Speaker:
from my snapshot area when I'm doing a read, the performance goes down.
Speaker:
And this goes back to that vendor that they basically suggested that if we had
Speaker:
90 days of user browsable snapshots, That their performance was going to be
Speaker:
like half of what, uh, what it typically
Speaker:
and I would say that is Probably based on older technology, Curtis.
Speaker:
I think when you had traditional RAID arrays or RAID Groups
Speaker:
that you were creating.
Speaker:
I think there was more of a performance impact I think now since you end
Speaker:
up aggregating a bunch of disks and then carving out volumes And so
Speaker:
you can share in the performance.
Speaker:
I think That's not as big of a concern anymore as it used to be.
Speaker:
Like I know those systems you're talking about, they now support
Speaker:
a thousand snapshots, right, for
Speaker:
so you think that the main hit from a performance standpoint on a copy
Speaker:
on write snapshot scenario in modern technology is mainly that IO hit the
Speaker:
first time you go to do a write when every time, every time you update a write.
Speaker:
And by the way, remember that It has to do that for every, that's sort of
Speaker:
calculate every time it does it right.
Speaker:
It has to calculate, is there a snapshot that is looking at this
Speaker:
block as it exists at this point in time, and then you have to copy it,
Speaker:
uh, for that, for that snapshot.
Speaker:
Which, there are different mechanisms you could use.
Speaker:
You could use bitmaps, you could use other things.
Speaker:
So, it's not the end of the world, and I think that a lot of these storage vendors
Speaker:
have optimized if they are continuing to use copy on write technologies.
Speaker:
But I would say a good chunk of them have moved away from
Speaker:
copy on write because of the I.
Speaker:
O.
Speaker:
penalties that we've talked
Speaker:
right, right.
Speaker:
So the, the, there was this vendor that came out, a little vendor
Speaker:
called, at one point it used to be called Network Appliance.
Speaker:
And it had a little screw and bolt as its logo.
Speaker:
right.
Speaker:
Um, the, the, at one point they said, we're just going to change our
Speaker:
name to NetApp because that's what
Speaker:
That's what everyone calls them.
Speaker:
I remember actually working at a startup and got my hands on my
Speaker:
first NetApp appliance and was like, yeah, wow, these are kind of cool.
Speaker:
And this was before I started working there.
Speaker:
And I was like, wow, this is really amazing and simple
Speaker:
for what it does and easy to
Speaker:
Yeah, exactly.
Speaker:
And they used a completely different way to, um, to do snapshots that at the time
Speaker:
was revolutionary, which I think has now been adopted by a lot of storage vendors.
Speaker:
And that is, we call it redirect on write.
Speaker:
Do you want to describe how that
Speaker:
works?
Speaker:
so what read so before you get there?
Speaker:
I think we need to talk about their right anywhere file layout
Speaker:
which allows then the snapshot.
Speaker:
Yep waffle Which a lot of other vendors I think do something similar as well these
Speaker:
days but what it is is going back to the copy on write example If you are writing
Speaker:
a block of data in copy on write sort of file system, previous file systems,
Speaker:
you would always write to the same spot.
Speaker:
And because you're always writing to the same spot, that's why you have to first
Speaker:
copy out the data and then update it.
Speaker:
With a write anywhere file layout, what you end up doing is, it doesn't
Speaker:
matter which actual block you end up writing to, you basically construct the
Speaker:
metadata tree to reference that block.
Speaker:
Even though I might be updating block 100, block 100 might actually
Speaker:
physically be at, like, block 1000.
Speaker:
And because I have all my metadata that tells me exactly where that data
Speaker:
exists, I just need to update the metadata to say, okay, if someone
Speaker:
tries to access block 100, it's actually physically on block 1000.
Speaker:
And so you can end up writing to any location in the file system
Speaker:
and not having to worry about always hitting that same location.
Speaker:
And so that's kind of waffle in a nutshell.
Speaker:
So basically what you then is you have this metadata system that has a pointer
Speaker:
to every block and it doesn't really matter where those blocks happen to be.
Speaker:
And you can have thousands of these snapshots, right?
Speaker:
These pointers at the top.
Speaker:
right, so then when we go to do an update and we have a snapshot, it just
Speaker:
means that what we're going to do is we're going to change the pointer.
Speaker:
We're going to say, okay, this, there's this block that's sitting here.
Speaker:
And we know we, we, we're not supposed to update that block because we
Speaker:
have a snapshot that's requiring on that, that's requiring that block.
Speaker:
And then we're going to just redirect.
Speaker:
We're going to, we're going to write a new block for the
Speaker:
new version of that old block.
Speaker:
And then we're going to change the pointer, right?
Speaker:
We're going to redirect the pointer.
Speaker:
To this new location of that block.
Speaker:
Meanwhile, the old block is still sitting there and we've got a
Speaker:
snapshot that's just pointing to it.
Speaker:
Right.
Speaker:
Um, and so you've got this infinite number of snapshots that are pointing
Speaker:
to an infinite number of, well, it's not an infinite number of snapshots, but
Speaker:
you have a very high number of snapshots that are, that, that, and they're all.
Speaker:
Just a whole bunch of metadata pointing to a whole bunch of blocks that are
Speaker:
just sitting all around the volume.
Speaker:
And the one thing, so this all sounds amazing, right?
Speaker:
Because you're like, Oh, writes are, writes are super fast.
Speaker:
I don't have to worry about it.
Speaker:
In fact, Curtis, just one correction.
Speaker:
When you're going to actually write a new block, you never actually have to
Speaker:
look up the old block because you're always writing to a new location.
Speaker:
So you don't care if that old block is occupied by a snapshot or not, right?
Speaker:
Because this is the downside though, of snapshots, especially with the
Speaker:
write anywhere file layout is.
Speaker:
You now need a process to go through and say when a snapshot gets deleted,
Speaker:
what blocks are no longer being used actively because they don't
Speaker:
belong to snapshots, they're not currently as part of the volume.
Speaker:
So you have a garbage reclamation process or different vendors call it
Speaker:
different things, but some process to go through and reclaim all of those
Speaker:
free blocks so they can be reused.
Speaker:
I think you said the same thing I said in different words, but
Speaker:
yeah, I see what you're saying.
Speaker:
I guess I was saying that, that it's making a decision on what to do.
Speaker:
Based on whether or not, I think if I could just change my part
Speaker:
of my answer, instead of, is this block being used by anything else?
Speaker:
I guess is,
Speaker:
Yeah, more
Speaker:
I guess that's the question I'm asking.
Speaker:
Yeah.
Speaker:
Uh, and actually, I guess what you're saying is it actually doesn't even make
Speaker:
that decision at that point in time.
Speaker:
If it's going to modify a new block, if it's going to modify a
Speaker:
block, it just writes a new block.
Speaker:
And then what happens to that old block is a completely separate process.
Speaker:
If there's, if there is a snapshot that's pointing to that, that
Speaker:
old block, then it will stay.
Speaker:
If there are no snapshots that are pointing to that block, then at some
Speaker:
point the garbage collection process will come and make it go away.
Speaker:
Is that a better, is that a better
Speaker:
Yes, that is a better description.
Speaker:
Now this is
Speaker:
don't want to argue with a former NetApp employee
Speaker:
about how NetApps
Speaker:
now I should say this is based on NetApp's technology, different
Speaker:
vendors may do different things, but for the most part, most of the
Speaker:
vendors do something similar ish.
Speaker:
Specifically, it's based on how NetApp worked when you worked
Speaker:
there, which was a while ago.
Speaker:
And things may be different now,
Speaker:
that is also true.
Speaker:
not.
Speaker:
I mean, Waffle is like at the core of their...
Speaker:
You know, the core of their technology.
Speaker:
And so, and that's why with redirect on right, that's why you could essentially
Speaker:
have an infinite number of snapshots with, with zero performance penalty,
Speaker:
you have zero performance penalty of having basically the performance
Speaker:
penalty of doing a right update.
Speaker:
Is the same whether you have a snapshot or you don't have a snapshot.
Speaker:
The, the only penalty, if you want to call it that, is that, um, one, one
Speaker:
big difference between this way and the other way is there is no snapshot area.
Speaker:
Right.
Speaker:
So in the other method, the snapshot area could fill up if
Speaker:
you held snapshots for too long.
Speaker:
In this configuration, there is no snapshot area.
Speaker:
The snapshot area is the volume, and if you have too many updates and you
Speaker:
keep too many snapshots, you would fill up the volume with snapshots.
Speaker:
So you've gotta get rid of the older
Speaker:
Well, and this is where That would apply if you're talking NetApp terminology
Speaker:
traditional volumes, but most of it has moved over to virtual volumes, where once
Speaker:
again, you have an aggregate, a shared pool of common data, and for each of the
Speaker:
volumes, typically you also set a limit on how much space a snapshot can occupy.
Speaker:
So you could say, I am allowing 20 percent for snapshots of my overall
Speaker:
volume capacity, in which case it'll start
Speaker:
And what happens when you hit that wall?
Speaker:
I, I don't know what the current behavior is.
Speaker:
Previously.
Speaker:
I believe it would like let you like start automatically pruning
Speaker:
snapshots and trying to free up space.
Speaker:
Right, right.
Speaker:
Because it, obviously, it's not going to, it's not going to prune, uh,
Speaker:
production, you know, current data.
Speaker:
So, yeah, I, it could, but again, we're, we're talking specifically NetApp,
Speaker:
but something has to happen, right?
Speaker:
If you're using this method.
Speaker:
Something has to happen.
Speaker:
Either we have to stop creating new snapshots, right?
Speaker:
Or stop updating the snapshots that we have.
Speaker:
And, uh, we need to delete older snapshots or we need to maybe delete, you know,
Speaker:
certain ones in the middle, right?
Speaker:
Basically you've got to do some kind of pruning or else you're going to
Speaker:
Yeah, the other challenge is also figuring out what snapshot to delete
Speaker:
because blocks are being shared, right?
Speaker:
You might be like, hey, this snapshot is huge and you go delete it, but
Speaker:
because those blocks are being shared by other snapshots, you're not actually
Speaker:
going to free any space, right?
Speaker:
So you need to be able to figure out like which snapshot actually
Speaker:
contains unique blocks that if I delete it will actually save me space.
Speaker:
complicated.
Speaker:
Storage management.
Speaker:
I
Speaker:
I don't miss production storage management.
Speaker:
Any other final thoughts on redirect on, right?
Speaker:
think that covers it
Speaker:
I mean, My personal, if you're going to do snapshots on a storage array, I
Speaker:
think redirect on write is the way to go.
Speaker:
It sounds like what you're saying, they've made copy on write better,
Speaker:
but I still think redirect on write is just significantly better.
Speaker:
So, um, but it might be more complicated than if you're coding it.
Speaker:
Right.
Speaker:
So the next one is what I'm going to call the dumbest of all snapshot methods.
Speaker:
That's not what I have in the book.
Speaker:
I gave it a much nicer name in the book.
Speaker:
And guess who does this method?
Speaker:
The leading hypervisor company in the world.
Speaker:
Yes.
Speaker:
I think that's a fair statement, right?
Speaker:
company in the world.
Speaker:
I think, I think it still is.
Speaker:
Yeah.
Speaker:
And that would be VMware.
Speaker:
So the way VMware does snapshots is just literally the Dumbest
Speaker:
implementation of snapshots that I've ever seen and I don't know how they
Speaker:
haven't addressed it, but here it is.
Speaker:
When you create a snapshot in VMware, it literally holds all the rights.
Speaker:
Now, by the way, if I'm wrong, by the way, you know, Broadcom, don't sue me.
Speaker:
This is based, this is based on my understanding of VMware snapshots.
Speaker:
Uh, you know, I've, I've checked every once in a while and they,
Speaker:
no one seems bothered by this.
Speaker:
Uh, but if this has changed, any of you that are, you know, if anybody
Speaker:
works for Broadcom slash VMware, then, you know, feel free to update
Speaker:
me and I will update this episode.
Speaker:
And I'll just delete this section.
Speaker:
But here's the way it works.
Speaker:
When you create a single snapshot on a VMware volume, it halts.
Speaker:
All rights on the, on the current volume.
Speaker:
And then it keeps all rights in a snapshot area.
Speaker:
And then when you delete that snapshot, it replays all those
Speaker:
rights against the production volume.
Speaker:
And this is why when you make a snapshot.
Speaker:
And then you, if you hold that snapshot for a long time and then
Speaker:
you delete that snapshot, this is why it has a big performance hit
Speaker:
against the production volume.
Speaker:
But no one
Speaker:
this is
Speaker:
snapshot of a VM
Speaker:
yeah, this is why you do not do this.
Speaker:
You don't use snapshots on VMware level snapshots the way you do any other
Speaker:
snapshots, because, and by the way, I used VMware for years before knowing this.
Speaker:
That's why I want to make sure I mentioned it.
Speaker:
And, and that is that if you create a snapshot and then hold it for a
Speaker:
long period of time, you're going to get hit with a massive IO hit
Speaker:
when you delete that snapshot.
Speaker:
So if you're using VMware, VMware level snapshots, then you use them the way we
Speaker:
talked about earlier, where you create a snapshot, you make a, you make a backup.
Speaker:
And then you delete the snapshot.
Speaker:
Maybe you take a VMware level snapshot, and then you take a storage level
Speaker:
snapshot of that snapshot, and then you delete the, the VMware level snapshot.
Speaker:
You should, if this is the way your snapshot system works, you
Speaker:
cannot leave the snapshots around for any significant period of time.
Speaker:
I was going to chime in.
Speaker:
Thank you for covering that.
Speaker:
The, this specifically is VM where software snapshots,
Speaker:
if you wanna call it that.
Speaker:
Right, that are only done at the VMware level.
Speaker:
Now, there are integrations that various storage vendors offer
Speaker:
by plugging into the VMware API.
Speaker:
So whenever you trigger a VMware snapshot, it actually triggers
Speaker:
a storage level snapshot.
Speaker:
So avoiding some of these issues, but not everyone is aware of it.
Speaker:
Not everyone is using a third party storage array that integrates with VMware.
Speaker:
So.
Speaker:
Just
Speaker:
Yeah.
Speaker:
So, right.
Speaker:
Thanks for, thanks for clarifying that.
Speaker:
This is specifically VMware level snapshots that are done by VMware.
Speaker:
And without any third party storage.
Speaker:
Yeah.
Speaker:
And I don't know why VMware did this, but it's bonkers.
Speaker:
It's just literally one of the weirdest, codest thing, weirdest
Speaker:
coded things I've ever heard.
Speaker:
Why would you do it that way?
Speaker:
Somewhere in a meeting, this is how they decided to
Speaker:
it.
Speaker:
was probably easier
Speaker:
and yeah, yeah, maybe it was easier,
Speaker:
and they didn't talk to the
Speaker:
I wonder about that.
Speaker:
They didn't, they exactly, they did not talk to the backup folks.
Speaker:
Well, uh, I think we have, uh, summarized the world of snapshots.
Speaker:
That?
Speaker:
No, I think we did a good job with that.
Speaker:
So, copy on write, redirect on write, dumbest method ever.
Speaker:
Those are the three types.
Speaker:
I've got it officially in the book, uh, I've got this labeled
Speaker:
as the hold all writes method.
Speaker:
Uh, I should really just change that to the dumbest method ever.
Speaker:
But, um, yeah.
Speaker:
So, you know, snapshots are a great tool.
Speaker:
In the backup and recovery arsenal.
Speaker:
They are the great sort of basis upon which we're going to talk about one
Speaker:
of my favorite ways to do backup.
Speaker:
And we're going to talk about that in another episode.
Speaker:
Hint, it's called near CDP, not CDP.
Speaker:
It's called near CDP.
Speaker:
And, uh, it's just, just the number one thing you have to understand
Speaker:
about snapshots is that unless you have copied this snapshot to
Speaker:
another location via some mechanism.
Speaker:
Which could be backup.
Speaker:
It could be replication of the volume.
Speaker:
It could be a number of things.
Speaker:
You do not have a backup.
Speaker:
You have a picture of your volume.
Speaker:
And that picture of your volume is as worthless as a picture of your
Speaker:
house after your house burns down.
Speaker:
It'll just be a nice memory and, uh, and a really bad day.
Speaker:
So it's just, that's the really, the most important thing to
Speaker:
understand about snapshots.
Speaker:
And now if this is your first time, snapshots have been explained to you.
Speaker:
Now you understand why I don't like it that they call.
Speaker:
What AWS does snapshots because that very much does not meet
Speaker:
the definition that we just had.
Speaker:
And I'm glad I brought this up because it's important to what we're talking
Speaker:
about is storage level snapshots.
Speaker:
Darn it.
Speaker:
I don't know
Speaker:
You can't call it that, yeah, because
Speaker:
this.
Speaker:
Yeah, these are traditional snapshots.
Speaker:
There are other things out there that people call snapshots
Speaker:
that don't work like this.
Speaker:
AWS snapshots don't work like this.
Speaker:
Uh, they are an actual image copy.
Speaker:
It's actually, they actually, when you make an AWS snapshot, it actually copies
Speaker:
that, that point in time out to another area of storage, which happens to be S3,
Speaker:
you?
Speaker:
And I think specifically you're talking about an AWS EBS snapshot
Speaker:
thank you.
Speaker:
I am talking about an AWS EBS snapshot.
Speaker:
Um, my former employer, Druva, they call what they do snapshots.
Speaker:
They call their backups snapshots.
Speaker:
I never liked that, but you know, nobody asked me.
Speaker:
Uh, so, but what we're talking about here is traditional snapshots.
Speaker:
And, um, a lot of other people will call what they do a snapshot.
Speaker:
Um, the problem is like a lot of terms in the, in the backup world.
Speaker:
It's a term like so many of our terms are, um, their words that are used
Speaker:
just, they're just English words that are used in so many different contexts.
Speaker:
when, like when we had the CDP episode, we couldn't figure out what to call
Speaker:
those point in time, because a lot of the CDP vendors called them snapshots.
Speaker:
Yeah, exactly.
Speaker:
Yeah, exactly.
Speaker:
All right.
Speaker:
Well, uh, I guess the only thing left for me to say is that's a wrap