This week’s guest tells the most incredible story we’ve ever had on the podcast. We’ve had ransomware restores, disaster recoveries after a hurricane, but we’ve never had someone who deleted their entire computing environment and then restored it using their backups. (Backups that had never been tested to this degree, BTW.)
Paul VanDyke is the IT Supervisor at the Kodiak Island Borough in Alaska, which is the second largest island in the US and has to satisfy its backup and DR needs while staying on the island. Cloud resources are not a possibility due to bandwidth concerns, so he’s doing things “old school.” We first talk about the kinds of things they are protecting from, including tsunamis, fires, and strong winds. They are primarily based on tape, and for DR they store copies of all backups in a nearby safe. We discussed ways they could improve their resilience, such as shipping some tapes to a location on the mainland.
But the highlight of this episode is the story of when Paul intentionally destroyed his entire environment and then tested his backup system! He learned many valuable lessons, starting with “don’t ever do that again!” Luckily, his test was successful, albeit not without some challenges. He wiped the storage arrays on five servers: two domain controllers, an email server, a file server, and an application server and then restored them. (He had his reasons for doing it this way, which he goes into in the podcast.)
One big thing he learned was how restores are often slower than backups. So he prioritized critical apps (e.g. email, fileserver, logins) and got them up by Monday morning. Then it took him a few more days to get the application server up and running due to a more complicated restore. We have a really good discussion on how Paul could have done things better, including a really good idea that Prasanna came up with it. Curtis also tells a similar story about the first time he “tested” backups when he actually needed them, versus doing it in advance.
We cover a number of topics and questions on this podcast:
What was an Exabyte Mammoth (M2) tape drive?
What is a helical scan tape drive?
What is multiplexing?
Why can restores be slower than backups?
What happens when you rebuild a RAID array?
Should you have a post-mortem after a large incident?
How important is recovery testing?
How important is it to set expectations in IT, especially when it comes to recovery times?
Mentioned in this episode:
Interview ad