So looking at this from a high level a day to day job of a admin is basically to keep the lights on and things moving forward. We put in a huge amount of time and effort in making sure what we are doing is done to a level we expect. But for some reason the projects always seem to get blurred, or rushed along. This leads to sub-par work and maybe some check boxes left un-ticked and so on. Eventfully we catch these or go back and fix them later. But there is a landmine there waiting to catch someone because we as admins got distracted. We have all ran across IT landmines some of our own doing and some of our predecessors. It’s always enviable to run across these.
Over the last month or so I have had some time to think and reflect on what is going on. And as I sit and look at what a good portion of my work day is wrapped around I realized its amazing how anything actually works in IT. Think about it….. and hear me out on this.
When building a Rube Goldberg Machine, you pick what you want to accomplish and what your thyme will be, and what materials you will use. I think this would be the key starting place with this. Then you start to layout the design roughly, research and figure the physics of said design. You spend countless hours researching and documenting what the plan is. Then it becomes time to build. This is your time to shine with all the research you do and you think you can take on the world. You start laying out the pieces, start putting together the individual parts, have one team work on one section you work on another, another team work on another part and so on. Your teams start testing individual parts of the contraction and finds flaws and adopts as necessary. This is where your design and actual end product begin to differ. Things keep changing, tests keep failing and succeeding. Then finally you get a fully tested individual parts and segments. Its ready for the big moment. Let’s run this thing. This is where all this time and effort is lead up to. You start the first action in the chain in seemingly never-ending events. Things go smooth till the one part. And it fails. You assess the situation and reset make some modifications and go again. This continues in a repetitive motion till you finally get it to succeed all the way through.
Now compare this story to building up a new Data Center. You build your business requirements, your design, and your hardware choices. And begin formulating the complete build design. You research every best practice documents, check every driver version to make sure its supported and on and on. You order your hardware, wait for it to show. When it does is when the real work begins. You start building out your new data center. One team does storage, one team does the network, one team does the compute and so on. You start testing find some issues make some changes to correct the errors, and continue till each team is satisfied with there part. Then you start to test the whole system. Things fail and you make more changes, and repeat this process a few times.
If you look the things are vastly different on what they are doing, but extremely similar in the build and design process.
So, take your Rube Goldberg Machine that you built. Think of running that same machine billions of times a day on repeat. One single slight blow of wind, or a shake in the floor could through the contraption off and it would fail. In IT you build redundant systems, but that’s really like building two of these things, and trying to keep it running all day every day, 365 days a year. All it takes is one thing that is not correct from a malformed packet, to a wrong SQL query and they whole contraction could go up in flames.
Also let’s not forget about all the changes you made from the original design. Now sit back and think. Did you document every change? Did you test every setting? Did you…. And so on. I am sure there are a few changes in there that were not documented. These are what I like to call Land Mines. These are the little settings that you forgot you changed for some reason or another. Maybe out of frustration because some process was taking to long, like disabling user acknowledgements, or disabling spanning tree. Right now, they might not make a difference, but in the future, they make someone’s day a nightmare.
Think about it a year down the road. And you go to plug in a device to a switch for a redundant link, and all the sudden you create a switching loop. Or you go to do a firmware update and upload it, and then all the sudden all your servers reboot at once. All because you changed one setting a year ago. These are what you call landmines. Lying and waiting for someone to step on them.
So here lately I have come with the conclusion that IT infrastructure is a Rube Goldberg Machine built on a field of land mines. We are all just waiting on one thing to go wrong and make another change and hope we did not alter the path of something and step on a landmine in the process.
So what is the moral of this story…I really don’t know just thought I should write about it. Maybe Document your changes no matter how small. It could come back and bite you or someone else later down the road. and Make your you understand the full ramifications of your changes before you make them. As it may alter the path of something a few steps down the process chain, in-turn blowing everything up.