A large part of what we do here at McLean IT is manage client infrastructure — or put in plainer terms, the servers and networks that support their respective businesses. No two networks are exactly the same, and that’s by design. There are some foundational similarities between them, but as they say, “form follows function”. The exact architecture of any given system of computers dictates unique needs of the business.
One of my larger clients has had some critical problems on their network. Twice in the last couple of months, the entire network failed. Users are unable to login, print, browse the internet, or the network file server. Of course a problem of this scale can’t be explained by a simple “have you tried turning it off and on again?”. When it happens, it causes some real disruption because the entire office grinds to a standstill, and incurs a great downtime cost, since it is a large office. Each time it happens, it can take up to a couple of hours to fix. 2 hours x 20+ unproductive admin staff = -$$$.
First, let me say a quick disclaimer that I’ve had this client for a little over a year now and it’s taken me all that time to untangle the mess that it was. Virtually nothing was documented. What was, was inaccurate. Everything down to the wiring was incomprehensible. This particular issue has only come up in the last couple of months.
It should be noted that the domain controller and file server are the same device. This isn’t uncommon in smaller networks and this is how this office started out. A lot of the mess I’ve had to untangle can be attributed to rapid growth without proper management. A good argument for proper network design. But of course the implication of this single-server design is that if anything goes wrong with the domain controller, so too go the files.
The first time this happened, it took me a while to find the cause of the issue. Some users could reach the internet, some could not. Some could login, some could not. Some could print and access servers, some could not.The DC (Domain Controller was unresponsive. I could ping it, but could not log on locally. Shares were not accessible. Rebooting took forever, indicating a problem accessing the domain controller (in other words it could not access itself). I ran a repair of the Directory Services and rebooted which eventually helped to the point of being able to login and everything worked… for a few minutes until the issue returned. I booted into safe mode and found a bunch of errors pertaining to the BDC (Backup Domain Controller). Knowing that users were unable to login, it was fairly obvious that it wasn’t doing it’s job, and it was scheduled to be decommissioned anyway, so I turned it off. Then I rebooted the DC once again. Since last time it seemed to work at first but halted after a few minutes, I waited to be sure. This time, it came back up and was stable. The conclusion, basically, was that the AD (Active Directory) of the BDC was corrupted somehow and was preventing the AD duplication from the DC, thereby crippling it. There wasn’t much point in trying to fix the BDC since it was soon to be replaced anyway.
So since I’m still trying to untangle the way these servers were being utilized, I didn’t want to take the BDC away outright. I left it there just in case there was an undocumented role that it filled. There did turn out to be one, but it was easily replaced with a VM (Virtual Machine) that I created using Hyper-V.
About couple weeks ago, the same thing happened a second time. The same symptoms, and the DC again was unresponsive. Since it seemed so familiar, I pinged the static IP address of the BDC and it responded right away. Somehow, the BDC had been turned back on. I’m still not sure how it was activated, or by whom, but there it was giving me another massive headache. I went on site, made sure the BDC was off, repaired the AD of the DC (the repair was necessary as somehow the sync was corrupting things again), reloaded DNS for good measure, and things were up again.
The manager was not pleased that this was causing issues yet again. I needed to find a more permanent solution to this problem ASAP.
So let me walk you through all the system failures involved here. To the end user, all they can determine is that “the network is down”. But there’s actually several different but related system failures contributing to the outage.
The office workers don’t all arrive at the same time. Some turn their computers off at night, others leave theirs running but logged out, still others running and logged in. When the DC goes down they are unable to contact the DC which means they are not able to login to their domain account. Those who are logged in already can continue to use their computers like usual, besides not being able to reach some network resources or the internet. Those who were logged off cannot login, but are confused that some people nearby are logged in and so they report to the manager that it works for some people and not others (this causes much confusion to everyone but is thus easily explained). Those whose computers were turned off will have the same issue but with an additional potential problem in this case, which brings me to the next system failure.
DHCP (Dynamic Host Configuration Protocol). Basically what this does is hand out addresses to all the network devices asking for one. Without an address, a device cannot speak to the network. Of course this system is also configured on the DC. So when it goes down, so too goes the ability for devices to talk to one another. But some devices *do* have addresses, further confusing end users. This is explained by DHCP’s lease time. Basically it’s the time before the assigned address expires and the device is expected to ask for a new one. So some user devices might have time left and others time may have expired in the middle of the server failure.
I would argue that the file server should be entirely separate from the DC, but the current configuration isn’t uncommon for small business networks. In many cases, *every* role is filled by a single DC with no BDC. But in an office this large, you want failsafes because downtime is expensive.
The other issue is security. The server room isn’t locked out, so when I deactivate a server for good reason, there’s nothing preventing a well-intentioned individual from turning it back on again.
In any case, the first order was to replace the BDC. The network is still running mostly Server 2003 but I put in an additional Server 2012 which filled the role nicely. The benefit of this change is that now if the DC goes down, users can still login to their workstations. I haven’t yet taken the step of moving the file server from the DC, but I have duplicated the data to a second location and I may simply remove the DC role from that server altogether, and choose a new DC.
I set the BDC as a secondary zone for the DNS (Domain Name System). DNS basically resolves addresses (both static and those assigned by DHCP) with more friendly memorable names: 10.0.0.2 becomes “DC1”, et cetera. This addition means that even if the DC goes down, not only can users resolve local network names to addresses, but also access the internet.
The last piece of the puzzle that I need to solve is DHCP. With Server 2003, there’s no such thing as “Hot Standby” for DHCP, and you can’t have two DHCP servers competitively handing out addresses from the same pool on the same network without causing even more headaches. So now a single DHCP server is supporting the network, and it’s on the DC. What I can do is split the address pool in half and assign it to two different DHCP addresses. For example I can allow DHCP1 to hand out addresses from 10.0.0.51 to 10.0.0.150, and then set DHCP2 to assign from 10.0.0.151 to 10.0.0.250. There’s only about sixty DHCP clients on the entire network, so any single DHCP server can handle the entire load AND have plenty of room for more, and if either one goes down, things will continue to run like normal.
However, the network addressing is a bit of a mess. Static (manually set) addresses are all over the network range with no logic to it. Some printers are accessed directly by IP address by a few non-conformists, and even some mapped network drives are by IP to some of these servers. The office manager sometimes logs into the DC to create new users, and he does this by IP — and it’s been the same IP for years. So that makes this a behavior issue. The address range will need an overhaul – putting all the static addresses in the 10.0.0.1-10.0.0.50 range and letting DHCP handle the rest. The downside is that anyone incorrectly connecting to resources by IP instead of by DNS or UNC path will experience some short-term grief. It’s not ideal, but it needs to be done. The alternative is a messy DHCP pool and a network with no logic to it, and that’s just asking for more trouble. The lesson here is that technology is not always the ideal solution — sometimes it has to be a change in behavior.
I spent half of today trying to track down the server hosting one of my clients websites. The business and website have been around for so long, and the site maintenance and development has changed hands so many times, that the business owners have no idea who hosts their website, nor whose account controls it. The world won’t end without this information, but until I get it, their website can never be updated, and I can’t implement any branded email for that business. Basically, nothing can change.
I’ve preached the gospel of documentation before, but for you fellow IT technicians, I thought I’d give some tips.
In my experience, those in the IT field tends to have stronger math skills than language skills, so technical documentation doesn’t come naturally to most. But the fundamental importance of documentation cannot be overstated. And although technical writing is a career in itself, it is a skill that should be cultivated by every IT pro.
Servers are probably the easiest place to start. In a busy IT environment, a lot can change in a short period. To avoid the impression that you’re laying track while trying to catch a moving train, start with something that doesn’t necessarily change so quickly.
Clean Up First
Sometimes businesses grow so fast that the infrastructure doesn’t have time to be tidy. One client that comes to my mind was so overly complex that the only way I could even fathom beginning to write documentation for was to start consolidating things. There were four different internet connections and gateways, each with their own configuration. I replaced those with a single, 4-WAN-port router. There were also two different fax servers and two NAS devices, none of which worked properly. I culled the herd until the infrastructure made sense again and from there I could rebuild.
Ask Questions Like Your Life Depends On It
The devil is in the details. Capture everything you can. And I mean everything – even down to the date of deployment and hard drive serial numbers. That can save a lot of time down the road to determine warranty status, or to schedule hardware upgrades. If there are any quirks or special considerations for a particular piece of hardware, include that in the notes. A proprietary piece of software? Include it. Does it have RAID? How is it configured? Is the drive partitioned? How? Is it backed up? When and how? Are notifications sent out upon success/failure? To whom? Most importantly, what roles does the server fill? Domain Controller? File, web, or database server? DHCP? DNS? VPN? E-Mail? Fax? Are there shared folders? Where is the original install media? Does it have a static or dynamic address?
Bottom line is, an outsider (assuming all random outsiders are also IT technicians) should be able to walk in off the street and know, at a glance, every practical nuance of the device in question. Because frankly, that random outsider could be you. Sometimes a server could be a champ and work without any intervention for ages, and then POOF. It breaks, and you know nothing about it because you never had to.
Avoid the proverbial egg on your face with comprehensive documentation.
Software & Licenses
This can be a long list, depending on the size of the business in question. The types of clients I work with are comparatively small – anywhere from two to fifty users. But even small businesses typically have a surprisingly long list of software that they frequently use. If they use accounting software, one would hope that they keep a record of their correspondence with the vendor, which typically includes the key codes needed to activate the software. I would argue that any and all codes should be kept in a unified IT document as well.
The kinds of information captured here are purchase dates, license codes, license restrictions, registration information, system requirements or dependencies, and also vendor information.
My documentation will sometimes refer to to software user manuals, and I try to also identify where that manual can be found, to save time sleuthing around the internet.
Include Contact Info
Yours, the client’s, everyone’s. Include the phone company. The ISP. The web host. The domain registrar. The go-to electrician. Anyone who either you or the client might need to contact. It’s not enough that the info is on your speed dial. That doesn’t do the client any good after you were hit by a bus.
Perhaps one of the biggest time savers. Some networks are simple: a router with many clients. Some are not. Cover your bases. The definitive diagram designer application for Windows is Microsoft Visio, and for Mac it’s Omnigraffle. There are other alternatives. Diagrams aren’t always limited to complex networks either. Maybe the client has clustered servers or a co-location. Or maybe they utilize server virtualization. Diagrams are always helpful. That’s why instruction manuals are usually full of them.
Time Is Money
You may be sensing a theme here. IT documentation should be designed to save time. As a reference document, it saves your precious time, and makes you look professional, and more importantly, proactive. If anything were to happen to you, it’s also a form of insurance for the client, because now at least some of your extensive knowledge can be recovered or shared. If you work for a company with many technicians, then all the better. Your fate is no longer tied with that one client. You can send the new guy, if he’s up for it.
Update It Regularly
Regularly is relative. In some cases, collecting all of this information is done passively over a very long period. The last one I finished took roughly six months. The important thing is that you update it as you make changes, and audit it say once a year.
IT infrastructure is a living breathing beast that grows and expands with the business. Documentation should grow and expand with it, or it quickly becomes useless.
Use A Template
Save yourself time in the future by making a template in whatever word processor you chose. The information you collect is largely the same no matter who you’re doing it for.
Up until recently I used a Pages template (yes, on a Mac), because hey, it’s pretty. But now I use Scrivener, which is what you might call a document project management tool. At it’s heart, it’s a word processor, but with a few new tricks. How to use Scrivener is way outside the scope of this document, but suffice it to say, it’s perfect for technical writing. And for those of you who already use Scrivener, I’ve included a template that you’re welcome to use.
I was inspired to do this post by Sharon Bennett (@bennettbusiness) with her article “My IT Guy was Hit by a Bus!“, which resonated with me because my early mentors used the same metaphor to drive home the importance of – yes – documentation. Her post included a great example template that looks to have been written in MS Word, which I’ve included a link to as well.
Her articles are actually really presented well and when I read them, I find myself nodding in agreement. She does a far better job at speaking to the Small Business audience than I ever have. Check out her site here: http://bennettbusinessconnections.com
Do you have any experiences where proper documentation saved you from disaster? Have your own templates or methods for documentation? Something important that I missed? Let me know in the comments!
Not to beat a dead horse, but I wanted to be clear on the importance of IT documentation. To date, I’ve yet to start with a client who already has it. In some cases I’ve been able to force the previous IT support vendor to put something together from their records (always an awkward conversation), but even then it’s been so inaccurate that it is virtually worthless.
I wish I remembered who first told me “document it, or it doesn’t exist”. It might have been my boss when I used to work as a systems admin at a local software developer company, as that mantra is more common with programmers. But of all the advice I’ve been given in my career, that one stands out the most.
Ironically, when I went to technical college, we were taught how to tear down hardware and build it up again, how not to electrocute ourselves, and how to diagnose both hardware and software issues with computers. But what was woefully absent was any kind of documentation training, which is a real shame. But to be frank, I think this really separates the professionals from the amateurs.
Francis Bacon once famously said “Knowledge is power”. With IT documentation, you harness the power to make informed decisions about your infrastructure.
I’ve planned for some time to write a series of blog posts featuring the IT tool kit and apps I have at my disposal. Apparently today is the day I start.
Being über mobile wasn’t a conscious decision, it was a reaction to the market demand as I saw it. There are two main types of technology support service: drop-off and on-site. If you have a computer at home, it’s not generally seen as too much of a hassle to unplug everything and take it to a service depot. A busy office doesn’t necessarily have the same luxury of time or manpower. One computer out of commission costs them significant productivity. Assigning someone to drop off a computer at a tech shop and pick it up again only compounds the issue. Worse, if the issue is widespread as with network issues, it’s not something you can pack into your car and drive somewhere.
So I resolved to give businesses the white glove treatment (as opposed to the rubber glove). Complimentary pick up, drop off; active on site and remote support.
So without further delay, here is part 1 of my IT tool kit feature, What’s In The Bag
I’d discovered Kato through ThinkGeek but purchased directly from the Hazard 4 website. The best part about it was the adaptability of it, and that there were dedicated pockets for both a tablet and a netbook. Specifically, it was said to accommodate the iPad and an 11″ MacBook Air perfectly. And so it does. So integral to my mobility is this bag, that I choose many tools based solely on whether they will fit inside.
Some of my colleagues highly recommend a larger camera bag which has greater capacity for more tools, but for the purposes of mobility, less is more.
Mid-2012 Apple MacBook Air 11″
This is my first MacBook, but not my first Mac. My previous (and only) two laptops were an Asus W3V, followed by a Dell XPS M1530. What I took away from my experiences with both models (and from all the laptops I serviced in the line of duty) was that laptops don’t live long and most aren’t very well made. Hauling around a laptop as one often does in my line of work, I learned quickly that holding it wrong could produce some frightening crackling sounds due to shoddy plastic flexing. So when I had opportunity to buy the famed aluminum unibody notebook, I took it and I’ve been completely satisfied by the result. This despite the fact that many of the tools I came to rely on were Windows only. That said, I’ve never been left wanting, and the tools I use now vastly outperform the ones I once used. Within days of getting it, a friend was admiring it and accidentally dropped it on hardwood — not even a scratch on it.
I chose the MacBook Air 11″ due to the ultra portability and was also influenced by the fact that it fit perfectly inside Hazard 4’s “Kato”. It’s also impossibly light. I can comfortably hold it with a couple of fingers. The one addition I made when ordering was to add 8GB of RAM, which in hindsight was a wise investment.
Apple iPhone 5
My first smartphone was a Blackberry Curve, but not by choice. I’d wanted an iPhone at the time, and when my contract was up, I picked up an iPhone 4 and later the iPhone 5. Why the iPhone? Several reasons. I like the uniformity. An iPhone is an iPhone. From model to model, I can know where the settings are and how to work them. Some argue that Apple’s devices are too restrictive, but to me that just makes them easier to work with. Harder to break. Not true of the Blackberry and certainly not of any Android phone. Don’t get me started about the Android. The only other smartphone OS I would consider would be the Windows phone. I like the direction Microsoft took, and they’re pretty slick.
The LTE on the iPhone 5 is actually faster than my home internet connection, if you can believe it. Bluetooth and WiFi tethering provide access to the internet on the rest of my devices.
I also use the camera extensively to capture things like serial numbers, error codes, and wiring to aid in technical documentation.
Plantronics Voyager Pro HD
The most recent addition to my mobile toolset, this is one of two Bluetooth hands-free devices that I use regularly. My old earpiece was Motorola H500, but when I turned my head it would flop around too much. The Voyager fits comfortably in my ear, and as Ric said of the Voyager Legend model in his mobile IT tool kit feature, it pretty much destroys the competition.
Apple iPad (3rd Gen)
You may be sensing a theme here. I used to loathe Apple fanboys. It was a factor that held me back from having an Apple desktop for years. I was afraid I would become one of those smug “I’m a Mac” guys that nobody can stand. Say what you will about Apple, but they make a solid, trendsetting product.
At first, most mobile tools I used were available for the iPhone. At one point, though, the small (but beautifully crisp) display became too much of a burden. So in came the iPad. A few months more of this, though, and my Dell XPS M1530 packed it in, which was when I bought the MacBook Air.
I was told that once I bought the Air, I’d never use the iPad again, but this turned out to be patently untrue. Each have their distinct uses and are a vital part of my toolset.
I received this kit along with something I purchased for my Mac Mini. For a “bonus” kit, it’s amazing how much it covers. Standard bits, Torx, and even the super-rare tri-wing bit common in Wii consoles, with this kit I can open almost anything. It’s also the most compact kit I’ve encountered to date. As much as I’ve used it, I’ve miraculously managed not to lose any of the bits!
TechLite Lumen Master TE116
Crawling under desks and inside dim wiring closets is great and all, but even better with a good quality flashlight. I scored 3 of these babies at Costco for about $8 apiece.
Hardware deployments invariably start with sealed boxes. In a bind, I’ll sometimes MacGuyver myself tools using paper clips or keys, but nothing speeds up unboxing like a tanto style knife. Except maybe a box cutter. But box cutters are only really good for cutting boxes. And maybe crafts. This bad boy is also ready to cut through seat belts in an emergency. For a computer guy, I use this a ton. There’s a spot on the Kato bag that it fits in perfectly, and my jeans have a great spot for it too.
I picked this up in response to all the presentations I attended (and performed) where the source was a laptop or iPad. Presenters came so prepared with multimedia content, but remembered only to bring the projector, and relied solely on the internal speakers to fill the room. Big mistake. Tablet and laptop speakers are better than nothing, but barely. This is a compact stereo speaker system that plugs in via the headphone jack and can easily fill a medium-sized room.
Western Digital 500 GB My Passport
I get called to businesses in response to emergencies, some of which are data related. If I suspect an impending system failure, I can plug this in and get a backup going right away. There are larger capacities available, but so far I find 500GB suits me just fine. The new models are USB3, and the best feature is that it both communicates and is powered by the same single USB cable. No fiddling with extra wires or open power sockets required!
Corsair Flash Voyager Mini 32GB
Although technically not “in the bag”, this little guy is a huge part of my ability to be mobile, and a vital part of my IT tool kit. Loads of storage space, it’s rubber, shock-proof, hooks to my key ring, and the best part is, there’s no lid to lose! This is the issue I have with many other flash drives, including other offerings from Corsair. This key contains the majority of the software I use on-site.
Up next in Part II – the software I use on-the-go.