One of my larger clients has had some critical problems on their network. Twice in the last couple of months, the entire network failed. Users are unable to login, print, browse the internet, or the network file server. Of course a problem of this scale can’t be explained by a simple “have you tried turning it off and on again?”. When it happens, it causes some real disruption because the entire office grinds to a standstill, and incurs a great downtime cost, since it is a large office. Each time it happens, it can take up to a couple of hours to fix. 2 hours x 20+ unproductive admin staff = -$$$.
First, let me say a quick disclaimer that I’ve had this client for a little over a year now and it’s taken me all that time to untangle the mess that it was. Virtually nothing was documented. What was, was inaccurate. Everything down to the wiring was incomprehensible. This particular issue has only come up in the last couple of months.
It should be noted that the domain controller and file server are the same device. This isn’t uncommon in smaller networks and this is how this office started out. A lot of the mess I’ve had to untangle can be attributed to rapid growth without proper management. A good argument for proper network design. But of course the implication of this single-server design is that if anything goes wrong with the domain controller, so too go the files.
The first time this happened, it took me a while to find the cause of the issue. Some users could reach the internet, some could not. Some could login, some could not. Some could print and access servers, some could not.The DC (Domain Controller was unresponsive. I could ping it, but could not log on locally. Shares were not accessible. Rebooting took forever, indicating a problem accessing the domain controller (in other words it could not access itself). I ran a repair of the Directory Services and rebooted which eventually helped to the point of being able to login and everything worked… for a few minutes until the issue returned. I booted into safe mode and found a bunch of errors pertaining to the BDC (Backup Domain Controller). Knowing that users were unable to login, it was fairly obvious that it wasn’t doing it’s job, and it was scheduled to be decommissioned anyway, so I turned it off. Then I rebooted the DC once again. Since last time it seemed to work at first but halted after a few minutes, I waited to be sure. This time, it came back up and was stable. The conclusion, basically, was that the AD (Active Directory) of the BDC was corrupted somehow and was preventing the AD duplication from the DC, thereby crippling it. There wasn’t much point in trying to fix the BDC since it was soon to be replaced anyway.
So since I’m still trying to untangle the way these servers were being utilized, I didn’t want to take the BDC away outright. I left it there just in case there was an undocumented role that it filled. There did turn out to be one, but it was easily replaced with a VM (Virtual Machine) that I created using Hyper-V.
About couple weeks ago, the same thing happened a second time. The same symptoms, and the DC again was unresponsive. Since it seemed so familiar, I pinged the static IP address of the BDC and it responded right away. Somehow, the BDC had been turned back on. I’m still not sure how it was activated, or by whom, but there it was giving me another massive headache. I went on site, made sure the BDC was off, repaired the AD of the DC (the repair was necessary as somehow the sync was corrupting things again), reloaded DNS for good measure, and things were up again.
The manager was not pleased that this was causing issues yet again. I needed to find a more permanent solution to this problem ASAP.
So let me walk you through all the system failures involved here. To the end user, all they can determine is that “the network is down”. But there’s actually several different but related system failures contributing to the outage.
The office workers don’t all arrive at the same time. Some turn their computers off at night, others leave theirs running but logged out, still others running and logged in. When the DC goes down they are unable to contact the DC which means they are not able to login to their domain account. Those who are logged in already can continue to use their computers like usual, besides not being able to reach some network resources or the internet. Those who were logged off cannot login, but are confused that some people nearby are logged in and so they report to the manager that it works for some people and not others (this causes much confusion to everyone but is thus easily explained). Those whose computers were turned off will have the same issue but with an additional potential problem in this case, which brings me to the next system failure.
DHCP (Dynamic Host Configuration Protocol). Basically what this does is hand out addresses to all the network devices asking for one. Without an address, a device cannot speak to the network. Of course this system is also configured on the DC. So when it goes down, so too goes the ability for devices to talk to one another. But some devices *do* have addresses, further confusing end users. This is explained by DHCP’s lease time. Basically it’s the time before the assigned address expires and the device is expected to ask for a new one. So some user devices might have time left and others time may have expired in the middle of the server failure.
I would argue that the file server should be entirely separate from the DC, but the current configuration isn’t uncommon for small business networks. In many cases, *every* role is filled by a single DC with no BDC. But in an office this large, you want failsafes because downtime is expensive.
The other issue is security. The server room isn’t locked out, so when I deactivate a server for good reason, there’s nothing preventing a well-intentioned individual from turning it back on again.
In any case, the first order was to replace the BDC. The network is still running mostly Server 2003 but I put in an additional Server 2012 which filled the role nicely. The benefit of this change is that now if the DC goes down, users can still login to their workstations. I haven’t yet taken the step of moving the file server from the DC, but I have duplicated the data to a second location and I may simply remove the DC role from that server altogether, and choose a new DC.
I set the BDC as a secondary zone for the DNS (Domain Name System). DNS basically resolves addresses (both static and those assigned by DHCP) with more friendly memorable names: 10.0.0.2 becomes “DC1”, et cetera. This addition means that even if the DC goes down, not only can users resolve local network names to addresses, but also access the internet.
The last piece of the puzzle that I need to solve is DHCP. With Server 2003, there’s no such thing as “Hot Standby” for DHCP, and you can’t have two DHCP servers competitively handing out addresses from the same pool on the same network without causing even more headaches. So now a single DHCP server is supporting the network, and it’s on the DC. What I can do is split the address pool in half and assign it to two different DHCP addresses. For example I can allow DHCP1 to hand out addresses from 10.0.0.51 to 10.0.0.150, and then set DHCP2 to assign from 10.0.0.151 to 10.0.0.250. There’s only about sixty DHCP clients on the entire network, so any single DHCP server can handle the entire load AND have plenty of room for more, and if either one goes down, things will continue to run like normal.
However, the network addressing is a bit of a mess. Static (manually set) addresses are all over the network range with no logic to it. Some printers are accessed directly by IP address by a few non-conformists, and even some mapped network drives are by IP to some of these servers. The office manager sometimes logs into the DC to create new users, and he does this by IP — and it’s been the same IP for years. So that makes this a behavior issue. The address range will need an overhaul – putting all the static addresses in the 10.0.0.1-10.0.0.50 range and letting DHCP handle the rest. The downside is that anyone incorrectly connecting to resources by IP instead of by DNS or UNC path will experience some short-term grief. It’s not ideal, but it needs to be done. The alternative is a messy DHCP pool and a network with no logic to it, and that’s just asking for more trouble. The lesson here is that technology is not always the ideal solution — sometimes it has to be a change in behavior.