I visited a school district recently that experienced a complete network failure. Through some troubleshooting, the tech was able to narrow the culprit down to something happening at the high school.
This article discusses how to troubleshoot a local area network when something like this happens. The images are for reference only, but should give you an idea as how to at least START the troubleshooting process, and hopefully it will help someone out there find the demon that shut down the network.
First, let me talk about how the tech narrowed down the issue to one building. All the connections from campus come into one data closet. When his users started complaining that they could no longer get online or use their email application, the tech went to the closet and unplugged one connection at a time until the problem went away (a smaller scale will be discussed below). Once he determined the problem went away when he unplugged the high school, he knew where to to start looking for the problem. This is where I came in (with an extra set of hands/eyes to help me).
At the high school, I unplugged all the connections in the tech closet. I then only connected a nearby computer and the connection back to the central tech closet (which leads to the Internet):
The little black screen in the upper left corner is a simulated PING test. I ran a PING test on the computer out to the DNS server we use. You could also PING something like YAHOO.COM, if you so choose.
How do I run a PING test? Do this:
- In Windows XP, click START then RUN then type CMD and press ENTER. In Vista, Click the WINDOWS BUTTON, then in the search box, type CMD and press ENTER
- At the prompt now on the screen, type PING YAHOO.COM and press ENTER (or use whatever address you need/want, so long as that address LETS you ping. Some do not)
- You will get one of two BASIC responses: either "Reply From..." or "Request Times Out." "Reply From..." is a successful ping (you can reach the site you are trying to ping). "Request Timed Out," however, is a FAILED ping. It means your computer could not reach the computer you tried to ping. This is useful for troubleshooting!
Okay, so what if your one computer and one connection back to the Internet fails the ping test? At that point, you know you have a problem in the devices that run between the two sites (in this case, between the main closet and the high school).
Since we were able to ping the world, we are back to troubleshooting the network. Once you confirm that your network works with just one computer, you add ONE of the connections on your network back into the loop:
I prefer to always pick one port to use for ALL the remaining tests. That way, you are limiting the possible problems down to one actual "leg" of your network. As you can see above, I took the wire from "Connection 1" and plugged it into port 5. In our case, I have no idea where "Connection 1" goes on the high school network. And for now, I don't even care. My goal is to find the link that is crashing the network. Once I hooked up "Connection 1," I run the PING test again on the computer nearby. Now, what I really did was set the ping test to run indefinitely. The command for that is: PING -t YAHOO.COM
In the example above, I am able to ping the outside world. At the high school, I was able to get to the outside world.
In the next example, I UNPLUG "Connection 1" and plug "Connection 2" into the SAME PORT that I used for the first test. Again, this is because I know that port works now that it has passed the first test.
Once "Connection 2" is hooked up, I check my PING test to be sure i do not have failure messages. It's important to stop here a moment to say this: You will most likely have *SOME* failures. One or two over the course of dozens is not too bad. what we are looking for is a complete failure where you get "Timed Out" error after "Timed Out" error - many in a row.
As long as the PING test passes, you repeat the procedure for each of your connections: disconnect the wire, plug in the next wire, check the PING.
So, what happens when you get a FAIL?
Uh-oh! We failed our PING test!! Now, we know where the problem is. Or at least we know which part of the network is causing the problem.
What I did at this point at the high school was to unplug the error-causing wire, and check the PING. Everything looked good. I plugged the offending wire back in, and checked the PING - FAIL! I did this because I wanted to be sure it wasn't some fluke. It was not.
Now, I unplugged the errant wire again. Then, I plugged the other wires into the switch. I checked the PING to make sure that everything was working while all the other wires were plugged in. Everything worked.
Our next task is to find out where the errant wire goes and discover which devices on that part of the network might be causing the problem. I believe we will find a situation where someone plugged both ends of a network cable into one switch. Basically, it's taking one end and plugging it in to, say port 5, and then taking the other end of the same cable and plugging it in to port 7. The actual port numbers don't matter. what matters is that it creates a loop. Ever heard feedback when someone is speaking with a microphone because they get too close to the speaker system? Same idea, but with data spinning out of control in an infinite loop. It is not pretty, and it brings many networks to a screeching halt because nothing else can get through while all this other data is going round-and-round.
Once we are able to hunt down the offending equipment, we take whatever steps we need to take in order to cure the problem. For example, if it is an issue where a wire is plugged back into the same switch, we unplug the wire. If it turns out to be an infected computer, we get it off the network and heal it.
One thing that should be done, though, is another PING test setup. The difference at this stage, though, is that you do not have to unplug everything else to run the test. all you have to do is plug in the wire that was causing the problem. If the problem persists, unplug it and hunt down another culprit. If everything works after you plug in the wire, then you have cured your network of the offending loop/computer!! Wahoo!
Cool post, glad I was there to help! John
ReplyDeleteYup, happens to us all the time. Our co-op tech coordinator had an elementary school he is totally rewiring (originally wired with Cat 3 16 or so year ago). While doing this, we're running a minimum of two network ports in each classroom. Several of these classrooms aren't being used at the moment, so we're putting in the ports, testing each line, but leaving only one jumper cable to the room.
ReplyDeleteSchool calls and says their brand new network, with gig switches and Cat 6 cabling is down. So, the co-op tech guy goes into the room with the switches for this campus, and wow, look at that Xmas tree! Network storm! He proceeds to calmly unplug cables from the switch, one at a time, until one unplugging results in the lights settling down and winking instead of the full Osborne.
Read the room description off the cable and index in the server room. Proceed to that room, to find the jumper cable plugged into BOTH networks ports. Problem solved.
Yep, few of our schools in our area are set up with managed switches. I figure we're moving that way eventually though, once the price becomes more feasible for our smaller districts (all of ours are small).