Network Audit - Tutorial?
A while ago some of you guys wanted to know how a network audit was performed or how my colleague and I tackled this for a subdivision from a big firm here in Belgium.
Well as I have been quite busy with this and other things I finally managed to write something down, albeit being a bit crude it might be considered as a tutorial, but being unsure of this I let you guys be the judge.
Here it goes:
By demand of “a big firm”, we executed a network audit of specific vlan’s in an existing LAN in one of their subdivisions. This check up gave an overview of the critical components and an idea of the bandwidth utilization. The performance of the checked vlan’s was analyzed in detail and gave an idea of the data flow, errors and load on the switch ports.
Alternatives or improvements would be represented to improve the entire performance of the monitored subnets. The scope of the audit didn’t include a total look at the complete network and the Wide-area-network (WAN)-performance.
To have a clear view about the hierarchical view of the network topology, we redrew the network layout in Visio and added it to the final report.
The LAN network of this subdivision is set up with different types of CISCO switches; in the core CISCO 6009 & CISCO 6006 chassis connected with a CISCO 35xx and 1924 in the access layer.
- Excerpt from the report -
The 9-slot Catalyst 6006 & 6009 chassis: delivers high-performance, multilayer switching solutions for enterprise and service provider networks. Backplane up to 32 Gbps.
WS-X6K-SUP1A-2GE: the Supervisor 1A offers layers 3-7 services via two daughter cards, the Policy Feature Card (PFC) and Multilayer Switch Feature Card (MSFC). The PFC offers enhanced services like server switching and QoS, while the Multilayer Switch Feature Card (MSFC) provides multiprotocol routing. Supervisor 1A provides enhanced performance and support for a broad range of services.
WS-6148-RJ-45/WS6348-RJ-45: 48-port 10/100 module, field upgradeable to Cisco Inline Power or IEEE 802.3af, RJ-45. Cisco Catalyst 6500 Series 48-Port 10/100 RJ-45 Classic Interface Module; upgradeable to support Cisco Prestandard PoE Daughter Card or to IEEE 802.3af PoE daughter card.
WS-6408A-GBIC: The Catalyst 6500 8-port Fabric-enabled GBIC-based Gigabit Ethernet Module, providing integrated Cisco Express Forwarding (CEF), is designed for high-density Gigabit core aggregation. Features: Interface: 1000BASE-SX, 1000BASE-LX/LH, 1000BASE-ZX
The Catalyst 3548/3524: Is a member of the Catalyst 2950 Series Ethernet Switches and is a fixed-configuration, stackable switch that provides wire-speed Fast Ethernet and Gigabit Ethernet connectivity for midsized networks and access edge.
The Catalyst 1924F: Provides 24 10BaseT ports and two 100BaseFX ports in a compact, single rack unit chassis.
- end excerpt -
This just to give you an overview of the type of network devices present in this firm.
After this we made an analysis of the vlan’s with all the details.
In the LAN network of the subdivision several vlan’s were in use. So to perform this audit, we chose to use a portable with analyze software installed (Fluke Networks Optiview), this was then placed in vlan1, this would then gather all info from the devices active in that vlan.
A second portable was placed in another vlan (vlan2) doing the same thing.
This would sum up the following:
IP Subnets :
10.0.x.x/255.255.255.x (1 device); 10.101.x.x/255.255.255. x (1 device);
10.101.x.x/255.255.255.x (1 device);
10.101.x.x/255.255.255.x (+/- 2 devices); 10.101.x.x/255.255.255.x (1 device);
IPX Networks : None detected
NetBIOS Domains : COMMON (273 hosts); XXX (2 servers); XXX (1 host); XXX (5 hosts, 2 servers); XXX(2 hosts); WORKGROUP (2 hosts, 3 printers); WORKSTATIONS (2 servers)
VLAN1 : Hosts detected : +/- 329
Printers detected : +/- 44
Servers : +/-13
Hosts detected : 2
Printers detected : 1
Access Points : 1
Servers : 2
...and so on.
Next we gave them an overview of the active servers and services on their network.
We would sum up the servers by name (if possible) or otherwise by IP-address , along with the active services on these servers (like: DHCP , DNS , print server , etc…).
(To their surprise though there was a server they “forgot” about … hmmm strange , what you say you forgot about that server :D )
That done we could get to the neat stuff … the network performance measurements.
To perform this network performance measurement we used the Fluke Optiview Console in the specific vlan’s of the LAN. This tool gives an overview of the utilization, broadcast, collisions and errors.
Errors like: ‘Interface error rate exceeded error threshold’, ‘Interface utilization exceeded warning/error threshold’ and ‘Key device not responding’ needed to be analyzed in depth.
In the problem log of the software we used there were also other items like: ‘Only device in NetBIOS domain’, ‘New device on the network’, ‘NetBIOS name change’ or ‘IP address changed’ these could be important in some area’s of the network.
The most frequent errors were:
- Interface Error rate Exceeded Error Threshold
- Interface Utilization Exceeded Error Threshold
- Key device not responding
- Overlapped subnet mask
- Router interface has come up/ has gone down
- And to a lesser extent: Duplicate IP address
To elaborate some more about these errors:
- Interface Error rate Exceeded Error Threshold:
This error indicated that, on the device associated with this error, the error rate on the interface had exceeded the globally set threshold (1% was the default) or that it was greater than the individual-interface threshold for more than 2 minutes.
By default, these types of errors that are counted as “errored” traffic include CRC (Frame Check Sequence) errors, misaligned errors, Fragment errors, Short Frame errors and Jabber Frame errors. Symptoms were; slow network performance due to lost packets and frame retransmissions.
A bit more info about the above mentioned and other errors:
In harmony with the normal Ethernet characteristics and in normal circumstances, collisions and errors are proportional to the amount of the load of the considered ports. Related to the Ethernet Protocol, especially in a shared environment, the number of collisions –and errors- will increase as the load of the medium increases.
So if the average load maintains below 40%, the average collisions under 5% and errors not appear or at least not reflect an average above 1% than these values are considered as good directives.
If we detected errors on ports, without an exceptional high load, we assumed it was most likely related to the transmission medium (from NIC to NIC) and had to be resolved at once. In case the errors went together with a very high load on these ports, we went on to examine it more in depth.
This is because broadcast ratio has no direct link to a specific port, because a broadcast is send to all ports of the broadcast domain. The broadcast ratio will raise as more servers, workstations and different networking protocols are added in the network.
So we gave them a top 10 of utilization, broadcast ratio, collision ratio and error ratio with the report and the complete audit reports as an annex for them to look through and solve the other minor errors by themselves.
- Interface Utilization Exceeded Error/Warning Threshold
This error indicated that, on the device associated with this warning, the utilization on the interface had exceeded the globally set error threshold (80% was the default) for more than 2 minutes. There were a lot of interfaces with this error, indicating that their LAN was under heavy load and needed urgent updating.
- Key device not responding
This error indicated that the Agent was not able to communicate with the associated device after repeated tries. To monitor the status of the network and its devices, the Network Inspector application checked basic communication with key devices approximately every 2 minutes. Key devices were determined automatically by the application when it detected that the device offered an important service to the network.
Therefore all servers, routers and switches were considered key devices.
- Overlapped subnet mask
This indicated that the configured subnet mask was most likely not the best choice for that specific device. This could be caused by incorrect configuration of the device’s subnet mask.
Crudely stated, in IP networking, devices must have some information about which devices are in the local network, and how to reach remote devices. The subnet mask allows a device to quickly determine if it needs to send a packet of information to the router for delivery or try to reach a device directly. When the subnet mask is wrong, this decision cannot be made correctly.
The subnet mask separates the IP address into two parts: network and host. The subnet mask uses "1"s to indicate the network portion and "0"s to indicate the host portion. For example, a subnet mask of 255.255.192.0 is 11111111.11111111.11000000.00000000. (Most of you already know this but I thought I just add it for reference sense or for those that didn’t already know).
In the situation where a device reported to have an overlapped subnet mask, the OptiView Console application determined that there were other devices on the network reporting a different subnet, on which this device would fit, and it then determined that the other subnet probably would be a better choice.( just stating the obvious)
For example, the device may be reporting a subnet mask of "255.255.0.0" while many other devices on the network are reporting a subnet mask of "255.255.255.0" and this device would fit into the same subnet if it used that mask.
As this different subnetmasks can interfere with normal network traffic these values were adjusted afterwards, one such device was a router which off course accounted for some of the network problems, like double bandwidth needs for a conversation or not being able to reach a certain resource.
The last item: Duplicate IP address was to be found a few times during the audit of the network, as this can cause corrupt ARP caches in routers when it occurs frequently we thought to mention it to the costumer so he could take a look at why this was happening.
One other thing we found was that some switches had uptimes as high as 375 days so it was safe to say that some, if not all switches, didn’t receive an update or a much needed reboot in some time.
So in conclusion if we look at the information collected by the Fluke Optiview Console we can conclude following:
- On several interfaces the ‘Utilization thresholds’ were exceeded especially on the access switches. The exceeded thresholds were noted mostly during working hours, from early in the morning until about 10 o’clock and around noon until approximately 16Hr.
In vlan1 for example, there were two interfaces on a certain device that indicated high load during many hours. This wasn’t abnormal though, because it was the WAN-network link which is also monitored and shaped by a Packeteer Packetshaper. Just to show you that sometimes high loads don’t necessary mean problems.
- In different locations and on different moments, devices did not respond to an IP ping request. This meant a key device (server, router, switch) had no IP connectivity for a -short- period of time. This could not be tolerated from a key device and had to be further analyzed.
- On one device ‘Router interface going down’ was seen many times. This indicated that there was no communication possible through the specific interface which offcourse needed immediate attention and was reported to the IT-departement.
- Furthermore there were al lot of other problems found on the network, like a rogue dial-in server which was active in one of the vlan’s
Finally when we took a look at the performance in the different vlan’s and the connections from the access layer to the core, there was sufficient bandwidth available.
It was particularly the access ports that were suffering from high peaks.
The configuration of the switches needed to be cleaned up; for example there were a lot of differences between the configurations on the core switches so after the completion of the audit we adjusted them until they were equal.
After the audit we presented everything we found out in a nice big report with lots of charts and numbers and talked it over , and also gave advice on where to improve or adjust certain things...some things we immediately adjusted after the audit as these were key points in their infrastructure , other things twere adjusted or adapted by the IT-departement of the firm.
For now this is it … It is a lot to read maybe but I hope some of you can use the information provided , if you can add some tips , alterations or improvements, don’t hesitate to drop me a message, as many here I to learn every day.
Also please excuse any language or spelling errors, I tried my best to avoid them.