XGS3700 - intermittently slow responses to pings and CPU over 100% messages in log

imaohw
imaohw Posts: 123  Ally Member
First Anniversary 10 Comments Friend Collector First Answer
edited August 2022 in Switch
I was making some changes to the configuration of my XGS3700 switch stack and experienced occasional slow loading of pages in the web user interface.  Sometimes taking as long as 2-3 seconds for a page to load.

I decided to run a continuous ping from a PC connected to the XGS3700 stack and saw intermittent slow response times. See below:


 
I logged into the switch thru SSH and displayed the CPU utilization using "show cpu-utilization".  The utilization was much higher than I would have expected.  There was very little network traffic at the time. All of the downstream switches (GS2210, GS1900, GS1350, GS1910) were showing CPU utilization of less than 10% (most at 1-2%). See XGS3700 CPU utilization output below:



To try and narrow down the high utilization I used the "show cpu-utilization process" which according to the CLI manual should provided CPU and memory usage per process.  However this command displayed exactly the same results as "show cpu-utilization". No detail information.

I may be wasting my time looking into this as I do not see any issues with network traffic flowing thru the XGS3700 stack. However, I get concerned when I see instances of CPU utilization of over 80% and when there is slowness in the XGS3700 web UI and the ping latency. I am curious about the following:
  • Why does the CLI command "show cpu-utilization process" not work as documented?
  • Where should I be looking to better understand the intermittent latency in ping responses from the XGS3700?
  • Are the CPU utilization statistics from the XGS3700 stack combining all CPUs from all switches in the stack or only from the "master" switch? 
  • Are all CPUs used when there are multiple switches or just the master's CPU?
  • Should I be concerned with the CPU utilization percentages?

All Replies

  • Zyxel小編 Lucious
    Zyxel小編 Lucious Posts: 278  Zyxel Employee
    First Anniversary Friend Collector First Answer First Comment
    edited October 2020
    @imaohw

    1. 
    CLI reference guide is a "generic" document which cannot be 100% precisely applied to every model.
    See the note on first page: some command options in this guide may not be available in your product.

    2.
    In older ZYNOS software like 4.30 on XGS3700 series, the priority of ICMP process in CPU queue is rather low than newer firmware. It seems to us this could be the cause of intermittent latency.

    3.
    In stacking operation, master slot's CPU handles most tasks except for particular ones such as DDMI info polling from transceivers.

    4.
    As long as there is no impact in significant ways, the CPU utilization seems okay.

    Zyxel_Lucious

  • imaohw
    imaohw Posts: 123  Ally Member
    First Anniversary 10 Comments Friend Collector First Answer
    edited November 2020
    @Zyxel_Lucious - I am now seeing periods of CPU Utilization of over 100% every few days on my XGS3700 switch stack.  Messages in the log like this:

       2 Nov 06 21:49:13 IN system: CPU utilization is over 100 and keep 5 seconds, driver count = 34.

    • The stack consists of 3 XGS3700-24 switches and 1 XGS3700-48 HP switch.  
    • Across all four switches only 55 ports are currently in use plus 8 10G stacking ports in ring topology. 
    • IP Source Guard is on.
    • RSTP is configured on 13 ports 
    • There are 3 static LAGS to other Zyxel switches
    • There is 1 LACP LAG to a USG
    • There are 17 vlans configured
    • No other switch capabilities are configured

    Any suggestions as to what I can look for.
  • Zyxel小編 Lucious
    Zyxel小編 Lucious Posts: 278  Zyxel Employee
    First Anniversary Friend Collector First Answer First Comment
    @imaohw

    Could you provide tech-support log to us?

    Zyxel_Lucious
  • imaohw
    imaohw Posts: 123  Ally Member
    First Anniversary 10 Comments Friend Collector First Answer
    @Zyxel_Lucious - log sent by PM.
  • Zyxel小編 Lucious
    Zyxel小編 Lucious Posts: 278  Zyxel Employee
    First Anniversary Friend Collector First Answer First Comment
    edited November 2020
    @imaohw

    After checking the log,

    1.
    As long as DHCP snooping is enabled, DHCP packets will all be processed by CPU no matter which VLAN they belong or what dst address they go.
    According to your config "dhcp snooping vlan 1-2,5-6,10,20,99,133,201-202", there could be considerable numbers of DHCP packet which can possibly burden the CPU.
    We'd suggest to rate-limit DHCP pps in DHCP Snooping Port Configure page or just disable DHCP snooping, to see if CPU utilization gets lower.

    2.
    In our experience, big amount of SNMP polling could often lead to CPU high issue.
    You may also check your SNMP application.

    Aside from high ping latency symptom, is there other impact could relate to the high CPU issue?
    Such as DHCP client not getting IP in time?

    Zyxel_Lucious
  • imaohw
    imaohw Posts: 123  Ally Member
    First Anniversary 10 Comments Friend Collector First Answer
    @Zyxel_Lucious

    1. I checked the "snooping table" on the XGS3700.  There are currently 83 devices in the table.  Less than 100 is typical.  Almost all devices are on the network permanently. Fewer than 15 devices leave from and return to the network. The lease time remaining on the devices averages at over 12 hours. 

    In looking at the configuration of the DHCP server (USG1100), for several vlans the lease time is 7 days, for a few others it is 1 day, the shortest is for guests (rarely used) and is 6 hours.  Almost all of the devices on the network has a "reservation" for a specific IP address in the DHCP server. In reviewing the USG1100 logs for DHCP events there are, on average, fewer than 20 events per hour.

    It seems unlikely this is a DHCP issue but I will try disabling DHCP snooping to see if that makes a difference.

    2. There should be minimal SNMP traffic on the network as SNMP is not currently in use. I will do a packet capture of SNMP traffic just to be sure.

    The only time I am seeing DHCP issues is when I am using both legs of a LAG to the USG1100.  Currently one leg is "downed" and there are no DHCP issues.

    Let me know if there is anything else I should look at. While this does not appear to be impacting  network traffic (other than pinging the switch), it is concerning as currently the network is hardly being used and I would not expect CPU utilization to go over 100%.
  • Ace
    Ace Posts: 25  Freshman Member
    First Anniversary Nebula Gratitude Friend Collector First Answer
    @imaohw

    Did you enable the CPU protection on your XGS3700? My experience is that CPU high may cause by ARP packet.
    Maybe you can enable it to limit for ARP to see if the problem can resolve or not.

  • imaohw
    imaohw Posts: 123  Ally Member
    First Anniversary 10 Comments Friend Collector First Answer
    @Ace - Thanks for the suggestion. I had not tried enabling CPU protection.  I have now. Maybe it will help me narrow it down to which port(s) traffic is the issue.
  • imaohw
    imaohw Posts: 123  Ally Member
    First Anniversary 10 Comments Friend Collector First Answer
    Update - rebooting the XGS3700 switch stack caused the CPU utilization to drop to average levels below 15% and I am no longer seeing CPU Utilization over 100% messages in the log.