What's new

High CPU only when no WAN connection

BSOD2600

Occasional Visitor
RT-AC87U 384.13_2
Diversion 4.1.8
amtm 3.0


Been having this problem for several merlin and diversion versions, so don't think its specifically related to their latest releases.

Basically, when the cable modem (WAN) looses connection, the 2 CPUs on the router are pegged at 100%. This prevents clients on the LAN from even able to obtain a DHCP address. Most of the time, the router is so resource constrained that I'm unable to issue a TOP command after SSH into it. Even when the WAN connection is restored, the router is locked up and is unable to provide routing to the LAN clients until its been power cycled.

Today, I was finally able to capture this behavior. top always showed the tdts_rule_agent process, and cycled between dnsmasq and mtdblock3 as the 2 other top offenders. here is a snapshot before/after disabling diversion, while the WAN is disconnected.

Code:
DIVERSION ENABLED

Mem: 194012K used, 61664K free, 3552K shrd, 1888K buff, 26700K cached
CPU: 47.1% usr 52.6% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq  0.1% sirq
Load average: 2.75 2.45 1.77 4/114 5437
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
  778     1 nobody   R    51068 19.9   0 46.1 dnsmasq --log-async
 5430  5334 admin    R     1732  0.6   1 44.6 tdts_rule_agent -g -r /jffs/signature/rule.trf
  192     1 admin    R     5132  2.0   1  5.0 nt_center
  169     1 admin    S      652  0.2   0  1.4 tftpd
<SNIP>

Mem: 196496K used, 59180K free, 1708K shrd, 1200K buff, 9152K cached
CPU: 49.8% usr 46.2% sys  0.0% nic  3.5% idle  0.0% io  0.0% irq  0.3% sirq
Load average: 2.84 2.39 1.34 3/114 6274
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
  794     1 nobody   R    81292 31.7   0 43.0 dnsmasq --log-async
   27     2 admin    RW       0  0.0   1 22.1 [mtdblock3]
  196     1 admin    S     5144  2.0   0  2.5 nt_center
  179     1 admin    S     2604  1.0   0  2.4 protect_srv
<SNIP>

[email protected]:/tmp/home/root# iostat
Linux 2.6.36.4brcmarm (RT-AC87U)        01/14/20        _armv7l_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.09    0.00    2.12    0.06    0.00   96.72

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
mtdblock3         0.12         4.16         0.00     249594          0
sda               0.03         4.64         0.11     277944       6344

[email protected]:/tmp/home/root# dstat
Traceback (most recent call last):
  File "/opt/bin/dstat", line 32, in <module>
    import six
ImportError: No module named six


DIVERSION DISABLED
Mem: 129604K used, 126072K free, 3512K shrd, 1048K buff, 9620K cached
CPU:  1.5% usr 52.2% sys  0.0% nic 46.0% idle  0.0% io  0.0% irq  0.0% sirq
Load average: 2.63 2.48 1.81 2/115 6174
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
 6136  5699 admin    R     1732  0.6   1 49.3 tdts_rule_agent -g -r /jffs/signature/rule.trf
  192     1 admin    R     5132  2.0   1  1.4 nt_center
  169     1 admin    S      652  0.2   1  0.9 tftpd
  179     1 admin    R     2604  1.0   1  0.6 protect_srv
<SNIP>
About 2 weeks ago, I migrated the diversion use case to a dedicated pi-hole device (partially because of this problem on a hunch). When the symptom was occurring today, I disabled diversion (and pixelserve), which then showed the CPUs only going between 0-50%. tdts_rule_agent was still always top of the list, but at least clients were able to get DHCP leases.

I also capture the RT-AC87U syslogs with Splunk. Looking back over the time of the incident, the only thing logged are a constant repeat of dnsmasq listing the nameservers and host addresses.
RT-AC87U - dnsmasq loop.png



Anyone have ideas what is the root cause for this problem?
 

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top