What's new

RT-BE88U – dnsmasq intermittently disappears (no process, port 53 unbound)

bacon612

Occasional Visitor

Hi everyone,


I’m seeing an intermittent issue on an RT-BE88U running Asuswrt-Merlin where dnsmasq stops running entirely, resulting in loss of DNS for the LAN. I’ve done some debugging and wanted to share the findings in case it’s helpful or if others are seeing the same behavior.


Router / Environment​


Router: RT-BE88U
Firmware: Asuswrt-Merlin (current release)
Uptime when failure occurs: typically several days


Network setup includes:


  • 2 AiMesh nodes
    • RT-AX58U (Merlin)
    • RT-AC68U (Merlin, though no longer officially supported)
  • Debian 12 chroot environment running:
    • Plex Media Server
    • Transmission
  • Entware installed
  • Custom scripts for Transmission VPN routing

However none of the scripts interact with dnsmasq.

Additional WAN context:
I am bypassing the AT&T gateway using a WAS-110 fiber bridge / XGS-PON ONT directly on the BE88U. WAN connectivity remains up when the issue occurs, and the observed failure is that dnsmasq disappears entirely (no process, no port 53 listener) until service restart_dnsmasq is run. Because the most specific log clue so far is dnsmasq: failed to send packet: Bad file descriptor during IPv6 RA / DHCPv6 activity on br0, I’m not sure whether the WAN bypass is relevant or just incidental, but I wanted to disclose it.



Symptom​


Internet suddenly stops working for LAN clients.


Observed behavior:


  • DNS resolution fails
  • Router itself still reachable via SSH
  • WAN connectivity still appears normal

Running diagnostics shows dnsmasq is no longer running.


Example:



pidof dnsmasq
(no output)

ps | grep dnsmasq
grep only


Port 53 is also not bound by any process.




netstat -lntup | grep :53
(no dnsmasq entries)



Restarting dnsmasq immediately restores service:




service restart_dnsmasq





Relevant Log Entries​


The most interesting log entry appears several minutes before the outage:




dnsmasq[25137]: failed to send packet: Bad file descriptor



Immediately prior to this, there are IPv6 RA / DHCPv6 related messages:




router advertisement on 2600:xxxx::, old prefix for br0
DHCPv6 stateless on 2600:xxxx:: constructed for br0
router advertisement on 2600:xxxx:: constructed for br0
dnsmasq[25137]: failed to send packet: Bad file descriptor



After that, dnsmasq continues logging DHCP activity for a while, then eventually disappears completely.




Confirmation During Failure​


When the issue occurred I captured the following:




pidof dnsmasq
(no output)

netstat -lntup | grep :53
(no listener)

ps | grep -E 'dcd|dnsmasq|networkmap|wanduck|roamast'
wanduck
networkmap
roamast
dcd



Other router daemons were still running normally.




Additional Observations​


After restarting dnsmasq, I consistently see:




dnsmasq-dhcp: Working around kernel bug: faulty source address scope for VRF slave br0



Bridge configuration:




br0
eth0
eth1
eth2
...
wl0
wl1
wl0.1
wl1.1
tap22



So dnsmasq is bound to a fairly busy bridge interface.




Memory / Resource Check​


Memory does not appear to be exhausted.


Example:




Mem: ~2GB total
~57MB free
~736MB available including cache



No OOM killer messages in dmesg.




Other daemon instability observed​


Occasionally the dcd process crashes earlier in the logs:




kernel: CPU: PID Comm: dcd
kernel: Tainted: P O



However dcd is still running during the dnsmasq outage, so it may be unrelated.




Current Workaround​


Restarting dnsmasq restores service immediately:




service restart_dnsmasq





Questions​


  1. Has anyone else seen dnsmasq disappear like this on the BE series routers?
  2. Could this be related to:
    • IPv6 RA / DHCPv6 handling
    • bridge interface (br0)
    • kernel networking bug referenced in the dnsmasq log?
  3. Would disabling LAN IPv6 be a useful diagnostic test?



Thanks in advance for any insight.
Happy to capture additional diagnostics if there are recommended commands.
 
@bacon612, please remove the Release tag from your post. The Release tag applies to firmware release annoucements.
 
The watchdog daemon should be restarting dnsmasq before you would need to do it manually. Is watchdog running?

The dnsmasq log is specific to dhcpv6, so disabling IPv6 could help as a diagnostic.
Thanks for the reply. I checked further. There is definitely a running /sbin/watchdog process, and strings /sbin/watchdog shows dnsmasq-related logic including dnsmasq_check, /var/run/dnsmasq.pid, start_dnsmasq, and stop_dnsmasq.


However, in the actual failure state I captured earlier, dnsmasq was completely absent (pidof dnsmasq empty, no port 53 listener) and had not been restarted automatically until I manually ran service restart_dnsmasq. I also checked syslog for watchdog, start_dnsmasq, and stop_dnsmasq and didn’t see any such log entries around normal operation.


So it looks like there is dnsmasq-aware logic in watchdog, but I’m not seeing evidence that it recovers this failure mode on my BE88U. Ive added some more logging to /jffs/scripts/dnsmasq-monitor.sh in hopes that if this keep happening, ill have more to see.
 
So it looks like there is dnsmasq-aware logic in watchdog, but I’m not seeing evidence that it recovers this failure mode on my BE88U.
What is the output of these commands?
Code:
nvram get asus_mfg
ls -l /var/run/dnsmasq*.pid
 
nvram get asus_mfg ls -l /var/run/dnsmasq*.pid
bacon612@RT-BE88U-2140:/tmp/home/root# nvram get asus_mfg
0
bacon612@RT-BE88U-2140:/tmp/home/root# ls -l /var/run/dnsmasq*.pid
-rw-r--r-- 1 nobody nobody 6 Mar 12 14:22 /var/run/dnsmasq.pid
bacon612@RT-BE88U-2140:/tmp/home/root#
 
bacon612@RT-BE88U-2140:/tmp/home/root# nvram get asus_mfg
0
bacon612@RT-BE88U-2140:/tmp/home/root# ls -l /var/run/dnsmasq*.pid
-rw-r--r-- 1 nobody nobody 6 Mar 12 14:22 /var/run/dnsmasq.pid
bacon612@RT-BE88U-2140:/tmp/home/root#
OK, so watchdog should be reading the process from that pid file and verifying the process ID exists. So next time it fails, check if that pid file is still there before starting dnsmasq. If that's missing, then dnsmasq cleaned up the pidfile when it terminated, which doesn't make for a useful watchdog.
 
Thanks for the suggestion regarding the PID file.


I’ve added some additional instrumentation so the next time the failure occurs I can capture the router state before manually restarting dnsmasq.


I created a small background monitor script (/jffs/scripts/dnsmasq-monitor.sh) that runs every 5 seconds and watches for:


  • dnsmasq PID disappearing
  • port 53 listener disappearing
  • new Bad file descriptor entries in syslog

If any of those occur, the script writes a snapshot to:



/jffs/dnsmasq-debug/


The dump includes:


  • pidof dnsmasq
  • port 53 listener state
  • process list (dnsmasq, watchdog, networkmap, roamast, etc.)
  • bridge and interface state
  • IPv4/IPv6 addresses
  • routing tables
  • recent syslog lines
  • recent dmesg output
  • recent Bad file descriptor lines
  • watchdog/dnsmasq control lines from syslog

The idea is to capture enough context to determine whether:


  1. dnsmasq exited cleanly and removed /var/run/dnsmasq.pid
  2. the PID file remains but the process is gone
  3. something else is stopping dnsmasq

For reference, current checks show:




nvram get asus_mfg
0



and the expected PID file exists:




/var/run/dnsmasq.pid



Next time the issue occurs I’ll check whether that PID file is still present before restarting dnsmasq and post the captured dump.


Hopefully that will clarify whether the watchdog should have caught the failure.
 

Support SNBForums w/ Amazon

If you'd like to support SNBForums, just use this link and buy anything on Amazon. Thanks!

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Back
Top