I think I figured the problem out. The problem is with Trend Micro's URL filtering service. If I do a
killall wred, then DNS lookup locally works again after a couple seconds. Running
wred -B again will cause DNS to lock up after a couple minutes. There are a couple issues on hand here (this is using the 380.62 beta on my RT-AC68U):
1. The wred process makes a DNS query to
rgom10-en.url.trendmicro.com for every URL visited (this could easily go up to hundreds if you look at tcpdump).
2. The A record that
rgom10-en.url.trendmicro.com translates to,
a151.g.akamai.net, only has a TTL of
20 seconds (this means the upstream DNS gets DoSed with these concurrent queries before dnsmasq can locally cache them for 20 seconds before the next DoS batch goes out again).
3. Google DNS (my upstream DNS @
8.8.8.8, 8.8.4.4) throttles the rate at which DNS can be requested, and some of the requests may not receive a response if the outgoing rate is too fast. This may cause wred to lock up waiting for the DNS query to return.
4. wred makes a HTTP connection to
rgom10-en.url.trendmicro.com at port 80 for every URL visited. The connection requests a URL status, waits for a response, and then closes the connection (
the connection is not reused). The number of connections can easily go up to hundreds at a time.
5. Every single time a connection is closed, the connection is placed in the
TIME_WAIT state.
6. The maximum number of connections allowed in the
TIME_WAIT state is default
4096 (/proc/sys/net/ipv4/tcp_max_tw_buckets).
7. The number of seconds before a connection times out in the
TIME_WAIT state is 2 minutes (120 seconds - /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_time_wait).
8. The number of connections cause the time wait bucket table to overflow (see /tmp/syslog.log):
Code:
Sep 18 11:35:03 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:08 kernel: net_ratelimit: 259 callbacks suppressed
Sep 18 11:35:08 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:38 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:38 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:38 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:38 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:38 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:38 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:38 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:38 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:38 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:47 kernel: net_ratelimit: 35 callbacks suppressed
Sep 18 11:35:47 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:47 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:47 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:47 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:47 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:47 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:47 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:47 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:47 kernel: TCP: time wait bucket table overflow
Sep 18 11:35:48 kernel: TCP: time wait bucket table overflow
To workaround this issue, I did the following (theoretically Trend Micro should really plan these connections more carefully):
/jffs/scripts/dnsmasq.postconf:
Code:
#!/bin/sh
CONFIG=$1
source /usr/sbin/helper.sh
pc_replace "servers-file=/tmp/resolv.dnsmasq" "" $CONFIG
echo '65535' > /proc/sys/net/ipv4/tcp_max_tw_buckets
The script changes the maximum time wait buckets every time dnsmasq is started / restarted, and also stops the router from leaking DNS to my ISP's pushed IPv6 servers.
Add the following lines to
/jffs/configs/hosts.add:
184.25.56.254 rgom10-en.url.trendmicro.com
184.25.56.255 rgom10-en.url.trendmicro.com
The 184.25.56.x range is the range closest to me, but you can do a nslookup to see where Akamai sends you. The DNS responses will be different depending on the geolocation you are in.
Asuswrt UI -> Tools -> Other Settings -> TCP Timeout: time_wait -> 15 -> Save.
service restart_dnsmasq
killall wred, wait 5 seconds.
wred -B
Verify that the TIME_WAIT entries are low using the command (assuming the 184.25.56.x range):
cat /proc/net/ip_conntrack | grep 184.25.56. | grep TIME_WAIT | wc -l
16
Local DNS should start working again after this. Hopefully this helps someone that might be having a similar issue.
