What's new

Mysterious loss of connectivity

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

Your syslog doesn't go back far enough.....

Modem log shows the event 8/7 at 23:01, syslog starts at 23:07
Look at the first syslog.

Actually this promted me to check and Colin Taylor the interface really DOES go down:

Code:
Aug  7 23:01:27 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link DOWN.
Aug  7 23:01:30 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link Up at 1000 mbps full duplex

Colin Taylor with this new information what do you think is wrong with the router? How come the entire eth0 interface goes down and up and yet the WAN IP is not properly renewed? Surely then the fault cannot be with the modem?
 
Last edited:
Look at the first syslog.
OK, I found it.
The problem is that the kernel saw the drop, but the router firmware (wanduck) never saw it because it was so quick. I'd have to double check, but I think it' may be a polling operation to detect WAN down.

The sequence should look like that in post #17
 
Ah fantastic. Seems like we are making progress here. How can I increase the polling frequency of wanduck?

I think the logs may show different behaviour since the last problem. From searching this forum looks like Diversion may mess with wanduck. Since I first reported this problem I have installed Diversion. Maybe that's the problem now.
 
Last edited:
Look at the first syslog.

Actually this promted me to check and Colin Taylor the interface really DOES go down:

Code:
Aug  7 23:01:27 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link DOWN.
Aug  7 23:01:30 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link Up at 1000 mbps full duplex

Colin Taylor with this new information what do you think is wrong with the router? How come the entire eth0 interface goes down and up and yet the WAN IP is not properly renewed? Surely then the fault cannot be with the modem?
Can you provide the complete syslog that covers that time period. We need to see the other messages before and after 23:01:27.
 
It's in this post:


Here is the link:


Code:
DESKTOP-73CRC8F
Aug  7 22:58:28 dnsmasq-dhcp[28213]: DHCPREQUEST(br0) 192.168.1.3 fe:92:f7:fd:92:f7
Aug  7 22:58:28 dnsmasq-dhcp[28213]: DHCPACK(br0) 192.168.1.3 fe:92:f7:fd:92:f7 RE200
Aug  7 23:01:27 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link DOWN.
Aug  7 23:01:30 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link Up at 1000 mbps full duplex
Aug  7 23:02:38 wlceventd: wlceventd_proc_event(469): eth7: Deauth_ind 30:24:32:E6:04:06, status: 0, reason: Unspecified reason (1)
Aug  7 23:02:38 hostapd: eth7: STA 30:24:32:e6:04:06 IEEE 802.11: disassociated
Aug  7 23:02:38 wlceventd: wlceventd_proc_event(505): eth7: Auth 30:24:32:E6:04:06, status: Successful (0)
Aug  7 23:02:38 wlceventd: wlceventd_proc_event(515): eth7: ReAssoc 30:24:32:E6:04:06, status: Successful (0)
Aug  7 23:02:38 hostapd: eth7: STA 30:24:32:e6:04:06 IEEE 802.11: associate
 
OK thanks for that, I see it now. As John pointed out this is different behaviour than that which you initially reported. In post #17 the router did restart the wan interface, just like you did manually last time, but you were still having problems.

So we're still left guessing what exactly is going on with the logic within the modem. Does the router need to obtain a new IP address every time the modem's WAN restarts? If so does there have to be some sort of delay before the router is allowed to do this? I suppose one thing to test the next time this happens, before manually restarting anything, is to see if you can still log into the modem.

Another thing that might be a factor is that fact that your modem's subnet is overlapping with that of your VPN. The modem is using 10.0.0.0/8 and your VPN is typically using something like 10.8.0.0/24.

At the moment I think the most promising solution is to enable the router's Network Monitoring. As far as I can see that should restart the WAN interface if the router looses connectivity to the internet.
 
Last edited:
I think it was me in terms of speculation about different behaviour.
Before in router log we saw:
Code:
Jul 16 18:08:11 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link DOWN.
Jul 16 18:08:12 WAN_Connection: ISP's DHCP did not function properly.
Jul 16 18:08:15 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link Up at 1000 mbps full duplex
Jul 16 18:08:18 WAN_Connection: WAN(0) link up.
Jul 16 18:08:18 rc_service: wanduck 1119:notify_rc restart_wan_if 0
Jul 16 18:08:18 lldpd[1128]: removal request for address of 10.241.204.243%11, but no knowledge of it
Based on instances of 'restart_wan_if' in 'wanduck.c' I think this corresponds to this segment from line 2384 of wanduck.c:
C:
    current_state[wan_unit] = get_wan_state(wan_unit);
    if(current_state[wan_unit] != WAN_STATE_CONNECTING && current_state[wan_unit] != WAN_STATE_CONNECTED && current_state[wan_unit] != WAN_STATE_DISABLED){
        snprintf(cmd, sizeof(cmd), "restart_wan_if %d", wan_unit);
        _dprintf("wanduck2: %s.\n", cmd);
        notify_rc_and_wait(cmd);
    }
Whereas now, despite same thing - eth0 down for 3 seconds - in router log we get only:
Code:
Aug  7 23:01:27 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link DOWN.
Aug  7 23:01:30 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link Up at 1000 mbps full duplex
Yet I see that wanduck is running:
Code:
admin@RT-AX86U-4168:/tmp/home/root# ps |grep -i wan
1119 admin    11688 S    /sbin/wanduck
6839 admin     5436 S    grep -i wan

What explains why 'wanduck' didn't operate this time? Does this 'wanduck' routine need to detect the actual downtime to trigger restart, as in detect 'eth0 down' in the short space of time between eth0 down and eth0 up? If so, then perhaps john9527 is onto something here:

OK, I found it.
The problem is that the kernel saw the drop, but the router firmware (wanduck) never saw it because it was so quick. I'd have to double check, but I think it' may be a polling operation to detect WAN down.

The sequence should look like that in post #17

Could it just be it happened too quickly for the detection routine? Can I increase the polling frequency? I can't figure out from 'wanduck.c' what controls the polling frequency.

Or could it be Diversion blocks it?

I think a reasonable guess is that upon modem refresh and eth0 down and up then router ought to restart WAN, as it tried to first time round (even though this failed for some reason - but I was using redirect all: yes before, whereas now I'm using VPN Director - so maybe that helps on that front).

That's interesting about modem subnet. I suppose I cannot change that because it is set by ISP? Or is it set by router or itself? That's confusing for me.

Next time I will try to see if I can log into modem given the issue.

I have also enabled the /jffs/scripts/dhcpc-event tweak to get extra info.

With network monitoring looks like this can only be enabled if 'dual wan' is set. And then I would put in say 8.8.8.8 or something?

I'd really like to figure out what caused wanduck to fail though - why it ran the first time but not the second time.
 
Last edited:
That's interesting about modem subnet. I suppose I cannot change that because it is set by ISP? Or is it set by router or itself? That's confusing for me.
It may be configurable in the modem's UI. For example, on the Asus you would change it here.

EDIT - Never mind. I see it's your ISP's DHCP range. Nothing you can do about that.
 
I have also enabled the /jffs/scripts/dhcpc-event tweak to get extra info.
I noticed that without this tweak a DHCP renew request will not be logged in syslog. So it's possible that the reason wanduck didn't restart the whole interface is because it was able to renew its lease and determined that it didn't need to.

With network monitoring looks like this can only be enabled if 'dual wan' is set. And then I would put in say 8.8.8.8 or something?
I don't see anything in the UI to suggest dual WAN needs to be used.
 
Last edited:
I noticed that without this tweak a DHCP renew request will not be logged in syslog. So it's possible that the reason wanduck didn't restart the whole interface is because it was able to renew its lease and determined that it didn't need to.
Any clue under this scenario what would result in loss of connectivity pending manual WAN restart?
Seems like we may need to wait for my connection to crap out again with this new tweak to get further information about what is happening.
john9527 please chip in with your thoughts on the wanduck.c aspect, polling or otherwise. I looked through the code and it is hard to follow it given the vast number of exceptions and optional code blocks.
 
I'm curious about the overlapping subnet issue. If your VPN provider has a router with an IP in the 10.0.0.0/8 subnet and if your ISP does as well, it seems like there will be problems. What does your System Log -> Routing Table look like when the VPN is connected?

I can imagine it works when the VPN is set up after the ISP connection because of the order in the routing table, but then when the ISP connection resets its rule could have precedence and now you can't reach the VPN provider.

EDIT - I know order doesn't usually matter in the routing table, but it can matter if there's a duplicate route, which is possible with overlapping subnets.
 
Last edited:
I'm curious about the overlapping subnet issue. If your VPN provider has a router with an IP in the 10.0.0.0/8 subnet and if your ISP does as well, it seems like there will be problems. What does your System Log -> Routing Table look like when the VPN is connected?

I can imagine it works when the VPN is set up after the ISP connection because of the order in the routing table, but then when the ISP connection resets its rule could have precedence and now you can't reach the VPN provider.

EDIT - I know order doesn't usually matter in the routing table, but it can matter if there's a duplicate route, which is possible with overlapping subnets.
default10.0.0.10.0.0.0UG000WAN
10.0.0.0*255.0.0.0U000WAN
10.0.0.1*255.255.255.255UH000WAN
10.8.2.0*255.255.255.0U000tun11
192.168.1.0*255.255.255.0U000LAN
239.0.0.0*255.0.0.0U000LAN

Here is something that confuses me a lot, which I noticed recently. If I set my DNS to NordVPN and perform an nslookup on netflix.com, I get:

Code:
nslookup netflix.com
Server:  UnKnown
Address:  192.168.1.1


Name:    netflix.com
Address:  192.0.0.69

192.0.0.69? That seems like a private IP? And just one hop (see below)? Very confusing.

Code:
tracert netflix.com

Tracing route to netflix.com [192.0.0.69]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  RT-AX86U-4168 [192.168.1.1]
  2    44 ms    46 ms    48 ms  192.0.0.69

Trace complete.

tracert amazon.co.uk

Tracing route to amazon.co.uk [54.239.33.58]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  RT-AX86U-4168 [192.168.1.1]
  2    47 ms    47 ms    48 ms  10.8.2.1
  3    47 ms    49 ms    47 ms  5.226.136.129
  4    61 ms    45 ms    50 ms  ae2.rt0-thn2.ldn.as25369.net [5.226.136.9]

Else using the normal CleanBrowsing (Family) set as WAN DNS:

Code:
nslookup netflix.com
Server:  RT-AX86U-4168
Address:  192.168.1.1


Non-authoritative answer:
Name:    netflix.com
Addresses:  2a05:d018:76c:b685:c898:aa3a:42c7:9d21
          2a05:d018:76c:b683:e1fe:9fbf:c403:57f1
          2a05:d018:76c:b684:b233:ac1f:be1f:7
          18.200.8.190
          54.155.246.232
          54.73.148.110
 
Last edited:
192.0.0.69? That seems like a private IP? And just one hop (see below)? Very confusing.

It's not even a private IP. It belongs to IANA. See ARIN. Specifically,

Addresses from this range should not be used as an alternative to the private IPv4 address ranges assigned by the IETF in the Best Current Practice document, RFC 1918, which can be found at:

When your connection is borked again, try checking that routing table.
 
ColinTaylor and others I have new data today. This is not following an internet connection failure, but rather just a modem reboot and what appears to be a successful restore of WAN connection. Albeit I do see the errors 'ISP's DHCP did not function properly'. I am trying to work out if the router is correctly dealing with the WAN DHCP lease/renewal stuff.
Router syslog:
The relevant bits I think are:
Code:
Aug 10 04:37:48 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link DOWN.
Aug 10 04:37:51 WAN_Connection: ISP's DHCP did not function properly.
Aug 10 04:37:56 WAN_Connection: WAN(0) link down.
Aug 10 04:38:10 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link Up at 1000 mbps full duplex
Aug 10 04:38:11 WAN_Connection: WAN(0) link up.
Aug 10 04:38:11 rc_service: wanduck 1119:notify_rc restart_wan_if 0
And:
Code:
Aug 10 04:38:14 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link DOWN.
Aug 10 04:38:17 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link Up at 1000 mbps full duplex
Aug 10 04:38:19 WAN_Connection: ISP's DHCP did not function properly.
Aug 10 04:38:22 custom_script: Running /jffs/scripts/dhcpc-event (args: leasefail)
Aug 10 04:38:42 custom_script: Running /jffs/scripts/dhcpc-event (args: bound)
Aug 10 04:38:43 wan: finish adding multi routes
And finally:
Code:
Aug 10 04:38:44 WAN_Connection: WAN was restored.
Modem log:
If it is relevant at 09:36am today my router states in respect of WAN lease:
19 hour(s) 2 minute(s) 45 seconds
Is there a way of verifying that the lease obtained is a real proper lease? I remain suspicious that the router is working with expired leases or something.
I will watch out to see what happens at the 12 hour point - that should prompt renewal right? I wonder if the problem is that the router is, in fact, trying to obtain lease too quickly?
This determines how aggressive the router will be in asking for a lease - it has nothing to do with the duration.

That setting was added because Charter would blacklist routers sending them too many DHCP requests in a short period of time. Their modem tends to take longer than the router to get fully up, so on a reboot, the router would end up sending DHCP requests before the modem was ready, which led to customers getting blacklisted by Charter, forcing them to turn off modem and router for 5-10 mins to fix things.

In aggressive mode, the router will work the same way it used to work before.

In Normal mode, it will wait up to 3 minutes before re-sending DHCP requests if the first one didn't get answered, to avoid blacklisting. So it might take longer to obtain a lease if you just rebooted modem or router.

For most people, Aggressive is fine. For Charter users (and other ISPs that are more aggressive), set it to Normal.
Update: I just checked at the 12 hour point, and all I see in the syslog is:
Code:
Aug 10 16:38:43 custom_script: Running /jffs/scripts/dhcpc-event (args: renew)
Full syslog here:
Seems (args: bound) is for new IP and (args: renew) is just for renewal.

For fun I just tried unplugging ethernet cable to modem and putting it back in, and here is what I see on syslog:
Code:
Aug 10 17:04:46 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link DOWN.
Aug 10 17:04:50 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link Up at 1000 mbps full duplex
No relevant entries after this. So no wan restart - would we expect one? It seems 48 hour modem refresh needs WAN restart but ASUS router does not think a disconnect/reconnect needs a WAN restart, or? I don't understand how all this works well enough.

Are we any further to knowing what the problem is with this extra information?
 
Last edited:
Sorry, this has got very long now. Colin Taylor I looked through wanduck.c and it seems to me that your suggestion regarding Network Monitoring may be a good idea. I have set:
Capture.PNG

The wanduck.c code is very hard to follow but I think this gets called with the functions do_dns_detect() and do_ping_detect().
When you set the network monitoring then dns_probe gets set to '1':
Code:
admin@RT-AX86U-4168:/tmp/home/root# nvram show |grep -i dns_probe
dns_probe=1
dns_probe_content=131.107.255.255 112.4.20.71 fd3e:4f5a:5b81::1
dns_probe_host=dns.msftncsi.com
size: 71879 bytes (59193 left)
And also wandog_target (for ping probe) gets set:
Code:
admin@RT-AX86U-4168:/tmp/home/root# nvram show |grep -i wandog
wandog_delay=0
wandog_enable=1
wandog_fb_count=4
wandog_interval=5
wandog_maxfail=12
wandog_target=www.google.com
size: 71879 bytes (59193 left)
And it seems to be actually doing something every 5 seconds (presumably given wandog_interval=5):
Code:
admin@RT-AX86U-4168:/tmp/home/root# tcpdump -vpni tun11 |grep -i msftncsi.com
tcpdump: listening on tun11, link-type RAW (Raw IP), capture size 262144 bytes
    10.8.3.2.43853 > 185.228.168.168.53: 24973+ A? dns.msftncsi.com. (34)
    185.228.168.168.53 > 10.8.3.2.43853: 24973 1/0/0 dns.msftncsi.com. A 131.107.255.255 (50)
    10.8.3.2.51202 > 185.228.168.168.53: 46556+ A? dns.msftncsi.com. (34)
    185.228.168.168.53 > 10.8.3.2.51202: 46556 1/0/0 dns.msftncsi.com. A 131.107.255.255 (50)
    10.8.3.2.36507 > 185.228.168.168.53: 55641+ A? dns.msftncsi.com. (34)
    185.228.168.168.53 > 10.8.3.2.36507: 55641 1/0/0 dns.msftncsi.com. A 131.107.255.255 (50)
    10.8.3.2.50597 > 185.228.168.168.53: 29697+ A? dns.msftncsi.com. (34)
    185.228.168.168.53 > 10.8.3.2.50597: 29697 1/0/0 dns.msftncsi.com. A 131.107.255.255 (50)
^C20443 packets captured
20509 packets received by filter
0 packets dropped by kernel

admin@RT-AX86U-4168:/tmp/home/root#
admin@RT-AX86U-4168:/tmp/home/root# tcpdump -vpni tun11 |grep -i google.com
tcpdump: listening on tun11, link-type RAW (Raw IP), capture size 262144 bytes
    10.8.3.2.33411 > 185.228.168.168.53: 32861+ A? www.google.com. (32)
    10.8.3.2.33411 > 185.228.168.168.53: 34397+ AAAA? www.google.com. (32)
    185.228.168.168.53 > 10.8.3.2.33411: 32861 1/0/0 www.google.com. A 216.239.38.120 (48)
    185.228.168.168.53 > 10.8.3.2.33411: 34397 1/0/0 www.google.com. AAAA 2001:4860:4802:32::78 (60)
    10.8.3.2.39070 > 185.228.168.168.53: 59112+ A? www.google.com. (32)
    10.8.3.2.39070 > 185.228.168.168.53: 60648+ AAAA? www.google.com. (32)
    185.228.168.168.53 > 10.8.3.2.39070: 59112 1/0/0 www.google.com. A 216.239.38.120 (48)
    185.228.168.168.53 > 10.8.3.2.39070: 60648 1/0/0 www.google.com. AAAA 2001:4860:4802:32::78 (60)
^C4965 packets captured
4968 packets received by filter
0 packets dropped by kernel

So I guess this would detect broken WAN state and force a restart_WAN?
I would still really like to understand why this is necessary and what is ultimately breaking the connection.
 
So I guess this would detect broken WAN state and force a restart_WAN?
I would still really like to understand why this is necessary and what is ultimately breaking the connection.
If the wanduck queries are routing out tun11, how does your WAN DHCP respond on the real WAN interface for DHCP renewal?
 
How can I check? I am really totally lost with how WAN DHCP is supposed to work. I don't know if it is working properly in my case or not.
I noticed:
Code:
Aug 10 16:38:43 custom_script: Running /jffs/scripts/dhcpc-event (args: renew)
And also a 'bound' entry earlier.
Code:
Aug 10 04:38:14 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link DOWN.
Aug 10 04:38:17 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link Up at 1000 mbps full duplex
Aug 10 04:38:19 WAN_Connection: ISP's DHCP did not function properly.
Aug 10 04:38:22 custom_script: Running /jffs/scripts/dhcpc-event (args: leasefail)
Aug 10 04:38:42 custom_script: Running /jffs/scripts/dhcpc-event (args: bound)
Aug 10 04:38:43 wan: finish adding multi routes
Does the existence of those mean that DHCP is being correctly renewed and initially set? Or could that be bogus?
My WAN IP lease is 24 hours. Every 48 hours my LTE modem resets. The LTE modem reset seems to have the potential to cause loss of internet connectivity. I don't understand whether this is to do with a WAN DHCP lease problem or not.
 
How can I check? I am really totally lost with how WAN DHCP is supposed to work. I don't know if it is working properly in my case or not.
I noticed:
Code:
Aug 10 16:38:43 custom_script: Running /jffs/scripts/dhcpc-event (args: renew)
And also a 'bound' entry earlier.
Code:
Aug 10 04:38:14 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link DOWN.
Aug 10 04:38:17 kernel: eth0 (Int switch port: 3) (Logical Port: 3) (phyId: c) Link Up at 1000 mbps full duplex
Aug 10 04:38:19 WAN_Connection: ISP's DHCP did not function properly.
Aug 10 04:38:22 custom_script: Running /jffs/scripts/dhcpc-event (args: leasefail)
Aug 10 04:38:42 custom_script: Running /jffs/scripts/dhcpc-event (args: bound)
Aug 10 04:38:43 wan: finish adding multi routes
Does the existence of those mean that DHCP is being correctly renewed and initially set? Or could that be bogus?
My WAN IP lease is 24 hours. Every 48 hours my LTE modem resets. The LTE modem reset seems to have the potential to cause loss of internet connectivity. I don't understand whether this is to do with a WAN DHCP lease problem or not.
Maybe it’s just DNS being sent through the tunnel. I don’t use VPN so it just struck me as odd at first. Not sure it’s anything to chase down.
 
It appears that udhcpc is bound to the WAN interface, so will hopefully ignore any VPN. The fact that there was a renew event at 50% of the lease time which didn't generate an error makes me think it's working.

Code:
 2430 admin     3492 S    /sbin/udhcpc -i eth0 -p /var/run/udhcpc0.pid -s /tmp/udhcpc -t2 -T5 -A160 -O33 -O249
 
So is the ordinary process as follows:

- router obtains WAN IP
- at 50% lease expiry router renews IP (args: renew)
- else if connection fails, router obtains new IP (args:bound)

So this would be all ticking away nicely save for the 48 hour modem refresh, when presumably ISP forces reconnect and invalidates the WAN IP lease. Following this forced reconnect, the ASUS router is not aware that the WAN IP lease expired (why not?) and so it goes on thinking everything is great. Internet is dead until manual restart.

Could my modem be blocking some kind of WAN IP renewal request from my ISP with these entries:

Code:
Detect UDP port scan attack, and the attack has been blocked, scan packet from 10.36.5.127

Or is that something unrelated?
 

Similar threads

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top