What's new

AC86U - httpd not responding

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

ScottW

Senior Member
I have an AC86U with Merlin 386.3_2. It was setup "from scratch" after installing 386.3_2 and a full reset to defaults, a few months ago. Only additions are Diversion and SkyNet. There are about 40 active devices. NVRAM usage is 70168/131072. There is a 2gb swap file which was created by AMTM. JFFS storage shows 3.25/47.00mb.

No problems of any kind -- all very stable -- UNTIL the last few weeks. No recent settings changes to correlate with the beginning of the issue. Perhaps a couple new devices on the LAN, but that's it.

The issue: After perhaps 5-7 days of uptime, when attempting to login to the webUI, the AC86U accepts userID and password, then hangs. Traffic is still being routed just fine on LAN/WAN and wireless still working fine. But the web interface is "dead" after accepting credentials.

I can still login via ssh. After a service restart_httpd, the webUI will respond with a login screen. But as before, it hangs immediately after entering credentials and clicking "Signin".

I don't see anything abnormal with cpu or memory in htop. A reboot gets everything working again -- and I can login normally for a few days. But the problem eventually occurs again.

Other things I've tried: I disabled the "web integration" for both Diversion and Skynet, but the problem still recurred a few days later. I deleted/recreated the Swap file (using amtm), but the problem still recurred a few days later.

Any common "known causes" for httpd hanging like this? Anything else I can try in ssh to debug/correct it when it occurs?
 
What browser(s) have you tested with? Have you tried disabling all extensions? Have you changed any of the browser settings?

Even if you haven't done it yourself, updates to browsers bring new defaults (often).

Is this the same device you use to connect to the router with each time?

Have you tried rebooting the client device first before trying to log on?
 
Hi L&LD -- Thanks for the suggestions!

What browser(s) have you tested with? Have you tried disabling all extensions? Have you changed any of the browser settings?
Is this the same device you use to connect to the router with each time?
Have you tried rebooting the client device first before trying to log on?

Two different browsers on three different clients and two different OS.... Safari on iOS, and Edge on Windows. No extensions on Safari, it's plain vanilla. A couple extensions on Edge; I will try disabling those next time to see if there's any difference. Rebooting devices (iPad, iPhone, Windows10 desktop) hasn't helped.

I don't think it is browser or client related, since behavior is the same with two very different browsers on two different platforms. Also (after a router reboot) it immediately works fine from those same browsers/clients for several days. It *seems* like a memory issue or something that occurs over a period of time, but the memory stats look good (via 'free -h') and NVRAM looks good (via 'nvram show | grep ^size').

Since I *do* still have SSH access when it happens, I'm hoping there's something more I can look at from the shell that might give a clue the next time it happens? Just restarting httpd gives me another login dialog via http, but it just hangs again (until after rebooting the router again).
 
Seems like the basics have been accounted for, hope others can give more pertinent suggestions.

Maybe the next step is to disable one or both of the scripts you use, to test further.
 
Maybe the next step is to disable one or both of the scripts you use, to test further.
I do have Diversion and Skynet running. No visible sign of any trouble there, both appear to be working fine. I did disable their "web integrations", but the problem occured again a few days later. I have also disabled them (via their respective menu options) AFTER the problem occurs, and that didn't resolve it... But that's not quite the same as disabling/uninstalling them completely, so they don't load at all during boot. I hate to go there, since I've used them both for years without issue and consider both highly valuable.

I considered it might be an issue with the usb stick, so I had amtm delete/recreate the swap file -- which it did without any errors. The amtm "disk check script" declares the usb "clean" at each boot, and there's plenty of space (about 26gb available on 32gb stick).

At the moment, I can login to the webUI just fine -- but current uptime is only 2 days, and the problem doesn't usually surface for 5-7 days.
 
I just changed main router to AC86U and moved AC88U to Mesh Node a week ago. Performance seemed better but have just experienced this same oddity. Have a log server running and see many of these error messages...

Error locking /var/lock/clientlist.lock: 28 No space left on device.

The router self-corrected with...
watchdog 1670:notify_rc stop_httpd
watchdog 1670:notify_rc start_httpd

Looking around for the issue has led me to a couple other threads.


I'm still digging but these may help you troubleshoot and hopefully find the root cause.
 
Hasn't Diversion always had problems with the RT-AC86U? I know it won't run here without issues!
 
I'm still digging but these may help you troubleshoot and hopefully find the root cause.
Thanks - I'll check out those threads. Doesn't seem to be a memory or file space issue from what I can see, but maybe related to a specific file, so I'll read through those threads.

The only change I can think of was that I added a couple devices to the LAN -- meaning I also added to the ClientList and DHCP reservations. Around 40-45 devices active, depending on what's turned on (several pieces of electronics test gear). The AC86U's memory usage itself looks fine, there's a 2gb swap file, and VRAM looks fine, but perhaps there's some other limit somewhere in Networkmap that I bumped up against.
 
Last edited:
Hasn't Diversion always had problems with the RT-AC86U? I know it won't run here without issues!
Nope. I've been running Diversion (and Skynet) on my AC86U for 3 years. Zero issues until the last couple months. Perhaps there were issues on the AC68U, I don't know, but a lot of people confuse the two.
 
Not sure if this would do it for you, see attached file, but I had the problem where the GUI just stops working after a while, even though the router seemed to be working and I could log in using ssh. Since this is a problem that has existed for a long time with no prospect of getting resolved, I wrote a script to restart the httpd service every 4 hrs. That timing may be overkill but there does not seem to be downside to do this. Since doing this, I have had no problems with the GUI. For good measure, I also reset DNSMASQ and the conncom GUI. If you do not want to do that, you could edit the file. When you manually run the script, it will add a cron job and add an entry into post-mount so it will survive a reboot. If you use this file, copy the file to \jffs\scripts dir, change the file name to eliminate .txt, set the permissions to 755, then run "sh restart-GUI" w/ no quotes. I'd be curious if this solves your problem.
 

Attachments

  • restart-GUI.txt
    973 bytes · Views: 119
Thanks, but as stated earlier -- I have already tried restarting the gui (service restart_httpd) when it hangs. It doesn't fix the problem. All it does is allow another login attempt -- but the GUI hangs again after entering credentials and clicking the login button. I have tried manually restarting DNSMASQ when the problem is present, and that didn't affect it either. I don't use connmon.

But I did find two new clues today!

CLUE #1: I am now pretty sure it isn't "after a few days" that this happens. I wasn't always logging in every consecutive day but started doing so -- and the last two times this happened were Sunday mornings, despite being a different number of days since reboot. That led me to look at what scripts are running Saturday night / early Sunday morning (some diversion update/backup scripts, then some scripts I wrote to backup certain folders and rotate/purge old backups).

CLUE #2: Looking at logs and backup files, most of those scripts seem to be working fine. EXCEPT FOR ONE. That script includes the line "nvram show >> $filename" to backup nvram. The resulting text file is normally about 69k most nights but is 0k on the nights before the httpd hangs. Completely empty -- implying "nvram show" had no output.

What would cause "nvram show" to produce no output? Keep in mind, each time I have found the gui unresponsive -- I logged on via ssh, and "nvram show" worked fine with normal output. That's one of the first things I checked. Total/used size also looks fine, i.e. "size: 69985 bytes (61087 left)". But "nvram show" created an empty output when run a few hours earlier by cron, so it seems to be a transient/intermittent issue.

I did update that script to capture STDERR in future (it was being discarded), so perhaps the next time it happens I will have another clue -- if there is an error message being thrown.
 
Last edited:
I had a similar issue last week when a script was not running as expected (spdMerlin).
I blocked the script via Cron, rebooted the rooter and forced a Diversion reinstall from amtm.

After that everything went back as usual ... stable.
 
I rewrote and combined the scripts for that night into a single script, with logging and redirect of Stderr for each step. The empty file from "nvram show >filename" is the main oddity, so if that shows up again and Stderr doesn't have any clue then I'll add more diagnostics around it. I'd really like to diagnose the root cause rather than just shot-gunning other stuff that might mask it, at least until my patience runs out. :cool:
 
This happened again this week (unable to login to webui). I gathered one "new" piece of information before rebooting.

As before, "service restart_httpd" doesn't help. It allows another login attempt, but that hangs again when the "sign in" button is clicked.

One of the scripts that ran the night before backs up nvram using "nvram show >$fn" (where $fn includes a date string). That file is normally about 69k, consistent with the "used" size of my nvram. But the most recent one was 0 bytes.

From my SSH session, I tried "nvram show". It hung -- no output at all. Was still able to break out with ctrl-C. Next I tried a few random "nvram get <name>". Those worked fine, with correct output.

I then created a script to do "nvram get <name>" for every variable name individually. When that "hung" at a particular variable, I commented that line and ran again. I did that repeatedly until I had identified all the variable names that cause "nvram get <name>" to hang. It does seem that *most* of the problems were with variables backed by /jffs/nvram files. But not all -- for example, "nvram get" also hung with "oauth_dm_refresh_ticket" and "ipv6_dns3".

So, I think the webUI problem is likely a result of this "nvram get" hang.
Anyone have any ideas what might cause "nvram get <name>" or "nvram show" to completely hang like this??
 
Last edited:
Just keeping this thread updated...

Yesterday I had to power down the router ("safely removed" USB first) because it was time to replace the UPS battery. After that, it powered up normally and seemed to be working fine.

Today I logged into the WebUI, and noticed there were no log entries for the tasks which normally run daily at 11pm and Midnight. Logged in via SSH to check crontab, and the entries were missing. Looked at "services-start" script, and the "cru a" entries looked fine -- so they *should* have been added to crontab at the previous day's power-up.

I rebooted again without changing anything and checked crontab again. Now the entries from "services-start" were there, but so was an additional entry -- incorrectly formatted -- that looked like text from the latter portion of the previous command -- somehow duplicated and inserted as another crontab line. Very strange.

Without changing anything, I rebooted again and looked at crontab again. This time, it looked perfect -- the two entries from services-start were there, the errant entry I previously saw was not. The rest of the entries (vpn watchdog, dns renew, diversion, skynet, etc) were all there and looked good.

All that -- coupled with the previous "nvram show" problems -- led me to wonder if perhaps there's something wrong with the jffs partition. So today I saved settings, reset the router to defaults, reformatted the jffs partition, restored settings and then restored jffs files from a tar backup.

It's all working fine at the moment... But since the original problem (hanging WebUI) sometimes didn't show up for days, I'll have to monitor for a while to see if reformatting jffs had any benefit.
 
Just to close this out...

The problem with the webUI intermittently hanging was related to "nvram get" intermittently hanging, and the "nvram get" intermittently hanging was apparently related to some problem with JFFS. As mentioned in the previous post, I did a reboot, created a tar backup of jffs, reformatted jffs, and restored the tar. No further problems since doing that. My best guess now is that the problem was something like a block on JFFS that was causing intermittent read failures -- and reformatting mapped it out.
 
I have similar issue on a sporadic basis.
May be it would be nice to set a script that once a week makes a copy of jffs, format jffs and restore jffs.
I'm using reboot scheduler every night and this might be an option .
 
I have similar issue on a sporadic basis.
May be it would be nice to set a script that once a week makes a copy of jffs, format jffs and restore jffs.
I'm using reboot scheduler every night and this might be an option .
I wouldn't recommend formatting of jffs weekly -- seems like that would be a lot of wear since formatting writes every block. If you suspect jffs might be the problem, I'd say back it up, reformat it and restore -- to see if it helps. Theoretically, if a bad (or intermittently bad) block has developed and is causing read/write problems then formatting should mark that block bad and prevent future occurrences. If jffs frequently develops more bad blocks, it may be time for a new router. But there are probably a lot of things that can cause httpd to stop responding -- and I don't think jffs problems would be near the top of that list. Seems it was the problem in my case -- at least I'm hoping it was, since the problem hasn't recurred since.
 
I am having the same problem... AC86U accepts userID and password, then hangs. Internest access works find. "nvram show" under SSH does not return. This happens every few days after reboot. I tried the proposed solution but the problem remains. I like to confirm if my steps are correct.

1. Backup jffs. I did​
cd /tmp/mnd/RTHDD/backup/
tar -cvpzf backup.tar.gz /jffs
2. Under WebUI, I select "Format JFFS partition at next boot". Then click apply and then click reboot. I check and confirm some of the scrips are not in /jffs/scripts/. i.e. format did occur.​
3. Restore jffs​
cd /
tar -xvzf /tmp/mnd/RTHDD/backup.tar.gz

4. Issue "halt" under SSH.​

I have Skynet, Diversion, YazFi. FW is 386.5. This also happen in 386.4.

Did I miss out any steps?
 

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top