What's new

Scribe syslog-ng not starting on boot if IPv6 enabled.

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

Also, it bears repeating that programming is not my day job, it's strictly a hobby.
I've been with you since before scribe was just a ghost of an idea, and I've admired each step. None of this is a criticism.

My point here is that trying to solve this gremlin does too much damage to the solid version in place two weeks ago, and I my suggestion is to revert this looping.

If someone who is experiencing the issue wants to help, they could run without scribe, and in a terminal kill syslogd and klogd, and see if in the resulting logs there are some clues about what restarts them. Or, run syslog-ng from a terminal without killing syslogd and klogd and then kill them from the terminal. I don't think there is anything fundamentally dangerous about both syslog-ng and the daemons both hitting the sockets, it's just that log messages won't necessarily be going to one place, or the same place.
 
I've been with you since before scribe was just a ghost of an idea, and I've admired each step. None of this is a criticism.

My point here is that trying to solve this gremlin does too much damage to the solid version in place two weeks ago, and I my suggestion is to revert this looping.

If someone who is experiencing the issue wants to help, they could run without scribe, and in a terminal kill syslogd and klogd, and see if in the resulting logs there are some clues about what restarts them. Or, run syslog-ng from a terminal without killing syslogd and klogd and then kill them from the terminal. I don't think there is anything fundamentally dangerous about both syslog-ng and the daemons both hitting the sockets, it's just that log messages won't necessarily be going to one place, or the same place.
I know your words weren't meant as criticism.

It depends on your view of "too much damage". You are concerned with keeping 100% of your logs - others, myself included, are more concerned with seamless operation, and are willing to give up a few log entries during boot-up to have it otherwise "just work". "Too much damage" for many people is syslog-ng failing to start at boot; not everyone using these scripts is technical, and issues like this lead to reports of "scribe doesn't work on my router" without any context or clue what the actual problem is. In my view the "least harm" solution is syslog-ng starting properly on everyone's router at the expense of some people losing some undefined number of log entries. I understand that you and other people may not view that as the "least harm", but a choice has to be made.

I had originally put the kill instruction outside the do loop because I was more worried about it taking a while for the system to kill syslogd and klogd than I was about the possibility of them being restarted after being killed. I was worried about the effect of issuing a kill signal to the same process twice, and the idea that they might get automatically restarted within that one second honestly never even crossed my mind. Turns out I was worried about the wrong thing.

It's hard to gauge here if the boot-up issues affect just a few people or many, and worse, if the problem will slowly spread as ASUS tweaks things here and there on various model build profiles. For instance, RMerlin said (if I understood correctly) that the location of the syslog.log file is just a flag in the per-model build profile, so I decided to anticipate ASUS slowly transitioning models to syslog.log being on /jffs instead of /tmp. That expectation gave rise to the need to come up with a method to detect where it was putting syslog.log that wasn't model or firmware version dependent. I don't think that requiring the people that this start-up issue affects to do some additional things is a tenable long-term solution.
 
Another thing occurs to me. If in this boot cycle something is restarting the daemons, are we sure that thing isn't restarting them later? We might have syslog-ng running, but if syslogd is restarted we have two processes reading from sockets at the same time.

I think we need more data from the folks with this issue; maybe they run a cronjob for a while testing whether there is a PID for syslogd.
 
I'm all too willing to help out with any further investigation on this. While I'm currently enjoying this new code I do understand the desire to not miss any messages in the log files. I would be willing to give some of my down time to help, and can live without scribe for a while if needs be as currently I have no problems I need to keep an eye on. Besides, it'll help me understand a little more.
 
Another thing occurs to me. If in this boot cycle something is restarting the daemons, are we sure that thing isn't restarting them later? We might have syslog-ng running, but if syslogd is restarted we have two processes reading from sockets at the same time.

I think we need more data from the folks with this issue; maybe they run a cronjob for a while testing whether there is a PID for syslogd.
Once the router is running and all the jffs scripts are active, the known things that restart syslogd and klogd will trip the service-event script and they will be killed. In theory it should just be 'logger start' that gets called by anything that want to try to restart them, but oddly enough 'time start' will restart them also without tripping the 'logger start' service event. So anytime service event sees 'logger' or 'time' and anything but 'stop', it loads up the scribe script and kills syslogd and klogd. I thought about making that it's own little script to reduce the time to load to kill those, but I think the time difference is negligible. There's a good chance scribe stays in the SSD onboard DRAM cache if it exists (not all SSDs have DRAM cache), or even in "used but available" memory so loading would be essentially instantaneous.

So if that's changed and other things are starting syslogd and klogd without tripping logger start, then it really doesn't even have to be people who've had issues at boot-up that could see syslogd and klogd start without tripping service-event. Rather than cron, which would be hit or miss, the service-event script could be tweaked to see if there is a live syslogd PID, and log it if there is.
 
I'm all too willing to help out with any further investigation on this. While I'm currently enjoying this new code I do understand the desire to not miss any messages in the log files. I would be willing to give some of my down time to help, and can live without scribe for a while if needs be as currently I have no problems I need to keep an eye on. Besides, it'll help me understand a little more.
Thanks. I may try to code a watchdog into the service event script that will log PIDs of syslogd and klogd any time service event fires. See if something besides logger or time starts them. Not tonight though, maybe not until next week, there's a lot going on right now.
 
then it really doesn't even have to be people who've had issues at boot-up that could see syslogd and klogd start without tripping service-event
Ah. Good point. Just because someone's been bit by a shark doesn't mean the water's safe for the rest of us.
I do understand the desire to not miss any messages in the log files.

Specifically, the messages that are being logged at the exact moment that this behaviour is manifesting itself. The webcam is recording you making a police report while overwriting the film of the accident.
 
Specifically, the messages that are being logged at the exact moment that this behaviour is manifesting itself. The webcam is recording you making a police report while overwriting the film of the accident.
I do get it, but I don't really have the time to try to track down what's happening, and I'm more or less convinced that even if I did, I'd end up with something I have no control over - I know that for some people just enabling IPv6 triggers it, and I can't possibly imagine a scenario where I could change anything that would change that behavior. It's even possible (though much less likely) that some ISPs are mis-configured in such a way that it's triggering the syslogd/klogd restart, and if that were the case, then even Asus probably couldn't resolve without breaking something else. The fact that the internet works at all is somewhat of a miracle considering how much mis-configured stuff is out there.

It's really easy for someone who's concerned about the lost messages to move the kill statements back outside the loop and do their own troubleshooting. Yes, they'd have to fix it every time scribe updated, but that's not a huge deal for a technical person. Heck, if someone can find a reliable way to fix it that doesn't introduce problems that require manual intervention, I'll probably add it in.

If you step back, what you're asking is to break it for some non-technical people to make it easier for an almost certainly smaller number of technical people to troubleshoot. I prefer to make it work for non-technical people and let the technical people play with it to see if they can figure out what's going on. I just can't in good conscious say "if you're non-technical and it doesn't work for you, oh well it sucks to be you even though I can fix it so it does work for you at the expense of losing some of the messages during boot-up that you probably don't care about anyways." These are the kinds of philosophical differences that code forks are made of.

Having said all that (dang, I can be long-winded), I'm having to replace my AC86U - it keeps rebooting spontaneously after being on a few days, and sometimes crashes and needs a power cycle. If my new router has the bootup issue, maybe late in July I may have time to fiddle with it and see if I can figure out what's causing it. Or maybe I'll have time to code in an "elorimer switch" for advanced users that uses the old method. :cool: So who knows what the future may bring.
 
Thanks. I may try to code a watchdog into the service event script that will log PIDs of syslogd and klogd any time service event fires. See if something besides logger or time starts them. Not tonight though, maybe not until next week, there's a lot going on right now.
Put the following line near the top of /jffs/scripts/service-event:
Code:
printf "$( date ) $0 $* \tsylogd: %s\tklogd: %s\n" "$( /bin/pidof syslogd )" "$( /bin/pidof klogd )" >> /opt/var/log/services.log
Periodically (like every couple days) look at the /opt/var/log/services.log file and see if any numbers are listed next to syslogd: or klogd: - that would indicate that something kicked them off. Hopefully the log will show that 'logger start' or 'time start' is what kicked them off, which means they will be killed automatically. If it's not one of those two, then I'll need to add another culprit to the list.
 
Put the following line near the top of /jffs/scripts/service-event:
Code:
printf "$( date ) $0 $* \tsylogd: %s\tklogd: %s\n" "$( /bin/pidof syslogd )" "$( /bin/pidof klogd )" >> /opt/var/log/services.log
Periodically (like every couple days) look at the /opt/var/log/services.log file and see if any numbers are listed next to syslogd: or klogd: - that would indicate that something kicked them off. Hopefully the log will show that 'logger start' or 'time start' is what kicked them off, which means they will be killed automatically. If it's not one of those two, then I'll need to add another culprit to the list.
Welp, at least with the AX86U Pro (388.2_2), there seem to be multiple culprits. What. A. Pain. Okay, scorched earth time, anytime service-event is called, find and kill klgod & syslogd. I was trying to minimize CPU cycles by only trying to kill them in known instances, but it looks like that will also be a game of whack-a-mole.

EDIT: Not sure about that now. Ugh, not enough time to thoroughly test now, so brute force / scorched earth will remain. Hope the cpu impact isn't too much.
 
Last edited:
What did you find besides time and logger?
I started to think ddns and dnsmasq but now I think not. I wasn't really clever enough with my logging and I think I misread the cause. The cases I caught may have been actually started by logger. I made my logging a bit more rational and I'm going to monitor for the next few days and see what turns up.
 

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top