What's new

Defunct cfg_server Zombies

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

Oh I mean no matter what I do it will immediately spawn cfg_server until there are approx 1010 running (which I assume has something to do with the little beast’s ulimit but was hoping to not have to care)

I just experienced the most amazing thing, I could use the webui! the network map and system statuses loaded! but within moments there were 28 cfg_server processes which is already plenty to make it unusable again. (9 is too many). Now there’s 59. It’s seriously a tribble problem istg.
 
Oh I mean no matter what I do it will immediately spawn cfg_server until there are approx 1010 running (which I assume has something to do with the little beast’s ulimit but was hoping to not have to care)

I just experienced the most amazing thing, I could use the webui! the network map and system statuses loaded! but within moments there were 28 cfg_server processes which is already plenty to make it unusable again. (9 is too many). Now there’s 59. It’s seriously a tribble problem istg.

@evolempt

Have you tried disabling your AiMesh config to confirm whether it's the culprit?

It's my understanding that cfg_client and cfg_server are related to AiMesh.

Kind Regards,


Gary
 
🤦🏻‍♀️ never even crossed my mind. I generally avoid replacing packages like that but i’m absolutely not a purist about it.


truly my lack of sed/awk skills are my greatest professional shame. This 👆🏻 gets me the same output as ps wT | grep cfg_server does.
There is a reason for that, I wanted to see the output to see if we could tell how many were defunct.

So if you have time,please share the output of

ps wT | awk '/cfg_server/{if( $0 !~ /awk/)printf "%s\n", $0}'
 
You could actually make a single line inside service event end script that triggers this kill script every time the cfg event happens.
Seems worth noting that service-event and service-event-end haven’t run in 22 hours, and when they did last run there were 3 cfg_server processes running. (I have them both logging their execution)
 
What about when you use pstree -s cfg_server
Code:
# ps wT | grep cfg_server | wc -l
1012
# pstree -s cfg_server
-+- 00001 admin /sbin/init
 \--- 21470 admin cfg_server

What does your one-liner look like that is capturing these events?

I’m not totally sure I understand the question here. So, more info in case this answers it: in /jffs/scripts/service-event and /jffs/scripts/service-event-end I had just added logger lines that resulted in "Running service-event” and “Running service-event-end” popping up in the system log. I have since updated the lines to log the service that triggered the scripts (e.g. /usr/bin/logger "$2 kicked off service-event-end";) but they haven’t run since Sunday. If you’re wondering how I know that there were 3 instances of cfg_server running, it’s just that I happened to have been watching it at that moment.

The service-event-end script includes this line:
Code:
if echo "$2" | /bin/grep -q "^cfgsync"; then { sh /jffs/sbin/zomg_cfg_server & }
but cfgsync has yet to fire as far as I know.

(please assume that any smart quotes that make their way into my posts are not in the script file. XenForo is helping)
 
The service-event-end script includes this line:
Code:
if echo "$2" | /bin/grep -q "^cfgsync"; then { sh /jffs/sbin/zomg_cfg_server & }
but cfgsync has yet to fire as far as I know.
FWIW... I've noticed that my implementation of zomg_cfg_server executes, but never actually logs within the service-event-end script.
 
I’m not totally sure I understand the question here. So, more info in case this answers it: in /jffs/scripts/service-event and /jffs/scripts/service-event-end I had just added logger lines that resulted in "Running service-event” and “Running service-event-end” popping up in the system log. I have since updated the lines to log the service that triggered the scripts (e.g. /usr/bin/logger "$2 kicked off service-event-end";) but they haven’t run since Sunday. If you’re wondering how I know that there were 3 instances of cfg_server running, it’s just that I happened to have been watching it at that moment.

The service-event-end script includes this line:
Code:
if echo "$2" | /bin/grep -q "^cfgsync"; then { sh /jffs/sbin/zomg_cfg_server & }
but cfgsync has yet to fire as far as I know.

(please assume that any smart quotes that make their way into my posts are not in the script file. XenForo is helping)
No I am just inquiring so I can repeat your tests. Basic debugging 101. I am going to be looking at this more myself. I only get the one pstree entry as well. it appears it must be the parent. the zombie processes must be processes that never complete a specific task. So something is happening here that is preventing the parent from reaping the child process, or the child process is waiting for something from the parent and never finishes. Could be something as simple as a lock file holding up a process in the wrong spot. For this instance, it may be more important to kill the parent, Zombie processes should die as well. Trigger a full restart of cfg afterwards.
 
Last edited:
🤦🏻‍♀️ never even crossed my mind. I generally avoid replacing packages like that but i’m absolutely not a purist about it.


truly my lack of sed/awk skills are my greatest professional shame. This 👆🏻 gets me the same output as ps wT | grep cfg_server does.

this however ps wT | awk '/cfg_server/{if( $0 !~ /awk/)printf "%s ", $1}' produces the output we would expect out of pidof.

I maybe fell down a rabbit hole today, but I have a functional script for my XT8.

Code:
#!/bin/sh


i=1
awk_pidof() { ps wT | awk '/\scfg_server/{if( $0 !~ /awk/)printf "%s ", $1}'; }

if [ "$(ps wT | grep $(basename "$0") | wc -l)" -gt 3 ]; then
  # has to be gt 2 because otherwise grep gets caught in the dragnet
  # gt 3 because… for some reason there are always 2 zomg processes?
  echo "Exiting zombie slayer because it thinks it's a duplicate process"
  exit;
fi

while [ "$(awk_pidof | wc -w)" -le 5 ]; do sleep 1; done

for pid in $(awk_pidof); do
   if [ "$i" -gt "5" ]; then
      # echo $i
      # echo "killing pid $pid"
      /bin/kill -s 9 $pid
      /usr/bin/logger "Zombie slayer killing pid $pid"
   fi
   i=$((i + 1))
done
Try this improvement of your previous script.

Code:
#!/bin/sh


i=1
awk_pidof() { ps wT | awk '/\scfg_server/{if( $0 !~ /awk/)printf "%s ", $1}'; }

if [ "$(ps wT | awk -v var="$(basename "$0")" '{if( $0 !~ /awk/ && $0 ~ /$var/ )printf "%s ", $1}' | wc -w)" -gt 2 ]; then
  # has to be gt 2 because otherwise grep gets caught in the dragnet
  # gt 3 because… for some reason there are always 2 zomg processes?
  echo "Exiting zombie slayer because it thinks it's a duplicate process"
  exit;
fi

while [ "$(awk_pidof | wc -w)" -le 3 ]; do sleep 1; done

for pid in $(awk_pidof); do
   if [ "$pid" != "$(pstree -s cfg_server | awk '{if($0 !~ /init/)print $2}')" ]; then
      # echo $i
      # echo "killing pid $pid"
      /bin/kill -s 9 $pid &
      /usr/bin/logger "Zombie slayer killing pid $pid"
   fi
   i=$((i + 1))
done
 
Last edited:
No I am just inquiring so I can repeat your tests. Basic debugging 101. I am going to be looking at this more myself. I only get the one pstree entry as well. it appears it must be the parent. the zombie processes must be processes that never complete a specific task. So something is happening here that is preventing the parent from reaping the child process, or the child process is waiting for something from the parent and never finishes. Could be something as simple as a lock file holding up a process in the wrong spot. For this instance, it may be more important to kill the parent, Zombie processes should die as well. Trigger a full restart of cfg afterwards.
So that’s a fun experiment to run. It also explains may explain some logs I got on sunday (can’t kill pid that doesn’t exist) that I assumed were due to the fact that I somehow was running two zombie slayer processes.

In ssh session #1

Code:
# while [ true ]; do ps wT | grep cfg_se
rver | wc -l && sleep 2; done
1012
1011
1012
1
1
1
…
1
1
5
5
1
1
…
1
1
3
5
16
1
1
… etc

in ssh session #2

Code:
# pstree -s cfg_server
-+- 00001 admin /sbin/init
 \--- 21470 admin cfg_server
# kill -9 21470

I’ll give you one guess as to when i ran kill.

Meanwhile, in the system log…

Code:
Mar 21 13:27:35 (redacted): cfgsync kicked off service-event
Mar 21 13:27:36 (redacted): cfgsync kicked off service-event-end
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10091
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10092
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10093
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10094
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10095
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10096
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10097
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10098
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10099
Mar 21 13:28:06 (redacted): cfgsync kicked off service-event
Mar 21 13:28:06 (redacted): cfgsync kicked off service-event-end
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10401
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10402
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10403
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10404
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10405
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10406
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10407
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10408
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10409
Mar 21 13:28:35 (redacted): cfgsync kicked off service-event
Mar 21 13:28:36 (redacted): cfgsync kicked off service-event-end
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10765
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10766
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10767
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10768
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10769
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10770
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10771
 
Code:
# ps wT | grep cfg_server | wc -l
1012
# pstree -s cfg_server
-+- 00001 admin /sbin/init
 \--- 21470 admin cfg_server



I’m not totally sure I understand the question here. So, more info in case this answers it: in /jffs/scripts/service-event and /jffs/scripts/service-event-end I had just added logger lines that resulted in "Running service-event” and “Running service-event-end” popping up in the system log. I have since updated the lines to log the service that triggered the scripts (e.g. /usr/bin/logger "$2 kicked off service-event-end";) but they haven’t run since Sunday. If you’re wondering how I know that there were 3 instances of cfg_server running, it’s just that I happened to have been watching it at that moment.

The service-event-end script includes this line:
Code:
if echo "$2" | /bin/grep -q "^cfgsync"; then { sh /jffs/sbin/zomg_cfg_server & }
but cfgsync has yet to fire as far as I know.

(please assume that any smart quotes that make their way into my posts are not in the script file. XenForo is helping)
I would try a one liner like

Code:
printf "%s" "$@" | /bin/grep -qE "^(((start|stop|restart|kill|reload)_?.*cfgsync)$)"; then { sh /jffs/sbin/zomg_cfg_server & }; fi

with

Code:
#!/bin/sh


i=1
awk_pidof() { ps wT | awk '/\scfg_server/{if( $0 !~ /awk/)printf "%s ", $1}'; }

if [ "$(ps wT | awk -v var="$(basename "$0")" '{if( $0 !~ /awk/ && $0 ~ /$var/ )printf "%s ", $1}' | wc -w)" -gt 2 ]; then
  # has to be gt 2 because otherwise grep gets caught in the dragnet
  # gt 3 because… for some reason there are always 2 zomg processes?
  echo "Exiting zombie slayer because it thinks it's a duplicate process"
  exit;
fi

while [ "$(awk_pidof | wc -w)" -le 3 ]; do sleep 1; done

for pid in $(awk_pidof); do
   if [ "$i" != "$(pstree -s cfg_server | awk '{if($0 !~ /init/)print $2}')" ]; then
      # echo $i
      # echo "killing pid $pid"
      /bin/kill -s 9 $pid &
      /usr/bin/logger "Zombie slayer killing pid $pid"
   fi
   i=$((i + 1))
done

see what kind of results you get.
 
I would try a one liner like

Code:
printf "%s" "$@" | /bin/grep -qE "^(((start|stop|restart|kill|reload)_?.*cfgsync)$)"; then { sh /jffs/sbin/zomg_cfg_server & }; fi

with

Code:
#!/bin/sh


i=1
awk_pidof() { ps wT | awk '/\scfg_server/{if( $0 !~ /awk/)printf "%s ", $1}'; }

if [ "$(ps wT | awk -v var="$(basename "$0")" '{if( $0 !~ /awk/ && $0 ~ /$var/ )printf "%s ", $1}' | wc -w)" -gt 2 ]; then
  # has to be gt 2 because otherwise grep gets caught in the dragnet
  # gt 3 because… for some reason there are always 2 zomg processes?
  echo "Exiting zombie slayer because it thinks it's a duplicate process"
  exit;
fi

while [ "$(awk_pidof | wc -w)" -le 3 ]; do sleep 1; done

for pid in $(awk_pidof); do
   if [ "$i" != "$(pstree -s cfg_server | awk '{if($0 !~ /init/)print $2}')" ]; then
      # echo $i
      # echo "killing pid $pid"
      /bin/kill -s 9 $pid &
      /usr/bin/logger "Zombie slayer killing pid $pid"
   fi
   i=$((i + 1))
done

see what kind of results you get.
That is assuming we can figureout why it wasn't running.

To me, your debug test explains that it is from the same parent process. Possibly lock file in the wrong place, or processes never completing. Apparently the parent is not reaping the child process before it becomes defunct.
 
I am thinking that it may be better to simply place this one liner in init-start


sh /jffs/sbin/zomg_cfg_server &

change the script to

Code:
#!/bin/sh


i=1
awk_pidof() { ps wT | awk '/\scfg_server/{if( $0 !~ /awk/)printf "%s ", $1}'; }

if [ "$(ps wT | awk -v var="$(basename "$0")" '{if( $0 !~ /awk/ && $0 ~ /$var/ )printf "%s ", $1}' | wc -w)" -gt 2 ]; then
  # has to be gt 2 because otherwise grep gets caught in the dragnet
  # gt 3 because… for some reason there are always 2 zomg processes?
  echo "Exiting zombie slayer because it thinks it's a duplicate process"
  exit;
fi
while true do;
while [ "$(awk_pidof | wc -w)" -le 3 ]; do sleep 1; done

for pid in $(awk_pidof); do
   if [ "$pid" != "$(pstree -s cfg_server | awk '{if($0 !~ /init/)print $2}')" ]; then
      # echo $i
      # echo "killing pid $pid"
      /bin/kill -s 9 $pid &
      /usr/bin/logger "Zombie slayer killing pid $pid"
   fi
   i=$((i + 1))
done
done

and reboot the router.
 
Last edited:
So that’s a fun experiment to run. It also explains may explain some logs I got on sunday (can’t kill pid that doesn’t exist) that I assumed were due to the fact that I somehow was running two zombie slayer processes.

In ssh session #1

Code:
# while [ true ]; do ps wT | grep cfg_se
rver | wc -l && sleep 2; done
1012
1011
1012
1
1
1
…
1
1
5
5
1
1
…
1
1
3
5
16
1
1
… etc

in ssh session #2

Code:
# pstree -s cfg_server
-+- 00001 admin /sbin/init
 \--- 21470 admin cfg_server
# kill -9 21470

I’ll give you one guess as to when i ran kill.

Meanwhile, in the system log…

Code:
Mar 21 13:27:35 (redacted): cfgsync kicked off service-event
Mar 21 13:27:36 (redacted): cfgsync kicked off service-event-end
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10091
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10092
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10093
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10094
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10095
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10096
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10097
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10098
Mar 21 13:27:40 (redacted): Zombie slayer killing pid 10099
Mar 21 13:28:06 (redacted): cfgsync kicked off service-event
Mar 21 13:28:06 (redacted): cfgsync kicked off service-event-end
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10401
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10402
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10403
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10404
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10405
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10406
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10407
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10408
Mar 21 13:28:10 (redacted): Zombie slayer killing pid 10409
Mar 21 13:28:35 (redacted): cfgsync kicked off service-event
Mar 21 13:28:36 (redacted): cfgsync kicked off service-event-end
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10765
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10766
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10767
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10768
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10769
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10770
Mar 21 13:28:40 (redacted): Zombie slayer killing pid 10771
The reason why you were assuming it was two processes was because the & on the one liner causes the script to execute first as a forked process which results in the extra process id.
 
I noticed that every time the zombie slayer script ran, I wound up with 0 cfg_server processes running, instead of 3 or so.

So, I thought “could it be that it’s killing the wrong process first‽” and fixed up your edits to make sure not to kill that one. But still I was winding up with zero. I decided to add a break to only kill one process. Again zero. Grabbed a random one from the list and ran kill -15 pid. all dead.

I have once again run up against the edges of my cleverness. Here’s the state of the script, which I’m quite happy with overall and would use if I didn’t have that second node. (note: this happens whether or not one has multiple nodes, to say nothing of AIMesh)

TL;DR runaway cfg_server issues will happen regardless of AIMesh usage; killing one of the 1012 processes kills them all, killing connection to nodes; i have no idea how to deal with root cause

(re root cause: As @SomeWhereOverTheRainBow points out: lockfile? process never returning? idk)

Code:
#!/bin/sh
# This script relies on parsing strings, don't include `cfg_server` in its name

awk_pidof() { ps wT | awk '/\scfg_server/{if( $0 !~ /awk/)printf "%s ", $1}'; }

if [ "$(ps wT | awk -v var="$(basename "$0")" '{if( $0 !~ /awk/ && $0 ~ /$var/ )printf "%s ", $1}' | wc -w)" -gt 1 ]; then
  echo "Exiting zombie slayer because it thinks it's a duplicate process"
  exit;
fi

while :; do
   while [ "$(awk_pidof | wc -w)" -le 3 ]; do sleep 5; done;

   /usr/bin/logger "Zombie slayer detects excess cfg_server processes"
   # echo "Zombie slayer detects excess cfg_server processes"

   i=0
   parent_pid=$(pstree -s cfg_server | awk '{if($0 !~ /init/)printf "%i", $2}')

   for pid in $(awk_pidof); do
      if [ "$pid" != "$parent_pid" ]; then
         # echo "killing pid $pid"
         /bin/kill -s 9 $pid &
         # /usr/bin/logger "Zombie slayer killing pid $pid"
         i=$((i + 1))
         break
      fi
   done

   /usr/bin/logger "Zombie slayer back to sleep after killing $i processes"
   # echo "Zombie slayer back to sleep after killing $i processes"
done

inevitably someone is going to say “why don’t you just kill all the processes always and forever, and switch how your nodes work?" the short answer is, i’m a software person not a hardware person and i dread having to think about channels and juggling multiple SSIDs and whatnot. I only need the second node for a tiny corner of my home, but I *need* that node for that tiny corner of my home.
 
I noticed that every time the zombie slayer script ran, I wound up with 0 cfg_server processes running, instead of 3 or so.

So, I thought “could it be that it’s killing the wrong process first‽” and fixed up your edits to make sure not to kill that one. But still I was winding up with zero. I decided to add a break to only kill one process. Again zero. Grabbed a random one from the list and ran kill -15 pid. all dead.

I have once again run up against the edges of my cleverness. Here’s the state of the script, which I’m quite happy with overall and would use if I didn’t have that second node. (note: this happens whether or not one has multiple nodes, to say nothing of AIMesh)

TL;DR runaway cfg_server issues will happen regardless of AIMesh usage; killing one of the 1012 processes kills them all, killing connection to nodes; i have no idea how to deal with root cause

(re root cause: As @SomeWhereOverTheRainBow points out: lockfile? process never returning? idk)

Code:
#!/bin/sh
# This script relies on parsing strings, don't include `cfg_server` in its name

awk_pidof() { ps wT | awk '/\scfg_server/{if( $0 !~ /awk/)printf "%s ", $1}'; }

if [ "$(ps wT | awk -v var="$(basename "$0")" '{if( $0 !~ /awk/ && $0 ~ /$var/ )printf "%s ", $1}' | wc -w)" -gt 1 ]; then
  echo "Exiting zombie slayer because it thinks it's a duplicate process"
  exit;
fi

while :; do
   while [ "$(awk_pidof | wc -w)" -le 3 ]; do sleep 5; done;

   /usr/bin/logger "Zombie slayer detects excess cfg_server processes"
   # echo "Zombie slayer detects excess cfg_server processes"

   i=0
   parent_pid=$(pstree -s cfg_server | awk '{if($0 !~ /init/)printf "%i", $2}')

   for pid in $(awk_pidof); do
      if [ "$pid" != "$parent_pid" ]; then
         # echo "killing pid $pid"
         /bin/kill -s 9 $pid &
         # /usr/bin/logger "Zombie slayer killing pid $pid"
         i=$((i + 1))
         break
      fi
   done

   /usr/bin/logger "Zombie slayer back to sleep after killing $i processes"
   # echo "Zombie slayer back to sleep after killing $i processes"
done

inevitably someone is going to say “why don’t you just kill all the processes always and forever, and switch how your nodes work?" the short answer is, i’m a software person not a hardware person and i dread having to think about channels and juggling multiple SSIDs and whatnot. I only need the second node for a tiny corner of my home, but I *need* that node for that tiny corner of my home.
My guess is that your best bet would be to wait another firmware release. Do a clean flash, factory reset, and minimumly reconfigure each setting all while monitoring this cfg process. See if things change. This is probably something that will require your continued patients and re-evaluation until you have determined the issue has been resolved or gone away. Unfortunately, I have never been able to get any of my routers running AIMesh, to reproduce this actual issue. All of my routers run only 3 cfg_server PID. It never grows past that.
 

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top