What's new

Defunct cfg_server Zombies

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

garycnew

Senior Member
All:

On my primary router, I've run into an issue where cfg_server child processes die, become defunct, grow to 15 - 20 zombie processes, consume all of my available RAM, and cause the primary router to start swapping.

Once the primary router starts swapping, the load-average starts skyrocketing, until the primary router is no longer usable.

There are always 3 healthy cfg_server processes. I've tried to kill the parent cfg_server process to force cleanup of the defunct, zombie processes, but it just results in all cfg_server processes becoming defunct.

Are there any known solutions to clear the cfg_server defunct processes and free the associated memory being used without rebooting the primary router?

Respectfully,


Gary
 
What router model? What firmware version? Using AiMesh? How many nodes? Wireless or Ethernet backhaul?

I suspect a problem with the master communicating with slaves, but that's pure speculation since I don't even know if you're using AiMesh.

My Ax86U primary with Merlin 386.3_2 and (1) AiMesh node has only (1) cfg_server process, using 48k. No problems. But that doesn't mean the problem doesn't exist. :)

Here's a thread discussing multiples on AC86U, doesn't mention firmware version, no real solution (workaround was killing them all)...
RT-AC86U cfg_server - what is it? | SmallNetBuilder Forums (snbforums.com)

Post more details, perhaps there's a pattern for specific models / firmware / aimesh conditions.
 
What router model?
RT-AC66U_B1 (All)
What firmware version?
Asuswrt-Merlin 384.19 (All)
Using AiMesh?
Yes
How many nodes?
5 x AiMesh Nodes
Wireless or Ethernet backhaul?
Ethernet Backhaul (All)

My original attempt to killall -9 cfg_server processes was based on the suggestion from the previous post you referenced, but did not resolve the issue as all cfg_server processes became defunct and would not re-spawn.

I used htop's tree view to identify the parent cfg_server and killing it produced the previously mentioned results.

Zombies growth can be days to weeks.

Upgrading Firmware isn't an option, until I am able to procure new hardware.

Appreciate your assistance @ScottW and @Tech9

Respectfully,


Gary
 
RT-AC66U_B1 (All)

Asuswrt-Merlin 384.19 (All)

Yes

5 x AiMesh Nodes

Ethernet Backhaul (All)

My original attempt to killall -9 cfg_server processes was based on the suggestion from the previous post you referenced, but did not resolve the issue as all cfg_server processes became defunct and would not re-spawn.

I used htop's tree view to identify the parent cfg_server and killing it produced the previously mentioned results.

Zombies growth can be days to weeks.

Upgrading Firmware isn't an option, until I am able to procure new hardware.

Appreciate your assistance @ScottW and @Tech9

Respectfully,


Gary
You could draft a handler of the processes, but I am not sure what you would call to "restart" the cfg_server.

Code:
[ "$(pidof cfg_server | wc -w)" -gt 1 ]
# Replace the "1" with whatever number is acceptable.

Your handler would poll the process using pidof, you would check to see if the amount of processes have reached a certain word count using wc -w. this will return the exact number of processes currently active. if it become greater than the number you deem acceptable, you send killall -9 cfg_server until all the cfg_server processes are killed. Once they are killed, you simply restart the process. TBH this is all hypothetical assuming it will be done and assuming the process doesn't auto-spawn at a certain point.
 
You could draft a handler of the processes, but I am not sure what you would call to "restart" the cfg_server.

Code:
[ "$(pidof cfg_server | wc -w)" -gt 1 ]
# Replace the "1" with whatever number is acceptable.

Your handler would poll the process using pidof, you would check to see if the amount of processes have reached a certain word count using wc -w. this will return the exact number of processes currently active. if it become greater than the number you deem acceptable, you send killall -9 cfg_server until all the cfg_server processes are killed. Once they are killed, you simply restart the process. TBH this is all hypothetical assuming it will be done and assuming the process doesn't auto-spawn at a certain point.
This is precisely the method proposed in the previously mentioned post. The problem is that after killall -9 cfg_server is executed on my router, the processes all became defunct and did not re-spawn.

I wonder if it must be done prior to maxing out the RAM and hitting swap?

I'll setup the recommended script and configure it for greater than the 3 healthy processes that are currently active and see if it helps.

I wonder if/how this might affect AiMesh?
 
There’s a watchdog to restart cfg_server, so killing them periodically *probably* won’t have any ill affect. But as to what’s causing them to stop working (go “zombie”) is anyone’s guess.

if you can’t upgrade FW, my only other suggestion (if not already tried) would be to power off all nodes, factory-reset the primary, then factory-reset and add each node again.

How are the nodes connected? All through a common switch? What type of cable? The process failing to respond (and new ones being spawned by watchdog) implies some kind of communication issue between router and nodes…. But could also be some firmware bug.
 
This is precisely the method proposed in the previously mentioned post. The problem is that after killall -9 cfg_server is executed on my router, the processes all became defunct and did not re-spawn.

I wonder if it must be done prior to maxing out the RAM and hitting swap?

I'll setup the recommended script and configure it for greater than the 3 healthy processes that are currently active and see if it helps.

I wonder if/how this might affect AiMesh?
what is the output when you run this command

Code:
ps -T | awk '/<defunct>/ {print $1}' | awk '{printf "%s",$0} END {print " "}'

The problem with defunct processes is that they are waiting for the parent to kill them, if we kill the parent, they will infinitely wait. Maybe they can be killed via PID, see if the command above gives a list of PID on a single line.
 
what is the output when you run this command

Code:
ps -T | awk '/<defunct>/ {print $1}' | awk '{printf "%s",$0} END {print " "}'

The problem with defunct processes is that they are waiting for the parent to kill them, if we kill the parent, they will infinitely wait. Maybe they can be killed via PID, see if the command above gives a list of PID on a single line.

Code:
# ps -T | awk '/<defunct>/ {print $1}' | awk '{printf "%s",$0} END {print " "}'
16359
 
Code:
# ps -T | awk '/<defunct>/ {print $1}' | awk '{printf "%s",$0} END {print " "}'
16359

Code:
ps -T | awk '/cfg_server/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

indicates all the PID for cfg_server

where as

Code:
ps -T | awk '/<defunct>/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

or

Code:
ps -T | awk '/defunct/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

would hopefully indicate the PID for defunct processes. If not you could adjust the defunct part to match what the defunct part actually says and it should indicated the proper PID's to kill of defunct child process's.
 
There’s a watchdog to restart cfg_server, so killing them periodically *probably* won’t have any ill affect. But as to what’s causing them to stop working (go “zombie”) is anyone’s guess.
If there's already a watchdog to restart cfg_server, I'm less concerned about ill effects.
if you can’t upgrade FW, my only other suggestion (if not already tried) would be to power off all nodes, factory-reset the primary, then factory-reset and add each node again.
No can do. My AiMesh Nodes are multifaceted and highly customized as BotNodes and TorNodes.
How are the nodes connected?
Star Topology
All through a common switch?
Directly Connected to Primary Router with a single AiMesh Node piggybacking off of one of the Directly Connected AiMesh Nodes. There's an issue in Firmware 384 where AiMesh Nodes have problems when not directly connected.
What type of cable?
CAT-5 (All Connected at 1Gbps)
The process failing to respond (and new ones being spawned by watchdog) implies some kind of communication issue between router and nodes…. But could also be some firmware bug.
I've implemented the recommended script and I believe I see it working. I saw a fourth cfg_server process spawn and it was still in a healthy state, and was successfully killed by the script. I'm thinking that if additional cfg_server processes go unchecked they eventually go defunct.

As long as the zomg_cfg_server script keeps the processes from exhausting my RAM and there are no ill effects... I'll be happy.

I should know within "days or weeks" whether there are any ill effects associated.

Thanks @ScottW @Tech9 @SomeWhereOverTheRainBow

Respectfully,


Gary

P.S. As I was typing this reply, I received email notification that the zomg_cfg_server script was executed and it's back to only 3 cfg_server processes reported.
 
Last edited:
Code:
ps -T | awk '/cfg_server/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

indicates all the PID for cfg_server

where as

Code:
ps -T | awk '/<defunct>/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

or

Code:
ps -T | awk '/defunct/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

would hopefully indicate the PID for defunct processes. If not you could adjust the defunct part to match what the defunct part actually says and it should indicated the proper PID's to kill of defunct child process's.

Presently, there are no defunct processes reported:

Code:
# ps -T | awk '/cfg_server/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'
18160 18168 18169

Code:
# ps -T | awk '/defunct/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'
#
 
If there's already a watchdog to restart cfg_server, I'm less concerned about ill effects.

No can do. My AiMesh Nodes are multifaceted and highly customized as BotNodes and TorNodes.

Star Topology

Directly Connected to Primary Router with a single AiMesh Node piggybacking off of one of the Directly Connected AiMesh Nodes. There's an issue in Firmware 384 where AiMesh Nodes have issues when not directly connected.

CAT-5 (All Connected at 1Gbps)

I've implemented the recommended script and I believe I see it working. I saw a fourth cfg_server process spawn and it was still in a healthy state, and was successfully killed by the script. I'm thinking that if additional cfg_server processes go unchecked they eventually go defunct.

As long as the zomg_cfg_server script keeps the processes from exhausting my RAM and there are no ill effects I'll be happy.

I should know within "days or weeks" whether there are any ill effects associated.

Thanks @ScottW @Tech9 @SomeWhereOverTheRainBow

Respectfully,


Gary

P.S. As I was typing this reply, I received email notification that the zomg_cfg_server script was executed and it's back to only 3 cfg_server processes reported.
I hope anything we mentioned here helps you. I have made similar handlers to keep processes that like to commit processing suicide "active". Let us know how well it works or if you need any other correspondence.
 
Presently, there are no defunct processes reported:

Code:
# ps -T | awk '/cfg_server/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'
18160 18168 18169

Code:
# ps -T | awk '/defunct/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'
#
I added the
Code:
sed '$d'
because the command was also counting its own PID as a PID granted it was the last PID on the list.
 
I hope anything we mentioned here helps you. I have made similar handlers to keep processes that like to commit processing suicide "active". Let us know how well it works or if you need any other correspondence.

Thanks, again!
 
You could draft a handler of the processes, but I am not sure what you would call to "restart" the cfg_server.

@ScottW @SomeWhereOverTheRainBow

It seems "killall -9 cfg_server" causes the watchdog to execute a "start_cfgsync" on the primary router; thus, the alternative to the killall is the following:

Code:
# service restart_cfgsync

However, it seems to force the AiMesh Nodes to execute a "restart_wirless" each time, which is filling up the syslog.

Not sure if this is a viable solution.

Respectfully,


Gary
 
@ScottW @SomeWhereOverTheRainBow

It seems "killall -9 cfg_server" causes the watchdog to execute a "start_cfgsync" on the primary router; thus, the alternative to the killall is the following:

Code:
# service restart_cfgsync

However, it seems to force the AiMesh Nodes to execute a "restart_wirless" each time, which is filling up the syslog.

Not sure if this is a viable solution.

Respectfully,


Gary
What about

kill -9 "$(ps -T | awk '/cfg_server/ {print $1}' | awk '{printf "%s ",$0} END {print ""}')"
 
What about

kill -9 "$(ps -T | awk '/cfg_server/ {print $1}' | awk '{printf "%s ",$0} END {print ""}')"

Doesn't that essentially do the same thing?

I'm wondering if killing any processes other than the first 3 might do the trick?
 
All:

I ended up modifying the zomg_cfg_server script (with a 1 minute cronjob) to only kill -9 any newly created cfg_server processes greater than the original 3 processes, which appears to be working without invoking the watchdog start_cfgsync process (as the parent cfg_server is preserved in a healthy state), subverting the restart_wireless processes on the AiMesh Nodes, and avoiding constant writing to the primary router's syslog.

Code:
# cat /jffs/sbin/zomg_cfg_server
#!/bin/sh

i=1

procs="$(pidof cfg_server)"

for pid in $procs; do
   if [ "$i" -gt "3" ]; then
      #echo $i
      #echo $pid
      /usr/bin/kill -9 $pid
      /usr/bin/logger "Running /jffs/sbin/zomg_cfg_server"
   fi

   i=$((i + 1))
done

It seems to be working very well. I'll update this post, if there are any observed ill effects.

I believe this to be a more ideal solution than the previous killall recommendation.

Respectfully,


Gary
 

Similar threads

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top