Defunct cfg_server Zombies

garycnew · Feb 27, 2022

All:

On my primary router, I've run into an issue where cfg_server child processes die, become defunct, grow to 15 - 20 zombie processes, consume all of my available RAM, and cause the primary router to start swapping.

Once the primary router starts swapping, the load-average starts skyrocketing, until the primary router is no longer usable.

There are always 3 healthy cfg_server processes. I've tried to kill the parent cfg_server process to force cleanup of the defunct, zombie processes, but it just results in all cfg_server processes becoming defunct.

Are there any known solutions to clear the cfg_server defunct processes and free the associated memory being used without rebooting the primary router?

Respectfully,

Gary

ScottW · Feb 27, 2022

What router model? What firmware version? Using AiMesh? How many nodes? Wireless or Ethernet backhaul?

I suspect a problem with the master communicating with slaves, but that's pure speculation since I don't even know if you're using AiMesh.

My Ax86U primary with Merlin 386.3_2 and (1) AiMesh node has only (1) cfg_server process, using 48k. No problems. But that doesn't mean the problem doesn't exist.

Here's a thread discussing multiples on AC86U, doesn't mention firmware version, no real solution (workaround was killing them all)...
RT-AC86U cfg_server - what is it? | SmallNetBuilder Forums (snbforums.com)

Post more details, perhaps there's a pattern for specific models / firmware / aimesh conditions.

Tech9 · Feb 27, 2022

garycnew said:
child processes die

garycnew said:
grow to 15 - 20 zombie processes

garycnew said:
I've tried to kill the parent

ScottW said:
workaround was killing them all

A new zombie movie script? Yes, please!

What happens next? How fast the zombies grow? How did you kill the parent? How do we kill them all?

garycnew · Feb 27, 2022

ScottW said:
What router model?

RT-AC66U_B1 (All)

ScottW said:
What firmware version?

Asuswrt-Merlin 384.19 (All)

ScottW said:
Using AiMesh?

Yes

ScottW said:
How many nodes?

5 x AiMesh Nodes

ScottW said:
Wireless or Ethernet backhaul?

Ethernet Backhaul (All)

My original attempt to killall -9 cfg_server processes was based on the suggestion from the previous post you referenced, but did not resolve the issue as all cfg_server processes became defunct and would not re-spawn.

I used htop's tree view to identify the parent cfg_server and killing it produced the previously mentioned results.

Zombies growth can be days to weeks.

Upgrading Firmware isn't an option, until I am able to procure new hardware.

Appreciate your assistance @ScottW and @Tech9

Respectfully,

Gary

SomeWhereOverTheRainBow · Feb 27, 2022

garycnew said:
RT-AC66U_B1 (All)

Asuswrt-Merlin 384.19 (All)

Yes

5 x AiMesh Nodes

Ethernet Backhaul (All)

My original attempt to killall -9 cfg_server processes was based on the suggestion from the previous post you referenced, but did not resolve the issue as all cfg_server processes became defunct and would not re-spawn.

I used htop's tree view to identify the parent cfg_server and killing it produced the previously mentioned results.

Zombies growth can be days to weeks.

Upgrading Firmware isn't an option, until I am able to procure new hardware.

Appreciate your assistance @ScottW and @Tech9

Respectfully,

Gary

You could draft a handler of the processes, but I am not sure what you would call to "restart" the cfg_server.

Code:

[ "$(pidof cfg_server | wc -w)" -gt 1 ]

# Replace the "1" with whatever number is acceptable.

Your handler would poll the process using pidof, you would check to see if the amount of processes have reached a certain word count using wc -w. this will return the exact number of processes currently active. if it become greater than the number you deem acceptable, you send killall -9 cfg_server until all the cfg_server processes are killed. Once they are killed, you simply restart the process. TBH this is all hypothetical assuming it will be done and assuming the process doesn't auto-spawn at a certain point.

garycnew · Feb 27, 2022

SomeWhereOverTheRainBow said:
You could draft a handler of the processes, but I am not sure what you would call to "restart" the cfg_server.

Code:

[ "$(pidof cfg_server | wc -w)" -gt 1 ]

# Replace the "1" with whatever number is acceptable.

Your handler would poll the process using pidof, you would check to see if the amount of processes have reached a certain word count using wc -w. this will return the exact number of processes currently active. if it become greater than the number you deem acceptable, you send killall -9 cfg_server until all the cfg_server processes are killed. Once they are killed, you simply restart the process. TBH this is all hypothetical assuming it will be done and assuming the process doesn't auto-spawn at a certain point.

This is precisely the method proposed in the previously mentioned post. The problem is that after killall -9 cfg_server is executed on my router, the processes all became defunct and did not re-spawn.

I wonder if it must be done prior to maxing out the RAM and hitting swap?

I'll setup the recommended script and configure it for greater than the 3 healthy processes that are currently active and see if it helps.

I wonder if/how this might affect AiMesh?

ScottW · Feb 27, 2022

There’s a watchdog to restart cfg_server, so killing them periodically *probably* won’t have any ill affect. But as to what’s causing them to stop working (go “zombie”) is anyone’s guess.

if you can’t upgrade FW, my only other suggestion (if not already tried) would be to power off all nodes, factory-reset the primary, then factory-reset and add each node again.

How are the nodes connected? All through a common switch? What type of cable? The process failing to respond (and new ones being spawned by watchdog) implies some kind of communication issue between router and nodes…. But could also be some firmware bug.

SomeWhereOverTheRainBow · Feb 27, 2022

garycnew said:
This is precisely the method proposed in the previously mentioned post. The problem is that after killall -9 cfg_server is executed on my router, the processes all became defunct and did not re-spawn.

I wonder if it must be done prior to maxing out the RAM and hitting swap?

I'll setup the recommended script and configure it for greater than the 3 healthy processes that are currently active and see if it helps.

I wonder if/how this might affect AiMesh?

what is the output when you run this command

Code:

ps -T | awk '/<defunct>/ {print $1}' | awk '{printf "%s",$0} END {print " "}'

The problem with defunct processes is that they are waiting for the parent to kill them, if we kill the parent, they will infinitely wait. Maybe they can be killed via PID, see if the command above gives a list of PID on a single line.

garycnew · Feb 27, 2022

SomeWhereOverTheRainBow said:
what is the output when you run this command

Code:

ps -T | awk '/<defunct>/ {print $1}' | awk '{printf "%s",$0} END {print " "}'

The problem with defunct processes is that they are waiting for the parent to kill them, if we kill the parent, they will infinitely wait. Maybe they can be killed via PID, see if the command above gives a list of PID on a single line.

Code:

# ps -T | awk '/<defunct>/ {print $1}' | awk '{printf "%s",$0} END {print " "}'
16359

SomeWhereOverTheRainBow · Feb 27, 2022

garycnew said:
Code:

# ps -T | awk '/<defunct>/ {print $1}' | awk '{printf "%s",$0} END {print " "}' 16359

Do you have any defunct process's present in this list created from the command below ? if so what does the "defunct" text look like that indicates it is defunct.

ps -T

SomeWhereOverTheRainBow · Feb 27, 2022

garycnew said:
Code:

# ps -T | awk '/<defunct>/ {print $1}' | awk '{printf "%s",$0} END {print " "}' 16359

Code:

ps -T | awk '/cfg_server/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

indicates all the PID for cfg_server

where as

Code:

ps -T | awk '/<defunct>/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

or

Code:

ps -T | awk '/defunct/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

would hopefully indicate the PID for defunct processes. If not you could adjust the defunct part to match what the defunct part actually says and it should indicated the proper PID's to kill of defunct child process's.

garycnew · Feb 27, 2022

ScottW said:
There’s a watchdog to restart cfg_server, so killing them periodically *probably* won’t have any ill affect. But as to what’s causing them to stop working (go “zombie”) is anyone’s guess.

If there's already a watchdog to restart cfg_server, I'm less concerned about ill effects.

ScottW said:
if you can’t upgrade FW, my only other suggestion (if not already tried) would be to power off all nodes, factory-reset the primary, then factory-reset and add each node again.

No can do. My AiMesh Nodes are multifaceted and highly customized as BotNodes and TorNodes.

ScottW said:
How are the nodes connected?

Star Topology

ScottW said:
All through a common switch?

Directly Connected to Primary Router with a single AiMesh Node piggybacking off of one of the Directly Connected AiMesh Nodes. There's an issue in Firmware 384 where AiMesh Nodes have problems when not directly connected.

ScottW said:
What type of cable?

CAT-5 (All Connected at 1Gbps)

ScottW said:
The process failing to respond (and new ones being spawned by watchdog) implies some kind of communication issue between router and nodes…. But could also be some firmware bug.

I've implemented the recommended script and I believe I see it working. I saw a fourth cfg_server process spawn and it was still in a healthy state, and was successfully killed by the script. I'm thinking that if additional cfg_server processes go unchecked they eventually go defunct.

As long as the zomg_cfg_server script keeps the processes from exhausting my RAM and there are no ill effects... I'll be happy.

I should know within "days or weeks" whether there are any ill effects associated.

Thanks @ScottW @Tech9 @SomeWhereOverTheRainBow

Respectfully,

Gary

P.S. As I was typing this reply, I received email notification that the zomg_cfg_server script was executed and it's back to only 3 cfg_server processes reported.

garycnew · Feb 27, 2022

SomeWhereOverTheRainBow said:
Code:

ps -T | awk '/cfg_server/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

indicates all the PID for cfg_server

where as

Code:

ps -T | awk '/<defunct>/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

or

Code:

ps -T | awk '/defunct/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'

would hopefully indicate the PID for defunct processes. If not you could adjust the defunct part to match what the defunct part actually says and it should indicated the proper PID's to kill of defunct child process's.

Presently, there are no defunct processes reported:

Code:

# ps -T | awk '/cfg_server/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'
18160 18168 18169

Code:

# ps -T | awk '/defunct/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}'
#

SomeWhereOverTheRainBow · Feb 27, 2022

garycnew said:
If there's already a watchdog to restart cfg_server, I'm less concerned about ill effects.

No can do. My AiMesh Nodes are multifaceted and highly customized as BotNodes and TorNodes.

Star Topology

Directly Connected to Primary Router with a single AiMesh Node piggybacking off of one of the Directly Connected AiMesh Nodes. There's an issue in Firmware 384 where AiMesh Nodes have issues when not directly connected.

CAT-5 (All Connected at 1Gbps)

I've implemented the recommended script and I believe I see it working. I saw a fourth cfg_server process spawn and it was still in a healthy state, and was successfully killed by the script. I'm thinking that if additional cfg_server processes go unchecked they eventually go defunct.

As long as the zomg_cfg_server script keeps the processes from exhausting my RAM and there are no ill effects I'll be happy.

I should know within "days or weeks" whether there are any ill effects associated.

Thanks @ScottW @Tech9 @SomeWhereOverTheRainBow

Respectfully,

Gary

P.S. As I was typing this reply, I received email notification that the zomg_cfg_server script was executed and it's back to only 3 cfg_server processes reported.

I hope anything we mentioned here helps you. I have made similar handlers to keep processes that like to commit processing suicide "active". Let us know how well it works or if you need any other correspondence.

SomeWhereOverTheRainBow · Feb 27, 2022

garycnew said:
Presently, there are no defunct processes reported:

Code:

# ps -T | awk '/cfg_server/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}' 18160 18168 18169

Code:

# ps -T | awk '/defunct/ {print $1}' | sed '$d' | awk '{printf "%s ",$0} END {print ""}' #

I added the

Code:

sed '$d'

because the command was also counting its own PID as a PID granted it was the last PID on the list.

garycnew · Feb 27, 2022

SomeWhereOverTheRainBow said:
I hope anything we mentioned here helps you. I have made similar handlers to keep processes that like to commit processing suicide "active". Let us know how well it works or if you need any other correspondence.

Thanks, again!

garycnew · Feb 28, 2022

SomeWhereOverTheRainBow said:
You could draft a handler of the processes, but I am not sure what you would call to "restart" the cfg_server.

@ScottW @SomeWhereOverTheRainBow

It seems "killall -9 cfg_server" causes the watchdog to execute a "start_cfgsync" on the primary router; thus, the alternative to the killall is the following:

Code:

# service restart_cfgsync

However, it seems to force the AiMesh Nodes to execute a "restart_wirless" each time, which is filling up the syslog.

Not sure if this is a viable solution.

Respectfully,

Gary

SomeWhereOverTheRainBow · Feb 28, 2022

garycnew said:
@ScottW @SomeWhereOverTheRainBow

It seems "killall -9 cfg_server" causes the watchdog to execute a "start_cfgsync" on the primary router; thus, the alternative to the killall is the following:

Code:

# service restart_cfgsync

However, it seems to force the AiMesh Nodes to execute a "restart_wirless" each time, which is filling up the syslog.

Not sure if this is a viable solution.

Respectfully,

Gary

What about

kill -9 "$(ps -T | awk '/cfg_server/ {print $1}' | awk '{printf "%s ",$0} END {print ""}')"

garycnew · Feb 28, 2022

SomeWhereOverTheRainBow said:
What about

kill -9 "$(ps -T | awk '/cfg_server/ {print $1}' | awk '{printf "%s ",$0} END {print ""}')"

Doesn't that essentially do the same thing?

I'm wondering if killing any processes other than the first 3 might do the trick?

garycnew · Feb 28, 2022

All:

I ended up modifying the zomg_cfg_server script (with a 1 minute cronjob) to only kill -9 any newly created cfg_server processes greater than the original 3 processes, which appears to be working without invoking the watchdog start_cfgsync process (as the parent cfg_server is preserved in a healthy state), subverting the restart_wireless processes on the AiMesh Nodes, and avoiding constant writing to the primary router's syslog.

Code:

# cat /jffs/sbin/zomg_cfg_server
#!/bin/sh

i=1

procs="$(pidof cfg_server)"

for pid in $procs; do
   if [ "$i" -gt "3" ]; then
      #echo $i
      #echo $pid
      /usr/bin/kill -9 $pid
      /usr/bin/logger "Running /jffs/sbin/zomg_cfg_server"
   fi

   i=$((i + 1))
done

It seems to be working very well. I'll update this post, if there are any observed ill effects.

I believe this to be a more ideal solution than the previous killall recommendation.

Respectfully,

Gary

Thread starter	Title	Forum	Replies	Date
	Easiest setup for remote log server	Asuswrt-Merlin	0	Apr 12, 2024
	DDNS Status: Server Error/Inactive	Asuswrt-Merlin	11	Apr 5, 2024
	NXDOMAIN on server but the router can resolve queries	Asuswrt-Merlin	1	Mar 18, 2024
V	DNS server vs DNS-over TLS server list	Asuswrt-Merlin	3	Mar 17, 2024
L	When I access the System Log --> IPv6 tab the web server crashes	Asuswrt-Merlin	0	Feb 28, 2024
A	Solved Inadyn Fails to verify server certs after update to 388.6	Asuswrt-Merlin	3	Feb 23, 2024
W	OpenVPN Server set to LAN only allows browsing	Asuswrt-Merlin	14	Feb 19, 2024
C	How can I do this.. Router VPN Client to Server 2	Asuswrt-Merlin	3	Feb 17, 2024
G	VPN Client with Public IP, split tunnel web server over VPN and other traffic not on the VPN	Asuswrt-Merlin	4	Feb 13, 2024
A	Cheapest device to run VPN server	Asuswrt-Merlin	6	Feb 13, 2024

Defunct cfg_server Zombies

Senior Member

Senior Member

Part of the Furniture

Senior Member

Part of the Furniture

Senior Member

Senior Member

Part of the Furniture

Senior Member

Part of the Furniture

Part of the Furniture

Senior Member

Senior Member

Part of the Furniture

Part of the Furniture

Senior Member

Senior Member

Part of the Furniture

Senior Member

Senior Member

Similar threads

Similar threads

Sign Up For SNBForums Daily Digest