Opened 7 years ago

Closed 7 years ago

#2244 closed defect (fixed)

remove ddns component, but boss still keeps trying to start it

Reported by: jreed Owned by: jinmei
Priority: medium Milestone: Sprint-20121023
Component: ~Boss of BIND (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 6 Add Hours to Ticket: 0
Total Hours: 9.55 Internal?: no

Description

Boss keeps trying to start ddns (see ticket #2243). I don't need ddns so I decide to not use it:

> config remove Boss/components b10-ddns
> config commit
> config show Boss/components/b10-ddns
Error: Boss/components/b10-ddns/kind not found

And config show Boss/components does not show it.

The log showed:

2012-09-06 08:19:30.781 DEBUG [b10-boss.boss] BIND10_RECEIVED_NEW_CONFIGURATION 
received new configuration: {'components': {'b10-cmdctl': {'kind': 'needed', 'sp
ecial': 'cmdctl'}, 'b10-xfrout': {'kind': 'dispensable'}, 'b10-auth': {'priority
': 10, 'kind': 'needed', 'special': 'auth'}, 'b10-stats': {'kind': 'dispensable'
, 'address': 'Stats'}, 'b10-resolver': {'priority': 10, 'kind': 'needed', 'speci
al': 'resolver'}, 'b10-stats-httpd': {'kind': 'dispensable', 'address': 'StatsHt
tpd'}}}
2012-09-06 08:19:30.781 INFO  [b10-boss.boss] BIND10_CONFIGURATOR_RECONFIGURE re
configuring running components
2012-09-06 08:19:30.781 DEBUG [b10-boss.boss] BIND10_CONFIGURATOR_BUILD building
 plan '{'sockcreator': ({'priority': 200, 'kind': 'core', 'special': 'sockcreato
r'}, <isc.bind10.special_component.SockCreator object at 0x7f7ff57ed610>), 'msgq
': ({'priority': 199, 'kind': 'core', 'special': 'msgq'}, <isc.bind10.special_co
mponent.Msgq object at 0x7f7ff57ed650>), 'b10-cmdctl': ({'kind': 'needed', 'spec
ial': 'cmdctl'}, <isc.bind10.special_component.CmdCtl object at 0x7f7ff5322050>)
, 'b10-xfrout': ({'kind': 'dispensable'}, <isc.bind10.component.Component object
 at 0x7f7ff5322090>), 'b10-auth': ({'priority': 10, 'kind': 'needed', 'special':
 'auth'}, <isc.bind10.special_component.Auth object at 0x7f7ff5322110>), 'cfgmgr
': ({'priority': 198, 'kind': 'core', 'special': 'cfgmgr'}, <isc.bind10.special_
component.CfgMgr object at 0x7f7ff57ed690>), 'b10-ddns': ({'kind': 'dispensable'
}, <isc.bind10.component.Component object at 0x7f7ff5311fd0>), 'b10-stats': ({'k
ind': 'dispensable', 'address': 'Stats'}, <isc.bind10.component.Component object
 at 0x7f7ff5311e90>), 'b10-resolver': ({'priority': 10, 'kind': 'needed', 'speci
al': 'resolver'}, <isc.bind10.special_component.Resolver object at 0x7f7ff5311f5
0>), 'b10-stats-httpd': ({'kind': 'dispensable', 'address': 'StatsHttpd'}, <isc.
bind10.component.Component object at 0x7f7ff53220d0>)}' -> '{'b10-resolver': {'p
riority': 10, 'kind': 'needed', 'special': 'resolver'}, 'sockcreator': {'priorit
y': 200, 'kind': 'core', 'special': 'sockcreator'}, 'msgq': {'priority': 199, 'k
ind': 'core', 'special': 'msgq'}, 'b10-stats': {'kind': 'dispensable', 'address'
: 'Stats'}, 'b10-cmdctl': {'kind': 'needed', 'special': 'cmdctl'}, 'b10-xfrout':
 {'kind': 'dispensable'}, 'b10-stats-httpd': {'kind': 'dispensable', 'address': 
'StatsHttpd'}, 'b10-auth': {'priority': 10, 'kind': 'needed', 'special': 'auth'}
, 'cfgmgr': {'priority': 198, 'kind': 'core', 'special': 'cfgmgr'}}'
2012-09-06 08:19:30.782 DEBUG [b10-boss.boss] BIND10_CONFIGURATOR_RUN running pl
an of 0 tasks
2012-09-06 08:19:30.788 DEBUG [b10-cmdctl.cmdctl] CMDCTL_SEND_COMMAND sending co
mmand get_config to module ConfigManager
2012-09-06 08:19:30.788 DEBUG [b10-cmdctl.cmdctl] CMDCTL_COMMAND_SENT command 'g
et_config' to module 'ConfigManager' was sent

Then a few seconds later:

2012-09-06 08:19:35.826 INFO  [b10-boss.boss] BIND10_COMPONENT_START component b
10-ddns is starting

And this keeps happening.

2012-09-06 08:24:59.409 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 24982 of b10-ddns ended with status 0
2012-09-06 08:24:59.409 ERROR [b10-boss.boss] BIND10_COMPONENT_FAILED component b10-ddns (pid 24982) failed: process exited normally with exit status 0
2012-09-06 08:25:06.053 INFO  [b10-boss.boss] BIND10_COMPONENT_START component b10-ddns is starting
2012-09-06 08:25:06.053 INFO  [b10-boss.boss] BIND10_STARTING_PROCESS starting process b10-ddns
2012-09-06 08:25:06.065 DEBUG [b10-boss.boss] BIND10_STARTED_PROCESS_PID started b10-ddns (PID 6493)

and so on ...

So I did a "Boss shutdown" and restarted and it no longer tried to start ddns.

Subtickets

Change History (16)

comment:1 Changed 7 years ago by muks

  • Summary changed from remove boss component, but boss still keeps trying to start it to remove ddns component, but boss still keeps trying to start it

comment:2 Changed 7 years ago by muks

  • Milestone changed from New Tasks to Sprint-20121009

comment:3 Changed 7 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

comment:4 Changed 7 years ago by jinmei

trac2244 is ready for review.

First off, I'd argue the subject is misleading. After investigating
it, I realized it primarily because b10-ddns repeatedly terminates
and was then restarted. "remove ddns" didn't work as expected in such
specific situation. If you remove ddns when it works fine, it should
be cleanly stopped as expected.

As far as I can see, what happened is this:

b10-ddns repeatedly fails and then restarts. When it fails, its
internal component state is set to "FAILED", and it's pushed in
the restart schedule queue of the main bind10 process. If the
"remove b10-ddns" command takes place between the failure and restart,
the configurator class effectively ignores this command because it
includes the removed component in the planned task list only when the
component's state is "running":

                if component.running():
                    plan.append({
                        'command': STOP_CMD,
                        'component': component,
                        'name': cname
                    })

And, even if the specified component was removed from the
configurator, the restart queue of the bind10 process does not consult
the consistency with the underlying configuration, so the restart
would take place anyway:

    def restart_processes(self):
...
        for component in self.components_to_restart:
            if not component.restart(now):

So the fix is two-fold:

  • Make sure that the configurator honors the remove command as long as the state is running or "failed"
  • In bind10's restart_processes(), skip the restart if a component has been removed from the configuration

The branch implements this idea with a few additional cleanups (see
commit logs).

Frankly, I didn't like this fix very much - "if the state is xxx then
do something" type of code is generally quite fragile, and this fix is
to add yet another such logic. But to fix the fundamental issue
cleanly, I guess we need some design level overhaul of the component
framework. I wish we eventually have time for that (the current
implementation seems to be too fragile in general, and is difficult to
maintain due to this logic and due to the fact that various modules
can change the state), but that would be far beyond the scope of this
ticket. So, I reluctantly chose the bandaid fix.

This is the proposed changelog entry:

486.	[bug]		jinmei
	The bind10 process now terminates a component (subprocess) by the
	"config remove Boss/components" bindctl command even if the
	process crashes immediately before the command is sent to bind10.
	Previously this led to an inconsistent state between the
	configuration and an internal component list of bind10, and bind10
	kept trying to restart the component.  A known specific case of
	this problem is that b10-ddns could keep failing (due to lack of
	dependency modules) and the administrator couldn't stop the
	restart via bindctl.
	(Trac #2244, git TBD)

comment:5 Changed 7 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:6 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:7 follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Did you try running the system tests? It seems to be failing reliably for me, like this:

cd tests/system; \
sh /home/vorner/work/bind10/tests/system/runall.sh
S:bindctl:Mon Oct  8 14:13:03 CEST 2012
T:bindctl:1:A
A:System test bindctl
Using SQLite3 database file /home/vorner/work/bind10/tests/system/bindctl/nsx1/zone.sqlite3
Zone name is .
Loading file "/home/vorner/work/bind10/tests/system/bindctl//nsx1/root.db"
4 RR(s) loaded in 0.35 second(s) (100.00% of /home/vorner/work/bind10/tests/system/bindctl//nsx1/root.db)
Done.
I:starting server nsx1
I:Checking b10-auth is disabled by default (0)
I:Starting b10-auth and checking that it works (1)
I:Checking BIND 10 statistics after a pause (2)
I:Stopping b10-auth and checking that (3)
I:Restarting b10-auth and checking that (4)
I:failed
I:Rechecking BIND 10 statistics after a pause (5)
I:failed
I:Changing the data source from sqlite3 to in-memory (6)
I:failed
I:Rechecking BIND 10 statistics after changing the datasource (7)
I:failed
I:Starting more b10-auths and checking that (8)
I:failed
I:Rechecking BIND 10 statistics consistency after a pause (9)
I:failed 
I:Stopping extra b10-auths and checking that (10)
I:failed
I:exit status: 1
R:FAIL
E:bindctl:Mon Oct  8 14:14:52 CEST 2012
S:glue:Mon Oct  8 14:14:52 CEST 2012
T:glue:1:A
A:System test glue
Using SQLite3 database file ./nsx1/zone.sqlite3
Zone name is .
Loading file "./nsx1/root.db"
16 RR(s) loaded in 0.32 second(s) (100.00% of ./nsx1/root.db)                   
Done.
Using SQLite3 database file ./nsx1/zone.sqlite3
Zone name is root-servers.nil.
Loading file "./nsx1/root-servers.nil.db"
5 RR(s) loaded in 0.19 second(s) (100.00% of ./nsx1/root-servers.nil.db)        
Done.
Using SQLite3 database file ./nsx1/zone.sqlite3
Zone name is com.
Loading file "./nsx1/com.db"
8 RR(s) loaded in 0.18 second(s) (100.00% of ./nsx1/com.db)                     
Done.
Using SQLite3 database file ./nsx1/zone.sqlite3
Zone name is net.
Loading file "./nsx1/net.db"
7 RR(s) loaded in 0.17 second(s) (100.00% of ./nsx1/net.db)                     
Done.
I:starting server nsx1
I:testing that a TLD referral gets a full glue set from the root zone (0)
I:testing that we don't find out-of-zone glue (1)
I:exit status: 0
R:PASS
E:glue:Mon Oct  8 14:14:57 CEST 2012
S:ixfr/in-2:Mon Oct  8 14:14:57 CEST 2012
T:ixfr/in-2:1:A
A:System test ixfr/in-2
Using SQLite3 database file /home/vorner/work/bind10/tests/system/ixfr/zone.sqlite3
Zone name is example.
Loading file "/home/vorner/work/bind10/tests/system/ixfr/db.example.n6"
10 RR(s) loaded in 0.31 second(s) (100.00% of /home/vorner/work/bind10/tests/system/ixfr/db.example.n6)
Done.
/home/vorner/work/bind9/bin/tests/system/testsock.pl: bind(10.53.0.1, 53210): Address already in use
I:Couldn't bind to socket (yet)
/home/vorner/work/bind9/bin/tests/system/testsock.pl: bind(10.53.0.1, 53210): Address already in use
I:Couldn't bind to socket (yet)
/home/vorner/work/bind9/bin/tests/system/testsock.pl: bind(10.53.0.1, 53210): Address already in use
I:Couldn't bind to socket (yet)
/home/vorner/work/bind9/bin/tests/system/testsock.pl: bind(10.53.0.1, 53210): Address already in use
I:Couldn't bind to socket (yet)
/home/vorner/work/bind9/bin/tests/system/testsock.pl: bind(10.53.0.1, 53210): Address already in use
/home/vorner/work/bind10/tests/system/start.pl: could not bind to server addresses, still running?
I:server sockets not available
R:FAIL
Can't open perl script "/home/vorner/work/bind10/tests/system/ixfr/stop.pl": No such file or directory

I think the testsock.pl ones are caused by previous failed tests, since they leave stuff running, I had to manually kill bind10. They seem they could be related, since the first one failing is about restarting. Many lettuce tests are also failing, with traceback like:

 Given I have bind10 running with configuration xfrin/retransfer_master.conf with cmdctl port 47804 as master # features/terrain/bind10_control.py:107
    Traceback (most recent call last):
      File "/home/vorner/.local/lib64/python2.7/site-packages/lettuce/core.py", line 117, in __call__
        ret = self.function(self.step, *args, **kw)
      File "/home/vorner/work/bind10/tests/lettuce/features/terrain/bind10_control.py", line 120, in have_bind10_running
        step.given(start_step)
      File "/home/vorner/.local/lib64/python2.7/site-packages/lettuce/core.py", line 326, in given
        return self.behave_as(string)
      File "/home/vorner/.local/lib64/python2.7/site-packages/lettuce/core.py", line 366, in behave_as
        assert not steps_failed, steps_failed[0].why.exception
    AssertionError: Got: 2012-10-08 14:23:30.925 FATAL [b10-boss.boss] BIND10_STARTUP_ERROR error during startup: b10-msgq already running, or socket file not cleaned , cannot start

And there are the minor things. The log description is missing „m“ at the end of the line:

The boss module simply skipped restarting that module, and the whole syste
went back to the expected state (except that the crash itself is likely

The method name is_failed looks like wrong in English. I know it is because of consistency, but it would look more correct as has_failed or is_restarting.

comment:8 in reply to: ↑ 7 ; follow-up: Changed 7 years ago by jinmei

Replying to vorner:

Did you try running the system tests? It seems to be failing reliably for me, like this:

I'm pretty sure I did lettuce tests and confirmed they passed (I
reconfirmed it now); I must confess I probably skipped the older
system tests, but I suspect the failure you've seen is due to
something else.

Actually, I can see that happen too sometimes (not always
reproducible), but the same thing happens (again, sometimes) on the
master branch too. On a closer look into the log file when it fails,
I noticed b10-auth received queries before it fully started up.
I suspect, depending on the timing, when b10-auth tries to receive
config message from cfgmgr in the initialization phase it can actually
accept and handle query via SessionImpl::readData()
(io_service().run_one() would invoke an incorrect event). This is not
good, and I guess we need to address this in a cleaner way, but that's
certainly outside of the scope of the ticket.

I'm not sure whether this branch can specifically introduce
regressions in system tests. If you see the failure selectively on
this branch, we should probably discuss it more. If it happens on
master or 93e596c (the branch point) too, that should be a different
matter (we should probably create a ticket for that).

And there are the minor things. The log description is missing „m“ at the end of the line:

The boss module simply skipped restarting that module, and the whole syste
went back to the expected state (except that the crash itself is likely

Good catch, thanks. Fixed.

The method name is_failed looks like wrong in English. I know it is because of consistency, but it would look more correct as has_failed or is_restarting.

Okay, I've changed it is_running() so that both are consistently named
is_xxx().

comment:9 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:10 in reply to: ↑ 8 ; follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

I'm not sure whether this branch can specifically introduce
regressions in system tests. If you see the failure selectively on
this branch, we should probably discuss it more. If it happens on
master or 93e596c (the branch point) too, that should be a different
matter (we should probably create a ticket for that).

I indeed did mean the older (shell and perl based) ones. And I think it is introduced in this branch, because:

  • 93e596c completes fine, without a failure. Other branches I reviewed recently were OK too.
  • This branch fails with the same set of errors reliably, every time (4 of 4 attempts it failed, there was even a reboot in between, etc).
  • The first failing scenario is about restarting, the others are just follow-ups:
I:Restarting b10-auth and checking that (4)
I:failed

The other changes are OK.

comment:11 in reply to: ↑ 10 Changed 7 years ago by jinmei

Replying to vorner:

I indeed did mean the older (shell and perl based) ones. And I think it is introduced in this branch, because:

  • 93e596c completes fine, without a failure. Other branches I reviewed recently were OK too.
  • This branch fails with the same set of errors reliably, every time (4 of 4 attempts it failed, there was even a reboot in between, etc).
  • The first failing scenario is about restarting, the others are just follow-ups:

Okay, you were right. I've found a bug that caused this problem and
fixed it. We should have been able to notice that sooner if we had
unit tests for Bob.reap_children(). We could add one within this
ticket, but I'd rather suggest leaving it to a separate ticket, and in
a larger framework: despite its importance the Bob class and
bind10_src.py.in in general are tested only very poorly. I suggest
refactoring it so the core logic will be more independent from
system-dependent things like process scheduling, also probably making
the Bob class more concise, and providing more tests for them.
Providing tests for reap_children() would be part of this larger task.

If this makes sense, and the latest branch passes the test (which I'm
quite sure does), I'll create the followup tickets and close this one.

comment:12 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:13 Changed 7 years ago by vorner

  • Owner changed from vorner to muks
  • Total Hours changed from 0 to 1.30

Hello

Now it passes, thank you :-). Please merge.

comment:14 follow-up: Changed 7 years ago by vorner

  • Owner changed from muks to jinmei

Sending to the correct person O:-)

comment:15 in reply to: ↑ 14 Changed 7 years ago by jinmei

Replying to vorner:

Sending to the correct person O:-)

Thanks, merge done. Closing.

comment:16 Changed 7 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 1.30 to 9.55
Note: See TracTickets for help on using tickets.