Opened 7 years ago

Closed 7 years ago

#2712 closed defect (fixed)

Cmdctl shutdown command does not shut down b10-cmdctl

Reported by: jelte Owned by: jinmei
Priority: medium Milestone: Sprint-20130423
Component: ~cmd-ctl (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 6 Add Hours to Ticket: 0
Total Hours: 4.6 Internal?: no

Description

This makes #2710 even more annoying; it looks like there is a deadlock when calling sendmsg() (it appears to hang on 'with self._lock' in ccsession.cc:81).

Subtickets

Change History (17)

comment:1 follow-up: Changed 7 years ago by shane

Do you mean when we send a specific shutdown command to b10-cmdctl and not the rest, that it causes a lockup in b10-cmdctl?

Is this new?

Why don't we see this during normal system shutdown?

comment:2 Changed 7 years ago by jreed

See #2083

comment:3 in reply to: ↑ 1 Changed 7 years ago by jelte

Replying to shane:

Do you mean when we send a specific shutdown command to b10-cmdctl and not the rest, that it causes a lockup in b10-cmdctl?

Yes

Is this new?

I don't know, I haven't seen it but I don't think I ever tried Cmdctl shutdown before

Why don't we see this during normal system shutdown?

i don't know (is it followed by a kill?)

comment:4 Changed 7 years ago by jwright

  • Component changed from Unclassified to cmd-ctl
  • Milestone changed from New Tasks to Next-Sprint-Proposed

comment:5 Changed 7 years ago by shane

  • Milestone changed from Previous-Sprint-Proposed to Next-Sprint-Proposed

comment:6 Changed 7 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20130402

comment:7 Changed 7 years ago by muks

When you fix this bug, please add a note to ticket #2083 asking its reporter to check if that bug is also fixed.

comment:8 Changed 7 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

comment:9 follow-up: Changed 7 years ago by jinmei

trac2712 is ready for review.

This was a simple deadlock (-like situation).

The 'Cmdctl shutdown' command from bindctl is handled in a separate
thread specifically handling the corresponding HTTP session, and if
the module is Cmdctl itself, it calls the ModuleCCSession's
command_handler() callback directly. For the shutdown command it
triggers "send_stopping", which eventually causes cc.Session.sendmsg()
for the CC session of the ModuleCCSession. This will require
acquiring a lock for the CC session.

On the other hand, another thread is normally sleeping on that CC
session for read events, holding that lock.

So, unless the sleeping thread is woken up by receiving a read event
(receiving a command from other module, etc), it effectively hangs.

Note that this shouldn't happen if the shutdown command was sent from
the Init module, because it's handled by the right thread.

Of course, the fundamental issue is, as has been the case, the naive
use of threads: creating a threaded application with an (seemingly) ad
hoc design, sharing various kinds of data without much consideration
of thread related issues. We need a fundamental overhaul here, and
while I hope we'll actually have time for it some day, that's
certainly too big for this ticket. My suggested fix is to send all
commands to msgq and have it send back to Cmdctl if the command is for
Cmdctl itself. That's redundant in some sense but should work, and
coding-wise the change is quite small.

I've implemented this fix with some additional tests (including a new
lettuce test to confirm the revised behavior at the system level). I
also added an assertion check in command_handler in case we introduce
a similar type of bug in future.

Proposed changelog entry:

606.?	[bug]		jinmei
	b10-cmdctl: corrected a hangup problem on receiving the shutdown
	command from bindctl.  Note, however, that cmdctl is defined as
	a "needed" module by default, so shutting down cmdctl would cause
	shutdown of the entire BIND 10 system anyway, and is therefore
	still not very useful in practice.
	(Trac #2712, git TBD)

comment:10 Changed 7 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:11 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:12 in reply to: ↑ 9 ; follow-up: Changed 7 years ago by vorner

Hello

Replying to jinmei:

Of course, the fundamental issue is, as has been the case, the naive
use of threads: creating a threaded application with an (seemingly) ad
hoc design, sharing various kinds of data without much consideration
of thread related issues. We need a fundamental overhaul here, and
while I hope we'll actually have time for it some day, that's
certainly too big for this ticket. My suggested fix is to send all
commands to msgq and have it send back to Cmdctl if the command is for
Cmdctl itself. That's redundant in some sense but should work, and
coding-wise the change is quite small.

I agree we need some rewrite here, but we need it almost everywhere. So I'm not optimistic here.

Also, I think this approach is better, since the code has one less special-case. I guess the special-case was mostly a premature optimisation.

However, it'll stop working if we ever unify the sessions (using just one from cfgmgr), because msgq refuses to send messages back to the sender, even if it is addressed there. Do you have idea why the special code is there?

Some comments to the code:

Could you describe the motivation of the test_handle_post_request_to_itself, for example in the docstring?

I believe this should be „quit“, not „quite“:

        # shutdown.  So "send bind10 command" will fail (it cannot complete
        # "quite").

I'm confused by this comment:

    ignore_failure ('ignoring failure', optional): set to not None if bindctl
    is expected to fail (and it's acceptable).

    Fails if bindctl does not exit with status code 0 and ignore_failure
    is not None.

So, if ignore_failure is set to None (a false value), is the failure ignored?

comment:13 Changed 7 years ago by jinmei

  • Owner changed from vorner to jinmei

(From the comment I assume the review was completed and the ticket was supposed
to be back to me)

comment:14 in reply to: ↑ 12 Changed 7 years ago by jinmei

Thanks for the review.

Replying to vorner:

Replying to jinmei:

However, it'll stop working if we ever unify the sessions (using just one from cfgmgr), because msgq refuses to send messages back to the sender, even if it is addressed there. Do you have idea why the special code is there?

I have no idea; I checked the code history, but it seemed to be coded
that way almost from the beginning without specific commit log for
that particular part. In any case, if and when we unify the session,
at least we should be able to notice the regression via the added
lettuce test, so I think it's okay to think about an alternative at
the point if and when it actually happens.

Some comments to the code:

Could you describe the motivation of the test_handle_post_request_to_itself, for example in the docstring?

Oops, actually, that test wasn't intended to be added. I simply
removed it (see also the commit log).

I believe this should be „quit“, not „quite“:

        # shutdown.  So "send bind10 command" will fail (it cannot complete
        # "quite").

Good catch, corrected.

I'm confused by this comment:

    ignore_failure ('ignoring failure', optional): set to not None if bindctl
    is expected to fail (and it's acceptable).

    Fails if bindctl does not exit with status code 0 and ignore_failure
    is not None.

So, if ignore_failure is set to None (a false value), is the failure ignored?

if ignore_failure is set to None, failure is detected and causes a
test failure (which is the previous behavior and the default of the
new version). I think the documentation is not incorrect, but
admittedly it can be confusing due to the kind of double-negation.
I've updated it so it'll hopefully be less confusing. Is that better?

comment:15 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:16 Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei
  • Total Hours changed from 0 to 1

OK, I think it can be merged now.

comment:17 Changed 7 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 1 to 4.6

Thanks, merge done, closing.

Note: See TracTickets for help on using tickets.