Opened 9 years ago

Closed 9 years ago

#516 closed defect (fixed)

xfrin/zonmgr will cause system stall

Reported by: jinmei Owned by: jinmei
Priority: high Milestone: A-Team-Sprint-20110126
Component: ~Inter-module communication(obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

I just figured out (at least one of) the root cause of b10-auth hang
I mentioned in #513. b10-xfrin and b10-zonemgr were the culprit.

These programs send control commands via CC channels, but never receive
responses. e.g. b10-xfrin has the following code:

    def publish_xfrin_news(self, zone_name, zone_class,  xfr_result):
...
            try:
                self._send_cc_session.group_sendmsg(msg, XFROUT_MODULE_NAME)
                self._send_cc_session.group_sendmsg(msg, ZONE_MANAGER_MODULE_NAME)
            except socket.error as err: 
...

But the other end of the command always returns a (either error/success)
response. So the response messages will eventually fill up the socket
receive queue. At this point, b10-msgq will start blocking in sending
further commands to these processes, and eventually the entire system
will stall.

An easy fix is to add group_recvmsg() and just ignore the result after each
call to group_sendmsg(). Since this bug is quite critical I think we
should need such a quick fix soon anyway.

On top of that, we'd have to think about cases whether and how to handle
the case where sendmsg() succeeds but recvmsg() doesn't. We'll also need
to make these command exchanges asynchronous eventually. And, as we
discussed in f2f, we may even have to think about adopting a different
messaging framework.

I'm going to add this ticket to the current A team sprint since this bug
is quite critical and I don't want us to forget it in the backlog status.
But I suspect it's beyond the scope of this sprint, and should actually
consider it a fodder of the next sprint.

Subtickets

Change History (10)

comment:1 Changed 9 years ago by zhanglikun

The problem "response messages will eventually fill up the socket receive queue" had been mentioned in ticket 22. We should find one better way to fix the problem, except "Make sure the sender should receive any response even it does't care about the result". Or else, the problem will happen again,

comment:2 Changed 9 years ago by jelte

One note, if the receiver sends a response being either success or fail, shouldn't the sender be interested in this?

Anyhow, I say add receives to the sender now for a quick solution, and for a more permanent solution, I propose we add a field to the command specification that specifies whether there will be responses to control commands. If that is true, senders MUST recv() at some point after sending, and receiver MUST respond(). If false, both MUST NOT.

comment:3 Changed 9 years ago by jinmei

I've pushed a quick hack fix to this problem in branch trac516.

It passes all unit tests locally and has been working on my personal
server. But this patch still misses tests for the problem.

If we think the problem is sufficiently critical and this patch is
better to be merged now even without tests, let's do it and add
more tests in a separate ticket. Otherwise, I think we should defer
this fix to the next release.

comment:4 Changed 9 years ago by jinmei

This is the proposed changelog entry:

  154.?	[bug]		jinmei
	b10-xfrin/b10-zonemgr: Fixed a bug where these programs didn't
	receive command responses from CC sessions.  Eventually the
	receive buffer became full, and many other components that rely
	on CC channels would stall (as noted in #420 and #513).  This is
	an urgent care fix due to the severity of the problem; we'll need
	to revisit it for cleaner fix later. (Trac #516, git TBD)

comment:5 Changed 9 years ago by jinmei

  • Status changed from new to reviewing

comment:6 Changed 9 years ago by jelte

  • Owner changed from UnAssigned to jelte

comment:7 Changed 9 years ago by jelte

  • Owner changed from jelte to jinmei

The changes in the code all look ok, and as they are supposed to be.

Note that I did run the system without apparent problems, but haven't looked at whether the domain sockets were really not filled anymore. So perhaps you'll want to wait for an ack by jeremy. From a code point of view, this change is ok and can be merged.

comment:8 Changed 9 years ago by jreed

Previously I was getting: [b10-auth] timeout happened while sending statistics data: Timeout while reading
 data from cc session

and netstat showed many sockets with stuck Recv-Q.

Commit 3752558cd9e659d55cdd5f261e45df6a793f4aeb worked for me and now I don't get that timeout.

comment:9 Changed 9 years ago by jinmei

I've merged the quick hack fix in master.

Replying to some other comments:

One note, if the receiver sends a response being either success or fail, shouldn't the sender be interested in this?

In general, it should, I think. Especially in the case of fail, the sender
may have to take some recovery action.

Anyhow, I say add receives to the sender now for a quick solution, and for a more permanent solution, I propose we add a field to the command specification that specifies whether there will be responses to control commands. If that is true, senders MUST recv() at some point after sending, and receiver MUST respond(). If false, both MUST NOT.

That looks like making sense.

I'm going to close this ticket for now, and open a new one as a backlog
for more complete tests and cleaner solution.

comment:10 Changed 9 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
Note: See TracTickets for help on using tickets.