Opened 8 years ago

Closed 8 years ago

#1543 closed defect (complete)

the socket requestor should abort itself on unexpected errors

Reported by: jinmei Owned by: vorner
Priority: very high Milestone: Sprint-20120207
Component: ~Boss of BIND (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket: Socket creator
Estimated Difficulty: 3 Add Hours to Ticket: 0
Total Hours: 2.25 Internal?: no

Description

Currently (as of #1522) the socket creator just throws an exception
in the arguably rare event of encountering unexpected failures in the
communication with the boss process (either via the msgq bus or the
UNIX domain socket). It's up to the application how to react to it,
but the current client code (as of #805) doesn't do anything and keep
using the creator. This is probably not a good choice, because if
such an error even happens it's more likely that something fundamental
is broken in the msgq or the UNIX domain socket (and internal state of
both or either of the requestor and boss), and it's more likely that
the requestor cannot get any useful result any more.

One possible option is to destroy the requestor and create it again,
but doing so has a harsh side effect of clearing all sockets (in the
boss, which will then cause inconsistency between the boss and the
application unless the application also closes the sockets). Even if
we accept the side effect, the recovery process would be complicated.

So the suggestion in the mean time is to have the requestor abort the
program either via assert() failure or throwing a fatal exception
(which would only be catchable at the top level of application and
make the process terminate, possibly with logging the fact). Assuming
it should actually very rare, we won't see this in practice, and by
adding the explicit handling we can be sure that we don't leave the
code in some undefined state that just happens to cause visible
failure by (un)luck.

Subtickets

Change History (9)

comment:1 Changed 8 years ago by jelte

  • Estimated Difficulty changed from 0 to 3

comment:2 Changed 8 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20120207

comment:3 Changed 8 years ago by vorner

  • Owner set to vorner
  • Status changed from new to accepted

comment:4 Changed 8 years ago by vorner

  • Owner changed from vorner to UnAssigned
  • Status changed from accepted to reviewing

Hello

It should be ready for review. It is based on the #1542 branch, which is still in review, because I use the exceptions introduced there to recognize the OK to continue error cases. The first commit there (058f2664b387c531920c6068b29cee34fc966a05) is just trivial merge with master, to keep up to date.

I think this might need a changelog entry, as it changes the user visible behaviour:

[feature]	vorner
When the allocation of a socket fails for a different reason than the socket
not being provided by the OS, the b10-auth and b10-resolver abort, as the
system might be in inconsistent state after such error.

comment:5 Changed 8 years ago by jelte

  • Owner changed from UnAssigned to jelte

comment:6 Changed 8 years ago by jelte

  • Owner changed from jelte to vorner

The code itself looks good, just a few nits, which are not really specific to this changeset, but bothered me anyway :)

  • I see the added try-catch block in portconfig.cc uses a slightly different braces style than most of the rest of the code; a newline before the catch. This is consistent with some of the existing catches in this file, and in fact is used in more places, but most of the time we put the catch on the same line as the closing brace. This is not really specified in our style guide, but I personally prefer the same-line approach, though of course opinions may differ :)
  • I also am not really fond of the use of 'we' in the error message descriptions; e.g. change
The situation is the same as in the SRVCOMM_EXCEPTION_ALLOC case, but we don't
know further details about the error, because it was signaled by throwing
something not being an exception. This is definitely a bug.

to (for instance)

The situation is the same as in the SRVCOMM_EXCEPTION_ALLOC case, but further
details about the error are unknown, because it was signaled by throwing
something that was not an exception. This is definitely a bug.

Again, not something unique to this changeset (we do it in more messages), but I'd prefer we don't add more :)

Same for the other added message.

comment:7 Changed 8 years ago by vorner

  • Owner changed from vorner to jelte

Hello

Thanks, both issues should have been properly addressed in the last two commits.

comment:8 Changed 8 years ago by jelte

  • Owner changed from jelte to vorner

looks good, please go ahead and merge

comment:9 Changed 8 years ago by vorner

  • Resolution set to complete
  • Status changed from reviewing to closed
  • Total Hours changed from 0 to 2.25

Thank you, closing.

Note: See TracTickets for help on using tickets.