Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#1522 closed task (complete)

Implementatio of isc::server_common::SocketRequestor

Reported by: vorner Owned by: vorner
Priority: very high Milestone: Sprint-20120110
Component: Unclassified Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket: Socket creator
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 22.83 Internal?: no

Description

This is task split of #805. The commit c8caecc35195080d091611bb0809cbc2834ba3f4 (on the trac805 branch) introduces an interface of the SocketRequestor?. That is supposed to actually get the socket over the wire from the boss.

The plan is to implement a concrete class implementing the methods. The tests may even subclass this concrete class (the tests might even want to subclass this concrete class to mock the communication over the unix domain socket).

The requestSocket would be blocking, it would send the request over the CC session, wait for the answer, check if the unix domain socket is already connected, if not, connect, and pick up the socket over it.

The releaseSocket would just send the request over the CC session and wait for the answer (so the answer is picked up and isn't stuck there forever).

The SocketRequestor::init function should be implemented in this task as well.

Subtickets

Change History (24)

comment:1 Changed 8 years ago by jelte

  • Owner set to jelte
  • Status changed from new to assigned

comment:2 Changed 8 years ago by shane

  • Feature Depending on Ticket set to Socket creator

comment:3 follow-up: Changed 8 years ago by jelte

  • Owner changed from jelte to UnAssigned
  • Status changed from assigned to reviewing

Ok, it should be about ready for review.

Branch is based on the commit mentioned above.

Apart from adding a concrete instance that uses the ModuleCCSession and implements the abstract methods, most work went into the tests (as expected). Tests now contain a mini forking domain socket server which pumps stats+send_fd() data to a domain socket.

A few not entirely related changes I needed:

  • cleanup() in socket_requestor.h, which removes the singleton and resets the static pointer (mostly needed for tests, i want to be able to initialize it again). But also useful for programs that want to clean up before they exit. (I left initTest() alone for now, and for this purpose it would not be correct, but we can probably merge the two)
  • added a groupSendMsg() and groupRecvMsg() call to ModuleCCSession (which are simple wrappers that directly call the equivalent methods of the internal msgq connection).

oh and a few other notes:

There are a few constants that I suspect will be needed in more places (at the top of socket_request.cc currently). If so we'll need to move them somewhere public. As I only needed them here now I put them in the .cc.

comment:4 Changed 8 years ago by jelte

  • Total Hours changed from 0 to 10

comment:5 in reply to: ↑ 3 ; follow-up: Changed 8 years ago by jinmei

Replying to jelte:

I've taken an initial look at the branch. I don't think we can
complete the entire review process (I've not even reviewed all of the
branch) and I'll be off for a while for holidays, so I've dumped what
I've found so far without picking up the ticket.

  • First, it didn't compile. I've fixed it myself and pushed the fix.
  • Second, tests didn't pass for me. I've also fixed it myself and pushed the fix.

socket_request.cc

  • 'static' isn't necessary for the constant strings:
    static const std::string CREATOR_SOCKET_OK("1");
    
    (and to be pedantic this use of static is officially deprecated in C++)
  • in general, defining non POT object this way is a source of static initialization fiasco. sometimes it can happen to be safe depending on how exactly it's used, but I'd generally rather avoid it in the first place than making it very sure it's safe. In this case we could either define it as char* and compare them using strcmp() or define it via a proxy function:
    const std::string& CREATOR_SOCKET_OK() {
        static const std::string str("1");
        return (str);
    }
    
  • The test failure was because the original code didn't take into account the case where sockaddr variants have a "length" member (in this case sun_len). I also suspect there may be some kernel implementation that rejects connect()/bind() if this member isn't set, so it would be safer if we do this:
    #include <config.h>
    ...
    #ifdef HAVE_SA_LEN
        sock_pass_addr.sun_len = <length>;
    #endif
    
  • I suspect there's a off-by-one bug here:
            if (path.size() > sizeof(sock_pass_addr.sun_path)) {
                isc_throw(SocketError, "Unable to open domain socket " << path <<
                                       ": path too long");
            }
    
            strcpy(sock_pass_addr.sun_path, path.c_str());
    
    Consider the case path.size() == sizeof(sun_path) and where the terminating \0 will be written. (If my guess is correct, tell vorner that this proves why it was prudent to add an assert() in the SocketSessionForwarder constructor:-)
  • If the test doesn't cover the boundary case (I've not checked it), I'd add it.
  • createFdShareSocket (maybe some others also) is not exception safe: If it throws after creating the socket it will leak. I'd use something like ScopedSocket defined n lib/util/io/socketsession.cc (latest master). Maybe we should move it to sockaddr_util.h (and rename the file) so that it can be shared by other libraries?
  • I'd avoid C-style cast:
            if (connect(sock_pass_fd,
                        (struct sockaddr *)&sock_pass_addr,
                        len) == -1) {
    
    (btw this code also breaks the style guideline in terms of the position of *). I defined and used a conversion template in lib/util/io/sockaddr_util.h. It's still a kind of fraud for the compiler, but IMO still better than C-style cast or reinterpret_cast, if only to limit the use of these beasts to cases where there is absolutely no other alternative.
  • getSocketFd: In my understanding passed_sock_fd == 0 is a valid case.

socket_requester_test.cc

  • in create(), the code around strncpy() made me nervous, too, although in this case it doesn't seem to have a bug.

dns_server_unittest.cc

This is not for this branch (I guess it's a result of importing #805),
but the tests didn't pass for me, too. I needed to add this:

         struct sockaddr_in addr;
+        memset(&addr, 0, sizeof(addr));
         addr.sin_family = AF_INET;

While looking at it I noticed some other things for this file:

  • some of the above comments apply to this implementation, too: exception safety, consideration for sin_len, non plain-old static object, unnecessary namescope-level static.
  • I would personally use getaddrinfo to create a sockaddr() to avoid this kind of portability issue.
  • I would use IPv6 for testing for future generation software like BIND 10 (we could also test IPv4 if we want).
  • technically, passing protocol 0 to socket for AF_INET isn't good:
            int result(socket(AF_INET, type, 0));
    
    If and when DCCP or SCTP is more common this can be ambiguous.

comment:6 Changed 8 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:7 follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jelte

Hello

I'd have a different set of comments:

  • The comment with CREATOR_SOCKET_UNAVAILABLE seems misleading. In the case of this status, the fd is not sent.
  • When closing the unix domain sockets:
    std::map<std::string, int>::iterator it;
    for (it = fd_share_sockets_.begin(); it != fd_share_sockets_.end(); ++it ) 
    
    Is there a reason to declare the it outside the cycle? This would leave it in scope even after the end of it (not that there would be much after the end).
  • This comment change is strange:
         /// \param session the CC session that'll be used to talk to the
    -    ///     socket creator.
    -    /// \throw InvalidOperation when it is called more than once.
    +    ///                socket creator.
    +    /// \throw InvalidOperation when it is called more than once,
    +    ///                         when socket_path is empty
         static void init(config::ModuleCCSession& session);
    
    Where is the socket_path it talks about?
  • The cleanup, is it really needed? Isn't initTest(NULL) enough? I know it doesn't release the memory, but is it a problem in tests?
  • The status messages from boss are 2 bytes long, either "0\n" or "1\n", you seem to take only the one byte.
  • This should be „so far“?
    // Clears the messages the client sent to far on the fake msgq
    
  • What use is clearMsgQueue() at the beginning of a test (there are multiple such tests).
  • The test testBadSocketReleaseAnswers seem to have two parts. Should they be commented what failure they test?
  • With the mini-fork-client thing, when you accept, the client_address is unused. Passing NULL should be OK there.
  • When you send a nonsense instead of send_fd (eg. when it is -2), can you be sure the socket will be in a consistent state? And that it'll eat the one wrong byte somehow? Because you seem to use the socket after that.

Thanks

With regards

comment:8 in reply to: ↑ 5 Changed 8 years ago by jelte

Addressing comments one at a time, in the order they came in

Replying to jinmei:

Replying to jelte:

I've taken an initial look at the branch. I don't think we can
complete the entire review process (I've not even reviewed all of the
branch) and I'll be off for a while for holidays, so I've dumped what
I've found so far without picking up the ticket.

  • First, it didn't compile. I've fixed it myself and pushed the fix.
  • Second, tests didn't pass for me. I've also fixed it myself and pushed the fix.

thanks

socket_request.cc

  • in general, defining non POT object this way is a source of static initialization fiasco. sometimes it can happen to be safe depending on how exactly it's used, but I'd generally rather avoid it in the first place than making it very sure it's safe. In this case we could either define it as char* and compare them using strcmp() or define it via a proxy function:
    const std::string& CREATOR_SOCKET_OK() {
        static const std::string str("1");
        return (str);
    }
    

went for that suggestion, we can reuse these in wrappers (just need to add them in the header, but I have not done that).

  • The test failure was because the original code didn't take into account the case where sockaddr variants have a "length" member (in this case sun_len). I also suspect there may be some kernel implementation that rejects connect()/bind() if this member isn't set, so it would be safer if we do this:
    #include <config.h>
    ...
    #ifdef HAVE_SA_LEN
        sock_pass_addr.sun_len = <length>;
    #endif
    

ok added

  • I suspect there's a off-by-one bug here:
            if (path.size() > sizeof(sock_pass_addr.sun_path)) {
                isc_throw(SocketError, "Unable to open domain socket " << path <<
                                       ": path too long");
            }
    
            strcpy(sock_pass_addr.sun_path, path.c_str());
    
    Consider the case path.size() == sizeof(sun_path) and where the terminating \0 will be written. (If my guess is correct, tell vorner that this proves why it was prudent to add an assert() in the SocketSessionForwarder constructor:-)

heh

  • If the test doesn't cover the boundary case (I've not checked it), I'd add it.

Yeah, fixed the check, and added a test

  • createFdShareSocket (maybe some others also) is not exception safe: If it throws after creating the socket it will leak. I'd use something like ScopedSocket defined n lib/util/io/socketsession.cc (latest master). Maybe we should move it to sockaddr_util.h (and rename the file) so that it can be shared by other libraries?

Hmm, looking at that new code I think there is more of it that is intended to be used here (or if not intended, at least useful).

For now I've went for the path of least resistance, added a close() in the existing throws. This will still leak if there is any other exception in that part of the code (or if a new throw is added). But given the seeming intent of the classes in that socketsession, this functionality should probably be moved there as well (i.e. why not let SocketSessionReceiver? create and connect to a socket, instead of passing it an opened one? or perhaps provide both options if there is a use-case for the existing one)

  • I'd avoid C-style cast:
            if (connect(sock_pass_fd,
                        (struct sockaddr *)&sock_pass_addr,
                        len) == -1) {
    
    (btw this code also breaks the style guideline in terms of the position of *). I defined and used a conversion template in lib/util/io/sockaddr_util.h. It's still a kind of fraud for the compiler, but IMO still better than C-style cast or reinterpret_cast, if only to limit the use of these beasts to cases where there is absolutely no other alternative.

Same as above; for now I left this in here, as is, with a TODO message (oh the pain, committing code with TODOs).

  • getSocketFd: In my understanding passed_sock_fd == 0 is a valid case.

oh yes it is, and i actually have seen it (due to an error elsewhere, but still)

socket_requester_test.cc

  • in create(), the code around strncpy() made me nervous, too, although in this case it doesn't seem to have a bug.

actually it did, the same off-by-one as in the code itself (though we are creating the original length here it is less dangerous, but that piece could in theory change.

dns_server_unittest.cc

skipping these to be handled in 805

comment:9 in reply to: ↑ 7 ; follow-up: Changed 8 years ago by jelte

  • Owner changed from jelte to vorner

Replying to vorner:

Hello

I'd have a different set of comments:

  • The comment with CREATOR_SOCKET_UNAVAILABLE seems misleading. In the case of this status, the fd is not sent.

ack, updated

  • When closing the unix domain sockets:
    std::map<std::string, int>::iterator it;
    for (it = fd_share_sockets_.begin(); it != fd_share_sockets_.end(); ++it ) 
    
    Is there a reason to declare the it outside the cycle? This would leave it in scope even after the end of it (not that there would be much after the end).

aesthetics :) But now that I see it again, it went over the official line length anyway, so inlined it.

  • This comment change is strange:
         /// \param session the CC session that'll be used to talk to the
    -    ///     socket creator.
    -    /// \throw InvalidOperation when it is called more than once.
    +    ///                socket creator.
    +    /// \throw InvalidOperation when it is called more than once,
    +    ///                         when socket_path is empty
         static void init(config::ModuleCCSession& session);
    
    Where is the socket_path it talks about?

oops, that shouldn't have been in there. Removed the comment.

  • The cleanup, is it really needed? Isn't initTest(NULL) enough? I know it doesn't release the memory, but is it a problem in tests?

Only if we want to run memory checkers on our tests. As I said, they serve very similar use-cases, and can probably be made into one function (or perhaps initTest can call cleanup if necessary then set it to the new value), but from an API point-of-view, calling something called init to clean up seems weird :) (and I did not want to mess with initTest() at this point because of followup work in 805 and/or other branches related to the socket creator).

  • The status messages from boss are 2 bytes long, either "0\n" or "1\n", you seem to take only the one byte.

ah, doh, changed.

  • This should be „so far“?
    // Clears the messages the client sent to far on the fake msgq
    

yep

  • What use is clearMsgQueue() at the beginning of a test (there are multiple such tests).

When initializing a ModuleCCSession, there is already some communication to the cc session (this is also the reason the SocketRequestorTest? initializer adds an empty success response before starting the fake cc session). We don't need to see what's in it here, so it just empties the queue for easier access to the later messages we do want to check.

  • The test testBadSocketReleaseAnswers seem to have two parts. Should they be commented what failure they test?

added

  • With the mini-fork-client thing, when you accept, the client_address is unused. Passing NULL should be OK there.

ack, null'ed

  • When you send a nonsense instead of send_fd (eg. when it is -2), can you be sure the socket will be in a consistent state? And that it'll eat the one wrong byte somehow? Because you seem to use the socket after that.

Well, in a way, reusing it and seeing that the next one does not fail is a way of testing this. I couldn't really come up with any better way to test it :)

comment:10 in reply to: ↑ 9 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jelte

Hello

Replying to jelte:

  • createFdShareSocket (maybe some others also) is not exception safe: If it throws after creating the socket it will leak. I'd use something like ScopedSocket defined n lib/util/io/socketsession.cc (latest master). Maybe we should move it to sockaddr_util.h (and rename the file) so that it can be shared by other libraries?

Hmm, looking at that new code I think there is more of it that is intended to be used here (or if not intended, at least useful).

For now I've went for the path of least resistance, added a close() in the existing throws. This will still leak if there is any other exception in that part of the code (or if a new throw is added). But given the seeming intent of the classes in that socketsession, this functionality should probably be moved there as well (i.e. why not let SocketSessionReceiver? create and connect to a socket, instead of passing it an opened one? or perhaps provide both options if there is a use-case for the existing one)

I don't think anything else there could throw, as these are C calls.

Anyway, would you add a ticket to incorporate the code from the socketsession, so we don't have some code twice?

Same as above; for now I left this in here, as is, with a TODO message (oh the pain, committing code with TODOs).

Still the asterisk on the wrong side of space ;-).

Replying to jelte:

  • The status messages from boss are 2 bytes long, either "0\n" or "1\n", you seem to take only the one byte.

ah, doh, changed.

But you still receive (and test) messages that are "0\0" and "1\0", not "0\n" and "1\n" (that's a newline, not null-byte). You need to update the constants, the tests and the buffer for reading should be 3 bytes large, if you want to compare it with ==.

  • When you send a nonsense instead of send_fd (eg. when it is -2), can you be sure the socket will be in a consistent state? And that it'll eat the one wrong byte somehow? Because you seem to use the socket after that.

Well, in a way, reusing it and seeing that the next one does not fail is a way of testing this. I couldn't really come up with any better way to test it :)

Well, my reasoning is, if nothing guarantees this should be one byte, then there'll be system that'll fail (solaris, for example).

And besides, if someone sends nonsense, then it might as well be other amount of nonsense and the protocol gets completely confused. Is there a reason for this in practice?

Thank you

comment:11 in reply to: ↑ 10 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

As I reported on the daily call, I'm taking over the ticket so we can
complete the socket creator tasks while Jelte is taking off.

I've re-reviewed the entire branch as a successor implementor, and
made some unrelated cleanups: from 80d9d4f to d8ac471. Most of them
should be trivial and non controversial, but there's one substantial
set of changes: I moved some static/private member functions of
SocketRequestor? because they don't need any private information of the
class. I generally believe it's better to implement things outside of
a class if they can reasonably be implemented only via the class's
public interfaces.

Hmm, looking at that new code I think there is more of it that is intended to be used here (or if not intended, at least useful).

For now I've went for the path of least resistance, added a close() in the existing throws. This will still leak if there is any other exception in that part of the code (or if a new throw is added). But given the seeming intent of the classes in that socketsession, this functionality should probably be moved there as well (i.e. why not let SocketSessionReceiver? create and connect to a socket, instead of passing it an opened one? or perhaps provide both options if there is a use-case for the existing one)

I don't think anything else there could throw, as these are C calls.

Anyway, would you add a ticket to incorporate the code from the socketsession, so we don't have some code twice?

Same as above; for now I left this in here, as is, with a TODO message (oh the pain, committing code with TODOs).

On looking it further, I guess we should eventually merge the FD
passing part of the socket creator interface and SocketSession?
framework (is this what Jelte meant by "functionality should be
moved"? If so, I tend to agree). Both do quite the same things using
pretty low level interfaces, which are generally more error prone and
less portability. So unifying such problematic part makes more sense
than other general cases of reducing duplicates. But that would be
beyond the scope of the initial development of socket creator support.
So I'm okay with deferring it to post release development cycle.

Still the asterisk on the wrong side of space ;-).

I believe I fixed it and other similar style issues.

Replying to jelte:

  • The status messages from boss are 2 bytes long, either "0\n" or "1\n", you seem to take only the one byte.

ah, doh, changed.

But you still receive (and test) messages that are "0\0" and "1\0", not "0\n" and "1\n" (that's a newline, not null-byte). You need to update the constants, the tests and the buffer for reading should be 3 bytes large, if you want to compare it with ==.

This should be fixed in 182024b.

  • When you send a nonsense instead of send_fd (eg. when it is -2), can you be sure the socket will be in a consistent state? And that it'll eat the one wrong byte somehow? Because you seem to use the socket after that.

Well, in a way, reusing it and seeing that the next one does not fail is a way of testing this. I couldn't really come up with any better way to test it :)

Well, my reasoning is, if nothing guarantees this should be one byte, then there'll be system that'll fail (solaris, for example).

And besides, if someone sends nonsense, then it might as well be other amount of nonsense and the protocol gets completely confused. Is there a reason for this in practice?

I've not fully understood the issue here. What happens in the -2 case
is the child sends one byte data and it's received in recv_fd() where
one the byte data is received by recvmsg() but since no FD is passed
recv_fd() fails and requestSocket() throws. I don't see how this
makes the subsequent doRequest() fail... That said, I think if
requestSocket() fails due to an unexpected error, it generally cannot
be assumed to be usable anymore and the caller application should
basically terminate the connection (and reopen it if necessary). The
SocketSession? framework is designed that way.

If we agree, I'm okay with removing the "doRequest() after failure"
test as the behavior in such a case wouldn't be guaranteed.

I've noticed some other, higher design level things while I looked at
the code more closely. These may not have to be addressed in this
release cycle but I'd like to clarify/revise it especially if we
consider unifying this API/implementation with SocketSession?:

  • In the current implementation (and probably as a result of the design) I suspect a socket was passed and could leak if "status" is bogus. Admittedly this wouldn't normally happen, but such a design seems to be fragile to me.
  • I personally don't think it the best idea to signal the case of "CREATOR_SOCKET_UNAVAILABLE" by exception. At the very least it should be specifically separated from other exceptions such as the one as a result of bogus status data; otherwise, assuming the app should generally drop the connection on the latter case, we could drop and re-establish the connection unnecessarily
  • I also don't think the interface using magic strings like "0\n" is clean. As we even already noticed such string manipulation is error prone. It's also counter intuitive to see the existence of '\n' (although this might be mitigated by introducing some constant representation). Also, passing non atomic data like this before passing fd using sendmsg/recvmsg would cause some tricky failure modes such as the one mentioned in the first bullet above. I'd rather use the currently-unused one-byte data that is passed with the FD using sendmsg/recvmsg.

comment:12 Changed 8 years ago by jinmei

  • Owner changed from jelte to vorner

comment:13 in reply to: ↑ 11 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

I've re-reviewed the entire branch as a successor implementor, and
made some unrelated cleanups: from 80d9d4f to d8ac471. Most of them
should be trivial and non controversial, but there's one substantial
set of changes: I moved some static/private member functions of
SocketRequestor? because they don't need any private information of the
class. I generally believe it's better to implement things outside of
a class if they can reasonably be implemented only via the class's
public interfaces.

Well, I'm not sure in this case. I understand the class as kind of namespace/way to say this belongs together. This feels like if the guts are lying all over the place. But I don't really care about it much, it's aesthetic matter.

On looking it further, I guess we should eventually merge the FD
passing part of the socket creator interface and SocketSession?
framework (is this what Jelte meant by "functionality should be
moved"? If so, I tend to agree). Both do quite the same things using
pretty low level interfaces, which are generally more error prone and
less portability. So unifying such problematic part makes more sense
than other general cases of reducing duplicates. But that would be
beyond the scope of the initial development of socket creator support.
So I'm okay with deferring it to post release development cycle.

Yes, I agree with this as well.

I've not fully understood the issue here. What happens in the -2 case
is the child sends one byte data and it's received in recv_fd() where
one the byte data is received by recvmsg() but since no FD is passed
recv_fd() fails and requestSocket() throws. I don't see how this
makes the subsequent doRequest() fail... That said, I think if
requestSocket() fails due to an unexpected error, it generally cannot
be assumed to be usable anymore and the caller application should
basically terminate the connection (and reopen it if necessary). The
SocketSession? framework is designed that way.

Ah, the sendmsg contains a one-byte data. I didn't know that, so it might work.

Anyway, it would be bad to close the connection. That would signal the boss to reclaim all the sockets transferred previously.

If we agree, I'm okay with removing the "doRequest() after failure"
test as the behavior in such a case wouldn't be guaranteed.

ACK, I agree, it tests only stuff that happens after the sending part went berserk anyway.

  • In the current implementation (and probably as a result of the design) I suspect a socket was passed and could leak if "status" is bogus. Admittedly this wouldn't normally happen, but such a design seems to be fragile to me.

Hmm, what do you mean? Like if the boss sent a "2\n" and the socket? Yes, then it would leak, but can we really avoid that in any way?

  • I personally don't think it the best idea to signal the case of "CREATOR_SOCKET_UNAVAILABLE" by exception. At the very least it should be specifically separated from other exceptions such as the one as a result of bogus status data; otherwise, assuming the app should generally drop the connection on the latter case, we could drop and re-establish the connection unnecessarily

I'd like to point that such situation shouldn't happen in practice and would be same kind of bug as a wrong data sent, etc. In the time the client picks up the socket, it was already prepared and reserved. It can be unavailable only in case the token is wrong.

  • I also don't think the interface using magic strings like "0\n" is clean. As we even already noticed such string manipulation is error prone. It's also counter intuitive to see the existence of '\n' (although this might be mitigated by introducing some constant representation). Also, passing non atomic data like this before passing fd using sendmsg/recvmsg would cause some tricky failure modes such as the one mentioned in the first bullet above. I'd rather use the currently-unused one-byte data that is passed with the FD using sendmsg/recvmsg.

The newline is there to preserve some kind of consistency, and I wasn't sure if python wouldn't want to do some kind of line based buffering (well, it probably wouldn't).

Anyway, the code as is looks OK, but I noticed one more problem I overlooked before. I don't see the code sending the token over the unix domain socket to tell the boss which socket is being requested.

Thanks

comment:14 in reply to: ↑ 13 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

If we agree, I'm okay with removing the "doRequest() after failure"
test as the behavior in such a case wouldn't be guaranteed.

ACK, I agree, it tests only stuff that happens after the sending part went berserk anyway.

I've changed the test ordering to avoid doing doRequest() in a
"possibly broken" state.

As for this one:

Anyway, the code as is looks OK, but I noticed one more problem I overlooked before. I don't see the code sending the token over the unix domain socket to tell the boss which socket is being requested.

Is there any documentation that describes the protocol between the
boss and other clients? I cannot find it in bind10/README or in
lib/bind10/README, so I've looked at the bind10.py implementation and
guessed what's expected, and updated the code based on it.

This caused an additional, unexpected issue: if the remote end (in
practice, boss) closes the connection first, requestSocket() would
result in SIGPIPE (and the application would die). Just like we did
for SocketSession, we need to at least ignore that signal. I've
added that code too (btw I suspect this is also the case for the bos -
an app could die after requestSocket(), which would automatically
close the connection). The fact that we did the same trick again is
another reason why we should unify this and SocketSession (but
that's beyond the scope of our current work).

I think other points are for post release cleanups:

Anyway, it would be bad to close the connection. That would signal the boss to reclaim all the sockets transferred previously.

Hmm, on further considering this and other bullet points below, I now
think we should probably manage the states of open sockets at the boss
and applications separately from the channel between them (i.e, the
connection between UNIX domain sockets). We should probably introduce
(re)synchronization mechanism about the open socket states between the
boss and the application when the connection is (re)established, and
then we can simply drop the connection if it's possibly in an unusable
state (and re-establish a new one).

  • In the current implementation (and probably as a result of the design) I suspect a socket was passed and could leak if "status" is bogus. Admittedly this wouldn't normally happen, but such a design seems to be fragile to me.

Hmm, what do you mean? Like if the boss sent a "2\n" and the socket? Yes, then it would leak, but can we really avoid that in any way?

See the third bullet, for example. If we pass the pair of status (1
byte) and FD via a single sendmsg()/recvmsg() (the status will be in
the 1-byte "normal" data; the FD is passed as ancillary data),
requestSocket() could examine both the FD and status and can close the
socket (e.g.) if FD is bogus.

  • I personally don't think it the best idea to signal the case of "CREATOR_SOCKET_UNAVAILABLE" by exception. At the very least it should be specifically separated from other exceptions such as the one as a result of bogus status data; otherwise, assuming the app should generally drop the connection on the latter case, we could drop and re-establish the connection unnecessarily

I'd like to point that such situation shouldn't happen in practice and would be same kind of bug as a wrong data sent, etc. In the time the client picks up the socket, it was already prepared and reserved. It can be unavailable only in case the token is wrong.

I thought "unavailable" can be possible for a reason we cannot
control, such as unrelated third-party application already opens the
port. Am I mistaken?

  • I also don't think the interface using magic strings like "0\n" is clean. As we even already noticed such string manipulation is error prone. It's also counter intuitive to see the existence of '\n' (although this might be mitigated by introducing some constant representation). Also, passing non atomic data like this before passing fd using sendmsg/recvmsg would cause some tricky failure modes such as the one mentioned in the first bullet above. I'd rather use the currently-unused one-byte data that is passed with the FD using sendmsg/recvmsg.

The newline is there to preserve some kind of consistency, and I wasn't sure if python wouldn't want to do some kind of line based buffering (well, it probably wouldn't).

Hmm, I'm not sure whether you mean the newline is good or bad by the
Python thing, but as long as we use multi-byte data that's probably
just a matter of taste. More substantially, if we could switch to
1-byte status code (obviously it would exclude a newline), we'll be
able to pass the pair of status code and FD via a single call to
sendmsg()/recvmsg(), which will make error handling simpler.

comment:15 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:16 Changed 8 years ago by jinmei

I forgot one more thing, not raised in the review but I noticed it
myself: I guess we need to add some logging messages for the socket
requestor. So I added some.

One thing that might raise a discussion is the format of address +
port in:

% SOCKETREQUESTOR_GETSOCKET Received a %1 socket for [%2]:%3, FD=%4, token=%5, path=%6

I've used [address]:port regardless of the address family. I know
it doesn't conform to the consensus (addr:port for the old version of
IP), but I expect we'll eventually introduce a formatter library to
help produce consistent output. So I didn't bother to introduce a
local formatter for now.

comment:17 in reply to: ↑ 14 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Trac is back, so the answer should go here as well.

Michal 'vorner' Vaner
Replying to jinmei:

Replying to vorner:
Is there any documentation that describes the protocol between the
boss and other clients? I cannot find it in bind10/README or in
lib/bind10/README, so I've looked at the bind10.py implementation and
guessed what's expected, and updated the code based on it.

No, I didn't find the time to write the documentation yet and I doubt anyone
else did, as it was mostly me (with the exception of this particular branch) who
worked on the socket creator. I know I should have, so I'll probably create a
ticket for it to do it sometime soon.

This caused an additional, unexpected issue: if the remote end (in
practice, boss) closes the connection first, requestSocket() would
result in SIGPIPE (and the application would die). Just like we did
for SocketSession, we need to at least ignore that signal. I've
added that code too (btw I suspect this is also the case for the bos -
an app could die after requestSocket(), which would automatically
close the connection). The fact that we did the same trick again is
another reason why we should unify this and SocketSession (but
that's beyond the scope of our current work).

Well, this wouldn't be such a problem, because boss is one of the components
that aren't allowed to crash. And it doesn't close the socket anywhere in its
code. But maybe better to have it there.

Anyway, it would be bad to close the connection. That would signal the boss to reclaim all the sockets transferred previously.

Hmm, on further considering this and other bullet points below, I now
think we should probably manage the states of open sockets at the boss
and applications separately from the channel between them (i.e, the
connection between UNIX domain sockets). We should probably introduce
(re)synchronization mechanism about the open socket states between the
boss and the application when the connection is (re)established, and
then we can simply drop the connection if it's possibly in an unusable
state (and re-establish a new one).

Well, they are bound to the socket they were sent over for a simple reason: we
need to know when an application crashed, to release its sockets. And it seems
an overkill to do so much to handle a possible bug, at last for this stage of
the project, it seems to me.

And anyway, I consider this code to be mostly temporary, for two reasons:

  • If/when we adopt dbus instead of msgq, we can register at the buss to tell the boss when an application that sent given request disconnects from the boss.
  • If/when we adopt dbus, we might want to get rid of all this low-level error-prone code and send the sockets over the dbus (it doesn't allow sending them on windows, but our current code doesn't either). And the socket could go directly in the response to the request, avoiding this two-phase way.

So, having a note in the code or low-priority ticket for it is OK, but I'd be
against trying to solve it now or in the next sprint or like that.

Hmm, what do you mean? Like if the boss sent a "2\n" and the socket? Yes, then it would leak, but can we really avoid that in any way?

See the third bullet, for example. If we pass the pair of status (1
byte) and FD via a single sendmsg()/recvmsg() (the status will be in
the 1-byte "normal" data; the FD is passed as ancillary data),
requestSocket() could examine both the FD and status and can close the
socket (e.g.) if FD is bogus.

But if the „status“ byte would be bogus, the fd could leak anyway, couldn't it?
If there was some?

Anyway, I do agree that sending it within the sendmsg packet, all at once, is a
good idea.

  • I personally don't think it the best idea to signal the case of "CREATOR_SOCKET_UNAVAILABLE" by exception. At the very least it should be specifically separated from other exceptions such as the one as a result of bogus status data; otherwise, assuming the app should generally drop the connection on the latter case, we could drop and re-establish the connection unnecessarily

I'd like to point that such situation shouldn't happen in practice and would be same kind of bug as a wrong data sent, etc. In the time the client picks up the socket, it was already prepared and reserved. It can be unavailable only in case the token is wrong.

I thought "unavailable" can be possible for a reason we cannot
control, such as unrelated third-party application already opens the
port. Am I mistaken?

We might have misunderstood each other. The CREATOR_SOCKET_UNAVAILABLE constant
is for the case I described. If the socket can't be created for reasons like the
port being taken already, the result would be sent over the msgq. But you're
right this case isn't signalled by a different exception now. The problem lies
within boss, as it sends the same error status for all errors, which should be
improved.

Another cleanup ticket, maybe?

The newline is there to preserve some kind of consistency, and I wasn't sure if python wouldn't want to do some kind of line based buffering (well, it probably wouldn't).

Hmm, I'm not sure whether you mean the newline is good or bad by the
Python thing, but as long as we use multi-byte data that's probably
just a matter of taste. More substantially, if we could switch to
1-byte status code (obviously it would exclude a newline), we'll be
able to pass the pair of status code and FD via a single call to
sendmsg()/recvmsg(), which will make error handling simpler.

My point was, if python did line-oriented buffering, the newline would force it
to flush. It seems a crazy idea to thing there would be line-oriented buffering,
looking back, but as you say, this may turn out to be irrelevant when we go for
the status with sendmsg.

Few notes for the new code:

  • Is there no logging when the request fails for some reason (error response over msgq, CREATOR_SOCKET_UNAVAILABLE, other failures…).
  • The protocolString function could be used inside the createRequestSocketMessage as well and reuse some code.
  • Is it possible to make gtest produce a warning (similar with the „You have disabled tests“ line at the bottom of the tests) if some of the tests are skipped when it is impossible to do the timeout?
  • When reading the tokens in the child, the hardcoded length is 4. Maybe a note to the function it shouldn't be used with tokens of different lengths?
  • What is wrong with ssize_t in 85bf91d5836306a39b471567952ad1e0af91d948? ssize_t is signed, so the error handling should work, and seems more clean than int, as it is 64bit on 64bit systems (therefore it allows sending more than 2GB of data at once, though it is unlikely anyone would do that).

comment:18 in reply to: ↑ 17 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

This caused an additional, unexpected issue: if the remote end (in
practice, boss) closes the connection first, requestSocket() would
result in SIGPIPE (and the application would die). Just like we did
for SocketSession, we need to at least ignore that signal. I've
added that code too (btw I suspect this is also the case for the boss -
an app could die after requestSocket(), which would automatically
close the connection). The fact that we did the same trick again is
another reason why we should unify this and SocketSession (but
that's beyond the scope of our current work).

Well, this wouldn't be such a problem, because boss is one of the components
that aren't allowed to crash. And it doesn't close the socket anywhere in its
code. But maybe better to have it there.

I'm afraid I was misunderstood. Assuming your comment is about
"btw I suspect this is also the case for the boss", what I wanted to
point out is that a bug in a client component could make the boss
crash, too.

Anyway, it would be bad to close the connection. That would signal the boss to reclaim all the sockets transferred previously.

Hmm, on further considering this and other bullet points below, I now
think we should probably manage the states of open sockets at the boss
and applications separately from the channel between them

Well, they are bound to the socket they were sent over for a simple reason: we
need to know when an application crashed, to release its sockets. And it seems

I understood that, but it can be done from the fact that the
application dies itself. In any case, as long as a socket could be in
a broken state, I don't think it acceptable to pretend that it's safe
to keep using it. The choices I can see are: drop everything and
create all necessary sockets from the scratch; similar to the previous
one, but just abort the application and restart it; or separate the
socket state and connection state.

an overkill to do so much to handle a possible bug, at last for this stage of
the project, it seems to me.

And anyway, I consider this code to be mostly temporary, for two reasons:

  • If/when we adopt dbus instead of msgq, we can register at the buss to tell the boss when an application that sent given request disconnects from the boss.
  • If/when we adopt dbus, we might want to get rid of all this low-level error-prone code and send the sockets over the dbus (it doesn't allow sending them on windows, but our current code doesn't either). And the socket could go directly in the response to the request, avoiding this two-phase way.

So, having a note in the code or low-priority ticket for it is OK, but I'd be
against trying to solve it now or in the next sprint or like that.

As a bottom line, I'm not requesting this be addressed within the
branch or by the release. But I oppose to doing nothing (beyond
making some comments) about this even in a near future. While I admit
the possibility of this failure mode should be quite small, the
underlying system is sufficiently complicated and we cannot be so
sure. Also, since we don't have a concrete plan of when and how to
replace msgq yet (and in any case I'm not so sure how the replacement
magically solves all the problems), I'm not comfortable to use the
plan as an excuse for not doing anything. Leaving the possibility of
undefined behavior in the code for a non determined period seems to be
irresponsible.

Maybe my practical suggestion would be to just abort the application
if such "should not happen" thing happens with an assert or non
catchable exception. If it really doesn't happen, it's just fine; if
it can happen in practical scenario, it's better to know that fact
sooner. And I insist we should do it sooner than later - if not for
the coming release, in the next sprint or so.

Hmm, what do you mean? Like if the boss sent a "2\n" and the socket? Yes, then it would leak, but can we really avoid that in any way?

See the third bullet, for example. If we pass the pair of status (1
byte) and FD via a single sendmsg()/recvmsg() (the status will be in
the 1-byte "normal" data; the FD is passed as ancillary data),
requestSocket() could examine both the FD and status and can close the
socket (e.g.) if FD is bogus.

But if the „status“ byte would be bogus, the fd could leak anyway, couldn't it?
If there was some?

The difference is that in the current style recv_fd() can be called
only after we examine the status, and it just wouldn't make sense to
call it if we know the status code is bogus. If we piggy-back the
status in the recvmsg, we can detect the code is bogus while still
retrieving the FD from the ancillary data, then close the FD (all
within recv_fd()).

  • I personally don't think it the best idea to signal the case of "CREATOR_SOCKET_UNAVAILABLE" by exception. At the very least it should be specifically separated from other exceptions such as the one as a result of bogus status data; otherwise, assuming the app should generally drop the connection on the latter case, we could drop and re-establish the connection unnecessarily

I'd like to point that such situation shouldn't happen in practice and would be same kind of bug as a wrong data sent, etc. In the time the client picks up the socket, it was already prepared and reserved. It can be unavailable only in case the token is wrong.

I thought "unavailable" can be possible for a reason we cannot
control, such as unrelated third-party application already opens the
port. Am I mistaken?

We might have misunderstood each other. The CREATOR_SOCKET_UNAVAILABLE constant
is for the case I described. If the socket can't be created for reasons like the
port being taken already, the result would be sent over the msgq. But you're
right this case isn't signalled by a different exception now. The problem lies
within boss, as it sends the same error status for all errors, which should be
improved.

Right, I realized I misunderstood it about CREATOR_SOCKET_UNAVAILABLE
after making this comment, and I also understood the main point still
stands as you noted.

Another cleanup ticket, maybe?

I think so, and I think is quite important because such error can
easily happen in practice.

Few notes for the new code:

  • Is there no logging when the request fails for some reason (error response over msgq, CREATOR_SOCKET_UNAVAILABLE, other failures…).

My intent was that in these cases we throw an exception and it will be
logged in an appropriate context at a higher layer. (But when we
separate the error case of failing to create a socket than other rare
failures, it would probably make sense to log the event about the
former).

  • The protocolString function could be used inside the createRequestSocketMessage as well and reuse some code.

Okay, done, and while doing so I noticed that the current code could
accept bogus 'protocol' and 'share mode'. Even though it may be
rejected after exchanging messages with the boss, it would be better
to reject such a case at an earlier point. I added the check and
corresponding tests.

  • Is it possible to make gtest produce a warning (similar with the „You have disabled tests“ line at the bottom of the tests) if some of the tests are skipped when it is impossible to do the timeout?

As far as I know, no. I've quickly looked at the google test
document but couldn't find anything usable for this purpose.

  • When reading the tokens in the child, the hardcoded length is 4. Maybe a note to the function it shouldn't be used with tokens of different lengths?

Good idea, and I actually thought about something like that. Added
some comments.

  • What is wrong with ssize_t in 85bf91d5836306a39b471567952ad1e0af91d948? ssize_t is signed, so the error handling should work, and seems more clean than int, as it is 64bit on 64bit systems (therefore it allows sending more than 2GB of data at once, though it is unlikely anyone would do that).

I simply tried to use the same type as the return value of read() or
write(), but I didn't necessarily have to change it. In any case
using a larger size than int doesn't help anyway because
read()/write() can only return int values. So in either case the
behavior of the code should exactly the same. If you still want to
revert this part, however, please do so.

comment:19 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:20 in reply to: ↑ 18 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

I'm afraid I was misunderstood. Assuming your comment is about
"btw I suspect this is also the case for the boss", what I wanted to
point out is that a bug in a client component could make the boss
crash, too.

Ah, right, that would be bad. But I believe boss ignores SIGPIPE already for some other reason.

I understood that, but it can be done from the fact that the
application dies itself. In any case, as long as a socket could be in
a broken state, I don't think it acceptable to pretend that it's safe
to keep using it. The choices I can see are: drop everything and
create all necessary sockets from the scratch; similar to the previous
one, but just abort the application and restart it; or separate the
socket state and connection state.

Maybe killing the application for now could be the way to go. If it turns out it happens in practice, we could have a look if it can be solved in a better way, right?

Maybe my practical suggestion would be to just abort the application
if such "should not happen" thing happens with an assert or non
catchable exception. If it really doesn't happen, it's just fine; if
it can happen in practical scenario, it's better to know that fact
sooner. And I insist we should do it sooner than later - if not for
the coming release, in the next sprint or so.

Aborting or doing something similarly simple is OK, but I was afraid you wanted to implement some complicated resynchronisation protocol to work around possible bugs, which, if ever encountered, should be fixed, and we don't know if they exist at all. That would sound like a waste of time to me.

The difference is that in the current style recv_fd() can be called
only after we examine the status, and it just wouldn't make sense to
call it if we know the status code is bogus. If we piggy-back the
status in the recvmsg, we can detect the code is bogus while still
retrieving the FD from the ancillary data, then close the FD (all
within recv_fd()).

But then, calling close on _something_ we got from the auxiliary data (if the status is bogus, then the socket may be as well) might lead to closing something completely unrelated. That doesn't sound too good to me either.

Few notes for the new code:

  • Is there no logging when the request fails for some reason (error response over msgq, CREATOR_SOCKET_UNAVAILABLE, other failures…).

My intent was that in these cases we throw an exception and it will be
logged in an appropriate context at a higher layer. (But when we
separate the error case of failing to create a socket than other rare
failures, it would probably make sense to log the event about the
former).

OK, that is fine, I was just making sure it was not overlooked.

Okay, done, and while doing so I noticed that the current code could
accept bogus 'protocol' and 'share mode'. Even though it may be
rejected after exchanging messages with the boss, it would be better
to reject such a case at an earlier point. I added the check and
corresponding tests.

Well, these are enums, so the application should not be able to create the invalid values in any legal way, but being too paranoid is probably better than being too careless.

  • What is wrong with ssize_t in 85bf91d5836306a39b471567952ad1e0af91d948? ssize_t is signed, so the error handling should work, and seems more clean than int, as it is 64bit on 64bit systems (therefore it allows sending more than 2GB of data at once, though it is unlikely anyone would do that).

I simply tried to use the same type as the return value of read() or
write(), but I didn't necessarily have to change it. In any case
using a larger size than int doesn't help anyway because
read()/write() can only return int values. So in either case the
behavior of the code should exactly the same. If you still want to
revert this part, however, please do so.

Well, my man 2 read (and write) tells me my read and write return ssize_t. But as I say, the practical difference would be none I guess, so it doesn't really matter.

Thank you

comment:21 in reply to: ↑ 20 Changed 8 years ago by jinmei

Replying to vorner:

I'm afraid I was misunderstood. Assuming your comment is about
"btw I suspect this is also the case for the boss", what I wanted to
point out is that a bug in a client component could make the boss
crash, too.

Ah, right, that would be bad. But I believe boss ignores SIGPIPE already for some other reason.

Ah, okay, you're right about the boss ignoring it.

I understood that, but it can be done from the fact that the
application dies itself. In any case, as long as a socket could be in
a broken state, I don't think it acceptable to pretend that it's safe
to keep using it. The choices I can see are: drop everything and
create all necessary sockets from the scratch; similar to the previous
one, but just abort the application and restart it; or separate the
socket state and connection state.

Maybe killing the application for now could be the way to go. If it turns out it happens in practice, we could have a look if it can be solved in a better way, right?

That works for me.

The difference is that in the current style recv_fd() can be called
only after we examine the status, and it just wouldn't make sense to
call it if we know the status code is bogus. If we piggy-back the
status in the recvmsg, we can detect the code is bogus while still
retrieving the FD from the ancillary data, then close the FD (all
within recv_fd()).

But then, calling close on _something_ we got from the auxiliary data (if the status is bogus, then the socket may be as well) might lead to closing something completely unrelated. That doesn't sound too good to me either.

The value returned as the ancillary data for SOL_SOCKET/SCM_RIGHTS
should be more reliable than the 1 byte user data; the former should
be a valid file descriptor unless the kernel has a bug. The latter
could be bogus due to a sender application's bug. If we still want to
cover kernel bugs, I don't see other options than aborting the program
at this point.

Okay, done, and while doing so I noticed that the current code could
accept bogus 'protocol' and 'share mode'. Even though it may be
rejected after exchanging messages with the boss, it would be better
to reject such a case at an earlier point. I added the check and
corresponding tests.

Well, these are enums, so the application should not be able to create the invalid values in any legal way, but being too paranoid is probably better than being too careless.

Exactly. I added the code to be safe against a buggy or malicious
caller.

Now, unless I miss something there shouldn't be no more open issue for
this ticket, right?

comment:22 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:23 Changed 8 years ago by vorner

  • Resolution set to complete
  • Status changed from reviewing to closed

Thanks everybody, the code is merged, closing.

comment:24 Changed 8 years ago by jinmei

  • Total Hours changed from 10 to 22.83
Note: See TracTickets for help on using tickets.