Opened 8 years ago

Closed 8 years ago

#1705 closed defect (fixed)

attempt to run multiple auth servers causes FATAL [b10-auth.server_common] SRVCOMM_EXCEPTION_ALLOC exception when allocating a socket: File exists

Reported by: jreed Owned by: vorner
Priority: medium Milestone: Sprint-20120306
Component: ~Inter-module communication(obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 2.12 Internal?: no

Description

Maybe this is related to #1703.

2012-02-22 11:20:33.334 DEBUG [b10-auth.cc] CC_GROUP_SEND sending message '{ "command": [ "get_module_spec", { "module_name": "Logging" } ] }' to group 'ConfigManager'
2012-02-22 11:20:33.335 DEBUG [b10-auth.cc] CC_GROUP_RECEIVE trying to receive a message
2012-02-22 11:20:33.337 FATAL [b10-auth.server_common] SRVCOMM_EXCEPTION_ALLOC exception when allocating a socket: File exists
2012-02-22 11:20:33.337 INFO  [b10-boss.boss] BIND10_LOST_SOCKET_CONSUMER consumer 21 of sockets disconnected, considering all its sockets closed
2012-02-22 11:20:33.337 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 23289 of b10-auth-2 ended with status 6
2012-02-22 11:20:33.337 ERROR [b10-boss.boss] BIND10_COMPONENT_FAILED component b10-auth-2 (pid 23289) failed with 6 exit status
2012-02-22 11:20:33.338 FATAL [b10-boss.boss] BIND10_COMPONENT_UNSATISFIED component b10-auth-2 is required to run and failed
2012-02-22 11:20:33.338 INFO  [b10-boss.boss] BIND10_SHUTDOWN stopping the server

My configuration is:

{"version": 2, "Auth": {"listen_on": [{"port": 5300, "address": "127.0.0.1"}]}, "Boss": {"components": {"b10-auth-2": {"kind": "needed", "special": "auth"}, "b10-xfrin": {"kind": "dispensable", "address": "Xfrin"}, "b10-cmdctl": {"kind": "needed", "special": "cmdctl"}, "b10-xfrout": {"kind": "dispensable", "address": "Xfrout"}, "b10-auth": {"kind": "needed", "special": "auth"}, "b10-zonemgr": {"kind": "dispensable", "address": "Zonemgr"}, "b10-stats": {"kind": "dispensable", "address": "Stats"}, "b10-stats-httpd": {"kind": "dispensable", "address": "StatsHttpd"}}}}

Maybe I just didn't configure correctly. I didn't find documentation for this.

Subtickets

Change History (12)

comment:1 Changed 8 years ago by vorner

The exception makes no sense to me. But it fails for me too, it seems there's something rotten in there. Did anything change since we merged the two tickets? And it doesn't really say which file exists :-(.

comment:2 Changed 8 years ago by vorner

  • Owner set to vorner
  • Status changed from new to accepted

Accepting right away as discussed on jabber.

comment:3 Changed 8 years ago by vorner

  • Component changed from b10-auth to Inter-module communication
  • Milestone changed from New Tasks to Sprint-20120306
  • Owner changed from vorner to UnAssigned
  • Status changed from accepted to reviewing
  • Sub-Project changed from DNS to Core

The problem seems to be from the category „You'll probably not believe me, but I'm telling the truth, even if it sounds incredible…“.

So, the file exists error comes from epoll_ctl when adding a new socket from inside an asio tcp acceptor. It means the file descriptor being added to the watcher (or whatever it is) is already there. Which was strange, because the file descriptor was just received from the boss when it was added. However, it turned out that the recvmsg actually does return the same file descriptor multiple times (maybe something inside the kernel gets confused when we are sending the same file descriptor to multiple applications).

I found a workaround that seems to help ‒ I dup the file descriptor after I receive it and close the original one. Now they are not being duplicated, but I fear two things:

  • There's a bug in linux kernel and we should report it.
  • When it can create a duplicate FD to one the recvmsg returned, it might as well hit a different one. I don't know how to check this, but it could be quite a disaster if it did.

And, I'm not sure how to write any kind of test for this. I propose we include this fix now and create a ticket to investigate the kernel or something. I'd be very interesting where this comes from.

As the feature was not in previous release, I don't think it needs a changelog entry (and I wouldn't like to describe the error there in few sentences).

comment:4 Changed 8 years ago by jelte

  • Owner changed from UnAssigned to vorner

Yes that workaround seems to help. Sure is ugly though :/ And if it is not a kernel bug, we should probably at least ask some guru wth is going on here :)

Having tested with adding and removing instances for two hours now, I have not seen the dup() return an existing one (still counts as anecdotal data but if nothing else, the hit window is reduced from always to almost never :p)

So I think we can merge this and go look for a better solution.

comment:5 Changed 8 years ago by jinmei

Why is this still open?

But as it's still open: I noticed one thing:

    int new_fd(dup(fd));
    close(fd);
    return (new_fd);

dup(2) can fail. We should catch that situation.

comment:6 follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

It was still open because I just merged it on friday afternoon and didn't do the usual close-ticket rituals yet.

Yes, you're right about the missing check. I just put the code there to test if it helps and then forgot about it. Dow about this?

comment:7 in reply to: ↑ 6 Changed 8 years ago by jinmei

Replying to vorner:

Yes, you're right about the missing check. I just put the code there to test if it helps and then forgot about it. Dow about this?

Um...what do you mean by 'Dow about this'? Create an open ticket?
Or fix it now on master?

comment:8 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:9 follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

„Doh“ was a typo, it should have been „How“ ‒ I pushed the change and asked for review, when you pointed it out ;-)

comment:10 in reply to: ↑ 9 Changed 8 years ago by jinmei

Replying to vorner:

„Doh“ was a typo, it should have been „How“ ‒ I pushed the change and asked for review, when you pointed it out ;-)

Oh okay. I didn't realize the branch was updated (or that it even
still existed). With this change if dup() succeeds and close() fails
new_fd will leak. I suggest the following patch instead.

--- a/src/lib/util/io/fd_share.cc
+++ b/src/lib/util/io/fd_share.cc
@@ -110,8 +110,14 @@ recv_fd(const int sock) {
     // It is strange, but the call can return the same file descriptor as
     // one returned previously, even if that one is not closed yet. So,
     // we just re-number every one we get, so they are unique.
-    int new_fd(dup(fd));
-    if (close(fd) == -1 || new_fd == -1) {
+    const int new_fd(dup(fd));
+    const int error = close(fd);
+    if (new_fd == -1 || error == -1) {
+        // even if close has failed above there's anything else we can do
+        // than returning an error at this point.
+        if (new_fd != -1) {
+            close(new_fd); // likewise.  Just return an error if this fails.
+        }
         return (FD_SYSTEM_ERROR);
     }
     return (new_fd);

comment:11 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:12 Changed 8 years ago by vorner

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 0 to 2.12

Thanks, closing

Note: See TracTickets for help on using tickets.