Opened 8 years ago

Closed 6 years ago

#1859 closed defect (wontfix)

many auth servers results in Unable to open domain socket on macmini

Reported by: jreed Owned by:
Priority: medium Milestone: Remaining BIND10 tickets
Component: Unclassified Version: bind10-old
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: Medium
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 5 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

I can reproduce the following frequently on the macmini system by attempting to run around 15 b10-auth components:

2012-03-29 07:10:38.818 DEBUG [b10-auth.cc] CC_GROUP_RECEIVED message arrived ('{ "from": "4f746d5e_3@macmini.lab.isc.org", "group": "Boss", "instance": "*", "reply": 2, "seq": 19, "to": "4f746d5e_12@macmini.lab.isc.org", "type": "send" }', '{ "result": [ 0, { "path": "/tmp/sockcreator-SWQsjy/sockcreator", "token": "t3455974382" } ] }')
2012-03-29 07:10:38.818 DEBUG [b10-auth.cc] CC_GROUP_RECEIVED message arrived ('{ "from": "4f746d5e_3@macmini.lab.isc.org", "group": "Boss", "instance": "*", "reply": 2, "seq": 18, "to": "4f746d5e_11@macmini.lab.isc.org", "type": "send" }', '{ "result": [ 0, { "path": "/tmp/sockcreator-SWQsjy/sockcreator", "token": "t3513635662" } ] }')
2012-03-29 07:10:38.819 FATAL [b10-auth.server_common] SRVCOMM_EXCEPTION_ALLOC exception when allocating a socket: Unable to open domain socket /tmp/sockcreator-SWQsjy/sockcreator: Connection refused
2012-03-29 07:10:38.819 FATAL [b10-auth.server_common] SRVCOMM_EXCEPTION_ALLOC exception when allocating a socket: Unable to open domain socket /tmp/sockcreator-SWQsjy/sockcreator: Connection refused
2012-03-29 07:10:38.819 INFO  [b10-boss.boss] BIND10_SOCKET_CREATED successfully created socket 37

Other times I get other failures like in #1850 or like:

2012-03-29 07:13:47.201 ERROR [b10-auth.cc] CC_READ_ERROR error reading data from command channel (End of file.)
2012-03-29 07:13:47.201 DEBUG [b10-auth.cc] CC_DISCONNECT disconnecting from message queue daemon
2012-03-29 07:13:47.202 FATAL [b10-auth.auth] AUTH_SERVER_FAILED server failed: Error while reading data from cc session: End of file.
2012-03-29 07:13:47.202 DEBUG [b10-auth.datasrc] DATASRC_CACHE_DESTROY destroying the hotspot cache

(Also often my sockcreator is left running.)

I used the following script to generate my configuration:

#!/usr/bin/awk -f

BEGIN{

  print "{\"version\": 2,";

  print "\"Auth\": {\"datasources\": [";
  print "{\"zones\": [{\"origin\": \"example\", \"file\": \"tests/smallzone/master.zone.file-canonical\"}], \"type\": \"memory\"}],";
  print "\"listen_on\": [{\"port\": 5300, \"address\": \"127.0.0.1\"}]},";

  print "\"Boss\": {\"components\": {";

  for (i = 1; i <= ARGV[1]; i++) {

    print "\"b10-auth-" i "\": {\"kind\": \"needed\", \"special\": \"auth\"}";

    if (i < ARGV[1]) { print ","; }

  }

  print "}}}";

}

The command line argument is the number of b10-auth components to run.

I think this is a macmini or portability issue. I can run 50 of the b10-auth server components on Linux fine.

Subtickets

Change History (10)

comment:1 follow-up: Changed 8 years ago by jinmei

maybe it's related to the default system limit of allowable number of
open files? On my MacBook? Pro, the default is 256.

In any case, running 50 b10-auth instances doesn't make sense
(although testing extreme cases does make sense if that was the
purpose).

comment:2 in reply to: ↑ 1 ; follow-up: Changed 8 years ago by jreed

Replying to jinmei:

maybe it's related to the default system limit of allowable number of
open files? On my MacBook? Pro, the default is 256.

I raised to 10000 and still had same problems.

In any case, running 50 b10-auth instances doesn't make sense
(although testing extreme cases does make sense if that was the
purpose).

I noticed problems once I started using multiple auths (I first tried 4). But sometimes it worked. I raised it to around 15 to get it to consistently fail. (Even it sometimes worked with 15 too.)

The 50 was on a different system just to quickly verify that it worked elsewhere.

I am researching to find out why a Unix socket would return "Connection refused". Maybe due to no longer listening? Maybe some queue is temporarily full?

comment:3 in reply to: ↑ 2 ; follow-up: Changed 8 years ago by jreed

From listen():

     The backlog parameter defines the maximum length for the queue of pending
     connections.  If a connection request arrives with the queue full, the
     client may receive an error with an indication of ECONNREFUSED. 

I don't know yet how this is related, if it is.

comment:4 in reply to: ↑ 3 ; follow-up: Changed 8 years ago by vorner

Hello

Replying to jreed:

     The backlog parameter defines the maximum length for the queue of pending
     connections.  If a connection request arrives with the queue full, the
     client may receive an error with an indication of ECONNREFUSED. 

I don't know yet how this is related, if it is.

I think it is. What I see could be a problem is the boss listens on a unix domain socket to send the file descriptors from socket creator over. Imagine following:

  • All 15 auths are started in approximately the same time.
  • So they all connect to boss at approximately the same time and want to ask for sockets.
  • The boss picks one and starts handling them (it is not blocking with socket creator at this moment, it happened before). But as it is python, it takes some time to get through the data structures and answer.
  • At the same time, 14 others try to connect and the queue gets full.

It could also be when the boss is blocking to request another socket from the socket creator (the socket is first „reserved“ over msgq, then sent over the unix domain socket).

The difference I see, I guess linux just sets the listen queue to some infinite number no matter what you put there. And we set the listen parameter to 5 in the boss. Could you try setting it to something more? On the line 900 of boss?

This still wouldn't explain the msgq failure, which might be something unrelated. Anyway, this listening problem would be solved when we teach the msgq to send file descriptors in-bound, we could get rid of the whole thing with the listening socket in boss.

comment:5 in reply to: ↑ 4 ; follow-up: Changed 8 years ago by jreed

Replying to vorner:

The difference I see, I guess linux just sets the listen queue to some infinite number no matter what you put there. And we set the listen parameter to 5 in the boss. Could you try setting it to something more? On the line 900 of boss?

Sorry I can't find it. Can you please provide a patch or paste the line(s) here?

comment:6 in reply to: ↑ 5 ; follow-up: Changed 8 years ago by jreed

Replying to jreed:

Replying to vorner:

The difference I see, I guess linux just sets the listen queue to some infinite number no matter what you put there. And we set the listen parameter to 5 in the boss. Could you try setting it to something more? On the line 900 of boss?

Sorry I can't find it. Can you please provide a patch or paste the line(s) here?

Nevermind. I was looking at some old version of bind10. I found it. It doesn't appear to help. I tried 20 and 40 for the listen parameter. bind10 still exits.

(By the way, I can reproduce everything exiting except b10-sockcreator left running pretty consistently.)

Here is the output of one attempt: http://git.bind10.isc.org/~jreed/bind10.log-trac1859-20120405.txt

comment:7 in reply to: ↑ 6 Changed 8 years ago by jreed

Here is the output of one attempt: http://git.bind10.isc.org/~jreed/bind10.log-trac1859-20120405.txt

For some reason, the logging shows:

2012-04-05 13:11:27.458 INFO  [b10-boss.boss] BIND10_SOCKET_GET requesting socket [::]:53 of type TCP from the creator
2012-04-05 13:11:27.458 DEBUG [b10-auth.cc] 2012-04-05 13:11:27.458 ERROR [b10-boss.boss] BIND10_SOCKET_ERROR error on bind call in the creator: 13/Permission denied
2012-04-05 13:11:27.459 DEBUG 

Where does this port 53 attempts come from? I have:

{"version": 2,
"Auth": {"datasources": [
{"zones": [{"origin": "example", "file": "tests/smallzone/master.zone.file-canonical"}], "type": "memory"}],
"listen_on": [{"port": 5300, "address": "127.0.0.1"}]},
"Boss": {"components": {
"b10-auth-1": {"kind": "needed", "special": "auth"}
,
"b10-auth-2": {"kind": "needed", "special": "auth"}

}}}

comment:8 Changed 8 years ago by shane

  • Defect Severity changed from N/A to Medium
  • Milestone New Tasks deleted

comment:9 Changed 6 years ago by tomek

  • Milestone set to Remaining BIND10 tickets

comment:10 Changed 6 years ago by tomek

  • Resolution set to wontfix
  • Status changed from new to closed
  • Version set to old-bind10

This issue is related to bind10 code that is no longer part of Kea.

If you are interested in BIND10/Bundy framework or its DNS components,
please check http://bundy-dns.de.

Closing ticket.

Note: See TracTickets for help on using tickets.