Opened 8 years ago

Closed 8 years ago

#1829 closed defect (fixed)

CC Session timeouts / ConfigManager not responding (on OpenBSD)

Reported by: jreed Owned by: jinmei
Priority: medium Milestone: Sprint-20120403
Component: Unclassified Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 5 Add Hours to Ticket: 0
Total Hours: 4.5 Internal?: no

Description

When I try to use master (at commit 51b3f091d6a750ad4a65825c4f36bb7c05dddbcb) on OpenBSD by running ./src/bin/bind10/run_bind10.sh --verbose I get the following:

2012-03-22 19:11:30.685 DEBUG [b10-auth.cc] CC_ESTABLISH trying to establish connection with message queue daemon at /home/jreed/src/bind10/msgq_socket
2012-03-22 19:11:30.686 DEBUG [b10-auth.cc] CC_ESTABLISHED successfully connected to message queue daemon
2012-03-22 19:11:34.6212012-03-22 19:11:34.622 ERROR [b10-xfrin.xfrin] XFRIN_UNKNOWN_ERROR unknown error: CC Session timeout waiting for configuration manager
2012-03-22 19:11:34.632 ERROR [b10-xfrout.xfrout] XFROUT_MODULECC_SESSION_ERROR error encountered by configuration/command module: No answer from ConfigManager when asking about Remote module Auth
2012-03-22 19:11:34.643Traceback (most recent call last):
  File "/home/jreed/src/bind10/src/lib/python/isc/cc/session.py", line 203, in _receive_full_buffer
  ERRORINFO [ b10-cmdctl.cmdctl [] b10-boss.bossCMDCTL_COMMAND_ERROR error in command get_module_spec to module ConfigManager: Module 'ConfigManager' not responding] 
BIND10_PROCESS_ENDED process 407 of b10-xfrout ended with status 0
2012-03-22 19:11:34.644 ERROR [b10-boss.boss] BIND10_COMPONENT_FAILED component b10-xfrout (pid 407) failed: process exited normally with exit status 0
2012-03-22 19:11:34.645 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 29542 of b10-xfrin ended with status 0
2012-03-22 19:11:34.645 ERROR [b10-boss.boss] BIND10_COMPONENT_FAILED component b10-xfrin (pid 29542) failed: process exited normally with exit status 0
    self._receive_len_data()
  File "/home/jreed/src/bind10/src/lib/python/isc/cc/session.py", line 163, in _receive_len_data
    new_data = self._receive_bytes(self._recv_len_size)
  File "/home/jreed/src/bind10/src/lib/python/isc/cc/session.py", line 149, in _receive_bytes
    data = self._socket.recv(size)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jreed/src/bind10/src/bin/stats/b10-stats-httpd", line 828, in <module>
    stats_httpd = StatsHttpd()
  File "/home/jreed/src/bind10/src/bin/stats/b10-stats-httpd", line 187, in __init__
    self.open_mccs()
  File "/home/jreed/src/bind10/src/bin/stats/b10-stats-httpd", line 199, in open_mccs
    SPECFILE_LOCATION, self.config_handler, self.command_handler)
  File "/home/jreed/src/bind10/src/lib/python/isc/config/ccsession.py", line 188, in __init__
    self._session = Session(socket_file)
  File "/home/jreed/src/bind10/src/lib/python/isc/cc/session.py", line 57, in __init__
    env, msg = self.recvmsg(False)
  File "/home/jreed/src/bind10/src/lib/python/isc/cc/session.py", line 121, in recvmsg
    data = self._receive_full_buffer(nonblock)
  File "/home/jreed/src/bind10/src/lib/python/isc/cc/session.py", line 218, in _receive_full_buffer
    raise SessionTimeout("recv() on cc session timed out")
isc.cc.session.SessionTimeout: recv() on cc session timed out
Exception AttributeError: "'ModuleCCSession' object has no attribute '_session'" in <bound method ModuleCCSession.__del__ of <isc.config.ccsession.ModuleCCSession object at 0x20258f150>> ignored
2012-03-22 19:11:34.664 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 5312 of b10-stats-httpd ended with status 256
2012-03-22 19:11:34.664 ERROR [b10-boss.boss] BIND10_COMPONENT_FAILED component b10-stats-httpd (pid 5312) failed: process exited normally with exit status 256
2012-03-22 19:11:34.691 ERROR [b10-auth.cc] CC_TIMEOUT timeout reading data from command channel
2012-03-22 19:11:34.692 DEBUG [b10-auth.cc] CC_DISCONNECT disconnecting from message queue daemon
2012-03-22 19:11:34.692 FATAL [b10-auth.auth] AUTH_SERVER_FAILED server failed: Timeout while reading data from cc session
2012-03-22 19:11:34.693 DEBUG [b10-auth.datasrc] DATASRC_CACHE_DESTROY destroying the hotspot cache
2012-03-22 19:11:34.696 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 25879 of b10-auth ended with status 256
2012-03-22 19:11:34.696 ERROR [b10-boss.boss] BIND10_COMPONENT_FAILED component b10-auth (pid 25879) failed: process exited normally with exit status 256
2012-03-22 19:11:34.696 FATAL [b10-boss.boss] BIND10_COMPONENT_UNSATISFIED component b10-auth is required to run and failed
2012-03-22 19:11:34.697 INFO  [b10-boss.boss] BIND10_SHUTDOWN stopping the server

Earlier with same code I was able to run it (not usable though) and received:

2012-03-22 16:55:27.550 ERROR [b10-zonemgr.zonemgr] ZONEMGR_CCSESSION_ERROR command channel session error: No answer from ConfigManager when asking about Remote module Logging

and other failures and tracebacks, but bind10 kept running. I was able to run it earlier and even connect bindctl to cmdctl. bindctl crashed when used after that though; whenever sending a command (like help or Stats show):

 File "/home/jreed/src/bind10/src/lib/python/isc/config/ccsession.py", line 508, in request_current_config
    raise ModuleCCSessionError("Bad config version")
isc.config.ccsession.ModuleCCSessionError: Bad config version

Subtickets

Attachments (1)

session.py.diff (622 bytes) - added by jinmei 8 years ago.

Download all attachments as: .zip

Change History (13)

comment:1 Changed 8 years ago by jinmei

I figured this out.

On OpenBSD, it appears when a process sends data larger than the
socket send buffer, it doesn't block (as long as some of the data can
be sent) and immediately returns after sending as many bytes as
possible. But the Python cc.session module assumes all data should be
sent (except in case of error), so it can result in partial write
without triggering an exception or error log.

(As a related note, OpenBSD's default send buffer size seems to be
quite small: 4096 bytes on my test system. But simply increasing the
size wouldn't be a reliable solution anyway).

What specifically happened in the ticket's scenario is that cfgmgr
tried to send relatively large data, which resulted in partial write,
and msgq blocks on read for the rest of the data.

So, the attached patch solves this problem (msgq internally does a
similar thing, so it's safe from this problem). (Yes I wrote the
patch without a test - blame me:-)

From a quick look, the C++ version has the same problem (so sending
statistics from auth to stats may cause a trouble, too), and also for
the message passing from auth to xfrout. We should probably check and
fix these, but at least I confirmed that I could start the bind10
system on OpenBSD and had it respond to a version.bind query.

What should we do about this? This ticket is not even in the current
sprint, but I personally think it's great if we can add OpenBSD to
the supported system list in the next release.

Changed 8 years ago by jinmei

comment:2 follow-up: Changed 8 years ago by jreed

The patch works for me:

> Stats show Stats
{
    "Stats": {
        "boot_time": "2012-03-23T17:06:19Z", 
        "last_update_time": "2012-03-23T17:06:19Z", 
        "lname": "4f6cad8b_5@openbsd5-64.lab.isc.org", 
        "report_time": "2012-03-23T17:06:52Z", 
        "timestamp": 1332522412.812677
    }
}

comment:3 in reply to: ↑ 2 Changed 8 years ago by jreed

Replying to jreed:

The patch works for me:

Boss shutdown is very noisy though. (Maybe different ticket for this.)

> Boss shutdown
2012-03-23 17:10:08.470 INFO >  [b10-boss.boss] BIND10_SHUTDOWN stopping the server
2012-03-23 17:10:08.474 INFO  [b10-boss.boss] BIND10_CONFIGURATOR_STOP bind10 component configurator is shutting down
2012-03-23 17:10:08.475 INFO  [b10-boss.boss] BIND10_COMPONENT_STOP component Socket creator is being stopped
2012-03-23 17:10:08.476 INFO  [b10-boss.boss] BIND10_SOCKCREATOR_TERMINATE terminating socket creator
2012-03-23 17:10:08.477 INFO  [b10-boss.boss] BIND10_COMPONENT_STOP component msgq is being stopped
2012-03-23 17:10:08.477 INFO  [b10-boss.boss] BIND10_COMPONENT_STOP component b10-xfrin is being stopped
2012-03-23 17:10:08.478 INFO  [b10-boss.boss] BIND10_STOP_PROCESS asking b10-xfrin to shut down
2012-03-23 17:10:08.491[b10-msgq] Closing socket fd 14
 [b10-msgq] Receive error: EOF
INFO  [[b10-msgq] Closing socket fd 16
b10-boss.boss[b10-msgq] Receive error: EOF
] BIND10_COMPONENT_STOP component b10-cmdctl is being stopped
2012-03-23 17:10:08.494 INFO  [b10-boss.boss] BIND10_STOP_PROCESS asking b10-cmdctl to shut down
2012-03-23 17:10:08.509[b10-msgq] Closing socket fd 17
 [b10-msgq] Receive error: EOF
INFO [b10-msgq] Closing socket fd 18
 [[b10-msgq] Got read on Strange Socket fd 18
b10-boss.boss] BIND10_COMPONENT_STOP component b10-xfrout is being stopped
2012-03-23 17:10:08.511 INFO  [b10-boss.boss] BIND10_STOP_PROCESS asking b10-xfrout to shut down
2012-03-23 17:10:08.5132012-03-23 17:10:08.513  INFOINFO   [ [b10-xfrout.xfroutb10-boss.boss] ] XFROUT_RECEIVED_SHUTDOWN_COMMAND shutdown command receivedBIND10_COMPONENT_STOP component b10-auth is being stopped

2012-03-23 17:10:08.514 INFO  [b10-boss.boss] BIND10_STOP_PROCESS asking b10-auth to shut down
2012-03-23 17:10:08.517 [b10-msgq] Closing socket fd 21
INFO[b10-msgq] Receive error: EOF
 [b10-msgq] Closing socket fd 22
 [[b10-msgq] Receive error: EOF
b10-boss.boss] [b10-msgq] Closing socket fd 20
BIND10_COMPONENT_STOP component cfgmgr is being stopped[b10-msgq] Receive error: EOF

2012-03-23 17:10:08.520 INFO  [b10-boss.boss] BIND10_STOP_PROCESS asking cfgmgr to shut down
2012-03-23 17:10:08.541[b10-msgq] Closing socket fd 9
 [b10-msgq] Receive error: EOF
INFO[b10-msgq] Closing socket fd 15
 [b10-msgq] Receive error: EOF
 [b10-boss.boss] BIND10_COMPONENT_STOP component b10-zonemgr is being stopped
2012-03-23 17:10:08.550 INFO  [b10-boss.boss] BIND10_STOP_PROCESS asking b10-zonemgr to shut down
2012-03-23 17:10:08.553[b10-msgq] Closing socket fd 11
 [b10-msgq] Receive error: EOF
INFO[b10-msgq] Closing socket fd 13
 [b10-msgq] Receive error: EOF
 [b10-boss.boss] BIND10_COMPONENT_STOP component b10-stats is being stopped
2012-03-23 17:10:08.565 INFO  [b10-boss.boss] BIND10_STOP_PROCESS asking b10-stats to shut down
2012-03-23 17:10:08.5662012-03-23 17:10:08.567  INFOINFO   [ [b10-stats.statsb10-boss.boss] ] STATS_RECEIVED_SHUTDOWN_COMMAND shutdown command receivedBIND10_COMPONENT_STOP component b10-stats-httpd is being stopped

2012-03-23 17:10:08.567 INFO  [b10-boss.boss] BIND10_STOP_PROCESS asking b10-stats-httpd to shut down[b10-msgq] Closing socket fd 12

[b10-msgq] Receive error: EOF
2012-03-23 17:10:08.582 INFO  [b10-stats-httpd.stats-httpd] STATHTTPD_SHUTDOWN shutting down
2012-03-23 17:10:08.583 INFO  [b10-stats-httpd.stats-httpd] STATHTTPD_CLOSING closing 127.0.0.1#8000
[b10-msgq] Closing socket fd 19
[b10-msgq] Receive error: EOF
2012-03-23 17:10:09.593 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 18461 of b10-auth ended with status 0
2012-03-23 17:10:09.593 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 4316 of b10-stats-httpd ended with status 0
2012-03-23 17:10:09.594 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 12013 of b10-xfrout ended with status 0
2012-03-23 17:10:09.594 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 20200 of b10-cmdctl ended with status 0
2012-03-23 17:10:09.594 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 20594 of b10-xfrin ended with status 0
2012-03-23 17:10:09.595 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 15693 of b10-stats ended with status 0
2012-03-23 17:10:09.596 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 5340 of b10-zonemgr ended with status 0
2012-03-23 17:10:09.597 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 19768 of cfgmgr ended with status 0
2012-03-23 17:10:09.598 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 14219 of Socket creator ended with status 0
2012-03-23 17:10:09.598 INFO  [b10-boss.boss] BIND10_SEND_SIGTERM sending SIGTERM to msgq (PID 6489)
2012-03-23 17:10:09.703 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 6489 of msgq ended with status 0
2012-03-23 17:10:09.703 INFO  [b10-boss.boss] BIND10_SHUTDOWN_COMPLETE all processes ended, shutdown complete

Failed to send request, the connection is closed
[1] + Done                 src/bin/bind10/run_bind10.sh --verbose 

comment:4 Changed 8 years ago by jinmei

(as discussed on jabber tentatively moving it to the current sprint)

comment:5 Changed 8 years ago by jinmei

  • Milestone changed from New Tasks to Sprint-20120403

comment:6 Changed 8 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

comment:7 Changed 8 years ago by jinmei

trac1829 is ready for review (this time I wrote a test first:-)

I also made it a bit more reliable by ensuring header part can also be
sent incrementally if necessary (it's less likely to be necessary
in practice, but can still happen if, e.g. the other end is too busy).

Another note, not directly related to this ticket: I checked the same
scenario with test code written in C. It didn't happen. I guess
it's due to something internal to (OpenBSD's?) Python network/socket
library. So we don't have to update the C++ version (at least it's
not urgent).

This is proposed changelog:

409.?	[bug]		jinmei
	Python CC library now ensures write operations transmit all given
	data (unless an error happens).  Previously it didn't check the
	size of transmitted data, which could result in partial write on
	some systems (notably on OpenBSD) and subsequently cause system
	hang up or other broken state.  This fix specifically solves start
	up failure on OpenBSD.
	(Trac #1829, git TBD)

comment:8 Changed 8 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:9 Changed 8 years ago by jelte

  • Owner changed from UnAssigned to jelte

comment:10 follow-up: Changed 8 years ago by jelte

  • Owner changed from jelte to jinmei

looks good, please merge :)

comment:11 in reply to: ↑ 10 Changed 8 years ago by jinmei

Replying to jelte:

looks good, please merge :)

Thanks for the review. Merge done, closing.

comment:12 Changed 8 years ago by jinmei

  • Estimated Difficulty changed from 0 to 5
  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 0 to 4.5

(giving an estimation of 5 at my discretion)

Note: See TracTickets for help on using tickets.