Opened 7 years ago

Closed 5 years ago

#2560 closed defect (wontfix)

stats crash on cc timeout

Reported by: jreed Owned by:
Priority: medium Milestone: Remaining BIND10 tickets
Component: Unclassified Version: bind10-old
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: Low
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 2 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

The following is from last release (20121115). The bind10 parent was suspended on purpose for testing something else, and then b10-stats crashed:

Traceback (most recent call last):
  File "/home/jreed/dnsbench/work/origin/bind10-20121115-release/20121126190544/install/lib/python3.1/site-packages/isc/cc/session.py", line 212, in _receive_full_buffer
    self._receive_len_data()
  File "/home/jreed/dnsbench/work/origin/bind10-20121115-release/20121126190544/install/lib/python3.1/site-packages/isc/cc/session.py", line 172, in _receive_len_data
    new_data = self._receive_bytes(self._recv_len_size)
  File "/home/jreed/dnsbench/work/origin/bind10-20121115-release/20121126190544/install/lib/python3.1/site-packages/isc/cc/session.py", line 158, in _receive_bytes
    data = self._socket.recv(size)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jreed/dnsbench/work/origin/bind10-20121115-release/20121126190544/install/libexec/bind10-devel/b10-stats", line 687, in <module>
    stats.start()
  File "/home/jreed/dnsbench/work/origin/bind10-20121115-release/20121126190544/install/libexec/bind10-devel/b10-stats", line 373, in start
    self.do_polling()
  File "/home/jreed/dnsbench/work/origin/bind10-20121115-release/20121126190544/install/libexec/bind10-devel/b10-stats", line 256, in do_polling
    (answer, env) = self.cc_session.group_recvmsg(False, seq)
  File "/home/jreed/dnsbench/work/origin/bind10-20121115-release/20121126190544/install/lib/python3.1/site-packages/isc/cc/session.py", line 275, in group_recvmsg
    env, msg  = self.recvmsg(nonblock, seq)
  File "/home/jreed/dnsbench/work/origin/bind10-20121115-release/20121126190544/install/lib/python3.1/site-packages/isc/cc/session.py", line 130, in recvmsg
    data = self._receive_full_buffer(nonblock)
  File "/home/jreed/dnsbench/work/origin/bind10-20121115-release/20121126190544/install/lib/python3.1/site-packages/isc/cc/session.py", line 227, in _receive_full_buffer
    raise SessionTimeout("recv() on cc session timed out")
isc.cc.session.SessionTimeout: recv() on cc session timed out
[b10-msgq] Closing socket fd 10
[b10-msgq] Receive error: EOF

This is repeatable.

b10-stats should not crash with noisy traceback if something is temporarily unavailable.

Subtickets

Change History (9)

comment:1 Changed 7 years ago by jwright

  • Defect Severity changed from N/A to Medium
  • Milestone changed from New Tasks to Next-Sprint-Proposed

comment:2 Changed 7 years ago by jwright

  • Defect Severity changed from Medium to Low

comment:3 Changed 7 years ago by naokikambe

A quick proposed patch:

  • src/bin/stats/stats.py.in

    diff --git a/src/bin/stats/stats.py.in b/src/bin/stats/stats.py.in
    index 7123c53..789c79d 100755
    a b if __name__ == "__main__": 
    688688    except OptionValueError as ove:
    689689        logger.fatal(STATS_BAD_OPTION_VALUE, ove)
    690690        sys.exit(1)
    691     except isc.cc.session.SessionError as se:
     691    except (isc.cc.session.SessionError,
     692            isc.cc.session.SessionTimeout) as se:
    692693        logger.fatal(STATS_CC_SESSION_ERROR, se)
    693694        sys.exit(1)
    694695    except StatsError as se:

I think that error should be caught at __main__. Otherwise it would be better that SessionTimeout is inherited from SessionError?

BTW suspending is a normal operation? If stats shuts down once, then collected statistics are lost. I'm not sure it is ok that stats dies when boss is suspended. Otherwise stats should be waiting and doing nothing until boss is waked up?

comment:4 follow-up: Changed 7 years ago by vorner

I don't see what should be the „correct“ behaviour here (unless it's about
producing more user friendly error message, in which case I don't think it's so
critical ticket).

But we do need to fix the message queue system eventually and this was one of
the proposed fixes (not just for msgq, but in general). I think we should not
fix these one by one, mostly because there are so many instances of this.

comment:5 in reply to: ↑ 4 Changed 7 years ago by jinmei

Replying to vorner:

But we do need to fix the message queue system eventually and this was one of
the proposed fixes (not just for msgq, but in general). I think we should not
fix these one by one, mostly because there are so many instances of this.

I tend to do +1. If it only happens when one does something "on
purpose", I'd rather avoid pasting another ad hoc bandaid.

comment:6 Changed 7 years ago by jreed

Also see #2636.

comment:7 Changed 7 years ago by jreed

  • Milestone set to Next-Sprint-Proposed

Also see #2880

comment:8 Changed 6 years ago by tomek

  • Milestone set to Remaining BIND10 tickets

comment:9 Changed 5 years ago by tomek

  • Resolution set to wontfix
  • Status changed from new to closed
  • Version set to old-bind10

This issue is related to bind10 code that is no longer part of Kea.

If you are interested in BIND10/Bundy framework or its DNS components,
please check http://bundy-dns.de.

Closing ticket.

Note: See TracTickets for help on using tickets.