Opened 7 years ago

Closed 5 years ago

#2636 closed defect (wontfix)

stats crash

Reported by: jreed Owned by: jreed
Priority: medium Milestone: Remaining BIND10 tickets
Component: statistics Version: bind10-old
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 5 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

OPS told me that the as112 server nagios check showed memory usage went up. (But looking at this, it happened after the crash.)

I logged in and saw in my logs that stats server crashed:

2013-01-15 19:00:00.791 ERROR [b10-auth.auth/95428] AUTH_ZONEMGR_COMMS error communicating with zone manager: ASIO read: data length is not ready
2013-01-15 19:00:01.152 ERROR [b10-auth.cc/95426] CC_TIMEOUT timeout reading data from command channel
2013-01-15 19:00:01.154 ERROR [b10-auth.auth/95426] AUTH_ZONEMGR_COMMS error communicating with zone manager: Timeout while reading data from cc session
2013-01-15 19:00:01.154 ERROR [b10-auth.cc/95426] CC_LENGTH_NOT_READY length not ready
2013-01-15 19:00:01.154 ERROR [b10-auth.auth/95426] AUTH_ZONEMGR_COMMS error communicating with zone manager: ASIO read: data length is not ready
2013-01-15 19:01:06.404 ERROR [b10-auth.cc/95422] CC_TIMEOUT timeout reading data from command channel
2013-01-15 19:01:07.169 ERROR [b10-auth.cc/95424] CC_TIMEOUT timeout reading data from command channel
2013-01-15 19:01:07.269 ERROR [b10-auth.auth/95422] AUTH_ZONEMGR_COMMS error communicating with zone manager: Timeout while reading data from cc session
2013-01-15 19:01:07.269 ERROR [b10-auth.cc/95422] CC_LENGTH_NOT_READY length not ready
2013-01-15 19:01:07.269 ERROR [b10-auth.auth/95422] AUTH_ZONEMGR_COMMS error communicating with zone manager: ASIO read: data length is not ready
2013-01-15 19:01:08.103 ERROR [b10-auth.auth/95424] AUTH_ZONEMGR_COMMS error communicating with zone manager: Timeout while reading data from cc session
2013-01-15 19:01:08.104 ERROR [b10-auth.cc/95424] CC_LENGTH_NOT_READY length not ready
2013-01-15 19:01:08.104 ERROR [b10-auth.auth/95424] AUTH_ZONEMGR_COMMS error communicating with zone manager: ASIO read: data length is not ready
2013-01-15 19:01:12.717 INFO  [b10-boss.boss/95413] BIND10_PROCESS_ENDED process 95423 of b10-stats ended with status 256
2013-01-15 19:01:12.780 ERROR [b10-boss.boss/95413] BIND10_COMPONENT_FAILED component b10-stats (pid 95423) failed: process exited normally with exit status 256
2013-01-15 19:01:12.785 INFO  [b10-boss.boss/95413] BIND10_COMPONENT_START component b10-stats is starting
2013-01-15 19:01:12.792 INFO  [b10-boss.boss/95413] BIND10_STARTING_PROCESS starting process b10-stats
2013-01-15 19:01:13.876 ERROR [b10-auth.cc/95429] CC_TIMEOUT timeout reading data from command channel
2013-01-15 19:01:14.339 ERROR [b10-auth.auth/95429] AUTH_ZONEMGR_COMMS error communicating with zone manager: Timeout while reading data from cc session
2013-01-15 19:01:14.340 ERROR [b10-auth.cc/95429] CC_LENGTH_NOT_READY length not ready
2013-01-15 19:01:14.340 ERROR [b10-auth.auth/95429] AUTH_ZONEMGR_COMMS error communicating with zone manager: ASIO read: data length is not ready
2013-01-15 19:01:15.084 INFO  [b10-stats.stats/78714] STATS_STARTING starting

The redirected bind10 output showed:

Traceback (most recent call last):
  File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line 212, in _receive_full_buffer
    self._receive_len_data()
  File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line 172, in _receive_len_data
    new_data = self._receive_bytes(self._recv_len_size)
  File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line 158, in _receive_bytes
    data = self._socket.recv(size)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/libexec/bind10/b10-stats", line 687, in <module>
    stats.start()
  File "/usr/local/libexec/bind10/b10-stats", line 373, in start
    self.do_polling()
  File "/usr/local/libexec/bind10/b10-stats", line 256, in do_polling
    (answer, env) = self.cc_session.group_recvmsg(False, seq)
  File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line 275, in group_recvmsg
    env, msg  = self.recvmsg(nonblock, seq)
  File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line 130, in recvmsg
    data = self._receive_full_buffer(nonblock)
  File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line 227, in _receive_full_buffer
    raise SessionTimeout("recv() on cc session timed out")
isc.cc.session.SessionTimeout: recv() on cc session timed out

Now Stats show has:

{
    "Auth": {
        "opcode.iquery": 4, 
        "opcode.notify": 1542, 
        "opcode.other": 0, 
        "opcode.query": 11446454195,
        "opcode.status": 0, 
        "opcode.update": 906151246, 
        "queries.tcp": 4529597, 
        "queries.udp": 12348077390, 
        "rcode.badalg": 0, 
        "rcode.badkey": 0, 
        "rcode.badmode": 0, 
        "rcode.badname": 0, 
        "rcode.badsigvers": 0, 
        "rcode.badtime": 0, 
        "rcode.badtrunc": 0, 
        "rcode.formerr": 469435, 
        "rcode.noerror": 35305553, 
        "rcode.notauth": 198, 
        "rcode.notimp": 906151250, 
        "rcode.notzone": 0, 
        "rcode.nxdomain": 11319210406, 
        "rcode.nxrrset": 0, 
        "rcode.other": 0, 
        "rcode.refused": 91402768, 
        "rcode.servfail": 57451, 
        "rcode.yxdomain": 0, 
        "rcode.yxrrset": 0
    },
    "Boss": {
        "boot_time": "2012-12-19T21:41:21Z"
    }, 
    "Stats": {
        "boot_time": "2013-01-15T19:01:14Z", 
        "last_update_time": "2013-01-15T19:32:27Z", 
        "lname": "50f5a77a_1c@ethel.isc.org", 
        "report_time": "2013-01-15T19:32:27Z", 
        "timestamp": 1358278347.888214
    }
}

The end of the dmesg output has:

Limiting icmp unreach response from 25027 to 200 packets/sec
Limiting icmp unreach response from 25088 to 200 packets/sec
Limiting icmp unreach response from 26449 to 200 packets/sec
Limiting icmp unreach response from 28994 to 200 packets/sec
Limiting icmp unreach response from 28885 to 200 packets/sec
Limiting icmp unreach response from 30465 to 200 packets/sec
Limiting icmp unreach response from 3801 to 200 packets/sec
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(8): failed
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(9): failed
swap_pager_getswapspace(5): failed
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(9): failed
swap_pager_getswapspace(5): failed
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(9): failed
swap_pager_getswapspace(5): failed

/var/log/messages around time of crash:

Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(8): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(9): failed
Jan 15 19:01:06 ethel kernel: 
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(5): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(9): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(5): failed
Jan 15 19:01:06 ethel kernel: 
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(9): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(5): failed

Maybe something grew to huge?

Subtickets

Change History (19)

comment:1 follow-up: Changed 7 years ago by vorner

My guess is the following happened:

  • Something grew too large (I have no idea what).
  • The system started swapping. It swapped boss out.
  • The stats daemon asked boss for list of processes over the msgq (that's around the line 256 as indicated by the traceback)
  • The boss took too long to get out of swap and the stats timed out waiting for the answer.

I don't think there's a reasonable workaround for this. But this could be solved by the upcoming (hopefully) msgq enhancements (auto-answering „undeliverable“, tracking responses in msgq and using asynchronous reads ‒ then the timeout would be larger and if it happened, the msgq would just send an error response, so stats would just abort this poll and not crash).

comment:2 follow-ups: Changed 7 years ago by shane

Surely the cc library should retry on timeout?

Also, is whatever process was using all the memory gone now? (It would be nice to know where the problem came from, of course!)

If not, perhaps we can enable hourly "ps" or something to a log file via cron and look for growing memory use?

comment:3 in reply to: ↑ 2 ; follow-up: Changed 7 years ago by jreed

Replying to shane:

Also, is whatever process was using all the memory gone now? (It would be nice to know where the problem came from, of course!)

I didn't see anything. There was plenty of real memory and swap when I checked then and still good today. No large processes.

If not, perhaps we can enable hourly "ps" or something to a log file via cron and look for growing memory use?

I added a cronjob for this.

comment:4 in reply to: ↑ 2 Changed 7 years ago by vorner

Hello

Replying to shane:

Surely the cc library should retry on timeout?

Surely not. The msgq does not lose messages. So if there was a recipient, the request got delivered. If there wasn't, there wouldn't be at the time of retry as well, so it would be a waste.

If there is the recipient, either the handling takes a long time (and doing it once again would probably take a long time too, timeouting again, and burning CPU), or the recipient just doesn't want to answer.

So the retry doesn't help either of the cases. And, it could make it worse ‒ imagine the message had some sideeffect, like increasing the number of running auth processes by one (by design, not by accident). Then delivering it twice is not what the administrator asked for.

It could be solved by setting a longer timeout, but before we do so, we need to make sure the timeouts don't happen in normal circumstances, eg. when the recipient isn't there.

Also, is whatever process was using all the memory gone now? (It would be nice to know where the problem came from, of course!)

Could be something outside bind10 as well. Is there some kind of backup, or something?

comment:5 in reply to: ↑ 1 Changed 7 years ago by naokikambe

Replying to vorner:

so stats would just abort this poll and not crash).

Of course, stats might also need to be further revised for this.

I'm curious about what causes so much swap.

comment:6 Changed 7 years ago by shane

Okay, so the scope of this ticket from a coding point of view is:

  1. Change the cc library so that it doesn't double-throw. That should allow applications to catch the exception.
  1. Revise stats if necessary (it could make the shutdown prettier, and save state if necessary).

There's still the issue of what caused the memory to grow, but we can perhaps treat that as a separate ticket.

comment:7 Changed 7 years ago by vorner

Actually, this is not a double-throw as you know it from C++. It catches one exception and throws a different one to exchange it. This is completely valid, often used thing in python and it can be caught.

The problem is, nobody tries to catch it. And, even if it was caught, it's not really clear what to do about it in the current state of libraries.

comment:8 Changed 7 years ago by jwright

  • Milestone changed from New Tasks to Next-Sprint-Proposed

comment:9 Changed 7 years ago by jreed

Also see #2560.

comment:10 Changed 7 years ago by jreed

Also see #2880.

comment:11 in reply to: ↑ 3 Changed 7 years ago by jreed

Replying to jreed:

I added a cronjob for this.

I forgot to check and later saw that it didn't work. (I had missing details, wasn't logged, but sent to email which was not set up so sitting in a sendmail client mqueue. Note to self: even though % works on /bin/sh command line, it doesn't in the crontab entry.)

Anyways, I made a new better script and cronjob to gather details.

comment:12 Changed 7 years ago by naokikambe

  • Milestone set to Next-Sprint-Proposed

comment:13 Changed 7 years ago by muks

  • Milestone changed from Next-Sprint-Proposed to Sprint-20130611

comment:14 Changed 7 years ago by muks

  • Milestone changed from Sprint-20130611 to Next-Sprint-Proposed

This ticket needs more breakdown (discussed during sprint planning call). Moving temporarily to Next-Sprint-Proposed, from where it'll go to Previous-Sprint-Proposed.

comment:15 Changed 7 years ago by jinmei

Jeremy, the description of this ticket is too broad. Could you do the
following?

  • check if the memory usage issue is still happening, and if so, create a separate ticket only focusing on it.
  • check if the crash still happens, and if so, create a separate ticket.
  • then close this ticket.

Regarding the second point, I suspect it still can, but I suspect the
real issue we should resolve is what to do if a remote module
disappears while b10-stats tries to collect statistics from it. Maybe
we can give a cleaner solution after we introduce various improvements
to msgq and CC system, so we might simply defer the whole issue until
then, especially if it doesn't happen often.

comment:16 Changed 7 years ago by muks

  • Owner set to jreed
  • Status changed from new to assigned

comment:17 Changed 6 years ago by naokikambe

After #2781, b10-stats doesn't fall even if it catches SessionTimeout. It skips asking and waits for some time, and then asks again. This fix is c185e035984740f3914a28f6a93f9d47ae8e75b2.

comment:18 Changed 6 years ago by tomek

  • Milestone set to Remaining BIND10 tickets

comment:19 Changed 5 years ago by tomek

  • Resolution set to wontfix
  • Status changed from assigned to closed
  • Version set to old-bind10

This issue is related to bind10 code that is no longer part of Kea.

If you are interested in BIND10/Bundy framework or its DNS components,
please check http://bundy-dns.de.

Closing ticket.

Note: See TracTickets for help on using tickets.