Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#1847 closed defect (invalid)

CC timeouts in auth

Reported by: jreed Owned by:
Priority: medium Milestone:
Component: Unclassified Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 2 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

I started receiving these while doing some performance load testing on two different systems:

2012-03-27 06:21:26.845 ERROR [b10-auth.cc] CC_TIMEOUT timeout reading data from command channel
2012-03-27 06:21:26.846 ERROR [b10-auth.auth] AUTH_STATS_TIMEOUT timeout while sending statistics data: Timeout while reading data from cc session

I don't recall them from earlier tests so maybe something was introduced or changed related to this in the last month or so.

I didn't research this yet, but opening the ticket so it is no forgot.

Subtickets

Change History (15)

comment:1 Changed 8 years ago by jreed

  • Summary changed from CC timeouts when auth is under a heavy load to CC timeouts in auth

Actually I am now seeing this without any load. I think this is a new regression related to msqg. Also see ticket #1848.

comment:2 Changed 8 years ago by jreed

Also I saw this output:

Traceback (most recent call last):
  File "/Local/Users/jreed/dnsbench/work/master/20120327130322/install/libexec/bind10-devel/b10-msgq", line 551, in <module>
    msgq.run()
  File "/Local/Users/jreed/dnsbench/work/master/20120327130322/install/libexec/bind10-devel/b10-msgq", line 460, in run
    self.run_poller()
  File "/Local/Users/jreed/dnsbench/work/master/20120327130322/install/libexec/bind10-devel/b10-msgq", line 479, in run_poller
    self.__process_write(fd)
  File "/Local/Users/jreed/dnsbench/work/master/20120327130322/install/libexec/bind10-devel/b10-msgq", line 391, in __process_write
    (_, msg) = self.sendbuffs[fileno]
KeyError: 25930328

(I may have copy and pasted some of that wrong as it was hidden mixed in many lines of other output.)

comment:3 Changed 8 years ago by jreed

I am getting these timeouts on multiple systems.

Also another problem is (when configured with four b10-auth components), it will only log that two or three started:

2012-03-27 11:58:28.619 INFO  [b10-auth.auth] AUTH_SERVER_STARTED server started
2012-03-27 11:58:28.650 INFO  [b10-auth.auth] AUTH_SERVER_STARTED server started

but ps shows me four were started at that time.

comment:4 Changed 8 years ago by jreed

I tried reversing f20a049cb0e6b29464acec69d26a46f26b7e2170 (ticket #1829) and reinstalled session.py but problems continue.

comment:5 Changed 8 years ago by jreed

I tried a master from a few days ago (before #1829, #1172, #1775, #1688) and it also has the problems.

comment:6 Changed 8 years ago by jreed

The AUTH_STATS_TIMEOUT problem is due to me no running b10-stats collector. I guess that is related to CC_TIMEOUT but don't understand that part.

As for the missing logging and the traceback in b10-msgq shown above, I am not sure yet.

comment:7 Changed 8 years ago by jreed

  • Resolution set to invalid
  • Status changed from new to closed

I opened a new ticket about the traceback. I will open a different ticket about missing logging. I will open a different ticket (if doesn't exist yet) about stats communication should be reversed (that is no stats if no stats collector). I am closing this ticket.

comment:8 follow-up: Changed 8 years ago by jelte

I have seen this one as well where I don't run Stats. It is not so much a timeout issue; the problem is that stats collection is a 'push' method; modules push their stats regularly to the Stats module. Since they expect the module to answer, and they do not check whether it is running, this results in a timeout (noone is answering). The easy fix would be to not expect a response. A better solution might be to make it a pull action ('getstats' or something, initiated by the stats collector)

comment:9 in reply to: ↑ 8 ; follow-up: Changed 8 years ago by jinmei

Replying to jelte:

I have seen this one as well where I don't run Stats. It is not so much a timeout issue; the problem is that stats collection is a 'push' method; modules push their stats regularly to the Stats module. Since they expect the module to answer, and they do not check whether it is running, this results in a timeout (noone is answering). The easy fix would be to not expect a response. A better solution might be to make it a pull action ('getstats' or something, initiated by the stats collector)

I'd actually suggest revisiting the IPC method for statistics at more
fundamental level. Blocking/timeout issue aside, I suspect the CC
channel is not appropriate for exchanging statistics information due
to scalability issues. I assume we'd eventually (soon?) like to be
able to have per-zone statistics, while still allowing millions of
zones to be served. If there's a reasonable way of using the CC
channel to exchange all of the statistics, that's fine, but I suspect
we need to use a different method for this purpose.

comment:10 in reply to: ↑ 9 ; follow-up: Changed 8 years ago by vorner

Hello

Replying to jinmei:

I'd actually suggest revisiting the IPC method for statistics at more
fundamental level. Blocking/timeout issue aside, I suspect the CC
channel is not appropriate for exchanging statistics information due
to scalability issues. I assume we'd eventually (soon?) like to be
able to have per-zone statistics, while still allowing millions of
zones to be served. If there's a reasonable way of using the CC
channel to exchange all of the statistics, that's fine, but I suspect
we need to use a different method for this purpose.

I'm not sure what you mean. If it's some way to push only diffs of the statistics, I think we want that sometime ‒ if we have millions of zones, its unlikely all of them would be accessed every 10 seconds or like that, so it makes little sense to send them all. Having a poll principle instead of push or notification or something would seem better too.

But if you mean creating YAUDS (Yet Another Unix Domain Socket) for the purpose of statistics, I'd be strongly against that. I believe the CC channel should be checked and made sure it can handle many and large messages to cope with the load. Having millions of zones would create messages several megabytes large, but these should pose no problem for the system.

comment:11 Changed 8 years ago by naokikambe

Basically I agree with that polling model rather than the current pushing model. BTW the current auth seems to ignore the result even if AUTH_STATS_TIMEOUT happens. I think it can at least check the returned value from AuthCountersImpl::submitStatistics(), which currently returns false if AUTH_STATS_TIMEOUT happens, and stop the timer if the value is false (or if the value has been false for some times). IMO a proposed change is like this:

diff --git a/src/lib/asiolink/interval_timer.cc b/src/lib/asiolink/interval_timer.cc
index 9873e9b0..78b0bcf 100644
--- a/src/lib/asiolink/interval_timer.cc
+++ b/src/lib/asiolink/interval_timer.cc
@@ -121,7 +121,9 @@ IntervalTimerImpl::callback(const asio::error_code& ec) {
         // Set next expire time.
         update();
         // Invoke the call back function.
-        cbfunc_();
+        // If the call back function returns false, then the timer
+        // should be cancelled.
+        if (!cbfunc_()) cancel();
     }
 }

Of course, I'm sure other changes are required by this change.

comment:12 in reply to: ↑ 10 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

I'd actually suggest revisiting the IPC method for statistics at more
fundamental level. Blocking/timeout issue aside, I suspect the CC
channel is not appropriate for exchanging statistics information due
to scalability issues. I assume we'd eventually (soon?) like to be
able to have per-zone statistics, while still allowing millions of
zones to be served. If there's a reasonable way of using the CC
channel to exchange all of the statistics, that's fine, but I suspect
we need to use a different method for this purpose.

I'm not sure what you mean. If it's some way to push only diffs of the statistics, I think we want that sometime ‒ if we have millions of zones, its unlikely all of them would be accessed every 10 seconds or like that, so it makes little sense to send them all. Having a poll principle instead of push or notification or something would seem better too.

But if you mean creating YAUDS (Yet Another Unix Domain Socket) for the purpose of statistics, I'd be strongly against that. I believe the CC channel should be checked and made sure it can handle many and large messages to cope with the load. Having millions of zones would create messages several megabytes large, but these should pose no problem for the system.

I was not proposing anything specific at this moment, and in any event
it wouldn't be YAUDS (depending on what exactly and specifically you
mean YAUDS). One possibility I had in my mind is that b10-auth stores
statistics in a separate storage such as an SQLite3 DB file, maybe in
a periodic batch task. b10-stats or any other statistics clients
would then retrieve the stored statistics values from that storage
separately.

I'm not necessarily excluding the use of the common message bus,
however, as long as we can exchange necessary information in a
reasonably timely fashion without disrupting the main server of
b10-auth (i.e., responding to queries).

comment:13 in reply to: ↑ 12 ; follow-up: Changed 8 years ago by naokikambe

Replying to jinmei:

One possibility I had in my mind is that b10-auth stores
statistics in a separate storage such as an SQLite3 DB file, maybe in
a periodic batch task. b10-stats or any other statistics clients
would then retrieve the stored statistics values from that storage
separately.

IMO stats might periodically pull out statistics from each module via a CC session, and store them into an SQLite3 DB together with a timestamp. Because maybe auth is very busy to answering DNS queries, and its asynchronous writing to the SQLite3 DB is likely to be very difficult. I'm afraid it might be possible to affect its response performance. That situations of auth are definitely in my image. If we complete replacing the current msgq with another better implement, couldn't that issue be problem, which is transmitting large bytes in the CC channel?

Thanks

comment:14 in reply to: ↑ 13 Changed 8 years ago by jinmei

Replying to naokikambe:

One possibility I had in my mind is that b10-auth stores
statistics in a separate storage such as an SQLite3 DB file, maybe in
a periodic batch task. b10-stats or any other statistics clients
would then retrieve the stored statistics values from that storage
separately.

IMO stats might periodically pull out statistics from each module via a CC session, and store them into an SQLite3 DB together with a timestamp. Because maybe auth is very busy to answering DNS queries, and its asynchronous writing to the SQLite3 DB is likely to be very difficult. I'm afraid it might be possible to affect its response performance. That situations of auth are definitely in my image. If we complete replacing the current msgq with another better implement, couldn't that issue be problem, which is transmitting large bytes in the CC channel?

I wouldn't expect too much from an unrealized magic technology:-)

I believe we all share the same goal: exchanging statistic information
from a (possibly very busy) server module and a statistics collector
without disrupting the main service of the server. It just seems to
me that opinions/views vary about where the major bottleneck that may
cause the disruption is.

To me, explicit inter-process communication, especially for a large
amount of data, is much riskier in that sense. The other end might
also be busy and the send() operation could block, but asynchronous
communication makes the code more complicated. Also, I'm not so sure
how and where b10-auth can store the statistics data until it's asked
to provide it by the collector (which is an unpredictable event in the
pull model). The number of zones may be millions, and the zone data
may be in a separate DB file or DB server, and b10-auth may not even
be able to have a handy way to get a list of all zones stored in the
DB.

Alternatively, b10-auth could simply build and maintain the statistics
as it responds to queries on the fly (and in memory), and periodically
(e.g. every 30 sec) dump it to somewhere, either to the collector via
asynchronous communication, or to an SQLite3 DB file. I guess
synchronous write is okay for the latter case if the period is
reasonably short, and the implementation can be much simpler. This is
what I imagined at the cited part above.

But, again, I'm not insisting on any particular approach as long as
the goal is achieved. So far, however, it seems to me that direct
communication approach has more difficult issues to solve for
stability and scalability.

comment:15 Changed 8 years ago by shane

  • Milestone New Tasks deleted
Note: See TracTickets for help on using tickets.