Opened 9 years ago

Closed 9 years ago

#513 closed defect (fixed)

b10-auth hangs in submitting statistics

Reported by: jinmei Owned by: jinmei
Priority: high Milestone: A-Team-Sprint-20110126
Component: b10-auth Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

After integrating the statistics changes, b10-auth on my personal server often hangs (meaning it stops responding to any queries). I found it blocked at the call to group_sendmsg() in AuthCountersImpl::submitStatistics().

I also found that when this happened msgq apparently blocked in reading data: The receive socket buffer from xfrout to msgq has 169 bytes of data; receive buffer from zonemgr to msgq has 7320 bytes of data; receive buffer from auth to msgq has 8112 bytes of data.

I've not figured out how exactly this happened, but I suspect it's a similar issue to #420.

The exact cause should be fixed, of course, but I think we should at least provide a woraround for b10-auth quickly. I propose adding a configuration knob to turn off the periodic submission of statistics (maybe as part of a way of customizing submission interval). This kind of knob will be necessary anyway, so the change itself won't be a temporary workaround.

I'm categorizing this problem as critical. I believe it should be fixed before the next (this week's) release.

Subtickets

Change History (14)

comment:1 follow-up: Changed 9 years ago by y-aharen

May I take this ticket?

Workaround:

  • Add config "Auth/enablesubmitstatistics" (boolean, default: false)
  • Check the config is true in statistics submission function

comment:2 in reply to: ↑ 1 ; follow-up: Changed 9 years ago by jinmei

Replying to y-aharen:

May I take this ticket?

Workaround:

  • Add config "Auth/enablesubmitstatistics" (boolean, default: false)
  • Check the config is true in statistics submission function

If you want, please, but I've already been working on it.

Would you like me to continue it, or would you take it over?

comment:3 in reply to: ↑ 2 ; follow-up: Changed 9 years ago by y-aharen

Replying to jinmei:

Replying to y-aharen:

May I take this ticket?

Workaround:

  • Add config "Auth/enablesubmitstatistics" (boolean, default: false)
  • Check the config is true in statistics submission function

If you want, please, but I've already been working on it.

Would you like me to continue it, or would you take it over?

OK, could you please continue working on it?

comment:4 in reply to: ↑ 3 Changed 9 years ago by jinmei

Replying to y-aharen:

Would you like me to continue it, or would you take it over?

OK, could you please continue working on it?

ack.

comment:5 Changed 9 years ago by jinmei

branch trac513 is ready for review.

I ended up introducing several sets of changes:

  • added a new method to asiolink::IntervalTimer? to disable/modify the timer interval (and a supplemental new method to help tests)
  • changed the way to configure ioservice and interval timer in AuthSrv?. this could be skipped, but it makes the configuration code much simpler, so I decided to introduce it.
  • added a new configuration for b10-auth to suctomize the timer interval for statistics submission. this is the main purpose of this ticket.

As a result the diff is a bit bigger than I originally intended. If it's too big for the reviewer, please let me know. I'll divide the entire diff into two or three chunks.

The proposed changelog entry is this:

  152.?	[func]*		jinmei
	b10-auth: Added new configuration variable "statistics-interval"
	to allow the user to change the timer interval for periodic
	statistics updates.  The update can also be disabled by setting
	the value to 0.  Disabling statistics updates will also work as
	a temporary workaround of a known issue that b10-auth can block in
	sending statistics and stop responding to queries as a result.
	(Trac #513, git TBD)

comment:6 Changed 9 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from new to reviewing

comment:7 Changed 9 years ago by y-aharen

  • Owner changed from UnAssigned to y-aharen

comment:8 follow-up: Changed 9 years ago by y-aharen

  • Owner changed from y-aharen to jinmei

I've completed my review.

  • src/bin/auth/auth_srv.cc:
109     /// Interval timer for periodic submission of statistics counters.
110     /// When NULL, the submission is disabled.
111     IntervalTimer statistics_timer_;

IntervalTimer can't be NULL, it's not a pointer. I think the comment on line 110 can be deleted because the way to disable timer is described in the other place.

  • I think the interval value passed through cmdctl should be checked that is not negative (i.e. not < 0). In the current implementation, we can set Auth/statistics-interval to -1. It causes infinite invoking of call back function. The parameter check can be done in StatisticsIntervalConfig::build(). If the value is negative, it should throw AuthConfigError or treat as 0 to disable timer. The choice of solution is up to you.

comment:9 in reply to: ↑ 8 ; follow-up: Changed 9 years ago by jinmei

Replying to y-aharen:

I've completed my review.

Thanks!

  • src/bin/auth/auth_srv.cc:
109     /// Interval timer for periodic submission of statistics counters.
110     /// When NULL, the submission is disabled.
111     IntervalTimer statistics_timer_;

IntervalTimer can't be NULL, it's not a pointer. I think the comment on line 110 can be deleted because the way to disable timer is described in the other place.

Ah, good catch. It was originally a pointer, but I modified it to a real
object in a later phase of development, and forgot to update the comment.
Fixed in the branch.

  • I think the interval value passed through cmdctl should be checked that is not negative (i.e. not < 0). In the current implementation, we can set Auth/statistics-interval to -1. It causes infinite invoking of call back function. The parameter check can be done in StatisticsIntervalConfig::build(). If the value is negative, it should throw AuthConfigError or treat as 0 to disable timer. The choice of solution is up to you.

Another nice catch. Fixed.

Please let me know if there are other issues or the branch is ready to merge.

(The following are notes that are not directly related to this ticket)

Although I agree it should be rejected at the frontend of configuration
parser (so I fixed it), I also believe this is primarily a problem of
the interface of IntervalTimer::setupTimer(). It takes an unsigned
integer and naively passes to boost::posix_time::seconds(), which assumes
a signed long value. We should at least use a fully compatible type
(simple 'long' would be best because we cannot even assume it's 32 bit),
and, if we want to reject negative timer intervals as part of the
IntervalTimer? interface, we should explicitly check and reject them
in its setupTimer(). But I wouldn't go into that level within this ticket
anyway. It would have to go to a separate ticket/task.

For that matter, while I extended the IntervalTimer? class, I noticed
the name of "setupTimer()" was redundant because the class name already
clarifies that it's somehow related to the timer. As a hindsight it
should have been named "setup", and so I chose "cancel()" rather than
"cancelTimer()", which might be better in terms of consistency.
So, for consistency and conciseness, I'd suggest renaming "setupTimer()"
to "setup()" (or at least something not redundant). It should be
a topic of separate ticket, though.

comment:10 Changed 9 years ago by jinmei

  • Owner changed from jinmei to y-aharen

comment:11 in reply to: ↑ 9 ; follow-up: Changed 9 years ago by y-aharen

The branch and changelog entry are OK. It is ready to merge into trunk.

Replying to jinmei:

Although I agree it should be rejected at the frontend of configuration
parser (so I fixed it), I also believe this is primarily a problem of
the interface of IntervalTimer::setupTimer(). It takes an unsigned
integer and naively passes to boost::posix_time::seconds(), which assumes
a signed long value. We should at least use a fully compatible type
(simple 'long' would be best because we cannot even assume it's 32 bit),
and, if we want to reject negative timer intervals as part of the
IntervalTimer? interface, we should explicitly check and reject them
in its setupTimer(). But I wouldn't go into that level within this ticket
anyway. It would have to go to a separate ticket/task.

I intended to make sure IntervalTimer::setupTimer() only accepts positive value or 0 by that, although boost::posix_time::seconds() is for generic purpose and accepts negative value. If it is a bad idea, I will create a ticket to address that.

For that matter, while I extended the IntervalTimer? class, I noticed
the name of "setupTimer()" was redundant because the class name already
clarifies that it's somehow related to the timer. As a hindsight it
should have been named "setup", and so I chose "cancel()" rather than
"cancelTimer()", which might be better in terms of consistency.
So, for consistency and conciseness, I'd suggest renaming "setupTimer()"
to "setup()" (or at least something not redundant). It should be
a topic of separate ticket, though.

I agree with you. I will create a ticket later.

comment:12 Changed 9 years ago by y-aharen

  • Owner changed from y-aharen to jinmei

comment:13 in reply to: ↑ 11 Changed 9 years ago by jinmei

Replying to y-aharen:

The branch and changelog entry are OK. It is ready to merge into trunk.

Okay, thanks. I'm going to close this ticket.

Replying to jinmei:

Although I agree it should be rejected at the frontend of configuration
parser (so I fixed it), I also believe this is primarily a problem of
the interface of IntervalTimer::setupTimer(). It takes an unsigned
integer and naively passes to boost::posix_time::seconds(), which assumes
a signed long value. We should at least use a fully compatible type
(simple 'long' would be best because we cannot even assume it's 32 bit),
and, if we want to reject negative timer intervals as part of the
IntervalTimer? interface, we should explicitly check and reject them
in its setupTimer(). But I wouldn't go into that level within this ticket
anyway. It would have to go to a separate ticket/task.

I intended to make sure IntervalTimer::setupTimer() only accepts positive value or 0 by that, although boost::posix_time::seconds() is for generic purpose and accepts negative value. If it is a bad idea, I will create a ticket to address that.

The problem is that by (implicitly) converting uint32_t to (signed)
long we lose information. More specifically, think about the case
where someone passes 0xffffffff to setupTimer() (perhaps expecting it
effectively works as "infinity".) But it will be converted to -1
within asio, and we'll see some surprising effects. Implicit conversion
into a type with smaller capacity should generally be avoided to
avoid this kind of surprise.

Note also that it's quite fragile to rely on a specific size (32bits
in the case of current code) of 'long'. In theory we could even see
an environment where long is 16 bit signed integer. For such
(mostly imaginary) systems, even more sensible interval like 32768
would cause an overflow.

So, the safest way is to use the same (long) or non larger type (such as
signed int) in the interface to setupTimer() and (if you don't want to allow
negative interval) explicitly reject negative values as part of validation.
A suboptimal alternative would be to use an unsigned type whose max value
is normally still positive for 'long' (this would be the case with uint16_t
for most systems we are supporting), but as already explained it's
much, much suboptimal than the safest way.

comment:14 Changed 9 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
Note: See TracTickets for help on using tickets.