Opened 8 years ago

Closed 8 years ago

#1751 closed defect (fixed)

stats doesn't separate or consolidate reports from multiple auth instances

Reported by: jinmei Owned by: naokikambe
Priority: medium Milestone: Sprint-20120320
Component: statistics Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 6 Add Hours to Ticket: 0
Total Hours: 28.04 Internal?: no

Description

If we run multiple b10-auth instances, each submits statistics
to b10-stats. (I've not looked into the code but it appears that)
b10-stats doesn't distinguish these reports, so depending on the
timing we see two different sets of statistics via bindctl or the http
server (you can see it at
http://ns.jinmei.org:8853/bind10/statistics/xml)

This is obviously bad. I can think of several ways to address this
issue:

  • consolidate all reports from b10-auth and present the unified statistics only.
  • maintain the separate reports and show them separetely.
  • support both mode.

The primary concern of the administrator would be the consolidated
(i.e. whole-system) statistics, so the first choice seems reasonable.
But the second one may not be that bad, because it would provide some
useful information such as when queries are not equally distributed.
And, the consolidated information can be calculated from separate
reports (but not vice versa). To this end the third one may be the
best if we can show it without being too noisy.

In either case, different instances of the auth server need to
uniquely identify themselves in the statistics reports. I don't know
if they are doing it right now, but if not, we first need to implement
that feature.

Subtickets

Change History (15)

comment:1 Changed 8 years ago by naokikambe

Trivial comments. AFAIK in current stats multiple instances are not considered. Stats considers that the received data is from the same source, and stats overwrites it. Maybe there is currently no way for stats to identify which instance sends the statistics. We should note this if we choose the second or third choice above.

BTW in my environment the multiple auth instances currently at almost same time seem to send the respective statistics to stats. This behavior might not be so important but we should note this when we try the first choice above.

comment:2 Changed 8 years ago by vorner

The auth don't distinguish themself yet, but they should as a result of either #1675 or #1676.

I think it could be then easy to do the third option easily, a statistics message would contain a instance parameter and it would be stored by that. Then a statistics item would have a consolidation method (average/sum/…) and they could be agregated by that.

comment:3 follow-up: Changed 8 years ago by jelte

i think separate counts would be easiest to achieve ('for now'), then an accumulator could be a separate task (stats show output is already getting quite big).

But TBH, I never really understood why statistics are not cumulative in the first place (though to get that right, we probably need another addition 'count_type': [cumulative|write-once|overwrite] or something)

comment:4 in reply to: ↑ 3 Changed 8 years ago by naokikambe

Replying to jelte:
That approach is different from this, but I tried to implement the increase method for incremental count into stats in #170. But it wasn't enough and it wasn't actually needed at that time so we postponed consideration. This is also secondly listed in TODO. Actually the requirement for stats has been changed very much since that time so we should consider about that.

comment:5 Changed 8 years ago by jelte

  • Estimated Difficulty changed from 0 to 6

comment:6 Changed 8 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20120320

comment:7 Changed 8 years ago by naokikambe

  • Owner set to naokikambe
  • Status changed from new to assigned

comment:8 follow-up: Changed 8 years ago by naokikambe

  • Owner changed from naokikambe to UnAssigned
  • Status changed from assigned to reviewing

The branch trac1751 is ready for reviewing. But I think this is a temporary fix for this defect.

A pid argument is newly introduced in the set command of stats. It's used by stats to identify which instance sends statistics data. I'm not sure whether the statistics management by pid is ok or not for this purpose.

For a multi-instance module like auth, the process id should be set in the argument. However for a single-instance module like boss, the pid doesn't have to be set. In this case the default value -1 is assumed.

In addition of the variable of statistics data of each module which exists before, stats internally has the variable of statistics data of each pid. When stats receives new statistics data, it internally updates statistics data of each pid and updates statistics data of each module by accumulating statistics data of each pid.

Even if the sender instance has already died, stats preserves statistics data which it had before, and accumulates it. But if the pid on the system rotates and same number is assigned again to one of the multiple instances, statistics data of the instance might be reset.

The views in bindctl and HTTP/XML aren't changed at all in that branch. But in another ticket they would be more changed so that administrators can see statistics by each instance.

comment:9 Changed 8 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:10 in reply to: ↑ 8 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to naokikambe

Hello

Replying to naokikambe:

The branch trac1751 is ready for reviewing. But I think this is a temporary fix for this defect.

Yes, I agree with this ‒ I think it is better than not having it, but it still needs more tweaking. I'm especially worried about one thing ‒ now a module that has multiple instances preserves them across restarts, while the others don't. This acts in an inconsistent way. Would you create a ticket for it?

As for the thing with PIDs, we probably want to solve it in a more wide way, together with addressing multiple instances in commands, etc. So that'll probably provide further insight on how this should be handled in the long term.

About the code, first I'd like to note that resolver also supports multiple instances. Would you add the support to it as well?

The tests of stats, I think EXPECT_GT would be better here:

    EXPECT_TRUE(statistics_session_.sent_msg->get("command")
                ->get(1)->get("pid")->intValue() > 0);

What do the two stars mean here?

**{'queries.tcp':1001}

I think you mean set Boss/components/b10-auth-2 here:

echo 'config add Boss/components b10-auth-2
config set Boss/components/b10-auth { "special": "auth", "kind": "needed" }

Also, would you provide a changelog entry? I think one would be needed, since it fixes a visible problem.

Thank you

comment:11 in reply to: ↑ 10 Changed 8 years ago by naokikambe

  • Owner changed from naokikambe to vorner

Thank you for reviewing.

Replying to vorner:

Yes, I agree with this ‒ I think it is better than not having it, but it still needs more tweaking. I'm especially worried about one thing ‒ now a module that has multiple instances preserves them across restarts, while the others don't. This acts in an inconsistent way. Would you create a ticket for it?

Actually I'm not sure whether accumulated counts which the non-existent instance had should be lost, but I revised the codes so that such inconsistency wouldn't happen. Stats queries the pid and instance list to Boss, searches the non-existent instance, and then removes statistics data which it had. Please review 94f99cad8539239a93877a1774fbadf89238c181.

As for the thing with PIDs, we probably want to solve it in a more wide way, together with addressing multiple instances in commands, etc. So that'll probably provide further insight on how this should be handled in the long term.

Yes, I think using PID is a short-term solution. Of course, we should discuss further about such addressing.

About the code, first I'd like to note that resolver also supports multiple instances. Would you add the support to it as well?

AFAIK, query counters in the resolver aren't implemented yet.
Maybe auth/statistics.{cc,h} are reusable for the resolver. They should be revised for the resolver, but we can implement counters by the same way as auth, e.g. #1399, #1401 and #1613.
Of course, query counts in a multi-instance resolver would be also consistent after that implementation.

The tests of stats, I think EXPECT_GT would be better here:

    EXPECT_TRUE(statistics_session_.sent_msg->get("command")
                ->get(1)->get("pid")->intValue() > 0);

Thank you for the suggestion, but I replaced it with the check with a real pid.
Please review ddf22289ac033aa2ffdaf8e348c4a05f2d1d6951.

What do the two stars mean here?

**{'queries.tcp':1001}

I just intended to pass a keyword argument to the function. The keyword includes "." , which has a special meaning in python, so the keyword must be literal.

I think you mean set Boss/components/b10-auth-2 here:

echo 'config add Boss/components b10-auth-2
config set Boss/components/b10-auth { "special": "auth", "kind": "needed" }

Yes, that was my typo. But after I fixed it, maybe I encountered another bug. After the second auth is removed via bindctl, the auth doesn't respond any more. It doesn't respond even to shutdown via bindctl. This seems for me to be the same phenomenon in #1703. So I gave up the final statistics check there.
And I also updated the regression check against an probable statistics inconsistency displayed by `Stats show'. Two more auths are started and the query counters are checked multiple times. Please review 3ca2c002d807fa1f8e7dbf6731ac4f1796f4e5a0.

Also, would you provide a changelog entry? I think one would be needed, since it fixes a visible problem.

Yes, I would. Please review eecc0d847fc5f9dc74ce96788e0b3489a98fb244.

Thanks

comment:12 follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to naokikambe

Hello

Replying to naokikambe:

Actually I'm not sure whether accumulated counts which the non-existent instance had should be lost, but I revised the codes so that such inconsistency wouldn't happen. Stats queries the pid and instance list to Boss, searches the non-existent instance, and then removes statistics data which it had. Please review 94f99cad8539239a93877a1774fbadf89238c181.

I'm not sure if any statistics should be lost after a restart. If the xfrin crashes and is started again, it starts its statistics from 0 again. But anyway, this is a different issue.

About the code, first I'd like to note that resolver also supports multiple instances. Would you add the support to it as well?

AFAIK, query counters in the resolver aren't implemented yet.

Um, yes, that would be good enough reason not to add the support now :-)).

Thank you for the suggestion, but I replaced it with the check with a real pid.

That is indeed even better.

Yes, I would. Please review eecc0d847fc5f9dc74ce96788e0b3489a98fb244.

I'm not sure, if the changelog should be so detailed about the implementation ‒ after all, it's for users and they don't care if it is by using PID or some other magic. How about something like this?:

[bug]		naokikambe
The statistic counters are now properly accumulated across multiple instances
of b10-auth (if there are multiple instances), instead of providing result for
random instance.

To the new code, I think these list comprehensions are quite hard to read. Would it be possible to say the same in some more understandable way, or at least add a comment what the meaning of each list is?

Thank you

With regards

comment:13 in reply to: ↑ 12 Changed 8 years ago by naokikambe

  • Owner changed from naokikambe to vorner

Thank you for replying.

Replying to vorner:

I'm not sure, if the changelog should be so detailed about the implementation ‒ after all, it's for users and they don't care if it is by using PID or some other magic. How about something like this?:

Thank you. I revised it so. That would be simpler for the users to understand easily.
Please check again 65d2a83510c3abf2f2cc3fb082ce56c99be32dfb.

To the new code, I think these list comprehensions are quite hard to read. Would it be possible to say the same in some more understandable way, or at least add a comment what the meaning of each list is?

Sorry for my carelessness for a reviewer. I added some notes to the code and rearranged it.
Please review b24200598f331f9b46805e04b92c23dd7018f2d4. If there is another part unable for you to understand, please let me know.

Regards,

comment:14 follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to naokikambe
  • Total Hours changed from 0 to 1.79

Thank you, this looks OK. Please merge.

comment:15 in reply to: ↑ 14 Changed 8 years ago by naokikambe

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 1.79 to 28.04

Thank you, I've merged. I'm closing this ticket.
And later I'll create a ticket to consider how to address multiple instances according to your comment.

Regards

Note: See TracTickets for help on using tickets.