Opened 8 years ago

Closed 7 years ago

#1938 closed defect (fixed)

why ZONEMGR_UNKNOWN_ZONE_NOTIFIED ?

Reported by: jreed Owned by: jinmei
Priority: high Milestone: Sprint-20130402
Component: secondary manager Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: Medium
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 5 Add Hours to Ticket: 0
Total Hours: 13.5 Internal?: no

Description

Who sent these notifies?

2012-05-02 00:16:05.830 ERROR [b10-zonemgr.zonemgr] ZONEMGR_UNKNOWN_ZONE_NOTIFIED notified zone 139.168.192.IN-ADDR.ARPA. (class IN) is not known to the zone manager
2012-05-02 00:16:05.830 ERROR [b10-auth.auth] AUTH_ZONEMGR_ERROR received error response from zone manager: "[b10-zonemgr] Notified zone (139.168.192.IN-ADDR.ARPA., IN) doesn't belong to zonemgr"
2012-05-02 00:16:09.841 ERROR [b10-zonemgr.zonemgr] ZONEMGR_UNKNOWN_ZONE_NOTIFIED notified zone 139.168.192.IN-ADDR.ARPA. (class IN) is not known to the zone manager
2012-05-02 00:16:09.842 ERROR [b10-auth.auth] AUTH_ZONEMGR_ERROR received error response from zone manager: "[b10-zonemgr] Notified zone (139.168.192.IN-ADDR.ARPA., IN) doesn't belong to zonemgr"
2012-05-02 00:16:17.827 ERROR [b10-zonemgr.zonemgr] ZONEMGR_UNKNOWN_ZONE_NOTIFIED notified zone 139.168.192.IN-ADDR.ARPA. (class IN) is not known to the zone manager
2012-05-02 00:16:17.827 ERROR [b10-auth.auth] AUTH_ZONEMGR_ERROR received error response from zone manager: "[b10-zonemgr] Notified zone (139.168.192.IN-ADDR.ARPA., IN) doesn't belong to zonemgr"

I don't have detailed logging enabled (18K qps is too much).

If these are errors then maybe should provide some details on who is sending it.

Also ZONEMGR_UNKNOWN_ZONE_NOTIFIED doc says "Please submit a bug report." So here it is.

The AUTH_ZONEMGR_ERROR documentation says it is a "debug message" but above you see ERROR.

Subtickets

Change History (26)

comment:1 Changed 8 years ago by shane

  • Milestone changed from New Tasks to Next-Sprint-Proposed

comment:2 Changed 7 years ago by shane

I got this as well. This should be INFO at most, maybe DEBUG.

comment:3 Changed 7 years ago by jreed

  • Milestone set to Next-Sprint-Proposed

These problems continues. I am changing this to next-sprint-proposed (I am trying to troubleshoot fatal problems on the system which may be related since this says "Please submit a bug report.")

Maybe the log message should also indicate where the NOTIFY came from.

comment:4 Changed 7 years ago by jreed

  • Milestone set to Next-Sprint-Proposed

comment:5 Changed 7 years ago by jreed

This problem continues. A production system should not log thousands of ERROR messages.

comment:6 Changed 7 years ago by jreed

  • Milestone set to Next-Sprint-Proposed

comment:7 Changed 7 years ago by jelte

I think the last half of the description is simply wrong (or perhaps wishful thinking).

From the looks of it in the code, this is simply a log message that a NOTIFY was received for a zone that the system has no knowledge of; there is no earlier check, at least in zonemgr itself.

Operationally this is certainly an error, why would you get NOTIFIES for zones you don't know about?

Should those not be logged? Or should it be warning, info?

comment:8 Changed 7 years ago by jelte

Let me disagree with myself here: there does appear to be a bug, since it should really respond with NOTAUTH instead of logging an error and not responding at all...

comment:9 Changed 7 years ago by shane

  • Milestone changed from Previous-Sprint-Proposed to Next-Sprint-Proposed

comment:10 Changed 7 years ago by shane

  • Milestone changed from Previous-Sprint-Proposed to Next-Sprint-Proposed

comment:11 Changed 7 years ago by shane

  • Milestone changed from Previous-Sprint-Proposed to Next-Sprint-Proposed

The issue with this being an ERROR is that an external user should not be able to trivially cause an ERROR. An ERROR is something the administrator should probably notice and do something about. A stray NOTIFY... not so much.

So, there are two actions here:

  1. Change the error level from ERROR to INFO
  2. Put a check earlier (in b10-auth?) to avoid wasting system resources and spamming up logs

comment:12 Changed 7 years ago by shane

  • Defect Severity changed from N/A to Medium
  • Priority changed from medium to high

Setting severity, and raising priority since this is affecting a production system.

comment:13 Changed 7 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20130402

comment:14 Changed 7 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

comment:15 follow-up: Changed 7 years ago by jinmei

trac1938 is ready for review.

There are several points to be fixed/improved, and each is mostly
independent.

  • b10-auth should have checked if it has authority for the zone in the NOTIFY (there was actually a TODO). This branch does that. And, I believe it suppresses most of the noisy logs reported in this ticket. BTW, this check could also be done in zonemgr, but I chose to do it in auth, because such bogus NOTIFY seems to be not so uncommon (as was reported in this ticket), and so it'd be better to deal with them without involving expensive inter-module communication. Changes up to af2ca1d (and b058641) address this issue.
  • still, zonemgr's behavior doesn't make sense. It's completely possible that a primary server receives a NOTIFY. So calling it "unknown" at the error level is obviously wrong. I've lowered log level, revised the log message and description to explain the situation more accurately, and made sure it doesn't trigger another error log at b10-auth. These changes are up to 0b8dd6b.
  • Finally, I tried to add some lettuce tests that cover these cases, and then I noticed existing notify related tests have something that doesn't make sense in waiting for log messages:
        Then wait 5 times for new master stderr message NOTIFY_OUT_SENDING_NOTIFY
        Then wait for new master stderr message NOTIFY_OUT_RETRY_EXCEEDED
    
    In these cases the notify should have been responded, so counting timeouts doesn't make sense. It actually seems to be a bug of notify_out, but instead of fixing it in this ticket, I've added one small log message to notify_out when it receives a valid response to NOTIFY, and used it for the newly added tests. Last few commits are for these tests.

I believe the branch now addresses all issues discussed in this
ticket.

Proposed changelog:

592.?	[bug]		jinmei
	b10-auth and zonemgr now handle some uncommon NOTIFY messages more
	gracefully: auth immediately returns a NOTAUTH response if the
	server does not have authority for the zone (the behavior
	compatible with BIND 9) without bothering zonemgr; zonemgr now
	simply skips retransfer if the specified zone is not in its
	secondary zone list, instead of producing noisy error logs.
	(Trac #1938, git TBD)

comment:16 Changed 7 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:17 Changed 7 years ago by jelte

  • Owner changed from UnAssigned to jelte

comment:18 in reply to: ↑ 15 ; follow-up: Changed 7 years ago by jreed

Replying to jinmei:

compatible with BIND 9) without bothering zonemgr; zonemgr now
simply skips retransfer if the specified zone is not in its
secondary zone list, instead of producing noisy error logs.

Is there still noisy logging at some debug level? Or is it entirely silent now?

comment:19 in reply to: ↑ 18 Changed 7 years ago by jinmei

Replying to jreed:

Replying to jinmei:

compatible with BIND 9) without bothering zonemgr; zonemgr now
simply skips retransfer if the specified zone is not in its
secondary zone list, instead of producing noisy error logs.

Is there still noisy logging at some debug level? Or is it entirely silent now?

They are still logged but at a debug level.

comment:20 follow-up: Changed 7 years ago by jelte

  • Owner changed from jelte to jinmei

All changesets look fine in general, a few minor comments:

  • I replaced a few occurrences of 'readble' with 'readable' in the comments
  • it's not in master yet, but the NOTIFY_OUT_REPLY_RECEIVED should probably use the AddressFormatter? class instead of raw address:port so that v6 addresses get blockquotes around them (the class is introduced in #1086, but whichever of these two branches is merged last should make the change I guess)

So this can be merged

comment:21 Changed 7 years ago by jelte

hmz I did run into an issue with the lettuce tests, they seem to be failing a lot on xfrin_notify_handling scenario 4 (notauth, but the failure is no XFROUT_NOTIFY_COMMAND).

Additionally (or to make it worse), the stats daemon produces a LOT of output, and I'm not sure whether it's necessary for this specific test, so I suggest we remove it from the config and the tests.

Of course that doesn't help anything, and I wonder if this too is related to #2790

comment:22 in reply to: ↑ 20 Changed 7 years ago by jinmei

Thanks for the review.

Replying to jelte:

All changesets look fine in general, a few minor comments:

  • I replaced a few occurrences of 'readble' with 'readable' in the comments

These are good (although they existed before this branch:-)

  • it's not in master yet, but the NOTIFY_OUT_REPLY_RECEIVED should probably use the AddressFormatter? class instead of raw address:port so that v6 addresses get blockquotes around them (the class is introduced in #1086, but whichever of these two branches is merged last should make the change I guess)

ack.

As for the lettuce tests (in your next comment), I tried various
"lightweight" workaround including using 'new' twice (according to
your last suggestion), but none seemed to work reliably. So I decided
to clarify and update the meaning of 'new' in terrain.py, and update
some features so they will work with this clarification (and,
if I understand it correctly, some cases were not really safe due to
the possible timing issues of 'new' even before this clarification;
they should have been fixed anyway).

All tests basically seem to pass (the only exception in my environment
is the XFRIN_XFR_TRANSFER_STARTED failure, but we know that's a
different issue), so I think this change doesn't introduce additional
regression, but please review and check it in your environment.

comment:23 Changed 7 years ago by jinmei

  • Owner changed from jinmei to jelte

comment:24 follow-up: Changed 7 years ago by jelte

  • Owner changed from jelte to jinmei

OK this approach seems to work :)

There is one final thing, not sure if this is the one you meant, but I think xfrin_notify_handling.feature:168 should be 'not XFRIN_TRANSFER_SUCCESS' instead of 'not XFRIN_TRANSFER_STARTED'

-    Then wait for bind10 stderr message XFRIN_XFR_TRANSFER_PROTOCOL_VIOLATION not XFRIN_XFR_TRANSFER_STARTED
+    Then wait for bind10 stderr message XFRIN_XFR_TRANSFER_PROTOCOL_VIOLATION not XFRIN_XFR_TRANSFER_SUCCESS

comment:25 in reply to: ↑ 24 Changed 7 years ago by jinmei

Replying to jelte:

OK this approach seems to work :)

Okay, good.

There is one final thing, not sure if this is the one you meant, but I think xfrin_notify_handling.feature:168 should be 'not XFRIN_TRANSFER_SUCCESS' instead of 'not XFRIN_TRANSFER_STARTED'

-    Then wait for bind10 stderr message XFRIN_XFR_TRANSFER_PROTOCOL_VIOLATION not XFRIN_XFR_TRANSFER_STARTED
+    Then wait for bind10 stderr message XFRIN_XFR_TRANSFER_PROTOCOL_VIOLATION not XFRIN_XFR_TRANSFER_SUCCESS

Ah, right. It looks like a typo existing before this branch, but now
that we stop using 'new' here, it could make the test fail (actually,
I suspect it would always make it fail; I'm not sure why I didn't
notice it). I've made this change and merged the branch.

Now closing the ticket.

comment:26 Changed 7 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 0 to 13.5
Note: See TracTickets for help on using tickets.