Opened 9 years ago

Closed 8 years ago

#833 closed defect (complete)

[b10-resolver] Nameservers unreachable but really are

Reported by: jreed Owned by: dvv
Priority: low Milestone: Sprint-20111122
Component: resolver Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: High
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 7.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

I noticed some SERVFAILs in my b10-resolver logging. Apparently, b10-resolver/nsas is giving up too soon.

Two examples are:

98.156.17.218.in-addr.arpa. PTR

0.1.c.5.1.8.e.f.f.f.1.d.9.1.2.0.a.b.7.1.5.1.f.1.0.7.4.0.1.0.0.2.ip6.arpa. PTR

which both result in:
[b10-resolver] Nameservers unreachable
and SERVFAIL

I set this to critical since it may not give proper result.

Subtickets

Change History (24)

comment:1 Changed 9 years ago by jelte

I'm not a hundred percent sure of this, but I've been adding a bit of extra debugging to a private branch here, and it looks like one of the main problems is that it's doing a lot of unnecessary work, so much in fact that the internal timeouts kick in and the NSAS marks zones as unreachable.

For instance, every lookup that is a cache miss starts a new recursion from the root right now (it doesn't search for the lowest known delegation at this moment). Another thing is that there is no 'front'-demuxer; if we ask it to resolve the same name/type twice (the second one before the first one finishes), it'll start a second RunningQuery? that does all the same work. These two combined add up a lot.

To fix the first one we have partial support in the cache, but we need to add some code to the resolver to make use of that (and that is not simply a one-line addition, though I think it isn't much). The second one requires a bit of design first.

I can do that and make tickets for it, and I think we should make this ticket 'depend' on those, and when they are done we can probably see better if this is a problem in itself or really just a side-effect of all that extra work.

(now that i've typed this, perhaps it also helps if we make the client_timeout higher for 'internal' queries, i.e. queries we initiated ourselves. I shall try that)

comment:2 Changed 9 years ago by stephen

  • Defect Severity set to N/A
  • Milestone set to Sprint-20110503
  • Sub-Project set to DNS

comment:3 Changed 9 years ago by vorner

  • Owner set to vorner
  • Status changed from new to accepted

I need to get back to know the resolver internals, so I'll enjoy this ticket :-).

comment:4 Changed 9 years ago by vorner

  • Owner changed from vorner to UnAssigned
  • Status changed from accepted to assigned

Hmm, after reading this, I should split something off. So, splitting off the cache part as #859.

comment:5 Changed 9 years ago by stephen

  • Estimated Difficulty changed from 0.0 to 7

comment:6 Changed 9 years ago by vorner

The #859 was merged. Does the problem still exist?

comment:7 Changed 9 years ago by jelte

Yes, on current master I still see this problem, unfortunately.

comment:8 Changed 9 years ago by stephen

  • Owner changed from UnAssigned to stephen

comment:9 Changed 9 years ago by stephen

  • Milestone changed from Sprint-20110614 to Year 3 Task Backlog
  • Owner changed from stephen to UnAssigned

Moved to year 3 task backlog during May 31 2011 Sprint Planning Meeting.

comment:10 Changed 8 years ago by jelte

This happens to me nearly ever time with the zone for nu.nl (a pretty popular dutch news site), if you want something to easily reproduce the problem.

comment:11 Changed 8 years ago by stephen

  • Milestone changed from Year 3 Task Backlog to Sprint-20110927

Moved to current sprint after sprint planning meeting of 2011-09-06.

comment:12 Changed 8 years ago by jelte

  • Milestone changed from Sprint-20110927 to Sprint-20111011

comment:13 Changed 8 years ago by jelte

  • Priority changed from critical to minor

comment:14 Changed 8 years ago by jreed

  • Defect Severity changed from N/A to High

comment:15 follow-up: Changed 8 years ago by dvv

  • Owner changed from UnAssigned to dvv

comment:16 in reply to: ↑ 15 Changed 8 years ago by jinmei

Would it be possible to complete (merge) the current tasks you are
working on and in the review queue? The general convention is that
every developer has at most one development ticket (also, after
completing one task and putting it on the review queue, the next
default action is to take on someone else's review request, instead
of starting another development task). Otherwise the review queue
tends to become longer and longer with long-lasting incomplete tasks.

Right now there doesn't seem to be an open review request, but from
a quick look you could respond to the review comment on #1138. If you
could first fully complete it, that would be very helpful.

comment:17 Changed 8 years ago by dvv

I took care of #1138 and #1140 review comments over this weekend.

comment:18 Changed 8 years ago by jelte

  • Milestone changed from Sprint-20111011 to Sprint-20111025

comment:19 Changed 8 years ago by jreed

I have several examples. Notice how this one gives up in a fraction of a second. To see earlier detailed (before is marked unreachable but still servfail) for this same query see http://git.bind10.isc.org/~jreed/833.txt

2011-10-19 06:19:57.154 DEBUG [b10-resolver.resolver] RESOLVER_DNS_MESSAGE_RECEIVED DNS message received: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47761
;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;157.11.107.109.in-addr.arpa. IN PTR

2011-10-19 06:19:57.154 DEBUG [b10-resolver.resolver] RESOLVER_QUERY_ACCEPTED query accepted: '157.11.107.109.in-addr.arpa./PTR/IN' from 127.0.0.1#45111
2011-10-19 06:19:57.154 DEBUG [b10-resolver.resolver] RESOLVER_NORMAL_QUERY processing normal query
2011-10-19 06:19:57.154 DEBUG [b10-resolver.reslib] RESLIB_RESOLVE asked to resolve <157.11.107.109.in-addr.arpa. IN PTR> (resolve() instance 2)
2011-10-19 06:19:57.154 DEBUG [b10-resolver.cache] CACHE_RESOLVER_LOOKUP_MSG looking up message in resolver cache for 157.11.107.109.in-addr.arpa./PTR
2011-10-19 06:19:57.154 DEBUG [b10-resolver.cache] CACHE_LOCALZONE_UNKNOWN entry with key 157.11.107.109.in-addr.arpa.12 not found in local zone data
2011-10-19 06:19:57.154 DEBUG [b10-resolver.cache] CACHE_MESSAGES_FOUND found a message entry for 157.11.107.109.in-addr.arpa.12 in the message cache
2011-10-19 06:19:57.155 DEBUG [b10-resolver.cache] CACHE_RRSET_LOOKUP looking up 11.107.109.in-addr.arpa./NS/IN in RRset cache
2011-10-19 06:19:57.155 DEBUG [b10-resolver.cache] CACHE_RESOLVER_LOOKUP_RRSET looking up RRset in resolver cache for 157.11.107.109.in-addr.arpa./PTR
2011-10-19 06:19:57.155 DEBUG [b10-resolver.cache] CACHE_LOCALZONE_UNKNOWN entry with key 157.11.107.109.in-addr.arpa.12 not found in local zone data
2011-10-19 06:19:57.155 DEBUG [b10-resolver.cache] CACHE_RRSET_LOOKUP looking up 157.11.107.109.in-addr.arpa./PTR/IN in RRset cache
2011-10-19 06:19:57.155 DEBUG [b10-resolver.cache] CACHE_RRSET_NOT_FOUND no RRset found for 157.11.107.109.in-addr.arpa./PTR/IN in cache
2011-10-19 06:19:57.155 DEBUG [b10-resolver.reslib] RESLIB_RECQ_CACHE_NO_FIND did not find <157.11.107.109.in-addr.arpa. IN PTR> in the cache, starting RunningQuery (resolve() instance 2)
2011-10-19 06:19:57.155 DEBUG [b10-resolver.reslib] RESLIB_RUNQ_CACHE_LOOKUP looking up up <157.11.107.109.in-addr.arpa. IN PTR> in the cache
2011-10-19 06:19:57.155 DEBUG [b10-resolver.cache] CACHE_RESOLVER_LOOKUP_MSG looking up message in resolver cache for 157.11.107.109.in-addr.arpa./PTR
2011-10-19 06:19:57.155 DEBUG [b10-resolver.cache] CACHE_LOCALZONE_UNKNOWN entry with key 157.11.107.109.in-addr.arpa.12 not found in local zone data
2011-10-19 06:19:57.156 DEBUG [b10-resolver.cache] CACHE_MESSAGES_FOUND found a message entry for 157.11.107.109.in-addr.arpa.12 in the message cache
2011-10-19 06:19:57.156 DEBUG [b10-resolver.cache] CACHE_RRSET_LOOKUP looking up 11.107.109.in-addr.arpa./NS/IN in RRset cache
2011-10-19 06:19:57.156 DEBUG [b10-resolver.reslib] RESLIB_RUNQ_CACHE_FIND found <157.11.107.109.in-addr.arpa. IN PTR> in the cache
2011-10-19 06:19:57.156 DEBUG [b10-resolver.reslib] RESLIB_REFERRAL referral received in response to query for <157.11.107.109.in-addr.arpa. IN PTR>
2011-10-19 06:19:57.156 DEBUG [b10-resolver.cache] CACHE_RESOLVER_UPDATE_MSG updating message for 157.11.107.109.in-addr.arpa./PTR/IN
2011-10-19 06:19:57.156 DEBUG [b10-resolver.cache] CACHE_MESSAGES_UPDATE updating message entry 157.11.107.109.in-addr.arpa./PTR/IN
2011-10-19 06:19:57.156 DEBUG [b10-resolver.cache] CACHE_MESSAGES_REMOVE removing old instance of 157.11.107.109.in-addr.arpa./PTR/IN first
2011-10-19 06:19:57.156 DEBUG [b10-resolver.cache] CACHE_RRSET_UPDATE updating RRset 11.107.109.in-addr.arpa./NS/IN in the cache
2011-10-19 06:19:57.156 DEBUG [b10-resolver.cache] CACHE_RRSET_LOOKUP looking up 11.107.109.in-addr.arpa./NS/IN in RRset cache
2011-10-19 06:19:57.156 DEBUG [b10-resolver.cache] CACHE_RRSET_REMOVE_OLD removing old RRset for 11.107.109.in-addr.arpa./NS/IN to make space for new one
2011-10-19 06:19:57.157 DEBUG [b10-resolver.reslib] RESLIB_REFER_ZONE referred to zone 11.107.109.in-addr.arpa.
2011-10-19 06:19:57.157 DEBUG [b10-resolver.nsas] NSAS_SEARCH_ZONE_NS searching NSAS for nameservers for zone 11.107.109.in-addr.arpa.
2011-10-19 06:19:57.157 DEBUG [b10-resolver.reslib] RESLIB_RUNQ_FAIL failure callback - nameservers are unreachable
2011-10-19 06:19:57.157 DEBUG [b10-resolver.resolver] RESOLVER_DNS_MESSAGE_SENT DNS message of 45 bytes sent: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 47761
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;157.11.107.109.in-addr.arpa. IN PTR

comment:20 Changed 8 years ago by jreed

Either work done in this ticket or new tickets opened to help clarify the log messages, add some identifier that can correlate logs together, new log messages as needed.

comment:21 Changed 8 years ago by jelte

  • Milestone changed from Sprint-20111025 to Sprint-20111108

comment:22 Changed 8 years ago by jelte

  • Milestone changed from Sprint-20111108 to Sprint-20111122

comment:23 Changed 8 years ago by dvv

I've added more log output and am trying to run the resolver in my environment. Another nasty little problem that gets in the way of using the resolver cropped up: some heavily used name servers treat EDNS in a weird way: they reject EDNS requests over UDP with FORMERR no matter what UDP size is advised, but happily accept them over TCP. An immediate example is ns[12345].msft.net. My suspicion is that most if not all non-bind Windows-based servers might exhibit the same behavior. I'm dealing with it right now. I can keep it as a part of this ticket or put this one on hold (as I haven't really been able to reproduce "nameservers are unreachable" while trying to use the resolver in real life without hitting the EDNS problem) and open a new ticket.

Last edited 8 years ago by dvv (previous) (diff)

comment:24 Changed 8 years ago by dvv

  • Resolution set to complete
  • Status changed from assigned to closed

Unable to reproduce the invalid "nameservers are unreachable" situation. Closing the ticket for now, the EDNS fallback issue is carried over to #1386.

Note: See TracTickets for help on using tickets.