Opened 10 years ago

Closed 9 years ago

#80 closed defect (fixed)

review: auth server incorrectly returns SERVFAIL to queries of class ANY

Reported by: jinmei Owned by: jinmei
Priority: high Milestone: A-Team-Sprint-20110316
Component: b10-auth Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 5.0 Add Hours to Ticket:
Total Hours: Internal?: no

Description

example:

% dig @127.0.0.1 -p 5300 version.bind txt -c ANY

; <<>> DiG 9.6.0-APPLE-P2 <<>> @127.0.0.1 -p 5300 version.bind txt -c ANY
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 36081
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

What happened internally is even worse: The meta data source threw an exception.
Even though it's an unusual query, it's completely valid and not something that should trigger an exception.

Also, in this specific case, the query should actually match version.bind/TXT/CH, and should return that record (although BIND 9 behaves in a counter intuitive manner in this case, too; it first matches the "IN" class and tries to resolve "version.bind" within that class, which normally fails).

Subtickets

Change History (25)

comment:1 Changed 10 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

I fixed this problem in r1404. Could someone review the change?

comment:2 Changed 10 years ago by jinmei

  • Owner changed from jinmei to shane
  • Status changed from accepted to assigned
  • Summary changed from auth server incorrectly returns SERVFAIL to queries of class ANY to review: auth server incorrectly returns SERVFAIL to queries of class ANY

assigning it to Shane, as the review dispatcher. Also changed the subject line so that it's clear that it's a review request.

comment:3 Changed 10 years ago by shane

  • Owner changed from shane to each

comment:4 Changed 10 years ago by shane

  • Component changed from Unclassified to b10-auth

comment:5 Changed 9 years ago by each

Looks like this should have been moved to "reviewing" status a while ago, but just as well that it wasn't, because it's not really fixed.

Queryies for qclass ANY don't servfail anymore, but they do fail. They attempt to look up records with a literal class value of 255.

BIND 9 actually does a search for any class. It looks for matching qname/qtype records in all classes and returns the first match found. BIND 10 should do the same.

comment:6 Changed 9 years ago by jreed

  • billable set to 0
  • Estimated Difficulty set to 0.0
  • Internal? unset

How to test? Use sqlite3 directly to add HS and/or CH zones? Even when I comment out the check in the datasrc library for "in" it assumes a CH zone is IN. Anyways, the -C ANY appears to give what I have.


comment:7 Changed 9 years ago by each

  • billable changed from 0 to 1

We should be able to test this by seeing if we get the expected answer, regardless of class, to a query for a given qname/qtype. A class-IN zone example.com should give an answer if queried with with "dig -c any -t ns example.com". The class-CH static zones should answer when queried with "dig -c any -t txt authors.bind".

(I'm surprised, incidentally, to discover that the second one of those doesn't work correctly in BIND 9, but the first one does.)

comment:8 Changed 9 years ago by jreed

  • Milestone changed from 05. 3rd Incremental Release: Serious Secondary to 06. 4th Incremental Release

comment:9 follow-up: Changed 9 years ago by shane

  • Milestone set to A-Team-Task-Backlog
  • Owner changed from each to UnAssigned
  • Priority changed from major to critical

I just tested this query on my server and it crashed b10-auth.

comment:10 Changed 9 years ago by jreed

#0  0xbb8152d7 in _lwp_kill () from /usr/lib/libc.so.12
#1  0xbb815294 in raise () from /usr/lib/libc.so.12
#2  0xbb814b46 in abort () from /usr/lib/libc.so.12
#3  0xbb7d663a in __assert13 () from /usr/lib/libc.so.12
#4  0xbbbbf6f4 in isc::datasrc::addToMessage (q=@0xbfbfe604, 
    sect=isc::dns::Message::SECTION_AUTHORITY, rrset=@0xbfbfe4c4, 
    no_dnssec=false) at /usr/pkg/include/boost/smart_ptr/shared_ptr.hpp:418
#5  0xbbbbb570 in isc::datasrc::DataSrc::doQuery (this=0xbb60a0b8, 
    q=@0xbfbfe604) at data_source.cc:602
#6  0x0805be9c in AuthSrvImpl::processNormalQuery (this=0xbb60a080, 
    io_message=@0xbb610be0, message=@0xbfbfe724, buffer=@0xbfbfe71c)
    at auth_srv.cc:508
#7  0x0805d027 in AuthSrv::processMessage (this=0xbb6034c0, 
    io_message=@0xbb610be0, message=@0xbfbfe7d4, buffer=@0xbfbfe7cc, 
    server=0xbfbfe8bc) at auth_srv.cc:468
#8  0x0805f556 in MessageLookup::operator() (this=0xbb601eb0, 
    io_message=@0xbb610be0, message=@0xbfbfe834, answer_message=@0xbfbfe82c, 
    buffer=@0xbfbfe824, server=0xbfbfe8bc) at auth_srv.cc:173
#9  0xbba87734 in asiolink::UDPServer::asyncLookup (this=0xbfbfe8b0)
    at udp_server.cc:288
#10 0xbba8a6ab in asio::asio_handler_invoke<asiolink::DNSServer::AsyncLookup<asiolink::UDPServer> > (function=@0xbfbfe8b0)
    at ../../../src/lib/asiolink/dns_server.h:135
#11 0xbba8a744 in asio::detail::completion_handler<asiolink::DNSServer::AsyncLookup<asiolink::UDPServer> >::do_complete (owner=0xbb602300, base=0xbb604610)
    at ../../../ext/asio/asio/detail/handler_invoke_helpers.hpp:41
#12 0xbba74efd in asiolink::IOService::run (this=0xbb60a080)
    at ../../../ext/asio/asio/detail/task_io_service_operation.hpp:35
#13 0x080694f5 in main (argc=Cannot access memory at address 0x0) at main.cc:188

comment:11 Changed 9 years ago by stephen

  • Milestone changed from A-Team-Task-Backlog to A-Team-Sprint-20110316

comment:12 in reply to: ↑ 9 Changed 9 years ago by jinmei

Replying to shane:

I just tested this query on my server and it crashed b10-auth.

I can't reproduce that. Can you be more specific, with the zone data
(text) you used and the offending query?

comment:13 Changed 9 years ago by jinmei

  • Owner changed from UnAssigned to shane

comment:14 follow-up: Changed 9 years ago by shane

  • Owner changed from shane to jinmei

I used this command just now (but please don't crash the server - I use setuid and it doesn't recover properly from crashes yet):

dig @b10-ns.time-travellers.org -c any -t aaaa time-travellers.org

This is with an SQLite-based back-end. You can get the contents of the zone via AXFR (I keep it open).

dig @b10-ns.time-travellers.org -t axfr time-travellers.org

comment:15 in reply to: ↑ 14 Changed 9 years ago by jinmei

Replying to shane:

dig @b10-ns.time-travellers.org -c any -t aaaa time-travellers.org

Okay, this looks like a penalty for setting up an IPv4-only (or type
A-only if you like to be super accurate) host even after the
exhaustion of the addresses:-)

Apparently the key is "-c any -t <non-existent-type>. With that
condition I could reproduce the symptom, too.

comment:16 Changed 9 years ago by jinmei

  • Estimated Difficulty changed from 0.0 to 5.0

comment:17 Changed 9 years ago by jinmei

Branch trac80 is ready for review.

For testing purposes I've cherry-picked a couple of changes from
trac626, which are irrelevant to the fix itself and can be ignored.
The main diff is

git diff 0f225ec^ 66e6ece

The bug was in retrieving RRsets from RRsetList, which stores the
result of data source lookup. The original code often did this:

    result_list.findRRset(rrtype. query.qclass());

When it's a class ANY query, query.qclass() is ANY, but result_list
stores the data from the data source, which is not normally ANY
(typically it's class IN). The buggy code sometimes naively assumes
the result is non NULL, and sometimes incorrectly handles the case as
it returns NULL.

Within the data source implementation we ensure the RR class is
already valid, so we don't have to match the class. I've solved this
issue by introducing a custom search routine.

Various instances of this type of bug were fixed with added tests, but
I've left open one final case: in getNsec3() and getNsec3Param().
There didn't seem to be any test for these cases anyway, and we'd have
to prepare NSEC3-signed zone to test it, which would make the diff too
big. Since this case of bug doesn't seem to be critical in that it
won't trigger a crash, I'd leave it to a separate task. I thought it
would be more important to fix the crash bug as soon as possible.

In tests, I've also introduced "no DNSSEC mode" to simplify tests
where DNSSEC related results are not the essential part of the test.
There's another unrelated cleanup: I've changed test domain name from
"flame.org" to "example.net" to avoid using a real domain main used by
someone.

It would also make sense to check both class IN and ANY for all
existing query tests like I did in this branch for WildcardCname?.
That would go to a separate task, too.

This is the proposed changelog entry:

  197.?	[bug]		jinmei
	b10-auth, src/lib/datasrc: class ANY queries were not handled
	correctly in the generic data source (mainly for sqlite3).  It
	could crash b10-auth in the worst case, and could result in
	incorrect responses in some other cases.
	(Trace #80, git TBD)

comment:18 Changed 9 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from assigned to reviewing

comment:19 Changed 9 years ago by jinmei

There was a typo in the proposed changelog entry, fixing it:

  197.?	[bug]		jinmei
	b10-auth, src/lib/datasrc: class ANY queries were not handled
	correctly in the generic data source (mainly for sqlite3).  It
	could crash b10-auth in the worst case, and could result in
	incorrect responses in some other cases.
	(Trac #80, git TBD)

comment:20 Changed 9 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:21 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

It seems g++ doesn't like passing a function-local class as a template argument, I'm not sure if this is OK by the standard or it is a bug. But I've taken the class outside, since it's a .cc file anyway, so it won't spam much of a namespace.

Secondly, there are things like this:

RRsetPtr cname_rrset = findRRsetFromList(syn,
					 RRType::CNAME());
addToMessage(q, Message::SECTION_ANSWER, cname_rrset);
chaseCname(q, task, cname_rrset);

As I get it, findRRsetFromList can return null pointer. Are the other functions expected to handle it? Or, is it so it can't happen in these cases that the type would be missing?

Thank you

comment:22 in reply to: ↑ 21 Changed 9 years ago by jinmei

Replying to vorner:

It seems g++ doesn't like passing a function-local class as a template argument, I'm not sure if this is OK by the standard or it is a bug. But I've taken the class outside, since it's a .cc file anyway, so it won't spam much of a namespace.

I don't know about what the standard says either, but in any event I'm
okay with the change.

Secondly, there are things like this:

RRsetPtr cname_rrset = findRRsetFromList(syn,
					 RRType::CNAME());
addToMessage(q, Message::SECTION_ANSWER, cname_rrset);
chaseCname(q, task, cname_rrset);

As I get it, findRRsetFromList can return null pointer. Are the other functions expected to handle it? Or, is it so it can't happen in these cases that the type would be missing?

I'd first like to note that this patch shouldn't change the behavior
regarding the possible null pointer case, so the question is not
specific to this patch but should apply to the original code, too.
On top of that, my general understanding is that when the code omits
null pointer check it means the pointer actually cannot be null in
that context.

In the above specific example, synthesizeCname() should ensure "syn"
contains a CNAME RR(set) that makes the call to findRRsetFromList()
successful. The only possible exception is the case where the
underlying data source implementation is buggy and returns an empty
DNAME RRset, but that case is rejected by the check for sys.size()
(see notes below though).

We could still add an explicit check and throw an exception so that if
the code overlooks minor exceptional cases that breaks the assumption
then the program will still not crash but return a SERVFAIL. But I
would personally have it crashed in that case rather than obscuring
the bug.

Note: I admit the code around this logic isn't clean and is difficult
to follow (but I note it was not written by me). We should rather
make synthesizeCname never fails (or has it throw for buggy data
sources) so that we can omit the syn.size() check, and we should even
get rid of the use of RRsetList in this case because we actually don't
expect a list of (multiple types of) RRsets in this context, etc. At
the risk of deferring so many cleanups for the data source
implementation, I'm hoping we can make it much cleaner in our
scheduled refactoring/redesigning the data source API and implementation.

comment:23 Changed 9 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:24 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

I wasn't accusing you of it ;-). I just saw the code and worried little bit if it can happen or not and if it was considered. So I just asked. I usually write something like assert(pointer); to both make sure the bug is caught soon enough if it happens and to mark I considered the situation and I think it can't happen.

Anyway, I agree that this bugfix is not a place for making the code nicer, let's worry about that during the refactoring. So, go ahead and merge it.

Thanks

comment:25 in reply to: ↑ 24 Changed 9 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed

Replying to vorner:

I wasn't accusing you of it ;-). I just saw the code and worried little bit if it can happen or not and if it was considered. So I just asked. I usually write something like assert(pointer); to both make sure the bug is caught soon enough if it happens and to mark I considered the situation and I think it can't happen.

Don't worry, I didn't interpret it as if I was accused:-) Double check
is always good and appreciated.

Anyway, I agree that this bugfix is not a place for making the code nicer, let's worry about that during the refactoring. So, go ahead and merge it.

Thanks, merged, and now closing this ticket. I'll open a new ticket
for the TODO (remaining minor NSEC3 cases).

Note: See TracTickets for help on using tickets.