Opened 9 years ago

Closed 9 years ago

#249 closed defect (fixed)

review: auth server can return SERVFAIL with an answer

Reported by: jinmei Owned by: jinmei
Priority: medium Milestone: 05. 3rd Incremental Release: Serious Secondary
Component: data source Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: Add Hours to Ticket:
Total Hours: Internal?:

Description

This happens if an sqlite3 zone DB doesn't contain zone's NS RRs.

We should fix it.

Subtickets

Change History (12)

comment:1 follow-up: Changed 9 years ago by jinmei

branches/trac249 is ready for review.

The proposed changelog entry is:

  59.?	[bug]		jinmei
	lib/datasrc,bin/auth: The authoritative server could return a
	SERVFAIL with a partial answer if it finds a data source broken
	while looking for an answer.  This can happen, for example, if a
	zone that doesn't have an NS RR is configured and loaded as a
	sqlite3 data source. (Trac #249, rTBD)

An initial note to reviewer: to add test cases for this bug, I decided to fundamentally refactor the test data source mock. So far, it hardcoded the test data within the method logic and was very difficult to maintain if you want to add data. Ideally, this refacotring should go to a separate ticket, but since the bug fix itself is a pretty easy I merged these sets of changes in a single branch. The refactor part of diff is r2272; the rest of the changes in the branch are for tests and bug fixes.

If combining these two changes isn't acceptable please say so.

Notes on the bug and fix themselves:

I'd rather see this as a data source bug than an authoritative server issue, so I solely fixed it in the data source library.

If a zone doesn't have an apex SOA/NS, the zone is broken (for example, BIND 9 refuses to load such zones. Even our loadzone tool doesn't allow a no SOA zone). So, I believe it's acceptable and even more sensible to throw an exception rather than making an error message.

This observation leads to broader discussions. First, I suspect other internal SERVFAIL cases should actully be changed to exceptions. But I was not so sure if they are due to something like a broken database or an unsual, but not necessarily broken error. So I didn't touch them. We should revisit these cases, too, and, whatever the conclusion is, we should avoid SERVFAIL+answer for these cases.

Second, this problem suggests we need to clarify which data is safe in the data source module. In a "captive" world like BIND 9, we can basically assume that the data parser performs necessary validation and once the data is successfully loaded it's valid. In a more open world like BIND 10, however, we cannot naively assume that. The data source may be "non captive" and bogus data may be inserted via an out-of-band interface, or the data source may be dynamically linked buggy implementation provided by a third party.

So, in a longer term, I think we need to define a border in the top level data source, under which we cannot assume the validity of the data and the module must perform validation. Then, other BIND 10 modules (specifically, the auth server) can assume it only handles valid data above the border line.

comment:2 Changed 9 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from new to reviewing
  • Summary changed from auth server can return SERVFAIL with an answer to review: auth server can return SERVFAIL with an answer

comment:3 in reply to: ↑ 1 ; follow-up: Changed 9 years ago by each

  • Owner changed from UnAssigned to jinmei

Replying to jinmei:

I decided to fundamentally refactor the test data source mock.

Glorious. Vast improvement. Thank you. Merge it.

Notes on the bug and fix themselves:

Rather than throwing exceptions from the data source, I would have had b10-auth look for a SERVFAIL Rcode in the reply from doQuery(), and either clear the partial answers out of the Message or construct a new Message with no answer. But, if we're going to throw exceptions, then I agree with your speculation that we should do so in every SERVFAIL case, not just the two that you've already done.

This observation leads to broader discussions. First, I suspect other internal SERVFAIL cases should actully be changed to exceptions. But I was not so sure if they are due to something like a broken database or an unsual, but not necessarily broken error. So I didn't touch them. We should revisit these cases, too, and, whatever the conclusion is, we should avoid SERVFAIL+answer for these cases.

AFAIK, SERVFAIL always means either that the database is insane, or we've been told to do something unnatural (for example, a glue query found on the task queue, which should never happen). Both seem like natural circumstances for an exception to me.

My only question about it is, I recall having a conversation early in the BIND 10 project about someone (maybe you?) running benchmarks that showed exceptions caused a serious performance hit. If that's true, then maybe it would be preferable to continue signaling the problem via the Rcode, and simply handle it better on the calling end? But I'm okay either way.

So, in a longer term, I think we need to define a border in the top level data source, under which we cannot assume the validity of the data and the module must perform validation.

What do you mean by "the module"?

comment:4 Changed 9 years ago by jinmei

  • Component changed from Unclassified to data source

comment:5 in reply to: ↑ 3 ; follow-up: Changed 9 years ago by jinmei

Replying to each:

Rather than throwing exceptions from the data source, I would have had b10-auth look for a SERVFAIL Rcode in the reply from doQuery(), and either clear the partial answers out of the Message or construct a new Message with no answer. But, if we're going to throw exceptions, then I agree with your speculation that we should do so in every SERVFAIL case, not just the two that you've already done.

Okay, but let's leave the other cases to a separate ticket anyway.

This observation leads to broader discussions. First, I suspect other internal SERVFAIL cases should actully be changed to exceptions. But I was not so sure if they are due to something like a broken database or an unsual, but not necessarily broken error. So I didn't touch them. We should revisit these cases, too, and, whatever the conclusion is, we should avoid SERVFAIL+answer for these cases.

AFAIK, SERVFAIL always means either that the database is insane, or we've been told to do something unnatural (for example, a glue query found on the task queue, which should never happen). Both seem like natural circumstances for an exception to me.

So, do you mean you now agree exceptions are (more) natural?

Exception vs error code is often a controversial issue, resulting in a matter of preference. So, I'm not necessarily insisting exceptions be the only choice. But, as I already explained and you seem to have agreeded, at least for these two cases it's clearly due to broken data bases, which should normally be rejected by a tool that builds the data source, I personally think exceptions are better.

My only question about it is, I recall having a conversation early in the BIND 10 project about someone (maybe you?) running benchmarks that showed exceptions caused a serious performance hit. If that's true, then maybe it would be preferable to continue signaling the problem via the Rcode, and simply handle it better on the calling end? But I'm okay either way.

Overall, I don't think this is an issue in this case.

If exceptions are actually triggered that's normally an expensive operation. But these two cases should actually be "exceptional" which should never happen with a proper tool. Of course, it couldn't be triggered via a remote packet as long as the data source is valid. So, in these normal cases we can ignore the cost of exception. If someone uses a broken tool that makes a broken data source and suffers from performace penalty, I think we can say it's a cost of using a broken tool (and the solution is to fix it).

That's one reason why I didn't touch other SERVFAIL cases. If some of them can be triggered via a remote query even if the underlying data source is valid, that could be a source of DoS and we might want to treat it differently. (although this concern applies to broken queries in general...it would be a subject of a different discussion).

So, in a longer term, I think we need to define a border in the top level data source, under which we cannot assume the validity of the data and the module must perform validation.

What do you mean by "the module"?

The top level data source (or, in this context, the query processing logic that interfaces to the underlying data source). For example, when we do

        return (ds->findRRset(task.qname, task.qclass, task.qtype,
                              target, task.flags, zonename));

we trust the data source that it builds valid RRsets in target, but even this may not always be guaranteed. A broken data source implementation may include an empty RRset (an RRset with no RDATA), for example.

Saying "that's not our problem" is one solution. But I persontally think it's a bit naive in an open environment like BIND 10, and even if we adopt this solution we should at least be sure a broken input doesn't cause a catastrophic consequence within (e.g.) data_source.cc. It won't be an easy task anyway.

But again, it's beyond the scope of this ticket. If we want to continue this discussion, let's move it to the list or create a separate ticket.

comment:6 Changed 9 years ago by jinmei

  • Owner changed from jinmei to each

comment:7 in reply to: ↑ 5 Changed 9 years ago by each

  • Owner changed from each to jinmei

Replying to jinmei:

So, do you mean you now agree exceptions are (more) natural?

Yes, I agree; from a coding perspective, these are totally natural uses for exceptions. What I was trying to say was that I originally avoided exceptions in the data source code because of the performance issue, and I wanted to make sure this isn't something we need to concern ourselves with.

I'm slightly concerned about the possibility of a bug turning up in the future that makes it possible to trigger a servfail on purpose. If that happened, then the use exceptions might make that a little more of a DoS vector than otherwise. But at present I don't know any way to get a servfail other than "the server is crazy", so I think your reasoning is valid.

I'm not sure why you want to postpone converting the other servfails to exceptions in another ticket, though. As it stands, the server can fail with an error code *or* an exception, and only one of those is handled correctly in b10-auth (i.e., explicitly clearing out partial answers). Shouldn't we either use exceptions in every case, or else handle both sorts of servfail condition?

To be clear, I'm fine with merging the branch as it is now, I just wanted to discuss these things.

(One minor note: When we get the validating resolver written, it will be possible to trigger a servfail by sending a query that's known not to validate. We'll definitely want to use the error code method then.)

comment:8 follow-up: Changed 9 years ago by each

Suggestion for a way to combine both sorts of servfail handling:

    try {
        Query query(message, dnssec_ok);
        impl_->data_sources_.doQuery(query);
    } catch (const Exception& ex) {
        message.setRcode(Rcode::SERVFAIL());
    }

    if (message.getRcode() == Rcode::SERVFAIL()) {
        if (impl_->verbose_mode_) {
            cerr << "[b10-auth] Internal error, returning SERVFAIL: " <<
                ex.what() << endl;
        }

        makeErrorMessage(message, response_renderer, Rcode::SERVFAIL(),
                         impl_->verbose_mode_);
        return (true);
    }

comment:9 in reply to: ↑ 8 ; follow-up: Changed 9 years ago by jinmei

Replying to each:

Suggestion for a way to combine both sorts of servfail handling:

Hmm...I'm not sure if this is the way to go (right now). We've fixed known problems with try-catch + exception, so the explicit check for rcode==SERVFAIL is redundant. For other cases, we are not even sure if there's really a scenario that could cause it.

Although it might be embarrased if this really happened, that's not a type of critical bug like a buffer overrun or cache poisoning. It may not even be a protocol violation (RFC1035 doen't seem to say the answer section of a SERVFAIL response must be empty, etc). So, adding an explicit case for rcode=SERVFAIL seems to be a kind of premature optimization, which may remain in the code for unknown reasons. I'd be rather make a separate ticket where we analyze the other cases of SERVFAIL to determie whether we can safely handle all of them as a really exceptional case and if not what is the best way to handle them.

After that analysis, the hybrid approach may turn out to be the best solution.

comment:10 in reply to: ↑ 9 ; follow-up: Changed 9 years ago by jinmei

Checking: so, is it okay to merge this branch to trunk in its current form (and open a separate ticket for the other SERVFAIL cases)?

comment:11 in reply to: ↑ 10 ; follow-up: Changed 9 years ago by each

Replying to jinmei:

Checking: so, is it okay to merge this branch to trunk in its current form (and open a separate ticket for the other SERVFAIL cases)?

Quoting myself: "To be clear, I'm fine with merging the branch as it is now" :)

comment:12 in reply to: ↑ 11 Changed 9 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed

Replying to each:

Replying to jinmei:

Checking: so, is it okay to merge this branch to trunk in its current form (and open a separate ticket for the other SERVFAIL cases)?

Quoting myself: "To be clear, I'm fine with merging the branch as it is now" :)

Ah, okay, thanks. Merged to trunk.

For the other cases of SERVFAIL, opened a separate ticket (#255).

Closing this one.

Note: See TracTickets for help on using tickets.