Opened 9 years ago

Closed 8 years ago

#414 closed defect (duplicate)

huge zone in sqlite3 hangs b10-auth if nxdomain

Reported by: jreed Owned by: shane
Priority: medium Milestone: Sprint-20110531
Component: data source Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 30.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

I don't know yet size of when this is triggered.

I have a zone.sqlite3 with 100 million entries. If I query for something that does not exist, b10-auth hangs. If I kill b10-auth, bind10 will restart it and I can do some queries (of labels in the zone).

If I query for something not in the zone, I get REFUSED (correctly).

This problem happens with or without caching (-n).

I straced the hung process for about 10 seconds and got 4951 read and lseek lines:

Process 10963 attached - interrupt to quit
lseek(15, 4249217024, SEEK_SET)         = 4249217024
read(15, "\r\0\0\0\r\0D\0\3\260\3`\3\36\2\334\2\214\2<\1\372\1\270\1h\1\30\0\326\0\224"..., 1024) = 1024
lseek(15, 4249220096, SEEK_SET)         = 4249220096
read(15, "\r\0\0\0\r\0D\0\3\260\3n\3,\2\334\2\214\2J\2\10\1\270\1h\1&\0\344\0\224"..., 1024) = 1024
lseek(15, 4249222144, SEEK_SET)         = 4249222144
read(15, "\r\0\0\0\r\0R\0\3\276\3|\3,\2\334\2\232\2X\2\10\1\270\1v\0014\0\344\0\224"..., 1024) = 1024
lseek(15, 4249224192, SEEK_SET)         = 4249224192
read(15, "\r\0\0\0\r\0R\0\3\276\3n\3\36\2\334\2\232\2J\1\372\1\270\1v\1&\0\326\0\224"..., 1024) = 1024
lseek(15, 4249226240, SEEK_SET)         = 4249226240

...

read(15, "\r\0\0\0\r\0R\0\3\276\3n\3\36\2\334\2\232\2J\1\372\1\270\1v\1&\0\326\0\224"..., 1024) = 1024
lseek(15, 4258495488, SEEK_SET)         = 4258495488
read(15, "\r\0\0\0\r\0D\0\3\260\3`\3\36\2\334\2\214\2<\1\372\1\270\1h\1\30\0\326\0\224"..., 1024) = 1024
lseek(15, 4258498560, SEEK_SET)         = 4258498560
read(15, "\r\0\0\0\r\0D\0\3\260\3n\3,\2\334\2\214\2J\2\10\1\270\1h\1&\0\344\0\224"..., 1024) = 1024
lseek(15, 4258499584, SEEK_SET)         = 4258499584
read(15, "\r\0\0\0\r\0R\0\3\276\3|\3,\2\334\2\232\2X\2\10\1\270\1v\0014\0\344\0\224"..., 1024) = 1024
lseek(15, 4258500608, SEEK_SET)         = 4258500608
read(15, "\r\0\0\0\r\0R\0\3\276\3n\3\36\2\334\2\232\2J\1\372\1\270\1v\1&\0\326\0\224"..., 1024) = 1024
Process 10963 detached

Subtickets

Change History (12)

comment:1 in reply to: ↑ description Changed 9 years ago by jinmei

Replying to jreed:

I don't know yet size of when this is triggered.

I have a zone.sqlite3 with 100 million entries. If I query for something that does not exist, b10-auth hangs. If I kill b10-auth, bind10 will restart it and I can do some queries (of labels in the zone).

FWIW I saw the same problem in my earlier performance experiments (and I believe I reported that, although not ticketed). It happened whether or not the query should return a positive or negative response. My guess at that time was it took too much time within the sqlite3 library, and your strace result seems to support this speculation.

comment:2 Changed 9 years ago by jreed

Yes, it happens for most positive responses too. Also the CPU percentage for b10-auth jumps up to around 20% use. I haven't waited to see how long this lasts. I will let it run now.

I am able to successful get a response for the SOA and NS (as long as is not already hung).

And I can query for the A record for the NS record's target. If I have caching enable (default), this will return the answer if I previously asked for the SOA or NS first and got it back in the ADDITIONAL SECTION (because it was cached). But if I run b10-auth without caching (-n) or if the cache expired (after 30 seconds), then it returns no answer (but returns the NS records associated for it in the AUTHORITY SECTION instead) -- and does not hang. (This A and NS at same label was caused by a different problem reported in #413).

comment:3 Changed 9 years ago by stephen

  • Milestone set to A-Team-Sprint-20110126

comment:4 Changed 9 years ago by stephen

  • Estimated Difficulty changed from 0.0 to 30

comment:5 follow-up: Changed 9 years ago by shane

The problem here ends up being an unindexed column. The 'rdtype' column on the 'records' table is used in the following lookup:

const char* const q_previous_str = "SELECT name FROM records "
    "WHERE zone_id=?1 AND rdtype = 'NSEC' AND "
    "rname < $2 ORDER BY rname DESC LIMIT 1";

In this case, the SQL engine has to look through ALL entries with "rname < $2" in order to find ones that match "rdtype = 'NSEC'" since rdtype is unindexed.

The solution is to add an appropriate index:

CREATE INDEX records_byrdtype ON records (rdtype)

The code change is simple (one line in SCHEMA_LIST[]), but we need to deal with being able to update the schema. We'll open a separate ticket for this, along with collecting a couple more SQL updates in other tickets.

comment:6 in reply to: ↑ 5 Changed 9 years ago by jinmei

Replying to shane:

The problem here ends up being an unindexed column. The 'rdtype' column on the 'records' table is used in the following lookup:

[...]

The code change is simple (one line in SCHEMA_LIST[]), but we need to deal with being able to update the schema. We'll open a separate ticket for this, along with collecting a couple more SQL updates in other tickets.

(not directly related to the main topic of this ticket) If you change
the schema in an incompatible manner, please also reconsider renaming
'rdtype' to, e.g., 'rrtype'. "rdtype" is a BIND 9 terminology, which
has the notion of "rdataset". Since we are basically handling things
per RR or per RRset, "rdtype" does not really sound like an appropriate
name.

comment:7 Changed 9 years ago by jinmei

What should we do this ticket? Is this still urgent?

I see this can be an important bug fix, but it seems to me completing
in memory data source is more urgent both in terms of "contract" and
in practice (such a huge zone wouldn't yet be ready to be used in actual,
even experimental, deployment due to other performance limitaions of
the current implementation).

So I'd propose moving this to the sprint backlog, and move/prioritize
wildcard related tasks.

Giving it back to Shane, who seems to have the most strong opinion.

comment:8 Changed 9 years ago by jinmei

  • Owner set to shane
  • Status changed from new to assigned

comment:9 Changed 9 years ago by jinmei

This ticket has stalled for 3 weeks.

I suspect it's time someone other than Shane should take over it.

comment:10 Changed 9 years ago by stephen

  • Milestone A-Team-Task-Backlog deleted

Milestone A-Team-Task-Backlog deleted

comment:11 Changed 9 years ago by shane

  • Milestone set to Sprint-20110405

comment:12 Changed 8 years ago by shane

  • Defect Severity set to N/A
  • Resolution set to duplicate
  • Status changed from assigned to closed
  • Sub-Project set to DNS

The changes in ticket #324 fix this problem as well. Please refer to that ticket for more information.

Note: See TracTickets for help on using tickets.