Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#657 closed defect (fixed)

listen_on for b10-auth leads to 100% CPU usage

Reported by: jinmei Owned by: vorner
Priority: very high Milestone: A-Team-Sprint-20110309
Component: b10-auth Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 5.0 Add Hours to Ticket: 0
Total Hours: 5.0 Internal?: no

Description

If we start BIND 10 with specifying listen_on, b10-auth consumes
nearly 100% of CPU time.

{"version": 2,
 "Auth": {
   "listen_on": [{"address": "127.0.0.1", "port": 5300}]
 }
}

If we remove b10-config.db, this doesn't happen.

Not fully investigated it, but it seems quite critical.

According to recent proposal from Jeremy, I propose we either

  • fix this very soon (within a few hours of timeframe), or
  • revert f06ce638877acf6f8e1994962bf2dbfbab029edf

Specifying this as a blocker, and also pushing it to the current A-team sprint.

Subtickets

Change History (18)

comment:1 follow-up: Changed 50 years ago by vorner

  • Add Hours to Ticket changed from 0.0 to 5.0
  • Total Hours changed from 0.0 to 5.0

comment:1 follow-up: Changed 9 years ago by jreed

I also see this sometimes... notice two scenarios below:

[bind10] Started b10-auth (PID 68789)
[b10-auth] DNSServices created.
[bind10] Starting b10-xfrout
[b10-auth] Configuration session channel created.
...
Setting listen addresses:
 0.0.0.0:53
 127.0.0.1:5300[bind10] Started b10-cmdctl (PID 68794)

[bind10] BIND 10 started
Initialize TCP server at 0.0.0.0:53
Initialize UDP server at 0.0.0.0:53
Initialize TCP server at 127.0.0.1:5300
Initialize UDP server at 127.0.0.1:5300
[b10-auth] Configuration channel established.
[b10-auth] Xfrin session channel created.
[b10-auth] Xfrin session channel established.
[b10-auth] Statistics session channel created.
[b10-auth] Statistics session channel established.
Setting listen addresses:
 0.0.0.0:53
 127.0.0.1:5300
Initialize TCP server at 0.0.0.0:53
Initialize UDP server at 0.0.0.0:53
Initialize TCP server at 127.0.0.1:5300
Initialize UDP server at 127.0.0.1:5300
[b10-auth] Set statistics timer to 60 seconds

Not sure why above shows the set listen addresses twice. This takes b10-auth to 100% cpu.

But I also have:

[bind10] BIND 10 started
Setting listen addresses:
 0.0.0.0:53
 127.0.0.1:5300
Initialize TCP server at 0.0.0.0:53
Unable to set new address: Failed to initialize network servers: Permission denied
[b10-auth] error: Server configuration failed: Failed to initialize network servers: Permission denied
[b10-auth] Configuration channel established.
[b10-auth] Xfrin session channel created.
[b10-auth] Xfrin session channel established.
[b10-auth] Statistics session channel created.
[b10-auth] Statistics session channel established.
Setting listen addresses:
 ::1:5300
 127.0.0.1:5300
Initialize TCP server at ::1:5300
Initialize UDP server at ::1:5300
Initialize TCP server at 127.0.0.1:5300
Initialize UDP server at 127.0.0.1:5300
[b10-auth] Set statistics timer to 60 seconds
[b10-auth] Data source database file: /usr/home/jreed/opt/bind10/var/bind10-devel/zone.sqlite3
[b10-auth] Server started.

Problem: it falls back to default listen addresses and ports. That is broken.

I put into same ticket for now as I believe is related:

But now it doesn't use high CPU.

comment:2 in reply to: ↑ 1 Changed 9 years ago by jreed

Replying to jreed:

Problem: it falls back to default listen addresses and ports. That is broken.

I opened a ticket for that: #662.

comment:3 Changed 9 years ago by vorner

  • Owner set to vorner
  • Status changed from new to accepted

Hmm, it seems I'm responsible for this one, so I'm going to have a look into it.

comment:4 follow-ups: Changed 9 years ago by vorner

  • Owner changed from vorner to UnAssigned
  • Status changed from accepted to reviewing

Hello

The problem was caused by a loop in the tcp and udp server which retried accept/recv on error. That caused a busy loop, because when the original socket was closed in update, it generated error on it's accept and it kept retrying on the closed socket.

I added a condition to allow only some errors, which seemed to me to be non-fatal. However, I have no idea how to test this (because even with the bug, the server did work, it only kept being busy) and the asio documentation is quite sparse to the point how to recognize the errors (Stephen suggested that it probably contains errno, which makes sense, but how to test that too?).

The branch contains one unrelated change ‒ I noticed a nop call when debugging this, so I removed it.

Anyway, we might want to move away from coroutines in the code and maybe we should revisit the set of allowed errors some time. But this seems to solve the 100% CPU problem and I tested the server answers queries, so I'm putting it to review this way.

Proposed changelog entry:

[bug]      vorner
Solved a 100% CPU usage problem after switching addresses in b10-auth
(and possibly, but unconfirmed, in b10-resolver). It was caused by repeated
reads/accepts on closed socket.

comment:5 Changed 9 years ago by jinmei

  • Owner changed from UnAssigned to jinmei
  • Status changed from reviewing to accepted

comment:6 in reply to: ↑ 4 Changed 9 years ago by jinmei

Replying to vorner:

The branch contains one unrelated change ‒ I noticed a nop call when debugging this, so I removed it.

First off, this is not "nop". We need this trick to ensure we use
special sqlite3 database file ($B10_FROM_BUILD/bind10_zones.sqlite3)
when running in the "from source" mode.

Actually I don't like to have this special case, and we previously
discussed this on jabber and (if I remember correctly) we roughly
agreed we should stop using the hardcoded special value and should
have the user specify the file in the configuration file (if
necessary).

But that would be beyond the scope of this ticket anyway. For now
I've reverted this commit and push the revert commit to the shared
repo (b9d341e).

I'll make a separate comment message on the main topic.

comment:7 in reply to: ↑ 4 ; follow-up: Changed 9 years ago by jinmei

Replying to vorner:

The problem was caused by a loop in the tcp and udp server which retried accept/recv on error. That caused a busy loop, because when the original socket was closed in update, it generated error on it's accept and it kept retrying on the closed socket.

I added a condition to allow only some errors, which seemed to me to be non-fatal. However, I have no idea how to test this (because even with the bug, the server did work, it only kept being busy) and the asio documentation is quite sparse to the point how to recognize the errors (Stephen suggested that it probably contains errno, which makes sense, but how to test that too?).

I confirmed the change stopped the symptom, but as I closely looked at
the implementation, multi-fold concerns arose (and so the review took
much longer than I originally expected).

First, we should have been able to avoid having this problem if we had
implemented #388 correctly. "when the original socket was closed in
update", the stop interface introduced in #388 is called, and since
the "stop state" is checked in TCP/UDPServer::operator(), it should be
able to avoid the infinite loop.

In fact, #388 does this for UDPServer (so the loop doesn't occur even
without the patch of this thread), but the TCPServer version is buggy.
TCPServer::stopped_by_hand_ is a real value (instead of a pointer or
reference that can be shared by multiple objects), so even if
TCPServer::stop() sets it to true, once the TCPServer object is copied
via the coroutine framework the effect is canceled.

Moving one step forward, we should (probably) have been able to avoid
having this bug in #388 if we really wrote tests. I understand tests
that would rely on real sockets and network I/O like this one may not
be trivial, but according to the ticket log of #388 it's even
questionable whether the implementor even thought about tests. As a
hindsight, we should have not allowed that quality of code to be
merged in the first place.

Assuming we fix this bug of #388, I'm also not sure if this
implementation is reasonable. It introduces another state,
stopped_by_hand_, which has subtle relationship with other state of
the class (i.e. sockets), making the code difficult to understand and
more error prone. Closing sockets at this point doesn't seem to be a
good idea, either; it will make subsequent socket operations fail
abruptly, and since it cannot be distinguished from real errors in the
network layer, e.g., we cannot be sure whether we should log the event
or not (as commented in the patch). I'd rather "cancel" the socket
here, and handle the cancel event inside the handler. That way, and
with my understanding that it can be called multiple times, we can
also eliminate the stopped_by_hand_ member variable. I also note
"socket_->close()" in TCPServer::stop() doesn't do what the code
intended to do (whether it's close() or cancel()) for the same reason
as its buggy stopped_by_hand_.

I have another concern about how we handle updating "listen_on".
Currently it clears all existing sockets and listens on new ones from
the scratch. If the old and new sets of interfaces share the same
address/port, this will (possibly) lead to service disruption. We
should actually do this in more sophisticated way: compare the old and
new sets, and stop only disappearing ones and open only really new
ones.

With all these considered, most substantially abut the code/design
quality of #388 without tests, I'd suggest we should step back first
and restart: revert #388, #575, and #576, restart #388 with tests,
then re-add #575 and #576.

Finally, comments about the patch itself follow:

  • [comment] handling socket errors in UDP/TCPServer::operator() would be necessary anyway, even if we fix the various bugs of #388 (there may be some system level errors that could lead to infinite loop).
  • I'd avoid using UNIX-specific error code such as EWOULDBLOCK. ASIO has a higher level object constants (see asio/error.hpp)
  • similarly, I'm not sure if we have to explicitly check the error category (it may not do harm actually, but the intent isn't clear and IMO reduces code readability)
  • [comment] you actually fixed another bug in TCPServer::operator(). In the following original code:
                do {
                    CORO_YIELD acceptor_->async_accept(*socket_, *this);
                } while (!ec);
    
    !ec should actually be just ec.
  • as for logging, as I suggested I'd separate intentional cancel and real error, and handle them separately in terms of logging.

comment:8 Changed 9 years ago by jinmei

  • Owner changed from jinmei to vorner
  • Status changed from accepted to assigned

comment:9 in reply to: ↑ 7 ; follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to jinmei
  • Status changed from assigned to reviewing

Hello

Replying to jinmei:

In fact, #388 does this for UDPServer (so the loop doesn't occur even
without the patch of this thread), but the TCPServer version is buggy.
TCPServer::stopped_by_hand_ is a real value (instead of a pointer or
reference that can be shared by multiple objects), so even if
TCPServer::stop() sets it to true, once the TCPServer object is copied
via the coroutine framework the effect is canceled.

Well, the problem is, if we share it, we might close an already accepted connection by accident by it (or, not close it, but stop handling it). A solution that doesn't have such problem would be much more complicated and unreadable.

Actually, I don't like the stopped_by_hand_ much anyway. If we close the socket, the OS knows. So nothing more will come.

Moving one step forward, we should (probably) have been able to avoid
having this bug in #388 if we really wrote tests. I understand tests
that would rely on real sockets and network I/O like this one may not
be trivial, but according to the ticket log of #388 it's even
questionable whether the implementor even thought about tests. As a
hindsight, we should have not allowed that quality of code to be
merged in the first place.

About the tests, I agree we should have them. But in the ticket it would need to implement test framework around the whole sockets stuff. I believe it would be too big for that ticket. I think there was a ticket opened for the test framework back then, but I can't find it right now.

Anyway, I can't think of a test that could actually catch this. Because all functionality was preserved, nothing stopped working. Even the socket got closed, so testing we can't connect wouldn't help. And at the end of the test, the io service would be destroyed, eliminating even the 100%CPU problem.

Assuming we fix this bug of #388, I'm also not sure if this
implementation is reasonable. It introduces another state,
stopped_by_hand_, which has subtle relationship with other state of
the class (i.e. sockets), making the code difficult to understand and
more error prone. Closing sockets at this point doesn't seem to be a
good idea, either; it will make subsequent socket operations fail
abruptly, and since it cannot be distinguished from real errors in the
network layer, e.g., we cannot be sure whether we should log the event
or not (as commented in the patch). I'd rather "cancel" the socket
here, and handle the cancel event inside the handler. That way, and
with my understanding that it can be called multiple times, we can
also eliminate the stopped_by_hand_ member variable. I also note
"socket_->close()" in TCPServer::stop() doesn't do what the code
intended to do (whether it's close() or cancel()) for the same reason
as its buggy stopped_by_hand_.

Hmm. That sounds like a nice refactoring of the code. Should it be on a separate ticket, this one or #388? (repoened)

I have another concern about how we handle updating "listen_on".
Currently it clears all existing sockets and listens on new ones from
the scratch. If the old and new sets of interfaces share the same
address/port, this will (possibly) lead to service disruption. We
should actually do this in more sophisticated way: compare the old and
new sets, and stop only disappearing ones and open only really new
ones.

Yes, and I believe there's a TODO about it. I wanted something working at the time of writing it into resolver with a note that this is fine-tuning. I agree we want it in the long term. But it's not part of this ticket. Should I just open a new one for it?

With all these considered, most substantially abut the code/design
quality of #388 without tests, I'd suggest we should step back first
and restart: revert #388, #575, and #576, restart #388 with tests,
then re-add #575 and #576.

Hmm, I'm little bit afraid about it. I'm not sure what will reverting these branches do with the code and history, they are already quite tangled into the rest. It might work, though.

But we might either promise to have it done before release, or revert even the parts with documentation updates.

Anyway, do you have an idea how to do the tests in reasonable way?

Finally, comments about the patch itself follow:

  • [comment] handling socket errors in UDP/TCPServer::operator() would be necessary anyway, even if we fix the various bugs of #388 (there may be some system level errors that could lead to infinite loop).

ACK, I was quite surprised about whoever wrote that code, why he did this. It makes kind of loopish impression.

  • I'd avoid using UNIX-specific error code such as EWOULDBLOCK. ASIO has a higher level object constants (see asio/error.hpp)

Good ☺. The asio online documentation is little bit unfriendly to me, but I'll have a look there soon.

  • [comment] you actually fixed another bug in TCPServer::operator(). In the following original code:
                do {
                    CORO_YIELD acceptor_->async_accept(*socket_, *this);
                } while (!ec);
    
    !ec should actually be just ec.

Yes, I know. I even had a theory why the original code did work.

  • as for logging, as I suggested I'd separate intentional cancel and real error, and handle them separately in terms of logging.

Should that be part of this ticket?

ACK about the false nop, I already noticed the hidden code there.

(switching the ticket state back to review, accept kind of takes it from review state)

comment:10 in reply to: ↑ 9 ; follow-up: Changed 9 years ago by jinmei

Replying to vorner:

Well, the problem is, if we share it, we might close an already accepted connection by accident by it (or, not close it, but stop handling it). A solution that doesn't have such problem would be much more complicated and unreadable.

Actually, I don't like the stopped_by_hand_ much anyway. If we close the socket, the OS knows. So nothing more will come.

[snip]

About the tests, I agree we should have them. But in the ticket it would need to implement test framework around the whole sockets stuff. I believe it would be too big for that ticket. I think there was a ticket opened for the test framework back then, but I can't find it right now.

Anyway, I can't think of a test that could actually catch this. Because all functionality was preserved, nothing stopped working. Even the socket got closed, so testing we can't connect wouldn't help. And at the end of the test, the io service would be destroyed, eliminating even the 100%CPU problem.

Right now I don't have a specific idea about how to test it, but the
difficulty you mentioned above seemed to highlight another problem
about how it was developed (although I understand you were not
responsible for that): we should have began with how to test it and
then implement it. Clearly #388 wasn't developed that way, and now
the code is shown we are puzzled about how to test that given code.
This is another reason why I'd suggest we should first step back.

Aynchronous behavior like this one is generally hard to understand,
so IMO we shouldn't have allowed it to be merged without tests. Even
if we didn't have this bug, we'll eventually want to update the code,
but without tests we won't be able to do it with confidence.

Another reason why I'd rather prefer stepping back is to keep
ourselves responsible for writing tests. As I noted in my previous
comment, #388 was asked for review without tests and even without
explaining why the tests were missing. IMO this is not sufficiently
responsible, and it would be an effective way to keep ourselves
motivated if the cost of not doing so is being forced to revert the
irresponsible commit.

If you don't agree with reverting these features, however, one
possible compromise might be:

  • merge the error check proposed in this ticket anyway. This would be necessary regardless of the problem of #388 as discussed separately, so we could say this is not to give an immunity to #388 but an independent fix that happens to solve another bug due to #388.
  • in parallel with that create a test ticket for #388 and give it a highest priority (someone who is most responsible for #388 should do it at a highest priority). This process should be done a test driven manner, so that we won't be trapped into a particular style of implementation, saying "we cannot test this implementation".

Assuming we fix this bug of #388, I'm also not sure if this
implementation is reasonable.

[snip]

Hmm. That sounds like a nice refactoring of the code. Should it be on a separate ticket, this one or #388? (repoened)

I'd say it will go to a separate ticket.

I have another concern about how we handle updating "listen_on".

[snip]

Yes, and I believe there's a TODO about it. I wanted something working at the time of writing it into resolver with a note that this is fine-tuning. I agree we want it in the long term. But it's not part of this ticket. Should I just open a new one for it?

Yes, please.

With all these considered, most substantially abut the code/design
quality of #388 without tests, I'd suggest we should step back first
and restart: revert #388, #575, and #576, restart #388 with tests,
then re-add #575 and #576.

Hmm, I'm little bit afraid about it. I'm not sure what will reverting these branches do with the code and history, they are already quite tangled into the rest. It might work, though.

But we might either promise to have it done before release, or revert even the parts with documentation updates.

Anyway, do you have an idea how to do the tests in reasonable way?

See above.

  • I'd avoid using UNIX-specific error code such as EWOULDBLOCK. ASIO has a higher level object constants (see asio/error.hpp)

Good. The asio online documentation is little bit unfriendly to me, but I'll have a look there soon.

Making documentation friendly is a difficult task:-) I think the ASIO
documentation is quite decent, but I agree it could be much improved.
It will also apply to our documentation:-)

  • as for logging, as I suggested I'd separate intentional cancel and real error, and handle them separately in terms of logging.

Should that be part of this ticket?

Not necessarily. It depends on how we deal with this general issue:
if we step back, there will be multiple follow up tasks, and this
would be included part of them. If we choose to merge the error check
fix, it wouldn't contain the larger change such as using cancel
anyway.

comment:11 Changed 9 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:12 in reply to: ↑ 10 ; follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

  • I'd avoid using UNIX-specific error code such as EWOULDBLOCK. ASIO has a higher level object constants (see asio/error.hpp)

OK, this is updated. Should I merge this? Or did I overlook something in the long discussion?

Thanks.

comment:13 in reply to: ↑ 12 Changed 9 years ago by jinmei

Replying to vorner:

  • I'd avoid using UNIX-specific error code such as EWOULDBLOCK. ASIO has a higher level object constants (see asio/error.hpp)

OK, this is updated. Should I merge this? Or did I overlook something in the long discussion?

It can be merged (but we shouldn't forget making and assigning #388
revision at a high priority task).

comment:14 Changed 9 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:15 Changed 9 years ago by vorner

  • Resolution set to fixed
  • Status changed from reviewing to closed

Thanks, merged.

And the ticket for #388 is #678.

comment:16 Changed 9 years ago by vorner

  • Add Hours to Ticket changed from 0 to 5

comment:17 Changed 9 years ago by vorner

  • Estimated Difficulty changed from 0.0 to 5
Note: See TracTickets for help on using tickets.