Opened 9 years ago

Closed 9 years ago

#775 closed defect (fixed)

b10-auth should not exit if it cannot bind to ports

Reported by: shane Owned by: hanfeng
Priority: high Milestone: Sprint-20110503
Component: b10-auth Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

The b10-auth process should not exit if it cannot bind to ports, but should report the error condition.

More background:

I have a server with a running DNS server listening to 0.0.0.0:53, so b10-auth cannot start with the default configuration. The cfgmgr cannot update its settings because b10-auth is not responding (since it cannot start).

See ticket #701 for a ticket explaining the behavior (which might be related).

Subtickets

Change History (15)

comment:1 Changed 9 years ago by shane

  • Priority changed from major to critical

comment:2 Changed 9 years ago by shane

  • Milestone changed from Year 3 Task Backlog to Sprint-20110419

comment:3 Changed 9 years ago by hanfeng

  • Owner set to hanfeng
  • Status changed from new to assigned

comment:4 Changed 9 years ago by hanfeng

  • Owner changed from hanfeng to UnAssigned
  • Status changed from assigned to reviewing

comment:5 Changed 9 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:6 Changed 9 years ago by vorner

  • Owner changed from vorner to hanfeng

Is it possible you forgot to push your changes? I can't find them.

comment:7 Changed 9 years ago by hanfeng

  • Owner changed from hanfeng to vorner

comment:8 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to hanfeng

Hello

Actually, I don't like this at all. The function is called in two situations:

  • At startup, when the server needs to bind to the ports the first time. This is the place which the bug talks about, because if this fails, the function throws, and as it is not caught, the server exits just when it starts. That would be mostly OK, because it can't work anyway. But currently our config system depends on the component running to ask it if the new config is good, which it can't, so it can't set a better config.
  • When the setting is changed at runtime. This is the place where current behaviour is actually needed (ok, the abort part is questionable, but IMO it's better, because it will reject the new config by not answering, it will timeout and the server will restart with the config it was able to work before). If you remove the throw there, it will fail to assign new addresses, restore the original ones. But it will not send error to the config manager, so user will think the config is OK (the error will appear only in the log, not at his screen with bindctl). Furthermore, cfgmrg will also think the config is OK, so it will write it to disk and present the config to the server next time it starts. At that time it will be even worse, because it will not have any addresses to roll back to, so the server becomes unusable at some unknown future time when the admin probably isn't near and looking at it.

So, in short, the throw must stay there. In case of the first startup, if it throws, we might want to catch it and not exit the whole program, at last as a short-time workaround, before we make it possible to configure things even when they are not running.

Furthermore, please notice that you modify behaviour of resolver as well, this part of code is shared with it.

And, anyway, the tests fail:

[ RUN      ] AuthConfigTest.listenAddressConfig
Unable to set new address: Failed to initialize network servers: Cannot assign requested address
../../../../src/lib/testutils/portconfig.h:109: Failure
Value of: result->equals(*isc::config::createAnswer())
  Actual: true
  Expected: false
[  FAILED  ] AuthConfigTest.listenAddressConfig (2 ms)

With regards

comment:9 in reply to: ↑ 8 ; follow-up: Changed 9 years ago by hanfeng

Replying to vorner:

So, in short, the throw must stay there. In case of the first startup, if it throws, we might want to catch it and not exit the whole program, at last as a short-time workaround, before we make it possible to configure things even when they are not running.

For this point, I don't agree, if every time port binding failed, we restart, it will make boss quite busy, during our test last time, we can see the server launch and quit for several times. The auth server isn't alive or dead, but keep jumping between them, which is quite terrible, you even can not get a chance to modify the configure by hand.

As for the last throw in port config. I have restore it.

Regards

comment:10 Changed 9 years ago by hanfeng

  • Owner changed from hanfeng to vorner

comment:11 in reply to: ↑ 9 ; follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to hanfeng

Hello

Replying to hanfeng:

Replying to vorner:

So, in short, the throw must stay there. In case of the first startup, if it throws, we might want to catch it and not exit the whole program, at last as a short-time workaround, before we make it possible to configure things even when they are not running.

For this point, I don't agree, if every time port binding failed, we restart, it will make boss quite busy, during our test last time, we can see the server launch and quit for several times. The auth server isn't alive or dead, but keep jumping between them, which is quite terrible, you even can not get a chance to modify the configure by hand.

Well, for one, it wouldn't get so busy, one restart every 10 seconds isn't busy (but it can be annoying, of course).

Anyway, the abort is not responsible for the jumping. If the process is starting up and it can't bind to the ports, the rollback is to the empty set of addresses, so the second exception can not happen, therefore the abort can't happen. The thing that did kill the process was the first exception, which you catch now.

The effect of the abort is, if user changes the configuration at runtime and it fails, it tries to return back to the original ones. If that fails as well (which it should not, in reality), there's some serious problem. So in that case, it aborted, making the boss restart it (with the old config). That should work, because it worked some time before already. But if that fails as well, it rollbacks to empty set of sockets, you catch the exception and it sits there. So it would jump only once and only in the really improbable situation.

But, after explaining the situation, I don't really care much about it, it's rare. In the long term, we should rewrite the changing of sockets so the old ones are released only after the new ones are successfully bound, so we wouldn't have to care about it (eg. the rollback couldn't throw).

As for the last throw in port config. I have restore it.

ACK. I don't really like the catch-all thing there or the fact that the server would be sitting there in completely useless way (and, in fact, in somehow inconsistent configuration). But due to the current problem with configuration, it's probably the less evil thing. So, could you add some comment around it that it's a temporary solution and should be removed once we are able to configure modules while they are not running?

And, maybe, it should have a changelog.

Otherwise, it is OK

Thanks

comment:12 in reply to: ↑ 11 Changed 9 years ago by hanfeng

Replying to vorner:

The effect of the abort is, if user changes the configuration at runtime and it fails, it tries to return back to the original ones.

I understand the logic, and I add comment for them. As for abort, If we use exception, it's better we don't add other way to represent the abnormal situation.

ACK. I don't really like the catch-all thing there or the fact that the server would be sitting there in completely useless way (and, in fact, in somehow inconsistent configuration).

I have modify the catch statement to only catch configure exception.

And, maybe, it should have a changelog.

I have add the change log.

Regards

comment:13 Changed 9 years ago by hanfeng

  • Defect Severity set to N/A
  • Owner changed from hanfeng to vorner
  • Sub-Project set to DNS

comment:14 Changed 9 years ago by vorner

  • Owner changed from vorner to hanfeng

Hello

The abort was there because we can't throw 2 exceptions at once.

Anyway, the code looks OK. Please merge.

Thanks

comment:15 Changed 9 years ago by hanfeng

  • Resolution set to fixed
  • Status changed from reviewing to closed
Note: See TracTickets for help on using tickets.