Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#1342 closed defect (complete)

Reintroduce delayed restarts after #213

Reported by: vorner Owned by: jelte
Priority: medium Milestone: Sprint-20111122
Component: ~Boss of BIND (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 4 Add Hours to Ticket:
Total Hours: 6 Internal?: no

Description

The #213 changes in boss remove the support to restart a component after some delay. Currently if it fails, it either means the whole system stops (depending on the config), or it tries to restart it right away. In the second case, if it keeps failing, we have a loop. This is uncomfortable regression.

The old way with restart after some 10s or so should be reintroduced.

Subtickets

Change History (16)

comment:1 Changed 8 years ago by jelte

  • Add Hours to Ticket changed from 0 to 4

comment:2 Changed 8 years ago by jelte

  • Add Hours to Ticket 4 deleted
  • Estimated Difficulty changed from 0 to 4

comment:3 Changed 8 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to New Tasks

comment:4 Changed 8 years ago by vorner

  • Milestone changed from New Tasks to Next-Sprint-Proposed

We might need this before we merge #213.

comment:5 Changed 8 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20111122

comment:6 Changed 8 years ago by jelte

  • Owner set to jelte
  • Status changed from new to assigned

comment:7 follow-up: Changed 8 years ago by jelte

  • Owner changed from jelte to UnAssigned
  • Status changed from assigned to reviewing

Turns out with the changes from 213, it's actually easier to implement this, so I've removed the legacy scheduler class and let the component class handle it's own restarting.

Behaviour of restarts changed slightly; it'll now be restarted X seconds (default 10) after time of start, not time of death. But the calculation of restart time is in its own method so we can easily change the algorithm.

comment:8 Changed 8 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:9 in reply to: ↑ 7 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jelte

Hello

Replying to jelte:

Turns out with the changes from 213, it's actually easier to implement this, so I've removed the legacy scheduler class and let the component class handle it's own restarting.

Behaviour of restarts changed slightly; it'll now be restarted X seconds (default 10) after time of start, not time of death. But the calculation of restart time is in its own method so we can easily change the algorithm.

I think that this actually was the original behavior, not 10s after the death, but 10s after the original start. Anyway, this doesn't seem very important.

I fixed the boss tests to pass (and pushed), as you removed the dead_processes variable.

I guess you see no way to test the boss code itself as unittest? Anyway, you might want to check how components interacts with various kinds and being killed sooner or later.

Actually, I found a problem with the behaviour. If I choose a needed component (let's say b10-auth in the default configuration) and kill it later than 10s after the system started, it is restarted. If I kill it again soon afterwards, the system stops. However, the semantics of needed component was to stop only if it fails at startup (with 10s being large enough time to fit the initialization in there probably), when the user typed the command and can see the system is ill. But this might happen after the user went for a holiday, so in this case it shouldn't bring the system down.

Also, what is the purpose of the restart_delay parameter? Nothing uses it.

And, the description of this parameter is misleading. It assumes the component is restarted every time, no matter if the component might be core for example.

Thanks

comment:10 in reply to: ↑ 9 ; follow-up: Changed 8 years ago by jelte

  • Owner changed from jelte to vorner

Replying to vorner:

Hello

Replying to jelte:

I fixed the boss tests to pass (and pushed), as you removed the dead_processes variable.

ah, doh, thanks, I had that change too, but had edited the .py, and not
noticed the .in file :/

I guess you see no way to test the boss code itself as unittest? Anyway, you might want to check how components interacts with various kinds and being killed sooner or later.

There may be a way to do it, I may be able to come up with a mocky
thingy, but either that would mostly be testing the mock itself, and I
think having a test that actually kills a process from the outside
would be more useful.

Actually, I found a problem with the behaviour. If I choose a needed component (let's say b10-auth in the default configuration) and kill it later than 10s after the system started, it is restarted. If I kill it again soon afterwards, the system stops. However, the semantics of needed component was to stop only if it fails at startup (with 10s being large enough time to fit the initialization in there probably), when the user typed the command and can see the system is ill. But this might happen after the user went for a holiday, so in this case it shouldn't bring the system down.

Heh. I *knew* it was too convenient that that variable existed already
:)

Anyway, I added a new one, _original_start_time which is now used for
the check for initial failure (and which is only set on the very first
call to start()). This should now be fixed.

Also, what is the purpose of the restart_delay parameter? Nothing uses it.

seemed useful, but yeah, if we don't use it we might as well hardcode
the value. done.

And, the description of this parameter is misleading. It assumes the component is restarted every time, no matter if the component might be core for example.

specific comment isn't relevant anymore, as i just removed the
parameter, but i added some new text, please take a look at it

comment:11 in reply to: ↑ 10 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jelte

Replying to jelte:

Replying to vorner:

I guess you see no way to test the boss code itself as unittest? Anyway, you might want to check how components interacts with various kinds and being killed sooner or later.

There may be a way to do it, I may be able to come up with a mocky
thingy, but either that would mostly be testing the mock itself, and I
think having a test that actually kills a process from the outside
would be more useful.

In the second sentence, I meant more like the stuff in lib/python/isc/bind10/components.py. There are quite few changes, but not many tests for it. There should be something to see that if it is restarted, the _original_start_time isn't changed, that it is None before the first start, the starts sets it, that a core component doesn't get restarted or return True, etc.

Actually, I found a problem with the behaviour. If I choose a needed component (let's say b10-auth in the default configuration) and kill it later than 10s after the system started, it is restarted. If I kill it again soon afterwards, the system stops. However, the semantics of needed component was to stop only if it fails at startup (with 10s being large enough time to fit the initialization in there probably), when the user typed the command and can see the system is ill. But this might happen after the user went for a holiday, so in this case it shouldn't bring the system down.

Heh. I *knew* it was too convenient that that variable existed already
:)

:-)). Well, I heard one should be wary about the things that come for free ;-).

comment:12 in reply to: ↑ 11 Changed 8 years ago by jelte

  • Owner changed from jelte to vorner

Replying to vorner:

Replying to jelte:

Replying to vorner:

In the second sentence, I meant more like the stuff in lib/python/isc/bind10/components.py. There are quite few changes, but not many tests for it. There should be something to see that if it is restarted, the _original_start_time isn't changed, that it is None before the first start, the starts sets it, that a core component doesn't get restarted or return True, etc.

Ah, yes, of course. I updated a number of the tests (added the return value checks to them) and made a separate once for the start time thingy.

I also found and fixed a bad test that wasn't executed because it was hidden by another test with the same name ('needed' was used when it should've been 'dispensable').

Some additional fooling around also brought to light something I didn't know about python; default values for optional parameters are only evaluated once, which is why originally the time.time() in the tests hack didn't work for me. Also fixed that and updated the tests

Heh. I *knew* it was too convenient that that variable existed already
:)

:-)). Well, I heard one should be wary about the things that come for free ;-).

hear hear :)

comment:13 Changed 8 years ago by vorner

  • Owner changed from vorner to jelte

It looks OK now. Please merge.

Thank you

comment:14 Changed 8 years ago by vorner

  • Total Hours changed from 0 to 1

comment:15 Changed 8 years ago by jelte

  • Add Hours to Ticket set to 5
  • Resolution set to complete
  • Status changed from reviewing to closed

thanks, merged, closing ticket

comment:16 Changed 8 years ago by jelte

  • Add Hours to Ticket 5 deleted
  • Total Hours changed from 1 to 6
Note: See TracTickets for help on using tickets.