Opened 8 years ago

Closed 8 years ago

#1365 closed task (complete)

Restarts in #213

Reported by: vorner Owned by: vorner
Priority: medium Milestone: Sprint-20111122
Component: ~Boss of BIND (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

This is a sub-branch of #213, handling of the restarts using the new framework. It's in branch 213-incremental-restarts.

Subtickets

Change History (15)

comment:1 follow-up: Changed 8 years ago by vorner

  • Owner set to jinmei
  • Status changed from new to reviewing

Hello

I looked at the system tests. It looks like the socket creator (together with some more processes) refused to stop in a peaceful way, so boss started to send SIGTERMs and SIGKILLs. However, socket creator reset it's internal process object for some reason, so it couldn't send it, therefore not being able to stop. The patch should solve the problem in the boss side, but I'm not clear on why the processes didn't want to stop in the first place.

I'm giving the ticket directly to you, as you already started reviewing the branch. I hope it's OK.

comment:2 Changed 8 years ago by jinmei

  • Milestone changed from Sprint-DHCP-20111109 to Sprint-20111108

comment:3 in reply to: ↑ 1 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

Hello

I looked at the system tests. It looks like the socket creator (together with some more processes) refused to stop in a peaceful way, so boss started to send SIGTERMs and SIGKILLs. However, socket creator reset it's internal process object for some reason, so it couldn't send it, therefore not being able to stop. The patch should solve the problem in the boss side, but I'm not clear on why the processes didn't want to stop in the first place.

I chased it further, and found the real cause. If we reset the creator,
we implicitly delete the Popen object. It internally waits() for the
child process in __del__(), which results in incosistent state
between the boss (who believes the process is still alive) and the system
(where the process doesn't exist).

This is tricky, so I added some more detailed documentation about this
point (pushed).

Other general comments about this subbranch:

  • maybe rename BoB.processes to BoB.components?
  • Likewise, I'd rename the 'info' parameter of register_process() to 'component' and document that it's expected to be an object of (sub)class of BaseComponent?.
  • restart_processes() is now never-called but kept for reference, right? I'd add a comment or doc string about that not to confuse people who happen to see this code.
  • some part of restart_processes() are now buggy. e.g. this is obviously wrong:
                        self.processes[proc_info.pid] = proc_info
    
    but I'm okay with leaving it if we address the previous bullet point.
  • It seems we can now remove BoB.sockcreator.
  • Some log messages should be cleaned up, e.g. BIND10_SOCKCREATOR_CRASHED.
  • test_show_processes_started: I'm not sure if this replacement is okay. Actually I'm not sure about the original intent of the test, but from the test name it seems to try to test whether the intended set of processes (components) have started up. If so, the specific list of components had a meaning. (We may eventually have to replace them from a list of processes based on the default config, but right now they are hardcoded).

The following are suggestions that I found useful when I chased the
systest failure further:

  • suggestion: adding this to the end of SockCreator?._start_internal()

self._boss.log_started(self.pid())

  • suggestion: in reap_children(), log the event of final child shutdown:
                if pid in self.processes:
                    # One of the processes we know about.  Get information on it.
                    component = self.processes.pop(pid)
                    if component.running() and self.runnable:
                        # Tell it it failed. But only if it matters (we are
                        # not shutting down and the component considers itself
                        # to be running.
                        component.failed(exit_status);
    
                    # Additional log here:
                    logger.info(something_like_indicating_shutdown,
                                component.name(), component.pid(),
                                exit_status)
    

comment:4 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:5 in reply to: ↑ 3 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

This is tricky, so I added some more detailed documentation about this
point (pushed).

Thanks for that. Seems like the computers are able to bring new surprises all the time :-)).

  • maybe rename BoB.processes to BoB.components?

I added comment about that, I believe processes is better, since there isn't one-to-one mapping between components and processes.

  • Likewise, I'd rename the 'info' parameter of register_process() to 'component' and document that it's expected to be an object of (sub)class of BaseComponent?.

The documentation is in 213-incremental already, it will get there on the merge. Parameter renamed.

  • restart_processes() is now never-called but kept for reference, right? I'd add a comment or doc string about that not to confuse people who happen to see this code.

I added a comment. It actually is called, but as the loop is over empty dict, it only returns None and does nothing. Yes, it is preserved for the time when we return the delayed restarts, which should be soon I hope.

  • test_show_processes_started: I'm not sure if this replacement is okay. Actually I'm not sure about the original intent of the test, but from the test name it seems to try to test whether the intended set of processes (components) have started up. If so, the specific list of components had a meaning. (We may eventually have to replace them from a list of processes based on the default config, but right now they are hardcoded).

Actually, this tests the method that lists the currently running processes, when you send the Boss show_processes command. It didn't test the right processes are started, the start_all_processes was used to only fill in some processes to be tested. Actually, this way it is closer to the idea of unit tests, testing only the one thing, and it is the function providing the processes. It is more or less the same as the above test, but checks it works even when there are some tests (so the format can be checked).

The following are suggestions that I found useful when I chased the
systest failure further:

Added both.

Thank you

comment:6 in reply to: ↑ 5 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

  • maybe rename BoB.processes to BoB.components?

I added comment about that, I believe processes is better, since there isn't one-to-one mapping between components and processes.

I see, but then doesn't this contradict the idea?

                component = self.processes.pop(pid)

or

            components_to_stop = list(self.processes.values())
            for component in components_to_stop:

Actually, these code fragments made me make this comment.

  • test_show_processes_started: I'm not sure if this replacement is okay. [...]

Actually, this tests the method that lists the currently running processes, when you send the Boss show_processes command. It didn't test the right processes are started, the start_all_processes was used to only fill in some processes to be tested. Actually, this way it is closer to the idea of unit tests, testing only the one thing, and it is the function providing the processes. It is more or less the same as the above test, but checks it works even when there are some tests (so the format can be checked).

Okay, but in that case I'd rename the test to, e.g. just
test_show_processes(), and the original test_show_processes() to
test_show_processes_empty().

Other changes look okay.

comment:7 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:8 in reply to: ↑ 6 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

Replying to vorner:

  • maybe rename BoB.processes to BoB.components?

I added comment about that, I believe processes is better, since there isn't one-to-one mapping between components and processes.

I see, but then doesn't this contradict the idea?

                component = self.processes.pop(pid)

or

            components_to_stop = list(self.processes.values())
            for component in components_to_stop:

Actually, these code fragments made me make this comment.

Actually, each item in the dict corresponds to one process. But the values are componenents, as each process belongs to one.

  • test_show_processes_started: I'm not sure if this replacement is okay. [...]

Actually, this tests the method that lists the currently running processes, when you send the Boss show_processes command. It didn't test the right processes are started, the start_all_processes was used to only fill in some processes to be tested. Actually, this way it is closer to the idea of unit tests, testing only the one thing, and it is the function providing the processes. It is more or less the same as the above test, but checks it works even when there are some tests (so the format can be checked).

Okay, but in that case I'd rename the test to, e.g. just
test_show_processes(), and the original test_show_processes() to
test_show_processes_empty().

OK, renamed.

Thanks

comment:9 in reply to: ↑ 8 Changed 8 years ago by jinmei

Replying to vorner:

  • maybe rename BoB.processes to BoB.components?

I added comment about that, I believe processes is better, since there isn't one-to-one mapping between components and processes.

I see, but then doesn't this contradict the idea?

                component = self.processes.pop(pid)

or

            components_to_stop = list(self.processes.values())
            for component in components_to_stop:

Actually, these code fragments made me make this comment.

Actually, each item in the dict corresponds to one process. But the values are componenents, as each process belongs to one.

I (now) know that, but that's not my point. The point is that such
code seems to indicate confusion on the relationship between processes
and components.

For example, consider the case where a component has multiple
processes. Then component.kill() will be called multiple times here:

        for component in components_to_stop:
            logger.info(BIND10_SEND_SIGTERM, component.name(), component.pid())
            try:
                component.kill()

(It's also awkward the log message talks about a particular process
PID, while kill() is a component-wide operation...hmm, and I'd wonder
what component.pid() means if the component has multiple processes).

Calling kill() on a component may or may not be safe depending on the
underlying implementation, but it's not clear from the documentation,
and, frankly, this code doesn't seem to take into account that
possibility but simply confuse components with processes.

Likewise, this code doesn't seem to be well defined to me:

                component = self.processes.pop(pid)
                logger.info(BIND10_PROCESS_ENDED, component.name(), pid,
                            exit_status)
                if component.running() and self.runnable:
                    component.failed(exit_status);

If the component has multiple processes, what happens here is that one
of the processes has died. Such a component would internally maintain
a list of processes, and to restart a new one in 'failed', it would
first have to remove the entry from the list for the dead process then
spawn a new one. But, again, since failed() is component-wide method,
the component cannot even know which process has died.

All these things seem to suggest that the current abstraction is
incomplete for the multi-process component model. IMO, we should
either:

  • make it well defined throughout the code, or
  • drop the incomplete model for the moment and assume one-to-one mapping between processes and component, at least in the boss implementation. (in this case it will be more intuitive to me if we rename BoB.processes to BoB.components).

comment:10 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:11 follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

OK, it is true it's incomplete in some ways and not tested at all. So I renamed it to components and made a note it might change.

comment:12 in reply to: ↑ 11 Changed 8 years ago by jinmei

Replying to vorner:

Hello

OK, it is true it's incomplete in some ways and not tested at all. So I renamed it to components and made a note it might change.

Looks okay (there seem to be other points in comments and naming
where we might want to clarify the relationship 'process' and
'comments', but I guess it's better not to tweak the code further for
this level of cleanups for now).

comment:13 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:14 Changed 8 years ago by jelte

  • Milestone changed from Sprint-20111108 to Sprint-20111122

comment:15 Changed 8 years ago by vorner

  • Resolution set to complete
  • Status changed from reviewing to closed

Merged back to #213, closing this ticket.

Note: See TracTickets for help on using tickets.