Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#1596 closed task (complete)

Addressing of multiple instances of the same component

Reported by: vorner Owned by: vorner
Priority: medium Milestone: Sprint-20120221
Component: Unclassified Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket: Multiple auths
Estimated Difficulty: 5 Add Hours to Ticket: 0
Total Hours: 13.26 Internal?: no

Description

To run multiple auth servers, we need to be able addressing them individually (at last for the shutdown command). Currently if single one of them should be stopped, all of them stop.

A short-term solution could be:

  • Pass an --id argument to each started process.
  • Add the same id parameter to the shutdown command, which will be filled in by boss when it sends the command.
  • If the shutdown command contains the parameter and it is different from the passed --id parameter, ignore the command.

We might want a better long-term solution, but this one would work for a minimal multi-core support by starting multiple auth servers.

Subtickets

Attachments (1)

b10-config.db (606 bytes) - added by jinmei 8 years ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 8 years ago by jelte

  • Estimated Difficulty changed from 0 to 5

comment:2 Changed 8 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20120207

comment:3 Changed 8 years ago by vorner

  • Owner set to vorner
  • Status changed from new to accepted

comment:4 Changed 8 years ago by vorner

  • Owner changed from vorner to UnAssigned
  • Status changed from accepted to reviewing

Hello

It should be ready for review. I send the pid of process to stop together with the command and the recipient can check it. Currently only authoritative server and resolver do, the others probably don't need to be run in multiple instances. I propose to introduce the checking into others as follow-up task(s). I also added some documentation.

I'd propose this changelog:

[func]		vorner
It possible to start authoritative server or resolver in multiple
instances, to use more than one CPU. Configuration is described in
the guide.

comment:5 Changed 8 years ago by jinmei

  • Owner changed from UnAssigned to jinmei

comment:6 follow-up: Changed 8 years ago by jinmei

First off, I made a few trivial changes to the branch.

higher level points

  • When I tried to add an additional b10-auth process (from the source env):
    > config add Boss/components b10-auth-2
    > config set Boss/components/b10-auth-2/special auth
    > config set Boss/components/b10-auth-2/kind needed
    > config commit
    
    with the attached config (it uses non privileged port), two processes in fact ran, but only one of them seemed to listen on:
    % lsof -n | grep b10-auth|grep IPv4    
    b10-auth  97674 jinmei   11u    IPv4 0x0fc42788        0t0      TCP 127.0.0.1:hacl-hb (LISTEN)
    b10-auth  97674 jinmei   12u    IPv4 0x0ea86294        0t0      UDP 127.0.0.1:hacl-hb
    
    I also saw this error in logs:
    2012-02-08 17:20:18.508 INFO  [b10-boss.boss] BIND10_SOCKET_GET requesting socket [::]:53 of type TCP from the creator
    2012-02-08 17:20:18.509 ERROR [b10-boss.boss] BIND10_SOCKET_ERROR error on bind call in the creator: 13/Permission denied
    2012-02-08 17:20:18.509 ERROR [b10-auth.server_common] SRVCOMM_ADDRESS_FAIL failed to listen on addresses ("Error creating socket on bind")
    
    I stopped it once and restarted it, but the situation was the same. If restarted it as a root, one of b10-auth listens on the privileged port:
    % sudo lsof -n | grep b10-auth|grep IPv4
    b10-auth  97589           root   16u     IPv4 0x0fc41748        0t0      TCP *:domain (LISTEN)
    b10-auth  97589           root   17u     IPv4 0x0ea86294        0t0      UDP *:domain
    b10-auth  97605           root   11u     IPv4 0x0f4adad8        0t0      TCP 127.0.0.1:hacl-hb (LISTEN)
    b10-auth  97605           root   12u     IPv4 0x0d669658        0t0      UDP 127.0.0.1:hacl-hb
    
  • It'd be convenient if I could differentiate different instances of the same type of process in logs.
  • I'm now feeling I'm confused. What was the expected relationship between components and processes? I thought a single component can consist of multiple (same type of) processes, but the updates to bind10 and component.py seem to assume it's one-to-one relationship. Comments and documentation also seem to be not very consistent. bind10_src.py says:
            # Some time in future, it may happen that a single component has
            # multple processes. If so happens, name "components" may be
            # inapropriate. But as the code isn't probably completely ready
            # for it, we leave it at components for now.
            self.components = {}
    
    which seems to support the one-to-multi design. On the other hand, component.py says:
    This framework allows for a single process to be started multiple times (by
    specifying multiple components with the same configuration). However, the rest
    of the system might not handle such situation well, so until it is made so,
    it would be better to start each process at most once.
    
    which seems to support the one-to-one mapping.
  • As for the testability of resolver, I see it's difficult with the current implementation architecture. For a longer term, I personally think we should unify command/config submodule for auth and resolver and share as much code as possible. But in any case it's beyond the scope of this task.
  • A related note: since this feature involves multiple processes, I think we need system tests. Then we can also test the resolver at least at the higher level. It could be okay to defer it to a separate task, but I think it should be done pretty soon (e.g., in the next sprint).

auth_srv.h

Just checking: why did you remove this? Because it's trivial?

-    /// \return the value of the counter.
+

auth/command.cc

  • Is there a rationale for the ordering between <string> and boost headers? Also, if the basic rule is "if header A (possibly) depends on header B, include A first", dns/ headers should be included much later. Same for exceptions.h. Same for other files.
  • what if args is bogus (e.g. pid is not int, perhaps there can be other bogus cases)? it should be tested. same for resolver, and I suspect it can be more devastating because exceptions won't be caught at a lower level.
  • maybe we should log the shutdown event? Especially if the wrong pid is not expected to be delivered (I was not sure about that), that would be worth noting.
  • ShutdownCommand::exec: to me this would look more natural when it fits in a single line:
            if (args && args->getType() == isc::data::Element::map &&
                args->contains("pid")) {
    
    but if you had a particular preference with that style I don't object.
  • ShutdownCommand::exec: I'd rather save a few lines by not introducing pid and my_pid. again, it may be a matter of preference.

command_unittest.cc

  • If we add a prefix of _ to itimer (which is probably a good practice - I was confused when I saw 'param' in a test), maybe we should be consistent.
  • AuthCommandTest::shutdownCorrectPID: another matter of taste, but I don't need the temporary 'param' if we make the member 'param' variable mutable (normal ElementPtr?) and reset it.

component_test.py

  • Is it okay to only check if pid is None to be sure if the PID is really passed in the non-test environment?

bind10-guide

  • do we need to regenerate and merge HTML in individual branch?
  • I suspect there's no SQLite3 locking issue due to using multiple auth instances (at least for now auth only uses it for read-only, so multiple auth instances shouldn't content each other). But the guide says "could be", which is basically always correct by definition, I don't oppose to that if you really want to mention that possibility.

changelog

Maybe "core" is more commonly used than CPU today? (even though
"multi-core" has also limited meaning). Not a strong opinion anyway.

Changed 8 years ago by jinmei

comment:7 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:8 in reply to: ↑ 6 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

  • When I tried to add an additional b10-auth process (from the source env):

[...]

with the attached config (it uses non privileged port), two
processes in fact ran, but only one of them seemed to listen on:

That is to be expected. This is solved in #1595, which is already merged in master, but not in this branch yet. If you merge with master locally, you should see that it works (or you could trust me I tried it O:-).

  • It'd be convenient if I could differentiate different instances of the same type of process in logs.

Hmm, I guess it could be done somehow. But I think it would be a different ticket.

  • I'm now feeling I'm confused. What was the expected relationship between components and processes? I thought a single component can consist of multiple (same type of) processes, but the updates to bind10 and component.py seem to assume it's one-to-one relationship. Comments and documentation also seem to be not very consistent. bind10_src.py says:
            # Some time in future, it may happen that a single component has
            # multple processes. If so happens, name "components" may be
            # inapropriate. But as the code isn't probably completely ready
            # for it, we leave it at components for now.
            self.components = {}
    
    which seems to support the one-to-multi design. On the other hand, component.py says:
    This framework allows for a single process to be started multiple times (by
    specifying multiple components with the same configuration). However, the rest
    of the system might not handle such situation well, so until it is made so,
    it would be better to start each process at most once.
    
    which seems to support the one-to-one mapping.

Well, when I wrote the first comment, I thought about a component that is a pipeline, for example, so there would be two different processes that do the one thing. This component could still be started multiple times by putting it multiple times into the configuration.

Anyway, I'd like to be able to avoid specifying the same component multiple times, I'd rather have a count = something configuration element in each of them. If the boss then generates multiple component objects or just one is question for the future implementation, though. And I'd like to try this multiple processes thing out first a little before investing too much time into fine-tuning the configuration of it.

  • As for the testability of resolver, I see it's difficult with the current implementation architecture. For a longer term, I personally think we should unify command/config submodule for auth and resolver and share as much code as possible. But in any case it's beyond the scope of this task.

ACK.

  • A related note: since this feature involves multiple processes, I think we need system tests. Then we can also test the resolver at least at the higher level. It could be okay to defer it to a separate task, but I think it should be done pretty soon (e.g., in the next sprint).

Hmm, I'd like to merge it to master first (just to make sure both of them listen on the port). It could be quite easy to do the tests then, like starting two copies, sending a query, stopping one of them, sending a query, starting it again, stopping the first one (to verify the sharing of sockets), send the query again, then shout them both down and see that a query times out. So, I'd create a new ticket and put it to next-splint-proposed.

auth_srv.h

Just checking: why did you remove this? Because it's trivial?

-    /// \return the value of the counter.
+

I don't remember removing it, so I probably hit some random keys in vi when looking into the file.

auth/command.cc

  • Is there a rationale for the ordering between <string> and boost headers? Also, if the basic rule is "if header A (possibly) depends on header B, include A first", dns/ headers should be included much later. Same for exceptions.h. Same for other files.

I don't think there's a rationale behind this. I just reordered them to have local headers first, then the project headers and then system headers/library headers. I don't usually think about it to that much detail, just about the three groups (with the same-named header going first in the local group). I reordered them little bit by some kind of gut feeling, I think it should be enough (the other project headers would go first in their directory, so they'd be tested there).

  • what if args is bogus (e.g. pid is not int, perhaps there can be other bogus cases)? it should be tested. same for resolver, and I suspect it can be more devastating because exceptions won't be caught at a lower level.

Ah, you're right it could be something else than int. I check for that now, but I don't see any other bogus input (negative numbers would be handled implicitly as not equal, etc).

  • maybe we should log the shutdown event? Especially if the wrong pid is not expected to be delivered (I was not sure about that), that would be worth noting.

It is expected to come, see (newly added) comment.

On a slightly related note, I think we will need to do some redesign near the configuration, commands and communication and how they should work when there's more than one instance of a component. This way it is more like a hackish way to make it work than a proper way.

  • ShutdownCommand::exec: I'd rather save a few lines by not introducing pid and my_pid. again, it may be a matter of preference.

I find this slightly more readable, so I'd like to keep it this way.

  • AuthCommandTest::shutdownCorrectPID: another matter of taste, but I don't need the temporary 'param' if we make the member 'param' variable mutable (normal ElementPtr?) and reset it.

I don't know, I don't think that one needs to be mutable.

  • do we need to regenerate and merge HTML in individual branch?

Shouldn't I regenerate it when I change it? (it's always a problem when we have a generated file in the repository). Anyway, I wanted to the output, so I needed to regenerate it. Should I revert the html back?

  • I suspect there's no SQLite3 locking issue due to using multiple auth instances (at least for now auth only uses it for read-only, so multiple auth instances shouldn't content each other). But the guide says "could be", which is basically always correct by definition, I don't oppose to that if you really want to mention that possibility.

I'd like to keep it there. There were several surprises from SQLite for us about logging already, so I expect the worst there.

changelog

Maybe "core" is more commonly used than CPU today? (even though
"multi-core" has also limited meaning). Not a strong opinion anyway.

OK, I'll use core when I merge it (and I hope I won't forget).

comment:9 in reply to: ↑ 8 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

Hello

Replying to jinmei:

  • When I tried to add an additional b10-auth process (from the source env):

[...]

with the attached config (it uses non privileged port), two
processes in fact ran, but only one of them seemed to listen on:

That is to be expected. This is solved in #1595, which is already merged in master, but not in this branch yet. If you merge with master locally, you should see that it works (or you could trust me I tried it O:-).

Ah, okay, I trust you:-) but I wish you could have noted it on asking
the review.

  • It'd be convenient if I could differentiate different instances of the same type of process in logs.

Hmm, I guess it could be done somehow. But I think it would be a different ticket.

That's fine.

  • I'm now feeling I'm confused. What was the expected relationship between components and processes? [...]

Well, when I wrote the first comment, I thought about a component that is a pipeline, for example, so there would be two different processes that do the one thing. This component could still be started multiple times by putting it multiple times into the configuration.

Anyway, I'd like to be able to avoid specifying the same component multiple times, I'd rather have a count = something configuration element in each of them. If the boss then generates multiple component objects or just one is question for the future implementation, though. And I'd like to try this multiple processes thing out first a little before investing too much time into fine-tuning the configuration of it.

That's fine. But it would be helpful if the comment clarifies how a
multi-process component would work (in future).

  • A related note: since this feature involves multiple processes, I think we need system tests. [...]

Hmm, I'd like to merge it to master first (just to make sure both of them listen on the port). It could be quite easy to do the tests then, like starting two copies, sending a query, stopping one of them, sending a query, starting it again, stopping the first one (to verify the sharing of sockets), send the query again, then shout them both down and see that a query times out. So, I'd create a new ticket and put it to next-splint-proposed.

That's fine for me.

  • what if args is bogus (e.g. pid is not int, perhaps there can be other bogus cases)? it should be tested. same for resolver, and I suspect it can be more devastating because exceptions won't be caught at a lower level.

Ah, you're right it could be something else than int. I check for that now, but I don't see any other bogus input (negative numbers would be handled implicitly as not equal, etc).

See below.

  • do we need to regenerate and merge HTML in individual branch?

Shouldn't I regenerate it when I change it? (it's always a problem when we have a generated file in the repository). Anyway, I wanted to the output, so I needed to regenerate it. Should I revert the html back?

As someone who reviews the branch, it's more convenient if the diff
only contains essential changes. It would also help reduce conflicts
on merge if different developers modify the guide in different
branches. I think this is something we should discuss at the project
level (a candidate biweekly call item?). For this particular branch
I'd leave it to you whether to revert it.

Comments on the revised branch follow:

  • auth_log.h: not a big deal (and not really specific to this ticket), but I wonder why we define local log levels like this:
    // Debug messages upon shutdown
    const int DBG_AUTH_SHUT = DBGLVL_START_SHUT;
    
    especially now that we have use this value for two purposes. Why can't we simply use DBGLVL_START_SHUT for both cases?
  • AuthCommand::exec: in general, I'd rather prefer rejecting bogus inputs (either as a result of exception or by returning a specific error) than ignoring them. In the case of auth, it should actually have been so because Element class would throw an exception (and it will be caught in execAuthServerCommand(), logged, and translated into an error answer to the command issuer). Same comment applies to the resolver, although in that case we (probably) cannot simply throw an exception because it wouldn't be caught in a proper context.
  • in shutdownNotInt test: this comment isn't correct (naive copy-paste):
        // With the correct PID, it should act exactly the same as in case
        // of no parameter
    
    but see the previous comment. Depending on how we do that the test may need to be adjusted too.
  • resover.c::my_command_handler(): it repeats a quite large comment that commented in auth regarding the 'not for us' note. I'd avoid that and just refer from one of them to the other. (for a longer term we should solve this by unifying the framework itself, but that's another story).
  • auth command_unittest: not a related point to this branch any more, but for consistency we might add the _ postfix to other member variables of AuthCommandTest??

comment:10 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:11 in reply to: ↑ 9 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

That is to be expected. This is solved in #1595, which is already merged in master, but not in this branch yet. If you merge with master locally, you should see that it works (or you could trust me I tried it O:-).

Ah, okay, I trust you:-) but I wish you could have noted it on asking
the review.

Sorry, I didn't think about it, it seemed unrelated to the ticket.

Shouldn't I regenerate it when I change it? (it's always a problem when we have a generated file in the repository). Anyway, I wanted to the output, so I needed to regenerate it. Should I revert the html back?

As someone who reviews the branch, it's more convenient if the diff
only contains essential changes. It would also help reduce conflicts
on merge if different developers modify the guide in different
branches. I think this is something we should discuss at the project
level (a candidate biweekly call item?). For this particular branch
I'd leave it to you whether to revert it.

OK, I'll try not to commit it next time because of the review, but I think it makes no sense to revert it now (and produce another large commit without any essential changes). The problem with conflicts is small one and easy to address, another regeneration solves it.

  • auth_log.h: not a big deal (and not really specific to this ticket), but I wonder why we define local log levels like this:
    // Debug messages upon shutdown
    const int DBG_AUTH_SHUT = DBGLVL_START_SHUT;
    
    especially now that we have use this value for two purposes. Why can't we simply use DBGLVL_START_SHUT for both cases?

I think for historical reasons, we didn't have the DBGLVL_START_SHUT at the time when most of the logging was written and when they appeared, the local constants were just updated. I didn't want to break the consistency with them and I don't want to go through the whole auth and modify them there, the cleanup change would be too big to piggy-back it I think. Maybe another ticket (the tickets seem to grow exponentially this way, every ticket has multiple follow-up tickets).

  • AuthCommand::exec: in general, I'd rather prefer rejecting bogus inputs (either as a result of exception or by returning a specific error) than ignoring them. In the case of auth, it should actually have been so because Element class would throw an exception (and it will be caught in execAuthServerCommand(), logged, and translated into an error answer to the command issuer). Same comment applies to the resolver, although in that case we (probably) cannot simply throw an exception because it wouldn't be caught in a proper context.

I changed it to throw the exception (and added the catch for the resolver, it can't hurt).

Thank you

comment:12 in reply to: ↑ 11 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

  • auth_log.h: not a big deal (and not really specific to this ticket), but I wonder why we define local log levels like this:
    // Debug messages upon shutdown
    const int DBG_AUTH_SHUT = DBGLVL_START_SHUT;
    
    especially now that we have use this value for two purposes. Why can't we simply use DBGLVL_START_SHUT for both cases?

I think for historical reasons, we didn't have the DBGLVL_START_SHUT at the time when most of the logging was written and when they appeared, the local constants were just updated. I didn't want to break the consistency with them and I don't want to go through the whole auth and modify them there, the cleanup change would be too big to piggy-back it I think. Maybe another ticket (the tickets seem to grow exponentially this way, every ticket has multiple follow-up tickets).

Not updating this branch for this is okay. The ticket bloat issue is
true, but I can't think of a better way to make a good balance between
ensuring quality and reasonably rapid integration. In many cases,
however, cleanup tickets are trivial and much easier than the original
one.

  • AuthCommand::exec: in general, I'd rather prefer rejecting bogus inputs (either as a result of exception or by returning a specific error) than ignoring them. In the case of auth, it should actually have been so because Element class would throw an exception (and it will be caught in execAuthServerCommand(), logged, and translated into an error answer to the command issuer). Same comment applies to the resolver, although in that case we (probably) cannot simply throw an exception because it wouldn't be caught in a proper context.

I changed it to throw the exception (and added the catch for the resolver, it can't hurt).

Okay.

The branch now looks basically okay. I've noted a few more minor
things, but I'd leave it to you whether/how to address them. After
addressing them (or not) please feel free to merge.

  • bind10_src: in my often incorrect understanding of the English grammar, an 'if' clause cannot be the subject of a sentence. So I'd change 'If it turns...or if we' to "Whether it turns...or whether we":
           # want to support multiple instances of a single component. If it turns
           # out that we'll have a single component with multiple same processes  
           # or if we start multiple components with the same configuration (we do
           # this now, but it might change) is an open question.
    
  • in auth command test, I'd initialize expect_rcode_ in the constructor based on the spirit of initializing everything as soon as possible.
  • Now we catch exceptions from the CC/element module at a reasonable level, we could even omit this check, which would make the main code a bit more concise:
    args->getType() == isc::data::Element::map &&
    

comment:13 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:14 in reply to: ↑ 12 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

The branch now looks basically okay. I've noted a few more minor
things, but I'd leave it to you whether/how to address them. After
addressing them (or not) please feel free to merge.

  • bind10_src: in my often incorrect understanding of the English grammar, an 'if' clause cannot be the subject of a sentence. So I'd change 'If it turns...or if we' to "Whether it turns...or whether we":
           # want to support multiple instances of a single component. If it turns
           # out that we'll have a single component with multiple same processes  
           # or if we start multiple components with the same configuration (we do
           # this now, but it might change) is an open question.
    

Well, the subject isn't 'if', but 'it'. Besides, whether is the same grammar kind of word as if.

Anyway, when merging, I found a problem which needed a bigger change ‒ the stats module stopped shutting down and it turned out I forgot to update all the spec files to expect the pid argument. Do you want to have a look at that change as well?

Also, there's one change, the Makefile in dns/tests/testdata talked about wrong file (I looked at the commit when it was introduced), so I fixed it (strange it wasn't caught by the build bots).

Thank you

comment:15 in reply to: ↑ 14 Changed 8 years ago by jinmei

Replying to vorner:

  • bind10_src: in my often incorrect understanding of the English grammar, an 'if' clause cannot be the subject of a sentence. So I'd change 'If it turns...or if we' to "Whether it turns...or whether we":
           # want to support multiple instances of a single component. If it turns
           # out that we'll have a single component with multiple same processes  
           # or if we start multiple components with the same configuration (we do
           # this now, but it might change) is an open question.
    

Well, the subject isn't 'if', but 'it'.

I didn't mean that (sub) sentence (note the "clause"). What I meant
is that the clause "If it turns out that we'll have a single component
with multiple same processes (or if... configuration)" is used as the
subject of the sentence of "<clause> is an open question".

And,

Besides, whether is the same grammar kind of word as if.

They are the same "kind", but in my possibly incorrect understanding
of the grammar, they are not always interchangeable. Whether the
corresponding clause can be a sentence subject is, again, in my
understanding, one exception.

An English grammar textbook I just happen to have now shows this
example:

"Whether there are any survivors is still uncertain."
(noting that this "Whether" cannot be replaced with "If").

while in this case you can use either whether or if:
"I asked her whether/if she had done that sort of work before".

But anyway, I don't claim I'm an English grammar expert, and at the
very least I don't think the original text could be misunderstood even
if it's grammatically incorrect. So if you believe it's correct I
don't oppose to keeping it.

Anyway, when merging, I found a problem which needed a bigger change ‒ the stats module stopped shutting down and it turned out I forgot to update all the spec files to expect the pid argument. Do you want to have a look at that change as well?

That looks okay, but is it okay to not change cmdctl?

Also, there's one change, the Makefile in dns/tests/testdata talked about wrong file (I looked at the commit when it was introduced), so I fixed it (strange it wasn't caught by the build bots).

Ack, thanks.

comment:16 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:17 Changed 8 years ago by vorner

  • Resolution set to complete
  • Status changed from reviewing to closed
  • Total Hours changed from 0 to 9.76

Thank you, closing

comment:18 Changed 8 years ago by jinmei

  • Total Hours changed from 9.76 to 13.26
Note: See TracTickets for help on using tickets.