Opened 7 years ago

Closed 6 years ago

#2400 closed defect (wontfix)

notify auth::DataSrcClientsMgr when the builder thread dies

Reported by: jinmei Owned by:
Priority: medium Milestone: DNS Outstanding Tasks
Component: b10-auth Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 4 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

Currently, there's no effective way to notify the main thread of
b10-auth that runs the DataSrcClientsMgr object when the separate
thread running DataSrcClientsBuilder object dies due to an
unexpected exception. This is not good: the auth server keeps running
with older zone data without noticing it, and the communication queue
is growing.

So the current implementation takes a harsh way: the builder thread
terminates the entire process by assert(false). This is still
suboptimal.

We should provide a way to notify the main thread when the builder
thread needs to terminate itself due to an unexpected exception.
My current idea is something like this:

  • introduce another shared variable (of boolean), say, builder_ok_. it's a member of the manager, set to true initially, and its pointer will be passed to the builder.
  • on unexpected termination, the builder sets it to false. (this could be protected by a mutex, but that's not absolutely necessary)
  • the manager checks variable every time it sends a new command to the builder or allow the application to access the client lists via the holder. If the value has been changed to false, it takes an appropriate action (in practice, it would throw a fatal exception to terminate the process anyway).

Subtickets

Change History (5)

comment:1 in reply to: ↑ description ; follow-up: Changed 7 years ago by vorner

Hello

Replying to jinmei:

So the current implementation takes a harsh way: the builder thread
terminates the entire process by assert(false). This is still
suboptimal.

Hmm, I didn't notice yesterday during the review, but is assert(false) the
best way? Asserts can be turned off, this probably should have been abort().
Should we fast-fix this somehow?

  • on unexpected termination, the builder sets it to false. (this could be protected by a mutex, but that's not absolutely necessary)

Is there a reason why this should be safe not to protect it by mutex? If the
reason is that boolean is probably small, well, it doesn't sound like a very
good reason for me. I heard that even the sig_atomic_t is not safe to be used
this way.

  • the manager checks variable every time it sends a new command to the builder or allow the application to access the client lists via the holder. If the value has been changed to false, it takes an appropriate action (in practice, it would throw a fatal exception to terminate the process anyway).

I'm not sure about this. So, on Monday an update to a zone happens and the
thread crashes during the time. But the server happily runs with the old
version of zone, because the server didn't exchange it. Than the next update
happens on for example Friday and then the server unexpectedly crashes during
an unrelated update, so the admin starts looking into what is wrong with the
Friday update.

What I think could work is send the main thread a message over the msgq, for
example. Or signal it (though that's tricky). Or somehow notify it right away.

comment:2 in reply to: ↑ 1 ; follow-up: Changed 7 years ago by jinmei

Replying to vorner:

This point first:

  • the manager checks variable every time it sends a new command to the builder or allow the application to access the client lists via the holder. If the value has been changed to false, it takes an appropriate action (in practice, it would throw a fatal exception to terminate the process anyway).

I'm not sure about this. So, on Monday an update to a zone happens and the
thread crashes during the time. But the server happily runs with the old
version of zone, because the server didn't exchange it. Than the next update
happens on for example Friday and then the server unexpectedly crashes during
an unrelated update, so the admin starts looking into what is wrong with the
Friday update.

Yeah, I know this is suboptimal in terms of the notification timing.
But in case you didn't notice it, note the "(every time) allow the
application to access the client lists". So, the main thread will
notice it at the next query handling after the builder thread dies.
It's still possible that there's been no query between Monday and
Friday, so in theory the same worst case scenario could happen. But
in this case the data source is not really used, so it's not really
incorrect in terms of externally visible behavior.

What I think could work is send the main thread a message over the msgq, for
example. Or signal it (though that's tricky). Or somehow notify it right away.

I agree it's better if the main thread can notice failure of the
builder thread without any effective delay, if it can be done easily.
These two ways seem to be much more complicated than the idea I
explained, however, so I'm not so sure if it's definitely better
regarding pros and cons balance.

  • on unexpected termination, the builder sets it to false. (this could be protected by a mutex, but that's not absolutely necessary)

Is there a reason why this should be safe not to protect it by mutex? If the
reason is that boolean is probably small, well, it doesn't sound like a very
good reason for me. I heard that even the sig_atomic_t is not safe to be used
this way.

This is not really safe in its strictest sense. But the race
condition in this case is that the main thread may not immediately
notice the change of the boolean. Since we already made this
compromise as discussed above (as long as we use this approach), it
shouldn't be a big deal whether the main thread handles one more query
than the ideal case before it notices the failure.

BTW, when hinting omitting the lock, I was thinking it wouldn't be
good if we need to acquire one more extra lock in every query handling
only for a basically "should not happen" event. But on second thought
we can probably just use the same lock that currently protects the
access to the client list map.

Hmm, I didn't notice yesterday during the review, but is assert(false) the
best way? Asserts can be turned off, this probably should have been abort().
Should we fast-fix this somehow?

As you know this is done.

comment:3 in reply to: ↑ 2 Changed 7 years ago by vorner

Hello

Replying to jinmei:

Yeah, I know this is suboptimal in terms of the notification timing.
But in case you didn't notice it, note the "(every time) allow the
application to access the client lists". So, the main thread will
notice it at the next query handling after the builder thread dies.
It's still possible that there's been no query between Monday and
Friday, so in theory the same worst case scenario could happen. But
in this case the data source is not really used, so it's not really
incorrect in terms of externally visible behavior.

Ah, sorry, my bad, I wasn't paying attention when reading this. I
misinterpreted it as whenever the list of commands (the queue) is accessed,
which would be whenever the zone (or other zone) would be reloaded again. Your
way makes sense.

This is not really safe in its strictest sense. But the race
condition in this case is that the main thread may not immediately
notice the change of the boolean. Since we already made this
compromise as discussed above (as long as we use this approach), it
shouldn't be a big deal whether the main thread handles one more query
than the ideal case before it notices the failure.

I don't know. I could imagine a system causing some strange failure (eg.
SIGBUS) when one thread reads some memory location while the other one writes.
Also, I don't know how long period it might happen to be without the memory
barrier which is implicit in the mutex (eg. „flush“ on the caches). Maybe I'm
just too paranoid there, but I consider this area as „Here be dragons and other
beasts of undefined behaviour“.

BTW, when hinting omitting the lock, I was thinking it wouldn't be
good if we need to acquire one more extra lock in every query handling
only for a basically "should not happen" event. But on second thought
we can probably just use the same lock that currently protects the
access to the client list map.

+1 to the same lock.

Hmm, I didn't notice yesterday during the review, but is assert(false) the
best way? Asserts can be turned off, this probably should have been abort().
Should we fast-fix this somehow?

As you know this is done.

Thanks for that.

comment:4 Changed 6 years ago by stephen

  • Milestone set to DNS Outstanding Tasks

comment:5 Changed 6 years ago by tomek

  • Resolution set to wontfix
  • Status changed from new to closed

DNS and BIND10 framework is outside of scope for Kea project.
The corresponding code has been removed from Kea git repository.
If you want to follow up on DNS or former BIND10 issues, see
http://bundy-dns.de project.

Closing ticket.

Note: See TracTickets for help on using tickets.