Opened 7 years ago

Closed 7 years ago

#2198 closed defect (fixed)

make sure InterprocessSyncLocker is thread safe

Reported by: jinmei Owned by: muks
Priority: medium Milestone: Sprint-20130423
Component: logging Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 5 Add Hours to Ticket: 0
Total Hours: 8.08 Internal?: no

Description

If we use a separate thread for b10-auth for background loading
both main thread and loader thread will share the datasrc logger.
We need to confirm the InterprocessSyncLocker also works for multi
threads.

Ideally we should have some explicit tests for this. If it's deemed
to be too difficult, we should at least confirm that by some
experiments on a couple of different platforms.

Subtickets

Attachments (4)

0001-master-Disable-use-of-InterprocessSync-in-Logger-as-.patch (2.5 KB) - added by muks 7 years ago.
Patch to disable InterprocessSync? usage in Logger until we fix the threads issue
experiments.patch (4.0 KB) - added by vorner 7 years ago.
Experiments with logging
lettuce.trace.2361 (3.1 MB) - added by vorner 7 years ago.
CMDCTL strace
lettuce.trace.2318 (2.5 MB) - added by vorner 7 years ago.
MSGQ strace

Change History (53)

comment:1 Changed 7 years ago by muks

For the file based locker, I think multiple threads in the same process will not wait for a lock, i.e., the range lock is a per-process lock and recursive locking will be granted. Please check.

comment:2 Changed 7 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20120918

comment:3 Changed 7 years ago by muks

  • Owner set to muks
  • Status changed from new to assigned

Picking

comment:4 Changed 7 years ago by muks

  • Owner changed from muks to UnAssigned
  • Status changed from assigned to reviewing

Up for review.

comment:5 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:6 follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to muks

Hello

I have several concerns about the branch.

First, the branch #2202 introduces a wrapped lock (and a thread, but it shouldn't be needed here). I think we should not be dealing directly with pthreads through the code, both because of exception safety and error handling and because of portability. Is there any reason why you didn't use the lock from there?

Furthermore, the code seems to be doing no error checking whatsoever. Most of the pthread functions can return error codes and they should be handled. Also, the new or insertion to the map can throw. If that happens, the mutex stays locked, causing a deadlock the next time anyone tries to acquire it.

Also, the lock, tryLock and unlock perform find on the map without locking the lock. That is problematic ‒ what if someone modifies it while they perform the find?

Is there really need for the complicated thing with the map? Will there ever be multiple locks with the same task name in the same process?

And, last, how often is the lock on the thing called? Are there some performance implications?

comment:7 in reply to: ↑ 6 ; follow-up: Changed 7 years ago by muks

  • Owner changed from muks to vorner

Hi Michal

Replying to vorner:

Hello

I have several concerns about the branch.

First, the branch #2202 introduces a wrapped lock (and a thread, but it shouldn't be needed here). I think we should not be dealing directly with pthreads through the code, both because of exception safety and error handling and because of portability. Is there any reason why you didn't use the lock from there?

I didn't know that there's a new lock class introduced in the #2202 branch. :) That is the first answer for why I didn't use that lock. For the RAII style lock, #2202's locker seems relevant. Ignoring that the branch is still in review, we still can't use this implementation here as the interface provided by that lock class is insufficient. It's not possible to unlock the mutex before its destruction (i.e., before that basic block is exited). I have not read more into #2202.

Use of pthreads in this file should be fine as it is a lock class implementation. We use other POSIX specific bits in this class such as fcntl(). The locker in #2202 also uses pthreads.

Furthermore, the code seems to be doing no error checking whatsoever. Most of the pthread functions can return error codes and they should be handled. Also, the new or insertion to the map can throw. If that happens, the mutex stays locked, causing a deadlock the next time anyone tries to acquire it.

I agree with the use of a RAII-style lock. A simple wrapper around a pthread mutex has been introduced. All pthread calls are now checked. Even then, I don't suppose our code can catch such exceptions and continue in case the map throws in this codepath.

Also, the lock, tryLock and unlock perform find on the map without locking the lock. That is problematic ‒ what if someone modifies it while they perform the find?

Good catch. :) The find() is now protected by a lock.

Is there really need for the complicated thing with the map? Will there ever be multiple locks with the same task name in the same process?

This is what InterprocessSyncFile's interface for the file locking provides (among multiple processes anyway) and we keep that semantic. It's not unlikely for multiple logging threads to exist, all of which use the "logger" name.

And, last, how often is the lock on the thing called? Are there some performance implications?

Our typical use case is the logger. Logging calls happen once every few hundred ms on average, so it doesn't have a big impact. The existing (thread-unsafe) code uses file range locks which are slower.

comment:8 in reply to: ↑ 7 Changed 7 years ago by vorner

  • Owner changed from vorner to muks

Replying to muks:

First, the branch #2202 introduces a wrapped lock (and a thread, but it shouldn't be needed here). I think we should not be dealing directly with pthreads through the code, both because of exception safety and error handling and because of portability. Is there any reason why you didn't use the lock from there?

I didn't know that there's a new lock class introduced in the #2202 branch. :) That is the first answer for why I didn't use that lock. For the RAII style lock, #2202's locker seems relevant. Ignoring that the branch is still in review, we still can't use this implementation here as the interface provided by that lock class is insufficient. It's not possible to unlock the mutex before its destruction (i.e., before that basic block is exited). I have not read more into #2202.

Looking at it again, I think the SyncMapMutex? is more or less the same as the
Locker from #2202. I don't think we want to have two implementations of the
same thing, we have more than enough code already. Could it be reused at least
for this mutex?

For the rest, well, the interface of #2202 mutex is minimalistic, since I
didn't need more. But it could indeed be extended. From a longer perspective, I
think it is better to have one complete interface than two half-complete.

Furthermore, the code seems to be doing no error checking whatsoever. Most of the pthread functions can return error codes and they should be handled. Also, the new or insertion to the map can throw. If that happens, the mutex stays locked, causing a deadlock the next time anyone tries to acquire it.

I agree with the use of a RAII-style lock. A simple wrapper around a pthread mutex has been introduced. All pthread calls are now checked. Even then, I don't suppose our code can catch such exceptions and continue in case the map throws in this codepath.

Well, it may or may not crash on these. It should be at least possible to do
some kind of graceful shutdown or recovery by dropping the current query and
trying again. Our policy says the classes should not get to invalid state on
exception.

Which brings me to these two parts of code:

data = new SyncMapData;
sync_map.erase(it);

int ret = pthread_mutex_destroy(&data->second);
if (ret != 0) {
    isc_throw(isc::InvalidOperation,
              "Error destroying InterprocessSyncFile mutex: "
              << strerror(ret));
}

delete data;

In both cases, it can leak if the pthread function returns error. The second
one could be solved by deleting data even before the erase. The first one could
be done with something like auto_ptr.

However, these are all patches for fragile design I think. I'd like to propose
an alternative approach:

  • Use the mutex from #2202. That one tries to be very paranoid about errors and be exception safe. Also, it reuses current code and makes future porting to windows easier.
  • Have one global mutex for the map. This can use the locker on stack in the usual way. You unlock it sooner than you exit the method sometimes, but I don't think it makes a big difference on how long it is held and if it did, there can be an inner block inside the method to invoke the destructor sooner.
  • Each task has a mutex (wrapped one, from #2202) allocated dynamically. The InterprocessSyncFile holds a shared pointer to it (so more than one can have the same one) and the map (which is protected by the above mutex) contains a weak_ptr to the mutex. This way unused mutexes get cleaned up automatically.
  • An InterprocessSyncFile contains a scoped_ptr which holds a Locker when the corresponding mutex is locked.

This has the advantage that whenever anything is lost due to exiting a method
or exception or something like that, everything gets cleared and unlocked
automatically. Also, it could become slightly more readable, there'd be less
error checking.

comment:9 Changed 7 years ago by jelte

  • Milestone Sprint-20120918 deleted

comment:10 Changed 7 years ago by jinmei

  • Milestone set to Next-Sprint-Proposed

comment:11 Changed 7 years ago by muks

  • Milestone changed from Next-Sprint-Proposed to Sprint-20121009

Moving to current sprint as we are out of tickets, and because we can revisit it now that #2202 is in master.

comment:12 follow-up: Changed 7 years ago by muks

  • Owner changed from muks to vorner

Up for review.

Mutex::tryLock() has been added (a requirement for our tryLock()) and Mutex::lock() and Mutex::unlock() were made public. Mutex::tryLock() has to be public anyway. The way we use a Mutex::Locker, it doesn't add benefit to dynamically allocate it (so it lives when it runs out of scope) when we can call the same methods on Mutex directly and the lock goes when it is destroyed.

I've added API comments to discourage the use of these methods directly and use the Mutex::Locker where applicable.

comment:13 in reply to: ↑ 12 ; follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to muks

Hello

First of all, it doesn't compile for me on a clean build ‒ the directories need
some reordering. Please, try a clean build and see.

Replying to muks:

Mutex::tryLock() has been added (a requirement for our tryLock()) and Mutex::lock() and Mutex::unlock() were made public. Mutex::tryLock() has to be public anyway. The way we use a Mutex::Locker, it doesn't add benefit to dynamically allocate it (so it lives when it runs out of scope) when we can call the same methods on Mutex directly and the lock goes when it is destroyed.

I'm not really happy about that. With this, we'll see stuff like this to appear in some time:

Locker locker(mutex);
...
...
// Unlock for a short while
mutex.unlock();
doSomething();
mutex.lock();
...
...

This would be very bad if doSomething() threw.

Also, the tryLock ‒ I guess it is needed somewhere and not provided for
completeness? Because I don't see the need for it and it seems to be used in
tests only.

Also, with the dynamic allocation and use of some kind of scoped_ptr, you could get rid of the try-catch block here:

    // First grab the thread lock...
    mutex_->lock();

    // ... then the file lock.
    try {
        if (do_lock(F_SETLKW, F_WRLCK)) {
            is_locked_ = true;
            return (true);
        }
    } catch (...) {
        mutex_->unlock();
        throw;
    }

    mutex_->unlock();
scoped_ptr<Mutex::Locker> locker(new Locker(mutex_));

if (do_lock(...)) {
        is_locked_ = true;
        locker_ = locker.release();
        return (true);
}

return (false);

Also, what is the difference with lock and tryLock? While the use of mutex is slightly different, but the call to do_lock looks exactly the same to me.

This really needs some error checking, EBUSY is not everything the pthread_try_lock can return:

bool
Mutex::tryLock() {
    assert(impl_ != NULL);
    const int result = pthread_mutex_trylock(&impl_->mutex);
    if (result != 0) {
        return (false);
    }
    ++impl_->locked_count; // Only in debug mode
    return (true);
}

What is the point of the exercise with weak pointers, if the code still relies on the fact that everything in the map is a valid pointer?

Also, what is the reason behind this?

-#include "interprocess_sync_file.h"
+#include <util/interprocess_sync_file.h>

comment:14 in reply to: ↑ 13 ; follow-up: Changed 7 years ago by muks

  • Owner changed from muks to vorner

Hi Michal

Replying to vorner:

Hello

First of all, it doesn't compile for me on a clean build ‒ the directories need
some reordering. Please, try a clean build and see.

Thanks for catching this. I've updated the dependencies and order of build. It should build properly now.

Replying to muks:

Mutex::tryLock() has been added (a requirement for our tryLock()) and Mutex::lock() and Mutex::unlock() were made public. Mutex::tryLock() has to be public anyway. The way we use a Mutex::Locker, it doesn't add benefit to dynamically allocate it (so it lives when it runs out of scope) when we can call the same methods on Mutex directly and the lock goes when it is destroyed.

I'm not really happy about that. With this, we'll see stuff like this to appear in some time:

Locker locker(mutex);
...
...
// Unlock for a short while
mutex.unlock();
doSomething();
mutex.lock();
...
...

This would be very bad if doSomething() threw.

If the code were written like this, then true. This is why the API doc comment was added asking users of Mutex to consider using Mutex::Locker where applicable. But where it doesn't fit in naturally, allowing lock() and unlock() methods is not bad.

Also, the tryLock ‒ I guess it is needed somewhere and not provided for
completeness? Because I don't see the need for it and it seems to be used in
tests only.

It's required to test that the other methods work properly for a file lock. Also because this is a interprocess synchronization object (not just for logger), tryLock() is a reasonable operation to provide.

Also, with the dynamic allocation and use of some kind of scoped_ptr, you could get rid of the try-catch block here:

    // First grab the thread lock...
    mutex_->lock();

    // ... then the file lock.
    try {
        if (do_lock(F_SETLKW, F_WRLCK)) {
            is_locked_ = true;
            return (true);
        }
    } catch (...) {
        mutex_->unlock();
        throw;
    }

    mutex_->unlock();
scoped_ptr<Mutex::Locker> locker(new Locker(mutex_));

if (do_lock(...)) {
        is_locked_ = true;
        locker_ = locker.release();
        return (true);
}

return (false);

Wouldn't this involve maintaining another locker_ member variable? Is the current code incorrect or complicated to follow? To me the latter code and maintaining locker_ would be more complicated and unnatural in this particular case.

Also, what is the difference with lock and tryLock? While the use of mutex is slightly different, but the call to do_lock looks exactly the same to me.

F_SETLKW vs. F_SETLK.

This really needs some error checking, EBUSY is not everything the pthread_try_lock can return:

bool
Mutex::tryLock() {
    assert(impl_ != NULL);
    const int result = pthread_mutex_trylock(&impl_->mutex);
    if (result != 0) {
        return (false);
    }
    ++impl_->locked_count; // Only in debug mode
    return (true);
}

tryLock() returns false indicating the lock operation failed. Does it have to throw as well? I'll add it if you think it's valid.

What is the point of the exercise with weak pointers, if the code still relies on the fact that everything in the map is a valid pointer?

As the weak pointer maintains no direct reference, I thought it was more understandable code. Or I am not following the question. :)

Also, what is the reason behind this?

-#include "interprocess_sync_file.h"
+#include <util/interprocess_sync_file.h>

Just for consistency with the rest of the codebase.

comment:15 in reply to: ↑ 14 Changed 7 years ago by vorner

  • Owner changed from vorner to muks

Hello

Replying to muks:

Thanks for catching this. I've updated the dependencies and order of build. It should build properly now.

Thanks.

Replying to muks:

If the code were written like this, then true. This is why the API doc comment was added asking users of Mutex to consider using Mutex::Locker where applicable. But where it doesn't fit in naturally, allowing lock() and unlock() methods is not bad.

Well, using lock() and unlock() in a language that has exceptions is very
risky. I'd very much like to avoid that and disallow that. Jinmei seems to
agree with the approach (I guess from the condvar discussion in #2332). So I'd
hate to allow it because of it is more slightly more convenient in one place.

I may be too scared of threads, but I know it might be very problematic to
debug (yes, I did try writing threaded applications before). So I want to be on
the safe side and disallow any kinds of bugs I can.

Also, the tryLock ‒ I guess it is needed somewhere and not provided for
completeness? Because I don't see the need for it and it seems to be used in
tests only.

It's required to test that the other methods work properly for a file lock. Also because this is a interprocess synchronization object (not just for logger), tryLock() is a reasonable operation to provide.

Well, I have idea of how to do it with a Locker too (throwing exception in case
it isn't possible to lock). But I still don't know if we want such operation at
all, if it is not really used in the real code anywhere.

scoped_ptr<Mutex::Locker> locker(new Locker(mutex_));

if (do_lock(...)) {
        is_locked_ = true;
        locker_ = locker.release();
        return (true);
}

return (false);

Wouldn't this involve maintaining another locker_ member variable? Is the current code incorrect or complicated to follow? To me the latter code and maintaining locker_ would be more complicated and unnatural in this particular case.

Yes. But then, it means we don't let people directly to the lock() and unlock() and get rid of the very awkwardly looking try thing and it would IMO be even safer.

This really needs some error checking, EBUSY is not everything the pthread_try_lock can return:

bool
Mutex::tryLock() {
    assert(impl_ != NULL);
    const int result = pthread_mutex_trylock(&impl_->mutex);
    if (result != 0) {
        return (false);
    }
    ++impl_->locked_count; // Only in debug mode
    return (true);
}

tryLock() returns false indicating the lock operation failed. Does it have to throw as well? I'll add it if you think it's valid.

Well, but you want to distinguish problems like „The mutex does not exist“ and
„The same thread tries to lock it again“ from „The mutex is being locked right
now, try later“. Because the first two are serious programmer errors (the
second could actually lead to application trying again and again and failing
forever, without providing any notification something is wrong), while the last
is normal operation. It doesn't have to throw, assert like the rest would be enough.

What is the point of the exercise with weak pointers, if the code still relies on the fact that everything in the map is a valid pointer?

As the weak pointer maintains no direct reference, I thought it was more understandable code. Or I am not following the question. :)

I think that if we want to use them, we want to be prepared for the situation
when we find an invalid weak pointer in the map, just in case something forgot
to remove it. This way, it would throw (I think).

Also, what is the reason behind this?

-#include "interprocess_sync_file.h"
+#include <util/interprocess_sync_file.h>

Just for consistency with the rest of the codebase.

But the second is less corect. There are quote-includes through the code as well, if they are from the local directory.

Anyway, would it be OK if I took the oportunity and wrote a counter-proposal for the code as I see it?

comment:16 Changed 7 years ago by muks

  • Owner changed from muks to vorner

Hi vorner

To move this bug forward, I've used the Mutex::Locker and made the Mutex operations private again. I've used a shared_ptr instead of a scoped_ptr, and the ref is then retained by the class member.

If you want to rewrite this, please feel free and assign this bug to yourself.

comment:17 Changed 7 years ago by muks

  • Owner changed from vorner to muks

Taking back temporarily as I want to add another test.

comment:18 Changed 7 years ago by muks

  • Owner changed from muks to vorner

Back to review.

comment:19 follow-up: Changed 7 years ago by vorner

  • Total Hours changed from 0 to 3.25

Hello

I noticed just one small detail. The thread body function (testThread) for the wrapper
expects to get a void function (returning nothing). I guess the `boost: is
able to throw away the result, but then, it's useless to return the NULL
needlessly ;-).

Otherwise, I think it is OK to merge.

comment:20 Changed 7 years ago by vorner

  • Owner changed from vorner to muks

comment:21 in reply to: ↑ 19 Changed 7 years ago by muks

Replying to vorner:

Hello

I noticed just one small detail. The thread body function (testThread) for the wrapper
expects to get a void function (returning nothing). I guess the `boost: is
able to throw away the result, but then, it's useless to return the NULL
needlessly ;-).

This has been updated.

comment:22 Changed 7 years ago by muks

  • Resolution set to fixed
  • Status changed from reviewing to closed

Merged to master in commit f8181bb72251fd3d5f33e157d1f2a2a75a58d765:

* 6adfc24 [2198] Don't return anything from thread functions
* 5b78174 [2198] Update comment
* 4ed683c [2198] Test the non-blocking variant of Mutex::Locker
* e6ac5ef [2198] Check pthread_mutex_trylock()'s return value
* 48e80a5 [2198] Revert include to local
* fff8929 [2198] Stop using direct lock methods on Mutex inside InterprocessSyncFile
* ecef170 [2198] Make direct methods on Mutex private
* 350c775 [2198] Add a block argument to Mutex::Locker constructor
* 666bfa9 [2198] Add AlreadyLocked exception to Mutex::Locker
* d042cd0 [2198] Remove dependency on libb10-util in src/lib/util/unittests/
* c0d25c4 [2198] First use a mutex for mutual exclusion among threads, then a lock file
* e71cd13 [2198] Add Mutex::tryLock() and make lock methods public
* 6642455 [2198] Test that locks work from threads
* e562ac7 [2198] Build threads directory first because . depends on it

Resolving as fixed. Thank you for the reviews.

comment:23 Changed 7 years ago by muks

  • Resolution fixed deleted
  • Status changed from closed to reopened

Re-opening bug to handle issues on various platforms. I'll update it tomorrow and put it to review.

comment:24 Changed 7 years ago by muks

These issues occured during the merge to master:

  • On Mac OS, -lpthread was not being used to link to the pthreads library which made pydnspp tests fail.
  • On CentOS (and even Debian and Solaris i686), pthread_mutex_trylock() returned EDEADLK when a mutex was already locked in the same thread, which was not checked. Fedora and some other systems returned 'EBUSY'.
    • This caused a locker test's child to fail on Solaris and not return status to its parent.
  • On Debian i686, there was a static destruction fiasco where the Mutex and SyncMap inside InterprocessSyncFile were being destroyed before the logger object which used it.

comment:25 Changed 7 years ago by muks

  • Owner changed from muks to vorner
  • Status changed from reopened to reviewing

Up for review. The changes for the above issues should be self-explanatory.

  • On Mac OS, one more change was made to make it look in the builddir for libb10-threads library.
  • Two objects are left created (leaked) in InterprocessSyncFile? to avoid the static destruction fiasco. This is not a leak worth bothering though as it doesn't grow with time. With some clever arrangement of instantiation, we can avoid the static destruction fiasco, but I think the approach in the branch is safer.

comment:26 Changed 7 years ago by vorner

  • Owner changed from vorner to muks
  • Total Hours changed from 3.25 to 3.85

Hello

First, did you try running it through the build bot? Maybe if there were so many problems, it would be worth it.

Also, the two functions (InterprocessSyncFile::getSyncMapMutex, InterprocessSyncFile::getSyncMap), maybe they could be stand-alone functions that live in the anonymous namespace. There is probably no need to have them in the header of the class.

And, the thing with EDEADLK, I'm not sure this is the correct behaviour ‒ I can't imagine where trying to lock the mutex that is being locked by the same thread makes sense in well-designed application. I don't actually mind this behaviour much, because the method is probably not going to be used outside of unit tests anyway, so I don't care much, but I wanted to say that the behaviour of debian might be reasonable approach too.

Anyway, I don't think I need another review round for the two functions, so if you want to change them, go ahead and then after running through the build bot, please merge.

Thank you

comment:27 Changed 7 years ago by muks

  • Resolution set to fixed
  • Status changed from reviewing to closed

I've had to rewrite all patches on top of master as it had moved on, and also because the earlier merge+revert got git thinking that the reverted state was the latest state of code (and it didn't apply changes in this branch). They don't need re-review as the patches are the same, just rebased.

Merged to master in commit e6ba20b0907ab7e861e608a71a35f71ab228e79f:

* 4e199b2 [2198] Move get methods into unnamed namespace
* c3a8f64 [2198] Fix library paths to add lib/util/threads/ too (Mac OS test failure)
* 73f49fc [2198] Avoid static destruction fiasco with InterprocessSyncFile and logger
* 59d1e77 [2198] Check for EDEADLK too when using Mutex::tryLock()
* d7a9543 [2198] Link -lpthread in libb10-threads target
* 1bb3279 [2198] Don't return anything from thread functions
* 9a85c00 [2198] Update comment
* 9de4809 [2198] Test the non-blocking variant of Mutex::Locker
* 11df86c [2198] Check pthread_mutex_trylock()'s return value
* 699a941 [2198] Revert include to local
* a80cd44 [2198] Stop using direct lock methods on Mutex inside InterprocessSyncFile
* 48bfbcc [2198] Fix include filename
* 13e9951 [2198] Make direct methods on Mutex private
* f8fb9a0 [2198] Add a block argument to Mutex::Locker's constructor
* 1b638d6 [2198] Add AlreadyLocked exception to Mutex::Locker
* 638f92b [2198] Remove dependency on libb10-util in src/lib/util/unittests/
* 1c5c694 [2198] First use a mutex for mutual exclusion among threads, then a lock file
* b09b430 [2198] Add Mutex::tryLock() and make lock methods public
* 2d01422 [2198] Test that locks work from threads
* 0a4ef32 [2198] Build threads directory first because . depends on it

Also committed one more fix to master:

* ae2e285 [master] Execute run_dbutil.sh using shell

This is necessary because run_dbutil.sh sometimes loses its execute (+x) mode bit when regenerated, and tests fail due to it.

Resolving as fixed. Thank you for the reviews Michal.

comment:28 Changed 7 years ago by jinmei

  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening...this one still seems to have serious issues.

I've found that once we merge #2211 (we heavily start using threads
from this ticket) to master auth unittests start failing, randomly,
but quite often. After spending some amount of time I figured out
that there are interprocess_sync_file caused various types of
inter-thread race conditions.

First one is due to the use of shared_ptr in InterprocessSyncFile.
The reference counter in it is shared by multiple threads, we need to
protect the access to it reasonably. It may help if we enable the
thread support in boost, but I'm not so sure about it.

Also, getSyncMapMutex() itself seems to have a race condition.
sync_map_mutex can be assigned at the same time. Some other variables
like is_locked_ (and maybe fd_) are used without any protection, at
least some of which seem to be very dangerous.

Overall, the quality of this code is still very low and not ready for
merge. Could you first revert it (if git revert causes
troubles for later use in some other way)?

I'm not sure how portable this issue is, but I'm pushing the merged
branch (#2211 and master) for reference. The branch name is
trac2211merge. Try building it and running auth unittests to see if
it happens on other environments.

comment:29 Changed 7 years ago by muks

The InterprocessSyncFile? was written modifed assuming that a different instance would be used in each thread, i.e., different logger object in each thread, but re-reading the description, I think it wants the same logger object to be shared among threads. The lock was made to work among many threads, but it (the instance) itself cannot be shared.

It can be fixed with more mutual exclusion. I'll revert the merge and fix it during the next sprint.

comment:30 Changed 7 years ago by muks

  • Status changed from reopened to assigned

Merge of branch trac2198_3 was reverted.

Changed 7 years ago by muks

Patch to disable InterprocessSync? usage in Logger until we fix the threads issue

comment:31 Changed 7 years ago by jinmei

I suggest we first discuss whether we really want to solve this issue
ourselves.

And, if we do, I suggest a much simplified approach: have
LoaggerManager maintain a single global mutex, and we use it
LoggerImpl::outputRaw(). LoaggerManager is the most reliable
singleton in the logging framework (relying on how applications
perform logging initialization), so it's probably the best to
implement it on top it, rather than introducing yet another point of
singleton.

comment:32 Changed 7 years ago by muks

  • Owner changed from muks to UnAssigned
  • Status changed from assigned to reviewing

This has been implemented in the trac2198_4 branch.

I have not added a test such as the interprocess one in logger_lock_test.cc + logger_lock_test.sh.in as it involves adding more API for passing mock objects and such. It wouldn't test logging calls from 2 threads deterministically to catch a race. The Mutex and Mutex::Locker are already tested in util/threads/tests/.

comment:33 follow-up: Changed 7 years ago by muks

If we are to pick up the tryLock() bits of Mutex and Mutex::Locker from the other branches, we can test if the Mutex in LoggerManager is locked at least, before log4cplus is used. But I'm not sure if we should test it.. let me know in the review.

comment:34 in reply to: ↑ 33 Changed 7 years ago by muks

Replying to muks:

If we are to pick up the tryLock() bits of Mutex and Mutex::Locker from the other branches, we can test if the Mutex in LoggerManager is locked at least, before log4cplus is used. But I'm not sure if we should test it.. let me know in the review.

This has been implemented now. We check that the LoggerManager mutex is locked during logging calls.

comment:35 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:36 Changed 7 years ago by vorner

  • Owner changed from vorner to muks

Hello

I thought I was careful about the map and shared pointers being protected
before. But maybe it got changed somewhere during the work and I didn't notice
:-|. Anyway…

I'm no longer sure about the scope of this ticket. Is it to make the
InterprocessLock? thread safe, or is it to make the output not garbled or is it
to make the logging thread safe?

The branch seems to address the second point now, but I think not the third ‒
and we need to ensure that in the end somehow. I think the current
implementation can either crash or cause inconsistent state, at least
temporarily, for example in the following situation:

  • A main thread and a background thread is running.
  • The background thread has a lot to do and it is logging while it is doing so (let's say it is loading a huge zone).
  • The administrator reconfigures the debugging. The main thread reacts to it, reconfiguring the logging library. This may include things like changing logging severities, but also changing files to be logged to.

If the reconfiguration happens at the same time as logging call, the internals
are not locked and the threads might confuse each other. And this can happen
even in case the logger does not output anything, because it needs to check if
the output should or should not be produced.

On the other hand, locking everything might not be good too. We have too much
logging over the place and it would impose too much overhead and lock
contention.

So I'm not sure what to do here. We probably should try to find some solution
on the mailing list.

comment:37 Changed 7 years ago by jelte

  • Milestone changed from Sprint-DHCP-20130103 to Sprint-20130108

comment:38 Changed 7 years ago by muks

  • Owner changed from muks to UnAssigned

Please review the trac2198_5 branch. This introduces a mutex inside a process to synchronize threads calling log4cplus.

I have tested that this causes lettuce runs on my local machine to pass, whereas it always fails on master.

But I have also seen at least one intermixed log message with this branch. So please review the branch carefully to see if you find anything that may cause a race.

comment:39 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:40 follow-up: Changed 7 years ago by vorner

Hello

Looking at the code, it looks sane an mostly OK. Just, around the lock()/unlock()/tryLock methods in mutex, the comments say not to use the method directly. But the methods are private. I guess the comment is useless there.

Also, I think we do lock correctly. I confirmed by writing a multi-process & multi-threaded program that does only logging. It produced no garbled messages. The programs are attached, if you want to have a look.

However, the problems with logging in lettuce still stays. I don't know how much it is related to this ticket, but:

  • I run the lettuce with redirection to pipe, by run_lettuce.sh | tee output.log, so I can examine the log even if it scrolls up too far. That may be related.
  • After some time of experimenting, I managed to catch the problem with strace running. This is the garbled output:
    2013-03-29 13:10:41.565 DEBUG [b10-msgq.msgq/2318] MSGQ_RECV_HDR Received header: {'type': 'getlname'}
    2013-03-29 13:10:41.560 DEBUG [b10-cmdctl.cmdctl2013-03-29 13:10:41.566/ DEBUG2361 [] b10-msgq.msgq/2318
    ] MSGQ_RECV_HDR Received header: {'type': 'subscribe', 'group': 'Xfrin', 'instance': '*'}2013-03-29 13:10:41.566
     DEBUG [b10-cmdctl.pycc /DEBUG2361 [b10-msgq.msgq/PYCC_LNAME_RECEIVED received local name: 515584c1_8@ruth2318
    ] MSGQ_SUBS_NEW_TARGET Creating new target for subscription to group 'Xfrin' for instance '*'2013-03-29 13:10:41.566
     DEBUG [b10-cmdctl.pycc/2361] PYCC_LNAME_RECEIVED received local name: 515584c1_9@ruth
    2013-03-29 13:10:41.567 DEBUG [b10-cmdctl.config/2361] CONFIG_LOG_WILD_MATCH will use logging configuration for wildcard logger *
    

Relevant strace files are attached to ticket.

From how it looks, it seems we do the logging, but the write() calls are buffered and delayed for later time for some reason.

I don't have a concrete idea what to do about it. I think we still do need the locking and it should be merged (once the detail up there are handled), but it obviously is not enough.

Changed 7 years ago by vorner

Experiments with logging

Changed 7 years ago by vorner

CMDCTL strace

Changed 7 years ago by vorner

MSGQ strace

comment:41 Changed 7 years ago by vorner

  • Owner changed from vorner to muks

comment:42 in reply to: ↑ 40 Changed 7 years ago by muks

Replying to vorner:

Looking at the code, it looks sane an mostly OK. Just, around the lock()/unlock()/tryLock methods in mutex, the comments say not to use the method directly. But the methods are private. I guess the comment is useless there.

These are now removed.

comment:43 Changed 7 years ago by muks

  • Owner changed from muks to vorner

Can we merge trac2198_5 the way it is right now into master and look at the remaining issue?

comment:44 follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to muks
  • Total Hours changed from 3.85 to 8.08

Hello

Yes, it now looks good to merge.

We may want to open another ticket for the other issues, because the problem is not in the scope of this ticket, strictly speaking.

comment:45 Changed 7 years ago by muks

Merged trac2198_5 branch to master in commit 4f3d61ec6d64807a8f174c4d028154732c4bad77:

* 40de1c1 [2198] Remove unnecessary text from API documentation
* e85c7ea [2198] Fix library paths to add lib/util/threads/ too (contd.)
* e974876 [2198] Fix library paths to add lib/util/threads/ too (OpenBSD test failure)
* 36e6a75 [2198] Link libb10-threads in log tests
* e642b05 [2198] Check that the logger locks the LoggerManager mutex before calling log4cplus
* 84bf517 [2198] Add Mutex::tryLock() and non-blocking variant of Mutex::Locker
* f8dc481 [2198] Fix library paths to add lib/util/threads/ too (Mac OS test failure)
* b9d570e [2198] Link -lpthread in libb10-threads target
* 2caf9ff [2198] Use a mutex from LoggerManager during logging calls

comment:46 follow-up: Changed 7 years ago by jreed

e642b05 duplicates noinst_SCRIPTS = console_test.sh and following lines.

Also a Changelog entry is missing for this.

comment:47 in reply to: ↑ 46 Changed 7 years ago by muks

Replying to jreed:

e642b05 duplicates noinst_SCRIPTS = console_test.sh and following lines.

Thank you for catching this. I've fixed it in master now.

Also a Changelog entry is missing for this.

I don't think a ChangeLog is required as it's got no changes to affect a user's view (neither bugfix in parsing something, nor some new functionality change). It just grabs another lock before logging. But if you feel strongly that it requires a ChangeLog entry, we'll add one.

comment:48 in reply to: ↑ 44 Changed 7 years ago by muks

Replying to vorner:

We may want to open another ticket for the other issues, because the problem is not in the scope of this ticket, strictly speaking.

I have created ticket #2900 for this issue.

comment:49 Changed 7 years ago by muks

  • Resolution set to fixed
  • Status changed from reviewing to closed

Resolving as fixed. Thank you for the reviews Michal, and Jinmei before that.

Note: See TracTickets for help on using tickets.