Opened 8 years ago

Closed 7 years ago

Last modified 7 years ago

#1622 closed defect (fixed)

Ensure multiple processes log to same file when files are being rolled

Reported by: stephen Owned by: jinmei
Priority: medium Milestone: Sprint-20130611
Component: logging Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: Medium
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 6 Add Hours to Ticket: 0
Total Hours: 5.91 Internal?: no

Description

Type implementing a subclass of log4cplus::RollingFileAppender? (and a lock file, both solutions suggested in #1518) to keep all processes logging to the same file when a log file is rolled.

Subtickets

Change History (37)

comment:1 Changed 8 years ago by shane

  • Milestone changed from New Tasks to Next-Sprint-Proposed

comment:2 Changed 8 years ago by jelte

  • Estimated Difficulty changed from 0 to 6

comment:3 Changed 7 years ago by jreed

This problem still exists with log4cplus 1.1.0 which is said to fix this problem. I will report upstream. Here is the example from production n10 server:

%ls -ltr /var/log/bind10/ | tail
-rw-r--r--  1 root  wheel   400705 Oct 30 21:30 bind10.log.8
-rw-r--r--  1 root  wheel  1048681 Oct 31 06:00 bind10.log.1
-rw-r--r--  1 root  wheel    82692 Oct 31 06:00 output.log
-rw-r--r--  1 root  wheel  1048625 Oct 31 06:00 bind10.log.2
-rw-r--r--  1 root  wheel     2799 Oct 31 11:36 bind10.log.4
-rw-r--r--  1 root  wheel     6325 Oct 31 11:36 bind10.log.3
-rw-r--r--  1 root  wheel   546025 Oct 31 13:03 bind10.log
-rw-r--r--  1 root  wheel   116931 Oct 31 13:04 bind10.log.5
-rw-r--r--  1 root  wheel   400637 Oct 31 13:04 bind10.log.7
-rw-r--r--  1 root  wheel   144156 Oct 31 13:04 bind10.log.6

Multiple log files still written too. Now all out of order. fstat shows multiple processes have different bind10 log files open.

comment:4 Changed 7 years ago by jreed

log4cplus 1.1.0 has UseLockFile?. Let's try it.

comment:5 Changed 7 years ago by jreed

I see this was duplicated by #1896. See it for other details.

comment:6 Changed 7 years ago by jreed

  • Milestone set to Next-Sprint-Proposed

comment:7 Changed 7 years ago by jreed

  • Milestone set to Next-Sprint-Proposed

comment:8 Changed 7 years ago by jinmei

Do we have a specific idea of how to do it?

comment:10 Changed 7 years ago by jreed

  • Milestone changed from Previous-Sprint-Proposed to Next-Sprint-Proposed

comment:11 Changed 7 years ago by jreed

Also #1629 suggests useLockFile.

comment:12 Changed 7 years ago by shane

  • Defect Severity changed from N/A to Medium
  • Milestone changed from Previous-Sprint-Proposed to Next-Sprint-Proposed
  • Priority changed from low to medium

Hm... I'm not sure why this priority is so low. Raising!

comment:13 Changed 7 years ago by shane

  • Sub-Project changed from DNS to Core

comment:14 Changed 7 years ago by shane

  • Milestone changed from Previous-Sprint-Proposed to Next-Sprint-Proposed

comment:15 Changed 7 years ago by shane

  • Milestone changed from Previous-Sprint-Proposed to Next-Sprint-Proposed

comment:16 follow-up: Changed 7 years ago by shane

  • Milestone changed from Previous-Sprint-Proposed to Next-Sprint-Proposed

Now that #2198 has cured some problems, we should at least see if this is still broken.

comment:17 in reply to: ↑ 16 Changed 7 years ago by jreed

Replying to shane:

Now that #2198 has cured some problems, we should at least see if this is still broken.

Problem still exists.

comment:18 Changed 7 years ago by jreed

  • Milestone changed from Previous-Sprint-Proposed to Next-Sprint-Proposed

comment:19 Changed 7 years ago by muks

  • Milestone changed from Next-Sprint-Proposed to Sprint-20130514

comment:20 Changed 7 years ago by jinmei

How can we reproduce it? specific procedure on specific machine?

comment:21 Changed 7 years ago by jreed

I don't think any specific machine is needed (I have seen on Linux and different BSDs). Just enable logging to a file, set the log rotation to small file size so happens frequently, set debug level to something so logging happens, and then start using cmdctl/bindctl, sending queries, etc. Logs for different components will keep logging to a renamed file.

comment:22 Changed 7 years ago by jinmei

It's not surprising to me log4cplus 1.1.0 doesn't work without using
its own file lock:

    if (useLockFile)
    {
...
        // Recheck the condition as there is a window where another
        // process can rollover the file before us.

        helpers::FileInfo fi;
        if (getFileInfo (&fi, filename) == -1
            || fi.size < maxFileSize)
        {
            // The file has already been rolled by another
            // process. Just reopen with the new file.

            // Open it up again.
            open (std::ios::out | std::ios::ate);
            loglog_opening_result (loglog, out, filename);

            return;
        }
    }

comment:23 Changed 7 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

comment:24 Changed 7 years ago by jinmei

Confirmed: log4cplus 1.1.0 + its internal lock file solved this
problem.

What should we do with 1.0.x? Can we simply say "if this issue
bothers you, you should use log4cplus 1.1.0"? It doesn't seem to be
impossible to port the specific fix to this problem to 1.0.4, so we
could define our own custom version of RollingFileAppender. But, in
general, I don't like to tweak external libraries that way. It's a
hack, and if we need this type of hack for every small problem for
external libraries, we are going to lose the advantages of avoiding
in-house tools for non core business. It's especially so if the
latest version of that library solves the problem itself.

comment:25 follow-up: Changed 7 years ago by jreed

We could have a log warning that a newer log4cplus is needed when using file based log rotation. We also should document this need.

Do you have a branch to push for the fix? (I didn't see yet to try it.)

comment:26 in reply to: ↑ 25 Changed 7 years ago by jinmei

Replying to jreed:

We could have a log warning that a newer log4cplus is needed when using file based log rotation. We also should document this need.

I don't think logging this event is feasible. It will be quite noisy
as all of the 100 modules complain about the same thing. There could
be some more hack to address this concern, but, again, I'd be
reluctant to make our code messier due to a problem of the external
library. So my current suggestion is simply to document it.

comment:27 Changed 7 years ago by jinmei

trac1622 is ready for review.

I've basically taken a minimalist approach: didn't introduce a derived
class to solve this for log4cplus 1.0.x; didn't bother to leave
warning log message when we detect it. I explained why in the ticket
comments. Note that this should be relatively a rare problem in
practice because normally logging should only happen sporadically and
less frequently.

Another note is that the log4cplus's internal log file is its default:
"<log file name>.lock". I considered an option of specifying our own
name, but concluded the default should be better. Note that this
shouldn't cause permission problems we've seen with the in-house
version of lock file; in fact, otherwise we couldn't even update the
log file itself due to the permission problem.

Proposed changelog entry:

622.?	[bug]		jinmei
	logging: prevented multiple BIND 10 processes from generating
	multiple small log files when they dumped logs to files and try
	to roll over them simultaneously.  This fix relies on a feature of
	underling logging library (log4cplus) version 1.1.0 or higher,
	so the problem can still happen if BIND 10 is built with an older
	version of log4cplus. (But this is expected to happen rarely in
	any case unless a verbose debug level is specified).
	(Trac #1622, git TBD)

comment:28 Changed 7 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:29 Changed 7 years ago by jreed

I am running trac1622 on one my personal systems that had the problem. It appears to be gone now with this fix. Thank you. The files have rotated a few times and now I have all logging from different components in single files as I expect. (I didn't test with older log4cplus without support.)

comment:30 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:31 follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

The code looks good. So, assuming both you and Jeremy checked it helps with the problem, I think it can be merged.

comment:32 Changed 7 years ago by vorner

  • Total Hours changed from 0 to 0.91

comment:33 follow-up: Changed 7 years ago by jreed

I still have a problem:

Some log files still out of order:

-rw-r--r--  1 root  reed  1050197 Jun  3 11:35 bind10.log.1
-rw-r--r--  1 root  reed  1049479 Jun  3 13:03 bind10.log.13
-rw-r--r--  1 root  reed    22216 Jun  3 13:04 bind10.log

The last two lines in the .13 log file are:

2013-05-30 23:27:54.052 DEBUG [b10-zonemgr.zonemgr/11742] ZONEMGR_REFRESH_ZONE refreshing zone foo. (class IN)
2013-06-03 13:03:16.463 DEBUG [b10-cmdctl.cmdctl/12565] CMDCTL_SEND_COMMAND sending command shutdown to module Cmdctl

For some reason "DEBUG [b10-cmdctl.cmdctl/12565] CMDCTL_SEND_COMMAND" was sent to wrong destination.
And see this inode:

1630925   2112 -rw-r--r--    1 root              reed                1049479 Jun  3 13:03 /home/reed/work/isc/bind10/log/bind10.log.13

And from fstat (same inode):

root     python3.3  23304    8 /        1630925 -rw-r--r--  1049479 w 
root     python3.3   6330    7 /        1630925 -rw-r--r--  1049479 w 
root     python3.3   7269    8 /        1630925 -rw-r--r--  1049479 w 

Those three PIDs that have old log file still opened are b10-stats-httpd, b10-ddns, and b10-cfgmgr.

This is with log4cplus-1.1.0.

Using the trac1622 branch.

comment:34 in reply to: ↑ 33 Changed 7 years ago by jinmei

Replying to jreed:

I still have a problem:

Some log files still out of order:

-rw-r--r--  1 root  reed  1050197 Jun  3 11:35 bind10.log.1
-rw-r--r--  1 root  reed  1049479 Jun  3 13:03 bind10.log.13
-rw-r--r--  1 root  reed    22216 Jun  3 13:04 bind10.log

The last two lines in the .13 log file are:

2013-05-30 23:27:54.052 DEBUG [b10-zonemgr.zonemgr/11742] ZONEMGR_REFRESH_ZONE refreshing zone foo. (class IN)
2013-06-03 13:03:16.463 DEBUG [b10-cmdctl.cmdctl/12565] CMDCTL_SEND_COMMAND sending command shutdown to module Cmdctl

For some reason "DEBUG [b10-cmdctl.cmdctl/12565] CMDCTL_SEND_COMMAND" was sent to wrong destination.
And see this inode:

I suspect it's a log4cplus bug (or an odd "feature"). It first
appends data to the "current" file and then check if it needs to roll
over it:

void
RollingFileAppender::append(const spi::InternalLoggingEvent& event)
{
    FileAppender::append(event);

    if(out.tellp() > maxFileSize) {
        rollover(true);
    }
}

I suggest you check if it has been fixed in a new version and if not
report it to the log4cplus developer.

And from fstat (same inode):

root     python3.3  23304    8 /        1630925 -rw-r--r--  1049479 w 
root     python3.3   6330    7 /        1630925 -rw-r--r--  1049479 w 
root     python3.3   7269    8 /        1630925 -rw-r--r--  1049479 w 

Those three PIDs that have old log file still opened are b10-stats-httpd, b10-ddns, and b10-cfgmgr.

That's not necessarily a problem. I suspect they've simply not been
dumping logs while the files are rolled over.

comment:35 in reply to: ↑ 31 Changed 7 years ago by jinmei

Replying to vorner:

The code looks good. So, assuming both you and Jeremy checked it helps with the problem, I think it can be merged.

Thanks for the review, merge done, closing.

comment:36 Changed 7 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 0.91 to 5.91

comment:37 Changed 7 years ago by jreed

#2988 is the follow-up ticket

Note: See TracTickets for help on using tickets.