Opened 8 years ago

Closed 8 years ago

#1518 closed defect (complete)

0 byte rotated log files when changing settings

Reported by: jreed Owned by: UnAssigned
Priority: high Milestone: Sprint-20120124
Component: logging Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: High
Sub-Project: Core Feature Depending on Ticket: none
Estimated Difficulty: 7 (but see comment) Add Hours to Ticket: 3
Total Hours: 0 Internal?: no

Description

All my rotated log files were full of log entries. Changing debuglevel from 99 to 0 resulted in lost logs:

-rw-r--r--  1 root  wheel    7853 Dec 19 18:47 /var/log/bind10.log
-rw-r--r--  1 root  wheel       0 Dec 19 18:47 /var/log/bind10.log.1
-rw-r--r--  1 root  wheel       0 Dec 19 18:47 /var/log/bind10.log.2
-rw-r--r--  1 root  wheel       0 Dec 19 18:47 /var/log/bind10.log.3
-rw-r--r--  1 root  wheel       0 Dec 19 18:47 /var/log/bind10.log.4
-rw-r--r--  1 root  wheel       0 Dec 19 18:47 /var/log/bind10.log.5
-rw-r--r--  1 root  wheel       0 Dec 19 18:47 /var/log/bind10.log.6
-rw-r--r--  1 root  wheel     498 Dec 19 18:47 /var/log/bind10.log.7
-rw-r--r--  1 root  wheel  169364 Dec 19 18:47 /var/log/bind10.log.8

These were all full a few seconds earlier. In addition, the new 8th file (that is my maxver) is now larger than the maxsize (30000).

Jinmei told me in jabber: I sometimes see size-0 log files, too.

Subtickets

Change History (11)

comment:1 Changed 8 years ago by jreed

I was able to reproduce this by changing debuglevel up and down. For your info, the system was getting around 60,000 queries in per second (it could only respond to about 13%) so it got lots of logs fast and rotated files fast. I didn't check but maybe the logs from the zero'd out files got dumped into the last file (so were not lost yet). I understand the logs will be lost anyways due to rotation, but that is why I turned the debuglevel down so I could look at them -- when they became 0 size that defeated my technique.

comment:2 Changed 8 years ago by jelte

  • Estimated Difficulty changed from 0 to 7 (but see comment)
  • Milestone changed from New Tasks to Sprint-20120110

Note that for now, the task for this ticket is just to find out what is happening, do not spend more than 1 (one) hour on this. It got estimated with a reasonable amount of points, partly because we have no idea what is wrong and whether it is the library or the way we use it right now.

If it turns out to be a trivial error we can fix, we can do it. If it is not trivial, we will defer this to the next sprint.

comment:3 Changed 8 years ago by jelte

  • Priority changed from major to critical

comment:4 Changed 8 years ago by shane

  • Feature Depending on Ticket set to none

comment:5 Changed 8 years ago by stephen

  • Owner set to stephen
  • Status changed from new to assigned

comment:6 follow-up: Changed 8 years ago by stephen

  • Owner changed from stephen to UnAssigned
  • Status changed from assigned to reviewing

Firstly, the problem is not one of lost data. What is happening is that when the log settings are changed, a small number of new log files are created with a size of zero (or a few hundred bytes). If the logging is configured such that only a small number of files are retained, these zero-length log files take up a number of that limit and older files are deleted. (Upping the maximum number of versions to something like 20 and looking at the i-node numbers before and after changing the log settings shows what is happening.)

The reason why these files are created is not completely clear, but it is to do with the fact that there are multiple processes logging to the same file. Currently when logging settings are changed, the logging code clears all appenders (the component that writes to the file) and sets up new ones. The effect of this is that for each process, all log files it has open are closed and then re-opened.

Ideally each process should reopen the same file as it closed. However, I speculate that when a BIND 10 process attempts to reopen the log file, sometimes the operation fails and for some reason a file rollover is triggered. (A file rollover is where old log files are renamed with a higher a version number and a new current log file opened.) Since there are several BIND 10 processes, there can be several rollovers, hence the empty files. Looking at the files opened by the BIND 10 processes confirms that view - after a logging reconfiguration we can end up with different BIND 10 processes pointing to different versions of the log file. (Processes do not necessarily all end up pointing to different log files - two or three processes may end up pointing to the same file.) This observation also explains why some files are zero length and some only a few hundred bytes - it depends on whether the process in question generated a log message when it received a configuration change: if so, the message was logged to the newly-opened log file.

So we have one problem with a change of logging configuration. However, thinking about it, even if the reconfiguration did not generate redundant files, we would have still have a problem when a file rollover takes place. To illustrate this, suppose we have processes Alpha and Beta both writing to the file "out.log". Suppose also that the file has reached its maximum size. When Alpha next writes to the log file, the size of the file is checked. Since it has reached its maximum size, Alpha closes the file and performs the rollover: "out.log" is renamed "out.log.1", a new "out.log" is created, and Alpha writes the message to that. During this process, Beta still has the same file open, so after the rename, it is "out.log.1" that Beta has open. If Beta now writes to the log file, the same thing happens. The check on the file size causes the file to be closed and the rollover to take place: "out.log.1" is renamed "out.log.2", "out.log" is renamed "out.log.1" and Beta opens "out.log" and writes the message to it. So although we started with both Alpha and Beta logging to the same file, we end up with Alpha and Beta logging to different files. (I don't think we have seen this yet, but only because tests tend to concentrate on a single component at a time.)

In summary, we have two problems with the rotating log files if all processes are set to log to the same file:

  1. When logging is reconfigured, instead of processes closing the current log file and reopening it, sometimes the BIND 10 processes can end up opening different log files.
  2. Even if (1) were not a problem, we can expect processes to end up logging to different log files when the log file size reaches the maximum limit.

Suggestions

  1. I think we will need to enable log4cplus debugging output to chase down the first problem. (It seems to be on by default but output is written to stderr, which is not captured by processes started by Boss.) It may be that the problem is just that a number of processes attempt to open the same file together and the error causes the rollover. A small random wait may be all that is required.
  1. I think the second problem could be mitigated by using a class derived from the log4cplus RollingFileAppender that overrides some of its methods. When a rollover happens, instead of closing the current file, rolling the file names and opening a file, it should close the file, open what it thinks should be the current file and try to write again. Only if this file exists and is at maximum size should the rollover be performed. We might also want to consider the idea of a lock file to prevent two processes trying to rollover a particular log file at the same time. (In fact, use of a lock file might well solve the first problem as well.)

I have put this for review in case anyone else has an insight/suggestions for this. Before the next sprint planning meeting I will close it and create tickets for the agreed solutions.

comment:7 Changed 8 years ago by jreed

Is this something that should be discussed with the log4cplus upstream?

Is log4cplus commonly used with multiple unrelated processes logging to same log files and using same log4cplus configurations?

comment:8 Changed 8 years ago by jelte

that second point does sound like something that should be handled natively by the rollover code in log4cplus. If your description is correct, it should also be something that is reproducable by a test. Perhaps we should write that and take it up with the log4cplus devs.

I also think that fixing 2 correctly should solve 1.

comment:9 in reply to: ↑ 6 Changed 8 years ago by vorner

Hello

  1. Even if (1) were not a problem, we can expect processes to end up logging to different log files when the log file size reaches the maximum limit.

This sounds possible. Also, this might have happened at the time it did the reconfiguration, if it bufferes a message or two, the chance that the limit is reached at the time the loggers dump the buffer is higher than usual.

Suggestions

  1. I think the second problem could be mitigated by using a class derived from the log4cplus RollingFileAppender that overrides some of its methods. When a rollover happens, instead of closing the current file, rolling the file names and opening a file, it should close the file, open what it thinks should be the current file and try to write again. Only if this file exists and is at maximum size should the rollover be performed. We might also want to consider the idea of a lock file to prevent two processes trying to rollover a particular log file at the same time. (In fact, use of a lock file might well solve the first problem as well.)

It might be so, I guess it would be also the fastest way to test the theory, as it shouldn't be so much code. And I agree with Jelte it could solve (1) as well. Maybe the library wasn't meant for this kind of use.

comment:10 Changed 8 years ago by jinmei

Confirming the theory via log4cplus output seems reasonable.

I don't have a strong opinion about sub-classing
RollingFileAppender, but if the base class is designed to allow such
type of customization, it seems worth trying (on the other hand if
the base class is not expected to be derived, that's probably not a
good idea).

Another thing: postfix seems to have multiple processes dump logs to
the same file. It's probably using built-in logging code, but we may
be able to find some hints about how to handle such a case in its
implementation.

comment:11 Changed 8 years ago by stephen

  • Add Hours to Ticket changed from 0 to 3
  • Resolution set to complete
  • Status changed from reviewing to closed

Confirming the theory via log4cplus output seems reasonable.

I've done this by modifying the logger_example.cc code (in branch stephen-multiplelog) to write a multiple sets of messages separated by a short delay (to give time to examine the files as the processes are running). If two processes are started and the files allowed to roll, the first file contains messages from both processes, thereafter each file contains messages from only a single process.

Ticket #1622 created to implement a solution.

Note: See TracTickets for help on using tickets.