Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#2410 closed defect (fixed)

race condition(?) in dhcp[46]_test.py

Reported by: jelte Owned by: stephen
Priority: medium Milestone: Sprint-DHCP-20121129
Component: dhcp Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 4 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

while looking at testing with bamboo, we saw regular failures in dhcp6_test.py:

FAIL: test_portnumber_0 (__main__.TestDhcpv6Daemon)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/test/bamboo-home/xml-data/build-dir/TESTPROJBIND10-TESTPLANBIND10-JOB1/src/bin/dhcp6/tests/dhcp6_test.py", line 156, in test_portnumber_0
    self.assertEqual( str(error).count("Failed to parse port number"), 1)
AssertionError: 0 != 1

(and other similar ones from the same set)

an earlier message may be related:

 F/test/bamboo-home/xml-data/build-dir/TESTPROJBIND10-TESTPLANBIND10-JOB1/src/bin/dhcp6/tests/dhcp6_test.py:172: ResourceWarning: unclosed file <_io.FileIO name=13 mode='wb'>

(and something about no data on either stdout and stderr, but I can't find the exact message in the logs right now)

Sometimes it works, sometimes it does not, I think I saw the same for dhcp4 at some point.

I suspect there is a potential race condition here, where sometimes read is done before the process got to printing. Not sure why this is regularly hit here, and not on our main test bots (perhaps the initial wait is long enough there).

Either way that unclosed file should probably be adressed anyway :)

Subtickets

Change History (16)

comment:1 Changed 7 years ago by jelte

I now ran into this error on a normal test run as well btw

comment:2 Changed 7 years ago by shane

  • Milestone New Tasks deleted

comment:3 Changed 7 years ago by stephen

  • Milestone set to Sprint-DHCP-20121129

comment:4 Changed 7 years ago by muks

Please also check if #2496 is related and can be fixed by whatever fixes this issue.

comment:5 Changed 7 years ago by stephen

  • Owner set to stephen
  • Status changed from new to accepted

comment:6 Changed 7 years ago by stephen

Now ready for review.

The problem appears to have been diagnosed correctly - the parent process reads from the pipe connecting it to the subprocess, assuming that at that point, the subprocess had written to the pipe. Although not encountering the problem on my systems, I was able to simulate it by (temporarily) adding a 2-second delay into "main()" of the server process.

This fix causes the parent to loop round reading the pipe until either the parent process terminates or some output appears.

Regarding the unclosed file, additional close calls were added to close unclosed descriptors, and the ProcessInfo object explicitly deleted. The latter appears to eliminate all but one of the unclosed descriptor messages, the exception being the case where the process ends by itself (instead of being killed). It is not clear what the reason for this is.

comment:7 Changed 7 years ago by stephen

  • Owner changed from stephen to UnAssigned
  • Status changed from accepted to reviewing

comment:8 Changed 7 years ago by marcin

  • Owner changed from UnAssigned to marcin

comment:9 Changed 7 years ago by marcin

  • Owner changed from marcin to stephen

Reviewed commit 56bcca870b41742f47160cae5b30b81ef62c1095

It seems to me that in the future we could change os.spawn approach to subprocess.communicate as it is intended to replace old modules. That would also help to avoid race condition.
Do we want to submit the ticket on this one?

dhcp6_test.py
The code appears to have some redundancy. There are two variables: output and new_output. The latter one receives new chunks of output from a process and appends it to the other variable. When process finally produces the output both variables are empty initially. When the following happens:

if (new_output is not None):
    output = output + str(new_output)

output is equal to new_output.
We finally hit this check:

if pi.process.poll() is not None or len(error) > 0 or len(output) > 0:
    break;

Let's assume that process is still running, there were no errors. The output is not empty (it is equal to new_output) so the loop breaks and it never gets to the point when the next chunk of process'es output is read to new_output and thus no other chunk of data is appended to output. What is then the reason to keep two variables around? Why don't we just write to output directly?

comment:10 Changed 7 years ago by stephen

  • Owner changed from stephen to marcin

It seems to me that in the future we could change os.spawn approach to subprocess.communicate as it is intended to replace old modules. That would also help to avoid race condition. Do we want to submit the ticket on this one?

I think we need to leave the code as is. Popen.comminicate reads stdout and stderr until an EOF is read, then waits for the process to terminate. This implicitly assumes that the process does communicate - whereas for some tests, we start up the process which outputs some text and enters its main loop. We check the text then explicitly kill the process.

dhcp6_test.py

The code appears to have some redundancy. There are two variables: output and new_output...

Fair point - it was a hangover from an attempt to accumulate output in multiple reads. These variables have been removed. In addition, some repeated code has been moved to a separate method and the "while" loop simplified.

comment:11 Changed 7 years ago by marcin

  • Owner changed from marcin to stephen

Reviewed commit aac0dd304b86ee38049480c8155f5eb97142c435

I verified the fix on the VM in the following way...

  • I was running the Debian6 system on VirtualBox?: Linux marcin-debian 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012 x86_64 GNU/Linux
  • I checked out two versions of bind10 to different folders: master branch to one folder and trac2410 with a fix to the other folder
  • I built both versions of bind10 and for each of them I launched the DHCPv6 test in the loop with this script in src/bin/dhcp6 folder:
    #!/bin/bash
    for i in {1..1000} 
    do
        make check
        if [[ $? -ne 0 ]]
        then
            exit 1;
        fi
    
    done
    

While script was running I was doing expensive folder copy operation of Downloads folder to Downloads2 in my home folder.

I hit the issue described in this ticket during the first iteration of the test when running it on master branch.

I made around 30 iterations of this test on trac2410 and did not hit this issue. This indicates that the fix works for the certain machine load.

Important: this change fixes the test fatal error but it does not take away the resource leak error also described in this ticket:

Running command: ../b10-dhcp6 -v
/home/marcin/devel/bbind10/src/bin/dhcp6/tests/dhcp6_test.py:151: ResourceWarning: unclosed file <_io.FileIO name=10 mode='wb'>

However, in the code there is a TODO statement that explains that resource warning is actually a false alarm. Is the following line neccessary:

pi = None

?
Doesn't this object die when the function ends anyway?

comment:12 Changed 7 years ago by stephen

  • Owner changed from stephen to marcin

Doesn't this object die when the function ends anyway?

It should do, which is why I think the issue is more complicated than it seems. There are several Python unit tests within each file. If the "pi = None" line is omitted, the all the unit tests report the resource leak. With it in, only the first one reports the leak.

From checks, the unclosed file is not one opened directly by the unit test code, it appears to be one opened by the subprocess.Popen object (part of the Python library) within ProcessInfo. The same leak happens with the tests for src/bin/bind10. Ticket #2043 has been opened for fix miscellaneous resource leaks.

As the current code fixes the problem (and also #2496), I suggest we close this ticket and leave the resource leak to #2043.

Last edited 7 years ago by stephen (previous) (diff)

comment:13 Changed 7 years ago by stephen

Suggested ChangeLog entry is:

XXX.	[bug]		stephen
	Fixed race condition in DHCP tests whereby the test program spawned
        a subprocess but attempted to read (without waiting) from the
        interconnecting pipe before the subprocess had written anything.  The
        lack of output from the subprocess was being interpreted as a test
        failure.
	(Trac #2410, git xxx)

comment:14 Changed 7 years ago by marcin

  • Owner changed from marcin to stephen

All ok. Please merge.

comment:15 Changed 7 years ago by stephen

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:16 Changed 7 years ago by stephen

Forgot to say: merged in commit f53e65cdceeb8e6da4723730e4ed0a17e4646579

Note: See TracTickets for help on using tickets.