I submitted this report so Sun, but I haven't heard back from them. All I have at this point is a review ID. The bug involves object monitors not being released properly in code with a large number of local variables when running under the HotSpot Server VM. It appears the bug gets triggered when the VM compiles the Java bytecode to native code.
I was able to reproduce it on both Linux and Windows with both the original and latest versions of Java 6 (at the time of the bug report, update 3) using the Server VM and compilation enabled. If the client VM was used, the bug was not reproducible. If compilation was disabled (-Xint), the bug was not reproducible. I tried it with the latest version of Java 5 (update 13), and the bug was not present.
Here's the report I submitted.
Date Created: Tue Oct 09 21:34:27 MDT 2007 Type: bug Customer Name: Paul Miner SDN ID: paul.miner status: Waiting Category: hotspot Subcategory: compiler2 Company: TGW release: 6 hardware: x86 OSversion: linux priority: 4 Synopsis: Hotspot server leaves synchronized block with monitor in bad state Description: FULL PRODUCT VERSION : java version "1.6.0_03" Java(TM) SE Runtime Environment (build 1.6.0_03-b05) Java HotSpot(TM) Server VM (build 1.6.0_03-b05, mixed mode) Also present in the original 1.6.0 b105 FULL OS VERSION : Linux 2.4.x Linux 2.6.x Windows XP Windows 2003EXTRA RELEVANT SYSTEM CONFIGURATION :
A DESCRIPTION OF THE PROBLEM :
I stumbled on this problem when a thread starting blocking on a synchronized method, but it was the only thread attempting to obtain the monitor on the object. It manifests as a monitor that one thread says will say another thread is locking, but the other thread won't be locking it. If the thread that is blamed exits, it then blames another random thread.
For reasons I don't understand, two important conditions are needed: a loop with many iterations (I was guessing this had to do with the compile threshold), and a lot of local variables. If the loop with many iterations is executed within a synchronized block inside a method with a lot of local variables, when it leaves the synchronized block, something is wrong with the monitor. If it enters a second synchronized block on the same object, the management package will show it holding the monitor after it exits the block, but with a null stack frame. If the thread just continues on and a different thread tries to obtain the monitor, the second thread will say the first thread still holds the monitor, although the management package shows that it does not. If the first thread exits, it will then say a different thread holds the monitor, sometimes even itself.
I have included two programs demonstrating both variations of the problem.
THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: No
THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Run the included programs, LockTest and LockTest2 using the server VM
EXPECTED VERSUS ACTUAL BEHAVIOR :
If the bug were not present, it would exit the main() function. Instead, the second thread blocks forever waiting for the monitor to be released.
The main() function runs the faulty code in a thread (call it thread A). After the function runs, thread A sleeps for 10 seconds while main starts a second thread (thread B) to run the faulty code a second time. While waiting for the second thread to finish, main() prints the state of thread B and the lock it is waiting on. For the first 10 seconds of its life, thread B is waiting on thread A. When thread A finishes sleeping and exits, thread B will claim another thread holds the monitor (ironically, it often blames itself, but sometimes the compiler thread).
Using the client VM, the message "Exiting main()" should appear in the messages, and the program should end after the two threads finish sleeping for 10 seconds. This is the expected behavior.
I first noticed the problem running under 1.6.0 b105, and updated to 1.6.0_03, but the problem was still present. I've tested under the various OSes listed above. I tried the program (sans management package stuff) under 1.5.0_04 on an XP machine, and it didn't exhibit the bug. Just to double-check, I ran the same class file under 1.6.0_03 and it failed.
ERROR MESSAGES/STACK TRACES THAT OCCUR :
Output for LockTest: main: About to start first thread: Thread-0 Thread-0: Entering test() Thread-0: Checkpoint #0, number of monitors: 1 Thread-0: MonitorInfo[0]=java.lang.Object@14b7453, StackDepth=3, frame=LockTest.test(LockTest.java:28) Thread-0: Checkpoint #1, number of monitors: 0 Thread-0: Checkpoint #2, number of monitors: 1 Thread-0: MonitorInfo[0]=java.lang.Object@14b7453, StackDepth=3, frame=LockTest.test(LockTest.java:35) Thread-0: Checkpoint #3, number of monitors: 1 Thread-0: MONITOR HELD AFTER EXITING SYNCHRONIZED BLOCK! Thread-0: MonitorInfo[0]=java.lang.Object@14b7453, StackDepth=-1, frame=No Frame Thread-0: Leaving test() main: About to start second thread: Thread-1 Thread-1: Entering test() Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-0 Thread-0: exiting Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-1 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-1 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-1 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-1 Thread-1 state: BLOCKED, LockName: java.lang.Object@14b7453, LockOwnerName: Thread-1 ... Output for LockTest2: main: About to start first thread: Thread-0 Thread-0: Number of monitors: 0 Thread-0: Entering test() Thread-0: Leaving test(), n=-2014260032 Thread-0: Number of monitors: 0 main: About to start second thread: Thread-1 Thread-1: Number of monitors: 0 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: Thread-0 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: Thread-0 Thread-0: exiting Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: CompilerThread1 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: CompilerThread1 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: CompilerThread1 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: CompilerThread1 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: CompilerThread1 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: CompilerThread1 Thread-1 state: BLOCKED, LockName: LockTest2@1034bb5, LockOwnerName: CompilerThread1 ...REPRODUCIBILITY :
---------- BEGIN SOURCE ----------
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
I found this out while isolating the bug that it was sensitive to the number of local variables. When I removed code while trying to come up with a test case, I usually removed the local variables with them, causing the bug to disappear (to my frustration). After much gnashing of teeth, I found out that this and the large loops were the key pieces. Reducing the number of local variables makes the problem go away. For example, on a Windows 2003 machine I tried it on, commenting out the lines declaring localVars 30-49 fixed it, and on a Linux 2.4 machine, commenting out the line declaring localVars 40-49 fixed it.
I suppose running with -Xint would also fix it, but changing the code is a better solution if at all possible.