Bug 71380 - NRWT incorrectly associates crash logs with tests
Summary: NRWT incorrectly associates crash logs with tests
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Dirk Pranke
URL:
Keywords:
Depends on: 73942 81575 81580 81586 81595 81598 81600 81603
Blocks: 64491
  Show dependency treegraph
 
Reported: 2011-11-02 11:34 PDT by Simon Fraser (smfr)
Modified: 2012-03-21 16:39 PDT (History)
10 users (show)

See Also:


Attachments
Wait for Crash Reporter to finish even when it lets the crashed process die quickly (5.41 KB, patch)
2011-12-06 09:08 PST, Adam Roben (:aroben)
no flags Details | Formatted Diff | Diff
Wait for Crash Reporter to finish even when it lets the crashed process die quickly (6.18 KB, patch)
2011-12-06 11:20 PST, Adam Roben (:aroben)
no flags Details | Formatted Diff | Diff
Patch (31.71 KB, patch)
2012-03-15 20:28 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
more testing (42.49 KB, patch)
2012-03-16 15:21 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
Roll-up patch (49.07 KB, patch)
2012-03-16 20:02 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Comment 1 Eric Seidel (no email) 2011-11-02 13:13:45 PDT
Interesting.  Sounds bad.  Will look shortly.
Comment 2 Adam Roben (:aroben) 2011-12-06 07:12:42 PST
It looks like we only wait for ReportCrash to finish when a test times out. We don't wait if the test crashed quickly.

The code is all in this area: http://trac.webkit.org/browser/trunk/Tools/Scripts/webkitpy/layout_tests/port/server_process.py?rev=98040#L181

Note that we only call self._executive.wait_newest(self._port.is_crash_reporter) from within _handle_timeout.
Comment 3 Adam Roben (:aroben) 2011-12-06 07:32:39 PST
It seems that self.crashed and self._check_for_crash are about the Python subprocess crashing, not about DRT crashing.
Comment 4 Adam Roben (:aroben) 2011-12-06 07:37:35 PST
(In reply to comment #3)
> It seems that self.crashed and self._check_for_crash are about the Python subprocess crashing, not about DRT crashing.

My mistake. self.crashed/self._check_for_crash are about ServerProcess's self._proc. self._proc corresponds to DRT/WTR.
Comment 5 Adam Roben (:aroben) 2011-12-06 08:02:34 PST
I think we probably need something like ORWT's processIsCrashing functionality.
Comment 6 Adam Roben (:aroben) 2011-12-06 09:08:14 PST
Created attachment 118060 [details]
Wait for Crash Reporter to finish even when it lets the crashed process die quickly
Comment 7 Adam Roben (:aroben) 2011-12-06 09:09:22 PST
Comment on attachment 118060 [details]
Wait for Crash Reporter to finish even when it lets the crashed process die quickly

This is only part of the story. Waiting for the newest Crash Reporter instance isn't enough, since we can have multiple Crash Reporter instances going at once. We should probably teach NRWT how to double-check that it's chosen the right crash log, perhaps by comparing PIDs.
Comment 8 Adam Roben (:aroben) 2011-12-06 09:10:38 PST
BTW, this bug is very easy for me to reproduce:

1. Insert an abort() in DumpRenderTree.mm:runTest
2. run-webkit-tests --no-http --exit-after-n-crashes-or-timeouts=2
Comment 9 Adam Roben (:aroben) 2011-12-06 09:23:34 PST
(In reply to comment #7)
> (From update of attachment 118060 [details])
> Waiting for the newest Crash Reporter instance isn't enough, since we can have multiple Crash Reporter instances going at once.

Similarly, always choosing the newest crash log for a given process name isn't enough, since we can have multiple crash logs written for DRT before we go to find the crash log itself.
Comment 10 Adam Roben (:aroben) 2011-12-06 11:20:10 PST
Created attachment 118075 [details]
Wait for Crash Reporter to finish even when it lets the crashed process die quickly
Comment 11 Adam Roben (:aroben) 2011-12-06 12:05:06 PST
Comment on attachment 118075 [details]
Wait for Crash Reporter to finish even when it lets the crashed process die quickly

Landed as r102162.
Comment 12 Adam Roben (:aroben) 2011-12-06 12:06:41 PST
My plan is to have ServerProcess collect the crash log itself (after waiting as needed), then pass it along to SingleTestWriter as part of the DriverOutput, rather than having SingleTestWriter find the crash log after the fact as it does today.
Comment 13 Adam Roben (:aroben) 2011-12-06 12:17:24 PST
Unfortunately, ServerProcess doesn't (and shouldn't) know about WebProcess crashes. So maybe this needs to be handled at the WebKitDriver level?
Comment 14 Adam Roben (:aroben) 2011-12-06 12:17:48 PST
We might need to change "#CRASHED - WebProcess\n" to "#CRASHED - WebProcess <pid>\n".
Comment 15 Eric Seidel (no email) 2011-12-06 12:27:39 PST
(In reply to comment #14)
> We might need to change "#CRASHED - WebProcess\n" to "#CRASHED - WebProcess <pid>\n".

That sounds like a good idea. :)
Comment 16 Eric Seidel (no email) 2011-12-19 10:49:39 PST
Comment on attachment 118075 [details]
Wait for Crash Reporter to finish even when it lets the crashed process die quickly

Cleared Dirk Pranke's review+ from obsolete attachment 118075 [details] so that this bug does not appear in http://webkit.org/pending-commit.
Comment 17 Dirk Pranke 2012-03-15 15:36:58 PDT
i will take a look ...
Comment 18 Dirk Pranke 2012-03-15 20:28:09 PDT
Created attachment 132184 [details]
Patch
Comment 19 Dirk Pranke 2012-03-15 20:33:21 PDT
Okay, I'm working on this, but there's definitely some ways to go as I continue to characterize how things crash and how we look for them. I *think* (am not positive) that we were being way too conservative in waiting for ReportCrash to complete.

If anyone at Apple knew exactly how ReportCrash runs (and when it exits) and felt like telling me, that would be nice :) there's not a lot of info on the interwebs. From what I can tell, at least when we're blocked on a read from DRT, we don't even find out that a process has crashed until after  the report has been written to disk. Interestingly, though, the ReportCrash process sticks around for another 20-30 seconds, basically doing nothing (I haven't attached to it to see if it is doing anything). 

There are certainly other paths that I'm going to need to investigate, in particular figuring out when to wait in a crashed subprocess (haven't tested WTR at all yet).

Any feedback on this patch would be welcome; does this fix crash log issues with DRT on SnowLeopard and/or Lion for y'all if running with one DRT at a time, at least? How about multiple DRTs? (Haven't tested that yet, either).
Comment 20 Dirk Pranke 2012-03-16 15:21:46 PDT
Created attachment 132394 [details]
more testing
Comment 21 Dirk Pranke 2012-03-16 20:02:56 PDT
Created attachment 132438 [details]
Roll-up patch
Comment 22 Dirk Pranke 2012-03-16 20:11:07 PDT
Comment on attachment 132438 [details]
Roll-up patch

View in context: https://bugs.webkit.org/attachment.cgi?id=132438&action=review

Okay, getting close. Still needs more unit tests, but this should fix the bug and be worth checking out. I haven't tested it on Lion yet, either ...

> Tools/Scripts/webkitpy/common/system/crashlogs.py:69
> +                errors += "ERROR: Failed to read '%s': %s\n" % (path, str(e))

I'm not sure that we really want to include this in the crash log, but it's useful for debugging race conditions in finding the right crash log and waiting long enough for now ... I'll probably also move the crashlog.py changes into a separate patch.

> Tools/Scripts/webkitpy/layout_tests/controllers/test_result_writer.py:43
> +    writer = TestResultWriter(port._filesystem, port, root_output_dir, test_name)

I'll move the filesystem member variable cleanup into a different patch ...

> Tools/Scripts/webkitpy/layout_tests/models/test_expectations.py:662
> +                                CRASH: ('crash', 'crashes'),

this change just removes the DRT string since it isn't right for Webkit2. I'm not sure if this whole block even needs to exist, or if we can just pull it from the TestFailures.

> Tools/Scripts/webkitpy/layout_tests/models/test_failures.py:-88
> -    @staticmethod

I'll probably break this into a separate patch as well ...

> Tools/Scripts/webkitpy/layout_tests/port/base.py:1052
> +        # FIXME: Does it make sense to push this logic into webkitpy.common.system.crashlogs ?

Looking for feedback on this in particular ...

> Tools/Scripts/webkitpy/layout_tests/port/chromium.py:610
> +            crash=crash, crashed_process_name=crashed_process_name, crashed_pid=crashed_pid, crash_log=crash_log,

I'm a bit waffly on whether we really need the process name and the pid *in addition to* the crash log, but it's definitely helpful for now to ensure we're correlating pid to crash log properly.

> Tools/Scripts/webkitpy/layout_tests/port/mac.py:164
> +        # FIXME: Should most of this be moved into CrashLogs()?

thoughts?

> Tools/WebKitTestRunner/TestController.cpp:812
>      if (!m_didPrintWebProcessCrashedMessage) {

Any Apple/Mac people wanna comment on this part? It looks like WKPageGetProcessIdentifier is mac-only; should we get other ports to implement something like this?
Comment 23 Dirk Pranke 2012-03-19 16:36:36 PDT
Comment on attachment 132438 [details]
Roll-up patch

clearing r? since I'm breaking this into multiple patches ...
Comment 24 Dirk Pranke 2012-03-19 18:08:27 PDT
oookay, I think the 7 listed bugs are about the size of it :) (In fairness, a couple of them are barely-related nits that I stumbled over while working on this).
Comment 25 Dirk Pranke 2012-03-21 16:39:05 PDT
closing as all sub-tasks have been committed as of r111619. let me know if you see any more issues.