Advanced search

Message boards : Graphics cards (GPUs) : System crash when entering sleep while GPU task running

Author Message
lukeu
Send message
Joined: 14 Oct 11
Posts: 26
Credit: 60,583,284
RAC: 31
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51530 - Posted: 20 Feb 2019 | 11:35:02 UTC

About once every week or two, when I put my PC to sleep at end of the day it instead hangs for a few minutes with one (of 2) monitors on and blank, and then shuts off. I believe this is Windows 10 writing a "bug check" (crash dump) due to the kernel having crashed, as seen in the Event Log.

After watching it for a while, I'm starting to suspect that this is related to the GPU tasks. These always come back in Boinc as "Error during computation" after the reboot. (And the last 3 or so have been PABLO_v3.) Here is one task that reported failure:

http://www.gpugrid.net/result.php?resultid=20531108

This failed with the infamous:

The simulation has become unstable. Terminating to avoid lock-up (1)


Unfortunately I can't tell if this message was written before or after the reboot.

NOTES: no overclocking, last recorded temp seems reasonable at 67'C. I'm not running a bleeding-edge driver (399.07) but I note this was also occurring with an older driver (391.35).

Now putting my developer's hat on (although I'm not a systems developer) my hunch here would be that some kernel memory corruption has occurred, and when Windows comes to checkpoint its processes it barfs.

This is also causing data corruption outside of BOINC. A number of recently-written files end up "zero-padded", i.e. their content replaced with 0x00's. (Probably something to do with SSD / write caching.) I've had to manually repair my git repo a number of times now, for example. I'm starting to worry if other files might have been zapped that I just haven't noticed yet.

I realise it's hard to separate cause & effect from the symptoms. For example, could the crash be due to something else, which then zeros out some of GPUGrid's files and that's why IT fails? From a low sample size of less than a dozen such crashes, I can only say that it never occurred when I didn't run BOINC (i.e. before this Winter).

If it would help, I can send the resulting crash-dump to a developer (230MB 7-Zipped). PM me if interested.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 175
Credit: 4,013,368,076
RAC: 8,321
Level
Arg
Scientific publications
watwatwat
Message 51534 - Posted: 20 Feb 2019 | 20:19:43 UTC - in response to Message 51530.

Quick question for you.

Did you tell BOINC to stop processing when you exit or did you just exit BOINC and leave the task crunching?

It sounds to me as if you are exiting BOINC but leave the task crunching. Then when you go to sleep the machine, it causes the kernal to crash as it's forced quit and end up with the error while computing.

Just a thought..



Z
____________

lukeu
Send message
Joined: 14 Oct 11
Posts: 26
Credit: 60,583,284
RAC: 31
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51536 - Posted: 21 Feb 2019 | 7:03:41 UTC - in response to Message 51534.

> Did you tell BOINC to stop processing when you exit or did you just exit BOINC and leave the task crunching?

Neither, I leave both BOINC and the tasks running (so that my PCs can wake on a timer and pre-heat my office before I arrive).

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2031
Credit: 14,709,781,269
RAC: 1,512,501
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51538 - Posted: 21 Feb 2019 | 16:58:41 UTC - in response to Message 51530.
Last modified: 21 Feb 2019 | 17:08:26 UTC

About once every week or two, when I put my PC to sleep at end of the day it instead hangs for a few minutes with one (of 2) monitors on and blank, and then shuts off. I believe this is Windows 10 writing a "bug check" (crash dump) due to the kernel having crashed, as seen in the Event Log.

After watching it for a while, I'm starting to suspect that this is related to the GPU tasks. These always come back in Boinc as "Error during computation" after the reboot. (And the last 3 or so have been PABLO_v3.) Here is one task that reported failure:

http://www.gpugrid.net/result.php?resultid=20531108

This failed with the infamous:

The simulation has become unstable. Terminating to avoid lock-up (1)

GPUGrid tasks don't tolerate suspending. Even stopping a task could take a minute or so.
I always close BOINC with exiting science apps, then watch MSI afterburner tray monitors (GPU usage and temperature) to go down.
After that I restart / turn off my PC.
You should do the same to avoid such errors. (until this bug get fixed = forever.)
When your PC turned on (by a timer or you) from a complete shutdown, BOINC will continue the tasks in it.
(You don't need to suspend them with the OS.)
If you have a password protected user account, it won't log on automatically at startup unless you set Windows not to ask for a username and password at startup. You can specify which user account to log on at startup by the following method:
Press Windows key + R
Type
control userpasswords2
press [Enter] or click [OK], then uncheck the checkbox, click [OK] and type your username and password (twice) then click [OK].
If your PC is connected to a domain, you should specify your username as domain/username.
I also recommend to turn off the "fast system startup" option in power management.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1104
Credit: 6,101,732,079
RAC: 890
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51540 - Posted: 21 Feb 2019 | 18:34:42 UTC - in response to Message 51538.

GPUGrid tasks don't tolerate suspending. Even stopping a task could take a minute or so.
I always close BOINC with exiting science apps, then watch MSI afterburner tray monitors (GPU usage and temperature) to go down.
After that I restart / turn off my PC.

This is exactly the procedure that I use and it works. Never shut down the PC until the GPUGrid app stops. Avoid sleep and hibernate. Also turn off write caching on the BOINC drive. If you take these steps your errors should disappear. Listen to Zoltan when he gives GPUGrid advice. :-)

mmonnin
Send message
Joined: 2 Jul 16
Posts: 255
Credit: 647,700,389
RAC: 65
Level
Lys
Scientific publications
wat
Message 51541 - Posted: 21 Feb 2019 | 18:49:00 UTC

I 3rd that. Shutting down the OS with BOINC running can result in computation errors, especially GPU tasks.

lukeu
Send message
Joined: 14 Oct 11
Posts: 26
Credit: 60,583,284
RAC: 31
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51544 - Posted: 22 Feb 2019 | 9:50:17 UTC
Last modified: 22 Feb 2019 | 9:52:07 UTC

Thanks for the tips folks.

Note that I'm really interested in sleeping rather than shutting off the computer.

Another interesting data-point is that my 2nd Win 7 PC running short tasks (GTX 660) has had no such problems this winter.

GPUGrid tasks don't tolerate suspending. Even stopping a task could take a minute or so.


Right, that's good to know. I think my next step will be to try a middle-ground approach: I'll use "Snooze GPU" tasks and wait a few minutes for those to idle properly before putting my PC to sleep. I can watch for the GPU ram dropping in Process Explorer to confirm.

I'd be willing to bet (at least a few fillér) that this will improve it. I'm guessing the OS being too impatient at suspending the apps that doesn't let GPUGrid checkpoint safely. Worth a shot anyway.

Also turn off write caching on the BOINC drive.


That would be safer yes, however that would hurt my code compile times, and I suspect that even if I segregated BOINC to a separate SSD I'm not sure whether the main drive would be saved from a hard kernel crash - so I think I'd need to do it on both of my SSDs. Fingers crossed suspending GPUGrid will work! I'll report back.

(FYI: you know I even went with a SSD specifically because they claimed its super-capacitors would protect data "in flight" during a sudden power outage. Has never bloody worked. I can say that with confidence, since some power company workmen repeatedly shut the power off to my building without notice a few weeks back. Creating more git-repo repair jobs for me. Fun times.)

lukeu
Send message
Joined: 14 Oct 11
Posts: 26
Credit: 60,583,284
RAC: 31
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51545 - Posted: 22 Feb 2019 | 10:02:51 UTC

Clarification: I say 'checkpoint' in the sense of Windows recording the state of the processes rather than the Boinc term for tasks saving their progress. I've just realised that could be ambiguous. (Hmm, now do processes need to be checkpointed for sleep, or is that just for hibernate I wonder...)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2031
Credit: 14,709,781,269
RAC: 1,512,501
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51552 - Posted: 22 Feb 2019 | 23:25:51 UTC - in response to Message 51544.

(FYI: you know I even went with a SSD specifically because they claimed its super-capacitors would protect data "in flight" during a sudden power outage. Has never bloody worked. I can say that with confidence, since some power company workmen repeatedly shut the power off to my building without notice a few weeks back. Creating more git-repo repair jobs for me. Fun times.)
That feature can't save the data from the write cache of the OS. That's why disable write caching is recommended, or a good UPS. (is it a Samsung PRO SSD?) These SSDs have a relatively large DDR RAM cache, so you should give a try disabling write caching in the OS, and check how much it hurts code compilation times. Perhaps you loose less time than the time spent repairing things.

lukeu
Send message
Joined: 14 Oct 11
Posts: 26
Credit: 60,583,284
RAC: 31
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51571 - Posted: 25 Feb 2019 | 8:59:45 UTC - in response to Message 51552.

Yeah, had a UPS. It died, so I was hoping this SSD would suffice instead of more lead-acid batteries.

That feature can't save the data from the write cache of the OS.

Indeed. I guess I'd hoped that the "safer" level of write caching would be good enough with the SSD's capacitors. But I suppose my hopes for this complex pipeline involving multiple vendors was too high. :)

so you should give a try disabling write caching in the OS

Promising result! After disabling all write caching, a Java full compile goes from 48 -> 52s.

But then: I've got System/Data partitions on my Samsung EVO 960 (non-pro I think) and a small Intel SSD as a temp drive. I've already configured it so the bulk of writes during a compile go to the Intel, so I only need to disable write caching on the Samsung. Looks like this will: protect my data better, keep my 48s compile times, and hopefully avoid a new UPS still. Win win win.

(I've got a similar amount of C++, which takes ~9m to build, but I think that would turn out similarly.)

Perhaps you loose less time than the time spent repairing things.

True. I guess my bigger concern is that as compile times increase I become more likely to lose focus, flick over and check the news, and if something catches my interest "Poof!" 15 minutes is gone. :-D

lukeu
Send message
Joined: 14 Oct 11
Posts: 26
Credit: 60,583,284
RAC: 31
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51598 - Posted: 6 Mar 2019 | 7:52:24 UTC

A quick update: I've since had two such hard shutdowns out of 8 attempts at sleeping.

I've not observed any zeroed-out files, which is the main thing... so far so good with write caching disabled on the data drive.

(I found one file corrupt, but Eclipse regenerated it clean before I could see what was wrong. This may have just been truncated though, which would be expected & fine. I think Eclipse auto-saves this every 5 minutes, so it could have just been bad luck with the timing.)

Also one time GPUGrid resumed without computation error, which I think is a first. However today it still had a computation error so I guess the "Snooze GPU" plan must be flawed (which I find odd since the 'acemd-922-80.exe' application did completely exit before I put it to sleep last night). *shrug* This is not so important anyway.

Post to thread

Message boards : Graphics cards (GPUs) : System crash when entering sleep while GPU task running