Advanced search

Message boards : Number crunching : Too many errors!!

Author Message
flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,325,690,197
RAC: 3,008,290
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 52173 - Posted: 3 Jul 2019 | 16:56:23 UTC

I suspect this is happening to everyone, something needs to be done right away.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 255
Credit: 647,700,389
RAC: 79
Level
Lys
Scientific publications
wat
Message 52178 - Posted: 3 Jul 2019 | 20:38:23 UTC

Maybe so but nothing happens at this project 'right away'.

Miklos M.
Send message
Joined: 16 Jun 12
Posts: 3
Credit: 6,220,550
RAC: 26,842
Level
Ser
Scientific publications
watwat
Message 52321 - Posted: 22 Jul 2019 | 1:02:02 UTC

Getting many frozen units, usually around 6% or so the time runs, but after 2-3 hours nothing is happening.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2031
Credit: 14,705,407,269
RAC: 1,376,423
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52322 - Posted: 22 Jul 2019 | 1:15:06 UTC - in response to Message 52321.

The GPU 0 in your PC is quite hot (85°C=185°F), that may cause workunits to freeze.

Miklos M.
Send message
Joined: 16 Jun 12
Posts: 3
Credit: 6,220,550
RAC: 26,842
Level
Ser
Scientific publications
watwat
Message 52324 - Posted: 22 Jul 2019 | 14:20:20 UTC

More errors and time wasted. I will wait until I have an answer/correction is made. Shame since I would like to crunch for medical science.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 255
Credit: 647,700,389
RAC: 79
Level
Lys
Scientific publications
wat
Message 52325 - Posted: 22 Jul 2019 | 23:33:49 UTC

Retvari Zoltan gave you the answer. No one else responded as it is probably correct and the 1st thing that should be fixed.

kksplace
Send message
Joined: 4 Mar 18
Posts: 34
Credit: 59,327,675
RAC: 395,689
Level
Thr
Scientific publications
wat
Message 52326 - Posted: 23 Jul 2019 | 0:17:54 UTC - in response to Message 52324.

Though temperature may have something to do with it, it may be other things as well.

GPUGrid long work units especially may 'freeze' as you describe if interrupted frequently. This can happen if you have BOINC Computing preferences set to "Suspend when computer is in use" or "Suspend GPU computing when computer is in use", or "Suspend when non-BOINC CPU usage is above ____%" I had this problem when first starting GPUGrid about a year ago. I got help via the Forum to figure it out:

http://www.gpugrid.net/forum_thread.php?id=4699#48749

I would recommend straight up disabling the "Suspend...." BOINC features and let the computer sort out how much to allocate when you are doing other things.

Also, even when I had the problem, I found that if I selected "Snooze GPU" (and waited; GPUGrid WUs take a while to actually 'snooze' on Windows) then unchecked "Snooze GPU", the majority of the time the WU would continue. Sure, still some wasted time, but at least the WU was completed.

kksplace
Send message
Joined: 4 Mar 18
Posts: 34
Credit: 59,327,675
RAC: 395,689
Level
Thr
Scientific publications
wat
Message 52327 - Posted: 23 Jul 2019 | 0:24:25 UTC - in response to Message 52322.
Last modified: 23 Jul 2019 | 0:25:18 UTC

The GPU 0 in your PC is quite hot (85°C=185°F), that may cause workunits to freeze.


My personal experience is that this temperature doesn't cause problems. I have been crunching GPUGrid for over a year now (24/7) with a 1070 in a case with airflow problems that crunches GPUGrid WUs at between 82 and 85 degrees. Feel free to check my recent tasks to confirm. I might be decreasing the life of the GPU (time will tell), but at least in my case everything is working OK.

(I am now looking at cooling options for this one. Because of my experience from my first build (Linux only), I have gained some confidence in trying modifications for the Dell machine with the 1070.)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2031
Credit: 14,705,407,269
RAC: 1,376,423
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52328 - Posted: 23 Jul 2019 | 1:10:53 UTC - in response to Message 52326.
Last modified: 23 Jul 2019 | 1:13:49 UTC

GPUGrid long work units especially may 'freeze' as you describe if interrupted frequently.
That's correct.
But his tasks (except for the last ones) haven't been suspended (interrupted) at all.
So this left the high GPU temperature as the most probable cause.
The cause could be also:
- overcomitted CPU
- inadequate PSU (maybe the PSU got too hot too)
- overclocked GPU memory
- overclocked GPU
- overclocked CPU memory
- overclocked CPU
- SLI cable
- burnt PCIe power connector(s)
- burnt M/B power connector (the 12V pins)
- interfering application which use GPU acceleration (for example: browsers, video playback software, games)
So all the usual stuff.
Of course it could be something we can't think of, as we don't know his system.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2031
Credit: 14,705,407,269
RAC: 1,376,423
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52329 - Posted: 23 Jul 2019 | 1:25:32 UTC - in response to Message 52327.

The GPU 0 in your PC is quite hot (85°C=185°F), that may cause workunits to freeze.
My personal experience is that this temperature doesn't cause problems.
Recent GPUs tolerate high temperatures much better than the GPUs tolerated it 5-6 years ago, but the physics haven't changed since then. Thermal expansion and contraction stayed the same over the years. The damage it causes is the same. The capacitors are more advanced now than they were 5-6 years ago, but they still prefer lower temperatures.

I have been crunching GPUGrid for over a year now (24/7) with a 1070 in a case with airflow problems that crunches GPUGrid WUs at between 82 and 85 degrees. Feel free to check my recent tasks to confirm.
I did. Some of your tasks hit 90°C.

I might be decreasing the life of the GPU (time will tell),
You do.

but at least in my case everything is working OK.
For now.

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,325,690,197
RAC: 3,008,290
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 52336 - Posted: 23 Jul 2019 | 20:24:35 UTC

85° is just too hot no matter which way you look at it, there's no margin for error. I would feel very uncomfortable running my GPU's that hot, I'm in the mid 70's now and keep a very close watch on them.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 502
Credit: 4,270,783,201
RAC: 1,862,080
Level
Arg
Scientific publications
watwatwat
Message 52338 - Posted: 23 Jul 2019 | 20:36:34 UTC

Not only prone to errors when very hot, but they also run less efficiently. Heated metal has a higher resistivity than colder metal. The GPU will use more power at 85-90C than 60-70C

Post to thread

Message boards : Number crunching : Too many errors!!