Advanced search

Message boards : Number crunching : failing work units?

Author Message
bfromcolo
Send message
Joined: 3 Sep 13
Posts: 8
Credit: 30,883,031
RAC: 3,604
Level
Val
Scientific publications
wat
Message 51412 - Posted: 1 Feb 2019 | 22:48:58 UTC

I have 2 1700x systems running the same version of Mint 19, and the same NVIDIA driver 390.77. Both are current in terms of updates.

One has a GTX-960 4G and fails every work unit. The other has a GTX-950 and a GTX-750ti, and it seems to run fine.

I don't think the 960 is an issue as it has run tasks OK under Windows 10 in the same system.

I'm stumped?

rod4x4
Send message
Joined: 4 Aug 14
Posts: 72
Credit: 1,489,473,969
RAC: 2,154,034
Level
Met
Scientific publications
watwatwatwatwatwatwat
Message 51413 - Posted: 2 Feb 2019 | 0:52:12 UTC - in response to Message 51412.
Last modified: 2 Feb 2019 | 1:29:50 UTC

The error the 960 is reporting is generally associated with overclocking. Have you tried reducing the clock on the gpu and memory?

Other things to check:
What temperature is the 960 running? (are the gpu fans working effectively)
You could try to swap the cards on the hosts to check if the issue follows the card, or stays with the host.
Reseat the GPU in the PCIe slot. Is the power connector seated correctly in gpu and power supply.
Is the power supply sufficient for the GPU, CPU and host accessories? Linux will run the GPU and CPU more effectively than Windows

bfromcolo
Send message
Joined: 3 Sep 13
Posts: 8
Credit: 30,883,031
RAC: 3,604
Level
Val
Scientific publications
wat
Message 51414 - Posted: 2 Feb 2019 | 14:15:11 UTC

Thanks for responding.

The 960 is at the stock clocks for the model, EVGA SSC. I can experiment with dropping them. As I said it ran some GPU Grid work units fine in Windows, and I have run other projects and FAH on it in Linux. I will try moving the card to the other system.

Temperatures are good, <80C

Power supply is an inexpensive older CX430 Corsair as I recall. Should be adequate for a stock 1700x and 960, but given its age could be a possibility.

One other difference in the systems is the mother boards. The 960 is on a B450 board and the other system has the B350 chipset.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 72
Credit: 1,489,473,969
RAC: 2,154,034
Level
Met
Scientific publications
watwatwatwatwatwatwat
Message 51415 - Posted: 2 Feb 2019 | 14:57:13 UTC - in response to Message 51414.

Does the b450 have the latest BIOS? I don't have a B450 board so cant comment on any issues with that ... it is a relatively new chipset (kudos to AMD for the new ZEN platform) but have seen big improvements with the b350 and new BIOS releases.
Moving the card between host PCs seems to be the next step in isolating the issue.
Good Luck!

Jim1348
Send message
Joined: 28 Jul 12
Posts: 688
Credit: 1,371,992,468
RAC: 1,065
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 51416 - Posted: 2 Feb 2019 | 15:49:39 UTC
Last modified: 2 Feb 2019 | 15:50:30 UTC

It is probably the factory overclock. I have seen it on an EVGA GTX 970, though not as severely as this one.

By the way, I think that Linux pushes a card harder than Windows, due to the presence of WDDM on Windows, which slows things down. Work units that will fail on Linux might work on Windows at the same nominal GPU clock. So slow the clock down.

bfromcolo
Send message
Joined: 3 Sep 13
Posts: 8
Credit: 30,883,031
RAC: 3,604
Level
Val
Scientific publications
wat
Message 51428 - Posted: 4 Feb 2019 | 18:00:39 UTC

I tried a GTX-950 stock and got the same error, this card has done many work units in Windows 7 and Linux without issue in other systems.

Swapped in a newer more powerful power supply.

Re-installed Mint 19.1, fully updated.

Updated the BIOS to the latest.

Still fails 100% of the time.

Did some Einstein and Milkyway with 0 issues.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2031
Credit: 14,706,573,669
RAC: 1,413,751
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51429 - Posted: 4 Feb 2019 | 19:40:02 UTC - in response to Message 51428.
Last modified: 4 Feb 2019 | 19:44:11 UTC

I tried a GTX-950 stock and got the same error, this card has done many work units in Windows 7 and Linux without issue in other systems.
It has been shown many times that the stress of the GPUGrid app under Linux or Windows XP (especially with SWAN_SYNC applied) is the largest of all projects. That means you should take GPUGrid under Linux or Windows XP to validate the stability of your system, not other projects' under other OSes (=Windows Vista and up).

Swapped in a newer more powerful power supply.
That's nice

Re-installed Mint 19.1, fully updated.
That's very precautionary.

Updated the BIOS to the latest.
That's ok (What BIOS btw? the card's? the MB's? both?)

Still fails 100% of the time.
Oh, that's because you forgot to heed the advice you've been given before, to reduce the clock speed of your GPU and/or increase its fan speed (to reduce its temperatures). It would be nice to know the error message of your tasks. The previous ones (on your GTX 960) failed with
# The simulation has become unstable. Terminating to avoid lock-up (1)
which is the clear sign of too high GPU clocks on the given temperature. The other reason of such errors is a faulty memory (warped card, broken soldering of the chips) on the card, but some workunits can go very long before they fail, which suggests the first explanation.

Did some Einstein and Milkyway with 0 issues.
If you read my post carefully, you should know by now that it doesn't matter.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 255
Credit: 647,700,389
RAC: 79
Level
Lys
Scientific publications
wat
Message 51430 - Posted: 4 Feb 2019 | 20:01:10 UTC

GPUGrid is actually not that hard on cards compared to other projects. Just watching the auto boost of cards shows this.

2 cards, same PC says its something else about the PC besides the cards.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 688
Credit: 1,371,992,468
RAC: 1,065
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 51431 - Posted: 4 Feb 2019 | 22:05:16 UTC - in response to Message 51430.
Last modified: 4 Feb 2019 | 22:09:05 UTC

GPUGrid is actually not that hard on cards compared to other projects. Just watching the auto boost of cards shows this.

I can assure you that Zoltan is correct.
I run (or have run) all the GPU projects, and know the difference.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 72
Credit: 1,489,473,969
RAC: 2,154,034
Level
Met
Scientific publications
watwatwatwatwatwatwat
Message 51432 - Posted: 4 Feb 2019 | 23:45:25 UTC - in response to Message 51428.

It would be interesting to know if the 960 will run at any power output.
You can limit the power of the GPU with the nvidia utility nvidia-smi.

The 960 will have an allowable range (using this command) of 60W to 140W
I am not sure of the 950 allowable range, but if you enter a value out of the allowable range, the command will let you know the correct values.

With a maximum power limit (default from factory) I find the 960 under Linux in BLOCKING mode uses around 92W (around 98W in SPIN mode)
Have you checked to see the GPU power draw when a task is running? (nvidia-smi without any switches will display this)

Try running the GPU at 80W limit with this command: nvidia-smi -pl 80

If this is successful, try playing with the Power limit to find the highest stable value. Would be interesting to hear your results.

bfromcolo
Send message
Joined: 3 Sep 13
Posts: 8
Credit: 30,883,031
RAC: 3,604
Level
Val
Scientific publications
wat
Message 51433 - Posted: 5 Feb 2019 | 13:56:46 UTC

OK I have it running at 80W now, we will see what happens. Just looking occasionally in nvidia-smi I have not seen it over 71W. Temperature is steady at 69C.

I don't know that this has in fact slowed it down much. Power state is P2. The boost clock is bouncing between 1316 and 1366, factory boost spec on this card is 1342. Memory clock is 6008 which is max for P2, factory spec is 7010 max at P3.

Just FYI the 950 that failed in this system is the exact same one that ran fine in my other 1700x with nearly identical configuration beyond the chipset.

The motherboard is a Gigabyte Auros B450M, and the BIOS update I applied moved it from F1 to F5.

bfromcolo
Send message
Joined: 3 Sep 13
Posts: 8
Credit: 30,883,031
RAC: 3,604
Level
Val
Scientific publications
wat
Message 51434 - Posted: 5 Feb 2019 | 21:19:42 UTC

It ran 3 small work units successfully at 80W, I pushed it up to 100W and will see what happens.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 72
Credit: 1,489,473,969
RAC: 2,154,034
Level
Met
Scientific publications
watwatwatwatwatwatwat
Message 51435 - Posted: 6 Feb 2019 | 2:10:51 UTC - in response to Message 51434.

This test demonstrates that the clock on the 960 is too high for GPUgrid tasks.
Limiting the power to 80W is simply a quick and easy way of limiting the GPU clock.
Also note that smaller tasks do not push the GPU as hard as larger tasks.

Playing with either the Power Limit or Clock settings (Can be set in Nvidia Setting GUI if you have "coolbits" enabled) will be your best bet in completing GPUgrid tasks.

Not really sure why only your B450 motherboard is exhibiting this behaviour. Does anyone else reading this post have any experience with AMD B450 (AM4 socket) motherboards and GPUgrid tasks? I know that the B350 boards are fine.

I would also recommend enabling nvidia persistence mode either via the nvidia persistence daemon or with: nvidia-smi -pm 1

Please note that all nvidia-smi commands must be reissued after a reboot.

If you what to extract further performance, you can enable SPIN mode on GPUgrid tasks using this guide posted by ServicEnginIC
https://www.gpugrid.net/forum_thread.php?id=4813&nowrap=true#50824

rod4x4
Send message
Joined: 4 Aug 14
Posts: 72
Credit: 1,489,473,969
RAC: 2,154,034
Level
Met
Scientific publications
watwatwatwatwatwatwat
Message 51436 - Posted: 6 Feb 2019 | 2:25:27 UTC - in response to Message 51434.

Just noticed you have more failures, presumably when you upped the power to 100W.

Perhaps try 90W, and then increment by 1W up or down from there.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1104
Credit: 6,101,732,079
RAC: 1,085
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51439 - Posted: 6 Feb 2019 | 14:57:36 UTC - in response to Message 51434.
Last modified: 6 Feb 2019 | 15:33:46 UTC

Not really sure why only your B450 motherboard is exhibiting this behaviour. Does anyone else reading this post have any experience with AMD B450 (AM4 socket) motherboards and GPUgrid tasks? I know that the B350 boards are fine.

This is undoubtedly related to the GPU, not the MB. I'm running 13x 1060 and 2x 1050Ti GPUs on 5x Ryzen 7 systems (2x X370 and 3x X470 MBs) with nary a hiccup (3 GPUs per system). This still leaves 13 threads per machine for running CPU projects at max speed.

bfromcolo
Send message
Joined: 3 Sep 13
Posts: 8
Credit: 30,883,031
RAC: 3,604
Level
Val
Scientific publications
wat
Message 51443 - Posted: 7 Feb 2019 | 18:14:40 UTC - in response to Message 51439.

Not really sure why only your B450 motherboard is exhibiting this behaviour. Does anyone else reading this post have any experience with AMD B450 (AM4 socket) motherboards and GPUgrid tasks? I know that the B350 boards are fine.

This is undoubtedly related to the GPU, not the MB. I'm running 13x 1060 and 2x 1050Ti GPUs on 5x Ryzen 7 systems (2x X370 and 3x X470 MBs) with nary a hiccup (3 GPUs per system). This still leaves 13 threads per machine for running CPU projects at max speed.


Maybe, but as discussed above I can put this GPU in another system and it will work, it even works in this system in Windows. And I moved a GTX-950 that was working fine in another system and experienced the same errors here. I am not sure we really understand what is happening here.

But it finished a Long work unit with no error at 90W.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 72
Credit: 1,489,473,969
RAC: 2,154,034
Level
Met
Scientific publications
watwatwatwatwatwatwat
Message 51444 - Posted: 7 Feb 2019 | 23:37:40 UTC - in response to Message 51443.
Last modified: 8 Feb 2019 | 0:34:33 UTC

The runtime for the long task was good. According to the chart on the Performance tab, your 960 is only 24 minutes slower than the fastest time for this card.

93W should be about the correct power draw for your card in BLOCKING mode.

If you setup SPIN mode, you could probably go up to around 98W as a power limit.

This should put your card amongst the fastest for this model.

After finding the best power limit setting, it would be interesting to see how this affects the runtimes of your other GPU projects. This will demonstrate which project gets the most out of the GPU.

Post to thread

Message boards : Number crunching : failing work units?