Advanced search

Message boards : Graphics cards (GPUs) : Don't understand why it is failing

Author Message
~Stack~
Send message
Joined: 10 Dec 09
Posts: 7
Credit: 77,610,772
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44203 - Posted: 16 Aug 2016 | 23:28:08 UTC

Greetings,

I have recently acquired another Nvidia GPU after a long absence where I only crunched projects with CPU work. All of the other projects are doing great with the GPU. GPUGRID, however, is not. The jobs are failing with this:

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 197 (0xc5, -59)
</message>
<stderr_txt>
# SWAN Device 0 :
# Name : GeForce GTX 470
# ECC : Disabled
# Global mem : 1279MB
# Capability : 2.0
# PCI ID : 0000:02:00.0
# Device clock : 1250MHz
# Memory clock : 1701MHz
# Memory width : 320bit
#SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 200

</stderr_txt>
]]>


I have read and searched online but have not found anything that is relevant to my case.

Can someone point out where things might be going bad please?

Here is the host: https://www.gpugrid.net/show_host_detail.php?hostid=362128

Also, another bit of relevant data if your are digging through the host logs. I picked up two NVidia cards: A 470 and a 460. After a few weeks of crunching, the 460 crapped out hard core yesterday and shat out all of the GPU work the computer had queued up. It will work for an hour or two after a reboot, then die again. It has since been removed. It was only 30$, what can I expect? *sigh* :-) Anyway, the point is ignore the 460 workloads; not relevant. The 470, however, is doing great on other projects.

Thanks!

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 851
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44215 - Posted: 17 Aug 2016 | 14:22:56 UTC - in response to Message 44203.

#SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 200

This error message says that the application does not include the parts needed for compute capability 2.0 (called "version 200" above) GPUs.
As the GPUGrid app works with CC3.0~CC5.2 cards, your card is too old for this project.
I don't recommend to crunch on these very old cards, as their energy efficiency is terrible compared to recent cards.

~Stack~
Send message
Joined: 10 Dec 09
Posts: 7
Credit: 77,610,772
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44219 - Posted: 17 Aug 2016 | 21:44:12 UTC - in response to Message 44215.

Thanks for that info! I tried searching for that message but never found a good explanation of what it was trying to tell me.


Out of curiosity, BOINC tells GPUGRID what card I have. Why doesn't GPUGRID throw an error /before/ it sends the work? I feel kinda bad that I took up work that just errored out like that.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 851
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44221 - Posted: 17 Aug 2016 | 23:50:14 UTC - in response to Message 44219.

Thanks for that info! I tried searching for that message but never found a good explanation of what it was trying to tell me.

There are a couple of useful threads for novices in the FAQ, for example:
FAQ - Recommended GPUs for GPUGrid crunching
However, this error message is not listed there.
You should try to use the advanced search, and extend the time limit for the search for more results.

Out of curiosity, BOINC tells GPUGRID what card I have. Why doesn't GPUGRID throw an error /before/ it sends the work? I feel kinda bad that I took up work that just errored out like that.

That's just sloppy business from GPUGrid's part, you should not feel bad.
This behavior applies to the brand new GTX 10X0 cards as well, because these are CC6.1 cards, and fail every workunit with the same error, still the GPUGrid scheduler will send them work (until their daily quota reaches 0).

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44228 - Posted: 18 Aug 2016 | 22:35:33 UTC - in response to Message 44221.

Thanks, I've updated the FAQ.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

vseven
Send message
Joined: 2 Apr 18
Posts: 2
Credit: 1,132,200
RAC: 0
Level
Ala
Scientific publications
wat
Message 49296 - Posted: 17 Apr 2018 | 12:22:02 UTC
Last modified: 17 Apr 2018 | 12:22:54 UTC

Just a FYI I get the same error on a new shiny Tesla V100 using Ubuntu 16.04 and NVidia drivers 390.30 which are fairly new:

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 197 (0xc5, -59)
</message>
<stderr_txt>
# CUDA Synchronisation mode: BLOCKING
# SWAN Device 0 :
# Name : Tesla V100-PCIE-16GB
# ECC : Enabled
# Global mem : 16160MB
# Capability : 7.0
# PCI ID : 94A8:00:00.0
# Device clock : 1380MHz
# Memory clock : 877MHz
# Memory width : 4096bit
# GPU [Tesla V100-PCIE-16GB] Platform [Linux] Rev [3212] VERSION [80]
# SWAN Device 0 :
# Name : Tesla V100-PCIE-16GB
# ECC : Enabled
# Global mem : 16160MB
# Capability : 7.0
# PCI ID : 94A8:00:00.0
# Device clock : 1380MHz
# Memory clock : 877MHz
# Memory width : 4096bit
#SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 700

</stderr_txt>
]]>

Which is very unfortunate because I could probably chew through these WU in no time. No issues on the other projects I'm crunching (Seti, Amicable Numbers, Milkyway).

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,648,917,755
RAC: 2,329,067
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 49297 - Posted: 17 Apr 2018 | 12:37:40 UTC

I could be wrong, but I don't yet think CUDA 9.0 is supported in this version of ACEMD which is the application for this project.

vseven
Send message
Joined: 2 Apr 18
Posts: 2
Credit: 1,132,200
RAC: 0
Level
Ala
Scientific publications
wat
Message 49343 - Posted: 20 Apr 2018 | 15:32:01 UTC - in response to Message 49297.

I believe you are correct. I spun up a Ubuntu using a P100 and Cuda 8 and they are now working.

Profile Chilean
Avatar
Send message
Joined: 8 Oct 12
Posts: 98
Credit: 385,652,461
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50597 - Posted: 25 Sep 2018 | 12:12:42 UTC - in response to Message 49296.

Same here. V100's aren't supported.
____________

Steffen
Send message
Joined: 2 Mar 19
Posts: 2
Credit: 48,438,972
RAC: 0
Level
Val
Scientific publications
wat
Message 51778 - Posted: 9 May 2019 | 17:27:34 UTC

And now also for a GTX 1660 Ti. Einstein@Home takes it, which is my answer for now.

Post to thread

Message boards : Graphics cards (GPUs) : Don't understand why it is failing

//