Advanced search

Message boards : Number crunching : Unit crash after 0 second

Author Message
Profile Zarck
Send message
Joined: 16 Aug 08
Posts: 143
Credit: 314,944,668
RAC: 112,316
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51629 - Posted: 14 Mar 2019 | 11:13:07 UTC
Last modified: 14 Mar 2019 | 11:54:29 UTC

Unit crash after 0 second,

No problem with Asteroids Gpu, Einstein Gpu, Milkyway Gpu, Seti Gpu.

Blocage des unités après 0 seconde,

Pas de problème avec Asteroids Gpu, Einstein Gpu, Folding@home Gpu, Milkyway Gpu, Séti Gpu.

@+
*_*

Nvidia Quadro K5000 + Geforce Titan.
____________

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,502,302,597
RAC: 2,142,509
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51630 - Posted: 14 Mar 2019 | 20:49:14 UTC - in response to Message 51629.

Both your cards are pretty old, they may not be capable of working with these WU's. Have you completed any work units?
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2048
Credit: 14,828,612,169
RAC: 2,502,210
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51631 - Posted: 14 Mar 2019 | 20:51:51 UTC - in response to Message 51629.

Unit crash after 0 second,

No problem with Asteroids Gpu, Einstein Gpu, Milkyway Gpu, Seti Gpu.

Nvidia Quadro K5000 + Geforce Titan.

The error message at the end of the stderr.txt is:
SWAN : FATAL Unable to load module .nonbonded.cu. (300)
It means that the Quadro K5000 is too old for this project as it is only Compute Capability 3.0.

Your Titan should work fine, but it gets very hot (83°C): (Task 20673822)
<core_client_version>7.14.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code -52 (0xffffffcc)</message> <stderr_txt> # GPU [GeForce GTX TITAN] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX TITAN # ECC : Disabled # Global mem : 6144MB # Capability : 3.5 # PCI ID : 0000:28:00.0 # Device clock : 928MHz # Memory clock : 3004MHz # Memory width : 384bit # Driver version : r419_29 : 41935 # GPU 0 : 81C # GPU 1 : 70C # GPU 0 : 82C # GPU 0 : 83C # GPU [Quadro K5000] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 1 : # Name : Quadro K5000 # ECC : Disabled # Global mem : 4096MB # Capability : 3.0 # PCI ID : 0000:0F:00.0 # Device clock : 705MHz # Memory clock : 2700MHz # Memory width : 256bit # Driver version : r419_29 : 41935 SWAN : FATAL Unable to load module .nonbonded.cu. (300) </stderr_txt> ]]>

I don't see any other task assigned to your Titan, so you've probably excluded it by mistake (you should exclude the Quadro K5000) from getting GPUGrid work.

Profile Zarck
Send message
Joined: 16 Aug 08
Posts: 143
Credit: 314,944,668
RAC: 112,316
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51632 - Posted: 14 Mar 2019 | 23:22:26 UTC
Last modified: 15 Mar 2019 | 0:05:45 UTC

How to exclude a card in boinc ?

Comment exclure une carte dans boinc ?
____________

Profile Zarck
Send message
Joined: 16 Aug 08
Posts: 143
Credit: 314,944,668
RAC: 112,316
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51633 - Posted: 15 Mar 2019 | 9:36:29 UTC

My config is now too old for your project, I can always do all the others.


____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 265
Credit: 647,845,139
RAC: 966
Level
Lys
Scientific publications
wat
Message 51678 - Posted: 31 Mar 2019 | 0:33:48 UTC - in response to Message 51632.
Last modified: 31 Mar 2019 | 0:34:22 UTC

How to exclude a card in boinc ?

Comment exclure une carte dans boinc ?


The format in cc_config.xml
<exclude_gpu>
<url>project_URL</url>
<device_num>N</device_num>
<type>NVIDIA|ATI|intel_gpu</type>
<app>appname</app>
</exclude_gpu>

Type is needed if you have more than 1 manufacture. Intel iGPU + NV as an example.
https://boinc.berkeley.edu/wiki/Client_configuration

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 58
Credit: 583,509,698
RAC: 39,244
Level
Lys
Scientific publications
watwatwat
Message 51794 - Posted: 14 May 2019 | 5:58:35 UTC
Last modified: 14 May 2019 | 6:00:04 UTC

Since May 13th, all newly loaded tasks end with an error after 0 seconds of compute time. Log for all of these:

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 212 (0xd4, -44)</message>
<stderr_txt>

</stderr_txt>
]]>

The same machine (it is a GTX750Ti under Ubuntu 18.04LTS Linux) has completed hundreds of tasks for a few years.
Any idea what exactly the error means?
I just resetted GPUGRID on that machine hoping it will resume computation properly.
Meanwhile I am testing Einstein to see whether the card is still OK.

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 177
Credit: 4,121,030,726
RAC: 843,988
Level
Arg
Scientific publications
watwatwat
Message 51809 - Posted: 14 May 2019 | 14:01:08 UTC - in response to Message 51794.

There seems to be alot of work units that have high failure rates. I noticed that my mutilcard machine has no work and checked. All work units have errored out. Thought it was a problem with my machine then checked the work units and see that not 1 of them has been completed by numerous other machines. Looks like all the work units are on their way to 9 failed attempts. Checked Server Status and can see the different types of work units which are failing is climbing.


____________

sis651
Send message
Joined: 25 Nov 13
Posts: 66
Credit: 162,627,597
RAC: 33,140
Level
Ile
Scientific publications
watwatwatwatwat
Message 51813 - Posted: 14 May 2019 | 17:22:56 UTC

The noise my notebook was making was very low. I understood there was a problem. I checked Boinc and GPUgrid works were failing with error code 212. I had one failure that Boinc couldn't resume the job. Later all jobs ended without starting with error code 212.

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 58
Credit: 583,509,698
RAC: 39,244
Level
Lys
Scientific publications
watwatwat
Message 51831 - Posted: 16 May 2019 | 10:37:17 UTC
Last modified: 16 May 2019 | 10:37:34 UTC

The problem is discussed here:
http://www.gpugrid.net/forum_thread.php?id=4924&nowrap=true#51786

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Post to thread

Message boards : Number crunching : Unit crash after 0 second