Is it time to have a conversation about...

Message boards : Number crunching : Is it time to have a conversation about...

Author	Message
Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 42028 - Posted: 26 Oct 2015 \| 9:02:25 UTC Last modified: 26 Oct 2015 \| 9:02:51 UTC
	...the very high failure rates on some WU's and what I would consider high failure rates on most WU's? Maybe it would be an idea for all of us to put aside our "mine is bigger than yours" ego for a period of time and run all our cards at stock speeds to assess whether "Over Clocking" is causing or is a large part of the cause of high failure rates. I know a percentage of failures will be caused by machines that are perpetually failing and can't understand why these aren't suspended from the project and an email sent to the account holder explaining the problem. However, we maybe overestimating the amount and effect of these machines on overall failure rates. In the interests of this project it would be good to at least try and see if we could get the failure rate down below 5% on any WU. The advantages would be considerable including; Higher result confidence by the scientists themselves. Higher realtime throughput. Higher user satisfaction. Just something that I think we should be talking about.
	ID: 42028 \| Rating: 0 \| rate: / Reply Quote

Dayle Diamond Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,630,338,415 RAC: 187,330 Level Scientific publications	Message 42359 - Posted: 10 Dec 2015 \| 23:11:23 UTC - in response to Message 42028. Last modified: 10 Dec 2015 \| 23:29:23 UTC
	Now that the error rates for different work units are posted in the Server Status page, I have to say that I'm totally shocked. One out of five tasks is failing. I've had a couple of posts up trying to minimize errors on my home PCs, and if the numbers are true, I'm one of the lucky ones. My error rate is closer to one in twenty. If we can get 20% more FLOPS just by increasing system stability, then that's gotta be a top priority. I can't speak for everyone, but my cards have not been overclocked past the factory default, and I'm always found GPUGrid programs to be weirdly fragile. Power outages will render a WU invalid. Windows automatic updates will render a WU invalid. One work unit crashing immediately on one GPU will somehow spill over and the other GPU will show up invalid. Sometimes work units will restart, showing 8 hours of computing but start at 0% again. After an extended time, those will be invalid. This doesn't happen with my other projects.
	ID: 42359 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 42376 - Posted: 12 Dec 2015 \| 12:45:53 UTC - in response to Message 42359. Last modified: 12 Dec 2015 \| 12:47:35 UTC
	Now that the error rates for different work units are posted in the Server Status page, I have to say that I'm totally shocked. One out of five tasks is failing. These data is not "normalized" so those hosts which fail every single workunit are included, which makes the numbers look worse as they really are. To fail a workunit usually take much less time (less than a minute) than to complete it (4~24+ hours), so these numbers start from 100% failure rate, then gets lower. I've had a couple of posts up trying to minimize errors on my home PCs, and if the numbers are true, I'm one of the lucky ones. My error rate is closer to one in twenty. I'm having similar error rate. If we can get 20% more FLOPS just by increasing system stability, then that's gotta be a top priority. Those who care, usually read this forum and are guided to take proper actions, but there's some who don't. This is a volunteer project, so there's no time & resources to find & warn every single contributor with failing hosts, but everybody's welcome to do that. I can't speak for everyone, but my cards have not been overclocked past the factory default... factory default is usually safe, but it's not a guarantee of error free operation. Also if overclocking is done carefully, it won't harm stability ...and I'm always found GPUGrid programs to be weirdly fragile. They are, as they are using the GPU in a different way than other apps, so it has to be stable all the time. Power outages will render a WU invalid. Power outages could make the checkpoint corrupted, as the data is not written to the HDD/SSD before the power goes off. Turning off delayed writes could resolve this, but it will also decrease overall system responsiveness, and reduce HDD/SSD lifetime a little. Windows automatic updates will render a WU invalid. I think it could be only when there a GPU driver update comes with Windows updates, otherwise only the system restart could harm the WU, but I don't have restart related WU failures. One work unit crashing immediately on one GPU will somehow spill over and the other GPU will show up invalid. I had such experiences on one of my hosts. It turned out that the 12V power pins on the 24-pinned MB connector was burnt. It's recommended to use MBs with extra 12V connectors (Molex, SATA, or PCIe) for multiple GPU configurations. Sometimes work units will restart, showing 8 hours of computing but start at 0% again. After an extended time, those will be invalid. I have some same experiences. These workunits used to stuck at some point, when you restart them (or the system), they begin from 0%. When I catch them, I abort them. This doesn't happen with my other projects. Other projects don't utilize the GPU to such extent as GPUGrid does. But the power failure issue could happen on other projects too (if they do checkpoints too frequently, or doesn't do at all like CPDN)
	ID: 42376 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 42389 - Posted: 15 Dec 2015 \| 3:40:16 UTC - in response to Message 42376.
	I've had a couple of posts up trying to minimize errors on my home PCs, and if the numbers are true, I'm one of the lucky ones. My error rate is closer to one in twenty. I'm having similar error rate. Wow. That is much too high. One of my systems went to 5% once so I took it apart and cleaned out the clogged fans. Anything above 1-2% is way too high.
	ID: 42389 \| Rating: 0 \| rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 98 Credit: 15,298,275,388 RAC: 1,155,802 Level Scientific publications	Message 42394 - Posted: 15 Dec 2015 \| 15:29:39 UTC
	Most of the time when I see a GPUGRID work unit fail it is due to one of the servers crashing. Sometimes it happens when BOINC crashes. It's not always clear to me why these events occur but I do try to at least cover the basics with the latest updates to prevent further issues. Latest BIOS, Windows updates, NVIDIA Drivers, BOINC version. Sometimes I run into a problem with the most recent versions but usually it's better. Don't leave the Windows automatic updates setting on. You can have it set to download the updates but let you choose when to install them. That way you can schedule a maintenance window to complete your updates, pause BOINC and reboot etc. What is the easiest way to check all my GPU's and determine which ones have the highest error rates? I would like to see if I can make any additional improvements.
	ID: 42394 \| Rating: 0 \| rate: / Reply Quote

Dayle Diamond Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,630,338,415 RAC: 187,330 Level Scientific publications	Message 42471 - Posted: 21 Dec 2015 \| 15:44:01 UTC
	Those who care, usually read this forum and are guided to take proper actions, but there's some who don't. This is a volunteer project, so there's no time & resources to find & warn every single contributor with failing hosts, but everybody's welcome to do that. That's a good idea. I can't afford more GPUs right now, but if I can help reactivate a few hosts, perhaps I can be of assistance. Retvari, I've just combed through my work unit list to see which tasks had failed with other users. I've sent out a few busybody notes to users with 100% error rates encouraging them to check the forums. You know hardware well beyond my abilities, will you keep an eye on the forums over the holidays?
	ID: 42471 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 42474 - Posted: 21 Dec 2015 \| 19:12:02 UTC - in response to Message 42471.
	I've sent out a few busybody notes to users with 100% error rates encouraging them to check the forums. You know hardware well beyond my abilities, will you keep an eye on the forums over the holidays? I will. However I think there won't be much need for my expertise during the holidays, as I don't share your optimism about that this time of the year is the best for turning "careless" volunteers to be careful :) I sent some warning messages in the past, but I've received very few response.
	ID: 42474 \| Rating: 0 \| rate: / Reply Quote

Dayle Diamond Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,630,338,415 RAC: 187,330 Level Scientific publications	Message 42485 - Posted: 23 Dec 2015 \| 20:56:26 UTC - in response to Message 42474.
	Of about fifteen notes, I got one reply. A contributor was able to fix their problem. GPUgrid is 1,728 GFLOPS faster for it! Shame there's no way to get in touch with "Anonymous".
	ID: 42485 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Is it time to have a conversation about...

	About	Science	Volunteers	Performance	Forum	Join us	Donate