Advanced search

Message boards : Number crunching : PABLO_redo_goal_RMSD2_KIX_CMY tasks is swamping other projects

Author Message
Keith Myers
Send message
Joined: 13 Dec 17
Posts: 257
Credit: 234,046,463
RAC: 70
Level
Leu
Scientific publications
wat
Message 50969 - Posted: 28 Nov 2018 | 6:51:52 UTC

I have a PABLO_redo_goal_RMSD2_KIX_CMY task I think I have figured out it is swamping my normal Seti work requests on the host it is on.

It has run for 3 hours so far and is projected to run for another 4.45 hours. When I run sched_ops it is shown to be have more shortfall than Seti. That should never happen with my resource split for Seti at 900 and GPUGrid at 50.

Anybody else run this task type yet? Run into problems? I have figured out if I suspend GPUGrid then Seti is able to ask for its normal amount of work to keep the caches up.

This has had me stumped all afternoon after Seti came back from outage yet my task cache continued to fall and my Seti gpu work requests were for pitiful amount of seconds. I did not know whether the upgrade in the graphics drivers had anything to do with it or coincidence.

Comments?

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 175
Credit: 4,013,368,076
RAC: 8,321
Level
Arg
Scientific publications
watwatwat
Message 50970 - Posted: 28 Nov 2018 | 7:05:18 UTC - in response to Message 50969.

I have run those work units but you know I run dedicated machines for each project so I will not be able to comment on the swapping. Hopefully someone else will be able to help you with this.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 257
Credit: 234,046,463
RAC: 70
Level
Leu
Scientific publications
wat
Message 50971 - Posted: 28 Nov 2018 | 7:22:51 UTC

Something is major going wrong on this host. I can't maintain my Seti gpu cache. It is not requesting any gpu work.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 67
Credit: 1,001,674,001
RAC: 73
Level
Met
Scientific publications
watwatwatwatwat
Message 50975 - Posted: 28 Nov 2018 | 16:14:05 UTC

This has had me stumped all afternoon after Seti came back from outage yet my task cache continued to fall and my Seti gpu work requests were for pitiful amount of seconds. I did not know whether the upgrade in the graphics drivers had anything to do with it or coincidence.

Comments?

I have similar issues with the CPU jobs running WCG and QC concurrently with equal priorities. WCG tends to swamp out QC over time necessitating me to intervene periodically by suspending the WGC jobs and let the cache refill with QC tasks then unsuspend WCG. The WCG tasks are about and hour each and currently the QC jobs are taking about 1 -2 hours each. My cache level settings are fairly low at 0.5 days.

Checking the event log for why QC is not requesting more work when none are in the queue shows WCG has filled the cache hence why I suspend for more QC jobs. Your situation appears to be the opposite of mine in that your Pablo tasks are a lot longer than the Seti jobs implying Seti should swamp GPUGrid except your priorities are 900/50 Seti/GPUGrid maybe why. So my theory is generally: whoever has the shorter jobs (WCG in my case), asks for work more often and gradually fills the cache hence the other project never gets work because the cache is always full.

With your priorities and if GPUGrid has more time left to complete than your cache size, Seti won't send more work. I do not know how often Seti requests work but if it is infrequently, GPUGrid hogs the show. I haven't figured a cure for this yet except maybe do what Zalster does. I could be way off base but it is something to look at and hope it helps.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 257
Credit: 234,046,463
RAC: 70
Level
Leu
Scientific publications
wat
Message 50976 - Posted: 28 Nov 2018 | 21:34:17 UTC
Last modified: 28 Nov 2018 | 21:43:36 UTC

I normally only have a maximum of 6 long/short tasks on any host at any time. That is what my 1 day cache and 900/50 Seti/GPUGrid resource share split nets me. I either have 400 or 500 Seti tasks on board all the time on each of my hosts. None of my other hosts had any troubles with getting work. Only the 477434 host was having issues. I think the scheduler has messed up that host as I show 12 tasks in progress so 6 of them are 'ghosts' and I am pretty sure that is what messed things up. So BOINC thought I had much more GPUGrid work than I actually have. I only have 6 actual tasks on board that host like normal.

I aborted the suspect long running task that the thread title is about and that didn't have any effect. I aborted a couple others and received some short tasks in return and that didn't help. Only when I suspended the project did I finally regain the ability to request Seti gpu work.

The solution was to reset the project on that host and that cleared things up and back to normal except now I have the 6 ghost tasks. Hope they clear from the database once the deadline is reached early next month.

These are the 'ghost' tasks.
https://www.gpugrid.net/result.php?resultid=19811972
https://www.gpugrid.net/result.php?resultid=19812931
https://www.gpugrid.net/result.php?resultid=19815503
https://www.gpugrid.net/result.php?resultid=19818063
https://www.gpugrid.net/result.php?resultid=19820978
https://www.gpugrid.net/result.php?resultid=19830474

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 67
Credit: 1,001,674,001
RAC: 73
Level
Met
Scientific publications
watwatwatwatwat
Message 50980 - Posted: 29 Nov 2018 | 16:39:21 UTC

The "ghost" task issue is perplexing. The GPUGrid task scheduler seems to ignore cache settings and sends a max of two jobs per gpu installed on any given system if tasks are available. In your case with 4 tasks (2 ghosts) per gpu, I can see why you are not receiving WU's from Seti as you pointed out.

In the past I had what I believe you are describing as ghost tasks from WCG (cpu work) appear rarely on several machines. The issue showed up as a completed task ready to send but WCG was not aware that the task existed. I never got credit for it and it persisted for months until I finally gave up trying to locate the problem. As I recall, I suspended receiving new tasks with all projects and when my cache was empty, I re-installed boinc and re-registered the new client with the projects and that cleared things up. Actually, I think re-installing boinc is a bit of overkill as there must be a xml file containing information that the WU exists and ready to send but the other xml files that schedule the upload are not including that particular task so it never gets picked up. That was several years ago and hasn't happened since with any project.

I am sure that if you empty you queue, remove boinc and then do a fresh install the ghosts will be gone but again, you may want to see if you can locate and edit the file(s) showing the ghost jobs first. If you find them, would you please report back which ones are the culprits? Good luck.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 257
Credit: 234,046,463
RAC: 70
Level
Leu
Scientific publications
wat
Message 50982 - Posted: 30 Nov 2018 | 1:13:42 UTC - in response to Message 50980.

By resetting the project I got rid of the "ghosts" in the client_state.xml file. That allowed BOINC to correctly assess each project credit debt so that I could once again start receiving Seti tasks again.

That does nothing for the scheduler database though. If a project allows "resend lost tasks" as Seti does for example there is a way to get the lost tasks resent to the host. You have to use a specific technique to get them resent. I don't know if GPUGrid has that turned on in the scheduler software or not. It is up to the administrators to enable that. You as user have no control.

Once I reset the project, all remnants of the "ghost" tasks were removed from the client_state.xml file so no chance of recovering them now. Now just have to wait for the task deadlines to clear them from my account database.

Post to thread

Message boards : Number crunching : PABLO_redo_goal_RMSD2_KIX_CMY tasks is swamping other projects