Advanced search

Message boards : Graphics cards (GPUs) : Posted on BOINC Alpha; re: 6.6.20

Author Message
Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8226 - Posted: 5 Apr 2009 | 18:38:58 UTC

I posted this on the Alpha mailing list, those that have 4 and 8 core systems may want to consider how your systems handle workloads in light of these observations ... if you are only running one or two projects these notes may not apply ... comments, as always, welcome ...

One of my long standing complaints about BOINC is with the CPU scheduler and the fact that the more CPUs you have the less optimal the choices seem to be as far as scheduling work. The CPU scheduler seems to be modeled and tested primarily on single CPU systems with occasional duals. Most of the problems I have noted are usually only readily apparent when you look for them on 4 CPU systems or better. My first trip down this rabbit hole was over 4 years ago when I had one of the first 4 CPU systems and JM VII was lead on the development of the CPU scheduler.

To date little has changed and perhaps I am one of the few to note these issues, but that does not mean that they do not exist.

For example, I installed 6.6.20 and noted in prior posts that I had a long standing count of "waiting" tasks to run and I thought that it was unusual. For other reasons I down-leveled to 6.5.0 and less than 8 hours after doing that I have only three ...

More troublesome is the question of 6.6.20 not scheduling correctly to the point that it actually causes GPU Grid tasks on multi-GPU systems to take excessive amounts of time to complete. Yet the actual run times may only reflect the "normal" 6 hour run time. I can't recall if I posted about that here ... and I still have pending inquiries on GPU Grid, but, since the down level I seem to be back to executing normally. With 6.6.20 I was seeing at least two and as many as all 4 tasks taking 24 or more hours to process, yet, as I said, only logging 6 hours in the internal time. Since time accounting is still not properly recorded and validated I do not know what to trust though clock watching does tell me that there is a problem ... so does the drastic reduction in throughput (cut in half) ... (Also for the first time I noted GPU tasks being suspended, though none of them were actually in deadline trouble) ...

My first posts here indicated that I felt that 6.6.20 was not honoring the task switch interval ... but I think that the major conceptual flaw is almost simpler than that, (though there still may be issues with the TSI mechanism).

The CPU scheduler is really based on a single CPU (or GPU) model with the assumption that what works for a single processing stream will scale to multiple streams. This has not been the case, though those of us that have brought it up in the past (usually by pointing out issues that arise as a consequence of this design deficiency; such as large numbers of "waiting" tasks or large memory footprints, or tasks failing because of being suspended or halted too many times, etc.) have largely been ignored.

Lets talk cash register lines... in the old days when 1 register was open there was one line, when 4 were open, there were 4 lines ... and if you got in the "wrong" line you got screwed by the person counting out $2,259.69 ... all in change, mostly pennies ...

Modern stores have one line that feeds the "bank" of registers and thus the one slow person does not impact those behind that one slow person... because they have access to all the other open registers...

BOINC Should work the same way, but does not. When it recalculates the work schedule it does so even for those tasks that are being processed. And, as often as this calculation is done, the old execution plan is discarded.

What SHOULD happen is that the queue should be examined and scheduling within that queue should be re-ordered if needs be, but the currently executing tasks should not be abandoned or suspended willy-nilly ...

One or more tasks actually being processed should be interrupted only when there is an emergency or the project's time slice is filled. Just as we would not pull a person out of line that was already being rung up at the cash register we should not stop processing an executing task.

And this, is why I think that I was seeing problems with 6.6.20 (I probably am one of the few that already have 4 GPU cores in one system) in that BOINC was changing its mind so often that the tasks were not being allowed to run ... I posted a log a few post ago here and the clue may be in there ...

Last note, if the task in potential deadline trouble can be satisfied by being run after the completion of the next task to be suspended because of switching or completion then that task still should not pre-empt running tasks.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8228 - Posted: 5 Apr 2009 | 18:58:56 UTC - in response to Message 8226.

I think you are being a bit vague about the problems you are experiencing due to this scheduling issues. It's only a guess that the recent GPU-Grid issue is caused by the BOINC scheduler.

Regarding the many WUs in flight: you're probably right that this is not ideal. However, they may just say "well, just increase the time between app switches or switch off 'leave in memory'".
One problem I see with your approach: very long tasks like CPDN could throw the system off balance, i.e. run for far too long. This could be stopped, though, as soon as a certain amount of debt has built up.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8237 - Posted: 6 Apr 2009 | 5:52:23 UTC - in response to Message 8228.
Last modified: 6 Apr 2009 | 5:52:39 UTC

I think you are being a bit vague about the problems you are experiencing due to this scheduling issues. It's only a guess that the recent GPU-Grid issue is caused by the BOINC scheduler.

Regarding the many WUs in flight: you're probably right that this is not ideal. However, they may just say "well, just increase the time between app switches or switch off 'leave in memory'".
One problem I see with your approach: very long tasks like CPDN could throw the system off balance, i.e. run for far too long. This could be stopped, though, as soon as a certain amount of debt has built up.

MrS

You are correct, it is only a guess ... sans more information from the GPU Grid project, which I have repeatedly asked for in the other thread where I discuss this/that issue, I can only go on my instincts and observations. Using 6.5.0 (and a couple other versions) with GPU Grid I have never seen a GPU Grid task suspended while other tasks are then run in advance of previously running tasks. In that most tasks running on this system only take about 6 hours there is little need to do this type of suspension and switching in that on average I complete a task roughly once every 90 minutes...

And, as to it NOT being 6.6.20 ... well, then I would have expected to have seen it ALREADY since I down leveled ... and I have not ... I was seeing it almost continuously with at least one task with 6.6.20 and there was no correlation of the task name (id) and the task in trouble.

As to the other, you misread what I said ... long running tasks would still switch out at the normal switch interval as determined by the participant (default 60 minutes). I do have leave tasks in memory because historically several project's tasks do not take well to suspension and removal from memory. But, this does lead to the issue of a big memory footprint if the CPU Scheduler misbehaves and starts more tasks than needful.

But the core issue is that the developers seem to have a chunk of code that is operationally optimal for systems with fewer resources. As I stated I first started observing issues nearly 4 years ago when I got my first 4 core system. And, the internal model does not seem to have been changed. Sadly, I suspect that one of the reasons that this occurs is that it is likely that the "best" system that the developers use for testing has 4 cores or less because we all know that the developers are almost always resource starved.

But the second problem is that I doubt that the spend hours staring at the execution patterns as I do ... I have two monitors and work on one and the other is always logged onto a BOINC instance and i watch it out of the corner of my eye and note changes.

As to your last point, sadly, you may be correct ... the BOINC Development team has a long history of ignoring suggestions as to how to improve BOINC ... up to and including developed and tested code changes ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8250 - Posted: 6 Apr 2009 | 16:56:49 UTC
Last modified: 6 Apr 2009 | 16:57:16 UTC

Now I have a better understanding of what you mean. And I agree, with 6.5.0 or previous versions I have not seen a suspended GPU task either. And yes, it doesn't make any sense to suspend a GPU-Grid task (if no other CUDA project is present).

And I also leave apps in memory, mainly so that I don't waste computation time. But I don't have as many projects as you do, so this is not as much of an issue for me.

.. don't know what else to say.
Edit: guess I should get some dinner ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8271 - Posted: 7 Apr 2009 | 2:43:00 UTC - in response to Message 8250.

Now I have a better understanding of what you mean. And I agree, with 6.5.0 or previous versions I have not seen a suspended GPU task either. And yes, it doesn't make any sense to suspend a GPU-Grid task (if no other CUDA project is present).

And I also leave apps in memory, mainly so that I don't waste computation time. But I don't have as many projects as you do, so this is not as much of an issue for me.

I'm running v6.6.20 on 7 machines (5 of which are quads) and have seen no problem with suspending tasks excessively. I have seen a GPUGRID task suspended in favor of running a new one but that's because the newly downloaded task had an earlier due date than the one already running. It seems due dates have been shortened. That's a choice by the GPUGRID admins and doesn't reflect a BOINC problem.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8276 - Posted: 7 Apr 2009 | 8:12:05 UTC - in response to Message 8271.

Now I have a better understanding of what you mean. And I agree, with 6.5.0 or previous versions I have not seen a suspended GPU task either. And yes, it doesn't make any sense to suspend a GPU-Grid task (if no other CUDA project is present).

And I also leave apps in memory, mainly so that I don't waste computation time. But I don't have as many projects as you do, so this is not as much of an issue for me.

I'm running v6.6.20 on 7 machines (5 of which are quads) and have seen no problem with suspending tasks excessively. I have seen a GPUGRID task suspended in favor of running a new one but that's because the newly downloaded task had an earlier due date than the one already running. It seems due dates have been shortened. That's a choice by the GPUGRID admins and doesn't reflect a BOINC problem.

It does if the resumed task or other tasks never suspended change from taking 6 hours to complete to 24 hours ...

I am not sure if the issue is with the tasks or with 6.6.20 or what. But, when the only thing changed is 6.6.20 to 6.5.0 and the problems go with the change ... well ...

In my case, the quad is the number of GPUs I have running tasks ... on an i7 ...

Anyway, for the nonce, I am back operational. I reported the problem which I suspect is nothing more than an exaggeration of a long standing issue that the developers love to ignore because it is inconvenient for them ... :)

But, if what I suspect is true this is going to bite them pretty big time real soon when more people and projects start doing GPU work ...

Well, I suspect I will quit asking soon as they keep feeding me the Oz line about the man behind the curtain ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8282 - Posted: 7 Apr 2009 | 19:28:38 UTC

It may not be the perfect thread for this, but since the dicussion got to that point already:

Paul, could you switch the i7 back to 6.6.20 and see if you can get the hanging tasks again? If you could observe this we'd know there's a serious problem. Otherwise we can only speculate so much.

MrS
____________
Scanning for our furry friends since Jan 2002

slozomby
Send message
Joined: 29 Jan 09
Posts: 17
Credit: 7,767,932
RAC: 0
Level
Ser
Scientific publications
watwatwat
Message 8294 - Posted: 8 Apr 2009 | 2:11:43 UTC - in response to Message 8282.

6.6.20 has other issues.

it stopped downloading milkyway and lattice cpu workunits almost entirely reporting that no cpu tasks were selected but cpu tasks were availible. even on my crappy laptop with no cuda gpu.

it did seem to handle gpugrid and seti on the same machine better than 6.5.0 for me at least. and i never saw the scheduler swap apps every 3 seconds like i get with 6.5.0 occasionally.

ive reverted back to 6.5.0. hopefully the next release is a little better.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8296 - Posted: 8 Apr 2009 | 5:31:29 UTC - in response to Message 8282.

Paul, could you switch the i7 back to 6.6.20 and see if you can get the hanging tasks again? If you could observe this we'd know there's a serious problem. Otherwise we can only speculate so much.


I was afraid someone was going to ask me to do that ... sigh ...

Well, I jut got done making a 800K+ log file to demonstrate the poor performance of the CPU scheduler in electing which tasks to run and noting that on that system I have seen it change its mind in less than 30 seconds as to what is desperately needed to be done ...

If I am well enough tomorrow I will give it a shot for a couple hours (I had a REAL bad day today and am not in good shape, sorry). I think I can tell if it is behaving badly and for the heck of it I will turn on logging and see if I can capture something that will indicate what is going on and making those tasks run badly.

The real pits is I still need to finish my taxes ...

slozomby
Send message
Joined: 29 Jan 09
Posts: 17
Credit: 7,767,932
RAC: 0
Level
Ser
Scientific publications
watwatwat
Message 8298 - Posted: 8 Apr 2009 | 7:08:55 UTC - in response to Message 8296.

Well, I jut got done making a 800K+ log file to demonstrate the poor performance of the CPU scheduler in electing which tasks to run and noting that on that system I have seen it change its mind in less than 30 seconds as to what is desperately needed to be done ...


i dont think thats limited to 6.6.20. when i downgraded to 6.5.0 my cpu switched projects about 1 a sec for a couple of minutes till i suspended everything but one cpu project. turning them back on 1 at a time let everything resume normally.

a brief excerpt from the log:

07-Apr-2009 18:59:05 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103
07-Apr-2009 18:59:06 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501
07-Apr-2009 18:59:06 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:07 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103
07-Apr-2009 18:59:08 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501
07-Apr-2009 18:59:10 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:11 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103
07-Apr-2009 18:59:12 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501
07-Apr-2009 18:59:13 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103
07-Apr-2009 18:59:14 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:15 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501
07-Apr-2009 18:59:16 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103
07-Apr-2009 18:59:17 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:19 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501
07-Apr-2009 18:59:20 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103
07-Apr-2009 18:59:21 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:22 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103
07-Apr-2009 18:59:23 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:25 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501
07-Apr-2009 18:59:26 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:27 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103
07-Apr-2009 18:59:28 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501
07-Apr-2009 18:59:29 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:30 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103
07-Apr-2009 18:59:31 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501
07-Apr-2009 18:59:32 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:34 [ABC@home] Restarting task abc_wu_35912036809000_9079000_2 using abc-finder version 103
07-Apr-2009 18:59:35 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501
07-Apr-2009 18:59:36 [Milkyway@home] Restarting task ps_s86_15_5127483_1239154851_0 using milkyway version 19
07-Apr-2009 18:59:37 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:38 [Milkyway@home] Restarting task ps_s86_15_5127483_1239154851_0 using milkyway version 19
07-Apr-2009 18:59:39 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501
07-Apr-2009 18:59:40 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:42 [Milkyway@home] Restarting task ps_s86_15_5127483_1239154851_0 using milkyway version 19
07-Apr-2009 18:59:43 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501
07-Apr-2009 18:59:44 [rosetta@home] Restarting task vpr247_t000_1_NMR_NESG_IGNORE_THE_REST_10477_29945_0 using minirosetta version 154
07-Apr-2009 18:59:45 [Milkyway@home] Restarting task ps_s86_15_5127483_1239154851_0 using milkyway version 19
07-Apr-2009 18:59:46 [QMC@HOME] Restarting task three_2bd_stackexp-ecp2-DZ.3520_0 using Amolqc-preRC1exp version 501

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8301 - Posted: 8 Apr 2009 | 9:12:17 UTC - in response to Message 8298.

Well, I jut got done making a 800K+ log file to demonstrate the poor performance of the CPU scheduler in electing which tasks to run and noting that on that system I have seen it change its mind in less than 30 seconds as to what is desperately needed to be done ...


i don't think thats limited to 6.6.20. when i downgraded to 6.5.0 my cpu switched projects about 1 a sec for a couple of minutes till i suspended everything but one cpu project. turning them back on 1 at a time let everything resume normally.

Actually this problem was first described by a guy named Paul D. Buck about 3 or 4 years ago. He noted it with the then current application on his brand new Dell 4 Dual processor (with HT) systems. ... :)

I am sorry if you got the impression that I was describing a problem that affects only 6.6.20 ... it is a long standing problem that arises from the design of the CPU Scheduler that has as one of its "Prime Directives" to not miss deadlines. The problem is quite simply that the most effective strategy when running on a single resource is not the best strategy when you have multiple processing resources.

The most common side effects are that there will be more than the expected number of tasks in "Waiting to Run" state (WTR). For example I have a task switch interval (TSI) of 720 minutes (12 hours) and virtually all tasks on that system should run to completion if BOINC honors TSI. But it doesn't, and so right at this moment I have 6 tasks in WTR.

The second symptom is that BOINC seemingly changes its mind on what to run on a moment to moment basis with no observable reason. Again, I have a low queue of 1 days, 8+4 processing elements and by actual counts have 1-1.4 days of work on hand, earliest deadline is 4 days hence and BOINC is randomly starting and stopping tasks. Some tasks are started, run for seconds to minutes and then suspended for hours before being run again. If the task was in such need of being run, why after it was suspended why isn't it the first task restarted?

Anyway JM VII and I are debating this and he is insisting that all is well ... the problem is that it is noticeable on a 4 processor system but only glaringly obvious on an 8 processor system.

The other thing that is going on is that BOINC Is running the internal model as much as 5 times a minute ... to my mind that is also madness. The processing needs and deadline issues are not going to change that much in that short of a timeframe. Heck, on an the system in question I counted the tasks done in 24 hours and it was 252 with FreeHAL counted and 219(?) without. Fundamentally, I was averaging a task completion once every 6 minutes. This was confirmed with another count over another 12 hour period.

To my mind, that means that the next task in deadline trouble can be scheduled to the next free resource and run then ...

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 60,073,744
RAC: 11,430
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 8303 - Posted: 8 Apr 2009 | 16:19:51 UTC - in response to Message 8301.

Paul,

The behavior you describe... sheesh, it's so correct I've always thought that was the *intended* behavior. BOINC has always behaved like that, as you say.

From what I can tell, BOINC re-calculates its short term debts (and its work scheduling) whenever various events occur, and a lot of things that it does incur these recalculations.

Under lots of circumstances, it also appears to override the 60 minute run-time preference (or 720 in your configuration).

This is one of the reasons all of my computers now remove tasks from memory when they are suspended. It's not the only reason, however: Some tasks use extremely large amounts of RAM and need to be removed if anyone is using the computer.

I've found that by letting the tasks get removed from memory that the footprint of having *many* suspended tasks is eliminated, and the overhead from reloading the tasks is actually insignificant.

It also greatly improves system latency for anyone who is using the computer.

Of course, this only works with applications that checkpoint. Then again, I consider not checkpointing to be an excellent reason to detach from a project.

Bottom line is that I have always seen this behavior in BOINC, but it's simply not an issue if you're not leaving tasks in memory. Yes, you lose some efficiency, especially with taks that have long times between checkpoints.

BTW... the WORST behavior I've observed from BOINC with unnecessary swapping is when it runs short of ram. Here's a real life example I observed recently. A single task (a huge Lattice GARLI WU) was using 1.2 gig of ram. The computer was set to limit BOINC RAM use to 1.0 gig when the computer is in user. If that GARLI is running and I start using the computer (it's a quad core), you would expect BOINC to suspend the GARLI task, right? It HAS to be removed, period, because it alone uses more ram than it is allowed.

But, of course, it doesn't work that way. BOINC decided somehow to remove a smaller task. That left a free core, so it started up a new task on that core -- from the same project as the task it just suspended!!! It's still over committed on memory, so a few seconds later (after that new task initializes), it suspends the brand new task -- and starts yet another new task from the same project.

Rinse and repeat until every single task from that project moved from the "Ready to start" status to the "Ready to Run" status with 5 to 10 seconds of run time on each task.

Eventually, of course, after trying to run nearly every unstarted non-GPU task on the system, it got around to actually suspending the 1.2 gig GARLI task that was causing the problem. But by that time I had 30+ suspended tasks.

Stupid, stupid, stupid.

BTW, the main reason I run BOINC as 'remove from memory' is to leave more ram for users. I'd rather lose credits than have my kids complaining about the computers being slow.


____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8305 - Posted: 8 Apr 2009 | 16:39:16 UTC

Michael,

So far on the mailing list the reception is that BOINC is fine and I am making a big deal out of nothing. Sadly, this seems to be the first reaction anytime an issue is reported...

In essence every event that changes the state of the system in relationship to a task does it. Start an upload, recalculate, finish an upload, recalculate, finish a task, recalculate (this makes sense actually), finish a download, recalculate (this ALMOST makes sense, unless you just recalculated a minute or so ago); oh well, there are other actions that trigger ... and this is again my point that on "larger" systems these triggers come so fast that it is lunacy... on a single threaded machine yes these triggers make sense... but not when you have many cores ... I guess that on the up and coming 16 CPU machines it is going to be really insane ...

The reason I leave in memory is that many applications, though they should be "suspendible" and removable from memory don't always react well with that setting. Thus you can have tasks that are suspended and crash costing you the time.

Heck there is evidence that using the restriction on allowing 100% of computer CPU use (putting it at 98% for example) causes some tasks to abort (Einstein and Rosetta).

Have you posted this behavior on the mailing lists? Will you, if you are not interested, may I summarize it and post it?

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 60,073,744
RAC: 11,430
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 8311 - Posted: 8 Apr 2009 | 18:08:34 UTC - in response to Message 8305.

Feel free to summarize and repost. I'm not up to leading this crusade right now.

I'm kind of tied up recovering data and migrating to a new computer -- the main disk in my primary system died yesterday.

Between recovering data, setting up raid on the new system, migrating data, installing apps, physically moving hardware, etc., I'm feeling kinda swamped right now.

Mike

____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8317 - Posted: 9 Apr 2009 | 5:43:50 UTC

I started a 6.6.21 thread in that there is a version of that name available...

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8326 - Posted: 9 Apr 2009 | 21:31:26 UTC

6.6.20 has replaced 6.4.7 as the official BOINC release.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8328 - Posted: 9 Apr 2009 | 22:07:50 UTC

Yes, and there are a couple potential major issues that may bite us in the butt too ...

Not to mention the annoyances I am trying to get some attention on ...

Post to thread

Message boards : Graphics cards (GPUs) : Posted on BOINC Alpha; re: 6.6.20

//