The project will be taken down in about an hour to perform an update of the BOINC server code. Ideally you shouldn't notice anything, but usually the world isn't ideal. See you again on the other side.
Copyright © 2024 Einstein@Home. All rights reserved.
Comments
RE: RH - Please let me know
)
Remove project / add project doesn't normally change the HostID - BOINC is designed to recycle the numbers, if for example it recognises the IP address and hardware configuration.
Doesn't matter if it's one at a time or multiples at at time, but it's probably best if you don't mix task types (whether from this project or across projects). If I do start monitoring your host - thanks for the offer - it would help the other observers if you could tell us a bit about any configuration details which can't be observed from the outside - and GPU utilisation factor is one of those.
Don't bust a gut changing things over. I need a bit of a breather, and to set up and get used to a replacement monitor: and Bernd needs to test some more new server code fixes next week, which will give us a new set of apps (designated as 'beta', but in reality the same as the existing ones) with blank application_details records to have a go at.
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
With new hosts and a new
)
With new hosts and a new monitor, let's see how that looks.
I've knock out the old data (and with it, the extreme data points) - but even so, Juan's new machines show very wide scatter.
Here's that in figures:
[pre] Jason Holmis Claggy Juan Juan Juan RH RH
Host: 11363 2267 9008 10352 10512 10351 5367 5367
GTX 780 GTX 660 GT 650M GTX 690 GTX 690 GTX 780 GTX 670 GTX 670
Credit for BRP4G, GPU
Maximum 2708.58 2197.18 10952.0 7209.47 6889.8 6652.9 4137.85
Minimum 115.82 88.84 153.90 1667.23 1244.41 1546.02 1355.49
Average 1326.79 1277.87 3631.58 2728.70 2198.10 2463.06 2007.02
Median 1541.35 1411.09 2426.03 2135.67 1948.04 2091.49 1910.19
Std Dev 628.07 690.05 2712.34 1403.91 942.62 969.59 305.80
nSamples 76 102 71 52 43 44 459
Runtime (seconds) (before)(after)
Maximum 5027.36 5088.99 11295.0 5605.83 8922.7 3182.0 4191.43 5099.40
Minimum 3239.20 3294.83 8122.09 3081.97 3854.24 1852.2 4061.45 4284.52
Average 3645.57 4549.28 8902.94 4411.88 6305.41 2342.3 4128.08 4686.13
Median 3535.46 4769.05 8847.82 3673.33 5127.40 1864.0 4127.35 4672.83
Std Dev 344.17 456.55 508.22 998.49 1932.50 615.41 20.40 204.66
365 94
Turnround (days)
Maximum 6.09 3.91 2.75 0.08 0.45 0.22 0.91
Minimum 0.13 0.07 0.13 0.04 0.05 0.02 0.15
Average 1.94 1.46 0.90 0.05 0.09 0.03 0.67
Median 1.46 1.54 0.79 0.04 0.06 0.03 0.69
Std Dev 1.78 1.00 0.65 0.01 0.06 0.03 0.12 [/pre]
All three of Juan's machines are showing a very wide variation in runtime - he'll have to explain that by local observation, I can't pick it up from the website.
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
RE: What is most helpful
)
On the basis of that guidance I am going to provide multiple weak systems that will run only Albert and will remain untouched after initial setup. Also, I'll go "natural" without multiple work units or doing anything with the clocks.
These will be new hosts (really low-powered hosts) so won't carry any prior statistics or other baggage with them.
I'll get on it, shortly.
If you need something different, I think Juan and I are both ready to make any sacrifice of "credits" if we are being helpful.
Computer 11519 Pretending to
)
Computer 11519
Pretending to be a new user. New install of GPU, new install of drivers, new install of BOINC.
First work fetch of BRP4G-opencl-ati has estimated runtime of 10 seconds.
Obviously, they are erroring-out.
Run time 3 min 40 sec
Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED
I know what the fix is, but I'm not concerned with fixing it. I'm concerned with helping you fix it.
What do you want me to do?
7.2.42
Maximum elapsed time exceeded
Activated exception handling...
[22:05:40][3552][INFO ] Starting data processing...
[22:05:41][3552][INFO ] Using OpenCL platform provided by: Advanced Micro Devices, Inc.
[22:05:41][3552][INFO ] Using OpenCL device "Juniper" by: Advanced Micro Devices, Inc.
[22:05:41][3552][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[22:05:41][3552][INFO ] Header contents:
------> Original WAPP file: ./p2030.20130202.G202.32-01.96.N.b0s0g0.00000_DM209.60
------> Sample time in microseconds: 65.4762
------> Observation time in seconds: 274.62705
------> Time stamp (MJD): 56326.065838408722
------> Number of samples/record: 0
------> Center freq in MHz: 1214.289551
------> Channel band in MHz: 0.33605957
------> Number of channels/record: 960
------> Nifs: 1
------> RA (J2000): 62454.7106018
------> DEC (J2000): 83413.5978003
------> Galactic l: 0
------> Galactic b: 0
------> Name: G202.32-01.96.N
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 4194304
------> Trial dispersion measure: 209.6 cm^-3 pc
------> Scale factor: 0.00111372
[22:05:46][3552][INFO ] Seed for random number generator is 1168661235.
[22:05:56][3552][INFO ] Derived global search parameters:
------> f_A probability = 0.08
------> single bin prob(P_noise > P_thr) = 1.32531e-008
------> thr1 = 18.139
------> thr2 = 21.241
------> thr4 = 26.2686
------> thr8 = 34.6478
------> thr16 = 48.9581
[22:06:42][3552][INFO ] Checkpoint committed!
[22:07:44][3552][INFO ] Checkpoint committed!
[22:08:46][3552][INFO ] Checkpoint committed!
[22:09:20][3552][INFO ] OpenCL shutdown complete!
[22:09:20][3552][WARN ] BOINC wants us to quit prematurely or we lost contact! Exiting...
Thanks, I had hoped
)
Thanks,
I had hoped newhost+app onramp for GPUs would improve, but see that it hasn't. I'm not surprised given we know two precise mechanisms there (default GPU efficiency pinned at 10% (0.1) and improperly applied normalisation (you can't normalise time estimates without a functional host_scale, which is disabled for the onramp period.)
New user, host &/or application is central to this effort, so thanks again for the information. At this point you could either choose to jigger the bounds of tasks (allowing it to reach where host_scale kicks in) or alternatively let it go on erroring & see what happens (I imagine it'd just keep erroring & rediuce quota to 1/day).
Both options have merit so it's your choice, though I think the jiggering option has been pretty thoroughly used, and the second one more likely in common usage cases. Up to you
Jason
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: At this point you
)
That's what happened. Down to 1 wu/day and I'm done for the day.
Man, am I ever glad I drove that one hour round trip in a 15mpg vehicle to try to get a steady stream of work headed Albert's direction.
There's always the 1 wu I'll get tomorrow.
lol, yeah, all in a good
)
lol, yeah, all in a good cause though :) obvious breakage like that makes the case put forward in some quarters that it's working fine look a tad on the ridiculous side. The more 'normal' situations like that, that simply don't work, the better we understand, and can push to get it fixed once and for all.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
From treblehit's server log
)
From treblehit's server log https://albert.phys.uwm.edu/host_sched_logs/11/11519
I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way.
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
RE: I do think we ought to
)
Doesn't the Main project have this adjustment because they have a single DCF there, But we don't use DCF here, so this adjustment shouldn't be used?
Claggy
RE: From treblehit's server
)
Sure, first from client perspective:
referring to the dodgy diagram, factoring in the bad onramp period default pfc_scale of 0.1 for GPUs, and inactive host_scale (x1) results in:
wu pfc ('peak flop claim') est = 0.1*1*wu_est (10% of minimum possible)
device peak_flops likely standard GPU ~20x actual rate (app, card & system dependant)
--> est about 1/200th of required elapsed --> bound exceed
Now digging through server end...
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: RE: I do think we
)
There is no adjustment, the adjustment is a lie. is hard wired active for all clients >= 7.0.28.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: RE: RE: I do think
)
But only on projects that don't use dcf, Einstein on my i7-2600K/HD7770 has a dcf of:
1.267963
Albert has of cause:
Claggy
RE: RE: RE: RE: I do
)
Well you've lost me there, because every scheduler reply to a >= 7.0.28 client, accirding to the scheduler code, pushes , [and there is no configuration switch for it ]
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: RE: RE: RE: Quote
)
Einstein has an older scheduler than Albert (or at least server version):
29/06/2014 11:45:58 | Einstein@Home | sched RPC pending: Requested by user
29/06/2014 11:45:58 | Einstein@Home | [sched_op] Starting scheduler request
29/06/2014 11:45:58 | Einstein@Home | Sending scheduler request: Requested by user.
29/06/2014 11:45:58 | Einstein@Home | Not requesting tasks: "no new tasks" requested via Manager
29/06/2014 11:45:58 | Einstein@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
29/06/2014 11:45:58 | Einstein@Home | [sched_op] ATI work request: 0.00 seconds; 0.00 devices
29/06/2014 11:46:00 | Einstein@Home | Scheduler request completed
29/06/2014 11:46:00 | Einstein@Home | [sched_op] Server version 611
29/06/2014 11:46:00 | Einstein@Home | Project requested delay of 60 seconds
29/06/2014 11:46:00 | Einstein@Home | [sched_op] Deferring communication for 00:01:00
29/06/2014 11:46:00 | Einstein@Home | [sched_op] Reason: requested by project
29/06/2014 11:46:05 | Albert@Home | sched RPC pending: Requested by user
29/06/2014 11:46:05 | Albert@Home | [sched_op] Starting scheduler request
29/06/2014 11:46:05 | Albert@Home | Sending scheduler request: Requested by user.
29/06/2014 11:46:05 | Albert@Home | Reporting 2 completed tasks
29/06/2014 11:46:05 | Albert@Home | Not requesting tasks: don't need
29/06/2014 11:46:05 | Albert@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
29/06/2014 11:46:05 | Albert@Home | [sched_op] ATI work request: 0.00 seconds; 0.00 devices
29/06/2014 11:46:08 | Albert@Home | Scheduler request completed
29/06/2014 11:46:08 | Albert@Home | [sched_op] Server version 703
29/06/2014 11:46:08 | Albert@Home | Project requested delay of 60 seconds
29/06/2014 11:46:08 | Albert@Home | [sched_op] handle_scheduler_reply(): got ack for task h1_0997.10_S6Direct__S6CasAf40_997.55Hz_1017_1
29/06/2014 11:46:08 | Albert@Home | [sched_op] handle_scheduler_reply(): got ack for task p2030.20130202.G202.32-01.96.N.b2s0g0.00000_2384_5
29/06/2014 11:46:08 | Albert@Home | [sched_op] Deferring communication for 00:01:00
29/06/2014 11:46:08 | Albert@Home | [sched_op] Reason: requested by project
Claggy
Ah allright, Yeah only
)
Ah allright,
Yeah only interested in fixing current code, rather than diagnosing/patching old versions :)
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: Ah allright, Yeah
)
I was thinking that they were using Einstein customisations here that might not be needed, looking at robl's Einstein log shows it's the durations that get scaled there:
http://einstein.phys.uwm.edu/hosts_user.php?userid=613597
2014-06-29 09:28:50.6296 [PID=17986] [send] [HOST#7536795] Sending app_version 483 einsteinbinary_BRP5 7 139 BRP5-cuda32-nv270; 49.97 GFLOPS
2014-06-29 09:28:50.6312 [PID=17986] [send] est. duration for WU 193304662: unscaled 9004.88 scaled 18527.18
2014-06-29 09:28:50.6312 [PID=17986] [HOST#7536795] Sending [RESULT#443159459 PB0024_00191_182_0] (est. dur. 18527.18 seconds)
2014-06-29 09:28:50.6324 [PID=17986] [send] est. duration for WU 193307638: unscaled 9004.88 scaled 18527.18
2014-06-29 09:28:50.6324 [PID=17986] [send] [WU#193307638] meets deadline: 18527.18 + 18527.18 < 1209600
2014-06-29 09:28:50.6332 [PID=17986] [send] [HOST#7536795] Sending app_version 483 einsteinbinary_BRP5 7 139 BRP5-cuda32-nv270; 49.97 GFLOPS
2014-06-29 09:28:50.6347 [PID=17986] [send] est. duration for WU 193307638: unscaled 9004.88 scaled 18527.18
2014-06-29 09:28:50.6347 [PID=17986] [HOST#7536795] Sending [RESULT#443165551 PB0024_00141_24_0] (est. dur. 18527.18 seconds)
2014-06-29 09:28:50.6356 [PID=17986] [send] est. duration for WU 193249827: unscaled 9004.88 scaled 18527.18
2014-06-29 09:28:50.6356 [PID=17986] [send] [WU#193249827] meets deadline: 37054.37 + 18527.18 < 1209600
2014-06-29 09:28:50.6364 [PID=17986] [send] [HOST#7536795] Sending app_version 483 einsteinbinary_BRP5 7 139 BRP5-cuda32-nv270; 49.97 GFLOPS
2014-06-29 09:28:50.6380 [PID=17986] [send] est. duration for WU 193249827: unscaled 9004.88 scaled 18527.18
2014-06-29 09:28:50.6381 [PID=17986] [HOST#7536795] Sending [RESULT#443038987 PB0023_01561_144_0] (est. dur. 18527.18 seconds)
Claggy
Now the server side, that
)
Now the server side, that 'Best version of app' striing comes from sched_version.cpp (scheduler inbuilt functions) and uses the following resources:
app->name, bavp->avp->id, bavp->host_usage.projected_flops/1e9
That projected_flops is set during app version selection, as number os samples will be < 10 , flops will be adjusted based on pfc samples average for the app version (there will be 100 of those from other users).
Since that's normalised elsewhere (see red ellipse on dodgy diagram), net effect translates pfc of 0.1 used for the original estimate, to 1, so peak_flops is x10-20
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: I was thinking that
)
Yeah, they were before. Quite a lot of work Bernd had to do to get here to stock updated sever code. Now (here), should be pretty close or identical (for our purposes) to current Boinc master IIRC.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: Now the server side,
)
Richard do you want code line numbers for that ?
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: Ah allright, Yeah
)
Yes, concentrating on the current code and moving it forward is certainly the right approach - but it's probably worth just being aware of the steps we moved through to reach this point, because it can influence compatibility problems that could arise in the future.
As we've discussed, DCF was deprecated from client v7.0.28, and in the server code from a little earlier. But not everything in the BOINC world moves in lockstep, so we have older and newer servers in use, and we also have older and newer clients in use.
Older servers take account of client DCF when scaling runtime estimates prior to allocating work:
[send] active_frac 0.999987 on_frac 0.999802 DCF 0.776980
Newer servers don't:
[send] on_frac 0.999802 active_frac 0.999987 gpu_active_frac 0.999978
Those are both the same machine (the one I've been graphing here), which explains why on_frac and active_frac are identical. But the first line comes from the Einstein server log, and the second line from the Albert server log.
So, even my late-alpha version of BOINC (v7.3.19) is maintaining, using and reporting DCF against an 'old server' project which needs it. Good compatibility choice.
But the reverse case is not so happy. An older client (I'm talking standard stock clients here, not Jason's specially-tweaked client) will do on using and reporting DCF as before, because it doesn't parse the tag. But the newer server code has discarded DCF completely, and doesn't scale its internal runtime estimates when presented with a work request from a client which is still using it.
This can - and does - result in servers allocating vastly different volumes of work from what the client expects, because the estimation process doesn't have all the same inputs.
Say, for the sake of argument, that an 'old' (pre-v7.0.28) client has got itself into a state with DCF=100, and asks for 1 day of work. For the BRP4G tasks we're studying here, we'd all expect the server to allocate maybe 20 tasks, and the client to agree with the server calculation of estimated runtime, slightly over 1 day. But if the client is using DCF, and the server isn't, that can appear as a 100 day work work cache when the client does the local calculation. That's a case where server-client compatibility breaks down, and breaks down badly.
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
It's a bit of a stretch to
)
It's a bit of a stretch to examine border cases when the standard setup doesn't even work right. IMO let's start at the common case & work outward, because I guarantee if the numbers come up flaky there, then they aren;t going to be magically better with incompatible server and clients.
For the present (treblehit's example) question, specifically the old Project DCF isn't involved in treblehit's example, on Albert, in any way (even though maintained by the client). It's the improper normalisation with inactive host scale appearing in another form
... however...
since both host_scale and pfc_scales are, somewhat noisy and unstable, 'per app DCFs' in disguise, and improperly normalised, it amounts to familiar sets of wacky number symptoms. If you keep looking for those you will find them everywhere, because the entire system is dependant on these, and you'd just end up swearing Project DCF is active server side, which in a sense through a lot of spaghetti it is, though it isn't called that, and is per app version and per host app version instead.
i.e. forget Project DCF (for now), use pfc_scale & host_scale.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: RE: Now the server
)
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
right, that's what I meant by
)
right, that's what I meant by line numbers (with brief description)
Caggy's case:
is his marketing flops estimate peak_flops / app version pfc's .
app version pfc is normalised to 0.1 (design flaw), and any real samples would have driven it toward 0.05 or lower . so that text should be 10-20x+ marketing flops, and is NOT the intent, nor remotely correct design. It's Gibberish.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: I can't quickly find
)
17/06/2014 18:17:17 | | CAL: ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (CAL version 1.4.1848, 1024MB, 984MB available, 3584 GFLOPS peak)
17/06/2014 18:17:17 | | OpenCL: AMD/ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (driver version 1348.5 (VM), device version OpenCL 1.2 AMD-APP (1348.5), 1024MB, 984MB available, 3584 GFLOPS peak)
17/06/2014 18:17:17 | | OpenCL CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (OpenCL driver vendor: Advanced Micro Devices, Inc., driver version 1348.5 (sse2,avx), device version OpenCL 1.2 AMD-APP (1348.5))
Claggy
there you go. app version pfc
)
there you go. app version pfc average (!) is 3584GFLOPS/34968.78 ~= 0.102**
[Edit:]
** unfortunately, that's improperly normalised, so meaningless without the normalisation reference app version figure, as per red ellipse on diagram... so the true figure will be likely around 0.02 or so, but anybody's guess without saying what app version is at 0.1
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: app version pfc is
)
The advice given to project administrators in http://boinc.berkeley.edu/trac/wiki/AppPlanSpec is:
I'm wondering whether they put in 0.1, expecting this to be a multiplier (real flops are lower than peak flops), but end up dividing by 0.1 instead? And from what you say, 'default 1' doesn't match the code either?
Edit: the alternative C++ documentation for plan_classes is in http://boinc.berkeley.edu/trac/wiki/PlanClassFunc. There, the example is
.21 // estimated GPU efficiency (actual/peak FLOPS)
At least one of those must be upside down.I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
RE: RE: app version pfc
)
nope [0.1 is hardwired via 'magic number'], and 1 wouldn't be right for GPU anyway. correct would be ~0.05, don't normalise (except for credit), and enable+set a default host_scale of 1 from the start.... which would yield a projected flops (before convergence) of 0.05x1*peak_flops ... basically one 20th of the Marketing flops... then [let it] scale itself..
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
See edit to my last. In my
)
See edit to my last. In my view, if the relevant numbers are all <<1, we should be multiplying by them, not dividing by them.
Out of coffee error - going shopping. Back soon.
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
RE: See edit to my last. In
)
The main issue is really that he starts with real marketing flops (more or less usable), works out an average efficiency there (yuck but still OK-ish), but then he normalises to some other app version... IOW multiplies by some arbitrary large number (or divides by some fraction if you prefer) with no connection to real throughputs or efficiencies in this device+app.
That's OK for a relative number for credit (debatable)... but totally useless for time and throughput estimates (which are absolute estimates). Improper normalisation shrunk your estimate multiplying the projected_flops to 10x+ bloated marketing flops.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: At least one of those
)
In a sense yes. GPU app+device+conditions efficiency would be actual/peak, and must be less than 1 (and it is, e.g. it should be around 0.05 for single task Cuda GPU). Normalisation could be viewed as turning it upside down. It'll raise the GFlops & shrink the time estimate artificially --> the exact opposite of the kindof behaviour we want for new hosts/apps.
A bit will become clearer when I have the next dodgy diagram ready. Getting bogged down in broken code is a bit of a red-herring at the moment, as there are design level issues to tackle first.
In particular, debugging the normalisation, including the absurd GFlops numbers it produces, is pointless in the context of estimates. That's because neither the time nor Gflops should be being normalised [AT ALL], so it all get's disabled in estimates, and restricted to credit related uses where it's applicable to get the same credit claims from different apps.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: RE: At least one of
)
Well, we do (crudely) have two separate cases to deal with.
1) initial attach. We have to get rid of that divide-by-almost-zero, or hosts can't run. They get the absurdly low runtime estimate/bound and error when they exceed it.
2) steady state. In my (political) opinion, trying to bring back client-side DCF will be flogging one dead horse too many. We need some sort of server-side control of runtime estimates, so that client scheduling works and user expectations are met. I'm happy to accept that the new version will be different to the one we have now, and look forward to seeing it.
OK, I'll get out of your hair, and take my coffee downstairs to grab some more stats.
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
RE: RE: RE: At least
)
LoL, always appreciate bouncing it around, thanks. At the moment it's a bit like pointing to a bucket of kittens and saying 'that's not the flower-pot I ordered!'. Yeah it's possible to debate over the intent versus function more, but when push comes to shove it's just wrong & gives wacky numbers. Not really any more complicated than that in some sense ;)
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
[pre]June 29, 2014 18:00
)
[pre]June 29, 2014 18:00 UTC
https://albertathome.org/host/9649
BRP4G 2x using 1 cpu thread each (app_config), GPU utilization = 92%
running an additional 4x Skynet POGs cpu WUs
GPU 7950 mem=1325, gpu=1150, pcie v2 x16
OS Win7 x64 Home Premium
CPU 980X running at 3.41 GHz with HT off
MEM Triple channel 1600 (7.7.7.20.2)[/pre]
RE: 1) initial attach. We
)
I'll be bringing more machines online today in a desperate attempt to provide steady, un-fiddled-with, untweaked, vanilla BRP4G work for you.
I just need instructed: A) let them fail so you can see that, B) somehow prevent them from failing so that you have the reliable work-flow.
Instructions, please.
Bret
Um, if you don't mind, I
)
Um, if you don't mind, I think it might be best to wait a little time. The administrators on this project are based in Europe, and as you know Jason is ahead of our time-zone, in Australia. I think it might be better to wait 12 hours or so, until we have a chance to compare notes by email when the lab opens in the morning.
After all, we don't want to use up our entire supply of unattached new hosts in one hit, or else we won't have anything left to test Jason's patches with....
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
RE: Um, if you don't mind,
)
RE: Um, if you don't mind,
)
Yes, unhooking that normalisation ( which divides by ~0.1, multiplies the GPU GFlops x~10 into absurd levels, and shrinks time estimates) is going to take quite some preparation to unhook *safely*. That same mechanism is hooked into credit (where it does make sense), so quite a lot of backwards & forwards for clarification, discussion and debate will be needed to get it 'right', and part of that's going to be me communicating effectively (which isn't always easy :)).
The other aspect is that some bandaids will be painful to rip off, and still other odd artefacts might be hiding inside... and only way to tell for sure is open it up.
The next few days will tell if we're all on the same page (but looking from different angles is fine). To me though, we are well through the tricky bits of understanding the current system enough to say it needs to be a lot better.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
Latest scattergram. I've
)
Latest scattergram.
I've reverted my 5367 to normal running (early afternoon yesterday), so my timings *should* be lower and steadier - doesn't really seem to show in credit yet. I wonder why Claggy's laptop gets such variable credit?
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
RE: I wonder why Claggy's
)
Multiple tasks on smaller GPU, each running longer, will generate higher raw peak flop claims (pfc's) then that's averaged with the wingman's (Yellow triangle on dodgy diagram). So result can be anywhere from normal range to jackpot, as we previously assessed, depending on the wingman's claim. Though the prevalence of the jackpot conditions is less obvious, the noise in the system is still there.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: RE: I wonder why
)
I'm just running a single GPU task on both my GPU hosts, (the T8100's 128Mb 8400M GS doesn't count).
Claggy
RE: RE: RE: I wonder
)
Could be the wingmen. (There's a number of combinations of wingmen types that'll give random results between two regions. Two similar wingmen tend to cancel with averaging and become 'normal')
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
RE: RE: RE: RE: I
)
Conversely, when he's paired with me - now back to lower, stable, runtimes - no jackpot, no bonus. Sorry 'bout that.
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
RE: RE: RE: RE: Quote
)
LoL, yep, throwing the dice to get an answer is as good as any ;)
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
@Richard/Claggy Should i
)
@Richard/Claggy
Should i continue to crunch BRP4G only or you sugest to crunch another type of WU too (could do GPU work only here).
BTW I slow down my cruchers here since don´t belive quantity is what you´re looking for and now they will produce a stable number of daily WU.
RE: BTW I slow down my
)
I think that's probably a good idea. We're already at the stage where my last 12 consecutive validations have been against one or other of your hosts (5 different machines, I think). And the machines are all pretty similar, to each other and to mine: GTX 670/690/780, running Win7/64 or (in one case) Server 2008.
In order to see (now) and test (later) BOINC's behaviour in the real world, we probably need a reasonable variation in hosts to give us realistic variation in the times and credits.
Bernd has launched a new 'BRP5' (Persueus Arm Survey) v1.40, with a Beta app tag on it, to test that new feature in the BOINC scheduler. I'm in the process of switching my machine over to run that instead: some company would be nice, but be warned: we're half expecting to fall over the 'EXIT_TIME_LIMIT_EXCEEDED' problem at some stage with BRP5 Beta, so hosts running it probably need to be watched quite closely for strange estimated runtimes, and you need to be ready to take action to correct it.
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
RE: ... some company would
)
I just downloaded my first v1.40 BRP5 and I'd say it's looking pretty good so far! The estimated completion time shown in Boinc is 5h03m08s.
These are the relevant lines from the scheduler log:
And I've got this in the application details:
For v1.39 the tasks took less than 5 hours and the APR was 21.91 GFlops.
Whatever was changed seems to be working with regards to the initial estimates assuming that the app and workload is more or less the same. Keep up the good work!
Nothing's been changed
)
Nothing's been changed yet...
I got something similar - 25.25Gflops and 4h57m02s24
But note that line I've picked out: that means there are fewer than 100 completed tasks for this app_version yet, across the project as a whole.
The worry is that when 100 tasks have been completed, but before you have completed 11 tasks on your host (to use APR), you'll see adjusting projected flops based on PFC avg and some absurdly large number. That'll be when the errors (if any) start.
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
Roger that, will keep a close
)
Roger that, will keep a close watch on things until I've completed my first 11 tasks then.
Well, here's the first
)
Well, here's the first conundrum:
All Binary Radio Pulsar Search (Perseus Arm Survey) tasks for computer 5367
After 200 minutes of solid GTX 670 work on Perseus, I earn the princely sum of ... 15 credits!
I didn't want to spam the boards with my stats - just milestone theads - but apparently signatures are no longer optional. Follow the link if you're interested.
http://www.boincsynergy.com/images/stats/comb-3475.jpg
Yea, I've got something
)
Yea, I've got something similar, 13.01 cr for 150 minutes of HD7770 work:
https://albertathome.org/workunit/619367
Claggy