[New release] BRP app v1.23/1.24 (OpenCL) feedback thread

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Joined: 28 Aug 06
Posts: 164
Credit: 1864017
RAC: 0

RE: Just out of curiosity,

Message 79258 in response to message 79253

Quote:

Just out of curiosity, was the Einstein app ever run in double precision and compared to results of single precision calculations? I presume it was based on "does not need", but I'd be interested to know the difference.

If memory serves me right, the BRP (then called ABP-) app started with code that indeed used double precision for some parts of its computations, and ran only on CPUs. When the idea came up to implement a GPU version, the code was changed to use single precision in those parts (almost all of the code) that were supposed to go on the GPU. At that point the scientists made sure that the ability to find pulsars wasn't compromised by this change. Note that the task of the app is not to determine the characteristics of a pulsar detection to extremely high precision (this is done in post-processing pulsar candidates and using re-observations), but to find candidate signals that stick out of the noise sufficiently clear to follow up on them. While this statement is simplifying things quite a bit, it gives you an intuitive idea why single precision is ok for this search.

Cheers
HB

Infusioned
Infusioned
Joined: 11 Feb 05
Posts: 38
Credit: 149000
RAC: 0

Ah I understand. You need a

Message 79259 in response to message 79258

Ah I understand. You need a way to cut through all the junk and the volunteers are the garbage filter; which means good enough detection is ok. Understood.

Also, I checked my Milkway@Home history to see if I was having validation issues there:

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=429181

and all my work is validated instantly because they are set to a minimum quorum of 1. I don't know if that's due to the fact that I have 44 million credit and the I am being considered a trusted source (if such a thing is even designated by the server), or that's just how the project is. I don't remember it being that way (I thought it used to be quorum of 2).

So now, that makes me nervous. If my results are off, the project isn't comparing them. And, the project is double precision so that means the results need to be accurate.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Joined: 28 Aug 06
Posts: 164
Credit: 1864017
RAC: 0

I don't want to get too far

Message 79260 in response to message 79259

I don't want to get too far off topic here, but it happens there is a paper specifically on the validation strategies for the type of simulation that is done at Milkyway@Home, written by the MW scientists: http://www.cs.rpi.edu/~szymansk/papers/dais10.pdf. Just to cure your nervousness :-)

Cheers
HB

zombie67 [MM]
zombie67 [MM]
Joined: 10 Oct 06
Posts: 73
Credit: 30924459
RAC: 0

RE: Hmmm... GPU temperature

Message 79261 in response to message 79257

Quote:
Hmmm... GPU temperature is ok??

It is OC slightly. I will move back to stock and see if that maks a difference.

Dublin, California
Team: SETI.USA

Infusioned
Infusioned
Joined: 11 Feb 05
Posts: 38
Credit: 149000
RAC: 0

RE: I don't want to get too

Message 79262 in response to message 79260

Quote:

I don't want to get too far off topic here, but it happens there is a paper specifically on the validation strategies for the type of simulation that is done at Milkyway@Home, written by the MW scientists: http://www.cs.rpi.edu/~szymansk/papers/dais10.pdf. Just to cure your nervousness :-)

Cheers
HB

Excellent. I will read it in chunks to break up the day as I need breaks from my work. Thanks.

Edit:
Ok I lied I read it all just now. So it seems that bad results aren't quite so bad, but still negatively effect things. And, ironically enough, they do have trusted/untrusted host status for users.

I will try to dig more on this because I see I have a lot of inconclusive results for Einstein now. For what it is worth, I know there was an issue with NVIDIA cards silently overflowing and generating bad numbers on the Seti Beta app. However, that still doesn't excuse bad numbers from AMD 6xxx cards if that's the issue.

zombie67 [MM]
zombie67 [MM]
Joined: 10 Oct 06
Posts: 73
Credit: 30924459
RAC: 0

Looks like reducing the OC

Looks like reducing the OC solved it. I also upgraded from 12.3 to 12.4. So I can't be 100% sure. But whatever the case, It's working again.

Also, FWIW, I am running 3 at a time (.33), and still only ~45% GPU load. And this is with cores reserved, so the CPU has only ~90% load. Is it possible to get to >90% GPU load? Is there an upper limit on the number of simultaneous tasks?

Dublin, California
Team: SETI.USA

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Joined: 28 Aug 06
Posts: 164
Credit: 1864017
RAC: 0

RE: Looks like reducing the

Message 79264 in response to message 79263

Quote:

Looks like reducing the OC solved it. I also upgraded from 12.3 to 12.4. So I can't be 100% sure. But whatever the case, It's working again.

Also, FWIW, I am running 3 at a time (.33), and still only ~45% GPU load. And this is with cores reserved, so the CPU has only ~90% load. Is it possible to get to >90% GPU load? Is there an upper limit on the number of simultaneous tasks?

The upper limit is reached when the Video RAM is exhausted. So per GB of VRAM you should be able to execute at least 2, possibly 3 instances. It's hard to tell where the "sweet spot" is to maximize the overall output, so some experimentation with the number of "reserved" CPU cores (cores not allocated to CPU apps) and # of GPU jobs in parallel is the best way to find out.

CU
HB

Infusioned
Infusioned
Joined: 11 Feb 05
Posts: 38
Credit: 149000
RAC: 0

A little update: I PM'd

Message 79265 in response to message 79264

A little update:

I PM'd Raistmer on the Seti Beta boards and asked him to read the last bit of this thread. He said he did not notice a higher failure rate with the 69xx series cards during his development of AMD apps.

Also, poking though my MW wu's, I validate just fine against:

CPU:
171830352
171730343
171601831
171601829
171650656
171850223
171838869

Anonymous GPU:
171940837

Other 69xx: (making sure my card isn't defective)
171917181
171954514

NVIDIA OpenCL:
171784516

HD 58xx GPU:
171907299

So, at this point, I am inclined to believe that my card isn't defective in specific, and that the 69xx series cards are producing valid results.

Should I go back to doing Albert or Einstein wu's?

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Joined: 28 Aug 06
Posts: 164
Credit: 1864017
RAC: 0

Hi! The issue with the HD

Message 79266 in response to message 79265

Hi!

The issue with the HD 6900 series is this: There is a specific function (used by the FFT lib we are using for the OpenCL apps) that is computed with less accuracy on HD 6900 cards than on others. This is confirmed by AMD. It is not even a defect or bug, because the OpenCL standard allows this behavior.

To deal with it, we made an app that uses a more accurate, but somewhat slower variant of this function. On Einstein@Home, this special app version is now delivered to HD6900 cards running the OpenCL app.

Bottom line: it is safe (validation wise) to resume computations on Einstein@Home with HD6900 cards.

Cheers
HB

Infusioned
Infusioned
Joined: 11 Feb 05
Posts: 38
Credit: 149000
RAC: 0

I'm glad you got to the

Message 79267 in response to message 79266

I'm glad you got to the bottom of things. I guess that means that I the next card I add will be a 79xx card instead of another 69xx. I can't imagine why AMD thought worse accuracy was acceptable considering their whole push for compute oriented video cards and APUs. Then again, maybe that's why things were changed with the 7xxx cards (assuming you had no errors with those)?

Hats off for all the hard work in getting this app developed.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.