I don't think that this bug is related to the symptoms we see here, at least if Charlie describes the conditions that trigger it correctly (see commit message):
The host in question has only one OpenCL platform (if your log excerpt from SETI is correct)
The host has only GPUs from one vendor
I'll look into the problem...
Oliver
Raistmer's stder.txt of OpenCL devices only reports GPU devices, not CPU devices,
Boinc should be reporting (on it's startup in the Event Log) CPU OpenCL support from the AMD drivers installed, and is unlikely is report CPU OpenCL support from Intel drivers (AMD CPU).
The openclapp sample calls boinc_get_opencl_ids(), which reads an init_data.xml file to determine which GPU to use. Actual project applications must also do this. Please see the ReadMe.txt file for more information.
This same sample is designed to run with AMD, NVIDIA and Intel Ivy Bridge GPUs. It is supplied with 3 minimal init_data.xml files, one for each of these 3 vendors (GPU "types".) Copy the appropriate init_data.xml file into the directory containing the openclapp executable.
This seems to imply you should send a init_data.xml suitable for each vendors device.
int boinc_get_opencl_ids(int argc, char** argv, int type, cl_device_id* device, cl_platform_id* platform);
This returns the OpenCL platform and device IDs for the GPU that your app should use. Pass the argc and argv your application receives from the BOINC client. The third argument type should specify the vendor of the desired GPU and can be one of the following:
Symbol
value
PROC_TYPE_NVIDIA_GPU
1
PROC_TYPE_AMD_GPU
2
PROC_TYPE_INTEL_GPU
3
With BOINC Clients version 7.0.12 or later, the first 3 arguments will be ignored and all data will be taken from the init_data.xml file in the slot directory. The first 3 arguments allow this to work with older BOINC Clients. If your OpenCL app can use OpenCL-capable GPUs from any vendor, you can pass 0 for the third argument (type); if you pass a type value of 0, the type will be taken from the gpu_type field of the init_data.xml file on newer clients, but will return an error code of CL_INVALID_DEVICE_TYPE on older clients.
This function is in the library boinc/api/libboinc_opencl.a (on Macs: boinc/mac_build/build/Deployment/libboinc_opencl.a). As an alternative to linking the library, you can add the file boinc/api/libboinc_opencl.cpp to your source files when building your project application.
This implies that if your app can run on a GPU from any vendor, and you set the type as 0, as opposed to, 1, 2 or 3, then the type is taken from the what actually in the init_data.xml, is the Wrong init_data.xml being supplied?
This seems to imply you should send a init_data.xml suitable for each vendors device.
That's not the case: init_data.xml is written by the client to communicate details to the app (which uses the BOINC API to access it).
Claggy wrote:
is the Wrong init_data.xml being supplied?
N/A, see above.
The problem is clearly stated in the failing tasks STDERR output:
[09:27:06][8096][INFO ] Starting data processing... GPU type not found in init_data.xml [09:27:06][8096][ERROR] Failed to get OpenCL platform/device info from BOINC (error: -161)! [09:27:06][8096][ERROR] Demodulation failed (error: -161)!
@Tullio: I need a copy of an affected init_data.xml file. You'll find them underneath the slots directory of the BOINC data directory. Please make sure that the particular slot is used to run one of the failing tasks.
PLEASE DON'T PASTE THE FILE HERE (CONTAINS CONFIDENTIAL DATA)! SEND IT TO ME VIA PRIVATE MESSAGE! THANKS
FYI, we deployed a different scheduler on the 16th which was definitely flawed and we have an idea why you receive work you never requested. I reverted that one and the original one is in place again. This will likely fix your problem.
@Tullio: please run an update/work-request in both of your machines and report back.
Summary (to avoid confusion):
There are two issues at hand
The scheduler sends types of tasks not requested (here: Intel GPU): both computers affected (wait for Tullio's work requests)
The (correctly sent) ATI GPU tasks fail to run because of a seemingly flawed init_data.xml (wait for Tullio's PM)
I am getting errors in binary pulsar radio search in Albert@home 1.40 while I do not get errors in Einstein@home with 1.39 on the same PC with Windows 8.1 and two ATI/AMD graphic processors, one on board called Devastator (8670D) and one discrete called Caico (HD 8470). Albert@home and Einstein@Qhome seem to see only the first while SETI@home sees both.
Tullio, I do understand your original problem description (see my summary above). What I was trying to say is:
We think we fixed the reason for problem 1, so I need you to try again and report back the result. Your hosts haven't contacted the server since we deployed the fix, please click on "Update" in the BOINC Manager.
We need a copy of an affected init_data.xml file in order to help you with problem 2. Please PM that file to me.
We need a copy of an affected init_data.xml file in order to help you with problem 2. Please PM that file to me.
I think there's a problem with that, When Boinc starts a task it'll populate a slot, when it finishes/errors etc, it'll empty the slot, (Even with Network suspended), I just tried it, two tasks running on my E8500/9800GTX+, one pending upload, only two slots.
Tullio's tasks are erroring between 1.5 and 2.75 secs after start, So he's not got much time to get the init_data.xmlbefore the slot ceases to exist.
He might be able to the Activity menu to suspend GPU usage quick enough to be able to be able to recover an init_data.xml before it errors, if he's lucky.
That's right unfortunately. One can, however, pass --exit_before_start to the BOINC client such that it will prepare the slot directory but exit right away. This way the init_data.xml file we be retained. Of course this means the core client has to be started manually, not automatically via the BOINC Manager or as a service.
That's right unfortunately. One can, however, pass --exit_before_start to the BOINC client such that it will prepare the slot directory but exit right away. This way the init_data.xml file we be retained. Of course this means the core client has to be started manually, not automatically via the BOINC Manager or as a service.
HTH, Oliver
I tried putting --exit_before_start in a shortcut, But couldn't get it to work, the app started and kept running, i suspended usage via the task bar, suspended the tasks that had been running and tried fresh tasks, the same.
Instead i tried --exit_after_app_start 1
That did work, Boinc started an app ran it for a second, then Boinc exited.
Tullio will have to make a shortcut to boinc.exe and put in the Target Box the following: "C:\Program Files\BOINC\boinc.exe" --exit_after_app_start 1
He'll have to Suspend GPU usage via the task bar, grab some GPU work from Albert, let it download, exit Boinc, double click his new shortcut, wait for Boinc.exe to start in the command window, then exit, then navigate to the slot directories:
C:\ProgramData\BOINC\slots
and search each slot for the einsteinbinary_BRP5_1.40_windows_intelx86__BRP5-opencl-ati.exe app, once he's found the right slot he can PM the init_data.xml
For this one, a NVidia Quadro FX 3450 (on Ubuntu 12.04) which is not detected by BOINC, I guess I should just remove it. I cannot find the applications available for the project, but I guess they are only for GPU, so there is no point in keeping this computer if the GPU is not recognized/not valid. Am I right?
Oliver Bock wrote:Claggy
)
Boinc should be reporting (on it's startup in the Event Log) CPU OpenCL support from the AMD drivers installed, and is unlikely is report CPU OpenCL support from Intel drivers (AMD CPU).
I've been looking at:
http://boinc.berkeley.edu/trac/wiki/GPUApp
This seems to imply you should send a init_data.xml suitable for each vendors device.
http://boinc.berkeley.edu/trac/wiki/OpenclApps
This implies that if your app can run on a GPU from any vendor, and you set the type as 0, as opposed to, 1, 2 or 3, then the type is taken from the what actually in the init_data.xml, is the Wrong init_data.xml being supplied?
http://boinc.berkeley.edu/trac/wiki/ProjectPlan
Part of the Boinc project plan is:
Claggy
Claggy wrote:This seems to
)
That's not the case: init_data.xml is written by the client to communicate details to the app (which uses the BOINC API to access it).
N/A, see above.
The problem is clearly stated in the failing tasks STDERR output:
[09:27:06][8096][INFO ] Starting data processing...
GPU type not found in init_data.xml
[09:27:06][8096][ERROR] Failed to get OpenCL platform/device info from BOINC (error: -161)! [09:27:06][8096][ERROR] Demodulation failed (error: -161)!
@Tullio: I need a copy of an affected init_data.xml file. You'll find them underneath the slots directory of the BOINC data directory. Please make sure that the particular slot is used to run one of the failing tasks.
PLEASE DON'T PASTE THE FILE HERE (CONTAINS CONFIDENTIAL DATA)! SEND IT TO ME VIA PRIVATE MESSAGE! THANKS
Thanks,
Oliver
FYI, we deployed a different
)
FYI, we deployed a different scheduler on the 16th which was definitely flawed and we have an idea why you receive work you never requested. I reverted that one and the original one is in place again. This will likely fix your problem.
@Tullio: please run an update/work-request in both of your machines and report back.
Summary (to avoid confusion):
There are two issues at hand
Oliver
I am getting errors in binary
)
I am getting errors in binary pulsar radio search in Albert@home 1.40 while I do not get errors in Einstein@home with 1.39 on the same PC with Windows 8.1 and two ATI/AMD graphic processors, one on board called Devastator (8670D) and one discrete called Caico (HD 8470). Albert@home and Einstein@Qhome seem to see only the first while SETI@home sees both.
Tullio, I do understand your
)
Tullio, I do understand your original problem description (see my summary above). What I was trying to say is:
Thanks,
Oliver
Oliver Bock wrote:We need a
)
Tullio's tasks are erroring between 1.5 and 2.75 secs after start, So he's not got much time to get the init_data.xml before the slot ceases to exist.
He might be able to the Activity menu to suspend GPU usage quick enough to be able to be able to recover an init_data.xml before it errors, if he's lucky.
Claggy
That's right unfortunately.
)
That's right unfortunately. One can, however, pass --exit_before_start to the BOINC client such that it will prepare the slot directory but exit right away. This way the init_data.xml file we be retained. Of course this means the core client has to be started manually, not automatically via the BOINC Manager or as a service.
HTH,
Oliver
Oliver Bock wrote:That's
)
Instead i tried --exit_after_app_start 1
That did work, Boinc started an app ran it for a second, then Boinc exited.
Tullio will have to make a shortcut to boinc.exe and put in the Target Box the following: "C:\Program Files\BOINC\boinc.exe" --exit_after_app_start 1
He'll have to Suspend GPU usage via the task bar, grab some GPU work from Albert, let it download, exit Boinc, double click his new shortcut, wait for Boinc.exe to start in the command window, then exit, then navigate to the slot directories:
C:\ProgramData\BOINC\slots
and search each slot for the einsteinbinary_BRP5_1.40_windows_intelx86__BRP5-opencl-ati.exe app, once he's found the right slot he can PM the init_data.xml
Claggy
Yep, thanks for the
)
Yep, thanks for the details.
@Tullio: please try to follow the steps Claggy just described.
Oliver
Dear all,several of my
)
Dear all,
several of my computers are giving errors of different types constantly. I had to suspend work for this project while I see what's going on:
Computer 12403: example task 1777793
I am not sure but I guess that I should apply the solution described by Claggy:
--exit_after_app_start 1, am I right?
Computer 12382: example task 1779193
For this one, a NVidia Quadro FX 3450 (on Ubuntu 12.04) which is not detected by BOINC, I guess I should just remove it. I cannot find the applications available for the project, but I guess they are only for GPU, so there is no point in keeping this computer if the GPU is not recognized/not valid. Am I right?
Computer 12620: example task 1780214
And about the 3rd one, I don't know what the hell is going on :)
1st and 3rd one work well with GPU projects like Milkyway, Einstein, SETI, Collatz and others.
Please, give me some clues.
Thanks