Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SYCL 2020 barrier instead of SYCL 1.2.1 one by default #1988

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cmake/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Use `ONEDPL_PAR_BACKEND` variable before the invocation of `find_package(oneDPL
### Using oneDPL package on Windows
On Windows, CMake requires some workarounds to use icx[-cl] successfully. A CMake package has been provided 'oneDPLWindowsIntelLLVM' to apply these required workarounds.
Some workarounds are provided for icpx, but it is not fully supported on Windows at this time. We also recommend updating to the most recent version of CMake, as they are actively improving support for Intel compilers (https://gitlab.kitware.com/cmake/cmake/-/issues/24314).
To enable the workarounds, please add `find_package(oneDPLWindowsIntelLLVM)` to your cmake file before you call `project()`. If using oneDPL from source files, you must add oneDPL's cmake directory to your `CMAKE_PREFIX_PATH` to allow CMake to find `oneDPLWindowsIntelLLVM`.
To enable the workarounds, please add `find_package(oneDPLWindowsIntelLLVM)` to your cmake file before you call `project()`. If using oneDPL from source files, you must add oneDPL's cmake directory to your `CMAKE_PREFIX_PATH` to allow CMake to find `oneDPLWindowsIntelLLVM`.

For example:

Expand Down
182 changes: 96 additions & 86 deletions documentation/library_guide/macros.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,89 +50,99 @@ Additional Macros
Use these macros to control aspects of |onedpl_short| usage. You can set them in your program code
before including |onedpl_short| headers.

================================== ==============================
Macro Description
================================== ==============================
``PSTL_USE_NONTEMPORAL_STORES`` This macro enables the use of ``#pragma vector nontemporal``
for write-only data when algorithms such as ``std::copy``, ``std::fill``, etc.,
are executed with unsequenced policies.
For further details about the pragma, see the |vector_pragma|_.
If the macro evaluates to a non-zero value,
the use of ``#pragma vector nontemporal`` is enabled.
By default, the macro is not defined.

Using this macro may have the same effect on the implementation of parallel
algorithms in the C++ standard libraries of GCC and LLVM.
---------------------------------- ------------------------------
``PSTL_USAGE_WARNINGS`` This macro enables Parallel STL to
emit compile-time messages, such as warnings
about an algorithm not supporting a certain execution policy.
When set to 1, the macro allows the implementation to emit
usage warnings. When the macro is not defined (by default)
or evaluates to zero, usage warnings are disabled.

Using this macro may have the same effect on the implementation of parallel
algorithms in the C++ standard libraries of GCC and LLVM.
---------------------------------- ------------------------------
``ONEDPL_USE_TBB_BACKEND`` This macro controls the use of |onetbb_long| or |tbb_long| for parallel
execution policies (``par`` and ``par_unseq``).

When the macro evaluates to a non-zero value, or when it is not defined (by default)
and no other parallel backends are explicitly chosen, algorithms with parallel policies
are executed using the |onetbb_short| or |tbb_short| library.
Setting the macro to 0 disables use of TBB API for parallel execution and is recommended
for code that should not depend on the presence of the |onetbb_short| or |tbb_short| library.

If all parallel backends are disabled by setting respective macros to 0, algorithms
with parallel policies are executed sequentially by the calling thread.
---------------------------------- ------------------------------
``ONEDPL_USE_OPENMP_BACKEND`` This macro controls the use of OpenMP* for parallel execution policies (``par`` and ``par_unseq``).

When the macro evaluates to a non-zero value, algorithms with parallel policies are executed
using OpenMP unless the TBB backend is explicitly enabled (that is, the TBB backend takes
precedence over the OpenMP backend).
When the macro is not defined (by default) and no other parallel backends are chosen,
a dedicated compiler option to enable OpenMP (such as ``-fopenmp``) also enables its use
for algorithms with parallel policies.
Setting the macro to 0 disables use of OpenMP for parallel execution.

If all parallel backends are disabled by setting respective macros to 0, algorithms
with parallel policies are executed sequentially by the calling thread.
---------------------------------- ------------------------------
``ONEDPL_USE_DPCPP_BACKEND`` This macro enables the use of device execution policies.

When the macro is not defined (default),
device policies are enabled only if SYCL support can be detected;
otherwise, they are disabled.
If the macro is set to a non-zero value, device policies are enabled unconditionally.
Setting the macro to 0 disables device policies.

When device policies are disabled, no SYCL dependency is introduced,
and their usage will lead to compilation errors.
---------------------------------- ------------------------------
``ONEDPL_USE_PREDEFINED_POLICIES`` This macro enables the use of predefined device policy objects,
such as ``dpcpp_default`` and ``dpcpp_fpga``. When the macro is not defined (by default)
or evaluates to non-zero, predefined policies objects can be used.
When the macro is set to 0, predefined policies objects and make functions
without arguments (``make_device_policy()`` and ``make_fpga_policy()``) are not available.
---------------------------------- ------------------------------
``ONEDPL_ALLOW_DEFERRED_WAITING`` This macro allows waiting for completion of certain algorithms executed with
device policies to be deferred. (Disabled by default.)

When the macro evaluates to non-zero, a call to a oneDPL algorithm with
a device policy might return before the computation completes on the device.

.. Warning:: Before accessing data produced or modified by the call, waiting
for completion of all tasks in the corresponding SYCL queue is required;
otherwise, the program behavior is undefined.
---------------------------------- ------------------------------
``ONEDPL_FPGA_DEVICE`` Use this macro to build your code containing |onedpl_short| parallel
algorithms for FPGA devices. (Disabled by default.)
---------------------------------- ------------------------------
``ONEDPL_FPGA_EMULATOR`` Use this macro to build your code containing Parallel STL
algorithms for FPGA emulation device. (Disabled by default.)

.. Note:: Define ``ONEDPL_FPGA_DEVICE`` and ``ONEDPL_FPGA_EMULATOR`` macros in the same
application to run on a FPGA emulation device.
Define only the ``ONEDPL_FPGA_DEVICE`` macro to run on a FPGA hardware device.
================================== ==============================
==================================== ==============================
Macro Description
==================================== ==============================
``PSTL_USE_NONTEMPORAL_STORES`` This macro enables the use of ``#pragma vector nontemporal``
for write-only data when algorithms such as ``std::copy``, ``std::fill``, etc.,
are executed with unsequenced policies.
For further details about the pragma, see the |vector_pragma|_.
If the macro evaluates to a non-zero value,
the use of ``#pragma vector nontemporal`` is enabled.
By default, the macro is not defined.

Using this macro may have the same effect on the implementation of parallel
algorithms in the C++ standard libraries of GCC and LLVM.
------------------------------------ ------------------------------
``PSTL_USAGE_WARNINGS`` This macro enables Parallel STL to
emit compile-time messages, such as warnings
about an algorithm not supporting a certain execution policy.
When set to 1, the macro allows the implementation to emit
usage warnings. When the macro is not defined (by default)
or evaluates to zero, usage warnings are disabled.

Using this macro may have the same effect on the implementation of parallel
algorithms in the C++ standard libraries of GCC and LLVM.
------------------------------------ ------------------------------
``ONEDPL_USE_TBB_BACKEND`` This macro controls the use of |onetbb_long| or |tbb_long| for parallel
execution policies (``par`` and ``par_unseq``).

When the macro evaluates to a non-zero value, or when it is not defined (by default)
and no other parallel backends are explicitly chosen, algorithms with parallel policies
are executed using the |onetbb_short| or |tbb_short| library.
Setting the macro to 0 disables use of TBB API for parallel execution and is recommended
for code that should not depend on the presence of the |onetbb_short| or |tbb_short| library.

If all parallel backends are disabled by setting respective macros to 0, algorithms
with parallel policies are executed sequentially by the calling thread.
------------------------------------ ------------------------------
``ONEDPL_USE_OPENMP_BACKEND`` This macro controls the use of OpenMP* for parallel execution policies (``par`` and ``par_unseq``).

When the macro evaluates to a non-zero value, algorithms with parallel policies are executed
using OpenMP unless the TBB backend is explicitly enabled (that is, the TBB backend takes
precedence over the OpenMP backend).
When the macro is not defined (by default) and no other parallel backends are chosen,
a dedicated compiler option to enable OpenMP (such as ``-fopenmp``) also enables its use
for algorithms with parallel policies.
Setting the macro to 0 disables use of OpenMP for parallel execution.

If all parallel backends are disabled by setting respective macros to 0, algorithms
with parallel policies are executed sequentially by the calling thread.
------------------------------------ ------------------------------
``ONEDPL_USE_DPCPP_BACKEND`` This macro enables the use of device execution policies.

When the macro is not defined (default),
device policies are enabled only if SYCL support can be detected;
otherwise, they are disabled.
If the macro is set to a non-zero value, device policies are enabled unconditionally.
Setting the macro to 0 disables device policies.

When device policies are disabled, no SYCL dependency is introduced,
and their usage will lead to compilation errors.
------------------------------------ ------------------------------
``ONEDPL_USE_PREDEFINED_POLICIES`` This macro enables the use of predefined device policy objects,
such as ``dpcpp_default`` and ``dpcpp_fpga``. When the macro is not defined (by default)
or evaluates to non-zero, predefined policies objects can be used.
When the macro is set to 0, predefined policies objects and make functions
without arguments (``make_device_policy()`` and ``make_fpga_policy()``) are not available.
------------------------------------ ------------------------------
``ONEDPL_ALLOW_DEFERRED_WAITING`` This macro allows waiting for completion of certain algorithms executed with
device policies to be deferred. (Disabled by default.)

When the macro evaluates to non-zero, a call to a oneDPL algorithm with
a device policy might return before the computation completes on the device.

.. Warning:: Before accessing data produced or modified by the call, waiting
for completion of all tasks in the corresponding SYCL queue is required;
otherwise, the program behavior is undefined.
------------------------------------ ------------------------------
``ONEDPL_FPGA_DEVICE`` Use this macro to build your code containing |onedpl_short| parallel
algorithms for FPGA devices. (Disabled by default.)
------------------------------------ ------------------------------
``ONEDPL_FPGA_EMULATOR`` Use this macro to build your code containing Parallel STL
algorithms for FPGA emulation device. (Disabled by default.)

.. Note:: Define ``ONEDPL_FPGA_DEVICE`` and ``ONEDPL_FPGA_EMULATOR`` macros in the same
application to run on a FPGA emulation device.
Define only the ``ONEDPL_FPGA_DEVICE`` macro to run on a FPGA hardware device.
------------------------------------ ------------------------------
``ONEDPL_USE_SYCL121_GROUP_BARRIER`` The macro controls the semantics of group barriers,
which can be aligned with either SYCL 1.2.1 or SYCL 2020 specification.
It affects algorithms that use device execution policies.

Set this macro to a non-zero value to enable SYCL 1.2.1 group barriers.
The default value is 1 when using the oneAPI DPC++ Compiler and 0 otherwise.

.. Note:: SYCL 1.2.1 group barriers can provide better performance on Intel GPUs.
.. Note:: The macro may be removed in future releases in favor of SYCL 2020 group barriers.
==================================== ==============================
25 changes: 20 additions & 5 deletions include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@
#define _ONEDPL_SYCL2020_KNOWN_IDENTITY_PRESENT (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50300))
#define _ONEDPL_SYCL2020_FUNCTIONAL_OBJECTS_PRESENT (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50300))
#define _ONEDPL_SYCL2020_REQD_SUB_GROUP_SIZE_PRESENT (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50300))
#define _ONEDPL_SYCL2020_GROUP_BARRIER_PRESENT (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50300))
#define _ONEDPL_SYCL2020_TARGET_PRESENT (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50400))
#define _ONEDPL_SYCL2020_TARGET_DEVICE_PRESENT (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50400))
#define _ONEDPL_SYCL2020_ATOMIC_REF_PRESENT (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50500))
Expand Down Expand Up @@ -209,17 +210,31 @@ __get_accessor_size(const _Accessor& __accessor)
#endif
}

// TODO: switch to SYCL 2020 with DPC++ compiler.
// SYCL 1.2.1 version is used due to better performance on Intel GPUs.
// The performance gap is negligible since
// https://github.com/intel/intel-graphics-compiler/commit/ed639f68d142bc963a7b626badc207a42fb281cb (Aug 20, 2024)
// But the fix is not a part of the LTS GPU drivers (Linux) yet.
#if !defined(ONEDPL_USE_SYCL121_GROUP_BARRIER)
# if _ONEDPL_LIBSYCL_VERSION
# define ONEDPL_USE_SYCL121_GROUP_BARRIER 1
# else
# define ONEDPL_USE_SYCL121_GROUP_BARRIER 0
# endif
#endif

template <typename _Item>
constexpr void
__group_barrier(_Item __item)
{
#if 0 // !defined(_ONEDPL_LIBSYCL_VERSION) || _ONEDPL_LIBSYCL_VERSION >= 50300
//TODO: usage of sycl::group_barrier: probably, we have to revise SYCL parallel patterns which use a group_barrier.
// 1) sycl::group_barrier() implementation is not ready
// 2) sycl::group_barrier and sycl::item::group_barrier are not quite equivalent
#if _ONEDPL_SYCL2020_GROUP_BARRIER_PRESENT && !ONEDPL_USE_SYCL121_GROUP_BARRIER
// SYCL 2020 barrier: applies to local and global memory within a work-group
sycl::group_barrier(__item.get_group(), sycl::memory_scope::work_group);
#else
#elif ONEDPL_USE_SYCL121_GROUP_BARRIER
// SYCL 1.2.1 barrier: applies to local memory within a work-group
__item.barrier(sycl::access::fence_space::local_space);
#else
# error "sycl::group_barrier is not supported, and no alternative is available"
#endif
}

Expand Down
Loading