uxlfoundation · dmitriy-sobolev · Jan 5, 2025 · Jan 22, 2025 · Jan 23, 2025 · Jan 23, 2025
diff --git a/cmake/README.md b/cmake/README.md
@@ -106,7 +106,7 @@ Use `ONEDPL_PAR_BACKEND` variable before the invocation of `find_package(oneDPL
 ### Using oneDPL package on Windows
 On Windows, CMake requires some workarounds to use icx[-cl] successfully.  A CMake package has been provided 'oneDPLWindowsIntelLLVM' to apply these required workarounds.
 Some workarounds are provided for icpx, but it is not fully supported on Windows at this time.  We also recommend updating to the most recent version of CMake, as they are actively improving support for Intel compilers (https://gitlab.kitware.com/cmake/cmake/-/issues/24314).
-To enable the workarounds, please add `find_package(oneDPLWindowsIntelLLVM)` to your cmake file before you call `project()`.  If using oneDPL from source files, you must add oneDPL's cmake directory to your `CMAKE_PREFIX_PATH` to allow CMake to find `oneDPLWindowsIntelLLVM`. 
+To enable the workarounds, please add `find_package(oneDPLWindowsIntelLLVM)` to your cmake file before you call `project()`.  If using oneDPL from source files, you must add oneDPL's cmake directory to your `CMAKE_PREFIX_PATH` to allow CMake to find `oneDPLWindowsIntelLLVM`.
 
 For example:
 

diff --git a/documentation/library_guide/macros.rst b/documentation/library_guide/macros.rst
@@ -50,89 +50,99 @@ Additional Macros
 Use these macros to control aspects of |onedpl_short| usage. You can set them in your program code
 before including |onedpl_short| headers.
 
-================================== ==============================
-Macro                              Description
-================================== ==============================
-``PSTL_USE_NONTEMPORAL_STORES``    This macro enables the use of ``#pragma vector nontemporal``
-                                   for write-only data when algorithms such as ``std::copy``, ``std::fill``, etc.,
-                                   are executed with unsequenced policies.
-                                   For further details about the pragma, see the |vector_pragma|_.
-                                   If the macro evaluates to a non-zero value,
-                                   the use of ``#pragma vector nontemporal`` is enabled.
-                                   By default, the macro is not defined.
-
-                                   Using this macro may have the same effect on the implementation of parallel
-                                   algorithms in the C++ standard libraries of GCC and LLVM.
----------------------------------- ------------------------------
-``PSTL_USAGE_WARNINGS``            This macro enables Parallel STL to
-                                   emit compile-time messages, such as warnings
-                                   about an algorithm not supporting a certain execution policy.
-                                   When set to 1, the macro allows the implementation to emit
-                                   usage warnings. When the macro is not defined (by default)
-                                   or evaluates to zero, usage warnings are disabled.
-
-                                   Using this macro may have the same effect on the implementation of parallel
-                                   algorithms in the C++ standard libraries of GCC and LLVM.
----------------------------------- ------------------------------
-``ONEDPL_USE_TBB_BACKEND``         This macro controls the use of |onetbb_long| or |tbb_long| for parallel
-                                   execution policies (``par`` and ``par_unseq``).
-
-                                   When the macro evaluates to a non-zero value, or when it is not defined (by default)
-                                   and no other parallel backends are explicitly chosen, algorithms with parallel policies
-                                   are executed using the |onetbb_short| or |tbb_short| library.
-                                   Setting the macro to 0 disables use of TBB API for parallel execution and is recommended
-                                   for code that should not depend on the presence of the |onetbb_short| or |tbb_short| library.
-
-                                   If all parallel backends are disabled by setting respective macros to 0, algorithms
-                                   with parallel policies are executed sequentially by the calling thread.
----------------------------------- ------------------------------
-``ONEDPL_USE_OPENMP_BACKEND``      This macro controls the use of OpenMP* for parallel execution policies (``par`` and ``par_unseq``).
-
-                                   When the macro evaluates to a non-zero value, algorithms with parallel policies are executed
-                                   using OpenMP unless the TBB backend is explicitly enabled (that is, the TBB backend takes
-                                   precedence over the OpenMP backend).
-                                   When the macro is not defined (by default) and no other parallel backends are chosen,
-                                   a dedicated compiler option to enable OpenMP (such as ``-fopenmp``) also enables its use
-                                   for algorithms with parallel policies.
-                                   Setting the macro to 0 disables use of OpenMP for parallel execution.
-
-                                   If all parallel backends are disabled by setting respective macros to 0, algorithms
-                                   with parallel policies are executed sequentially by the calling thread.
----------------------------------- ------------------------------
-``ONEDPL_USE_DPCPP_BACKEND``       This macro enables the use of device execution policies.
-
-                                   When the macro is not defined (default),
-                                   device policies are enabled only if SYCL support can be detected;
-                                   otherwise, they are disabled.
-                                   If the macro is set to a non-zero value, device policies are enabled unconditionally.
-                                   Setting the macro to 0 disables device policies.
-
-                                   When device policies are disabled, no SYCL dependency is introduced,
-                                   and their usage will lead to compilation errors.
----------------------------------- ------------------------------
-``ONEDPL_USE_PREDEFINED_POLICIES`` This macro enables the use of predefined device policy objects,
-                                   such as ``dpcpp_default`` and ``dpcpp_fpga``. When the macro is not defined (by default)
-                                   or evaluates to non-zero, predefined policies objects can be used.
-                                   When the macro is set to 0, predefined policies objects and make functions
-                                   without arguments (``make_device_policy()`` and ``make_fpga_policy()``) are not available.
----------------------------------- ------------------------------
-``ONEDPL_ALLOW_DEFERRED_WAITING``  This macro allows waiting for completion of certain algorithms executed with
-                                   device policies to be deferred. (Disabled by default.)
-
-                                   When the macro evaluates to non-zero, a call to a oneDPL algorithm with
-                                   a device policy might return before the computation completes on the device.
-
-                                   .. Warning:: Before accessing data produced or modified by the call, waiting
-                                      for completion of all tasks in the corresponding SYCL queue is required;
-                                      otherwise, the program behavior is undefined.
----------------------------------- ------------------------------
-``ONEDPL_FPGA_DEVICE``             Use this macro to build your code containing |onedpl_short| parallel
-                                   algorithms for FPGA devices. (Disabled by default.)
----------------------------------- ------------------------------
-``ONEDPL_FPGA_EMULATOR``           Use this macro to build your code containing Parallel STL
-                                   algorithms for FPGA emulation device. (Disabled by default.)
-
-                                   .. Note:: Define ``ONEDPL_FPGA_DEVICE`` and ``ONEDPL_FPGA_EMULATOR`` macros in the same
-                                      application to run on a FPGA emulation device.
-                                      Define only the ``ONEDPL_FPGA_DEVICE`` macro to run on a FPGA hardware device.
-================================== ==============================
+==================================== ==============================
+Macro                                Description
+==================================== ==============================
+``PSTL_USE_NONTEMPORAL_STORES``      This macro enables the use of ``#pragma vector nontemporal``
+                                     for write-only data when algorithms such as ``std::copy``, ``std::fill``, etc.,
+                                     are executed with unsequenced policies.
+                                     For further details about the pragma, see the |vector_pragma|_.
+                                     If the macro evaluates to a non-zero value,
+                                     the use of ``#pragma vector nontemporal`` is enabled.
+                                     By default, the macro is not defined.
+
+                                     Using this macro may have the same effect on the implementation of parallel
+                                     algorithms in the C++ standard libraries of GCC and LLVM.
+------------------------------------ ------------------------------
+``PSTL_USAGE_WARNINGS``              This macro enables Parallel STL to
+                                     emit compile-time messages, such as warnings
+                                     about an algorithm not supporting a certain execution policy.
+                                     When set to 1, the macro allows the implementation to emit
+                                     usage warnings. When the macro is not defined (by default)
+                                     or evaluates to zero, usage warnings are disabled.
+
+                                     Using this macro may have the same effect on the implementation of parallel
+                                     algorithms in the C++ standard libraries of GCC and LLVM.
+------------------------------------ ------------------------------
+``ONEDPL_USE_TBB_BACKEND``           This macro controls the use of |onetbb_long| or |tbb_long| for parallel
+                                     execution policies (``par`` and ``par_unseq``).
+
+                                     When the macro evaluates to a non-zero value, or when it is not defined (by default)
+                                     and no other parallel backends are explicitly chosen, algorithms with parallel policies
+                                     are executed using the |onetbb_short| or |tbb_short| library.
+                                     Setting the macro to 0 disables use of TBB API for parallel execution and is recommended
+                                     for code that should not depend on the presence of the |onetbb_short| or |tbb_short| library.
+
+                                     If all parallel backends are disabled by setting respective macros to 0, algorithms
+                                     with parallel policies are executed sequentially by the calling thread.
+------------------------------------ ------------------------------
+``ONEDPL_USE_OPENMP_BACKEND``        This macro controls the use of OpenMP* for parallel execution policies (``par`` and ``par_unseq``).
+
+                                     When the macro evaluates to a non-zero value, algorithms with parallel policies are executed
+                                     using OpenMP unless the TBB backend is explicitly enabled (that is, the TBB backend takes
+                                     precedence over the OpenMP backend).
+                                     When the macro is not defined (by default) and no other parallel backends are chosen,
+                                     a dedicated compiler option to enable OpenMP (such as ``-fopenmp``) also enables its use
+                                     for algorithms with parallel policies.
+                                     Setting the macro to 0 disables use of OpenMP for parallel execution.
+
+                                     If all parallel backends are disabled by setting respective macros to 0, algorithms
+                                     with parallel policies are executed sequentially by the calling thread.
+------------------------------------ ------------------------------
+``ONEDPL_USE_DPCPP_BACKEND``         This macro enables the use of device execution policies.
+
+                                     When the macro is not defined (default),
+                                     device policies are enabled only if SYCL support can be detected;
+                                     otherwise, they are disabled.
+                                     If the macro is set to a non-zero value, device policies are enabled unconditionally.
+                                     Setting the macro to 0 disables device policies.
+
+                                     When device policies are disabled, no SYCL dependency is introduced,
+                                     and their usage will lead to compilation errors.
+------------------------------------ ------------------------------
+``ONEDPL_USE_PREDEFINED_POLICIES``   This macro enables the use of predefined device policy objects,
+                                     such as ``dpcpp_default`` and ``dpcpp_fpga``. When the macro is not defined (by default)
+                                     or evaluates to non-zero, predefined policies objects can be used.
+                                     When the macro is set to 0, predefined policies objects and make functions
+                                     without arguments (``make_device_policy()`` and ``make_fpga_policy()``) are not available.
+------------------------------------ ------------------------------
+``ONEDPL_ALLOW_DEFERRED_WAITING``    This macro allows waiting for completion of certain algorithms executed with
+                                     device policies to be deferred. (Disabled by default.)
+
+                                     When the macro evaluates to non-zero, a call to a oneDPL algorithm with
+                                     a device policy might return before the computation completes on the device.
+
+                                     .. Warning:: Before accessing data produced or modified by the call, waiting
+                                        for completion of all tasks in the corresponding SYCL queue is required;
+                                        otherwise, the program behavior is undefined.
+------------------------------------ ------------------------------
+``ONEDPL_FPGA_DEVICE``               Use this macro to build your code containing |onedpl_short| parallel
+                                     algorithms for FPGA devices. (Disabled by default.)
+------------------------------------ ------------------------------
+``ONEDPL_FPGA_EMULATOR``             Use this macro to build your code containing Parallel STL
+                                     algorithms for FPGA emulation device. (Disabled by default.)
+
+                                     .. Note:: Define ``ONEDPL_FPGA_DEVICE`` and ``ONEDPL_FPGA_EMULATOR`` macros in the same
+                                        application to run on a FPGA emulation device.
+                                        Define only the ``ONEDPL_FPGA_DEVICE`` macro to run on a FPGA hardware device.
+------------------------------------ ------------------------------
+``ONEDPL_USE_SYCL121_GROUP_BARRIER`` The macro controls the semantics of group barriers,
+                                     which can be aligned with either SYCL 1.2.1 or SYCL 2020 specification.
+                                     It affects algorithms that use device execution policies.
+
+                                     Set this macro to a non-zero value to enable SYCL 1.2.1 group barriers.
+                                     The default value is 1 when using the oneAPI DPC++ Compiler and 0 otherwise.
+
+                                     .. Note:: SYCL 1.2.1 group barriers can provide better performance on Intel GPUs.
+                                     .. Note:: The macro may be removed in future releases in favor of SYCL 2020 group barriers.
+==================================== ==============================
diff --git a/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h b/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h
@@ -63,6 +63,7 @@
 #define _ONEDPL_SYCL2020_KNOWN_IDENTITY_PRESENT               (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50300))
 #define _ONEDPL_SYCL2020_FUNCTIONAL_OBJECTS_PRESENT           (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50300))
 #define _ONEDPL_SYCL2020_REQD_SUB_GROUP_SIZE_PRESENT          (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50300))
+#define _ONEDPL_SYCL2020_GROUP_BARRIER_PRESENT                (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50300))
 #define _ONEDPL_SYCL2020_TARGET_PRESENT                       (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50400))
 #define _ONEDPL_SYCL2020_TARGET_DEVICE_PRESENT                (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50400))
 #define _ONEDPL_SYCL2020_ATOMIC_REF_PRESENT                   (!_ONEDPL_LIBSYCL_VERSION_LESS_THAN(50500))
@@ -209,17 +210,31 @@ __get_accessor_size(const _Accessor& __accessor)
 #endif
 }
 
+// TODO: switch to SYCL 2020 with DPC++ compiler.
+// SYCL 1.2.1 version is used due to better performance on Intel GPUs.
+// The performance gap is negligible since
+// https://github.com/intel/intel-graphics-compiler/commit/ed639f68d142bc963a7b626badc207a42fb281cb (Aug 20, 2024)
+// But the fix is not a part of the LTS GPU drivers (Linux) yet.
+#if !defined(ONEDPL_USE_SYCL121_GROUP_BARRIER)
+#    if _ONEDPL_LIBSYCL_VERSION
+#        define ONEDPL_USE_SYCL121_GROUP_BARRIER 1
+#    else
+#        define ONEDPL_USE_SYCL121_GROUP_BARRIER 0
+#    endif
+#endif
+
 template <typename _Item>
 constexpr void
 __group_barrier(_Item __item)
 {
-#if 0 // !defined(_ONEDPL_LIBSYCL_VERSION) || _ONEDPL_LIBSYCL_VERSION >= 50300
-    //TODO: usage of sycl::group_barrier: probably, we have to revise SYCL parallel patterns which use a group_barrier.
-    // 1) sycl::group_barrier() implementation is not ready
-    // 2) sycl::group_barrier and sycl::item::group_barrier are not quite equivalent
+#if _ONEDPL_SYCL2020_GROUP_BARRIER_PRESENT && !ONEDPL_USE_SYCL121_GROUP_BARRIER
+    // SYCL 2020 barrier: applies to local and global memory within a work-group
     sycl::group_barrier(__item.get_group(), sycl::memory_scope::work_group);
-#else
+#elif ONEDPL_USE_SYCL121_GROUP_BARRIER
+    // SYCL 1.2.1 barrier: applies to local memory within a work-group
     __item.barrier(sycl::access::fence_space::local_space);
+#else
+#    error "sycl::group_barrier is not supported, and no alternative is available"
 #endif
 }