Add loongarch support and LSX/LASX impl #2560

HecaiYuan · 2025-04-30T07:42:02Z

Added CI tests and implemented LSX (128-bit vector extensions) and LASX (256-bit vector extensions) optimizations. welcome to review. @jan-wassenberg @jeromejj

jan-wassenberg · 2025-05-02T16:56:29Z

Congrats on the successful implementation! I will review next week, have been very busy with gemma.cpp.

johnplatts · 2025-05-07T20:02:07Z

There are some more enhancements that should be made on the LSX/LASX targets:

Implement Vec128<T> → Vec256<T> Combine, Vec256<T> → Vec128<T> LowerHalf/UpperHalf, Vec256<T> ↔ Vec128<T, N> [ZeroExtend]ResizeBitCast using __builtin_shufflevector if available (this can be checked for using #if HWY_HAS_BUILTIN(__builtin_shufflevector) and Clang uses __builtin_shufflevector to implement some of the AVX __m128/__m128i/__m128d ↔ __m256/__m256i/__m256d intrinsics)
LSX/LASX-specific implementations of RoundingShiftRight using __lsx_vsrlri_*/__lsx_vsrari_*/__lasx_xvsrlri_*/__lasx_xvsrari_*
LSX/LASX-specific implementations of RoundingShr using __lsx_vsrlr_*/__lsx_vsrar_*/__lasx_xvsrlr_*/__lasx_xvsrar_*
Implement Mask128<T>/Mask256<T> SetOnlyFirst/SetAtOrAfterFirst using __lsx_vsub_q/__lasx_xvsub_q (equivalent to PPC8 vec_sub(__vector unsigned __int128, __vector unsigned __int128))

HecaiYuan · 2025-05-08T01:25:53Z

There are some more enhancements that should be made on the LSX/LASX targets:

Implement Vec128<T> → Vec256<T> Combine, Vec256<T> → Vec128<T> LowerHalf/UpperHalf, Vec256<T> ↔ Vec128<T, N> [ZeroExtend]ResizeBitCast using __builtin_shufflevector if available (this can be checked for using #if HWY_HAS_BUILTIN(__builtin_shufflevector) and Clang uses __builtin_shufflevector to implement some of the AVX __m128/__m128i/__m128d ↔ __m256/__m256i/__m256d intrinsics)

LSX/LASX-specific implementations of RoundingShiftRight using __lsx_vsrlri_*/__lsx_vsrari_*/__lasx_xvsrlri_*/__lasx_xvsrari_*

LSX/LASX-specific implementations of RoundingShr using __lsx_vsrlr_*/__lsx_vsrar_*/__lasx_xvsrlr_*/__lasx_xvsrar_*

Implement Mask128<T>/Mask256<T> SetOnlyFirst/SetAtOrAfterFirst using __lsx_vsub_q/__lasx_xvsub_q (equivalent to PPC8 vec_sub(__vector unsigned __int128, __vector unsigned __int128))

Thank you for the suggestion. I'll fix what doesn't fit.

HecaiYuan · 2025-05-09T09:14:06Z

LoongArch does not have the __lsx_vsub_q/__lasx_xvsub_q instructions. Other suggestions have been addressed. @johnplatts

jan-wassenberg

Nice work! And it's a nice instruction set :)
Some comments already on LSX and the other files, I have not yet reviewed LASX.

jan-wassenberg · 2025-05-09T11:34:45Z

CMakeLists.txt

+# enabling LASX still require -mlasx flag to be passed, in order to enable all
+# targets, we can check them directly, adding them if they are supported.  In
+# this way, Our local compilers(GCC 8.3.0 or CLANG 8.0.1) also could enable
+# LSX & LASX targets. Any better ideas are welcom.


How about using check_cxx_source_compiles to verify compiler support?

Yeah,no problem. Will update in next push.

jan-wassenberg · 2025-05-09T11:37:48Z

CMakeLists.txt

@@ -68,6 +68,14 @@ endif()
 # The following is only required with GCC < 6.1.0 or CLANG < 16.0
 set(HWY_CMAKE_ARM7 OFF CACHE BOOL "Set copts for Armv7 with NEON (requires vfpv4)?")

+# Upstream compilers(GCC 14 or CLANG 18) start supporting LSX by default, but
+# enabling LASX still require -mlasx flag to be passed, in order to enable all


Do we really require -mlasx? The patch includes enabling runtime dispatch, but only for LASX?

I am not sure whether I understand you correctly. Let me expand it. In line 71, the word supporting should be replaced with enabling, which is more accurate. It's because GCC14 or CLANG 18 actually can compile LASX code but requires -mlasx flag to be passed, otherwise, we would only compile LSX code.

For enabling runtime dispatch, it's NOT only for LASX, it's my mistake. Will fix it in next push.

Thanks for your reviews, please feel free to let me know if there are any enhancements that should be made.

hwy/ops/loongarch_lsx-inl.h

jan-wassenberg · 2025-05-09T11:41:30Z

hwy/ops/loongarch_lsx-inl.h

+HWY_API Vec128<float, N> IfThenElse(Mask128<float, N> mask,
+                                    Vec128<float, N> yes, Vec128<float, N> no) {
+  const DFromV<decltype(yes)> d;


Is there a penalty from mixing int/float? If not, it's probably better to BitCast to signed (use RebindToSigned<decltype(d)>), then call the previous IfThenElse overload.

jan-wassenberg · 2025-05-09T11:45:43Z

hwy/ops/loongarch_lsx-inl.h

+template <class D, HWY_IF_V_SIZE_LE_D(D, 16), HWY_IF_T_SIZE_D(D, 1)>
+HWY_INLINE VFromD<D> Iota0(D /*d*/) {
+  alignas(16) TFromD<D> _tmp_data[] = {


Here also static constexpr?

jan-wassenberg · 2025-05-09T11:48:46Z

hwy/ops/loongarch_lsx-inl.h

+HWY_API Vec128<uint8_t, N> ShiftRightSame(const Vec128<uint8_t, N> v,
+                                          int bits) {
+  return Vec128<uint8_t, N>{__lsx_vsrl_b(v.raw, __lsx_vreplgr2vr_b(bits))};


Whenever there is no efficiency loss, it's nice to reuse code to avoid duplication. Can we just have a single template of T, N that calls Set() and Shr()?

jan-wassenberg · 2025-05-09T11:50:03Z

hwy/ops/loongarch_lsx-inl.h

+}
+
+// usinged


hwy/ops/loongarch_lsx-inl.h

jan-wassenberg · 2025-05-09T11:59:34Z

hwy/ops/loongarch_lsx-inl.h

+HWY_INLINE VFromD<D> ClampF64ToI32Max(D d, VFromD<D> v) {
+  // The max can be exactly represented in binary64, so clamping beforehand
+  // prevents x86 conversion from raising an exception and returning 80..00.


Does Loongarch also have the same behaviour as x86? Might want to update the comment.

I made modifications to the DemoteTo(double->UI32) instruction and removed ClampF64ToI32Max. It seems that ClampF64ToI32Max is not needed.

jan-wassenberg · 2025-05-09T12:01:16Z

hwy/ops/loongarch_lsx-inl.h

+#endif  // HWY_TARGET > HWY_LASX
+
+// ------------------------------ Integer <=> fp (ShiftRight, OddEven)


These implementations do not use/depend on ShiftRight/OddEven, so can remove the comment.

jan-wassenberg

Also reviewed LASX now :)

jan-wassenberg · 2025-05-13T05:33:59Z

hwy/ops/loongarch_lasx-inl.h

+
+namespace detail {
+
+// we don't have intrinsics to operate between 128-bit and 256-bit,


hm, it is only 100% guaranteed correct to use the last union member written. Maybe it's better to hwy::CopyBytes instead, if that is zero-cost?

template <class D, HWY_IF_V_SIZE_D(D, 32)> HWY_API VFromD<D> Combine(D d, VFromD<Half<D>> hi, VFromD<Half<D>> lo) { #if HWY_HAS_BUILTIN(__builtin_shufflevector) (void)d; typedef uint32_t U32RawVectType __attribute__((__vector_size__(16))); return VFromD<D>{reinterpret_cast<typename detail::Raw256<TFromD<D>>::type>( __builtin_shufflevector(reinterpret_cast<U32RawVectType>(lo.raw), reinterpret_cast<U32RawVectType>(hi.raw), 0, 1, 2, 3, 4, 5, 6, 7))}; #else const RebindToUnsigned<decltype(d)> du; const Half<decltype(du)> du128; detail::U256 u; u.ii[0] = BitCast(du128, lo).raw; u.ii[1] = BitCast(du128, hi).raw; return BitCast(d, VFromD<decltype(du)>{u.i256}); #endif }

For example with Combine implementation, you mean that we should do as below to avoid using the last union member :

const RebindToUnsigned<decltype(d)> du; const Half<decltype(du)> du128; __m256i vec_result; __m128i vec_tmp[2]; vec_tmp[0] = BitCast(du128, lo).raw; vec_tmp[1] = BitCast(du128, hi).raw; CopyBytes<32>(vec_tmp, &vec_result); return BitCast(d, VFromD<decltype(du)>{vec_result});

Right ? Any more suggestions are appreciated. Thanks.

Yes, that looks good. You might want to add alignas(32) to both. The __builtin_shufflevector code path is still preferable when supported, it has less risk extra code/actual copies being generated.

Thanks very much, we can do that.

jan-wassenberg · 2025-05-13T05:39:09Z

hwy/ops/loongarch_lasx-inl.h

+// ------------------------------ ShiftLeftSame
+
+template <typename T, HWY_IF_UI8(T)>
+HWY_API Vec256<T> ShiftLeftSame(const Vec256<T> v, const int bits) {


Here also, should we implement as Shl(v, Set(d, bits))?

Yeah, it's nice to reuse the codes, we can do that. Thanks.

jan-wassenberg · 2025-05-13T05:40:14Z

hwy/ops/loongarch_lasx-inl.h

+  const DFromV<decltype(x)> d;
+  const RebindToUnsigned<decltype(d)> du;
+  const RepartitionToWide<decltype(du)> dd;
+  return BitCast(d, VFromD<decltype(du)>{__lasx_xvbitsel_v(


Should this be implemented using OddEven, to avoid duplicating the logic?

Yes, will update in next push.

jan-wassenberg · 2025-05-13T05:41:07Z

hwy/ops/loongarch_lasx-inl.h

+
+// ------------------------------ Floating-point classification
+
+// FIXME: disable gcc-14 tree-based loop optimizations to prevent test failures


Which tests are failing? Would be good to mention that in the comment.

jan-wassenberg · 2025-05-13T05:42:53Z

hwy/ops/loongarch_lasx-inl.h

+
+template <class D, HWY_IF_V_SIZE_D(D, 32)>
+HWY_API VFromD<D> Combine(D d, VFromD<Half<D>> hi, VFromD<Half<D>> lo) {
+#if HWY_COMPILER_CLANG && HWY_HAS_BUILTIN(__builtin_shufflevector)


GCC also supports __builtin_shufflevector, right?

You're right. I'll make the revisions. Thanks.

jan-wassenberg · 2025-05-13T05:43:17Z

hwy/ops/loongarch_lasx-inl.h

+template <class D, HWY_IF_V_SIZE_D(D, 32)>
+HWY_API VFromD<D> LoadDup128(D d, const TFromD<D>* HWY_RESTRICT p) {
+  detail::U256 u;
+  u.ii[0] = u.ii[1] = __lsx_vld(p, 0);


We could implement this using Combine() to minimize usage of U256.

Done,thanks.

jan-wassenberg · 2025-05-13T06:15:50Z

hwy/ops/loongarch_lasx-inl.h

+  const RebindToSigned<decltype(d)> di;
+  const auto a = ConcatLowerLower(d, v, v);
+  const auto b = ConcatUpperUpper(d, v, v);
+  return BitCast(d, Vec256<int64_t>{__lasx_xvshuf_d(idx.raw, BitCast(di, b).raw,


Wouldn't vperm_d be much more efficient?

Actually, LoongArch instruction sets do not have vperm_d implementation :)

To clarify, I meant this: https://jia.je/unofficial-loongarch-intrinsics-guide/lasx/permutation/#__m256i-__lasx_xvpermi_d-__m256i-a-imm0_255-imm

__lasx_xvpermi_d only accepts compile-time constant value for imm.

I see. We might still be better off synthesizing 32-bit indices from 64-bit: 2*idx + {1,0}, then using vperm_w?

You mean the same process like x86, right ? :

// 64-bit lanes: convert indices to 8x32 unless AVX3 is available template <class D, HWY_IF_V_SIZE_D(D, 32), HWY_IF_T_SIZE_D(D, 8), typename TI> HWY_API Indices256<TFromD<D>> IndicesFromVec(D d, Vec256<TI> idx64) { static_assert(sizeof(TFromD<D>) == sizeof(TI), "Index size must match lane"); const Rebind<TI, decltype(d)> di; (void)di; // potentially unused #if HWY_IS_DEBUG_BUILD HWY_DASSERT(AllFalse(di, Lt(idx64, Zero(di))) && AllTrue(di, Lt(idx64, Set(di, static_cast<TI>(2 * Lanes(di)))))); #endif #if HWY_TARGET <= HWY_AVX3 (void)d; return Indices256<TFromD<D>>{idx64.raw}; #else const Repartition<float, decltype(d)> df; // 32-bit! // Replicate 64-bit index into upper 32 bits const Vec256<TI> dup = BitCast(di, Vec256<float>{_mm256_moveldup_ps(BitCast(df, idx64).raw)}); // For each idx64 i, idx32 are 2*i and 2*i+1. const Vec256<TI> idx32 = dup + dup + Set(di, TI(1) << 32); return Indices256<TFromD<D>>{idx32.raw}; #endif }

Yes indeed :)

It seems that the implementation of IndicesFromVec is much more complicated than ConcatLowerLowerplus ConcatUpperUpper, which means benefit nothing from performance ? What do you think ?

I see ConcatLowerLower/UpperUpper are 3 cycles like on x86, but shuf_d is just 1 cycle latency, so total 7.
The proposal is DupEvent (1 cycle) plus load+two add (likely also one each) plus perm_w (3 cycles), so total 6. But I am not familiar with the throughputs. The numbers on https://jia.je/unofficial-loongarch-intrinsics-guide are likely reciprocals, so cycles per instruction, right? So that might change things, but generally I'd think addition pairs better with shuffle-heavy code.

Agreed on the impressive cycle analysis, I implemented TableLookupLanes for T_SIZE = 8 as below:

template <typename T, HWY_IF_T_SIZE(T, 8)> HWY_API Vec256<T> TableLookupLanes(Vec256<T> v, Indices256<T> idx) { // const DFromV<decltype(v)> d; // const RebindToSigned<decltype(d)> di; // const auto a = ConcatLowerLower(d, v, v); // const auto b = ConcatUpperUpper(d, v, v); // return BitCast(d, Vec256<int64_t>{__lasx_xvshuf_d(idx.raw, BitCast(di, // b).raw, // BitCast(di, a).raw)}); using TI = MakeSigned<T>; const DFromV<decltype(v)> d; const RebindToSigned<decltype(d)> di64; const Repartition<int32_t, decltype(d)> di32; const Vec256<TI> dup{__lasx_xvpackev_w(idx.raw, idx.raw)}; const Vec256<TI> idx32 = dup + dup + Set(di64, int64_t(1) << 32); return BitCast( d, TableLookupLanes(BitCast(di32, v), Indices256<int32_t>{idx32.raw})); }

The cylces are 7 ?
==> DupEvent(1 cycle) + two add(2cycles) + one Set()(1cycle) + vperm_w(3cycles))，where am i wrong ?

But suppose my caculation were right, I would still respect what you said：

but generally I'd think addition pairs better with shuffle-heavy code.

jan-wassenberg · 2025-05-13T06:19:54Z

hwy/ops/loongarch_lasx-inl.h

+HWY_API Vec256<T> TwoTablesLookupLanes(Vec256<T> a, Vec256<T> b,
+                                       Indices256<T> idx) {
+  const auto idx2 = Indices256<T>{__lasx_xvandi_b(idx.raw, 31)};
+  const auto sel_hi_mask = __lasx_xvslli_b(idx.raw, 2);


Let's also use ShiftLeft here. Also, is the xvandi really required? It's OK to have implementation-defined result if the indices are out of bounds.

TableLookupLanes will choose xvshuf_b for T_SIZE=1, which returns unpredictable results when idx.raw bits 7..5 are not all zeros. So the xvandi is required to ensure the results are expected.

jan-wassenberg · 2025-05-13T06:24:37Z

hwy/ops/loongarch_lasx-inl.h

+}
+
+template <typename T, HWY_IF_T_SIZE(T, 2)>
+HWY_API Vec256<T> TwoTablesLookupLanes(Vec256<T> a, Vec256<T> b,


We could also consider making this generic for all lane types (just remove the IF_T_SIZE) to reuse code.
Using IfNegativeThenElse would avoid the _h suffix. And we'd have to compute the shift count, which is a bit ugly - but I think 8*sizeof(T) - 6 + CeilLog2(sizeof(T)) would work.

It works, thanks. Done!

jan-wassenberg · 2025-05-13T06:36:44Z

hwy/ops/loongarch_lasx-inl.h

+  const Mask128<T> maskL = MaskFromVec(LowerHalf(VecFromMask(d, mask)));
+  const Vec128<T> expandL = Expand(LowerHalf(v), maskL);
+  // We have to shift the input by a variable number of bytes, but there isn't
+  // a table-driven option for that until VBMI, and CPUs with that likely also


Might want to update the comment. I think there is no native Expand here.

jinboson · 2025-05-15T07:45:13Z

please let me know if anything else is needed. :)

jan-wassenberg

Thanks for making the changes. Looks like we're ready to merge.

jinboson · 2025-05-15T11:33:00Z

Thanks go to @jan-wassenberg and @johnplatts for your thoughtful and detailed reviews, and also go to @HecaiYuan for the changes. Thanks everyone! :)

HecaiYuan force-pushed the master branch from 16c3134 to 69704f2 Compare May 9, 2025 09:09

jan-wassenberg requested changes May 9, 2025

View reviewed changes

HecaiYuan force-pushed the master branch from 69704f2 to c83999a Compare May 12, 2025 08:22

jan-wassenberg requested changes May 13, 2025

View reviewed changes

HecaiYuan and others added 3 commits May 14, 2025 16:46

loongson: Add LSX targets impl

Loading
Loading status checks…

7bfb8e8

loongson: Add LASX targets impl

Loading
Loading status checks…

d584659

loongson: add ci test

Loading
Loading status checks…

b5bc48a

HecaiYuan force-pushed the master branch from c83999a to b5bc48a Compare May 14, 2025 08:50

jan-wassenberg approved these changes May 15, 2025

View reviewed changes

jan-wassenberg added the ready to pull label May 15, 2025

copybara-service bot merged commit 32ed3c8 into google:master May 15, 2025
27 of 29 checks passed

		#endif // HWY_TARGET > HWY_LASX

		// ------------------------------ Integer <=> fp (ShiftRight, OddEven)


		namespace detail {

		// we don't have intrinsics to operate between 128-bit and 256-bit,


		// ------------------------------ Floating-point classification

		// FIXME: disable gcc-14 tree-based loop optimizations to prevent test failures

Add loongarch support and LSX/LASX impl #2560

Add loongarch support and LSX/LASX impl #2560

Conversation

HecaiYuan commented Apr 30, 2025

Uh oh!

jan-wassenberg commented May 2, 2025

Uh oh!

johnplatts commented May 7, 2025

Uh oh!

HecaiYuan commented May 8, 2025

Uh oh!

HecaiYuan commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinboson May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinboson May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinboson May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HecaiYuan commented May 9, 2025 •

edited

Loading

jinboson May 10, 2025 •

edited

Loading

jinboson May 10, 2025 •

edited

Loading

jinboson May 13, 2025 •

edited

Loading

jinboson May 13, 2025 •

edited

Loading

jan-wassenberg May 13, 2025 •

edited

Loading