Skip to content

Add loongarch support and LSX/LASX impl #2560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 15, 2025

Conversation

HecaiYuan
Copy link
Contributor

Added CI tests and implemented LSX (128-bit vector extensions) and LASX (256-bit vector extensions) optimizations. welcome to review. @jan-wassenberg @jeromejj

@jan-wassenberg
Copy link
Member

Congrats on the successful implementation! I will review next week, have been very busy with gemma.cpp.

@johnplatts
Copy link
Contributor

There are some more enhancements that should be made on the LSX/LASX targets:

  • Implement Vec128<T>Vec256<T> Combine, Vec256<T>Vec128<T> LowerHalf/UpperHalf, Vec256<T>Vec128<T, N> [ZeroExtend]ResizeBitCast using __builtin_shufflevector if available (this can be checked for using #if HWY_HAS_BUILTIN(__builtin_shufflevector) and Clang uses __builtin_shufflevector to implement some of the AVX __m128/__m128i/__m128d__m256/__m256i/__m256d intrinsics)
  • LSX/LASX-specific implementations of RoundingShiftRight using __lsx_vsrlri_*/__lsx_vsrari_*/__lasx_xvsrlri_*/__lasx_xvsrari_*
  • LSX/LASX-specific implementations of RoundingShr using __lsx_vsrlr_*/__lsx_vsrar_*/__lasx_xvsrlr_*/__lasx_xvsrar_*
  • Implement Mask128<T>/Mask256<T> SetOnlyFirst/SetAtOrAfterFirst using __lsx_vsub_q/__lasx_xvsub_q (equivalent to PPC8 vec_sub(__vector unsigned __int128, __vector unsigned __int128))

@HecaiYuan
Copy link
Contributor Author

There are some more enhancements that should be made on the LSX/LASX targets:

  • Implement Vec128<T>Vec256<T> Combine, Vec256<T>Vec128<T> LowerHalf/UpperHalf, Vec256<T>Vec128<T, N> [ZeroExtend]ResizeBitCast using __builtin_shufflevector if available (this can be checked for using #if HWY_HAS_BUILTIN(__builtin_shufflevector) and Clang uses __builtin_shufflevector to implement some of the AVX __m128/__m128i/__m128d__m256/__m256i/__m256d intrinsics)
  • LSX/LASX-specific implementations of RoundingShiftRight using __lsx_vsrlri_*/__lsx_vsrari_*/__lasx_xvsrlri_*/__lasx_xvsrari_*
  • LSX/LASX-specific implementations of RoundingShr using __lsx_vsrlr_*/__lsx_vsrar_*/__lasx_xvsrlr_*/__lasx_xvsrar_*
  • Implement Mask128<T>/Mask256<T> SetOnlyFirst/SetAtOrAfterFirst using __lsx_vsub_q/__lasx_xvsub_q (equivalent to PPC8 vec_sub(__vector unsigned __int128, __vector unsigned __int128))

Thank you for the suggestion. I'll fix what doesn't fit.

@HecaiYuan
Copy link
Contributor Author

HecaiYuan commented May 9, 2025

LoongArch does not have the __lsx_vsub_q/__lasx_xvsub_q instructions. Other suggestions have been addressed. @johnplatts

Copy link
Member

@jan-wassenberg jan-wassenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! And it's a nice instruction set :)
Some comments already on LSX and the other files, I have not yet reviewed LASX.

CMakeLists.txt Outdated
# enabling LASX still require -mlasx flag to be passed, in order to enable all
# targets, we can check them directly, adding them if they are supported. In
# this way, Our local compilers(GCC 8.3.0 or CLANG 8.0.1) also could enable
# LSX & LASX targets. Any better ideas are welcom.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using check_cxx_source_compiles to verify compiler support?

Copy link
Contributor

@jinboson jinboson May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah,no problem. Will update in next push.

CMakeLists.txt Outdated
@@ -68,6 +68,14 @@ endif()
# The following is only required with GCC < 6.1.0 or CLANG < 16.0
set(HWY_CMAKE_ARM7 OFF CACHE BOOL "Set copts for Armv7 with NEON (requires vfpv4)?")

# Upstream compilers(GCC 14 or CLANG 18) start supporting LSX by default, but
# enabling LASX still require -mlasx flag to be passed, in order to enable all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really require -mlasx? The patch includes enabling runtime dispatch, but only for LASX?

Copy link
Contributor

@jinboson jinboson May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether I understand you correctly. Let me expand it. In line 71, the word supporting should be replaced with enabling, which is more accurate. It's because GCC14 or CLANG 18 actually can compile LASX code but requires -mlasx flag to be passed, otherwise, we would only compile LSX code.

For enabling runtime dispatch, it's NOT only for LASX, it's my mistake. Will fix it in next push.

Thanks for your reviews, please feel free to let me know if there are any enhancements that should be made.

HWY_API Vec128<float, N> IfThenElse(Mask128<float, N> mask,
Vec128<float, N> yes, Vec128<float, N> no) {
const DFromV<decltype(yes)> d;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a penalty from mixing int/float? If not, it's probably better to BitCast to signed (use RebindToSigned<decltype(d)>), then call the previous IfThenElse overload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

template <class D, HWY_IF_V_SIZE_LE_D(D, 16), HWY_IF_T_SIZE_D(D, 1)>
HWY_INLINE VFromD<D> Iota0(D /*d*/) {
alignas(16) TFromD<D> _tmp_data[] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also static constexpr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

HWY_API Vec128<uint8_t, N> ShiftRightSame(const Vec128<uint8_t, N> v,
int bits) {
return Vec128<uint8_t, N>{__lsx_vsrl_b(v.raw, __lsx_vreplgr2vr_b(bits))};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever there is no efficiency loss, it's nice to reuse code to avoid duplication. Can we just have a single template of T, N that calls Set() and Shr()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// usinged
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsigned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

HWY_INLINE VFromD<D> ClampF64ToI32Max(D d, VFromD<D> v) {
// The max can be exactly represented in binary64, so clamping beforehand
// prevents x86 conversion from raising an exception and returning 80..00.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Loongarch also have the same behaviour as x86? Might want to update the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made modifications to the DemoteTo(double->UI32) instruction and removed ClampF64ToI32Max. It seems that ClampF64ToI32Max is not needed.

#endif // HWY_TARGET > HWY_LASX

// ------------------------------ Integer <=> fp (ShiftRight, OddEven)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These implementations do not use/depend on ShiftRight/OddEven, so can remove the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@jan-wassenberg jan-wassenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reviewed LASX now :)


namespace detail {

// we don't have intrinsics to operate between 128-bit and 256-bit,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, it is only 100% guaranteed correct to use the last union member written. Maybe it's better to hwy::CopyBytes instead, if that is zero-cost?

Copy link
Contributor

@jinboson jinboson May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template <class D, HWY_IF_V_SIZE_D(D, 32)> 
HWY_API VFromD<D> Combine(D d, VFromD<Half<D>> hi, VFromD<Half<D>> lo) {
#if HWY_HAS_BUILTIN(__builtin_shufflevector)
  (void)d;
  typedef uint32_t U32RawVectType __attribute__((__vector_size__(16)));
  return VFromD<D>{reinterpret_cast<typename detail::Raw256<TFromD<D>>::type>(
      __builtin_shufflevector(reinterpret_cast<U32RawVectType>(lo.raw),
                              reinterpret_cast<U32RawVectType>(hi.raw), 0, 1, 2,
                              3, 4, 5, 6, 7))};
#else
  const RebindToUnsigned<decltype(d)> du;
  const Half<decltype(du)> du128;
  detail::U256 u;
  u.ii[0] = BitCast(du128, lo).raw;
  u.ii[1] = BitCast(du128, hi).raw;
  return BitCast(d, VFromD<decltype(du)>{u.i256});
#endif
}

For example with Combine implementation, you mean that we should do as below to avoid using the last union member :

const RebindToUnsigned<decltype(d)> du;
const Half<decltype(du)> du128;
__m256i vec_result;
__m128i vec_tmp[2];
vec_tmp[0] = BitCast(du128, lo).raw;
vec_tmp[1] = BitCast(du128, hi).raw;
CopyBytes<32>(vec_tmp, &vec_result);
return BitCast(d, VFromD<decltype(du)>{vec_result});

Right ? Any more suggestions are appreciated. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks good. You might want to add alignas(32) to both. The __builtin_shufflevector code path is still preferable when supported, it has less risk extra code/actual copies being generated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much, we can do that.

// ------------------------------ ShiftLeftSame

template <typename T, HWY_IF_UI8(T)>
HWY_API Vec256<T> ShiftLeftSame(const Vec256<T> v, const int bits) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also, should we implement as Shl(v, Set(d, bits))?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's nice to reuse the codes, we can do that. Thanks.

const DFromV<decltype(x)> d;
const RebindToUnsigned<decltype(d)> du;
const RepartitionToWide<decltype(du)> dd;
return BitCast(d, VFromD<decltype(du)>{__lasx_xvbitsel_v(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be implemented using OddEven, to avoid duplicating the logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will update in next push.


// ------------------------------ Floating-point classification

// FIXME: disable gcc-14 tree-based loop optimizations to prevent test failures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which tests are failing? Would be good to mention that in the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


template <class D, HWY_IF_V_SIZE_D(D, 32)>
HWY_API VFromD<D> Combine(D d, VFromD<Half<D>> hi, VFromD<Half<D>> lo) {
#if HWY_COMPILER_CLANG && HWY_HAS_BUILTIN(__builtin_shufflevector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCC also supports __builtin_shufflevector, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I'll make the revisions. Thanks.

template <class D, HWY_IF_V_SIZE_D(D, 32)>
HWY_API VFromD<D> LoadDup128(D d, const TFromD<D>* HWY_RESTRICT p) {
detail::U256 u;
u.ii[0] = u.ii[1] = __lsx_vld(p, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could implement this using Combine() to minimize usage of U256.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,thanks.

const RebindToSigned<decltype(d)> di;
const auto a = ConcatLowerLower(d, v, v);
const auto b = ConcatUpperUpper(d, v, v);
return BitCast(d, Vec256<int64_t>{__lasx_xvshuf_d(idx.raw, BitCast(di, b).raw,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't vperm_d be much more efficient?

Copy link
Contributor

@jinboson jinboson May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, LoongArch instruction sets do not have vperm_d implementation :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__lasx_xvpermi_d only accepts compile-time constant value for imm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We might still be better off synthesizing 32-bit indices from 64-bit: 2*idx + {1,0}, then using vperm_w?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the same process like x86, right ? :

// 64-bit lanes: convert indices to 8x32 unless AVX3 is available
template <class D, HWY_IF_V_SIZE_D(D, 32), HWY_IF_T_SIZE_D(D, 8), typename TI>
HWY_API Indices256<TFromD<D>> IndicesFromVec(D d, Vec256<TI> idx64) {
  static_assert(sizeof(TFromD<D>) == sizeof(TI), "Index size must match lane");
  const Rebind<TI, decltype(d)> di;
  (void)di;  // potentially unused
#if HWY_IS_DEBUG_BUILD
  HWY_DASSERT(AllFalse(di, Lt(idx64, Zero(di))) &&
              AllTrue(di, Lt(idx64, Set(di, static_cast<TI>(2 * Lanes(di))))));
#endif

#if HWY_TARGET <= HWY_AVX3
  (void)d;
  return Indices256<TFromD<D>>{idx64.raw};
#else
  const Repartition<float, decltype(d)> df;  // 32-bit!
  // Replicate 64-bit index into upper 32 bits
  const Vec256<TI> dup =
      BitCast(di, Vec256<float>{_mm256_moveldup_ps(BitCast(df, idx64).raw)});
  // For each idx64 i, idx32 are 2*i and 2*i+1.
  const Vec256<TI> idx32 = dup + dup + Set(di, TI(1) << 32);
  return Indices256<TFromD<D>>{idx32.raw};
#endif
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the implementation of IndicesFromVec is much more complicated than ConcatLowerLowerplus ConcatUpperUpper, which means benefit nothing from performance ? What do you think ?

Copy link
Member

@jan-wassenberg jan-wassenberg May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see ConcatLowerLower/UpperUpper are 3 cycles like on x86, but shuf_d is just 1 cycle latency, so total 7.
The proposal is DupEvent (1 cycle) plus load+two add (likely also one each) plus perm_w (3 cycles), so total 6. But I am not familiar with the throughputs. The numbers on https://jia.je/unofficial-loongarch-intrinsics-guide are likely reciprocals, so cycles per instruction, right? So that might change things, but generally I'd think addition pairs better with shuffle-heavy code.

Copy link
Contributor

@jinboson jinboson May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the impressive cycle analysis, I implemented TableLookupLanes for T_SIZE = 8 as below:

template <typename T, HWY_IF_T_SIZE(T, 8)>
HWY_API Vec256<T> TableLookupLanes(Vec256<T> v, Indices256<T> idx) {
  // const DFromV<decltype(v)> d;
  // const RebindToSigned<decltype(d)> di;
  // const auto a = ConcatLowerLower(d, v, v);
  // const auto b = ConcatUpperUpper(d, v, v);
  // return BitCast(d, Vec256<int64_t>{__lasx_xvshuf_d(idx.raw, BitCast(di,
  // b).raw,
  //                                                  BitCast(di, a).raw)});
  using TI = MakeSigned<T>;
  const DFromV<decltype(v)> d;
  const RebindToSigned<decltype(d)> di64;
  const Repartition<int32_t, decltype(d)> di32;
  const Vec256<TI> dup{__lasx_xvpackev_w(idx.raw, idx.raw)};
  const Vec256<TI> idx32 = dup + dup + Set(di64, int64_t(1) << 32);
  return BitCast(
      d, TableLookupLanes(BitCast(di32, v), Indices256<int32_t>{idx32.raw}));
}

The cylces are 7 ?
==> DupEvent(1 cycle) + two add(2cycles) + one Set()(1cycle) + vperm_w(3cycles)),where am i wrong ?

But suppose my caculation were right, I would still respect what you said:

but generally I'd think addition pairs better with shuffle-heavy code.

HWY_API Vec256<T> TwoTablesLookupLanes(Vec256<T> a, Vec256<T> b,
Indices256<T> idx) {
const auto idx2 = Indices256<T>{__lasx_xvandi_b(idx.raw, 31)};
const auto sel_hi_mask = __lasx_xvslli_b(idx.raw, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also use ShiftLeft here. Also, is the xvandi really required? It's OK to have implementation-defined result if the indices are out of bounds.

Copy link
Contributor

@jinboson jinboson May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableLookupLanes will choose xvshuf_b for T_SIZE=1, which returns unpredictable results when idx.raw bits 7..5 are not all zeros. So the xvandi is required to ensure the results are expected.

}

template <typename T, HWY_IF_T_SIZE(T, 2)>
HWY_API Vec256<T> TwoTablesLookupLanes(Vec256<T> a, Vec256<T> b,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also consider making this generic for all lane types (just remove the IF_T_SIZE) to reuse code.
Using IfNegativeThenElse would avoid the _h suffix. And we'd have to compute the shift count, which is a bit ugly - but I think 8*sizeof(T) - 6 + CeilLog2(sizeof(T)) would work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works, thanks. Done!

const Mask128<T> maskL = MaskFromVec(LowerHalf(VecFromMask(d, mask)));
const Vec128<T> expandL = Expand(LowerHalf(v), maskL);
// We have to shift the input by a variable number of bytes, but there isn't
// a table-driven option for that until VBMI, and CPUs with that likely also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to update the comment. I think there is no native Expand here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

HecaiYuan and others added 3 commits May 14, 2025 16:46
@jinboson
Copy link
Contributor

please let me know if anything else is needed. :)

Copy link
Member

@jan-wassenberg jan-wassenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes. Looks like we're ready to merge.

@copybara-service copybara-service bot merged commit 32ed3c8 into google:master May 15, 2025
27 of 29 checks passed
@jinboson
Copy link
Contributor

Thanks go to @jan-wassenberg and @johnplatts for your thoughtful and detailed reviews, and also go to @HecaiYuan for the changes. Thanks everyone! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants