-
Notifications
You must be signed in to change notification settings - Fork 350
Add loongarch support and LSX/LASX impl #2560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Congrats on the successful implementation! I will review next week, have been very busy with gemma.cpp. |
There are some more enhancements that should be made on the LSX/LASX targets:
|
Thank you for the suggestion. I'll fix what doesn't fit. |
LoongArch does not have the __lsx_vsub_q/__lasx_xvsub_q instructions. Other suggestions have been addressed. @johnplatts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! And it's a nice instruction set :)
Some comments already on LSX and the other files, I have not yet reviewed LASX.
CMakeLists.txt
Outdated
# enabling LASX still require -mlasx flag to be passed, in order to enable all | ||
# targets, we can check them directly, adding them if they are supported. In | ||
# this way, Our local compilers(GCC 8.3.0 or CLANG 8.0.1) also could enable | ||
# LSX & LASX targets. Any better ideas are welcom. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using check_cxx_source_compiles to verify compiler support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah,no problem. Will update in next push.
CMakeLists.txt
Outdated
@@ -68,6 +68,14 @@ endif() | |||
# The following is only required with GCC < 6.1.0 or CLANG < 16.0 | |||
set(HWY_CMAKE_ARM7 OFF CACHE BOOL "Set copts for Armv7 with NEON (requires vfpv4)?") | |||
|
|||
# Upstream compilers(GCC 14 or CLANG 18) start supporting LSX by default, but | |||
# enabling LASX still require -mlasx flag to be passed, in order to enable all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really require -mlasx? The patch includes enabling runtime dispatch, but only for LASX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether I understand you correctly. Let me expand it. In line 71, the word supporting
should be replaced with enabling
, which is more accurate. It's because GCC14 or CLANG 18 actually can compile LASX code but requires -mlasx flag to be passed, otherwise, we would only compile LSX code.
For enabling runtime dispatch
, it's NOT only for LASX, it's my mistake. Will fix it in next push.
Thanks for your reviews, please feel free to let me know if there are any enhancements that should be made.
hwy/ops/loongarch_lsx-inl.h
Outdated
HWY_API Vec128<float, N> IfThenElse(Mask128<float, N> mask, | ||
Vec128<float, N> yes, Vec128<float, N> no) { | ||
const DFromV<decltype(yes)> d; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a penalty from mixing int/float? If not, it's probably better to BitCast to signed (use RebindToSigned<decltype(d)>
), then call the previous IfThenElse overload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
hwy/ops/loongarch_lsx-inl.h
Outdated
template <class D, HWY_IF_V_SIZE_LE_D(D, 16), HWY_IF_T_SIZE_D(D, 1)> | ||
HWY_INLINE VFromD<D> Iota0(D /*d*/) { | ||
alignas(16) TFromD<D> _tmp_data[] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also static constexpr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
hwy/ops/loongarch_lsx-inl.h
Outdated
HWY_API Vec128<uint8_t, N> ShiftRightSame(const Vec128<uint8_t, N> v, | ||
int bits) { | ||
return Vec128<uint8_t, N>{__lsx_vsrl_b(v.raw, __lsx_vreplgr2vr_b(bits))}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever there is no efficiency loss, it's nice to reuse code to avoid duplication. Can we just have a single template of T, N that calls Set() and Shr()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
hwy/ops/loongarch_lsx-inl.h
Outdated
} | ||
|
||
// usinged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsigned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
hwy/ops/loongarch_lsx-inl.h
Outdated
HWY_INLINE VFromD<D> ClampF64ToI32Max(D d, VFromD<D> v) { | ||
// The max can be exactly represented in binary64, so clamping beforehand | ||
// prevents x86 conversion from raising an exception and returning 80..00. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Loongarch also have the same behaviour as x86? Might want to update the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made modifications to the DemoteTo(double->UI32) instruction and removed ClampF64ToI32Max. It seems that ClampF64ToI32Max is not needed.
hwy/ops/loongarch_lsx-inl.h
Outdated
#endif // HWY_TARGET > HWY_LASX | ||
|
||
// ------------------------------ Integer <=> fp (ShiftRight, OddEven) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These implementations do not use/depend on ShiftRight/OddEven, so can remove the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also reviewed LASX now :)
hwy/ops/loongarch_lasx-inl.h
Outdated
|
||
namespace detail { | ||
|
||
// we don't have intrinsics to operate between 128-bit and 256-bit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, it is only 100% guaranteed correct to use the last union member written. Maybe it's better to hwy::CopyBytes instead, if that is zero-cost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <class D, HWY_IF_V_SIZE_D(D, 32)>
HWY_API VFromD<D> Combine(D d, VFromD<Half<D>> hi, VFromD<Half<D>> lo) {
#if HWY_HAS_BUILTIN(__builtin_shufflevector)
(void)d;
typedef uint32_t U32RawVectType __attribute__((__vector_size__(16)));
return VFromD<D>{reinterpret_cast<typename detail::Raw256<TFromD<D>>::type>(
__builtin_shufflevector(reinterpret_cast<U32RawVectType>(lo.raw),
reinterpret_cast<U32RawVectType>(hi.raw), 0, 1, 2,
3, 4, 5, 6, 7))};
#else
const RebindToUnsigned<decltype(d)> du;
const Half<decltype(du)> du128;
detail::U256 u;
u.ii[0] = BitCast(du128, lo).raw;
u.ii[1] = BitCast(du128, hi).raw;
return BitCast(d, VFromD<decltype(du)>{u.i256});
#endif
}
For example with Combine
implementation, you mean that we should do as below to avoid using the last union member :
const RebindToUnsigned<decltype(d)> du;
const Half<decltype(du)> du128;
__m256i vec_result;
__m128i vec_tmp[2];
vec_tmp[0] = BitCast(du128, lo).raw;
vec_tmp[1] = BitCast(du128, hi).raw;
CopyBytes<32>(vec_tmp, &vec_result);
return BitCast(d, VFromD<decltype(du)>{vec_result});
Right ? Any more suggestions are appreciated. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that looks good. You might want to add alignas(32) to both. The __builtin_shufflevector code path is still preferable when supported, it has less risk extra code/actual copies being generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much, we can do that.
hwy/ops/loongarch_lasx-inl.h
Outdated
// ------------------------------ ShiftLeftSame | ||
|
||
template <typename T, HWY_IF_UI8(T)> | ||
HWY_API Vec256<T> ShiftLeftSame(const Vec256<T> v, const int bits) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also, should we implement as Shl(v, Set(d, bits))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's nice to reuse the codes, we can do that. Thanks.
hwy/ops/loongarch_lasx-inl.h
Outdated
const DFromV<decltype(x)> d; | ||
const RebindToUnsigned<decltype(d)> du; | ||
const RepartitionToWide<decltype(du)> dd; | ||
return BitCast(d, VFromD<decltype(du)>{__lasx_xvbitsel_v( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be implemented using OddEven, to avoid duplicating the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will update in next push.
hwy/ops/loongarch_lasx-inl.h
Outdated
|
||
// ------------------------------ Floating-point classification | ||
|
||
// FIXME: disable gcc-14 tree-based loop optimizations to prevent test failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which tests are failing? Would be good to mention that in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
hwy/ops/loongarch_lasx-inl.h
Outdated
|
||
template <class D, HWY_IF_V_SIZE_D(D, 32)> | ||
HWY_API VFromD<D> Combine(D d, VFromD<Half<D>> hi, VFromD<Half<D>> lo) { | ||
#if HWY_COMPILER_CLANG && HWY_HAS_BUILTIN(__builtin_shufflevector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCC also supports __builtin_shufflevector, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'll make the revisions. Thanks.
hwy/ops/loongarch_lasx-inl.h
Outdated
template <class D, HWY_IF_V_SIZE_D(D, 32)> | ||
HWY_API VFromD<D> LoadDup128(D d, const TFromD<D>* HWY_RESTRICT p) { | ||
detail::U256 u; | ||
u.ii[0] = u.ii[1] = __lsx_vld(p, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could implement this using Combine() to minimize usage of U256.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done,thanks.
hwy/ops/loongarch_lasx-inl.h
Outdated
const RebindToSigned<decltype(d)> di; | ||
const auto a = ConcatLowerLower(d, v, v); | ||
const auto b = ConcatUpperUpper(d, v, v); | ||
return BitCast(d, Vec256<int64_t>{__lasx_xvshuf_d(idx.raw, BitCast(di, b).raw, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't vperm_d be much more efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, LoongArch instruction sets do not have vperm_d
implementation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__lasx_xvpermi_d
only accepts compile-time constant value for imm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. We might still be better off synthesizing 32-bit indices from 64-bit: 2*idx + {1,0}, then using vperm_w?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the same process like x86, right ? :
// 64-bit lanes: convert indices to 8x32 unless AVX3 is available
template <class D, HWY_IF_V_SIZE_D(D, 32), HWY_IF_T_SIZE_D(D, 8), typename TI>
HWY_API Indices256<TFromD<D>> IndicesFromVec(D d, Vec256<TI> idx64) {
static_assert(sizeof(TFromD<D>) == sizeof(TI), "Index size must match lane");
const Rebind<TI, decltype(d)> di;
(void)di; // potentially unused
#if HWY_IS_DEBUG_BUILD
HWY_DASSERT(AllFalse(di, Lt(idx64, Zero(di))) &&
AllTrue(di, Lt(idx64, Set(di, static_cast<TI>(2 * Lanes(di))))));
#endif
#if HWY_TARGET <= HWY_AVX3
(void)d;
return Indices256<TFromD<D>>{idx64.raw};
#else
const Repartition<float, decltype(d)> df; // 32-bit!
// Replicate 64-bit index into upper 32 bits
const Vec256<TI> dup =
BitCast(di, Vec256<float>{_mm256_moveldup_ps(BitCast(df, idx64).raw)});
// For each idx64 i, idx32 are 2*i and 2*i+1.
const Vec256<TI> idx32 = dup + dup + Set(di, TI(1) << 32);
return Indices256<TFromD<D>>{idx32.raw};
#endif
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the implementation of IndicesFromVec
is much more complicated than ConcatLowerLower
plus ConcatUpperUpper
, which means benefit nothing from performance ? What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see ConcatLowerLower/UpperUpper are 3 cycles like on x86, but shuf_d is just 1 cycle latency, so total 7.
The proposal is DupEvent (1 cycle) plus load+two add (likely also one each) plus perm_w (3 cycles), so total 6. But I am not familiar with the throughputs. The numbers on https://jia.je/unofficial-loongarch-intrinsics-guide are likely reciprocals, so cycles per instruction, right? So that might change things, but generally I'd think addition pairs better with shuffle-heavy code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on the impressive cycle analysis, I implemented TableLookupLanes for T_SIZE = 8 as below:
template <typename T, HWY_IF_T_SIZE(T, 8)>
HWY_API Vec256<T> TableLookupLanes(Vec256<T> v, Indices256<T> idx) {
// const DFromV<decltype(v)> d;
// const RebindToSigned<decltype(d)> di;
// const auto a = ConcatLowerLower(d, v, v);
// const auto b = ConcatUpperUpper(d, v, v);
// return BitCast(d, Vec256<int64_t>{__lasx_xvshuf_d(idx.raw, BitCast(di,
// b).raw,
// BitCast(di, a).raw)});
using TI = MakeSigned<T>;
const DFromV<decltype(v)> d;
const RebindToSigned<decltype(d)> di64;
const Repartition<int32_t, decltype(d)> di32;
const Vec256<TI> dup{__lasx_xvpackev_w(idx.raw, idx.raw)};
const Vec256<TI> idx32 = dup + dup + Set(di64, int64_t(1) << 32);
return BitCast(
d, TableLookupLanes(BitCast(di32, v), Indices256<int32_t>{idx32.raw}));
}
The cylces are 7 ?
==> DupEvent(1 cycle) + two add(2cycles) + one Set()(1cycle) + vperm_w(3cycles)),where am i wrong ?
But suppose my caculation were right, I would still respect what you said:
but generally I'd think addition pairs better with shuffle-heavy code.
hwy/ops/loongarch_lasx-inl.h
Outdated
HWY_API Vec256<T> TwoTablesLookupLanes(Vec256<T> a, Vec256<T> b, | ||
Indices256<T> idx) { | ||
const auto idx2 = Indices256<T>{__lasx_xvandi_b(idx.raw, 31)}; | ||
const auto sel_hi_mask = __lasx_xvslli_b(idx.raw, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also use ShiftLeft
here. Also, is the xvandi really required? It's OK to have implementation-defined result if the indices are out of bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TableLookupLanes will choose xvshuf_b
for T_SIZE=1, which returns unpredictable results when idx.raw bits 7..5 are not all zeros. So the xvandi is required to ensure the results are expected.
hwy/ops/loongarch_lasx-inl.h
Outdated
} | ||
|
||
template <typename T, HWY_IF_T_SIZE(T, 2)> | ||
HWY_API Vec256<T> TwoTablesLookupLanes(Vec256<T> a, Vec256<T> b, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also consider making this generic for all lane types (just remove the IF_T_SIZE
) to reuse code.
Using IfNegativeThenElse would avoid the _h suffix. And we'd have to compute the shift count, which is a bit ugly - but I think 8*sizeof(T) - 6 + CeilLog2(sizeof(T))
would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works, thanks. Done!
hwy/ops/loongarch_lasx-inl.h
Outdated
const Mask128<T> maskL = MaskFromVec(LowerHalf(VecFromMask(d, mask))); | ||
const Vec128<T> expandL = Expand(LowerHalf(v), maskL); | ||
// We have to shift the input by a variable number of bytes, but there isn't | ||
// a table-driven option for that until VBMI, and CPUs with that likely also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to update the comment. I think there is no native Expand here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
please let me know if anything else is needed. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes. Looks like we're ready to merge.
Thanks go to @jan-wassenberg and @johnplatts for your thoughtful and detailed reviews, and also go to @HecaiYuan for the changes. Thanks everyone! :) |
Added CI tests and implemented LSX (128-bit vector extensions) and LASX (256-bit vector extensions) optimizations. welcome to review. @jan-wassenberg @jeromejj