Skip to content

Load index into Vector using highway ops #2544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zafeerali943 opened this issue Mar 26, 2025 · 19 comments
Open

Load index into Vector using highway ops #2544

zafeerali943 opened this issue Mar 26, 2025 · 19 comments

Comments

@zafeerali943
Copy link

Hi Champs!

I am trying to load/store an index into a simd vector from an index array pointer of type(size_t* indices). Is there any relevant operator available in highway to perform such load and store operations?

Thanks

@jan-wassenberg
Copy link
Member

Hi, our lane types are fixed-size whereas size_t can be either 32-bit or 64-bit.
I'd recommend choosing an IndexT as either uint32_t #if SIZE_MAX == 0xFFFFFFFF else uint64_t. Then you can cast pointers to this type and use the normal Load/LoadU.
Hope this helps?

@zafeerali943
Copy link
Author

@jan-wassenberg Thanks for the response! It sounds good. I’m currently trying to load elements from an array with any data type using an index vector. I believe this can be achieved using the GatherIndex operation in Highway, but I’m facing some confusion regarding the following:

Let’s say I want to load int32 elements into a 128-bit SIMD vector using an index vector.

The index vector is 64 bits wide, meaning it can hold 2 indices per vector. However, int32 elements can be processed in sets of 4 at a time.

This leads me to wonder—using the GatherIndex operation to load 4 int32 elements into a vector with only 2 indices per vector seems impossible, right?

How can this situation be handled?

@jan-wassenberg
Copy link
Member

Yes, mixed-width gathers are problematic. Although x86 supports them, other platforms do not. Thus I would advise ensuring the width of the index and the to-be-gathered elements match. If you really must have size_t indices, then you'd probably want to DemoteTo them to u32 (unless SIZE_MAX == 0xFFFFFFFF) to match your int32 data.

@zafeerali943
Copy link
Author

zafeerali943 commented Mar 27, 2025

@jan-wassenberg I understood, but array size always would not be of size less than int32 right? such cases we can't able to use gather right for all dtypes with index vector of either int32/int64 type?

This what I am trying to achieve:
const size_t N = Lanes(d);
const size_t* read = indices;
const size_t iL0 = readL[0 * N];
alignas(alignof(T)) size_t scalar_indices[N];
StoreU(indices_vec, d, scalar_indices);
Vec< D > arr = GatherIndex(d, arr, scalar_indices);

@jan-wassenberg
Copy link
Member

That's correct. CPUs usually do not support gather for less than 32-bit elements. Gather is anyway slow and best avoided if possible, unless you're going to use the resulting vector inside other SIMD code.

@zafeerali943
Copy link
Author

zafeerali943 commented Mar 27, 2025

@jan-wassenberg I am currently trying to use basecase of vqsort to work with sort indices along with keys wihtout the use of KV structure, thats were I am trying to use gather and index vector here. Do you have any suggestions

@jan-wassenberg
Copy link
Member

hm, I'm not sure I understand. VQSort key+value only works with KV structures, unless you are defining u32 "keys" yourself which consist of the actual u16 key plus a u16 index. If you want to gather values, you can just use normal C++ code. This will be about as fast as vector code for that.

@zafeerali943
Copy link
Author

hm, I'm not sure I understand. VQSort key+value only works with KV structures, unless you are defining u32 "keys" yourself which consist of the actual u16 key plus a u16 index. If you want to gather values, you can just use normal C++ code. This will be about as fast as vector code for that.

@jan-wassenberg to be precise, I am trying to introduce two different pointers one is Key and another index pointer, trying modify basecase code to implement index swapping along with key swap.

@jan-wassenberg
Copy link
Member

Ah, got it. That will involve changing traits-inl.h

  HWY_INLINE Vec<D> First(D /* tag */, const Vec<D> a, const Vec<D> b) const {
    return Min(a, b);
  }

to add an output argument for the indices, and instead of returning the min directly, getting a mask via Lt(a, b), then using that to return IfThenElse(mask, a, b) (which is the same as Min), but then also doing the IfThenElse on two also to be added index arguments. Again, a sizeable undertaking.

@zafeerali943
Copy link
Author

@jan-wassenberg that's sounds good!, I believe for basecase the size of elements would be less than 256, so we can handle the indices in uint16, the only thing I am still getting stuck is that how we can handle appropriate indices to dtype of input key.

@jan-wassenberg
Copy link
Member

For the code above, indices would be the same size as the dtype. You can use MakeUnsigned<T> to obtain an unsigned type of the same size as T.

@zafeerali943
Copy link
Author

@jan-wassenberg I could able to understand the VQSort calls for all DTypes, but how KV is actually handled, since keys are used for comparison, how values are getting rearranged accordingly to the keys?

@zafeerali943
Copy link
Author

For the code above, indices would be the same size as the dtype. You can use MakeUnsigned<T> to obtain an unsigned type of the same size as T.

@jan-wassenberg Between I tried implementing the suggestions you gave; it got worked but while copying the sorted vectors back to indices array, it stores wrong results, not sure what goes wrong

@jan-wassenberg
Copy link
Member

KV is handled by including the value inside the key, and changing the key comparator to ignore the value part of the key.

@zafeerali943
Copy link
Author

@jan-wassenberg Kindly can you share the exact implementation details that performs the above case

@zafeerali943
Copy link
Author

@jan-wassenberg are you pointing to this piece of code :

struct OrderAscendingKV64 : public KeyValue64 {
using Order = SortAscending;
using OrderForSortingNetwork = OrderAscending;

HWY_INLINE bool Compare1(const LaneType* a, const LaneType* b) const {
return (*a >> 32) < (*b >> 32);
}

template
HWY_INLINE Mask Compare(D /* tag */, Vec a, Vec b) const {
return Lt(ShiftRight<32>(a), ShiftRight<32>(b));
}

// Not required to be stable (preserving the order of equivalent keys), so
// we can include the value in the comparison.
template
HWY_INLINE Vec First(D /* tag */, const Vec a, const Vec b) const {
return Min(a, b);
}

template
HWY_INLINE Vec Last(D /* tag */, const Vec a, const Vec b) const {
return Max(a, b);
}

template
HWY_INLINE Vec FirstOfLanes(D d, Vec v,
uint64_t* HWY_RESTRICT /* buf */) const {
return MinOfLanes(d, v);
}

template
HWY_INLINE Vec LastOfLanes(D d, Vec v,
uint64_t* HWY_RESTRICT /* buf */) const {
return MaxOfLanes(d, v);
}

// Same as for regular lanes.
template
HWY_INLINE Vec FirstValue(D d) const {
return Set(d, hwy::LowestValue<TFromD>());
}

template
HWY_INLINE Vec LastValue(D d) const {
return Set(d, hwy::HighestValue<TFromD>());
}

template
HWY_INLINE Vec PrevValue(D d, Vec v) const {
return Sub(v, Set(d, uint64_t{1} << 32));
}
};

@jan-wassenberg
Copy link
Member

Yes, that's right. Compare is shifting out the value bits.

@zafeerali943
Copy link
Author

@jan-wassenberg, does the current VQSort implementation in [vqsort-ini.h] support all Highway-supported intrinsics, including x86 (AVX-512, AVX2), Arm (NEON, SVE), and RISC-V (V extension)? Are there any other intrinsic sets supported by VQSort?

Thanks in advance!

@jan-wassenberg
Copy link
Member

Yes indeed, you can see the full list in the readme or detect_targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants