Skip to content

GH-46683: Add utf8_zfill kernel for sign-aware zero padding #46815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

iabhi4
Copy link
Contributor

@iabhi4 iabhi4 commented Jun 15, 2025

What does this PR do?

Adds a new compute kernel utf8_zfill to Arrow Compute module. This kernel zero-pads UTF-8 strings while preserving leading signs.

Why is it needed?

Python's str.zfill() behavior is useful and expected in many data cleaning scenarios. Arrow lacked a direct equivalent.

Implementation details

  • Adds Utf8ZFillTransform
  • Registers the function in the compute kernel registry
  • Adds support to Python bindings and tests
  • Includes sign-aware padding logic

Are these changes tested?

  • C++ unit tests: TestStringKernels, Utf8ZFill
  • Python tests: test_utf8_zfill()
    Ran both tests

@iabhi4
Copy link
Contributor Author

iabhi4 commented Jun 15, 2025

@pitrou Just wanted to give you a heads-up that the implementation and tests are ready for review.
Please let me know if I’ve missed anything or if there’s a changelog entry or any documentation I should update as part of this PR.

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @iabhi4 !

The CI failures look connected. For example:

/arrow/cpp/src/arrow/compute/kernels/scalar_string_internal.h:179:  Check failed: _s.ok() Operation failed: registry->AddFunction(std::move(func))
Bad status: Invalid: In function 'utf8_zfill': description line length exceeds 78 characters

We do not require changelog entry per PR - it is done automatically at the release. As for the documentation, on the Python side, the User Guide is very minimal at this point and so the work would be out of scope of this PR (will keep in mind to maybe add a new issue under #46601).

That being said, I would love if we would have a short example here: https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.PadOptions.html, even if only for the new utf8_zfill kernel 😊

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 16, 2025
@iabhi4
Copy link
Contributor Author

iabhi4 commented Jun 17, 2025

Thanks for the review @AlenkaF, took care of the CI errors and added an Example in the PadOptions docs as suggested

@iabhi4 iabhi4 requested a review from AlenkaF June 17, 2025 22:53
@AlenkaF
Copy link
Member

AlenkaF commented Jun 18, 2025

@github-actions crossbow submit preview-docs

Copy link

Revision: 7d33c20

Submitted crossbow builds: ursacomputing/crossbow @ actions-08f3ce64cb

Task Status
preview-docs GitHub Actions

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates — the PR looks good!
I just have one minor comment/suggestion, which is optional.

cc @rok / @pitrou mind having a look at the C++ part?

@iabhi4 iabhi4 requested review from AlenkaF and pitrou June 18, 2025 21:57
@AlenkaF
Copy link
Member

AlenkaF commented Jun 19, 2025

I think a suggestion of mine might have been lost during the review stage. What I meant to propose was setting padding="0" as the default, while still allowing users to override it with any other character. That would align better with the function's name (zero fill) and also provide the desired behaviour without requiring users to explicitly specify the padding each time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants