Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add hashmap to get_const_offset #4010

Closed

Conversation

sjamesr
Copy link
Contributor

@sjamesr sjamesr commented Jan 7, 2025

No description provided.

@sjamesr
Copy link
Contributor Author

sjamesr commented Jan 7, 2025

This change has worse performance than HEAD for small modules, but it is a substantial improvement for modules with large numbers of constants (I saw a 18% speedup in loading a module with 10,000 consts).

Ideas:

  • Right now there are at least 3 allocations per insertion. Right now the hashmap keys and const buffer memory cannot be shared, because the constant buffer will be realloc'd when it grows, potentially invalidating the pointers in the hashmap. Perhaps the constant buffer could be a linked list of chunks of N constants. This way, more constant buffer space could be added without reallocating the existing buffer.

  • hash map uses chaining to resolve collisions. Moving to open addressing would reduce the number of allocations.

  • maybe the loader should be smarter about using a hash map only when the number of constants grows beyond a certain threshold

  • maybe toolchains should emit table references instead of {i32,i64,f32,f64}.const instructions, thus moving the de-dup/compression step to the assembler/linker and out of the loader

@wenyongh
Copy link
Contributor

wenyongh commented Jan 9, 2025

This change has worse performance than HEAD for small modules, but it is a substantial improvement for modules with large numbers of constants (I saw a 18% speedup in loading a module with 10,000 consts).

@sjamesr thanks for your effort to improve the performance, I haven't read the detailed code to understand the optimization, but it seems that a 18% speedup is not as expected. I tried to improve the perpformance with another PR (see #4012), and tested the standalone case in tests/standalone/test-printf which has more than 20000 consts, experiments show that the PR can greatly improve the performance for this case: the load time is reduced from ~550ms to ~5ms in my machine. You can also test it with the below command:

cd tests/standalone/test-printf
time iwasm --heap-size=102400 --stack-size=8096000 test_printf_builtin.wasm
# or modify the source code to print the execution time of wasm_loader_load manually

I also tested you patch, but it didn't improve the loading time for the test-print case, instead, it is even worse. So I think maybe my PR is better? And could you test my PR for your case? Thanks in advance.

@sjamesr
Copy link
Contributor Author

sjamesr commented Jan 9, 2025

This does indeed show a substantial improvement in our benchmark, a 15% speedup in loading a module with ~1000 consts. After this, the largest single cost in our benchmark is read_leb.

Thank you for doing this!

My change adds a bh_hash_map of Const -> offset, returning the existing offset if it's there or adding a new entry to the map if it's not. Moving bh_hash_map to open addressing and away from chaining might have a substantial impact by reducing the number of allocations, and this map is used elsewhere in the code. But for now I think I'll drop my change and see if we can improve read_leb.

@wenyongh
Copy link
Contributor

wenyongh commented Jan 10, 2025

OK, welcome, I refactored some code and applied similar changes to wasm mini loader. If you are to drop this PR, then I will let others help review my PR.

For reading leb in wasm loader of fast interpreter, one optimization I can think is that we don't need to check the leb format again in the second scan since the checks have been done in the first scan. I also submitted another PR (see #4017) for it, but it doesn't improve performance a lot for the case test-printf. Maybe you can have a try for your case.

Another optimization is to refine the bh_read_leb function for the first scan, but I have no idea yet, maybe you can have a try also.

@sjamesr
Copy link
Contributor Author

sjamesr commented Jan 10, 2025

#4012

@sjamesr sjamesr closed this Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants