Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase search table size(s) #123

Closed
meme opened this issue Jan 27, 2020 · 12 comments
Closed

Increase search table size(s) #123

meme opened this issue Jan 27, 2020 · 12 comments

Comments

@meme
Copy link

meme commented Jan 27, 2020

When running this tool on a large C codebase, I was experiencing errors with values out-of-bounds of the size of a short, so I modified _search.py to create a table with larger bounds:

diff --git a/documentation/_search.py b/documentation/_search.py
index cd8b643..efe2bc5 100644
--- a/documentation/_search.py
+++ b/documentation/_search.py
@@ -108,7 +108,7 @@ class ResultMap:
     #
     #  prefix  |  name  | \0 |  URL
     # id + len | suffix |    | suffix
-    # 16b + 8b |        | 8b |
+    # 32b + 8b |        | 8b |
     #
     # prefixed & suffixed item (flags & 0xb11 == 0b11):
     #
@@ -124,7 +124,7 @@ class ResultMap:
     #
     offset_struct = struct.Struct('<I')
     flags_struct = struct.Struct('<B')
-    prefix_struct = struct.Struct('<HB')
+    prefix_struct = struct.Struct('<IB')
     suffix_length_struct = struct.Struct('<B')
     alias_struct = struct.Struct('<H')
 
@@ -268,23 +268,24 @@ class ResultMap:
                 output += b'\0'
                 output += e.url.encode('utf-8')
 
-        assert len(output) == offset
+        # XXX
+        # assert len(output) == offset
         return output
 
 class Trie:
     #  root  |     |       header         | results | child 1 | child 1 | child 1 |
     # offset | ... | | result # | child # |   ...   |  char   | barrier | offset  | ...
-    #  32b   |     |0|    7b    |   8b    |  n*16b  |   8b    |    1b   |   23b   |
+    #  32b   |     |0|    7b    |   8b    |  n*32b  |   8b    |    1b   |   23b   |
     #
     # if result count > 127, it's instead:
     #
     #  root  |     |      header          | results | child 1 | child 1 | child 1 |
     # offset | ... | | result # | child # |   ...   |  char   | barrier | offset  | ...
-    #  32b   |     |1|   11b    |   4b    |  n*16b  |   8b    |    1b   |   23b   |
+    #  32b   |     |1|   11b    |   4b    |  n*32b  |   8b    |    1b   |   23b   |
 
     root_offset_struct = struct.Struct('<I')
     header_struct = struct.Struct('<BB')
-    result_struct = struct.Struct('<H')
+    result_struct = struct.Struct('<I')
     child_struct = struct.Struct('<I')
     child_char_struct = struct.Struct('<B')
 
@@ -421,8 +422,8 @@ def serialize_type_map(map: List[Tuple[CssClass, str]]) -> bytearray:
 # magic  | version | symbol | result |  type  |
 # header |         | count  |  map   |  map   |
 #        |         |        | offset | offset |
-#  24b   |   8b    |  16b   |  32b   |  32b   |
-search_data_header_struct = struct.Struct('<3sBHII')
+#  24b   |   8b    |  32b   |  32b   |  32b   |
+search_data_header_struct = struct.Struct('<3sBIII')
 
 def serialize_search_data(trie: Trie, map: ResultMap, type_map: List[Tuple[CssClass, str]], symbol_count, *, merge_subtrees=True, merge_prefixes=True) -> bytearray:
     serialized_trie = trie.serialize(merge_subtrees=merge_subtrees)

It seems that the changes required for search.js are somewhat less trivial since the code seems to be a little more over the place (hardcoded buffer seek offsets, etc.). I am wondering if this is the correct way to proceed with this issue, and if so, what changes are required to synchronize the search for this expanded data format?

@mosra
Copy link
Owner

mosra commented May 8, 2020

Hello,

apologies for a late reply, was busy on other projects the past months.

This is one of the hard problems, heh. I'm not exactly happy about unconditionally expanding the fields to 32 bits as that will significantly increase size of search data for projects that don't need that many symbols (for example with Magnum I'm at over 1.3 MB with 12k symbols and even with that the initial load times are starting to become problematic). One option could be to support both 16- and 32-bit sizes in a single format, depending on how much is needed.

How many symbols do you have in your case and what's the size of the generated searchdata-v0.js (or .bin) file, and how big is it gzip-compressed? Because I fear that even if the search supports that, the download size would be too big to be practical and you'd need to start looking for different solutions anyway.

@meme
Copy link
Author

meme commented May 8, 2020

Can we use variable-length integer encoding, like https://developers.google.com/protocol-buffers/docs/encoding#varints?

I will investigate sizes, have to fish that project back out of whatever hole I left it in. Still very interested in getting this work though.

@mosra
Copy link
Owner

mosra commented May 8, 2020

Making JavaScript code robust is hard, so I'm trying to have as little JS as possible :) Variable-length encoding is possible, but the client side would need to unpack it first, which could offset the savings from faster downloads. I was personally thinking about pre-processing the data with for example Burrows - Wheeler transform but there's again the problem of having to decode them back on the client side.

I'd bet more on server-side compression which is transparent to the client, gzip already does quite a good job (the 1.3 MB file gets compressed to about 700 kB by gzip on transfer) and more advanced compression schemes (brotli, zstd) could do even better.

Curious: is the project public?

@meme
Copy link
Author

meme commented May 8, 2020

I want to generate documentation for radare2. It is quite large, but the current documentation "solution" is sub-par.

@meme
Copy link
Author

meme commented May 8, 2020

Additionally: could you help correct (or rewrite) the patch to use 32 bits if it's not a lot of up-front work so I can just give this a go to see if it is even worth the time?

@mosra
Copy link
Owner

mosra commented May 12, 2020

Done in the search-32 branch, with all needed changes in 73564df. The 100% test coverage fortunately makes such change rather easy :)

The resulting file size is about 10% larger, it's not as bad as I expected but I'm also not entirely happy about it. Will keep this in a branch until I get an idea how to solve it differently.

@meme
Copy link
Author

meme commented May 13, 2020

Just got a chance to try it out, it's snappy and looks beautiful. Thank you for your work!

The search reports: 70733 symbols (9624.4 kB)

You can view it here temporarily (commit is HEAD): http://sdf.org/~keegan/radare2-89cfe05d2/index.html

@mosra
Copy link
Owner

mosra commented May 18, 2020

Nice :) The search data download took a few more seconds than would be acceptable for me. You're just above the 65k limit, so a couple ideas that could possibly trim down the search data size:

  • there's a ton of structs and each struct member is one entry in search. It's unlikely that anybody will ever search for a cb field (there's like 10 of them) instead of the name of the struct containing it, so excluding members of simple structs could be an option (tangentially, having m.css support the INLINE_SIMPLE_STRUCTS Doxygen option would be great, but I ran into quite severe issues with the XML output when trying that out last time -- see Doxygen: Support inlining simple structs in file #83 for reference)
  • something similar could maybe be done for enum values and the like, but i think that would actually hurt usability
  • the docs currently contain everything, including private APIs and C files, so cutting that down a bit could easily get you under 65k (ideally building the doc bottom-up for each relevant file, but for a project of such size that's a work for decades, heh)

@meme
Copy link
Author

meme commented May 18, 2020

I will note that there is no gzip compression enabled on that page. Aside from that, this is good advice that I will investigate... does this exist in the FAQ or Quick Start Guide? I imagine others would find it useful.

@mosra
Copy link
Owner

mosra commented May 18, 2020

The first point needs changes on my side (there's nothing that could filter those out currently), and for the other two -- I don't recommend enabling M_SHOW_UNDOCUMENTED and I think that's pointed out a few times there :)

@meme
Copy link
Author

meme commented May 18, 2020

Yeah I understand the M_SHOW_UNDOCUMENTED, but this is an effort to improve documentation in the radare2 project, as well as provide a reference for developers wishing to integrate, so as of right now disabling it is not an option, unfortunately.

I will continue to explore the first two.

@mosra
Copy link
Owner

mosra commented Jan 9, 2022

(Sorry for embarrassingly late replies, finally got a chance to get back to this project.)

This is finally fixed as of b0cf44e and 0411e18, I ended up making the type sizes variable so small projects still keep small search data binaries but larger projects can expand the limits where needed. This can raise the default limit from 65k and 16MB files to 4G for both. It unfortunately isn't automagic as data packing defaults are picked to keep file sizes implicitly small and estimating the sizes beforehand would be overly error prone. Instead, if the limits are hit, the doc generation aborts with a message suggesting to modify various configuration options, such as

OverflowError: Trie result ID too large to store in 16 bits, set SEARCH_RESULT_ID_BYTES = 3 in your conf.py.

Add the suggested options in conf.py (or in the Doxyfile-mcss with a M_ prefix) and restart the doc generation.

With this being implemented, I'm going to drop the search-32 branch, as it's not needed for anything anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants