Increase search table size(s) #123

meme · 2020-01-27T01:16:16Z

When running this tool on a large C codebase, I was experiencing errors with values out-of-bounds of the size of a short, so I modified _search.py to create a table with larger bounds:

diff --git a/documentation/_search.py b/documentation/_search.py
index cd8b643..efe2bc5 100644
--- a/documentation/_search.py
+++ b/documentation/_search.py
@@ -108,7 +108,7 @@ class ResultMap:
     #
     #  prefix  |  name  | \0 |  URL
     # id + len | suffix |    | suffix
-    # 16b + 8b |        | 8b |
+    # 32b + 8b |        | 8b |
     #
     # prefixed & suffixed item (flags & 0xb11 == 0b11):
     #
@@ -124,7 +124,7 @@ class ResultMap:
     #
     offset_struct = struct.Struct('<I')
     flags_struct = struct.Struct('<B')
-    prefix_struct = struct.Struct('<HB')
+    prefix_struct = struct.Struct('<IB')
     suffix_length_struct = struct.Struct('<B')
     alias_struct = struct.Struct('<H')
 
@@ -268,23 +268,24 @@ class ResultMap:
                 output += b'\0'
                 output += e.url.encode('utf-8')
 
-        assert len(output) == offset
+        # XXX
+        # assert len(output) == offset
         return output
 
 class Trie:
     #  root  |     |       header         | results | child 1 | child 1 | child 1 |
     # offset | ... | | result # | child # |   ...   |  char   | barrier | offset  | ...
-    #  32b   |     |0|    7b    |   8b    |  n*16b  |   8b    |    1b   |   23b   |
+    #  32b   |     |0|    7b    |   8b    |  n*32b  |   8b    |    1b   |   23b   |
     #
     # if result count > 127, it's instead:
     #
     #  root  |     |      header          | results | child 1 | child 1 | child 1 |
     # offset | ... | | result # | child # |   ...   |  char   | barrier | offset  | ...
-    #  32b   |     |1|   11b    |   4b    |  n*16b  |   8b    |    1b   |   23b   |
+    #  32b   |     |1|   11b    |   4b    |  n*32b  |   8b    |    1b   |   23b   |
 
     root_offset_struct = struct.Struct('<I')
     header_struct = struct.Struct('<BB')
-    result_struct = struct.Struct('<H')
+    result_struct = struct.Struct('<I')
     child_struct = struct.Struct('<I')
     child_char_struct = struct.Struct('<B')
 
@@ -421,8 +422,8 @@ def serialize_type_map(map: List[Tuple[CssClass, str]]) -> bytearray:
 # magic  | version | symbol | result |  type  |
 # header |         | count  |  map   |  map   |
 #        |         |        | offset | offset |
-#  24b   |   8b    |  16b   |  32b   |  32b   |
-search_data_header_struct = struct.Struct('<3sBHII')
+#  24b   |   8b    |  32b   |  32b   |  32b   |
+search_data_header_struct = struct.Struct('<3sBIII')
 
 def serialize_search_data(trie: Trie, map: ResultMap, type_map: List[Tuple[CssClass, str]], symbol_count, *, merge_subtrees=True, merge_prefixes=True) -> bytearray:
     serialized_trie = trie.serialize(merge_subtrees=merge_subtrees)

It seems that the changes required for search.js are somewhat less trivial since the code seems to be a little more over the place (hardcoded buffer seek offsets, etc.). I am wondering if this is the correct way to proceed with this issue, and if so, what changes are required to synchronize the search for this expanded data format?

The text was updated successfully, but these errors were encountered:

mosra · 2020-05-08T16:14:31Z

Hello,

apologies for a late reply, was busy on other projects the past months.

This is one of the hard problems, heh. I'm not exactly happy about unconditionally expanding the fields to 32 bits as that will significantly increase size of search data for projects that don't need that many symbols (for example with Magnum I'm at over 1.3 MB with 12k symbols and even with that the initial load times are starting to become problematic). One option could be to support both 16- and 32-bit sizes in a single format, depending on how much is needed.

How many symbols do you have in your case and what's the size of the generated searchdata-v0.js (or .bin) file, and how big is it gzip-compressed? Because I fear that even if the search supports that, the download size would be too big to be practical and you'd need to start looking for different solutions anyway.

meme · 2020-05-08T16:48:32Z

Can we use variable-length integer encoding, like https://developers.google.com/protocol-buffers/docs/encoding#varints?

I will investigate sizes, have to fish that project back out of whatever hole I left it in. Still very interested in getting this work though.

mosra · 2020-05-08T17:24:14Z

Making JavaScript code robust is hard, so I'm trying to have as little JS as possible :) Variable-length encoding is possible, but the client side would need to unpack it first, which could offset the savings from faster downloads. I was personally thinking about pre-processing the data with for example Burrows - Wheeler transform but there's again the problem of having to decode them back on the client side.

I'd bet more on server-side compression which is transparent to the client, gzip already does quite a good job (the 1.3 MB file gets compressed to about 700 kB by gzip on transfer) and more advanced compression schemes (brotli, zstd) could do even better.

Curious: is the project public?

meme · 2020-05-08T17:28:48Z

I want to generate documentation for radare2. It is quite large, but the current documentation "solution" is sub-par.

meme · 2020-05-08T19:27:35Z

Additionally: could you help correct (or rewrite) the patch to use 32 bits if it's not a lot of up-front work so I can just give this a go to see if it is even worth the time?

mosra · 2020-05-12T07:32:53Z

Done in the search-32 branch, with all needed changes in 73564df. The 100% test coverage fortunately makes such change rather easy :)

The resulting file size is about 10% larger, it's not as bad as I expected but I'm also not entirely happy about it. Will keep this in a branch until I get an idea how to solve it differently.

meme · 2020-05-13T14:10:06Z

Just got a chance to try it out, it's snappy and looks beautiful. Thank you for your work!

The search reports: 70733 symbols (9624.4 kB)

You can view it here temporarily (commit is HEAD): http://sdf.org/~keegan/radare2-89cfe05d2/index.html

mosra · 2020-05-18T09:20:51Z

Nice :) The search data download took a few more seconds than would be acceptable for me. You're just above the 65k limit, so a couple ideas that could possibly trim down the search data size:

there's a ton of structs and each struct member is one entry in search. It's unlikely that anybody will ever search for a cb field (there's like 10 of them) instead of the name of the struct containing it, so excluding members of simple structs could be an option (tangentially, having m.css support the INLINE_SIMPLE_STRUCTS Doxygen option would be great, but I ran into quite severe issues with the XML output when trying that out last time -- see Doxygen: Support inlining simple structs in file #83 for reference)
something similar could maybe be done for enum values and the like, but i think that would actually hurt usability
the docs currently contain everything, including private APIs and C files, so cutting that down a bit could easily get you under 65k (ideally building the doc bottom-up for each relevant file, but for a project of such size that's a work for decades, heh)

meme · 2020-05-18T14:53:25Z

I will note that there is no gzip compression enabled on that page. Aside from that, this is good advice that I will investigate... does this exist in the FAQ or Quick Start Guide? I imagine others would find it useful.

mosra · 2020-05-18T15:38:36Z

The first point needs changes on my side (there's nothing that could filter those out currently), and for the other two -- I don't recommend enabling M_SHOW_UNDOCUMENTED and I think that's pointed out a few times there :)

meme · 2020-05-18T16:06:31Z

Yeah I understand the M_SHOW_UNDOCUMENTED, but this is an effort to improve documentation in the radare2 project, as well as provide a reference for developers wishing to integrate, so as of right now disabling it is not an option, unfortunately.

I will continue to explore the first two.

mosra · 2022-01-09T16:40:49Z

(Sorry for embarrassingly late replies, finally got a chance to get back to this project.)

This is finally fixed as of b0cf44e and 0411e18, I ended up making the type sizes variable so small projects still keep small search data binaries but larger projects can expand the limits where needed. This can raise the default limit from 65k and 16MB files to 4G for both. It unfortunately isn't automagic as data packing defaults are picked to keep file sizes implicitly small and estimating the sizes beforehand would be overly error prone. Instead, if the limits are hit, the doc generation aborts with a message suggesting to modify various configuration options, such as

OverflowError: Trie result ID too large to store in 16 bits, set SEARCH_RESULT_ID_BYTES = 3 in your conf.py.

Add the suggested options in conf.py (or in the Doxyfile-mcss with a M_ prefix) and restart the doc generation.

With this being implemented, I'm going to drop the search-32 branch, as it's not needed for anything anymore.

mosra mentioned this issue May 24, 2020

AssertionError - unable to searialize search trie #150

Closed

mosra mentioned this issue Jun 9, 2020

Search name suffixes #149

Draft

mosra mentioned this issue Jan 3, 2022

Assertion breaking in _search.py:356 during Trie serialization for large project #190

Closed

mosra closed this as completed Jan 9, 2022

meme mentioned this issue Jan 13, 2022

TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' #217

Closed

github-project-automation bot moved this to Done in m.css / Doxygen theme Sep 20, 2024

github-project-automation bot added this to m.css / Doxygen theme Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase search table size(s) #123

Increase search table size(s) #123

meme commented Jan 27, 2020

mosra commented May 8, 2020

meme commented May 8, 2020

mosra commented May 8, 2020

meme commented May 8, 2020

meme commented May 8, 2020

mosra commented May 12, 2020 •

edited

Loading

meme commented May 13, 2020 •

edited

Loading

mosra commented May 18, 2020 •

edited

Loading

meme commented May 18, 2020

mosra commented May 18, 2020

meme commented May 18, 2020

mosra commented Jan 9, 2022

Increase search table size(s) #123

Increase search table size(s) #123

Comments

meme commented Jan 27, 2020

mosra commented May 8, 2020

meme commented May 8, 2020

mosra commented May 8, 2020

meme commented May 8, 2020

meme commented May 8, 2020

mosra commented May 12, 2020 • edited Loading

meme commented May 13, 2020 • edited Loading

mosra commented May 18, 2020 • edited Loading

meme commented May 18, 2020

mosra commented May 18, 2020

meme commented May 18, 2020

mosra commented Jan 9, 2022

mosra commented May 12, 2020 •

edited

Loading

meme commented May 13, 2020 •

edited

Loading

mosra commented May 18, 2020 •

edited

Loading