Feature Request: Fine-tuned selection of including/excluding specific code block when `--compress` (i.e. Function Body code block) #561

atjsh · 2025-05-17T07:42:49Z

Topic: repomix --compress, LLM-based code compression

Motivation

Generate richer context

When I run --compress in Repomix, I’d like the output to capture not only symbol information but also the implementation details of functions.
My use-case is to bundle an entire source-code repository into a single file, feed it to an LLM, and let the model analyze the codebase holistically.
Because many insights depend on what happens inside a function, simply exporting the symbols wasn't enough for me—I need the function bodies in the compressed output as well.

Giving users control over which code snippets get included would provide much greater flexibility.

Apply custom pre-processing to code blocks identified via tree-sitter

With tree-sitter we can pinpoint distinct AST nodes—such as a function’s name and its body. If we could:

retain every function symbol untouched, and
pass each function body through an LLM for aggressive compression,

then we’d keep the static-analysis benefits of full symbol visibility and supply the language model with far richer context.

Ideally, each code block (e.g., function name, function body) would have its own configurable processing pipeline.

Related issues

Various compression levels to reduce token count #36 and Add --summary CLI option for LLM-based code summarization #511 discuss LLM-based summarization.

More Notes

I am currently developing llmlingua-2-js, a JavaScript port of LLMLingua-2. Once it is finished, developers will be able to run LLM-based code compression seamlessly in the same Node.js environment that Repomix uses.

I plan to open a separate PR for llmlingua-2-js when it is ready.

The text was updated successfully, but these errors were encountered:

yamadashy · 2025-05-22T14:27:55Z

Hi, @atjsh !
Thank you for the suggestion!

I may not have fully grasped the issue, but are you proposing to introduce stages to --compress and add one that uses llmlingua?

To make it more concrete—just as an example—you’d allow passing a compression level to --compress, use structure (default) for the current behavior, and add semantic to include llmlingua-compressed function bodies. Is that correct?

I’ve been planning to add variations to the compression levels, so this sounds great! As we discussed on Discord, using llmlingua for compression could raise some challenges, but let’s take it one step at a time.

atjsh · 2025-05-26T05:28:31Z

Idea

To make it more concrete—just as an example—you’d allow passing a compression level to --compress, use structure (default) for the current behavior, and add semantic to include llmlingua-compressed function bodies. Is that correct?

Yes, that is my current idea.

Let's say we "compress" this source code.

function addValues(left: number, right: number): number {
    console.log(left);
    console.log(left);
    console.log(left);
    console.log(left);
    console.log(left);
    console.log(left);
    console.log(right);
    console.log(right);
    console.log(right);
    console.log(right);
    console.log(right);
    console.log(right);

    return left + right;
}

I want to tell LLM that:

the function's call signature: (left: number, right: number) -> number
the behavior: "log inputs, and return the sum of the inputs."

The structure should be kept as-is.

// structure - kept as-is
function addValues(left: number, right: number): number { }

The semantic could be compressed. We could omit "some unnecessary" codes from original source code.

// semantic - some contents are removed
console.log(left)
console.log(left)
console.log(right)

return left + right

Combined result:

function addValues(left: number, right: number): number { 
console.log(left)
console.log(left)
console.log(right)

return left + right
}

Methodology for source-code (text) compression

My current idea for the 'compression of function body' is LLMLingua-2. When compared to, for example, LLaMa 3, It's small enough to run locally. Also, When summarizing text, LLMLingua-2 dose not "generate" new tokens that is not present in the original input, so it prevents hallucination problem. (source)

Should I split the issue?

It's getting quite big haha.
Maybe I should make a smaller GH issues with more specific implementation idea. For example:

including function body in --compress pipeline
applying the compression pipeline to compressed output
etc...

Or, we "could" plan and implement the new compression pipeline, right away. It is kinda risky tho - might be threaded as an experimental feature. What do you think?

atjsh · 2025-05-26T05:39:38Z

Compressing a single source-code file at once (import state, function signature, function implementation, etc. everything.) could be another method too.
It can be way easier to implement.

atjsh changed the title ~~Add fine-tuned option selection feature for including/excluding specific code block (i.e. Function Body code block)~~ Add fine-tuned selection of including/excluding specific code block (i.e. Function Body code block) May 17, 2025

atjsh changed the title ~~Add fine-tuned selection of including/excluding specific code block (i.e. Function Body code block)~~ Fine-tuned selection of including/excluding specific code block when --compress (i.e. Function Body code block) May 17, 2025

atjsh changed the title ~~Fine-tuned selection of including/excluding specific code block when --compress (i.e. Function Body code block)~~ Feature Request: Fine-tuned selection of including/excluding specific code block when --compress (i.e. Function Body code block) May 19, 2025

yamadashy added enhancement New feature or request needs discussion Issues needing discussion and a decision to be made before action can be taken labels May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature Request: Fine-tuned selection of including/excluding specific code block when `--compress` (i.e. Function Body code block) #561

Feature Request: Fine-tuned selection of including/excluding specific code block when `--compress` (i.e. Function Body code block) #561

atjsh commented May 17, 2025 •

edited

Loading

yamadashy commented May 22, 2025

Uh oh!

atjsh commented May 26, 2025 •

edited

Loading

Uh oh!

atjsh commented May 26, 2025

Uh oh!

Uh oh!

Feature Request: Fine-tuned selection of including/excluding specific code block when --compress (i.e. Function Body code block) #561

Feature Request: Fine-tuned selection of including/excluding specific code block when --compress (i.e. Function Body code block) #561

Comments

atjsh commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Generate richer context

Apply custom pre-processing to code blocks identified via tree-sitter

Related issues

More Notes

yamadashy commented May 22, 2025

Uh oh!

atjsh commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Idea

Methodology for source-code (text) compression

Should I split the issue?

Uh oh!

atjsh commented May 26, 2025

Uh oh!

Feature Request: Fine-tuned selection of including/excluding specific code block when `--compress` (i.e. Function Body code block) #561

Feature Request: Fine-tuned selection of including/excluding specific code block when `--compress` (i.e. Function Body code block) #561

atjsh commented May 17, 2025 •

edited

Loading

atjsh commented May 26, 2025 •

edited

Loading