Skip to content

Feature Request: Fine-tuned selection of including/excluding specific code block when --compress (i.e. Function Body code block) #561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
atjsh opened this issue May 17, 2025 · 3 comments
Labels
enhancement New feature or request needs discussion Issues needing discussion and a decision to be made before action can be taken

Comments

@atjsh
Copy link

atjsh commented May 17, 2025

Topic: repomix --compress, LLM-based code compression

Motivation

Generate richer context

When I run --compress in Repomix, I’d like the output to capture not only symbol information but also the implementation details of functions.
My use-case is to bundle an entire source-code repository into a single file, feed it to an LLM, and let the model analyze the codebase holistically.
Because many insights depend on what happens inside a function, simply exporting the symbols wasn't enough for me—I need the function bodies in the compressed output as well.

Giving users control over which code snippets get included would provide much greater flexibility.

Apply custom pre-processing to code blocks identified via tree-sitter

With tree-sitter we can pinpoint distinct AST nodes—such as a function’s name and its body. If we could:

  • retain every function symbol untouched, and
  • pass each function body through an LLM for aggressive compression,

then we’d keep the static-analysis benefits of full symbol visibility and supply the language model with far richer context.

Ideally, each code block (e.g., function name, function body) would have its own configurable processing pipeline.

Related issues

More Notes

I am currently developing llmlingua-2-js, a JavaScript port of LLMLingua-2. Once it is finished, developers will be able to run LLM-based code compression seamlessly in the same Node.js environment that Repomix uses.

I plan to open a separate PR for llmlingua-2-js when it is ready.

@atjsh atjsh changed the title Add fine-tuned option selection feature for including/excluding specific code block (i.e. Function Body code block) Add fine-tuned selection of including/excluding specific code block (i.e. Function Body code block) May 17, 2025
@atjsh atjsh changed the title Add fine-tuned selection of including/excluding specific code block (i.e. Function Body code block) Fine-tuned selection of including/excluding specific code block when --compress (i.e. Function Body code block) May 17, 2025
@atjsh atjsh changed the title Fine-tuned selection of including/excluding specific code block when --compress (i.e. Function Body code block) Feature Request: Fine-tuned selection of including/excluding specific code block when --compress (i.e. Function Body code block) May 19, 2025
@yamadashy
Copy link
Owner

Hi, @atjsh !
Thank you for the suggestion!

I may not have fully grasped the issue, but are you proposing to introduce stages to --compress and add one that uses llmlingua?

To make it more concrete—just as an example—you’d allow passing a compression level to --compress, use structure (default) for the current behavior, and add semantic to include llmlingua-compressed function bodies. Is that correct?

I’ve been planning to add variations to the compression levels, so this sounds great! As we discussed on Discord, using llmlingua for compression could raise some challenges, but let’s take it one step at a time.

@yamadashy yamadashy added enhancement New feature or request needs discussion Issues needing discussion and a decision to be made before action can be taken labels May 22, 2025
@atjsh
Copy link
Author

atjsh commented May 26, 2025

Idea

To make it more concrete—just as an example—you’d allow passing a compression level to --compress, use structure (default) for the current behavior, and add semantic to include llmlingua-compressed function bodies. Is that correct?

Yes, that is my current idea.


Let's say we "compress" this source code.

function addValues(left: number, right: number): number {
    console.log(left);
    console.log(left);
    console.log(left);
    console.log(left);
    console.log(left);
    console.log(left);
    console.log(right);
    console.log(right);
    console.log(right);
    console.log(right);
    console.log(right);
    console.log(right);

    return left + right;
}

I want to tell LLM that:

  • the function's call signature: (left: number, right: number) -> number
  • the behavior: "log inputs, and return the sum of the inputs."

The structure should be kept as-is.

// structure - kept as-is
function addValues(left: number, right: number): number { }

The semantic could be compressed. We could omit "some unnecessary" codes from original source code.

// semantic - some contents are removed
console.log(left)
console.log(left)
console.log(right)

return left + right

Combined result:

function addValues(left: number, right: number): number { 
console.log(left)
console.log(left)
console.log(right)

return left + right
}

Methodology for source-code (text) compression

My current idea for the 'compression of function body' is LLMLingua-2. When compared to, for example, LLaMa 3, It's small enough to run locally. Also, When summarizing text, LLMLingua-2 dose not "generate" new tokens that is not present in the original input, so it prevents hallucination problem. (source)

Should I split the issue?

It's getting quite big haha.
Maybe I should make a smaller GH issues with more specific implementation idea. For example:

  • including function body in --compress pipeline
  • applying the compression pipeline to compressed output
  • etc...

Or, we "could" plan and implement the new compression pipeline, right away. It is kinda risky tho - might be threaded as an experimental feature. What do you think?

@atjsh
Copy link
Author

atjsh commented May 26, 2025

Compressing a single source-code file at once (import state, function signature, function implementation, etc. everything.) could be another method too.
It can be way easier to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs discussion Issues needing discussion and a decision to be made before action can be taken
Projects
None yet
Development

No branches or pull requests

2 participants