πŸ’¬ RepoQA

🚩The First Benchmark for Long-Context Code Understanding.🚩
githubPyPI - Version

πŸ”Š The goal of RepoQA: is to create a series of long-context code understanding tasks to challenge chat/instruction models for code:

  • Multi-Lingual: RepoQA covers 50 high-quality respositories from 5 programming langauges.
  • Application-Driven: While "Needle in the Code" by CodeQwen uses a synthetic task to examine the vulnerable parts over the LLM's long context, RepoQA focuses on tasks that can reflect real-world uses.
  • πŸ” Searching Needle Function (πŸ”—): Search a function given its description.
  • 🚧 RepoQA is still under development... More types of QA tasks are coming soon... Stay tuned!
          
# Using RepoQA is super easy
pip install "repoqa[vllm]"
# RepoQA supports 5 backends
repoqa.search_needle_function --backend openai    --model "gpt4-turbo"
repoqa.search_needle_function --backend anthropic --model "claude-3-haiku-20240307"
repoqa.search_needle_function --backend vllm      --model "Qwen/CodeQwen1.5-7B-Chat"
repoqa.search_needle_function --backend hf        --model "Qwen/CodeQwen1.5-7B-Chat"
repoqa.search_needle_function --backend google    --model "gemini-1.5-pro-latest"
          
          

πŸ”Ž Searching Needle Function (SNF)

Overview: This task ask the model to retrieve 10 needle functions from each of 5 langauges x 10 repositories (500 sub-tasks/tests). Each time the model is given a long chunk of source code (following import dependency) and a precise function description, and we ask the model to find the function in the context that corresponds to the description. More details can be found at πŸ”—How It Works.

πŸ† Benchmark @ 16K Code Context

πŸ› οΈ Config: The code in the prompt is fixed to 16K tokens (by CodeLlama tokenizer). Yet, the required context is a bit larger than 16K so we extend 8K and 16K models using either Dynamic RoPE Scaling or just no scaling -- whichever is better. For example, RoPE scaling makes Llama 3 models substaintially better and CodeLlama-13B slight better (Credit to @abacaj for the finding!).

πŸ“ Note: SNF is an elementary test focusing on testing LLMs' capabilities on long-context code understanding and retrieval. It does not lead to simple conclusions like "model X is better than Model Y (on everything)". It's a start-point task, and we will include more challenging tasks in the future. :D

Drag 🟒 to select the threshold of match similarity (larger -> closer to exact match)
thresh =
0.8

How It Works

SNF includes 500 sub-tasks from 5 languages x 10 repositories x 10 needles. The prompt and expected output are demonstrated in the following figure:

The evaluator passes a test if the model generated function: (i) has the highest similarity to the ground-truth compared to all other functions; (ii) the similarity is above certain threshold (default is 0.8 but can be user defined). By default, we define the similarity using BLEU score (method 4).

The curation of the dataset includes four steps: (i) select permissive repositories based on some quality-based metrics; (ii) collect source code content and analyze their file dependency; (iii) use tree-sitter to parse all functions and select a subset of them as needle functions; (iv) prompt GPT-4 Turbo to generate function description for the needle. Detailed information and scripts for dataset curation can be found at our GitHub repo.

πŸ™‹πŸ»β€β™€οΈ FAQ

Just yet another needle test?

No. Here are some notes:
  • SNF != RepoQA, SNF \in RepoQA: Yes, SNF is a variant of needle test, but SNF != RepoQA. SNF is a start point and elementary test: if a model can't pass SNF, don't expect it to pass more challenging tasks. We will build more challenging tasks in the future.
  • Unlike vanilla needle tests which use single test to perform fully synthetic retrieval, SNF is a multi-lingual, application-driven, and comprehensive test that require LLMs to understand NL description before retrieval, which aligns with the use of advanced code search.

Non-determinism

In theory as we use greedy decoding, the results should be deterministic. In practice, the results may slightly vary: (i) for OpenAI/Anthropic models, they do not seem to be deterministic all the time (Thanks to @scottinallcaps); and (ii) for local inference, the configuration of library versions and tensor parallelism sizes may also slightly impact reproducibility.

Known limitations

  • The current description is made verbose to avoid one description being mapped to multiple functions. However, in real world, developers may naturally use short description for code search. (Thanks @chrisgorgo for the suggestion!)