Blog

Cracking the Coding Evaluation

Lucy Gao
November 13, 2023
7
-minute read

Tabby offers an open-source alternative solution to GitHub Copilot with easy setup and self-host options. We embrace an open ecosystem to support major open source coding LLMs (e.g. StarCoder, CodeLlama, WizardCoder, etc.), and enable easy integration of proprietary models. In addition, Tabby performs retrieval-augmented code completion to suggest code from your private codebase. We firmly believe in the continuous advancement in open source coding LLMs, yet we need quantitative measurements to guide the direction of product improvement, and help developers decide their model of choice.

Evaluation coding LLMs has also been a hot topic in academics. Many different metrics targeting different coding tasks have been proposed over the past year. At Tabby, we prioritize on metrics that best resemble real-world development workflow, and of course, the metrics should be constructed with non-biased data sources. In this blogpost, we will discuss our thoughts for desired code completion benchmarks, and also review latest academic progress in this area.

Exisiting Paradigms

Existing coding LLM benchmark mostly focus on Pass@k metric - generating k code samples and measuring how often the results successfully pass given unit tests. OpenAI initially introduced this metric in Evaluating Large Language Models Trained on Code in July 2021, along with the release of HumanEval bechmark dataset.

🤖 HumanEval

HumanEval is a hand-crafted dataset, consisting of 164 Python programming problems with unit tests. An example task looks like:

from typing import List 

def below_zero(operations: List[int]) -> bool: 
    
    """ 
    You're given a list of deposit and withdrawal operations on a bank account that starts with zero balance. Your task is to detect if at any point the balance of account fallls below zero, and at that point function should return True. Otherwise it should return False.
    
     >>> below_zero([1, 2, 3]) False 
     
     >>> below_zero([1, 2, -4, 5]) True 
     
    """

HumanEval was a pioneer research effort, but now suffers from some unfortunate drawbacks:

  1. Data is likely contaminated. HumanEval dataset has been around for over two years and it has been discussed and documented widely online. The latest coding LLMs are likely to have included its test data in training data crawling, which would make the evaluation no longer valid.
  2. Trivial coding questions that aren't mimicing real engineering setups. HumanEval includes mostly LeetCode's interview-style questions, where they include a single function for LLMs to fill in the body. In a more realistic corporate setup, developers often add code in multiple files in a single PR, and constantly refer to functions implemented in other files. These are indeed more interesting yet challenging tasks for LLMs to perform, but are critical scenarios for AI coding assitants to land in enterprises.
  3. Unit tests are too weak. Researchers noticed that test cases in HumanEval tasks (on average 7.7 tests per problem) aren't enough to guarantee the correctness of the generated code (e.g. a wrong implementation could still pass all existing tests), and thus augmented test cases in HumanEval benchmark by 80x in HumanEvalPlus.
human-eval-plus
  1. Limited coverage in programming languages. This one is obvious as HumanEval only includes Python code. We ❤️ all programming languages!

🧩 Mostly Basic Programming Problems (MBPP)

MBPP is another popular benchmark for code generation. Researchers from Google introduced it in the paper Program Synthesis with Large Language Models in August 2021, one month after the release of HumanEval. It contains 974 entry-level Python (as the name clearly suggests) programming tasks. An example looks like:

   """
   Write a python function to remove first and last occurrence of a given character from the string.

   "assert remove_Occ(\"hello\",\"l\") == \"heo\""
   "assert remove_Occ(\"abcda\",\"a\") == \"bcd\""
   "assert remove_Occ(\"PHP\",\"P\") == \"H\"" 
   
   """

Unlike HumanEval, MBPP targets basic tasks commonly encountered by engineers, such as string manipulation, simple arithmetic, and basic data structure operations. However it still faces similar drawbacks as HumanEval mentioned above.

What we are looking for in coding LLM evaluations?

🔬 Scientific and Relevant Setup

The top thing in our mind is metric setup. Like mentioned above, most existing coding LLM evaluations focus on function-level code generation - given a docstring or a function signature at most, the LLM is expected to generate the entire function body.

Here are what we think a trustworthy evaluation setup should cover:

  1. Non-trivial code. Definitely no more Leetcode-style coding questions! The ideal evaluation should target projects with substantial engineering complexity. Evidences like lines of code, number of files, or number of contributors could serve as good indicators to estimate the code complexity.
  2. Cross-file references. This is a key factor to differentiate a more reliable and practical evaluation from something that only scratches the surface of the coding world. Engineers do not code in silo, but are greatly encouraged to reuse a function or API implemented in the existing codebase.
  3. Code completion. Code completion is the most widely adopted LLM-powered feature in developer tools. Millions of developers worldwide have employed AI code completions in their daily workflow. Tabby provides a low-barrier solution in code completion, and is committed to continue to improve the end-to-end product quality.

⚖️ Ease and Low-Cost to Run

The ease and cost to run evaluations is directly correlated to the number of models we can evaluate, and the frequency we can afford to update the results (in the case of refreshed evaluation date, for example). There are efforts to leverage crowdsourcing to rate the quality of LLM responses (e.g. Glaive arena) which excels at receiving high-quality human feedbacks and provides valuable insights to understand user behaviors. However it's harder to scale crowdsourcing ratings and takes longer to receive results. We are iterating quickly on Tabby, and decided that scalability and ease are critical to us now.

🔍 Data Quality and Inclusion

The data quality is critical to maintain the legitimacy of such evaluation. Here's what's important for evaluation data:

  1. Train/Eval Data Split. It's one of the most important concepts in your Machine Learning 101 course. Yet often times it gets so basic that folks neglect the challenges to ensure it in real-world applications over time. For example, HumanEval started as a manually drafted dataset to firmly ensure the data separation. Nevertheless over time, it still faces data contamination issue.
  2. Evaluation Quality. HumanEvalPlus mentioned above is a great example for this. Understanding the quality of the evaluation is important for developing a fair sense of the true model performance. We also encourage continuous efforts in improving evaluation quality!💪🏻
  3. Data Inclusion / Coverage. In the case of coding, inclusion includes efforts like increasing the support of different programming languages. In practice, choosing the reasonable ratio of each programming language is also tricky yet important.

Highlights of recent research innovations

In this section, we showcase a few recent research work of from the academics toward building reliable and sound evaluations for coding LLMs. Following these papers, we observe a growing emphasize in evaluating coding LLMs with repository-level context, which indeed aligns with what we have been looking for.

🗂️ CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

CrossCodeEval benchmark specially targets to address the gap that "existing code completion datasets such as HumanEval and MBPP mostly focus on single-file tasks, neglecting the real-world complexity of multi-file software projects". To achieve this goal, CrossCodeEval uses a static-analysis-based method to strictly require cross-file context for accurate code completion. Experiments show that cross-file context improves end-to-end system performance (LLM + code retriever), yet there's still a lot of room to improve.

cceval

🧪 RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

RepoBech also recognizes that current benchmarks primarily focus on single-file tasks, which creates a gap in assessing these systems in more complex, real-world, multi-file programming scenarios. Therefore, RepoBench introduces three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline) to measure the quality of each module and also the end-to-end system.

repobench

💾 RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

RepoCoder presents an innovative approach of combining similarity-based retriever and LLM prediction into an iterative retrieval-generation pipeline. To demostrate the effectiveness of this method, authors also introduced RepoEval, covering scenarios like line, API invocation, and function body completion from high quality real-world repositories.

repocoder
Share this post

Stay Updated with Tabby News

Subscribe to our newsletter for the latest updates and news about Tabby.

By joining, you agree to our Terms and Conditions.
Thank you! We've received your submission.
Oops! Something went wrong. Please try again.

Discover Tabby Unlock Your Coding Potential

Explore the Power of Tabby, the Self-Hosted AI Coding Assistant
                                                                                                             
                                                                                                             
                                                                                                             
                                                                                                             
333                                                                            333333                        
444   7                                                                       66466                          
00   313333                                                                 0000                             
   55555                                                                                                  331
  666                                                                                                    444 
888       777777                                                                                        888  
0       3311                                                                                            0    
    222222                                                                                                   
  455555         77777777                                                                                    
 666664       1111117                                                                                        
999999     3333333                                                                                    7      
8888     2222222   77777777                                                                    777           
000    5555555   1111111                                                                     33333           
0   4444444    1111113                                                                     55555             
  6666666    2333333                                                                      66664              
 999999    2222222                           7                                           8888                
888888    555555      77777777     77777777                                              00                  
0000   44444444     77777777    177777777                                                                    
00  4444444446   111111111    11111111                                                                       
0  666666666   13313131    3331313        777777                                                             
 999999999    333333     3333333        777777                                                               
8888888      222223    2222222        1111117                                                                
0000       222222     2222225      11111111                                                                  
000     55555555    5555555     333333333    7                                                               
0      5444444   544444445    3333322                                                                        
      444444   4444444444  2222222                                                             7             
      6666  66666666666 5552522   7 777   777 77    7 7   7                                  7               
     6666 66666666666  55555  7777777777  7777     7777   7777       7 7777       7 77 77 7      7 7   777   
   999999999999999    4444  7777777777 777777    177777  777777     777777  7777777777        7777 7777777   
 88898889888898     4444 111111111   111111     1111111  77777    777777 1777777777         717777777777     
88888888888        666 11111111    111111     11111111  11111    711111111111111          1111111111111      
0000000          666 3131313     3131313     11313133  11113    111111111111           111111111111111       
00000         9999  333333      3333333    33333333   13333   3333333333         333333333333333311          
000        999999  33333      3333333     333333     3333    3333333          33333333333333                 
0       8888888  222222    222222222    222222     3232    32332            3323232323223                    
    88888880    22222    222222222   2222222     2222    2222             2222222222222                      
 00000000     555555    55555555  255555552    2222    22222            222222222222                         
000000       55555    55555555 5555555555    5555   255555            5555555555552                          
0000       555555    555555555555555554    5555   55555555          555555555555                             
00      4444444     44444444444444445    44444 4444444445         44444444445                                
      44444444     444444444444444      444444444444 4444       4444444444                                   
    46666664      66666666666664       44444444644   444      46444444                                       
 66666666        6666666666666        666666666    6666      666666                                          
6969666        66969696969696       96666666     66666      66666     777777777    777777                    
99999        99999999999999       9999999      999999      99999     111111113    11111                      
99         999999999999        9999999        999999      99999    333333333     33333                       
         888888888      9988888888898       8888899     888889    22222222      22222   77777777             
      888888888      88888888888888      8888888     88888888   55555555       5555    111111                
    088888888      0888888888888       088888      88888888   444444444      44444    22222                  
  000000000      00000000000        000000      0000000000  666666666      66666    55555      1111111       
0000000000     000000000     00000000000    000000000000   99999999      999999    66666     2222222         
0 000000    000000000   0000000000000    0000000000000    8888888      888888   9999999    4444444           
000000    00000000    0000000000000   00000000000000     00000      0000000   0000000    89999998         7  
000    00000000     0000000000000   000000000000000    00000       000000   00000000    000000   9999999     
  

Get Started with our Community Plan Today

Get Started

Simple self-onboarding

Free community plan

Local-first deployment



  
333                                                                            333333                        
444   7                                                                       66466                          
00   313333                                                                 0000                             
   55555                                                                                                  331
  666                                                                                                    444 
888       777777                                                                                        888  
0       3311                                                                                            0    
    222222                                                                                                   
  455555         77777777                                                                                    
 666664       1111117                                                                                        
999999     3333333                                                                                    7      
8888     2222222   77777777                                                                    777           
000    5555555   1111111                                                                     33333           
0   4444444    1111113                                                                     55555             
  6666666    2333333                                                                      66664              
 999999    2222222                           7                                           8888                
888888    555555      77777777     77777777                                              00                  
0000   44444444     77777777    177777777                                                                    
00  4444444446   111111111    11111111                                                                       
0  666666666   13313131    3331313        777777                                                             
 999999999    333333     3333333        777777                                                               
8888888      222223    2222222        1111117                                                                
0000       222222     2222225      11111111                                                                  
000     55555555    5555555     333333333    7                                                               
0      5444444   544444445    3333322                                                                        
      444444   4444444444  2222222                                                             7             
      6666  66666666666 5552522   7 777   777 77    7 7   7                                  7               
     6666 66666666666  55555  7777777777  7777     7777   7777       7 7777       7 77 77 7      7 7   777   
   999999999999999    4444  7777777777 777777    177777  777777     777777  7777777777        7777 7777777   
 88898889888898     4444 111111111   111111     1111111  77777    777777 1777777777         717777777777     
88888888888        666 11111111    111111     11111111  11111    711111111111111          1111111111111      
0000000          666 3131313     3131313     11313133  11113    111111111111           111111111111111       
00000         9999  333333      3333333    33333333   13333   3333333333         333333333333333311          
000        999999  33333      3333333     333333     3333    3333333          33333333333333                 
0       8888888  222222    222222222    222222     3232    32332            3323232323223                    
    88888880    22222    222222222   2222222     2222    2222             2222222222222                      
 00000000     555555    55555555  255555552    2222    22222            222222222222                         
000000       55555    55555555 5555555555    5555   255555            5555555555552                          
0000       555555    555555555555555554    5555   55555555          555555555555                             
00      4444444     44444444444444445    44444 4444444445         44444444445                                
      44444444     444444444444444      444444444444 4444       4444444444                                   
    46666664      66666666666664       44444444644   444      46444444                                       
 66666666        6666666666666        666666666    6666      666666                                          
6969666        66969696969696       96666666     66666      66666     777777777    777777                    
99999        99999999999999       9999999      999999      99999     111111113    11111                      
99         999999999999        9999999        999999      99999    333333333     33333                       
         888888888      9988888888898       8888899     888889    22222222      22222   77777777             
      888888888      88888888888888      8888888     88888888   55555555       5555    111111                
    088888888      0888888888888       088888      88888888   444444444      44444    22222                  
  000000000      00000000000        000000      0000000000  666666666      66666    55555      1111111       
0000000000     000000000     00000000000    000000000000   99999999      999999    66666     2222222         
0 000000    000000000   0000000000000    0000000000000    8888888      888888   9999999    4444444           
000000    00000000    0000000000000   00000000000000     00000      0000000   0000000    89999998         7  
000    00000000     0000000000000   000000000000000    00000       000000   00000000    000000   9999999     

Explore Full Features with Team or Enterprise Plans

BOOK A DEMO 🚀

Enterprise-first experience

Flexible deployment options

Enhanced security support