Ottia Labs: Examining LLM Accuracy in Task Time Predictions

Introduction

In a recent empirical study conducted by Ottia Labs, we aimed to compare the accuracy of task duration estimates generated by Large Language Models (LLMs) against those provided by human developers. To ensure unbiased results, both LLMs and human participants received identical task descriptions, devoid of any adjustments or supplementary details such as code bases—information typically available to experienced developers later in the process.

Methodology

The study involved analyzing over 4,100 data points across eight different LLMs and human estimates. Each task was evaluated for accuracy and confidence levels, providing a comprehensive comparison of predictive capabilities.

Findings

Human Estimates: A Benchmark for Accuracy

On average, human developers demonstrated an impressive ability to predict task durations, with their estimates only 1% higher than the actual hours required. This accuracy underscores their deep understanding of workload and project demands but also highlights a common human bias where estimations often translate into fixed time allocations.

LLM Performance Analysis

‍

Codechat Bison and Codestral: Close Contenders

Codechat Bison: Estimated 93% of the actual time required, 99% of estimates fell within the reliable confidence range.

Codestral: Slightly optimistic but remarkably accurate, often aligning closely with developer estimates.

GPT-Based Models: A Mixed Bag of Predictions

GPT 3.5 Turbo: Moderately confident, leading in perfectly accurate estimates.

GPT 4 Turbo: Mostly overestimated task times, predicting a smaller percentage of tasks accurately at specific confidence levels.

GPT 4 Vision: relatively confident, consistently underestimated on all confidence levels

The Overestimators: Mistral, Gemini 1.5 and Claude 3

Mistral: Frequently overestimated durations but maintained high confidence.

Gemini 1.5: Consistently doubled the actual needed time, regardless of confidence level.

Claude 3: Balanced between under and overestimations on top confidence levels but sufficiently overestimating on low confidence levels.

Conclusion

While this preliminary study only scratches the surface, it already highlights the unique strengths and limitations of both human and machine-based task time predictions. As the tech industry continues to evolve, understanding these nuances will be crucial for companies aiming to optimize project timelines and ensure successful project outcomes.

Future Work

We plan to explore further advancements in automated task description generation and fine-tuning estimation models. To stay updated on our latest research and findings, please subscribe to our newsletter.

Detailed Findings:

‍

‍

Codechat Bison

Accuracy: Estimated 93% of actual hours.

Confidence: 99% of its estimates considered correct, with 22% within the 80-120% correctness range.

Performance: Slight underestimation in high confidence levels; significant overestimation at lower levels.

GPT 3.5 Turbo

Accuracy: on average estimates are 33% higher than actual hours.

Confidence: 44% of estimates in levels 2-5; champion of 100% accurate estimates (19%).

Performance: slightly overestimating at all confidence levels but not more than 177% (at level 4), and it is quite precise even at levels 7, and 8 (121% and 90%).

Codestral

Accuracy: overestimated by 25% on average

Confidence: 87% of estimates considered correct, with 75% in the top three confidence levels and 19% of estimates being 100% correct.

Performance: Slight overestimation tendency on all levels except level 1, but no big deviations on any level; substantial amount of near-correct estimates (26%).

Mistral

Accuracy: overestimated by 95% in average, i.e. most estimates are 2x of actual time

Confidence: 86% of its estimates fell within the top five confidence levels.

Performance: Often overestimated, especially from level 5 onwards.

Claude 3

Accuracy: slightly overestimated by 37% in average

Confidence: 64% of estimates deemed correct, with 54% in the top three confidence levels.

Performance: Generally slightly underestimated at level 2; overestimations in other levels.

Gemini 1.5

Accuracy: Overestimates by 2x on average, only 16% of estimates within the 80-120% correct range.

Confidence: 53% of estimates are on high confidence levels 1-5

Performance: Consistently overestimated by double the needed time across all levels, except at level 1-3.

GPT 4 Vision

Accuracy: Underestimates by 50% on average

Confidence: considers 62% of his estimate to be accurate

Performance: Mostly underestimated tasks by 50% and more but overestimated at level 2 by 56%.

GPT 4 Turbo

Accuracy: overestimated by 3x on average

Confidence: Believed 60% of its estimates were correct.

Performance: Mostly overestimated tasks; only 12% within the 80-120% accuracy range; smallest fraction of perfect predictions (7.5%).

Conclusion

LLMs exhibit varied accuracy and confidence, on average tending to overestimate tasks durations. On confidence intervals 1 to 5 some LLMs produce quite accurate estimates