The Power of Adept Distributed Teams in Outsourcing: A Guide to Operational Excellence
Discover the benefits of adept distributed teams in outsourcing. Enhance operational excellence, reduce HR risks, and optimize your business operations.
Discover the comparative accuracy of task time predictions by Large Language Models and human developers through Ottia Labs' empirical study. Learn how our insights can optimize project management.
In a recent empirical study conducted by Ottia Labs, we aimed to compare the accuracy of task duration estimates generated by Large Language Models (LLMs) against those provided by human developers. To ensure unbiased results, both LLMs and human participants received identical task descriptions, devoid of any adjustments or supplementary details such as code bases—information typically available to experienced developers later in the process.
The study involved analyzing over 4,100 data points across eight different LLMs and human estimates. Each task was evaluated for accuracy and confidence levels, providing a comprehensive comparison of predictive capabilities.
On average, human developers demonstrated an impressive ability to predict task durations, with their estimates only 1% higher than the actual hours required. This accuracy underscores their deep understanding of workload and project demands but also highlights a common human bias where estimations often translate into fixed time allocations.
Codechat Bison: Estimated 93% of the actual time required, 99% of estimates fell within the reliable confidence range.
Codestral: Slightly optimistic but remarkably accurate, often aligning closely with developer estimates.
GPT 3.5 Turbo: Moderately confident, leading in perfectly accurate estimates.
GPT 4 Turbo: Mostly overestimated task times, predicting a smaller percentage of tasks accurately at specific confidence levels.
GPT 4 Vision: relatively confident, consistently underestimated on all confidence levels
Mistral: Frequently overestimated durations but maintained high confidence.
Gemini 1.5: Consistently doubled the actual needed time, regardless of confidence level.
Claude 3: Balanced between under and overestimations on top confidence levels but sufficiently overestimating on low confidence levels.
While this preliminary study only scratches the surface, it already highlights the unique strengths and limitations of both human and machine-based task time predictions. As the tech industry continues to evolve, understanding these nuances will be crucial for companies aiming to optimize project timelines and ensure successful project outcomes.
We plan to explore further advancements in automated task description generation and fine-tuning estimation models. To stay updated on our latest research and findings, please subscribe to our newsletter.
Accuracy: Estimated 93% of actual hours.
Confidence: 99% of its estimates considered correct, with 22% within the 80-120% correctness range.
Performance: Slight underestimation in high confidence levels; significant overestimation at lower levels.
Accuracy: on average estimates are 33% higher than actual hours.
Confidence: 44% of estimates in levels 2-5; champion of 100% accurate estimates (19%).
Performance: slightly overestimating at all confidence levels but not more than 177% (at level 4), and it is quite precise even at levels 7, and 8 (121% and 90%).
Accuracy: overestimated by 25% on average
Confidence: 87% of estimates considered correct, with 75% in the top three confidence levels and 19% of estimates being 100% correct.
Performance: Slight overestimation tendency on all levels except level 1, but no big deviations on any level; substantial amount of near-correct estimates (26%).
Accuracy: overestimated by 95% in average, i.e. most estimates are 2x of actual time
Confidence: 86% of its estimates fell within the top five confidence levels.
Performance: Often overestimated, especially from level 5 onwards.
Accuracy: slightly overestimated by 37% in average
Confidence: 64% of estimates deemed correct, with 54% in the top three confidence levels.
Performance: Generally slightly underestimated at level 2; overestimations in other levels.
Accuracy: Overestimates by 2x on average, only 16% of estimates within the 80-120% correct range.
Confidence: 53% of estimates are on high confidence levels 1-5
Performance: Consistently overestimated by double the needed time across all levels, except at level 1-3.
Accuracy: Underestimates by 50% on average
Confidence: considers 62% of his estimate to be accurate
Performance: Mostly underestimated tasks by 50% and more but overestimated at level 2 by 56%.
Accuracy: overestimated by 3x on average
Confidence: Believed 60% of its estimates were correct.
Performance: Mostly overestimated tasks; only 12% within the 80-120% accuracy range; smallest fraction of perfect predictions (7.5%).
LLMs exhibit varied accuracy and confidence, on average tending to overestimate tasks durations. On confidence intervals 1 to 5 some LLMs produce quite accurate estimates
With 3000+ professionals on board, we’re ready to assist you with full-cycle development.