Muhammad Adnan Rizqullah
King Abdulaziz University
Faculty of Computing and Information Technology
Advisor: Dr. Emad Yosif Albassam
Traditional development methods struggle to keep pace with increasing system complexity
Promising solution for efficient code generation, but reliability and comprehension require thorough evaluation
Providing test cases as input to guide LLMs toward producing more accurate and reliable code
Empirically analyze effectiveness across multiple dimensions
###Prompt
Check if numbers are closer
than threshold.
>>> has_close_elements(
[1.0, 2.0, 3.0], 0.5)
False
###Signature
def has_close_elements(
numbers: List[float],
threshold: float) -> bool:
###Prompt + Signature
(same as baseline)
###Test (50% of suite)
assert candidate(
[1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.) == True
assert candidate(
[1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.5) == False
assert candidate(
[1.0, 2.0, 5.9, 4.0, 5.0], 0.5) == True
Key Difference: TDP adds explicit test cases as executable specifications (50% public tests, 50% withheld for validation)
| Strategy | Mechanism | Improves | Auto-Evaluation? |
|---|---|---|---|
| Chain-of-Thought (CoT) | Step-by-step reasoning decomposition | How the model reasons | ✗ No |
| Structured CoT | CoT + programming constructs | How (up to 13.79% Pass@1) | ✗ No |
| Persona Prompting | Role assignment to shape perspective | Code quality, not correctness | ✗ No |
| Few-Shot | Concrete input-output examples | What to produce | ✗ No |
| TDP | Executable test assertions | What + correctness criteria | ✓ Built-in |
TDP is orthogonal to reasoning-enhancement techniques (CoT) and role-setting (Persona) — suggesting potential complementarity.
| Reference | Year | Dataset | LLM Models | Contribution | Limitation |
|---|---|---|---|---|---|
| Liu et al. | 2025 | HumanEval variants | 9 models (GPT-4, Mixtral, Llama3, etc.) | Autonomous self-debugging framework with test dependency graphs; 10.41% average improvement | All benchmarks HumanEval-based; No systematic model size comparison; Pass@k attempts undisclosed |
| Mathews & Nagappan | 2024 | HumanEval, MBPP, CodeChef | GPT-4, Llama 3 | 9-30% accuracy improvement with remediation loops | TDP applied only to failed cases; Non-standard difficulty analysis; Limited model scope |
| Fakhoury et al. | 2024 | HumanEval, MBPP | GPT 3/3.5 variants | Semi-automated TiCODER workflow; 38% improvement | Requires human intervention; TDP not isolated; No difficulty analysis |
| Piya & Sullivan | 2024 | Custom Leetcode | GPT-3.5 | 5:8 efficiency ratio; TDP best practices | Single model; Lacks statistical rigor; TDP not isolated; Low test coverage |
| Chen et al. | 2022 | HumanEval, MBPP, APPS, Contests | Codex, INCODER, CODEGEN variants | Dual execution agreement; 18.8% improvement | Outdated models; No difficulty analysis; TDP not isolated |
| Lahiri et al. | 2022 | HumanEval, MBPP | code-davinci-002 | Interactive test-driven specification | Requires user feedback; Single model; TDP not isolated; No difficulty analysis |
Test-Driven Prompting (TDP) shows promise but has critical limitations:
What is the performance of test-driven code generation across programming languages of different popularity and type nature?
Scope: Python, JavaScript, C++, TypeScript, PHP, Ruby, Go, C#
Objective: Measure TDP performance across 8 languages to address language bias in existing research
What is the performance of test-driven code generation on various models with differing characteristics?
Scope: Closed/open-source, varying sizes, general vs specialized
Objective: Compare TDP effectiveness across closed/open-source models, varying sizes, and specialized vs general-purpose LLMs
What is the relationship between programming problem difficulty and LLM performance?
Objective: Analyze LLM performance across problem difficulty levels from introductory to competition-level
What is the relationship between test suite completeness and LLM performance?
Objective: Quantify the relationship between test suite completeness and LLM performance
How can a decision framework guide developers in selecting appropriate LLMs for platform-specific development?
Objective: Develop decision framework for LLM selection in mobile development (Android/Java, iOS/Swift) based on accuracy, budget, and deployment needs
| Model | Source Type | Size | Specialization |
|---|---|---|---|
| GPT-4o | Closed Source | Large | General Purpose |
| GPT-4o-mini | Closed Source | Small | General Purpose |
| Claude 3.5 Sonnet | Closed Source | Large | General Purpose |
| Claude 3.5 Haiku | Closed Source | Small | General Purpose |
| Qwen 2.5 Coder 32B | Open Source | 33B | Coding Specialized |
| Qwen 2.5 Coder 14B | Open Source | 14B | Coding Specialized |
| Qwen 2.5 Coder 7B | Open Source | 7B | Coding Specialized |
| Qwen 2.5 Coder 3B | Open Source | 3B | Coding Specialized |
Comprehensive Coverage: 4 closed-source + 4 open-source models spanning 3B-Large parameters
| Language | Frequency | Type System |
|---|---|---|
| Python | High | Dynamic |
| JavaScript | High | Dynamic |
| C++ | High | Static |
| TypeScript | High | Static |
| PHP | Medium | Dynamic |
| Ruby | Medium | Dynamic |
| Go | Medium | Static |
| C# | Medium | Static |
Balanced Design: 4 languages per cell enables robust statistical analysis
Cassano, F., et al. (2023). MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE TSE, 49(7).
| Category | Mean | Median |
|---|---|---|
| High-Freq Dynamic | 5.71% | 7.46% |
| High-Freq Static | 4.03% | 2.96% |
| Med-Freq Dynamic | 4.91% | 4.79% |
| Med-Freq Static | 3.86% | 4.88% |
Note: Black lines = group means (μ), green dashed lines = medians (M)
Example: Python (+7.09 pp) vs JavaScript (+4.32 pp)
Both dynamic high-frequency, yet 3 pp difference!
| Model | First | Remed |
|---|---|---|
| GPT-4o | +6.54% | +1.29% |
| GPT-4o-mini | +5.36% | +1.29% |
| Claude Sonnet | +5.76% | +0.66% |
| Claude Haiku | +6.56% | +0.93% |
| Qwen 32B | +7.08% | +3.47% |
| Qwen 14B | +7.43% | +7.74% |
| Qwen 7B | +6.58% | +6.70% |
| Qwen 3B | +3.45% | +3.68% |
THE strongest empirical finding: Perfect success rate + large effect size = model-agnostic effectiveness
80% Success Rate (8/10): Small+TDP ≥ Large Baseline
GPT-4o-mini + TDP (80.7%)
vs GPT-4o baseline (77.4%)
+3.3 pp,
16.7× cheaper
HumanEval: Perfect parity
(81.7%)
MBPP: Mini wins 79.6% vs 73.1% (+6.5 pp)
Haiku + TDP (84.6%) vs
Sonnet baseline (81.4%)
+3.2 pp
advantage
Economic transformation: Smaller models + TDP achieve competitive performance at fraction of cost
Inverse pattern: Larger improvements on easier problems!
Remediation recovery on
EASY:
0.0% → 45.77% (+45.77 pp)
Specification errors recoverable!
Why 0%? 67-88% of Easy failures = trailing newlines. 100% of Easy require them vs 92.5% Med, 79.2% Hard—no "escape routes."
Mechanism: Easy problems = low algorithmic demands + low specification clarity. Hard problems = high algorithmic demands + higher specification clarity. TDP eliminates specification disadvantage.
| Ratio | Avg Tests | Δ | Effect | P-value |
|---|---|---|---|---|
| 0.25 | 1.56 | +2.90 pp | d=1.24 | 0.005 |
| 0.5 | 3.34 | +3.35 pp | d=1.23 | 0.099 |
| 0.75 | 5.07 | +4.27 pp | d=1.97 | 0.018 |
| 1.0 | 7.20 | +6.85 pp | d=2.83 | 0.009 |
Ratio 0.5: Although p=0.099 slightly exceeds the conventional 0.05 threshold, the Wilcoxon signed-rank test statistic of 0.0 indicates perfect unidirectional effectiveness (all models improved, no regressions), demonstrating consistent practical effectiveness despite the limited sample size.
Qwen models: Plateau at 0.5-0.75 (5.49-6.09 pp), jump to 8.53-9.14 pp at 1.0
GPT models: U-shaped pattern with dip at 0.5 (0.61-1.22 pp), recover to ~4.9 pp at 1.0
Key insight: Even minimal coverage (1.56 tests) provides meaningful gains (+2.90 pp, p=0.005)
Practical recommendation: 50% test suite offers good cost-benefit trade-off
| Node | Question | Answer | Next | Recommendation |
|---|---|---|---|---|
| 1 | First-Attempt Correct? | Yes | → 2a | First-attempt scenario |
| No | → 2b | Multi-attempt scenario | ||
| 2a | Budget? | High | → 3a | Higher budget |
| Low | → 3b | Lower budget | ||
| 2b | Budget? | High | → 3c | Higher budget |
| Low | → 3d | Lower budget | ||
| 3a | Self-Host? | Yes | ✓ | Qwen-32B First (And: 75.16%, iOS: 78.25%) |
| No | ✓ | GPT-4o First (And: 75.81%, iOS: 81.04%) | ||
| 3b | Self-Host? | Yes | ✓ | Qwen-14B First (And: 71.52%, iOS: 74.86%) |
| No | ✓ | GPT-4o-mini First (And: 68.37%, iOS: 73.71%) | ||
| 3c | Self-Host? | Yes | ✓ | Qwen-32B Multi (And: 78.56%, iOS: 85.97%, $5/hr) |
| No | ✓ | GPT-4o Multi (And: 80.36%, iOS: 88.87%) | ||
| 3d | Self-Host? | Yes | ✓ | Qwen-14B Multi (And: 73.49%, iOS: 78.07%, $1.8/hr) |
| No | ✓ | GPT-4o-mini Multi (And: 74.64%, iOS: 80.72%) |
Note: Multi-attempt adds 5-8 pp | iOS consistently 4-7 pp > Android | And = Android
IEEE Access - Accepted
i-JIM - Accepted
I welcome your questions and feedback
Muhammad Adnan Rizqullah
Advisor: Dr. Emad Yosif Albassam
King Abdulaziz University
Faculty of Computing and Information Technology
Every problem-model combination tested with BOTH baseline AND TDP
→ Eliminates selection bias
Baseline: Problem + function signature
TDP: Baseline + 50% test cases
→ 50% withheld for validation
Up to 5 attempts with error feedback for BOTH strategies
→ Equal opportunities prevent compounded advantages
Shapiro-Wilk normality → Paired t-tests / Wilcoxon
Cohen's d effect sizes + 95% confidence intervals
→ Ensures practical significance
BUT: Means vs medians inconsistent
Med-Freq Static: mean 3.86% vs median 4.88%
Example: Python (+7.09 pp) vs JavaScript (+4.32 pp)
Both dynamic high-frequency, yet 3 pp difference!
+7.09 pp improvement
+3.36 pp improvement (lowest)
Same category, different performance:
Key finding: TDP effectiveness depends on alignment between problem characteristics and language-specific training patterns, not on broad categorical distinctions
10× difference between benchmarks vs 2× max within language types
Mechanism: Specification clarity matters more than language properties
Finding: Problem characteristics (test suite comprehensiveness, baseline difficulty) >>> intrinsic language properties
| Lang | Model | Δ |
|---|---|---|
| C++ | GPT-4O | +14.14 pp |
| PHP | GPT-4O | +13.64 pp |
| Python | GPT-4O | +11.51 pp |
| PHP | Qwen | +11.11 pp |
| Ruby | Qwen | +9.35 pp |
| Lang | Model | Δ |
|---|---|---|
| PHP | GPT-4O | -3.12 pp |
| Ruby | GPT-4O | -1.26 pp |
| C++ | Qwen | -1.25 pp |
| Go | Qwen | -0.65 pp |
| JS | Qwen | -0.63 pp |
Notable: PHP with GPT-4O shows largest improvement (+13.64 pp on MBPP) AND largest regression (-3.12 pp on HumanEval) - same language-model, different benchmark!
Key insight: Python shows 2× improvement compared to TypeScript/Go, despite all being popular languages
Finding: TDP effectiveness independent of proprietary vs community-developed models
Qwen 32B + TDP: 84.7%
GPT-4o + TDP:
83.9%
Open-source matches closed-source!
Cost Advantage:
Qwen 32B:
$0.21/M tokens
Claude Sonnet: $12.50/M tokens
59× cost savings!
Practical implication: Organizations with self-hosting capabilities can achieve competitive accuracy at dramatically lower cost
Remediation Attenuation:
Per-attempt overhead:
To match TDP accuracy:
TDP reduces remediation
tasks:
HumanEval:
9.5 vs 13
MBPP: 18.29 vs 44.43
Example: Task 53.0 - All models failed due to missing trailing newline (100% identical failure)
TDP makes implicit requirements explicit
Example: Task 271 - Pattern recognition in number sequences (TDP helps marginally)
TDP clarifies constraints but can't convey algorithmic insights
Predictive guidance: Specification-heavy domains (data formatting, API integration) → maximum TDP benefit. Algorithm-heavy domains (optimization, graph theory) → moderate benefit
Performance vs Python:
| Model | Ac@1 | R.Ac |
|---|---|---|
| GPT-4o | 75.81% | 80.36% |
| GPT-4o-mini | 70.74% | 74.64% |
| Qwen-32B | 75.16% | 78.56% |
| Qwen-14B | 73.23% | 73.49% |
| Model | Ac@1 | R.Ac |
|---|---|---|
| GPT-4o | 81.04% | 88.87% |
| GPT-4o-mini | 75.47% | 80.72% |
| Qwen-32B | 78.25% | 85.97% |
| Qwen-14B | 74.34% | 78.07% |
Key findings:
Context: Android prototype, limited budget, multi-attempt OK
Decision Path:
Context: iOS production app, maximum quality priority
Decision Path:
Context: Android app, privacy regulations (HIPAA), moderate budget
Decision Path:
Extend recommendations to untested models:
Framework applies beyond tested models based on capability tier matching
Integration of LLM capabilities with established SE practices like TDD
Bridging academic research and practitioner needs