A neuro-symbolic approach for test oracle generation
PhD: Università della Svizzera italiana
English
Automatic generation of test oracles is an open challenge in software testing. Although the generation of test prefixes has received substantial attention, the oracle problem, defined as the automatic generation of assertions that verify program behavior, remains largely open. This dissertation proposes the integration of symbolic and neural techniques in a unified neuro-symbolic framework to effectively generate test oracles. The research conducts a systematic three-stage empirical study spanning the generation of both axiomatic and concrete test oracle. The first stage defines Tratto, a neuro-symbolic approach that generates axiomatic test oracles token-by-token. Tratto integrates a symbolic component that constrains the search space of valid tokens to ensure compilability, and a neural component that guides the token generation toward semantically relevant oracles. Tratto achieves 73% accuracy, 72% precision, and 61% F1-score on a ground-truth dataset, outperforming the symbolic baseline Jdoctor (61% accuracy, 62% precision, 25% F1-score) and the neural baseline GPT-4 (40% accuracy, 24% precision, 37% F1-score). The approach generates three times more correct oracles than Jdoctor while producing ten times fewer false positives than GPT-4, demonstrating improved soundness through neuro-symbolic integration. The second stage is a large-scale empirical study to evaluate large language models for generating concrete oracles. We conducted the study with an unbiased dataset of 13,866 test oracles from 135 Java projects, with all test cases created after the models' training cut-off date to ensure strict separation from training data. The study systematically varies model families, sizes, and prompt configurations, and generates 554,640 predictions. The results show that larger models consistently outperform smaller ones across all specializations, and that providing additional contextual information beyond the test prefix and the source code of the invoked methods do not lead to any significant performance improvement. Mutation analysis shows that LLM-generated oracles achieve fault-detection capability comparable to developer-written oracles (43% vs. 45% mutation score), while also revealing a critical limitation: 50% of the generated oracles in the experiment fail to compile or execute. The third stage defines Tracto, a neuro-symbolic approach for concrete oracle generation, composed of a symbolic module that integrates a grammar-based static analysis and a project identifiers resolution system to retrieve candidate tokens and enforce validation constraints, and a neural module that performs token selection and concrete literal inference. The systematic comparison against a fine-tuned pure neural model (Qwen2.5-Coder-3B) and the best performing vanilla model of the empirical study (Qwen2.5-Coder-32B) on 3,448 post-cutoff test oracles reveals fundamental trade-offs. The pure neural model achieves the highest raw accuracy (33%) compared to Tracto (20%) and the vanilla LLM (10%). However, Tracto yields more robust oracles, reaching higher compilation rates (80%) over the pure neural model (73%) and the vanilla LLM (10%) and higher test-pass rates (59% over 51% and 7%, respectively). The results presented in this dissertation support the research hypothesis that a neuro-symbolic approach improves the quality of the generated oracles with respect to purely neural approaches. Symbolic constraints reduce false positives, improve compilation rates, and enhance test pass rates across both axiomatic and concrete oracle generation. However, these benefits manifest alongside reduced test oracles generation coverage due to early termination when validation constraints detect problematic patterns.
-
Collections
-
-
Language
-
-
Classification
-
Computer science and technology
-
License
-
-
Open access status
-
green
-
Identifiers
-
-
Persistent URL
-
https://n2t.net/ark:/12658/srd1334462