mirror of
https://github.com/Z3Prover/z3
synced 2026-05-16 15:15:35 +00:00
Add TPTP integration test report (500 files, z3 -T:5)
Agent-Logs-Url: https://github.com/Z3Prover/z3/sessions/35080496-0ba4-4fbc-a441-5f5c430e978c Co-authored-by: NikolajBjorner <3085284+NikolajBjorner@users.noreply.github.com>
This commit is contained in:
parent
57e0a12e3a
commit
a07922f2b7
2 changed files with 4641 additions and 0 deletions
139
tptp-test-report.md
Normal file
139
tptp-test-report.md
Normal file
|
|
@ -0,0 +1,139 @@
|
|||
# TPTP Integration Test Report — `copilot/integrate-tptp-into-src-shell`
|
||||
|
||||
**Date:** 2026-05-12
|
||||
**Branch:** `copilot/integrate-tptp-into-src-shell`
|
||||
**Z3 version:** 4.17.0 (64-bit)
|
||||
**Test corpus:** 500 randomly-selected TPTP v8.1.2 problems
|
||||
**Domains sampled:** ARI (Arithmetic, n=247), SET (Set Theory, n=168), SYN (Syntactic, n=85)
|
||||
**Invocation:** `z3 -T:5 <file.p>`
|
||||
|
||||
> **Note:** `tptp.org` was not accessible from the test sandbox (DNS blocked). Problems were drawn from the GoelandProver/GoelandBenchmarks repository, which mirrors TPTP v8.1.2 problem files in the same format.
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
| Category | Count | Pct |
|
||||
|---|---|---|
|
||||
| Correct answer | 178 | 35.6% |
|
||||
| Wrong answer | 25 | 5.0% |
|
||||
| Parse error - negative literal | 52 | 10.4% |
|
||||
| Parse error - rational literal | 40 | 8.0% |
|
||||
| Parse error - TFF type declaration | 96 | 19.2% |
|
||||
| Sort mismatch (TFF arithmetic) | 46 | 9.2% |
|
||||
| Crash (unhandled z3_error) | 40 | 8.0% |
|
||||
| Timeout (process killed at 5s) | 23 | 4.6% |
|
||||
|
||||
**Overall correct rate: 178/500 = 35.6%**
|
||||
|
||||
---
|
||||
|
||||
## Timing
|
||||
|
||||
| Metric | Value |
|
||||
|---|---|
|
||||
| Minimum | 0.010 s |
|
||||
| Maximum | 10.141 s |
|
||||
| Mean | 1.099 s |
|
||||
| Median | 0.029 s |
|
||||
| Under 1 s | 435 (87.0%) |
|
||||
| 1-5 s | 2 (0.4%) |
|
||||
| Over 5 s | 63 (12.6%) |
|
||||
|
||||
---
|
||||
|
||||
## Per-Domain Results
|
||||
|
||||
| Domain | Correct | Total | Rate |
|
||||
|---|---|---|---|
|
||||
| ARI (Arithmetic) | 0 | 247 | 0.0% |
|
||||
| SET (Set Theory) | 113 | 168 | 67.3% |
|
||||
| SYN (Syntactic) | 65 | 85 | 76.5% |
|
||||
|
||||
---
|
||||
|
||||
## Bug Findings
|
||||
|
||||
### Bug 1 - Negative integer literals not parsed (52 files)
|
||||
|
||||
The TFF parser rejects negative integer literals such as `-2` or `-4`.
|
||||
|
||||
Example (ARI007_1.p):
|
||||
```
|
||||
tff(n2_less_2,conjecture, $less(-2,2) ).
|
||||
```
|
||||
Error: `TPTP parse error: unexpected character '-' at 32:11`
|
||||
|
||||
Every TFF problem using negative constants fails immediately.
|
||||
|
||||
---
|
||||
|
||||
### Bug 2 - Rational/real literals not parsed (40 files)
|
||||
|
||||
The parser rejects rational literals written as `p/q` (TPTP syntax for rationals, e.g. `1/2`).
|
||||
|
||||
Error pattern: `TPTP parse error: unexpected character '/' at 32:11`
|
||||
|
||||
---
|
||||
|
||||
### Bug 3 - TFF user-type declarations parse error (96 files)
|
||||
|
||||
TFF problems that introduce predicate or function type signatures fail to parse.
|
||||
|
||||
Example (ARI094_1.p):
|
||||
```
|
||||
tff(p_type,type, p: $int > $o ).
|
||||
```
|
||||
Error: `TPTP parse error: expected ')' at 32:19`
|
||||
|
||||
The `: $int > $o` type product syntax is not supported.
|
||||
|
||||
---
|
||||
|
||||
### Bug 4 - Sort mismatch for TFF typed variables (46 files)
|
||||
|
||||
When a TFF formula uses typed quantifier variables such as `[X: $int]`, the TPTP frontend binds them to the uninterpreted sort `U` rather than Z3's `Int` sort, causing a sort mismatch when they are passed to arithmetic built-ins.
|
||||
|
||||
Example (ARI005_1.p):
|
||||
```
|
||||
tff(prove_12_less_something,conjecture, ? [X: $int] : $less(12,X) ).
|
||||
```
|
||||
Error: `TPTP frontend error: Sort mismatch at argument #2 for function (declare-fun $less (U U) Bool) supplied sort: Int`
|
||||
|
||||
---
|
||||
|
||||
### Bug 5 - Arithmetic built-ins treated as uninterpreted (25 wrong answers)
|
||||
|
||||
Even when parsing succeeds, simple TFF arithmetic conjectures return CounterSatisfiable instead of Theorem. The arithmetic predicates `$less`, `$lesseq`, `$greater`, `$greatereq` are being declared as uninterpreted predicates over sort `U`, not mapped to Z3 built-in arithmetic relations.
|
||||
|
||||
Examples of wrong answers:
|
||||
- ARI002_1.p: `~ $less(3,2)` → expected Theorem, got CounterSatisfiable
|
||||
- ARI014_1.p: `$lesseq(2,2)` → expected Theorem, got CounterSatisfiable
|
||||
- ARI015_1.p: `$less(2,3)` → expected Theorem, got CounterSatisfiable
|
||||
|
||||
---
|
||||
|
||||
### Bug 6 - Unhandled z3_error exception on timeout (40 crashes)
|
||||
|
||||
When a problem hits the timeout, Z3 crashes with:
|
||||
```
|
||||
terminate called after throwing an instance of 'z3_error'
|
||||
what(): timeout
|
||||
```
|
||||
instead of printing `% SZS status Timeout` and exiting cleanly. The TPTP frontend does not catch `z3_error` around the solver call.
|
||||
|
||||
---
|
||||
|
||||
## SET and SYN Domain (67-77% Success)
|
||||
|
||||
The SET and SYN domains use purely first-order FOF (untyped) formulas with no arithmetic and work reasonably well. Main failure modes are timeouts on harder combinatorial problems.
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
1. Fix negative/rational literal parsing in the TPTP5 lexer
|
||||
2. Map TFF arithmetic sorts ($int, $rat, $real) to Z3 Int/Real sorts
|
||||
3. Map arithmetic predicates ($less, $lesseq, etc.) to Z3 built-in arithmetic relations
|
||||
4. Support TFF type declaration syntax (name: sort > sort)
|
||||
5. Catch z3_error in the TPTP frontend and output SZS Timeout status instead of crashing
|
||||
4502
tptp-test-results.json
Normal file
4502
tptp-test-results.json
Normal file
File diff suppressed because it is too large
Load diff
Loading…
Add table
Add a link
Reference in a new issue