3
0
Fork 0
mirror of https://github.com/Z3Prover/z3 synced 2026-05-16 15:15:35 +00:00

Add TPTP integration test report (500 files, z3 -T:5)

Agent-Logs-Url: https://github.com/Z3Prover/z3/sessions/35080496-0ba4-4fbc-a441-5f5c430e978c

Co-authored-by: NikolajBjorner <3085284+NikolajBjorner@users.noreply.github.com>
This commit is contained in:
copilot-swe-agent[bot] 2026-05-12 17:40:32 +00:00 committed by GitHub
parent 57e0a12e3a
commit a07922f2b7
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 4641 additions and 0 deletions

139
tptp-test-report.md Normal file
View file

@ -0,0 +1,139 @@
# TPTP Integration Test Report — `copilot/integrate-tptp-into-src-shell`
**Date:** 2026-05-12
**Branch:** `copilot/integrate-tptp-into-src-shell`
**Z3 version:** 4.17.0 (64-bit)
**Test corpus:** 500 randomly-selected TPTP v8.1.2 problems
**Domains sampled:** ARI (Arithmetic, n=247), SET (Set Theory, n=168), SYN (Syntactic, n=85)
**Invocation:** `z3 -T:5 <file.p>`
> **Note:** `tptp.org` was not accessible from the test sandbox (DNS blocked). Problems were drawn from the GoelandProver/GoelandBenchmarks repository, which mirrors TPTP v8.1.2 problem files in the same format.
---
## Executive Summary
| Category | Count | Pct |
|---|---|---|
| Correct answer | 178 | 35.6% |
| Wrong answer | 25 | 5.0% |
| Parse error - negative literal | 52 | 10.4% |
| Parse error - rational literal | 40 | 8.0% |
| Parse error - TFF type declaration | 96 | 19.2% |
| Sort mismatch (TFF arithmetic) | 46 | 9.2% |
| Crash (unhandled z3_error) | 40 | 8.0% |
| Timeout (process killed at 5s) | 23 | 4.6% |
**Overall correct rate: 178/500 = 35.6%**
---
## Timing
| Metric | Value |
|---|---|
| Minimum | 0.010 s |
| Maximum | 10.141 s |
| Mean | 1.099 s |
| Median | 0.029 s |
| Under 1 s | 435 (87.0%) |
| 1-5 s | 2 (0.4%) |
| Over 5 s | 63 (12.6%) |
---
## Per-Domain Results
| Domain | Correct | Total | Rate |
|---|---|---|---|
| ARI (Arithmetic) | 0 | 247 | 0.0% |
| SET (Set Theory) | 113 | 168 | 67.3% |
| SYN (Syntactic) | 65 | 85 | 76.5% |
---
## Bug Findings
### Bug 1 - Negative integer literals not parsed (52 files)
The TFF parser rejects negative integer literals such as `-2` or `-4`.
Example (ARI007_1.p):
```
tff(n2_less_2,conjecture, $less(-2,2) ).
```
Error: `TPTP parse error: unexpected character '-' at 32:11`
Every TFF problem using negative constants fails immediately.
---
### Bug 2 - Rational/real literals not parsed (40 files)
The parser rejects rational literals written as `p/q` (TPTP syntax for rationals, e.g. `1/2`).
Error pattern: `TPTP parse error: unexpected character '/' at 32:11`
---
### Bug 3 - TFF user-type declarations parse error (96 files)
TFF problems that introduce predicate or function type signatures fail to parse.
Example (ARI094_1.p):
```
tff(p_type,type, p: $int > $o ).
```
Error: `TPTP parse error: expected ')' at 32:19`
The `: $int > $o` type product syntax is not supported.
---
### Bug 4 - Sort mismatch for TFF typed variables (46 files)
When a TFF formula uses typed quantifier variables such as `[X: $int]`, the TPTP frontend binds them to the uninterpreted sort `U` rather than Z3's `Int` sort, causing a sort mismatch when they are passed to arithmetic built-ins.
Example (ARI005_1.p):
```
tff(prove_12_less_something,conjecture, ? [X: $int] : $less(12,X) ).
```
Error: `TPTP frontend error: Sort mismatch at argument #2 for function (declare-fun $less (U U) Bool) supplied sort: Int`
---
### Bug 5 - Arithmetic built-ins treated as uninterpreted (25 wrong answers)
Even when parsing succeeds, simple TFF arithmetic conjectures return CounterSatisfiable instead of Theorem. The arithmetic predicates `$less`, `$lesseq`, `$greater`, `$greatereq` are being declared as uninterpreted predicates over sort `U`, not mapped to Z3 built-in arithmetic relations.
Examples of wrong answers:
- ARI002_1.p: `~ $less(3,2)` → expected Theorem, got CounterSatisfiable
- ARI014_1.p: `$lesseq(2,2)` → expected Theorem, got CounterSatisfiable
- ARI015_1.p: `$less(2,3)` → expected Theorem, got CounterSatisfiable
---
### Bug 6 - Unhandled z3_error exception on timeout (40 crashes)
When a problem hits the timeout, Z3 crashes with:
```
terminate called after throwing an instance of 'z3_error'
what(): timeout
```
instead of printing `% SZS status Timeout` and exiting cleanly. The TPTP frontend does not catch `z3_error` around the solver call.
---
## SET and SYN Domain (67-77% Success)
The SET and SYN domains use purely first-order FOF (untyped) formulas with no arithmetic and work reasonably well. Main failure modes are timeouts on harder combinatorial problems.
---
## Recommendations
1. Fix negative/rational literal parsing in the TPTP5 lexer
2. Map TFF arithmetic sorts ($int, $rat, $real) to Z3 Int/Real sorts
3. Map arithmetic predicates ($less, $lesseq, etc.) to Z3 built-in arithmetic relations
4. Support TFF type declaration syntax (name: sort > sort)
5. Catch z3_error in the TPTP frontend and output SZS Timeout status instead of crashing

4502
tptp-test-results.json Normal file

File diff suppressed because it is too large Load diff