3
0
Fork 0
mirror of https://github.com/Z3Prover/z3 synced 2026-05-16 15:15:35 +00:00

Add TPTP master-branch re-run report (500 files, z3 -T:5, 43.8% correct)

Agent-Logs-Url: https://github.com/Z3Prover/z3/sessions/d15efc5f-0eac-4cf1-b3f1-27788eea9447

Co-authored-by: NikolajBjorner <3085284+NikolajBjorner@users.noreply.github.com>
This commit is contained in:
copilot-swe-agent[bot] 2026-05-13 00:06:50 +00:00 committed by GitHub
parent ddf912d8a6
commit 215f298536
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 4631 additions and 0 deletions

129
tptp-test-report-master.md Normal file
View file

@ -0,0 +1,129 @@
# TPTP Integration Test Report — `master` branch (re-run after #9483 merge)
**Date:** 2026-05-12
**Branch:** `master` (commit `1a2d3e0ebb` — "Integrate TPTP with internal APIs via `cmd_context`…")
**Z3 version:** 4.17.0 (64-bit)
**Test corpus:** 500 randomly-selected TPTP v8.1.2 problems (same seed=42 sample as previous run)
**Domains:** ARI (n=247), SET (n=168), SYN (n=85)
**Invocation:** `z3 -T:5 <file.p>`
> **Note:** `tptp.org` is not accessible from the CI sandbox. Problems are drawn from
> [GoelandProver/GoelandBenchmarks](https://github.com/GoelandProver/GoelandBenchmarks),
> which mirrors TPTP v8.1.2 in identical format.
---
## Results vs. Previous Run (`copilot/integrate-tptp-into-src-shell`)
| Category | Previous | Master | Change |
|---|---|---|---|
| Correct answer | 178 (35.6%) | 219 (43.8%) | +41 |
| Wrong answer | 25 (5.0%) | 15 (3.0%) | -10 |
| Parse error — negative literal | 52 (10.4%) | 0 (0.0%) | -52 |
| Parse error — rational literal | 40 (8.0%) | 0 (0.0%) | -40 |
| Parse error — TFF type declaration | 96 (19.2%) | 73 (14.6%) | -23 |
| Sort mismatch (TFF arithmetic) | 46 (9.2%) | 128 (25.6%) | +82 |
| Crash (unhandled z3_error) | 40 (8.0%) | 38 (7.6%) | -2 |
| Timeout (process killed at 5s) | 23 (4.6%) | 25 (5.0%) | +2 |
| Other errors | 0 (0.0%) | 2 (0.4%) | +2 |
**Overall correct rate: 219/500 = 43.8%** (was 35.6%)
---
## Per-Domain Accuracy
| Domain | Previous | Master |
|---|---|---|
| ARI (Arithmetic) | 0/247 (0.0%) | 43/247 (17.4%) |
| SET (Set Theory) | 113/168 (67.3%) | 113/168 (67.3%) |
| SYN (Syntactic) | 65/85 (76.5%) | 63/85 (74.1%) |
---
## Timing
| Metric | Value |
|---|---|
| Minimum | 0.008 s |
| Maximum | 10.169 s |
| Mean | 1.073 s |
| Median | 0.030 s |
| Under 1 s | 435 (87.0%) |
| Over 5 s | 63 (12.6%) |
---
## Fixes Confirmed in Master
### Fixed: Negative integer literals now parsed
`$less(-2, 2)` now returns `% SZS status Theorem`**0 files** fail on negative literals (was 52).
### Fixed: Typed quantifier variables (`[X: $int]`)
`? [X: $int] : $less(12,X)` now returns `% SZS status Theorem` — resolved via correct sort mapping.
### Fixed: Simple arithmetic predicates as uninterpreted
The ARI domain went from **0% to 17.4%** correct. Simple facts like `$lesseq(2,2)` and `$less(2,3)` now return Theorem.
---
## Remaining Issues
### Issue 1 — Sort mismatch: bare integer constants treated as sort U (128 files)
When a TFF formula uses bare integer constants (no typed variables), the arithmetic constant is still in sort `U` instead of `Int`.
Example (`ARI055_1.p`):
```
tff(prove_31_not_12,conjecture, 31 != 12 ).
```
Error: `TPTP frontend error: Sorts U and Int are incompatible`
This is the most widespread remaining issue (128 files, 25.6%).
### Issue 2 — TFF type declarations not supported (73 files)
`tff(name, type, f: $int > $o)` type-introduction blocks still fail to parse.
Example (`ARI354_1.p`):
```
tff(real_less_problem_10,conjecture, ~ $less(-3.25,-8.69) ).
```
Error: `TPTP parse error: expected ')' at 32:15`
(The real literal `-3.25` triggers a different parse path than the negative integer `-2`.)
### Issue 3 — Real/rational literals with decimal point not parsed (73 files overlap)
Decimal real literals like `-3.25` or `1.5` are not recognized. These are a subset of the type-declaration parse errors.
### Issue 4 — `$uminus` function not recognized (2 files)
`$uminus(2)` is a TPTP arithmetic function for unary negation.
Error: `TPTP parse error: expected identifier at 32:5`
### Issue 5 — Crash on timeout (38 files)
Unchanged from before: when the timeout fires, `z3_error` is thrown uncaught and `std::terminate` is called, producing:
```
terminate called after throwing an instance of 'z3_error'
what(): timeout
```
instead of `% SZS status Timeout`.
### Issue 6 — Wrong answers (15 files)
15 files receive an incorrect SZS verdict. Examples:
- `SYN364+1.p`, `SYN375+1.p`, `SYN377+1.p` — expected Theorem, got CounterSatisfiable
- `SET576+3.p` — expected Theorem, got CounterSatisfiable
These appear to be FOF problems in the SYN/SET domains where Z3's quantifier handling returns an incorrect model.
---
## Recommendations
1. **Fix bare integer constant sort** — when the formula context is TFF with `$int` typing, integer numeral tokens should produce `Int`-sorted terms, not `U`-sorted ones.
2. **Support decimal real literals** — add lexer support for `d.d` (decimal point) real literal syntax.
3. **Support `$uminus`** — map this TPTP arithmetic function to Z3's unary minus.
4. **Catch `z3_error` in TPTP frontend** — print `% SZS status Timeout` before exiting to avoid crashing.

File diff suppressed because it is too large Load diff