3
0
Fork 0
mirror of https://github.com/Z3Prover/z3 synced 2026-05-28 12:56:28 +00:00

Add compare-stats anomaly reporter workflow for 30h bug/crash triage (#9647)

This adds an agentic workflow that analyzes `compare_stats.html` over a
rolling 30-hour window and publishes a GitHub Discussion summarizing
bugs, crashes, and anomalies. It explicitly captures unknown-outlier
patterns where a benchmark is `unknown` while peers in the same set are
mostly `sat`/`unsat`/`timeout`.

- **Workflow added**
- Introduces `.github/workflows/compare-stats-anomaly-reporter.md` (plus
compiled `.lock.yml`).
  - Supports `workflow_dispatch` and scheduled execution.
- Uses safe discussion output with auto-close of older reports for the
same stream.

- **Data acquisition + robustness**
- Fetches `http://mtzguido.tplinkdns.com:8081/z3/compare_stats.html`
with `curl` and `wget` fallback.
- Adds integrity checks (non-empty HTML/table presence) and explicit
incomplete-report behavior on fetch/parse failures.

- **30-hour analysis semantics**
- Filters rows by timestamp candidates (`time`, `timestamp`, `date`,
`run`, etc.) using UTC.
- Falls back to full-table analysis when timestamps are unavailable, and
marks the report accordingly.

- **Classification logic**
- Detects bug/crash signals from status/details (`crash`, `segfault`,
`assert`, `abort`, `exception`, `error`, `failed`, `bug`).
  - Detects:
- unknown-outlier anomalies (thresholded minority `unknown` in otherwise
decisive SAT-family outcomes),
- status divergences (conflicting non-timeout outcomes for same
benchmark),
    - repeated hard-failure anomalies.

- **Discussion output shape**
- Produces a compact report with executive counts, bug/crash table,
anomaly subsections, and raw extraction summary/limitations.

```yaml
safe-outputs:
  create-discussion:
    title-prefix: "[Compare Stats] "
    category: "agentic workflows"
    close-older-discussions: true
```

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
This commit is contained in:
Copilot 2026-05-27 09:25:33 -07:00 committed by GitHub
parent 1aa2158bf4
commit b3fff5b399
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 1564 additions and 0 deletions

1373
.github/workflows/compare-stats-anomaly-reporter.lock.yml generated vendored Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,191 @@
---
description: Analyze compare_stats.html for the latest 30 hours and publish bug/crash/anomaly summary as a GitHub Discussion
on:
schedule:
- cron: "0 */12 * * *"
workflow_dispatch:
permissions: read-all
strict: false
timeout-minutes: 45
network:
allowed:
- defaults
- mtzguido.tplinkdns.com
tools:
bash: [":*"]
github:
toolsets: [default]
safe-outputs:
create-discussion:
title-prefix: "[Compare Stats] "
category: "agentic workflows"
close-older-discussions: true
missing-tool:
create-issue: true
noop:
report-as-issue: false
---
# Compare Stats Bug/Crash/Anomaly Reporter
Your name is ${{ github.workflow }}. You are a Z3 benchmarking analysis agent for `${{ github.repository }}`.
Analyze the benchmark comparison page below, focusing on results from the last 30 hours, then create a GitHub Discussion with a concise but actionable summary of:
- Bugs
- Crashes
- Anomalies
Source URL:
`http://mtzguido.tplinkdns.com:8081/z3/compare_stats.html`
Note: this endpoint is currently HTTP-only. Treat fetched data as non-sensitive benchmark telemetry and do not include secrets in requests or reports.
Note: the workflow runs every 12 hours but analyzes 30 hours intentionally to provide overlap and avoid missing transient failures between runs.
Overlapping windows are expected; `close-older-discussions: true` keeps only the latest report thread active.
## Requirements
### 1) Fetch and save the source page
Use bash to fetch the page into `/tmp/gh-aw/agent/compare_stats.html`.
Try this first:
```bash
curl -fsSL --max-time 60 "http://mtzguido.tplinkdns.com:8081/z3/compare_stats.html" -o /tmp/gh-aw/agent/compare_stats.html
```
If that fails, retry once with:
```bash
wget -q -T 60 -O /tmp/gh-aw/agent/compare_stats.html "http://mtzguido.tplinkdns.com:8081/z3/compare_stats.html"
```
If both fail, still create a discussion that explains the fetch failure, includes stderr output, and marks the report as incomplete.
After a successful fetch, perform basic integrity checks before parsing:
- file is non-empty
- content includes `<html` and at least one `<table`
- if checks fail, treat as suspicious/incomplete data and report this explicitly
### 2) Parse tabular data
Use Python to parse all tables from the HTML into normalized rows.
Use resilient parsing:
- Prefer `pandas.read_html` when available.
- If pandas fails, parse with `html.parser`/regex fallback.
Persist normalized JSON to `/tmp/gh-aw/agent/compare_stats_rows.json`.
### 3) Detect time window (last 30 hours)
Find candidate timestamp columns using case-insensitive column-name matches:
- `time`, `timestamp`, `date`, `run`, `created`, `updated`
Parse datetimes with timezone handling if present. Use current UTC time and filter to rows where timestamp is within the past 30 hours.
Treat naive timestamps as UTC.
If no timestamp can be extracted:
- Report this limitation explicitly.
- Continue analysis on all rows.
- Mark the discussion as "time-window fallback".
### 4) Classify bugs/crashes/anomalies
Infer key columns using column-name heuristics:
- status/result/outcome
- benchmark/instance/file/name
- set/suite/group/track/family
- message/error/details/reason
Normalize status strings to lowercase.
#### Bugs / Crashes
Classify a row as **crash/bug** if status/details contain terms like:
- `crash`, `segfault`, `assert`, `abort`, `exception`, `error`, `failed`, `bug`
#### Anomalies
At minimum, detect:
1. **Unknown-outlier anomaly** (required):
- Within the same benchmark set/suite/group, if most rows are in `{sat, unsat, timeout}` but a minority are `unknown`, flag the `unknown` rows as anomalies.
- Rationale: require enough samples for confidence and avoid flagging sets where `unknown` is common behavior. `0.4` caps unknown results to a minority, while `0.6` enforces a decisive majority of sat/unsat/timeout outcomes. Any remainder after those constraints is intentionally allowed for other statuses.
- Use this threshold: `total_rows >= 4`, `unknown_count / total_rows <= 0.4`, and `(sat_count + unsat_count + timeout_count) / total_rows >= 0.6`.
- If set/suite/group columns are missing, fallback grouping order is: directory prefix of benchmark path/name, then benchmark name prefix before first separator (`/`, `:`, `::`), then a single global group.
2. **Status divergence anomaly**:
- Same benchmark name appears multiple times with conflicting non-timeout statuses (for example `sat` vs `unsat`).
- Ignore timeout-only disagreements here; timeout behavior is covered under the repeated hard-failure anomaly section to reduce noise from transient runtime variance.
3. **Repeated hard-failure anomaly**:
- Same benchmark appears repeatedly with crash/error-like status in the time window.
### 5) Generate discussion report
Create a GitHub Discussion using `create-discussion` safe output.
Use this structure:
```markdown
### Compare Stats Analysis Report
**Source**: [compare_stats.html](http://mtzguido.tplinkdns.com:8081/z3/compare_stats.html)
**Workflow Run**: [#${{ github.run_id }}](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})
**Analysis Time (UTC)**: <timestamp>
**Window**: last 30 hours (or fallback mode)
### Executive Summary
- Rows analyzed: N
- Rows in 30h window: M (or "timestamp unavailable")
- Bugs/crashes: B
- Anomalies: A
### Bugs and Crashes
| Benchmark Set | Benchmark | Status | Details | Timestamp |
|---|---|---|---|---|
| ... |
### Anomalies
#### Unknown-Outlier Cases
| Benchmark Set | Benchmark | Status | Peer Status Distribution | Timestamp |
|---|---|---|---|---|
| ... |
#### Status Divergences
| Benchmark | Observed Statuses | Benchmark Set(s) | Timestamp(s) |
|---|---|---|---|
| ... |
#### Repeated Hard Failures
| Benchmark | Failure Count | Representative Status/Details | Benchmark Set(s) |
|---|---|---|---|
| ... |
### Notes and Limitations
- Mention parsing assumptions
- Mention missing columns/timestamps if any
<details>
<summary><b>Raw Extraction Summary</b></summary>
- Table count
- Candidate columns used
- Top status distribution
- Up to 30 representative raw rows (sanitized)
</details>
```
## Reporting Rules
- Be factual and concise.
- Do not claim certainty when column mapping is heuristic.
- If no bugs/crashes/anomalies are found, still create the discussion and explicitly state "No issues detected in analyzed window."
- Do not open PRs or modify repository files.