# Code Health Repowise computes a 1-10 health score for every file in your repo from twenty-five deterministic markers: McCabe complexity, deep nesting, brain methods, class cohesion (LCOM4), god classes, clone detection, untested hotspots, function-level churn, code-age volatility, ownership dispersion, relative churn, change entropy, co-change scatter, recent defect history, test-quality smells, and more. **No LLM calls, no cloud requirement.** Pure Python over tree-sitter and git data, designed to finish in under 30 seconds on a 3,000-file repo.
repowise code-health loop: 25 deterministic markers fan into three signals (defect risk, maintainability, performance), the graph and git history locate where risk concentrates, and refactoring intelligence emits concrete plans (Extract Class, Extract Helper, Move Method, Break Cycle, Split File) your agent executes
Code health runs as a loop: **measure** every file across three signals, **locate** where the risk concentrates through the dependency graph and git history, then **fix** it with a concrete refactoring plan an agent can execute (see [Refactoring targets](#refactoring-targets) and [docs/REFACTORING.md](REFACTORING.md)). Linters check patterns. The health score predicts risk. A linter flags a line that matches a known-bad shape; the score estimates which files are likely to harbor the next bug, ranks them, and is calibrated against real defect history (see [Does the score find the bugs?](#does-the-score-find-the-bugs)). It uses signals a linter has no access to: churn, ownership, co-change, blast radius, hotspots. The [performance signal](#performance-static-performance-risk) is the one part that overlaps a linter, and even it follows the call graph across files, which a file-local linter cannot. ## Quick start ```bash repowise init # full index, populates health tables repowise health # KPIs + 20 worst-scoring files + top findings repowise update # re-score only changed files on each subsequent run ``` Open `http://localhost:7777/repos//health` for the dashboard once the local server is running (`repowise serve`). ## The score Each file starts at 10.0. Marker findings deduct from the score; deductions are capped per category so any one category can drive the score down by at most: | Category | Cap | Markers | |------------------------|-------|------------| | Organizational | −3.5 | developer_congestion, knowledge_loss, hidden_coupling, function_hotspot, code_age_volatility, ownership_risk, churn_risk, change_entropy, co_change_scatter, prior_defect | | Structural complexity | −2.5 | brain_method, low_cohesion, god_class, nested_complexity, bumpy_road, complex_conditional | | Test coverage | −2.0 | untested_hotspot, coverage_gap | | Test coverage (cont.) | −2.0 | coverage_gradient | | Size & complexity | −1.5 | complex_method, large_method, primitive_obsession | | Duplication | −1.0 | dry_violation | | Test quality | −0.5 | large_assertion_block, duplicated_assertion_block | | Error handling | −0.5 | error_handling | Twenty-five markers across the categories above. `function_hotspot` and `code_age_volatility` are blame-based and sit in the organizational bucket: both are tier-aware and stay silent on ESSENTIAL-tier repos until the per-line blame index is built. Per-marker weight multipliers let the strongest empirical predictors deduct more than the uniform severity table alone allows. The weights are **calibrated offline against a defect corpus, not hand-tuned**: each file is scored at the pre-window commit (T0, no leakage) and a logistic regression with NLOC as an explicit control fits each marker's defect lift beyond file size. The runtime stays deterministic; only the learned constants ship. The strongest predictors are `co_change_scatter`, `change_entropy`, `ownership_risk`, and `nested_complexity`; widely-firing smells that proved weak under leakage-free scoring are floored. The full per-marker table lives in the [architecture doc](architecture/code-health.md#61-calibrated-weight-multipliers), and the calibration with confidence intervals is in the [benchmark report](https://github.com/repowise-dev/repowise-bench/blob/master/health-defect/BENCHMARK_REPORT.md). The final score is clamped to `[1.0, 10.0]`. The three repo-level KPIs: - **Hotspot Health**: NLOC-weighted average over the files classified as hotspots by the git layer (high churn percentile plus minimum-activity floors), not a fixed top-N slice. - **Average Health**: NLOC-weighted average over all files. - **Worst Performer**: single lowest-scoring file. ## Three health signals: defect risk, maintainability, and performance Repowise surfaces three orthogonal health signals computed from the same marker stream by one shared scoring kernel: **defect risk** (the calibrated, overall number), **maintainability**, and **performance**. They are co-equal views, never blended into one number (the section on the overall score below explains why). The score above is the defect-risk signal: it is calibrated against a defect corpus, the bands are calibrated to it (Alert files carry roughly 17x the defect rate of Healthy files), and it is the overall number surfaced everywhere. But not every code smell predicts bugs. A handful of markers fire widely and matter a lot for how hard code is to read and change, yet proved weak as defect predictors under leakage-free scoring, so the defect calibration floors them to 0.5 (`low_cohesion`, `brain_method`, `primitive_obsession`, `dry_violation`, `error_handling`). Floored inside a defect-framed score they do two unhelpful things at once: they still nudge the number a little (noise against the calibrated signal) and they get no credit for the real problem they describe (maintainability). Repowise therefore computes a second, parallel signal, maintainability, from the same marker stream: - The floored smells above deduct at full weight (1.0) in maintainability instead of the 0.5 the defect calibration imposes. The defect calibration does not apply to a non-defect signal, so the maintainability weights are expert-set and tuned only against the maintainability pillar's own per-category caps. - The structural smells that are genuine defect predictors and core maintainability concerns (`god_class`, `large_method`, `nested_complexity`) count toward both dimensions. - Pure defect and organizational predictors (`change_entropy`, `ownership_risk`, `co_change_scatter`, and the like) stay out of maintainability entirely. - SQL markers (`sql_high_complexity` for branchy stored routines, `sql_select_star` in views/routines, `sql_update_delete_without_where`) are maintainability-only by construction: no defect corpus covers procedural SQL, so they carry an advisory weight under their own capped `sql` category and never touch the defect score. Routine complexity is counted from the decision keywords in the body text (procedural bodies do not parse into an AST); statement smells come from the sqlglot AST and stay silent on dbt/Jinja templates and on statements whose parse looks garbled. The two signals are computed by the single shared scoring kernel (`scoring.score_file`) against independent weight/category/cap tables, and they never feed back into each other. The overall, surfaced score remains exactly the defect score (byte-for-byte; a golden test locks this) until a later, deliberate decision to blend. Maintainability is surfaced alongside it as a co-equal headline: - **REST/overview**: `summary.maintainability_average` plus a per-file `maintainability_score` on every metric row. - **MCP `get_health`**: `kpis.maintainability_average` and per-file `defect_score` / `maintainability_score` / `performance_score`. - **CLAUDE.md** and the CLI `status` line print a maintainability headline next to defect-risk health. - Every finding carries a `dimension` (`defect` / `maintainability` / `performance`) naming the pillar it homes under, so findings can be filtered per signal. ## Performance: static performance risk The third signal, performance, flags shapes that waste work (code whose structure does redundant I/O), rather than measured runtime. It is deliberately high-precision and low-recall: a few real findings the rest of the toolchain can trust beat a wall of maybes. Standard linters do not find this class of problem. clippy, ruff's `PERF` rules, ESLint, and golangci-lint read one function at a time, so a loop in one file whose database call lives in another is invisible to them. On a 12,000-file benchmark the standard linters found **0** of the cross-function I/O-in-loop cases; repowise surfaced 557 findings across the run, about 90 of them spanning function boundaries, and 98% fell in categories ruff has no rule for. The findings are ordered by impact rather than raw count (NDCG 0.755 against 0.292 for severity-only). One caveat: the Rust dialect was new when the benchmark ran, and clippy could not be run end-to-end on the corpus because of a Windows build wall, so the Rust comparison there is catalogue-level, not a measured head-to-head. The data is in the [perf-detection benchmark](https://github.com/repowise-dev/repowise-bench/tree/master/perf-detection). The detectors (all under one bounded `performance` category cap of 1.0, so the pillar stays advisory) are: - **`io_in_loop`**: a database call, network request, filesystem read, or subprocess spawn that runs once per loop iteration: the classic N+1. This is the moat. Two things make it more than a file-local lint: - **Dependency classification.** The loop-nested call is resolved through a shared I/O-boundary classifier (`io_kind ∈ {db, network, filesystem, subprocess, lock}`) and only fires on a classified execution sink (an actual round-trip like `.execute` / awaited HTTP / `subprocess.run`), not a query-builder chain or a same-named pure helper. - **Call-graph reachability.** The loop and the I/O call need not be in the same function. A bounded-depth (≤3 hops) walk over the resolved `calls` graph catches the interprocedural case (loop in `A`, sink in a helper `A` calls) that no file-local linter can see. Cross-function findings carry their resolved `caller -> ... -> sink` path for explainability. - **`string_concat_in_loop`**: quadratic `+=` string building in a loop. - **`blocking_sync_in_async`**: a synchronous blocking call inside an `async` function, which stalls the whole event loop (mirrors ruff `ASYNC210/230/251`). - **`resource_construction_in_loop`**: a heavy I/O client or connection (`sqlite3.connect` / `httpx.Client` / `boto3.client` / `new PrismaClient` / `new HttpClient` / `sql.Open`) constructed every iteration instead of hoisted: connection churn, and socket exhaustion for `HttpClient`. - **`lock_in_loop`**: a mutex acquired on every iteration (`lock.acquire` / `mu.Lock` / `synchronized` / `lock(x){}`): a contention site. Activates the `lock` I/O-boundary kind. - **`serial_await_in_loop`**: an awaited I/O round-trip run one at a time in a loop where a `gather` / `Promise.all` could fan it out. Advisory: a static analyzer cannot prove the iterations are independent, so the finding suggests rather than asserts. - **`membership_test_against_list_in_loop`**: `x in big_list` (or `big_list.includes(x)`) inside a loop is O(n·m); a set makes each lookup O(1). Fires only when the right operand is provably a list, never a set or dict. - **`nested_loop_with_io`**: an I/O sink in the inner body of a **nested** loop: O(n·m) round-trips, the quadratic cousin of `io_in_loop`. The nesting itself raises confidence the finding is real, so it surfaces alongside `io_in_loop`. - **`blocking_io_under_lock`**: an I/O round-trip reached while a block-scoped lock is held (a C# `lock(){}` / Java `synchronized(){}` block, directly or through a call). Every other thread blocks for the full I/O wait, a throughput killer. The cross-function case reuses the same bounded-reachability engine as the N+1 moat, with a `lock -> io` entry set. Two markers use centrality as a precision gate rather than only a sort key: a shape that is noisy when flagged everywhere only fires in a hot function (one with top-quintile call-graph in-degree, or in a churny/hotspot file), computed by a reusable severity ranker over the same call graph the N+1 pass uses: - **`hot_path_sync_io`**: a blocking subprocess or filesystem call in a hot, request-reachable function, even outside a loop. It generalizes the pillar beyond loops: its latency is paid on every call through the function. (DB and network are excluded, since both are awaited in async code, and the un-awaited calls a static pass sees are result materializers or chained awaits, not blocking round-trips; subprocess and filesystem are synchronous by construction.) Advisory: a latency signal ranked by centrality, not a defect. - **`nested_loop_quadratic`**: a data-dependent loop nested inside another (O(n²)) in a hot function. The centrality gate makes the list short and reviewable (it cut a 13x volume of raw nested loops), but centrality answers "is this function important", not "is n large", so it ships advisory and informational only, never at a weight that moves the score. A few markers are language-specific, contributed by that language's dialect (see below) rather than the shared core: - **`regex_compile_in_loop`** (Java, Go): a `Pattern.compile` / `regexp.MustCompile` recompiled every iteration instead of hoisted. Skipped on Python / .NET, which cache compiled patterns; on Go it fires only for a string-literal pattern (a dynamic argument may legitimately vary per iteration and cannot be hoisted). - **`defer_in_loop`** (Go): a `defer` inside a loop holds the deferred handle until the enclosing function returns, not the iteration: the classic Go file or row-handle leak. A pure syntactic shape, very high precision. - **`goroutine_in_unbounded_loop`** (Go): a `go ...()` spawned per element of a `for k, v := range coll` loop, with no concurrency bound: a spawn explosion (use a worker pool or bounded `errgroup`). Restricted to the two-variable `range` form, which is only legal over a collection (a single-variable `for i := range n` is a bounded count loop). Advisory. - **`list_insert_zero_in_loop`** (Python): `lst.insert(0, x)` each iteration shifts the whole list (O(n²)); use `collections.deque.appendleft`. Gated to a literal `0` index and to a list not re-created each iteration. - **`pd_concat_in_loop`** (Python): `pd.concat([acc, chunk])` inside a loop copies the whole frame each pass (O(n²)); collect chunks and concat once. - **`json_parse_in_loop`** (JS/TS): the `JSON.parse(JSON.stringify(x))` deep-clone idiom in a loop (use `structuredClone`). Restricted to that idiom; parsing a distinct payload each iteration is necessary work, not waste. Advisory. - **`array_spread_in_reduce`** (JS/TS): `arr.reduce((a, x) => [...a, x], [])` rebuilds the accumulator every step (O(n²)); push-and-return instead. The `.reduce` is itself the loop, so this fires regardless of an enclosing loop. - **sync-over-async** (C#, via `blocking_sync_in_async`): `.Result` / `.Wait()` / `.GetAwaiter().GetResult()` inside an `async` method blocks a thread-pool thread. C# is the one non-Python language with real `async`/`await`. - **`sql_cartesian_join`** (SQL): a comma-join (`FROM a, b`) with no join predicate anywhere in the statement produces the full cross product (O(n·m) rows). An explicit `CROSS JOIN` states intent and is not flagged; a comma-join whose statement carries a `WHERE` is old-style join syntax and is not flagged either. Detected on the sqlglot AST; a statement whose parse looks garbled (dialect mismatch) emits nothing. Advisory. Each performance finding's `details` carry the `boundary_kind` it crosses, a `cross_function` flag, and the reachability `path` for the cross-function case. Severity is ranked by centrality (an N+1 in a high-traffic, churny function outranks one in a leaf), not by raw count. **Languages.** The performance signal fires on Python, TypeScript/JavaScript, Java, Go, and C#. Each language is a self-contained `PerfDialect` plugin (`analysis/health/perf/dialects/`) that owns its callee-extraction grammar, its execution-sink lexicon, the loop / string / async predicates, and its own marker list, registered in `PERF_DIALECTS` like the rest of the per-language pipeline. A language without a dialect emits no perf findings (never a wrong one). The db/network/filesystem/subprocess lexicons and the per-language precision hazards (Java `.find`/`.get`, GORM `Find`/`Save`, C# in-memory-vs-`IQueryable` LINQ) are each handled inside that language's dialect. The verb sets are gated for precision: distinctive sinks (EF `*Async`, Spring-Data `findBy*`, JDBC `executeQuery`) fire on name alone, while ambiguous verbs require file-level db-import evidence. `io_in_loop` is validated across languages on an 11-repo OSS corpus: Go 96.7%, TypeScript 100%, Python 96.2% hand-labeled precision; the `blocking_sync_in_async` C# `.Result`/Result-pattern collision and the Go `*sql.Rows.Scan` cursor FP were caught and fixed by that validation. `string_concat_in_loop` is validated at 100% (26/26) after a reset-per-iteration guard (an accumulator re-initialized each iteration is bounded, not O(n²)). `nested_loop_quadratic` now fires only on a same-collection shape (two nested loops over the same collection = all-pairs O(n²)) instead of raw nesting depth; that makes it precision-safe-by-construction but rare, so it stays advisory-only, as do `blocking_io_under_lock`, `pd_concat_in_loop`, `json_parse_in_loop`, and `goroutine_in_unbounded_loop` (high-precision by construction, low corpus recall). **Soundness limits (honest, by design).** Performance is a static signal, so it under-reports rather than over-reports (these cap recall, not precision): dynamic dispatch, monkeypatching, and callbacks-as-values produce no `calls` edge and are invisible; ORM lazy-load N+1 fires on attribute access (no visible call) and is explicitly out of scope (this includes Hibernate lazy-load N+1, which fires on a getter, and EF Core navigation-property lazy load), so we catch explicit repository or query calls in loops, not attribute-triggered lazy loads; chains longer than three hops from the loop are not followed; and an unmodelled library is untyped (`None`), so its sinks don't fire. We call this performance RISK, never measured performance, and never fold it into the defect score. The commit-agreement precision study and its caveats are published in the [perf-detection methodology](https://github.com/repowise-dev/repowise-bench/blob/master/perf-detection/METHODOLOGY.md). **Verified findings.** `serial_await_in_loop` and `nested_loop_quadratic` are advisory because a quick scan cannot tell whether a loop's iterations depend on each other. When a deeper analysis can prove they do not, the finding is marked verified: it asserts the fix (fan out the awaits, or a genuine quadratic scan) instead of hedging, and carries `dataflow_verified: true`. The check is conservative (anything it cannot prove stays advisory) and only touches Python, TypeScript, JavaScript, and Go. It changes the wording, not the score. Performance surfaces exactly where maintainability does: a `performance_average` on the overview summary and MCP `kpis`, a per-file `performance_score`, a Performance KPI card and per-pillar finding filter on the dashboard, the per-file Health tab and drawer, and a `Performance risk` line in CLAUDE.md and the CLI `status` summary (each omitted/`null` on indexes built before the detectors landed). The dimension names are mirrored in `@repowise-dev/types` (`HEALTH_DIMENSIONS`) with a parity test on each side. ## The overall score: defect, not a blend The single number repowise surfaces as the headline (the dashboard ring, the band, the badge, the "does the score find the bugs?" stat) is, and stays, the defect score. Maintainability and performance are presented as co-equal pillars, not blended into the headline. This is a deliberate decision, for three reasons: 1. **Band calibration.** The Healthy/Warning/Alert cutoffs are calibrated to the defect score (Alert is roughly 17x the defect rate). A blended headline would invalidate those boundaries with no recalibration corpus behind the new number. 2. **Honesty of the validation stat.** "Does the score find the bugs?" is a claim about the defect pillar; it must stay bound to the number it measures. 3. **Different precision profiles.** Maintainability is expert-set and performance is high-precision/low-recall advisory. Neither is a calibrated bug predictor, so neither should move the bug-calibrated headline. A golden test (`tests/unit/health/test_scoring_dimensions`) locks the defect score byte-for-byte against the pre-split single score, so no pillar can ever regress it. Introducing a blended overall score would require a written rationale and a recalibration plan; until then, overall = defect. ## Bands and distribution On top of the 1-10 number, every score falls into one of three **bands**. These are the single categorical scheme repowise surfaces (there is deliberately no letter grade: a letter on top of the number would be a third overlapping scale with arbitrary cliffs): | Band | Score | Meaning | |------|-------|---------| | **Healthy** | `≥ 8.0` | Low-risk, maintainable. | | **Warning** | `4.0 – 8.0` | Worth watching; rising complexity or process risk. | | **Alert** | `< 4.0` | High-risk; concentrates defects. | The cutoffs are not arbitrary. On our calibration corpus, Alert files carry roughly 17x the per-file defect rate of Healthy files, so the band boundaries are empirically defensible. They are defined once in core (`analysis/health/grading.py`) and mirrored in `@repowise-dev/types` for the UI; a parity test on each side locks the values. The **health distribution** is the NLOC-weighted split of the repo across the three bands: what share of your code (by volume, not file count) is Healthy, Warning, or Alert. `repowise health` prints it as a one-line summary; the dashboard renders it as a bar. ```text Distribution (by code volume): 8% alert (12 files) · 21% warning (88 files) · 71% healthy (410 files) ``` ## Badge `repowise health --badge` prints ready-to-paste Markdown for a README health badge (a Shields-style color and `N.N/10` badge, no letter). A running Repowise server (or the hosted app) also serves the badge directly: ```text GET /api/repos/{repo_id}/health/badge.svg # self-rendered flat SVG GET /api/repos/{repo_id}/health/badge.json # Shields endpoint payload ``` Embed the dynamic form via Shields: ```markdown ![code health](https://img.shields.io/endpoint?url=/api/repos//health/badge.json) ``` ## Does the score find the bugs? The score is only worth anything if the files it flags are the files that actually break. After an index, repowise checks that against the repo's own history and prints a one-line callout: ``` Does the score find the bugs? 16/20 lowest-health files had a bug fix in the last 6 months, 3.3x the 24% baseline (80% vs 24%). ``` It ranks every file by health score, takes the 20 lowest, and counts how many were touched by a `fix:` commit in the trailing ~180-day window (the same signal the `prior_defect` marker uses). That precision is contrasted with the repo-wide base rate (the fraction of all files with a recent fix) to give the lift. The same number appears on the web `health` and `overview` dashboards, where it expands into a per-K table (worst 10/20/30), a concentration stat (what share of recently-fixed files fall in the least-healthy 20%), and the exact flagged files. Agents can read the same stat over MCP, so a coding agent can confirm the score is trustworthy on this repo before acting on it: ```python # MCP: dashboard mode, the same precision@K / lift block get_health(include=["accuracy"]) ``` It stays silent on repos without enough history to be honest (fewer than 25 scored files, or fewer than 5 recently-fixed files). One caveat it discloses: `prior_defect` is itself one (down-weighted) input to the score, so this is an association on the indexed history, not a leakage-free forward prediction. ### Beyond your repo: cross-project validation The per-repo callout is a quick local check. The score is also validated across projects, so the number is not tuned to a single codebase. Every file is graded at a commit that precedes the bug window, so no future information leaks in, then checked against what actually broke: - Across 21 repositories, 9 languages, and 2,826 files, the score reaches a mean ROC AUC of **0.737** (95% CI 0.683 to 0.787). ROC AUC measures how often the score ranks a known-buggy file worse than a clean one: 0.5 is a coin flip, 1.0 is perfect. It beats raw churn by 0.10 AUC and a prior-defects baseline by 0.117 AUC (DeLong p < 1e-9). - On the public PROMISE/jEdit defect dataset, which played no part in calibration, the same markers score AUC **0.76 to 0.78**. That held-out result is the main evidence the signal is not overfit. - The limits ship with the wins. Among files of similar size the signal is weak (within-size-band AUC near 0.49), so part of the headline is simply that larger files carry more risk; and a prior-defects baseline still finds bugs slightly more efficiently under a fixed review budget (it wins on Popt by 0.085). Full methodology, confidence intervals, and the named head-to-head against CodeScene are in the [benchmark hub](https://github.com/repowise-dev/repowise-bench). ## The markers **brain_method**: A single function that is simultaneously long, deeply nested, highly complex, and central to the dependency graph. The strongest single signal of fragile code. Centrality is judged against the repo's own dependency density (top-quintile of connected files, with an absolute hub bar), so it fires on sparse-graph languages too, not just Python. **low_cohesion**: A class whose methods split into groups that share no fields and don't call each other (LCOM4 ≥ 2). Measured by the walker's class-level model; a high value usually means several smaller, single-responsibility classes are hiding inside one. **god_class**: A large class (≥ 200 lines, ≥ 15 methods) that also contains a brain method. Size alone isn't flagged: the brain-method requirement keeps flat data holders and config tables from firing. **nested_complexity**: Functions with control-flow nesting ≥ 4 levels. Hard to read, hard to test, hard to refactor. **bumpy_road**: Multiple branches stacked at the same depth, usually a sign the function is doing several jobs that should be split. **complex_method**: Cyclomatic complexity ≥ 9. Each branch is a path the test suite has to cover. **large_method**: Long functions that also carry at least some branching. A long-but-perfectly-flat body (a big config/data literal, a wall of sequential assignments) is a layout artefact rather than a complexity smell, so it is excluded: the trigger is about length-with-substance, not raw line count. **primitive_obsession**: Many primitive parameters in one signature. A dataclass or parameter object would name the inputs. Suppressed in very small modules (under ~60 non-blank lines), where a wide signature is an idiomatic config/builder/forwarder rather than a design smell. **dry_violation**: Cross-file code clones, detected by a native Rabin-Karp rolling hash over tree-sitter tokens (variable renames don't hide a clone). Pairs are ranked by co-change so dormant duplicates rank lower than active ones. **untested_hotspot**: A hotspot file with low or zero coverage and many dependents. The textbook "write tests before refactoring" case. **coverage_gap**: Non-test files with meaningful uncovered surface. Severity grades along coverage depth. **coverage_gradient**: A continuous coverage deduction that scales with the uncovered fraction (`4.0 × (1 − line_coverage_pct/100)`, capped), so files stay penalised in proportion to how much code is untested rather than only when they fall below a hard threshold. Fires across the whole 0-100% range for files with known coverage; silent (no imputation) where coverage was never ingested. **developer_congestion**: Too many active authors touching the same file. Usually an ownership problem dressed up as a code problem. **knowledge_loss**: The primary authors of the file are no longer active on the project. Refactor while someone still remembers why. Gated on recent activity: an abandoned-but-stable file is low risk (the survivor effect), so this only fires while the code is still being changed. **hidden_coupling**: Files that consistently change in the same commits without an explicit import or dependency edge between them. Captures behavioral coupling (shared protocols, parallel config, copy-pasted constants) that static analysis cannot see. Tier-aware: empty on ESSENTIAL-tier repos until co-change backfill runs. **complex_conditional**: Branch / loop guards that combine three or more boolean operators. Severity grows with the operator count (LOW at 3, MED at 4, HIGH at 5, CRIT at 6+). **function_hotspot**: Functions that are both structurally complex and frequently modified. Per-function modification counts come from a per-line blame index built once per file (FULL git tier) and shared with `code_age_volatility`. Fires when a function's distinct-commit count is at or above the repo-wide p80 AND the function meets a structural floor (CCN ≥ 10 or max nesting ≥ 3). Tier-aware: returns no findings on ESSENTIAL-tier repos until `backfill_blame()` runs. **code_age_volatility**: Functions whose median line age is at least a year old that are suddenly being modified. Strong defect predictor: the editor is usually working in unfamiliar territory. Uses the same per-line blame index: `median_age_days` from per-line author timestamps, `recent_mod_count` from distinct shas inside the last 30 days. Severity escalates with both axes (CRIT when median age ≥ 2y AND ≥ 5 recent commits). Tier-aware: same ESSENTIAL no-op as `function_hotspot`. **ownership_risk**: Long-run ownership dispersion. Counts minor contributors (authors who each own less than 5% of the file's commits) and the dominant owner's share. Many drive-by authors with no clear owner is the single strongest defect correlate in the literature (Bird et al.). Fires on files with ≥ 5 commits where ≥ 3 contributors are minor or no owner holds 40%. Complements `developer_congestion`, which measures active (90-day) contention rather than lifetime dispersion. **churn_risk**: Relative churn: the fraction of a file's lines rewritten in the last 90 days, normalized by file size. A file whose recent window rewrote more lines than it contains is structurally unstable regardless of how big it is. Because the trigger is a ratio to NLOC, it does not simply re-flag large files. Fires when the file is actively churning (≥ 5 recent commits, top quartile of repo churn) and relative churn ≥ 1.0. **change_entropy**: How scattered a file's change history is, adapted from Hassan's History Complexity Metric. Each commit is treated as a one-period window whose entropy is `log2(files-touched)`, distributed across its files and decayed over time. A file repeatedly caught up in wide, scattered commits scores high; one changed in focused, single-purpose commits stays low even if it changes often, so this is not a churn proxy. Fires when the file is actively changing (≥ 3 recent commits) and sits in the top 20% of repo change entropy. Tier-aware: silent on ESSENTIAL-tier repos (no co-change walk). **co_change_scatter**: Breadth of coupling. Counts the distinct files a file co-changes with above the indexer's recording threshold; a high count means editing it tends to ripple across the codebase (shotgun surgery). This is the breadth complement to `hidden_coupling`, which flags specific undeclared coupled pairs. Fires on actively-changing files (≥ 3 recent commits) coupled to ≥ 8 distinct partners. Tier-aware: silent on ESSENTIAL-tier repos. ## Test quality These two fire only on test files and live in a deliberately small category (cap −0.5), so a noisy test never dominates its own health score. **large_assertion_block**: A test that fires 15 or more assertions in one uninterrupted run. Such a test usually checks several behaviours at once: when it fails it points at a line, not a cause, and it's brittle to unrelated changes. Splitting it into focused cases makes failures legible. **duplicated_assertion_block**: The same run of assertions copy-pasted across tests. Reuses the Rabin-Karp clone detector and keeps only the clone regions that overlap an assertion block on a test file. A change to the asserted behaviour then has to be edited in several places, and usually isn't, so the copies drift. **error_handling**: Swallowed-exception and unsafe-unwrap anti-patterns: an empty or comment-only `catch`/`except` body, a Python catch-all `except:` / `except Exception:`, Rust `.unwrap()` / `.expect()` / `panic!`-family macros, and Go's empty `if err != nil {}` or blank-identifier discard of a call's error. Detection is precision-first: only the unambiguous shapes fire, and an unsupported language or parse failure yields no signal rather than a guess. Each occurrence is a LOW finding anchored to its line, and the whole category is capped at −0.5 per file: this is an advisory maintainability flag (every linter is expected to surface `except: pass`), deliberately not a calibrated defect predictor. On the 21-repo benchmark it is AUC-neutral, so it is excluded from the weight calibration and bounded so it can never move a file's score by more than half a point. ## Test coverage Pass coverage reports straight into the analyzer: ```bash pytest --cov --cov-report=lcov:coverage.lcov repowise health --coverage coverage.lcov # Cobertura, Clover, or multiple sources also work: repowise health \ --coverage backend/coverage.xml --coverage-format cobertura \ --coverage frontend/lcov.info ``` Formats are auto-detected: **LCOV**, **Cobertura** XML, **Clover** XML, and a **normalized JSON** (`repowise-coverage-v1`) keyed by repo-relative path; the last lets you feed coverage from any runner once it's mapped to one shape: ```json { "format": "repowise-coverage-v1", "files": { "src/foo.py": { "line_coverage_pct": 87.5, "total_coverable_lines": 40 } } } ``` Coverage data feeds into `untested_hotspot`, `coverage_gap`, and `coverage_gradient` (a continuous deduction proportional to the uncovered fraction), and shows up on the `/repos//health/coverage` dashboard. ## Refactoring targets ```bash repowise health --refactoring-targets ``` A health score tells you a file is in trouble; a refactoring target names the specific fix. Repowise emits one structured `RefactoringSuggestion` per opportunity, computed deterministically during the health pass from data it has already produced: the call graph, the class cohesion model, the clone pairs, and git co-change. No re-parse, no LLM, inside the same <30s budget. Five detectors ship today: | Type | What it names | Built from | |------|---------------|------------| | **Extract Class** | The cohesion groups an incohesive / god class should split into: the exact methods + fields per group. | LCOM4 union-find components; god-class shape confirmed by Lanza-Marinescu (WMC = Σ McCabe, TCC). | | **Extract Helper** | A clone's exact occurrences and where the shared helper should live. | Rabin-Karp clone pairs (line ranges + token count + co-change); extraction site = community centroid of the files. Transitive clones are clustered into one suggestion, not pairwise nags. | | **Move Method** | A feature-envy method and the class it actually belongs to. | Jaccard distance of the method's entity set (fields/methods it touches) to each class over the call graph; fires only when a foreign class is clearly nearer than its own. | | **Break Cycle** | The minimal set of import edges to invert to break a dependency cycle. | A strongly-connected component, then a greedy minimum feedback arc set over the real import edges. | | **Split File** | The cohesive files an oversized module should decompose into: which symbols move where, plus the import edits in every dependent. | Community detection (Leiden/Louvain) over a weighted intra-file symbol graph, gated on partition **modularity**. Language-agnostic (reads `defines`/`calls` edges); the file-level analog of Extract Class. | Each suggestion is structured data, not a string: a `plan` (the split groups, the move target, the cut edges), the `evidence` that justifies it (LCOM4=3, the clone ranges, the cycle size), the `impact_delta` (the health deduction it recovers), an `effort_bucket` (`S`/`M`/`L`/`XL`), and a `blast_radius` (the callers and co-changing files that must move with it). Human-readable text is rendered from the structure at the edges (CLI / MCP / web); the structure is the source of truth. Ranking is graph-aware. Suggestions sort by `impact_delta × call-graph centrality × blast_radius`, so a plan on a central hub file outranks the same plan on a leaf, not the churn-only sort other tools use. The default surface honors a `min_confidence` gate (`low` / `medium` / `high`, default `medium`). For agentic workflows, the same data is one MCP call away: ```python get_health(include=["refactoring"]) # ranked structured plans get_health(targets=["src/api/server.py"]) # one file in detail get_health(targets=["module:src.api"]) # everything in a module ``` The web **Refactoring** tab renders each plan as a card (split groups as a tree, move arrow, clone occurrences with line links, file-split groups with their residual core and import-rewrite list) with a **copy-to-agent** prompt and an opt-in **Generate code** action that expands a plan into generated code plus a unified diff. Code generation runs only on an explicit request, never in the indexing hot path. It is enabled by default and can be turned off with `refactoring.llm.enabled: false`. Full reference: **[docs/REFACTORING.md](REFACTORING.md)**. ## Trends Every health run writes a `HealthSnapshot` row (rolling 50 entries per repo). Each snapshot stores the repo KPIs **and** a compact `{path: score}` map, so the history doubles as a per-file record. Two repo-level alerts run over the history: - **Declining Health**: current `hotspot_health` is ≥ 0.5 below the snapshot 5 runs ago. - **Predicted Decline**: the three most recent snapshots are each strictly below the one before. Inspect from the CLI: ```bash repowise health --trend ``` Or from MCP: ```python get_health(include=["trend"]) ``` ### Per-file score over time The same snapshots power a per-file trajectory: a file's score plotted across runs (CodeScene's signature view). It surfaces on the file's Health tab and in the health drawer as a sparkline, with a delta vs. the previous run and a Declining flag (the per-file version of the alerts above: ≥ 0.5 below the run 5 snapshots back, or three consecutive drops). A trend is silent on thin history: it needs at least two snapshots that both carry the file, otherwise the UI shows "no score history yet" rather than a misleading single dot. Gaps (a file absent from some snapshots) are skipped, not zero-filled. Fetch it directly: ```bash # REST: one file's series + current delta + declining flag GET /api/repos/{repo_id}/health/files/trend?file_path=path/to/file.py ``` ```python # MCP: targeted mode attaches a per-file `trends` block get_health(targets=["path/to/file.py"]) ``` ## File signals Every file carries process, people, and topology signals we already compute during indexing. They answer "should I worry about this file?" with context the score alone can't, and they surface together on the file's Health tab and in the health drawer, grouped, captioned, and silent ("no signal") when the underlying data is absent rather than imputed. | Group | Signal | Means | |-------|--------|-------| | Process | Prior defects | Bug-fix commits touching this file in the last ~6 months. `0` is a real, reassuring signal. | | Process | Change scatter | `change_entropy_pct` (0-100): how spread out its edits are across commits. High = chaotic change. | | Process | 90-day churn | Commits and lines added/deleted in the trailing 90 days. | | Process | Age | How long the file has existed in git history. | | People | Primary owner | The all-time top committer and their commit share. | | People | Recent owner | The top committer in the last 90 days. A different name from the primary owner flags a knowledge handoff. | | Topology | Dependents | How many files depend on this one (graph in-degree). | | Topology | Dependencies | How many files this one depends on (graph out-degree). | These are pure surfacing: no new measurement, no scoring. Fetch them directly: ```bash # REST: embedded in the file-detail aggregate and the drawer breakdown GET /api/repos/{repo_id}/files/{path} # data.health.signals GET /api/repos/{repo_id}/health/files/breakdown?file_path=path/to/file.py ``` ```python # MCP: attached to the get_context health block (null fields dropped) get_context(targets=["path/to/file.py"], include=["health"]) # MCP: also on get_health targeted mode, one `signals` object per metric get_health(targets=["path/to/file.py"], include=["signals"]) ``` ## Hotspot anatomy Two views dissect where risk concentrates, both plotted from data already on disk (churn from the git indexer, complexity from the walker, blame at function granularity). ### Churn × complexity One dot per recently-changed file: the x-axis is its 90-day commit count (churn), the y-axis is its max cyclomatic complexity, dot size is NLOC, and dot color is the health band. Dashed guides sit at the repo's median churn and median complexity, so the tinted top-right corner reads "busier and more complex than a typical file here", the refactor zone, where volatility and tangle collide and defects concentrate. It lives on the **Hotspots & churn** dashboard tab, toggleable with the churn × bus-factor view. ```bash # REST: repo-level point list (one point per churned file) GET /api/repos/{repo_id}/health/churn-complexity ``` ```python # MCP: the same point list, dashboard mode get_health(include=["churn_complexity"]) ``` Files with no recent churn are omitted (they have nothing to say on the churn axis); complexity is never used to filter, so a high-churn, low-complexity file still shows in the bottom-right ("changes constantly but stays simple"). ### Functions by churn The file's Health tab lists its functions ranked by modification count, with the 90-day recent-mod count, median age, and blame owner per function, the same `git_function_blame` rollup the symbol page uses. It promotes per-function ownership and volatility out of the buried marker cards into a first-class table, so "which function in this file is the actual hotspot" is one glance away. ## Configuration Per-file overrides live in `.repowise/health-rules.json`: ```json { "disabled_biomarkers": ["primitive_obsession"], "rules": [ { "path": "tests/**/*.py", "disabled_biomarkers": ["large_method", "complex_method"] }, { "path": "src/legacy/**", "disabled_biomarkers": ["dry_violation"] } ] } ``` `path` holds an fnmatch-style glob over the repo-relative POSIX path (`path_glob` and `glob` are accepted aliases). ### Severity overrides and profiles A team can soften a signal it treats as advisory without disabling it outright by remapping its severity (typically a demotion). Overrides apply repo-wide and per-path; an explicit per-path entry wins over the repo-wide one. ```json { "profile": "small-team", "severity_overrides": { "complex_method": "low" }, "rules": [ { "path": "src/generated/**", "severity_overrides": { "large_method": "low" } } ] } ``` Accepted severity values are `low`, `medium`, `high`, `critical`. The named `small-team` profile expands to a preset demotion of the process/people and noisier structural signals a 1-3 person repo can't support; an explicit `severity_overrides` key always wins over the preset. Only the severity label is tunable. The per-marker weight multipliers and the category caps are the calibrated constants the benchmark numbers rest on and are deliberately not overridable, so a team's local policy never changes what the published accuracy claims mean. Markers that carry a continuous deduction (`coverage_gradient`) are unaffected by severity remaps. ## Incremental updates `repowise update` only re-scores the changed files. Findings and metrics for unchanged files stay put, no nightly full re-index needed. ## Status one-liner `repowise status` includes a single-line health summary (the maintainability and performance pillars append once the index has populated them): ``` Health: 7.4 (avg) · 6.2 (hotspots) · 2.1 (worst: payments/processor.ts) · 7.0 (maintainability) · 9.1 (performance) ``` ## Comparison How the code-health signal compares to the tools it is most often placed next to. The honest dividing line: each of these has a rules or linter engine and a definitional rating; repowise predicts which files harbor the next bug and validates that forward in time against a labeled bug corpus. Each tool also has real strengths repowise does not claim to beat, noted under the table. | Capability | Repowise | CodeScene | SonarQube | Code Climate / Qlty¹ | Codacy | |---|---|---|---|---|---| | Per-file health / maintainability score | ✅ 1-10, 25 markers | ✅ 1-10 Code Health | ⚠️ A-E ratings from rule counts | ✅ A-F grade | ✅ A-F grade | | Score uses git / behavioral signals (churn, ownership, co-change, hotspots) | ✅ | ✅ its core | ❌ static rules only | ⚠️ churn-vs-complexity, no co-change | ❌ | | Cross-file / call-graph analysis | ✅ interprocedural call graph | ⚠️ git temporal coupling, not a static call graph | ⚠️ cross-file taint (security only) | ❌ file-local | ❌ file-local | | Defect-validated score vs a real bug corpus (published numbers) | ✅ mean ROC AUC 0.737, held-out 0.76-0.78 | ⚠️ "Code Red" study (defect density / cycle time), no per-file AUC | ❌ | ❌ | ❌ | | Static performance risk (N+1 / IO-in-loop across the call graph) | ✅ | ❌ | ❌ | ❌ | ❌ | | Test-coverage ingestion | ✅ LCOV/Cobertura/Clover/JSON | ✅ JaCoCo/Cobertura/LCov/Clover | ⚠️ imports reports | ✅ | ✅ | | Untested-hotspot detection (coverage × risk) | ✅ | ✅ | ❌ | ⚠️ coverage beside hotspots, no composite | ❌ | | Concrete cross-file refactoring plans (Extract Class / Move Method / Split File) | ✅ + opt-in LLM code-gen | ⚠️ AI fixes 5 in-function smells, no cross-file plan | ❌ flags + AI CodeFix | ❌ flags only | ❌ flags + per-issue AI autofix | | Health trend tracking + declining-health alerts | ✅ | ✅ | ✅ quality gates, new-code focus | ✅ | ✅ quality gates | | AI-agent / MCP integration | ✅ MCP tools | ✅ Code Health MCP server | ✅ official MCP server | ❌ | ✅ official MCP + Guardrails | | Security scanning (adjacent axis) | ⚠️ separate security layer | ⚠️ secondary | ✅ strong (cross-file taint) | ❌ | ✅ full SAST/SCA/secrets/IaC | | License | ✅ AGPL-3.0, free internal use | ⚠️ free OSS; paid per active author | ⚠️ Community free; paid by LOC | ⚠️ free OSS; paid teams | ⚠️ free OSS; paid per dev | ¹ Code Climate Quality was spun out as Qlty Software in November 2024; "Code Climate" now sells the Velocity engineering-analytics product. This row tracks the Quality / Qlty capabilities. SonarQube "Security Hotspots" are static-analysis security flags, not git-churn hotspots. What each tool does that repowise does not: **SonarQube** has the broadest security scanning (cross-file taint analysis), the widest language coverage, and the most widely adopted merge-gate model. **CodeScene** is the most mature behavioral-analysis product (knowledge maps, off-boarding simulation, 28+ languages) and has the only published empirical defect study in this group (the "Code Red" correlation study, Tornhill and Borg, TechDebt 2022). **Code Climate / Qlty** defined the churn-vs-complexity quadrant and pairs with Velocity for DORA analytics. **Codacy** has the widest security suite of the four and polished PR automation. ## See also - [`packages/core/src/repowise/core/analysis/health/README.md`][hr]: developer overview of the layer. - Sub-package READMEs: `complexity/`, `coverage/`, `duplication/`, `biomarkers/`. [hr]: ../packages/core/src/repowise/core/analysis/health/README.md