R-0310 March 2026 · 7 min

A note on evals.

How we measure whether Mara is actually useful for the work analysts do, not the work benchmarks pretend they do.

Most benchmarks for security models test trivia. They ask the model to identify a vulnerability class from a snippet, or to name an ATT&CK technique from a sentence. The model gets a number. The number goes up over time. None of this is the work.

The work is sitting at 3 a.m. with a half-formed alert, three unrelated logs and a manager asking what to do. The work is reading a sandbox report that is mostly noise and deciding which two lines matter. The work is writing the post-mortem so that the people on next week's shift do not make the same mistake.

What we measure instead.

—Triage realism. Does Mara's ranking of a queue match how a senior analyst would have ranked it?
—Discriminating questions. When uncertain, does Mara ask the question that actually separates hypotheses?
—Calibration. When Mara says ‘high confidence’, is it right at the rate we tell users to expect?
—Useful refusals. Does Mara decline to fabricate, and decline gracefully?

These metrics are harder to publish because they require a panel of practitioners and not a leaderboard. We publish them anyway. We publish the failure modes too. A model is something you have to live with, not something you launch.

“A model is something you have to live with, not something you launch.”

Mara is a research preview from venode. Feedback, corrections and disagreements welcome, hello@venode.ai.

A note on evals.

What we measure instead.

Reasoning about an unknown sample.

Attribution under uncertainty.

Working in the open.