Now accepting pilot partners

See how they actually work.
Not how they interview.

AI-native work trials that predict on-the-job performance — with evidence.

Trusted by founders from YC, SPC, and more

polymath-demo
Scroll
The Problem

Hiring is broken.
Everyone knows it.

That sinking feeling three months in when your "great interview" turns into a performance problem.

Resumes lie

Credentials don't predict performance. You're hiring based on marketing, not capability.

Interviews are theater

Candidates perform rehearsed answers. You learn who interviews well, not who ships.

LeetCode is cargo cult

Inverting binary trees has nothing to do with building products. Memorization isn't problem-solving.

AI changes everything

Your best hire uses AI to 10x their output. Traditional interviews penalize this.

"The best predictor of future performance is past performance — in similar conditions."

So why are we still using artificial conditions to predict real work?

How It Works

Four steps to certainty

A work trial that shows you who someone is — not who they pretend to be in an interview.

01

Local Environment

Real environment. Real tools.

Candidate works on their own machine with their choice of AI tools. Just like their actual job.

Watcher connected
$claude --version
Claude Code v1.0.14
$git clone project-repo && cd project-repo
$npm run dev
02

Ambiguous Assignment

Designed to reveal agency.

A real-world project with incomplete requirements. See how they handle ambiguity and turn chaos into shipped code.

Assignment Brief

Build a support inbox triage system. Ingest tickets, cluster by topic, propose auto-replies, admin UI for review.

"Requirements intentionally incomplete. Ask questions. Make decisions. Ship something real."

ambiguityagencytradeoffs
03

Observer Agent

Intelligent probing. Not surveillance.

Our AI observes how they work and asks surgical questions at key moments to understand their thinking.

P
I noticed you chose SQLite over Postgres. What's driving that decision?
C
Faster to prototype. I'll note Postgres migration as tech debt if this scales.
P
What would you cut first if time halves?
04

Evidence-Based Report

Signal, not vibes.

Rubric scores, evidence clips, and a clear recommendation you can act on.

Candidate ReportHIRE
Intelligence
4.5
AI Tool Usage
4.8
Judgment
4.2
Agency
4.6
"Strong evidence of independent thinking and effective AI usage..."
The Report

Decisions backed by evidence

Not a vibe. A report that makes the hiring decision obvious — with receipts.

Candidate Evaluation Report

Full-Stack Engineer • Jan 2025

STRONG HIRE
Active Time

14.2 hours

Commits

47

Decisions Logged

12

Rubric Scores

Intelligence4.8
Grit4.2
AI Tool Usage4.0
Openness4.6

Key Evidence

14:23 — "Chose to ship auth-less MVP first, documented security as Day 2 priority. Explicitly traded speed for completeness."

16:45 — When tests failed, diagnosed root cause in 8 mins using Claude Code. Fixed without introducing regressions.

What we measure

Six dimensions that predict on-the-job success. Each scored 1–5 with evidence.

Intelligence

Problem decomposition, pattern recognition, learning speed

Grit

Persistence through ambiguity and setbacks

AI Tool Usage

Leverages AI effectively as a force multiplier

Openness

Receptivity to feedback, new approaches, and change

Judgment

Prioritization, tradeoffs, knowing what matters

Agency

Proactive decisions, owns outcomes, escalates with options

Why It Works

Built for the real world

Not another coding test. A rethinking of how to predict who will ship.

Tests ambiguity tolerance

Incomplete specs, shifting requirements. See who thrives in chaos and who freezes.

Measures actual agency

Do they wait for permission or make decisions? Agency is observable, not self-reported.

Observes AI usage (doesn't ban it)

We measure how well they leverage AI tools — not whether they can work without them.

3-day signal vs 1-hour snapshot

Energy management, iteration cycles, how they handle getting stuck. Patterns, not performances.

Defensible hiring decisions

Every recommendation backed by timestamped evidence. Show your board exactly why you hired them.

Works with AI-native candidates

The best engineers use every tool available. So should the evaluation.

Traditional hiring vs. Polymath

Traditional Hiring
  • Resume screening (marketing document)
  • LeetCode (tests memorization)
  • Behavioral interviews (rehearsed answers)
  • "Culture fit" (vibes-based)
  • Decision made on 3 hours of performance
Polymath Society
  • Real work trial (actual output)
  • Ambiguous problems (tests thinking)
  • Observed behavior (not self-reported)
  • Evidence-backed rubric (defensible)
  • 3-day signal (patterns, not moments)
Who It's For

For teams who can't afford to guess

Bad hires cost 6–12 months. The wrong engineer can kill a startup.

Startup Founders

Seed to Series B

Every engineer either accelerates or drags. Get certainty before committing $200K+/year.

"We've been burned twice by 'great interviews.' Never again."

Hiring Managers

Engineering leads

Your gut says hire, but can you defend it? Get a report that makes the decision obvious.

"Finally, something that shows me how they actually work."

Technical Recruiters

In-house teams

Resumes all look the same. Give your hiring managers signal they can act on.

"The report sells itself. Hiring managers trust it."

Currently onboarding pilot partners

We're working with a small group of startups to refine the process. Hiring engineers soon? Let's talk.

Become a Pilot Partner
Get Early Access

Join the waitlist

Limited pilot spots available. Be among the first.

No spam, everReply within 48 hours