polymathsociety.com / reports / latest-0Fleshed out report
Task

Build an agent that does better on LegalBench-RAG — an open benchmark for case-law retrieval and analysis, scored against ground-truth citations from real legal writing.

Also do anything else you think that could help the law industry adopt AI at large.

Top 1%of Big Tech employees in the Bay Area
3rdin cohort on raw output, among 20 candidates
28%LegalBench-RAG score on the held-out test set — 4× a naive Opus 4.7 baseline (7%)
42experiments run across the build vs. a cohort average of 18 — shot count is a strong proxy for intensity
FDE skills · demonstrated this cohort
  • ·Has read extensively on agent harnesses — scaffolding, constraints, knowledge management.
  • ·Built an eval set of 80 test cases from scratch — strong eval-engineering instincts.
  • ·Experimented extensively with retrieval and memory architectures.
Demo · what he built

Walks through the agent end-to-end on a real LegalBench-RAG task — retrieval, reasoning, citation. Last 90s shows the vibecoded interface a lawyer can drive directly. Recorded day 14, single take, unedited.

Percentile vs. big tech engineers (Amazon / Google)

Estimated by big tech employees themselves, based on their co-workers. Dashed ring is the median big-tech engineer.

JudgementAgencyIntensityIntelligenceLearning speedTechnical affinityPeople affinity
Hover any vertex for the percentile.
this candidatemedian big-tech engineer
Rating scale · for observations + downsides
bad
bottom of working engineers
okay
mid-tier of working engineers
good
top quartile of working engineers
great
top decile of working engineers
exceptional
Anthropic researcher / FAANG L6 tier · top 1–2%
What we observed
1exceptional

Very high agency and judgement — built a self-improving harness and a UI for lawyer feedback, and read the surrounding literature extensively on his own.

Why
  • ·Routinely read up on literature quickly to make the best decisions — both for law terms and for agent harness best practices, retrieval mechanisms, etc. Ramp speed was very high (top 10% of cohort).
  • ·On day 6 he realised he was iterating on the harness too slowly in code. Built a UI for lawyers to give feedback, with a self-improving harness.
  • ·Came up with a CI/CD pipeline for testing each experiment and version-controlling it himself.
2great

Intense worker — 6th out of 20 on intensity of work for the hours they put in.

Why
  • ·Put in an average of 6.2 hours every day. Cohort average was 4.2 hours. 5+ hours on 13 of 14 days.
  • ·Average focus of 6 (60% flow state) on a scale of 8 — cohort average was 2.5.
  • ·On day 3, read 11 blogs and papers on harnesses and evals over a span of 4 hours. More stamina than most.
  • ·Operated with a sense of urgency — set a schedule, set ambitious goals, and got 80% of tasks done on time.
3great

Highly parallel AI tool usage.

Why
  • ·When given feedback, ramped up AI tool usage from 80% of the time single-threaded to 10% of the time single-threaded.
  • ·On day 3, over a span of 2 hours, managed the design of the lawyer-feedback website and 2 coding agents in parallel.
  • ·Behaviour change was fairly immediate.
Downsides
1okay

Over-handing to AI.

Why
  • ·For technical tasks, often handed them over to AI completely while keeping things underspecified. Decided AI couldn't do the task and stepped in only after nearly half an hour of flailing.
  • ·Even for designing interfaces, depends on an AI first draft rather than specifying in granular detail to the AI.
  • ·Priors are that this will likely improve quickly with feedback.
2okay

Over-deliberates decisions instead of getting real-world feedback.

Why
  • ·Made 5 design iterations over 2 days for the lawyer-facing platform.
  • ·Did not contact real lawyers to put it in front of them. When the AI recommended that, he decided against it out of fear of burning social capital.
In their own words · self-reported

These are the candidate's own answers about how they like to work. Self-reported, not a measured assessment — useful for fit, not a personality verdict.

How do you like to best work with people? Be as descriptive about how you are with teams, work and otherwise.

I do my best work heads-down and solo, but I'm not closed off — I like a small team where everyone owns a clear piece, we sync briefly, then go deep on our own. I default to writing things down over scheduling a meeting. In a group I'm more of a quiet builder than the loudest voice; I'll push back on something I disagree with, but usually in writing or 1:1 rather than in a big room. Socially I'm friendly but reserved, and I recharge alone.

How do you like feedback?

Direct and written, so I can sit with it. I tend to act on it a day later rather than in the moment — I like to think it through before changing course.

When are you most productive? What is your ideal way to work?

Early morning and late evening; the middle of the day is my weakest stretch. My ideal day is two or three uninterrupted 3–4 hour blocks, a goal set up front, and notifications off until the block is done.

What drains you?

Lots of small context-switches and back-to-back meetings. Cold outreach to strangers is the thing I avoid most.

Day by day · at a glance
FOCUS chip · % time in flow
slow
0.5–1.5
<30% in flow state
okay
1.5–3
30–50% in flow state
intense
3–7
50–70% in flow state
very intense
7–8
>70% in flow state
OUTPUT chip · vs. cohort
slow
0.5–1.5
bottom of cohort that day
okay
1.5–3
around cohort median
intense
3–7
top tier of cohort that day
very intense
7–8
#1 or #2 of cohort that day
At a glance · across 14 days
hours
actual hours worked
02468cohort avg 4.2Day 1: 4.5hDay 2: 5hDay 3: 7hDay 4: 3hDay 5: 6hDay 6: 7hDay 7: 7hDay 8: 6hDay 9: 8hDay 10: 5hDay 11: 8hDay 12: 5hDay 13: 6hDay 14: 5h1234567891011121314
focus
slow → very intense
02468cohort avg 2.5Day 1: focus okay (1.8)Day 2: focus okay (2.4)Day 3: focus very intense (7.4)Day 4: focus slow (0.8)Day 5: focus intense (4.5)Day 6: focus intense (5.2)Day 7: focus intense (6.0)Day 8: focus very intense (7.7)Day 9: focus intense (6.0)Day 10: focus okay (2.0)Day 11: focus very intense (7.6)Day 12: focus okay (2.3)Day 13: focus intense (5.5)Day 14: focus okay (2.7)1234567891011121314
output
slow → very intense
02468cohort avg 2.5Day 1: output slow (0.9)Day 2: output okay (2.2)Day 3: output intense (6.0)Day 4: output slow (0.7)Day 5: output okay (2.6)Day 6: output intense (5.0)Day 7: output very intense (7.3)Day 8: output very intense (7.8)Day 9: output intense (5.5)Day 10: output okay (2.2)Day 11: output very intense (7.5)Day 12: output okay (2.4)Day 13: output very intense (7.4)Day 14: output intense (4.0)1234567891011121314
Resume · background
····
Education
University of California San DiegoSep 2023 – Mar 2025
M.S. in Computer Science and Engineering · Specialization in Artificial IntelligenceGPA 4.00 / 4.00
  • ·Key Courses: Computer Vision, Robotics, ML Systems, Software Engineering, Recommender Systems
Indian Institute of Technology BombayJul 2019 – Jul 2023
B.Tech with Honors in Computer Science and Engineering · Minor in EntrepreneurshipCPI 9.66 / 10
  • ·Key Courses: Advanced Image Processing, Machine Learning, Linear Algebra, Probabilistic Theory, Web Security
Experience
Computer Vision Intern · Duality AIJun 2024 – Sep 2024
  • ·Built pipelines to generate high-fidelity Gaussian Splatting synthetic environments to validate vision models in real-world settings
  • ·Designed automated 3D reconstruction techniques for featureless objects, reducing digital-twin generation time by 40%
  • ·Collaborated with Autodesk to validate Unreal Engine simulations for robotics tasks; structured domain randomization reduced Sim2Real gap and increased mAP-50 by 15% for object detection and segmentation
Data and Applied Scientist Intern · Microsoft IndiaMay 2022 – Jul 2022
  • ·Developed a decision-tree ranker to recommend emails without user queries, improving Outlook search capabilities
  • ·Integrated data pipelines across team infrastructures, combining user-specific features from large-scale context logs
  • ·Proposed hierarchical feature-sets for the ranker, reducing latency for recommendations and improving recall
Key Projects
Mirror AI: Deployable PersonasOct 2024 – Dec 2024
Honorable mention, Supabase YC Hackathon
  • ·Designed an agentic LLM architecture with LangGraph to mirror user personalities as interactive digital personas
  • ·Deployed a full-stack platform using Supabase + Vercel for secure hosting and user authentication
  • ·Integrated with the Notion API for personal context; one-click deployment to publish a persona
Improving LLM Reasoning for Numerical ProblemsSep 2024 – Dec 2024
  • ·Enhanced MathPrompter (ACL 2023) with chain-of-thought, achieving 10% higher accuracy on Llama 3.1 1B where prior methods failed
  • ·Reduced hallucination rates by integrating multi-step validation for robust, consistent outputs
Inverse Rendering with 2D Gaussian SplattingMar 2024 – May 2024
  • ·Built a novel inverse-rendering framework in CUDA to recover PBR properties of a scene using 2D Gaussian Splatting
  • ·Improved normal-map MAE by 15% over current SOTA, achieving superior novel-view synthesis and relighting
Real-time 3D Perception for Home RobotsSep 2023 – Sep 2024
Graduate Student Researcher, Supervisor: · UC San Diego
  • ·Investigated real-time dense visual SLAM methods using NeRFs and Gaussian Splatting for robot navigation
  • ·Integrated object segmentation, grasp-pose estimation, and 3D mapping on the Fetch robot via ROS; novel tabletop rearrangement algorithm reduced cost by 20% vs. SOTA
3D Tomography with Primal-Dual Neural NetworksMay 2021 – Jul 2023
UCL Research Internship, Supervisor: · University College London
  • ·Built a stochastic neural-network architecture of a primal-dual algorithm for online 3D-volume reconstruction from tomographic projections; 99.6% structural similarity in low-dosage conditions
  • ·Shipped a Python library with custom gradient operators for single-pass volume reconstruction, cutting compute by up to 5× over SOTA learning-based approaches
Other
  • ·Image Colorization GAN — web app coloring grayscale images using pix2pix U-Net architecture
  • ·Sudoku Solver — Augmented Reality app solving Sudoku from a live feed with robust real-time performance
  • ·Autonomous Robot — Roomba-like robot with visual-SLAM using EKF and A* path planning on ROS
Skills
ProgrammingC++, C, Python, MATLAB, Linux & Bash, SQL, HTML, JavaScriptToolsPyTorch, ROS, TensorFlow, scikit-learn, OpenCV, ReactJS, Matplotlib, ArduinoExpertise inFull-stack development, Generative AI, 3D Perception, ML Systems, Statistical Image Processing