Hire AI Native Engineers

See who's engineering and who's vibe coding. Candidates work with real coding agents like Claude Code and Codex, and you see every decision they make.

32:41

outship

Claude Code

Codex

bash

Claude Codev2.1.113

Welcome back!

Opus 4.7 · API Usage Billing

~/workspace

Tips for getting started

Run /init to create a CLAUDE.md file with instructions for Claude

Recent activity

No recent activity

●

The api pods are getting OOM-killed at 600 MB but the image is 1.21 GB — it's shipping the full python:3.12 base. I'll switch to a multi-stage distroless build and tighten the k8s resource block to match the new footprint.

●

Read(Dockerfile)

⎿Read 38 lines

●

Update(Dockerfile)

⎿Added 9 lines, removed 5 lines

1-FROM python:3.12 AS app

2-WORKDIR /srv

3-COPY pyproject.toml uv.lock ./

4-RUN pip install --no-cache-dir -e .

5-COPY . .

1+FROM python:3.12-slim AS builder

2+WORKDIR /build

3+COPY pyproject.toml uv.lock ./

4+RUN pip install --no-cache-dir uv && \

5+ uv sync --frozen --no-dev

7+FROM gcr.io/distroless/python3-debian12:nonroot

8+COPY --from=builder /build/.venv /srv/.venv

9+USER nonroot

●

Update(k8s/api/deployment.yaml)

⎿Added 4 lines, removed 4 lines

28 resources:

29 limits:

30- memory: "1Gi"

31- cpu: "1000m"

30+ memory: "256Mi"

31+ cpu: "500m"

32 requests:

33- memory: "512Mi"

33+ memory: "128Mi"

●

Bash(docker build -t inference-api:slim . && docker images inference-api:slim)

⎿inference-api slim ce4f2b1 187MB built in 42.3s

●

Bash(kubectl apply -f k8s/api/deployment.yaml --dry-run=server)

⎿deployment.apps/inference-api configured (server dry run)

·Combobulating…(thinking with xhigh effort)

⏵⏵ accept edits on (shift+tab to cycle)

Dockerfile

main.py

build.sh

backendapp

main.py

1from __future__ import annotations
2 
3import logging
4import os
5from contextlib import asynccontextmanager
6 
7import torch
8from fastapi import FastAPI, Request
9from prometheus_client import Counter, Histogram, make_asgi_app
10 
11from app.models.classifier import Classifier
12from app.schemas import PredictRequest, PredictResponse
13from app.metrics import COUNT_DESC
14 
15logger = logging.getLogger(__name__)
16INFERENCE_LATENCY = Histogram(
17    "inference_seconds", "Wall time per inference", ["model"]
18)
19INFERENCE_COUNT = Counter("inference_total", COUNT_DESC, ["model"])
20 
21 
22@asynccontextmanager
23async def lifespan(app: FastAPI):
24    device = "cuda" if torch.cuda.is_available() else "cpu"
25    app.state.classifier = Classifier.from_pretrained(
26        os.environ["MODEL_PATH"], device=device,
27    )
28    logger.info("inference-api ready on %s", device)
29    yield
30    del app.state.classifier
31    if device == "cuda":
32        torch.cuda.empty_cache()
33 
34 
35app = FastAPI(lifespan=lifespan, title="inference-api")
36app.mount("/metrics", make_asgi_app())
37 
38 
39@app.post("/v1/predict", response_model=PredictResponse)
40async def predict(req: PredictRequest, request: Request) -> PredictResponse:
41    clf = request.app.state.classifier
42    with INFERENCE_LATENCY.labels(model=clf.name).time():
43        out = await clf.predict(req.input)
44    INFERENCE_COUNT.labels(model=clf.name).inc()
45    return PredictResponse(label=out.label, score=out.score)
46 
47 
48@app.get("/healthz")
49async def healthz() -> dict[str, str]:
50    return {"status": "ok"}

candidate-solution

Layout: U.S.

You

World class engineering teams already hire with Outship

AthenaHQ

Engineers are shipping with AI everyday

It's time for hiring processes to stop pretending otherwise

See how it works →

Traditional technical interviews were designed for a different era

Algorithm puzzles and LeetCode were designed for engineers coding alone with nothing but an editor. That role does not exist anymore. Banning AI entirely or allowing it naively is also not the fix.

Weak engineers can hide behind AI

Unskilled candidates might look great in your current interview process, but they will commit 40k lines of unreviewed slop to your repo or delete your production DB once hired.*

You can't sit behind every candidate and watch them build

If you could, you'd see exactly who can engineer: who decomposes the problem correctly, catches the agent when it drifts, and goes into the specifics when things break. But it's not an option to do this at scale.

* Yes, this has actually happened to several of our customers before they started using Outship.

See exactly how candidates work

Every prompt, every edit, every command, and every decision is analyzed to show you who's engineering and who's vibe coding

Experienced engineer

Decomposes the problem and researches the stack. Catches the agent when it drifts and ships with tests.

Cleaned dead branches and unused imports the agent left behind

Codebase Orientation

Used plan mode before touching the auth flow

Architectural Research

Added Redis cache + invalidation on order write paths

Full-Stack Execution

Pushed back on a hallucinated Pydantic validator

Agent Communication

Caught N+1 query the agent left unbatched

Verification & Shipping

Wrote 3-option tradeoff before picking SQS over polling

Architectural Research

Memoized list filter to kill avoidable re-renders

Full-Stack Execution

Folded 3 near-identical handlers the agent copy-pasted into one helper

Codebase Orientation

Added timeout on outbound HTTP the agent missed

Verification & Shipping

Picked optimistic locking, documented row-lock deadlock risk

Architectural Research

Implemented idempotency key on retry-safe writes

Full-Stack Execution

Stopped the agent mid-edit when it duplicated existing util logic

Agent Communication

Cleaned dead branches and unused imports the agent left behind

Codebase Orientation

Used plan mode before touching the auth flow

Architectural Research

Added Redis cache + invalidation on order write paths

Full-Stack Execution

Pushed back on a hallucinated Pydantic validator

Agent Communication

Caught N+1 query the agent left unbatched

Verification & Shipping

Wrote 3-option tradeoff before picking SQS over polling

Architectural Research

Memoized list filter to kill avoidable re-renders

Full-Stack Execution

Folded 3 near-identical handlers the agent copy-pasted into one helper

Codebase Orientation

Added timeout on outbound HTTP the agent missed

Verification & Shipping

Picked optimistic locking, documented row-lock deadlock risk

Architectural Research

Implemented idempotency key on retry-safe writes

Full-Stack Execution

Stopped the agent mid-edit when it duplicated existing util logic

Agent Communication

Analyzing session against rubric…

Inexperienced vibe coder

Pastes the spec, re-prompts “fix it” until tests pass. Accepts every diff without reading what changed.

Failed Migration

Network Partition

Cold Start

claude code

✻ Welcome to Claude Code (v2.1.34)

cwd: ~/api-service · model: claude-opus-4-7

⏵⏵accept edits on·shift+tab to cycle

Technical screening platform for the AI era

Real tasks in real environments, with every step captured

See it in action →

Tasks that actually matter

Outship supports virtually any task. Each candidate gets a real VM with VS Code, dependencies, and AI agents configured.*

Import interview projects you already use from GitHub, ask candidates to solve a past bug, or even have them ship a real PR to your product.

Create your first task →

* GPUs and specialized setups are available, contact us!

The Process Is the Signal

Outship captures everything a candidate does while building: every prompt, every edit, every command, every decision.

It's like watching every candidate think out loud. You see how they decompose the problem, drive the agent, and recover when things break.

See how it works →

outship

latency-benchmark

Coding Agent

Terminal

~/workspace $ codex

>_ OpenAI Codex (v0.46.0)

model: gpt-5-codex high /model to change

directory: ~/workspace

To get started, describe a task or try one of these commands:

/init – create an AGENTS.md file with instructions for Codex

/status – show current session configuration

/approvals – choose what Codex can do without approval

/model – choose what model and reasoning effort to use

/review – review any changes and find issues

>Instrument every stage of the voice pipeline (ASR, NLU, TTS) with high-resolution timing. Collect p50/p95/p99 latencies across 100 runs, add warm-up rounds, generate a markdown report with per-stage breakdown, and make sure the harness works with both the local mock pipeline and the remote gRPC endpoint. Use histogram-based percentile calculation instead of sorting...

100% context left · ? for shortcuts

Scrub to explore the session →

Candidate prompts agent

Test for how engineers work today

Other platforms bolted AI onto their tests. Outship is purpose-built for engineering with AI.

See how it works →

Algorithmic tests & toy assessments

Output only — you can't see the thought process
Bans AI entirely, or allows a crappy AI copilot
Algorithm memorization, not real engineering
Toy problems, tests for a job that no longer exists

Preferred by fast-growing Silicon Valley teams

outship

Outship

Every prompt, edit, and decision captured
Real coding agents like Claude Code and Codex
Real tasks from your own codebase
Tests the craft engineers actually practice today

Focus on skills that matter, fill roles faster

Top 1%

Standout engineers your competitors miss

2-3 hours

Of your engineers' time saved per candidate

Coding agents are eating software

Engineering teams are adopting AI to ship more, faster than ever. Identify hires with the agency, speed, and taste to thrive in this era. Book a Demo

On Y Combinator asking founders to attach a coding-agent session they're proud of:

“I seriously think this is just the beginning of being able to find the real builders.”

Garry Tan

President & CEO, Y Combinator

“taste is a new core skill”

Greg Brockman

Co-Founder, OpenAI

“The obvious tactical thing is just get really good at using AI tools. When I was graduating as a senior from high school, the obvious tactical thing was get really good at coding, and this is the new version of that.”

Sam Altman

CEO, OpenAI

“Every one of our engineers, some 40,000, are now assisted by AI and our productivity has gone up incredibly.”

Jensen Huang

President & CEO, NVIDIA

“When I speak to my most senior engineers — the best developers we have — they actually say that they haven't written a single line of code since December. They actually generate code and supervise it.”

Gustav Söderström

Co-CEO, Spotify

On Coinbase mandating engineers onboard AI coding tools:

“there were a couple people that had not done it. Some of them had a good reason… and some of them didn't. And they got fired.”

Brian Armstrong

Co-Founder & CEO, Coinbase

Integrations

Connect your stack

GitHub, your cloud, your CI/CD, your code review pipeline — Outship plugs into the tools your team already uses, so candidate work mirrors how engineers actually ship.

Get started

Task repositories imported from GitHub

Frequently asked questions

Candidates code on a real Linux VM, so virtually any language or framework works. We also support live dev servers, which is useful for full stack and React Native projects where candidates need to view a running app.

Two options. You can provide access through the Anthropic or OpenAI APIs, or through cloud providers like AWS Bedrock, Google Vertex AI, and Azure. Your keys are proxied so candidates never see them. You can also have candidates log in and authenticate with their own accounts.

Remote VM by default. Local mode is also available if you want candidates to use their own setup, including MCPs, CLIs, and custom skills. Some of our customers use it to see how candidates actually work day to day.

Candidates consistently prefer Outship over traditional technical interviews. They build a real project that matches the scope of actual work at your company instead of grinding through algorithm puzzles. They appreciate that the process tests the skills and tools they'll actually use on the job.

Yes. Candidates work in a fully isolated VM or on their own machine against a fresh repository. The task repositories you've attached and your GitHub credentials are never exposed to them.

Yes. Book a demo and we'll show you a sample interview and the corresponding reports for candidates at different skill levels.

Yes. Outship integrates with common ATS providers like Ashby, Greenhouse, and Workday. We also support automated flows, for example routing every candidate that applies through a specific channel into a task. Our team will work with you to set up the workflow that fits your hiring process.

Most of our customers are ready to send their first assessment the same day they onboard. We create a shared Slack channel with your team and work through the setup with you, helping you test with candidates and refine the process until it fits your use case.

Outship lets you hire engineers built for this era

Book a Demo