Y COMBINATOR · EXTRACTED

Software Is Changing (Again) ft. Andrej Karpathy

Andrej Karpathy on the three eras of software, the new LLM stack, and where the real leverage sits in 2026.

2.4M views on YouTube
"LLMs are a new kind of computer, and you program them in English." — Andrej Karpathy

Karpathy spent years at OpenAI and Tesla building AI systems that ended up in production. He left to teach. This Y Combinator keynote is where he laid out the Software 1.0 / 2.0 / 3.0 framework that's now everywhere: classical hand-written code, trained neural network weights, and natural-language LLM programs. The talk isn't a forecast. It's a working theory of where the leverage is right now, what the stack actually looks like, and what kinds of products are buildable today that weren't 18 months ago. If you build with AI for a living, this is the operating manual.

TACTIC 01

Pick the right era of software for the problem

The whole talk hangs on one frame. There are now three ways to write software, and they all coexist. Software 1.0 is classical code, where a human writes the rules. Software 2.0 is neural network weights, where you specify a dataset and an architecture and gradient descent does the programming. Software 3.0 is LLMs, where the program is a prompt written in plain English. None of these replaces the others. They have different cost curves, different reliability profiles, different failure modes. Most teams burning months right now are using one of the three for a problem that belongs to a different era.

THE PLAY

Before you build a feature, classify it. If the rules are clear and the input is structured, write the code (1.0). If the input is messy but you have a lot of labeled data and you need it fast at runtime, train a model (2.0). If the task is linguistic and a wrong answer is annoying but not catastrophic, prompt an LLM (3.0). Picking wrong burns weeks. Picking right cuts the build in half.

TACTIC 02

Treat LLMs as 1960s infrastructure

Karpathy compares LLMs in 2026 to mainframe computing in the 1960s. The intelligence sits in centralized data centers. You access it through APIs and time-share the compute. The phase where everyone's laptop runs a useful local model hasn't arrived. He doesn't predict when it will. The point is that the architecture is centralized for the foreseeable future, and that fact should shape what you build. Cost will keep dropping, latency will keep improving, but the network round-trip is here to stay.

THE PLAY

Stop trying to ship features that depend on local inference unless you also make hardware. The surface to design against is API-based access with 200ms to 2-second round trips. That means streaming UIs, optimistic states, partial results rendered as they arrive, skeleton screens that don't feel broken. The teams that lose are the ones who treat the API call as a synchronous function. The teams that win make latency invisible.

TACTIC 03

Build the suit before you build the robot

Karpathy uses an Iron Man analogy that's worth taking seriously. The Iron Man suit augments a human pilot who stays in control. The Iron Man robot acts on its own. Almost every product that's working in 2026 is a suit. Cursor for code, Perplexity for search, Claude for writing. The autonomous-agent space, where the AI takes consequential actions without a human in the loop, is where most of the failures and most of the abandoned MVPs live. LLMs aren't reliable enough to act unsupervised in most domains. The right default is suit mode. Reach for autonomy only when verification is genuinely cheaper than supervision, and even then, slowly.

THE PLAY

Walk through your product feature by feature. For each AI-driven action, ask: does a human approve before this affects something real? If yes, you're shipping a suit. If no, ask the harder question. Is the cost of an error here actually lower than the cost of a human glance? If you can't answer that with a clear yes, redesign as a suit. Most agentic features that bleed retention fail this test.

TACTIC 04

Make the autonomy slider visible

Once a product is in suit mode, the next design question is how much of the work the AI does versus how much the human does. Cursor's design is the canonical example: tab completion at the low end, inline edits in the middle, agent mode at the high end. The user picks where to sit based on the task and on how much they trust the AI for that task. The mistake Karpathy flags is fixing the slider in one position and forcing every user to live there. The right pattern is making the position visible, making it adjustable, and letting users move as their trust grows. Trust is built one task at a time. The slider is what lets that happen.

THE PLAY

In your AI product, find the autonomy decision and surface it as a control the user can see. Default new users to a conservative position where the AI suggests and the human commits. Let power users opt up to higher autonomy as they get comfortable. The slider is not a settings-screen toggle. It belongs in the main flow.

TACTIC 05

Ship the verification UI, not just the model

A pattern Karpathy keeps returning to: the bottleneck in LLM products is rarely whether the model produces the right output. It's whether the user can verify the output is right, fast enough that the loop is worth doing. If the AI generates in 2 seconds and the human takes 10 minutes to check, the loop is broken. Cursor invests heavily in the diff view. Perplexity invests heavily in citations. The teams that ship LLM products that stick are the ones that put as much engineering into the verification surface as into the generation pipeline. The teams that don't are the ones whose users churn after week two because they realize the AI is faster but their workflow isn't.

THE PLAY

For every AI-generated output in your product, sit with the question: how does the user check this is right, in seconds, without leaving the flow? If the answer involves opening another tab, scrolling through the source, or copy-pasting into another tool, the verification UI hasn't been designed yet. Ship a diff view, a citation, a confidence band, a highlight of what changed. Time-budget verification separately from generation, and don't ship until verification fits in the budget.

TACTIC 06

Write your codebase for LLMs to read

The most practical part of the talk is for engineering leaders. Karpathy argues that codebases now have a second class of reader: AI agents. They have no institutional memory, they rebuild context from the code on every call, and they're going to be touching every line in your repo. The same hygiene that helps human engineers helps LLMs more, because humans can ask a teammate and LLMs can't. He calls out specific patterns that work: documentation files written for AI agents at the repo root, short well-named functions, explicit types, comments on the "why" not the "what." He also calls out specific examples in industry. Vercel's documentation is good for AI agents. Clerk's, in his telling, is not.

THE PLAY

Add an AGENTS.md or llms.txt at the root of your repo. Describe the architecture in two paragraphs. List the conventions. Flag the gotchas. Then audit your three most-touched files for LLM readability. Are function names descriptive without context? Are types explicit? Are there comments where the reasoning isn't obvious from the code? This investment pays back the first week your team uses an AI agent on the codebase, and it compounds every week after that.

TACTIC 07

The work is in the tail, not the demo

Karpathy ends on the gap that kills most AI startups. Demos work in the easy 80% of cases. Real products live in the messy long tail: the unusual inputs, the edge cases, the failures users hit on day three when they trust the system enough to push it. He pulls from his Tesla years here. After more than a decade of work, full self-driving still isn't done, and the reason isn't capability. It's reliability in the tail. The same dynamic plays out in every AI product. The team that does the unglamorous work, building eval sets, tracking failure modes, designing graceful degradation, wins their vertical. The team that ships another impressive demo and pivots loses.

THE PLAY

Pick the most common failure mode your AI product has right now. Not the most exotic, the most common. Build an eval set of real examples that cover it. Track your accuracy on that set every week. Don't ship features that regress it. This is boring work. It is also the work that separates the AI products that exist in two years from the ones that don't.

YOUR ACTION PLAN

All the plays, back to back. Use this as your checklist.

  1. 01

    Pick the right era of software for the problem

    Write the era number on every ticket before estimating it.

  2. 02

    Treat LLMs as 1960s infrastructure

    Audit every AI feature. Anywhere the user waits more than 400ms with no feedback, ship a streaming or optimistic state this week.

  3. 03

    Build the suit before you build the robot

    Find the one feature in your product that takes action without a human checkpoint. Add the checkpoint.

  4. 04

    Make the autonomy slider visible

    Add a visible "how much should the AI do" control to your primary AI feature this sprint.

  5. 05

    Ship the verification UI, not just the model

    Pick your weakest AI feature. Add one verification element this week. Diff, citation, or change highlight.

  6. 06

    Write your codebase for LLMs to read

    Write AGENTS.md today. Two paragraphs on architecture, a list of conventions, a list of gotchas. Ship it before tomorrow.

  7. 07

    The work is in the tail, not the demo

    Pick your top failure mode. Build the eval. Run it weekly.

Newsletter

Get each new protocol the day it drops

One email per drop. No spam. Unsubscribe anytime.

Y COMBINATOR · EXTRACTED BY PODEX