Claude and HeyGen — how to record a video without ever stepping in front of a camera · Wiki

It's now possible to record a few minutes of video with your own face and voice without sitting down in front of a camera even once. All you need is a finished script, a digital clone and a handful of tools that something stitches together for you. Below is a breakdown of how the process actually works in practice — and what it really costs.

Let me start with a simple observation: the figure you see in a video like this isn't a person, but their digital clone. Built in roughly ten minutes. The script stays your own — these are still the speaker's real thoughts — but the face on screen was generated by a machine.

Three tools and one conductor

The whole setup rests on three pieces. HeyGen is the tool that creates the avatar — a digital double that speaks in your voice on the recording. ElevenLabs (also written 11 Labs) handles the voice clone, faithfully reproducing your timbre and the way you speak. Remotion glues it all together and adds the motion graphics — captions and animated elements that appear at exactly the right moment.

The glue is Claude Code — the tool in which Claude works like an operator: it reads files, runs steps, connects to other programs. It plays the role of conductor. Without it you'd have to copy script fragments by hand, download recordings, upload them back and stitch them together — tediously, piece by piece. Claude Code chains those steps into a single run: you say what you want, and it does it.

Step one — the face clone

The avatar is created in HeyGen. The fastest route is a webcam recording: the tool shows a short text to read aloud, the recording runs about fifteen seconds, and a moment later the clone is ready. The second route is uploading your own footage — with around ten gigabytes of recordings the model has more data to learn from and captures your facial expressions better.

This is where the new version, Avatar 5, comes in. The earlier ones (Avatar 3 and 4) worked tolerably well but gave themselves away with artificial lip movement and mechanical gestures. The fifth learns from more than ten million data points about facial expression and builds a digital double from a fifteen-second recording. The result is natural enough that it's hard to tell from a real take — the figure moves its head, blinks, swallows. The difference between the old and new versions is obvious: on Avatar 3, anyone who knows you will instantly catch that "something's off" about the mouth.

Step two — the voice clone

HeyGen's default voice sounds weak, so the voice is sourced separately, from ElevenLabs. This is the crucial stage — in how the result lands, the voice carries the most weight. There are two options. The instant voice clone is quick and simple but less faithful. The professional voice clone needs material: the tool asks for at least thirty minutes of recordings, and the more you give it the better — somewhere around two hours gives the best result.

In the panel you can adjust pace, stability, similarity and the intensity of the style. Landing on settings that sound natural takes trial and error — you'll go through many iterations before you hit your own way of speaking. The finished recording downloads as an audio file and uploads into HeyGen's AI Studio section, where it's synced with the avatar. Worth knowing: importing the same voice straight into HeyGen sounds worse than in ElevenLabs — which is why this stage is done separately.

Three luminous blocks — a fragment of text, a sound wave and a video frame — linked into a chain, with a point of light binding the flow from script to finished recording.

Why the script has to be cut into pieces

There are two technical limits. In Avatar 5, HeyGen lets you generate material up to three minutes long. ElevenLabs has its own ceiling: past roughly a minute the voice starts to degrade and resembles the original less and less. The practical sweet spot is fragments of forty-five to sixty seconds — that way the voice stays consistent throughout.

That's why a ten-minute script has to be split up. Claude Code does it: it takes the text from disk, cuts it into chunks of about a minute and passes them to HeyGen. There's one condition — a cut always falls at the end of a sentence, never in the middle of one. That way, once everything is stitched back together, you can't hear the seams between fragments.

There's one more detail that calls for a workaround. For now, Avatar 5 doesn't work over the API — that is, the interface programs use to connect to each other automatically. Only Avatar 3 and 4 are available there. The solution: Claude Code generates material in Avatar 4 over the API, while a separate script (built on Playwright, a tool that clicks around a browser the way a human would) opens the HeyGen panel and switches the version to the fifth. It's a temporary workaround — once Avatar 5 reaches the API, it goes away.

What Remotion does

The last stage is editing. Remotion is given a background and a graphics style, then assembles the recordings on its own, transcribes them (turning speech into text with precise timestamps) and, on that basis, drops captions and animations in at the right seconds. Since it knows a given word was spoken at the forty-fourth second, that's exactly where it fires the animation. That's how the audio lines up with the graphics.

This whole run — from script to finished video — can be kicked off in the evening and collected in the morning. A process that once needed a camera operator, an editor and a voice artist turns into a task done overnight.

What actually changes

Three conclusions come out of this setup. First: the avatar has crossed the so-called uncanny valley — the point at which an artificial figure already looks believable enough not to jar. Second: a single agent can run the whole production — pull the voice, cut the audio, hand it to HeyGen, assemble it in Remotion. Third, and most important: the bottleneck has shifted away from recording and editing toward what the machine won't do — the idea, the script and the strategy. The human stays where they matter most.

It's worth tallying the costs, because they aren't low. A HeyGen plan runs about $30 a month, ElevenLabs about $22 (roughly a hundred minutes of material), Claude Code from $20 to $200. HeyGen over the API is billed separately: each one-minute fragment is about four dollars, so a ten-minute video comes to close to fifty. For comparison: a freelance editor costs $35–75 an hour, a ten-minute piece can run up to $300, and a studio plus a professional voice artist push the bill into the thousands. (The figures are ballpark; tool prices change — it's worth checking the current ones.)

On the left a tangled gray knot of lines, on the right one clean luminous path running from green to blue toward a goal — the contrast between manual work and an agent-driven flow.

Three honest caveats

To finish, three doubts that suggest themselves. "It's artificial, inauthentic" — partly fair, but when the script, the voice and the face are yours, the only thing missing is physical presence in front of a camera. "It'll flood the web with junk" — junk is piling up anyway, and removing the production bottleneck only sharpens the competition: the better idea wins, because weak content with a good avatar is still weak content. "It'll take editors' jobs" — it changes them, more likely; the edge goes to whoever adds their own knowledge of the subject to the tool.

And one sober note to close on: this isn't a setup that works "from the first click." Behind the finished result sit more than a hundred, maybe two hundred generated clips and a fair amount of fiddling with settings. The tools shorten production, but getting to a good result still takes patience and a few rounds of fixes.