Kling 3.0 and O3: Kuaishou’s New Bid for the AI Video Crown

Kuaishou has released two new AI video models, Kling 3.0 and O3: 15-second videos, multi-shot editing with up to six camera cuts, and native audio generation. With real examples from X and a look ahead to Seedance 2.0.

By Thomas Fenkart · 3 min read

In early February, Kuaishou dropped Kling 3.0 and O3—two video models with seriously ambitious goals. I took a closer look this time, and not just at the press releases. Maximum video length is now 15 seconds. That doesn’t sound like much, but in the AI video world it’s a real leap from the previous 10. What interests me more is the multi-shot feature: up to 6 camera cuts in a single generation. On X, @doctorwasif shows a 15-second sci-fi thriller: a hacker uncovers a conspiracy, and four cuts build tension from a wide shot to an intense reveal. All from one prompt, with natural flow between shots. @sparker888 built a full-on fan production—Midjourney V7 reference images used as elements for the multi-shot feature, rendered in under an hour. Four minutes of finished video, including bloopers at the end. The audio thing Native audio sync is the other big point in Kling 3.0. Dialogue, music, sound effects—everything is generated directly alongside the video. Japanese, Korean, and Spanish now work too, not just English and Chinese. What surprised me: multi-person dialogue with three people talking at once, correct lip-sync, and proper voice assignment. Until now it always turned into chaos as soon as multiple people were in the frame. @bonega_ai nails it: "The biggest AI video problem has always been consistency. Same character, new shot, new face. Kling 3.0 just shipped multi-shot storyboarding. 6 shots per clip, character identity locked across every angle." A day later came O3—the Swiss Army knife of the family. Text-to-video, image-to-video, video-to-video, multi-reference, editing—everything in one model. You can use 10+ reference images at once, and the system keeps characters and style consistent. The text-based editing in O3 is interesting: add objects, change lighting, swap backgrounds—via text input, without masking. @terencesia_ shows an ad demo and comments: "Brands aren't hiring photographers the way they used to. More ad production is moving in-house." Magic Hour published a detailed review: Kling 3.0 is "one of the first AI video models that feels built for structured storytelling, not just flashy clips." Their take: stronger than most competitors for multi-shot and 15-second narrative—but on-screen text and complex physics (water, fire, fabric) can still be problematic. Hands and fingers in close-ups remain inconsistent. What this means for the industry With this double release, Kuaishou is making a statement to Sora, Veo, and Runway. Longer videos, multi-shot, native audio—these are exactly the things professional users have been asking for. The API was out a day after launch, and there’s already a ComfyUI integration. That shows Kling isn’t positioning itself as a toy, but as part of a production pipeline. Will that put them in the lead? The advantage lasted exactly six days. On February 10, ByteDance countered with Seedance 2.0—2K video with synced audio, up to 12 reference files, and according to Forbes, it "nails real world physics." More on that in the next article.