Sora 團隊專訪:怎么開發的?生成要多久?啥時候能用?
前段時間,Sora 的核心團隊接受了一個采訪,透露了很多未說的信息。我把采訪記錄回聽了 4 遍,整理下了英文逐字稿,并翻譯成了中文。
主持人:能邀請各位百忙之中抽空來參加這次對話,真是十分榮幸~
在對話開始之前,要不先做個自我介紹?比如怎么稱呼,負責哪些事情?
First of all thank you guys for joining me. I imagine you’re super busy, so this is much appreciated. If you don’t mind, could you go one more time and give me your names and then your roles at OpenAI.
Bill Peebles:
Bill Peedles,在 OpenAI 負責 Sora 項目
My name is Bill Peebles. I’m a lead on Sora here at OpenAI.
Tim Brooks:
Tim Brooks,負責 Sora 項目的研究
My name is Tim Brooks. I’m also a research lead on Sora.
Aditya Ramesh:
Aditya,一樣的,也是負責人
I’m a Aditya. I lead Sora Team
主持人:
我對 Sora 了解一些,主要還是看了你們發布的那些宣傳資料、網站,還有一些演示視頻,真挺牛的。能簡單說說 Sora 究竟是咋實現的嗎?我們之前有討論過 DALL-E 和 Diffusion,但說實話,我對 Sora 的原理確實摸不透。
Okay, so I’ve reacted to Sora. I saw the announcement and the website and all those prompts and example videos that it made that you guys gave, and it was super impressive. Can you give me a super concise breakdown of how exactly it works? Cause we’ve explained DALL-E before and diffusion before, but how does Sora make videos?
Bill Peebles:
簡單來說,Sora 是個生成模型。最近幾年,出現了很多很酷的生成模型,從 GPT 系列的語言模型到 DALL-E 這樣的圖像生成模型。
Yeah, at a high level Sora is a generative model, so there have been a lot of very cool generative models over the past few years, ranging from language models like the GPT family to image generation models like DALL-E.
Bill Peebles:
Sora 是專門生成視頻的模型。它通過分析海量視頻數據,掌握了生成各種現實和虛擬場景的視頻內容的能力。
具體來說,它借鑒了 DALL-E 那樣基于擴散模型的思路,同時也用到了 GPT 系列語言模型的架構??梢哉f,Sora 在訓練方式上和 DALL-E 比較相似,但架構更接近 GPT 系列。
Sora is a video generation model, and what that means is it looks at a lot of video data and learns to generate photorealistic videos. The exact way it does that kind of draws techniques from both diffusion-based models like DALL-E as well as large language models like the GPT family. It’s kind of like somewhere in between; it’s trained like DALL-E, but architecturally it looks more like the GPT family. But at a high level, it’s just trained to generate videos of the real world and of digital worlds and of all kinds of content.
主持人:
聽起來,Sora 像其他大語言模型一樣,是基于訓練數據來創造內容等。那么,Sora 的訓練數據是什么呢?
It creates a huge variety of stuff, kind of the same way the other models do, based on what it’s trained on. What is Sora trained on?
Tim Brooks:
這個不方便說太細??
但大體上,包括公開數據及 OpenAI 的被授權數據。
We can’t go into much detail on it, but it’s trained on a combination of data that’s publicly available as well as data that OpenAI has licensed.
Tim Brooks:
不過有個事兒值得分享:
以前,不論圖像還是視頻模型,大家通常只在一個固定尺寸上進行訓練。而我們使用了不同時長、比例和清晰度的視頻,來訓練 Sora。
One innovation that we had in creating Sora was enabling it to train on videos at different durations, as well as different aspect ratios and resolutions. And this is something that’s really new. So previously, when you trained an image or video generation model, people would typically train them at a very fixed size like only one resolution, for example.
Tim Brooks:
至于做法,我們把各種各樣的圖片和視頻,不管是寬屏的、長條的、小片的、高清的還是低清的,我們都把它們分割成了一小塊一小塊的。
But what we do is we take images, as well as videos, of all wide aspect ratios, tall long videos, short videos, high resolution, low resolution, and we turn them all into these small pieces we call patches.
Tim Brooks:
接著,我們可以根據輸入視頻的大小,訓練模型認識不同數量的小塊。
通過這種方式,我們的模型就能夠更加靈活地學習各種數據,同時也能生成不同分辨率和尺寸的內容。
And then we’re able to train on videos with different numbers of patches, depending on the size of the input, and that allows our model to be really versatile to train on a wider variety of data, and also to be used to generate content at different resolutions and sizes.
主持人:
你們已經開始使用、構建和發展它一段時間了,可否解答我一個疑惑?
我本身是做視頻的,能想到這里要處理的東西有很多,比如光線啊、反光啊,還有各種物理動作和移動的物體等等。
所以我就有個問題:就目前而言,你覺得 Sora 擅長做什么?哪些方面還有所欠缺?比如我看到有個視頻里一只手竟然長了六個手指。
You’ve had access to using it, building it, developing it for some time now. And obviously, there’s a, maybe not obviously, but there’s a ton of variables with video. Like I make videos, I know there are lighting, reflections, you know, all kinds of physics and moving objects and things involved. What have you found that Sora in its current state is good at? And maybe there are things that are specifically weaknesses, like I’ll show the video that I asked for in a second, where there are six fingers on one hand. But what have you seen are our particular strengths and weaknesses of what it’s making?
Tim Brooks:
Sora 特別擅長于寫實類的視頻,并且可以很長,1分鐘那么長,遙遙領先。
但在一些方面它仍然存在不足。正如你所提到的,Sora 還不能很好的處理手部細節,物理效果的呈現也有所欠缺。比如,在之前發布的一個3D打印機視頻中,其表現并不理想。特定場景下,比如隨時間變化的攝像機軌跡,它也可能處理不佳。因此,對于一些物理現象和隨時間發生的運動或軌跡,Sora 還有待改進。
It definitely excels at photo realism, which is a big step forward. And the fact that the videos can be so long, up to a minute, is really a leap from what was previously possible. But some things it still struggles with. Hands in general are a pain point, as you mentioned, but also some aspects of physics. And like in one of the examples with the 3D printer, you can see it doesn’t quite get that right. And also, if you ask for a really specific example like camera trajectory over time, it has trouble with that. So some aspects of physics and of the motion or trajectories that happen over time, it struggles with.
主持人:
看到 Sora 在一些特定方面做得這么好,實在是挺有趣的。
像你提到的,有的視頻在光影、反射,乃至特寫和紋理處理上都非常細膩。這讓我想到 DALL-E,因為你同樣可以讓 Sora 模仿 35mm 膠片拍攝的風格,或者是背景虛化的單反相機效果。
但是,目前這些視頻還缺少了聲音。我就在想,為 AI 生成的視頻加上 AI 生成的聲音,這個過程是不是特別有挑戰性?是不是比我原先想象的要復雜很多?你們認為要實現這樣的功能,我們還需要多久呢?
It’s really interesting to see the stuff it does well, because like you said, there are those examples of really good photorealism with lighting and reflections and even close-ups and textures. And just like DALL-E, you can give it styles like shot in 35mm film or shot, you know, like from a DSLR with a blurry background. There are no sounds in these videos, though. I’m super curious if it would be a gigantic extra lift to add sound to these, or if it’s more complicated than I’m realizing. How far does it feel like you are from being able to also have AI-generated sound in an AI-generated video?
Bill Peebles:
這種事情很難具體說需要多久,并非技術難度,而是優先級排期。
我們現在的當務之急是要先把視頻生成模型搞得更強一些。畢竟,以前那些AI生成的視頻,最長也就四秒,而且畫質和幀率都不太行。所以,我們目前的主要精力都在提升這塊。
當然了,我們也覺得視頻如果能加上聲音,那效果肯定是更棒的。但現在,Sora 主要還是專注于視頻生成。
It’s hard to give exact timelines with these kinds of things. For first one, we were really focused on pushing the capabilities of video generation models forward, because before this, you know, a lot of AI-generated video was like 4 seconds of pretty low frame rate and the quality wasn’t great. So that’s where a lot of our effort so far has been. We definitely agree though that, you know, adding in these other kinds of content would make videos way more immersive. So it’s something that we’re definitely thinking about. But right now, Sora is mainly just a video generation model and we’ve been focused on pushing the capabilities in that domain, for sure.
主持人:
你們在 Sora 身上做了大量工作,它的進步有目共睹。我很好奇,你們是怎么判斷它已經達到了可以向世界展示的水平的?
就像 DALL-E 一樣,它在發布之初就驚艷全場,這一定是一個值得銘記的時刻。另外,在 Sora 已經表現出色的方面,你們是如何決定下一步的改進方向的呢?有什么標準或者參考嗎?
So okay, DALL-E has improved a lot over time. It’s gotten better, it’s improved in a lot of ways and you guys are constantly developing and working towards making Sora better. First of all, how did you get to the point where you’d gotten good enough with it that you knew it was ready to share with the world and we had this mic drop moment? And then how do you know how to keep moving forward and making things that it’s better at?
Tim Brooks:
你可能會注意到,我們目前并沒有正式的發布 Sora,而是通過比如博客、Twitter、Tiktok 等渠道發布一些視頻。這里的主要原因是,我們希望在真正準備好之前,更多的獲得一些來自用戶的反饋,了解這項技術如何能為人們帶來價值,同時也需要了解在安全方面還有哪些工作要做,這將為我們未來的研究指明方向。
現在的 Sora 還不成熟,也還沒有整合到 ChatGPT 或其他任何平臺中。我們會基于收集到的意見進行不斷改進,但具體內容還有待探討。
我們希望通過公開展示來獲取更多反饋,比如從安全專家那里聽取安全意見,從藝術家那里了解創作思路等等,這將是我們未來工作的重點。
A big motivation for us, really the motivation for why we wanted to get Sora out in this, like a blog post form, but it’s not yet ready, is to get feedback to understand how this could be useful to people, also what safety work needs to be done. And this will really set our research roadmap moving forward. So it’s not currently a product. It’s not available in ChatGPT or anything. And we don’t even have any current timelines for when we would turn this into a product. But really, right now we’re in the feedback-getting stage. So we want to, you know, we’ll definitely be improving it, but how we should improve it is kind of an open question. And we wanted to show the world this technology that’s on the horizon, and start hearing from people about how could this be useful to you, hear from safety experts how could we make this safe for the world, hear from some artists how could this be useful in your workflows, and that’s really going to set our agenda moving forward.
主持人:
有哪些反饋,分享一下?
What have you heard so far?
Tim Brooks:
有一個:用戶希望對生成的視頻有更精細、直接的控制,并非只有簡單的提示詞。
這個挺有趣的,也這無疑是我們未來要重點考慮的一個方向。
One piece of feedback we’ve definitely heard is that people are interested in having more detailed controls. So that will be an interesting direction moving forward, whereas right now it’s about, you know, you have this maybe kind of short prompt. But people are really interested in having more control over exactly the content that’s generated, so that’s definitely one thing we’ll be looking into.
主持人
確實,有些用戶可能只是想確保視頻是寬屏或豎屏,或者光線充足之類的,而不想花太多精力去設計復雜的提示詞。這個想法很有意思。
下一個話題,未來 Sora 是否有可能生成出與真實視頻毫無二致的作品呢?我猜是可以的。
就像 DALL-E 那樣,隨著時間發展,越來越強。
Interesting, I can imagine just wanting to make sure it’s widescreen or make sure it’s vertical or it’s well-lit or something like that, just to not have to worry about prompt engineering, I guess. Okay, so I guess as if you’ve been working on this stuff for a long time, is there a future where you can generate a video that is indistinguishable from a normal video? Because that’s how it feels like DALL-E has evolved over time where you can ask for a photorealistic picture and it can make that. Is that something you could imagine actually being possible? I guess probably yes, because we’ve seen it do so much already.
Aditya Ramesh:
我也相信,因此我們會變得變得更為謹慎。
人們應該知道他所看到的視頻,是真實的,還是 AI 生成的。我們希望 AI 的能力不會被用到造謠上。
Eventually I think it’s going to be possible, but of course as we approach that point we want to be careful about releasing these capabilities so that, you know, people on social media are aware of when a video they see could be real or fake. You know, when a video that they see comes from a trusted source, we want to make sure that these capabilities aren’t used in a way that could perpetuate misinformation or something.
主持人:
在 Sora 生成的視頻中,在右下角都有水印,這確實很明顯。但是,像這樣的水印可以被裁剪掉。
我很好奇,有沒有其他方法可以識別 AI 生成的視頻?
I saw there’s a watermark in the bottom corner of Sora-generated videos, which obviously is pretty important, but a watermark like that can be cropped. I’m curious if there are other ways that you guys think about being able to easily identify AI-generated videos, especially with a tool like Sora.
Aditya Ramesh:
對于 DALL·E 3,我們訓練了一種溯源分類器,可以識別圖像是否由模型生成。
我們也在嘗試將此應用于視頻,雖然不完美,但這是第一步。
For DALL·E 3, we trained provenance classifiers that can tell if an image was generated by the model or not. We’re working on adapting that technology to work for stored videos as well. That won’t be a complete solution in and of itself, but it’s kind of like a first step.
主持人:
懂了。就像是加上一些元數據或者某種嵌入的標志,這樣如果你操作那個文件,你就知道它是 AI 生成的。
Got it. Kind of like metadata or like a sort of embedded flag, so that if you play with that file, you know it’s AI generated.
Aditya Ramesh:
C2PA 就是這樣做的,但我們訓練的分類器可以直接應用于任何圖像或視頻,它會告訴你這個媒體是否是由我們的某個模型生成的。
C2PA does that but the classifier that we trained can just be run on any image or video and it tells you if it thinks that the media was generated by one of our models or not.
主持人:
明白了。我還想知道你的個人感受。
顯然,你們必須等到覺得 Sora 準備好了,可以向世界展示它的能力??吹狡渌藢?Sora 的反應,你有什么感覺呢?
有很多人說“太酷了,太神奇了”,但也有人擔心“哦不,我的工作岌岌可?!薄D闶窃趺纯创藗兏鞣N各樣的反應的?
Got it. What I’m also curious about is your reaction. You obviously had to get to the point where Sora comes out and you think it’s ready for the world to see what it’s capable of. What’s been your reaction to other people’s reactions to Sora? There’s a lot of “this is super cool, this is amazing” but there’s also a lot of “oh my God, my job is in danger.” How do you digest all of the different ways people react to this thing?
Aditya Ramesh:
我能感受到人們對未來的焦慮。作為使命,我們會以安全負責的方式推出這項技術,全面考慮可能帶來的各種影響。
但與此同時,我也看到了許多機遇:現在如果有人想拍一部電影,由于預算高昂,要獲得資金支持可能非常困難-制片公司需要仔細權衡投資風險。而這里,AI 就可以大幅降低從創意到成片的成本,創造不同。
I felt like a lot of the reception was like, definitely, you know, some anxiety as to what’s going to happen next. And we definitely feel that in terms of, you know, our mission to make sure that this technology is deployed in a safe way, in a way that’s responsible to all of the things people are already doing involving video generation. But I also felt like a lot of opportunity, like right now, for example, if a person has an idea for a movie they want to produce, it can be really difficult to get funding to actually produce the movie because the budgets are so large. You know, production companies have to be aware of the risk associated with the investment that they make. One cool way that I think AI could help is if it drastically lowers the cost to go from idea to a finished video.
主持人:
Sora 和 DALL·E 確實有很多相似之處,尤其是在使用場景上。
我自己就經常用 DALL·E 來設計各種概念圖,幫助很大。我相信對于 Sora 來說,類似的創意應用場景也會有無限可能。
我知道,Sora 現在還沒具體的開放時間,但你覺會很快嗎?
Yeah, there’s a lot of parallels with DALL·E just in the way I feel like people are going to use it. Because when DALL·E got really good, I started – I mean, I can use it as a brainstorming tool. I can use it to sort of visualize a thumbnail for a video, for example. I could see a lot of the same cool-like use cases being particularly awesome with Sora. I know you’re not giving timelines, but you’re in the testing phase now. Do you think it’s going to be available for public use anytime soon?
Aditya Ramesh:
我覺得不會那么快,我覺得??
Not any time soon, I think.
主持人:
最后一個問題是:在將來,當 Sora 能制作出帶聲音的、極度逼真的、5分鐘的 YouTube 視頻的時候,會出現哪些新的、要應對的問題?
更進一步說,相較于圖片,視頻制作的復雜的要高得多。但視頻則涉及到時間、物理等多個維度,還有反射、聲音等諸多新的難題。
說實話,你們進入視頻生成領域的速度遠超我的預期。那么在 AI 生成媒體這個大方向上,下一步會是什么呢?
I guess my last question is, way down the road, way down into the future, when Sora is making five-minute YouTube videos with sound and perfect photorealism. What medium makes sense to dive into next? I mean, photos is one thing, videos have this whole dimension with time and physics and all these new variables with reflections and sound. You guys are, you jumped into this faster than I thought. What is next on the horizon for AI-generated media in general?
Tim Brooks:
我期待看到人們用 AI 來創造全新的東西。大聰明:來看看離譜村吧
去復刻已有對東西,不算難事兒;但使用新工具,去創造未曾出現的東西,著實令人心動!
對我來說,一直激勵我的,正是讓那些真正有創意的人,將一切不可能的事情變成可能,不斷推進創造力的邊界,這太令人興奮了!
So something I’m really excited for is how the use of AI tools evolves into creating completely new content and I think a lot of it will be us learning from how people use these tools to do new things. But often it’s easy to think about how they could be used to create existing things. But I actually think they’ll enable completely new types of content. It’s hard to know what that is until it’s in the hands of the most creative people, but really creative people when they have new tools do amazing things. They make new things that were not previously possible. That’s really what motivates me a lot. Long term, it’s like how could this turn into completely new experiences in media that currently aren’t capable, that currently we’re not even thinking about. It’s hard to picture exactly what that is, but I think that will be really exciting to just be pushing the creative boundaries and allowing really creative people to push those boundaries by making completely new tools.
主持人:
確實有趣??!
我覺得,由于它們是基于已有內容訓練的,因此生成的東西也只能建立在現有內容之上。要讓它們發揮創造力,唯一的辦法可能就是通過你給它的 prompt 了。
你需要在如何巧妙地提出要求上下功夫,琢磨該如何引導它。這么理解對嗎?
Yeah, it’s interesting. I feel like the way it works is that since it’s trained on existing content, it can only produce things based on what already exists. The only way to get it to be creative is with your prompt, I imagine. You have to get clever with the learning curves prompt engineering and figuring out what to say to it. Is that accurate?
Bill Peebles:
除了prompt,Sora 還可以通過其他方式引導視頻生成。
比如在我們之前發布的報告里,演示了如何將兩個的混合輸入:左邊視頻一開始是無人機飛過斗獸場,然后逐漸過渡到右邊 – 蝴蝶在水下游動。中間有一個鏡頭,斗獸場漸漸毀壞,然后被看起來像被珊瑚覆蓋,沉入水中。
像這一類的視頻生成,無論是技術還是體驗,都是完全與以往不同的。
There are other kinds of cool capabilities that the model has, sort of beyond just like text-based prompting. So in our research post that we released with Sora, we had an example where we showed blending between two input videos, and there was one really cool example of that where the video on the left starts out as a drone flying through the Colosseum, and on the right it gradually transitions into like a butterfly swimming underwater. There’s a point in there where the Colosseum gradually begins decaying and looking as if it’s covered in coral reefs and is partially underwater. These kinds of, you know, generated videos really do kind of start to feel a bit new relative to what’s been possible in the past with older forms of technology, and so we’re excited about these kinds of things, even beyond just prompting, as being new experiences that people can generate with technology like Sora.
Aditya Ramesh:
從某種意義上來說,我們做的事情,就是先模擬自然,再超越自然!
In some ways we really see modeling reality as the first step to being able to transcend it.
主持人:
哇,這確實挺酷的,很有意思啊!
Sora能夠越精準地模擬現實,我們就能在它的基礎上越快地進行創新和創作。理想情況下,它甚至能成為一種工具,開辟新的創意可能性,激發更多的創造性思維。
真的超級贊!
如果有什么話想對大家說,現在正是個好時機。畢竟,你們是最早開始這個項目的人,比任何人都更早地看到了它的潛力。關于Sora和OpenAI,還有什么是你們想讓大家知道的嗎?
Wow! I like that, it’s really interesting yeah. The better it is able to model reality, the faster you’re able to sort of build on top of it, and ideally that unlocks new creative possibilities as a tool and all kinds of other things. Super cool! Well, I’ll leave it open to if there’s anything else you want people to know. Obviously, you guys have been working on this longer than anyone else has gotten to see what it does or play with it. What else do you want the world to know about Sora and OpenAI?
Tim Brooks:
我們還特別興奮的一點是,AI通過從視頻數據中學習,將不僅僅在視頻創作方面發揮作用。畢竟,我們生活在一個充滿視覺信息的世界,很多關于這個世界的信息是無法僅通過文本來傳達的。
雖然像GPT這樣的模型已經非常聰明,對世界有著深刻的理解,但如果它們無法像我們一樣“看到”這個世界,那么它們就會缺失一些信息。
因此,我們對Sora及未來可能在Sora基礎上開發的其他AI模型充滿期待。通過學習世界的視覺信息,它們將能更好地理解我們所生活的世界,因為有了更深刻的理解,未來它們能夠更好地幫助我們。
I think another thing we’re excited about is how learning from video data will make AI more useful a bit more broadly than just creating videos, because we live in a world where we see things kind of like a video that we’re watching, and there’s a lot of information about the world that’s not in text. While models like GPT are really intelligent and understand a lot about the world, there is information that they’re missing when they don’t see the visual world in a way similar to how we do. So one thing we’re excited about for Sora and other AI models moving forward that build on top of Sora is that by learning from visual data about the world, they will hopefully just have a better understanding of the world we live in, and in the future be able to help us better just because they understand things better.
主持人:
確實非常酷!我猜背后肯定有大量的計算工作和一群技術大神!
說實話,我一直盼著某天能用上 Sora,有進度來請立即敲我~
That is super cool. I imagine there’s a lot of computing and a lot of talented engineering that goes into that. So I wish you guys the best of luck. I mean eventually when I’m able to plug in more stuff into Sora, I’m very excited for that moment too. So keep me posted.
Bill Peebles:
沒問題
We’ll do
主持人:
謝啦
Thank you.
OpenAI Team:
感謝
Thanks.
1000 thousand years later…
主持人:
對了,我還忘了問他們一個挺有意思的問題。雖然錄的時候沒問到,但大家都想知道,用一個提示讓 Sora 生成一個視頻需要多長時間?
我私信問了他們,答案是:得看具體情況,但你可以去買杯咖啡回來,它可能還在忙著生成視頻。
所以,答案是「需要挺長一段時間」
One more fun fact I forgot to ask them. During the recording, but everyone wanted to know how long does it take to generate a video with Sora with a single prompt? I did ask them off camera, and the answer was: it depends, but you could go get coffee and come back, and it would still be working on the video. So a while seems to be the answer.
作者:賽博禪心
微信公眾號:賽博禪心
本文由 @賽博禪心 翻譯發布于人人都是產品經理。未經作者許可,禁止轉載。
題圖來自 Unsplash,基于 CC0 協議
該文觀點僅代表作者本人,人人都是產品經理平臺僅提供信息存儲空間服務。
- 目前還沒評論,等你發揮!