OPENAI-SORA+技术文档总结+中英对照原稿.docx

资源描述

《OPENAI-SORA+技术文档总结+中英对照原稿.docx》由会员分享，可在线阅读，更多相关《OPENAI-SORA+技术文档总结+中英对照原稿.docx（8页珍藏版）》请在三一办公上搜索。

1、OPENAlSORA技术报告原文+译文+报告总结要点总结模型路径：1 .架构为扩散模型（diffusionmodel）+transformer2 .训练时先用预训练模型把大量的大不一的视频源文件编码转化为统一的patch表示，把时空要素提取作为transformer的token进行训练。3 .模型效果好和超大量的数据集和更多的运算时间息息相关优势：1 .人物和背景的连贯性，即时人物运动出了相机范围再回来时还保持同样特征2 .自然语言的理解程度很高3 .可以在同一个种子下生成不同尺寸（横向竖向）的视频适配不同设备4 .可以生成长达Imin高清视频5 .可以以文字，图片，视频作为控制要素控制输出结

2、果不足：1 .对于物理规则了解较弱，比如吹气后蜡烛不会熄灭，左右不分，玻璃掉落不会碎2 .对于算力要求较高（猜测）可以实现：1 .文生槐顷,图生视频，图+文生视频，视频修改2 .视频转绘，视频延伸，视频补全未来畅想：1 .重新洗牌Al生成视频产业2 .扩散模型的上限比想象中的高！3 .全局一致性可以被解决4 .文字生成3D或将迎来突破5 .AR,VR,VIsionPro新型应用潜力大神观点：路思勉清华叉院看完TeChrliCalRePcXt的一些想法：1. DiffUSiOn生成框架的天花板远比我们之前想象的要更高（费可能已经学了）.makediffusiongreatagain!给DiffU

3、Sion研究者注入一剂之心剂也.从数学理论上来说.DiffUSion也是舱解几乎双合任意数据分布的（包括Ik实世界的连费住视频）.2. Scaleisallyouneed.SCaIe上去后,在视频生成上能券产生类似在LLM里的满观现象.包括视续连贯性,3Dconsistency.Long-rangecoherence.3. PhysicsPrkx什么的可能都不需要强外引入.ScaleData足以.报告原文hups:/ODenaicomresearchVideOgeneration-models-as-woidsimulalors英文原文中文翻译Videogenerationmodelsaswo

4、rldsimulatorsWeexplorelarge-scaletrainingofgenerativemodelsonvideodata.Specifically,wetraintext-conditionaldiffusionmodelsjointlyonvideosandimagesofvariabledurations,resolutionsandaspectratios.Weleverageatransformerarchitecturethatoperatesonspacetimepatchesofvideoandimagelatentcodes.Ourlargestmodelz

5、Sora,iscapableofgeneratingaminuteofhighfidelityvideo.Ourresultssuggestthatscalingvideogenerationmodelsisapromisingpathtowardsbuildinggeneralpurposesimulatorsofthephysicalworld.Thistechnicalreportfocuseson(1)ourmethodforturningvisualdataofalltypesintoaunifiedrepresentationthatenableslarge-scaletraini

6、ngofgenerativemodels,and(2)qualitativeevaluationofSorazscapabilitiesandlimitations.Modelandimplementationdetailsarenotincludedinthisreport.Muchpriorworkhasstudiedgenerativemodelingofvideodatausingavarietyofmethods,includingrecurrentnetworks,lz2f3generativeadversarialnetworks,45,6z7autoregressivetran

7、sformersz8z9anddiffusionmodels.10,11,12Theseworksoftenfocusonanarrowcategoryofvisualdatazonshortervideos,oronvideosofafixedsize.Soraisageneralistmodelofvisualdataitcangeneratevideosandimagesspanningdiversedurations,aspectratiosandresolutions,uptoafullminuteofhighdefinitionvideo.Turningvisualdatainto

8、patchesWetakeinspirationfromlargelanguagemodelswhichacquiregeneralistcapabilitiesbytrainingoninternet-scaledata.l3z14ThesuccessoftheLLMparadigmisenabledinpartbytheuseoftokensthatelegantlyunifydiversemodalitiesoftext-code,mathandvariousnaturallanguages.Inthiswork,weconsiderhowgenerativemodelsofvisual

9、datacaninheritsuchbenefits.WhereasLLMshavetexttokens,Sorahasvisualpatches.Patcheshavepreviouslybeen视频生成模型作为世界模拟器我们探索了在视频数据上进行大规模生成模型的训练。具体而言，我们联合在可变持续时间、分辨率和宽高比的视频和图像上训练了文本条件扩散模型。我们利用了一个在视频和图像潜在编码的时空块上操作的transformer架构。我们最大的模型，Sora,能够生成一分钟高保真度的视频。我们的结果表明，扩展视频生成模型是建立通用物理世界模拟器的一条有前景的道路。本技术报告关注以下两个方面：（1

10、）我们将各种类型的视觉数据转换为统一表示的方法，以实现大规模生成模型的训练，以及（2）对Sora的能力和局限性进行定性评估。模型和实现细节未包含在本报告中。之前的研究已经探讨了使用各种方法对视频数据进行生成建模,包括循环网络、生成对抗网络、自回归变压器和扩散模型。这些工作通常侧重于某一类视觉数据、较短的视频或固定大小的视频。Sora是一种视觉数据的通用模型一一它可以生成跨越各种持续时间、宽高比和分辨率的视频和图像,高清视频最长可达一分钟。将视觉数据转换成patch我们受到大型语言模型的启发，这些模型通过在互联网规模的数据上进行训练而获得了通用能力。LLM范式的成功部分得益于优雅地统一了文本的多

11、种模态一一代码、数学和各种自然语言的标记。在这项工作中，我们考虑了生成视觉数里模型如何继承这些好处。而LLMs具有文本标记，Sora具有视觉patch0patch已被证明是视帝数据模型的有效走示。showntobeaneffectiverepresentationformodelsofvisualdata.15,16,17,18Wefindthatpatchesareahighly-scalableandeffectiverepresentationfortraininggenerativemodelsondiversetypesofvideosandimages.我们发现，patch是一种高

12、度可扩展且有效的表示方法，适用于训练不同类型的视频和图像的生成模型。Atahighlevel,weturnvideosintopatchesbyfirstcompressingvideosintoalower-dimensionallatentspacez19andsubsequentlydecomposingtherepresentationintospacetimepatches.VideocompressionnetworkWetrainanetworkthatreducesthedimensionalityofvisualdata.20Thisnetworktakesrawvideoa

13、sinputandoutputsalatentrepresentationthatiscompressedbothtemporallyandspatially.Soraistrainedonandsubsequentlygeneratesvideoswithinthiscompressedlatentspace.WealsotrainacorrespondingdecodermodelthatmapsgeneratedIatentsbacktopixelspace.SpacetimeLatentPatchesGivenacompressedinputvideo,weextractasequen

14、ceofspacetimepatcheswhichactastransformertokens.Thisschemeworksforimagestoosinceimagesarejustvideoswithasingleframe.Ourpatch-basedrepresentationenablesSoratotrainonvideosandimagesofvariableresolutions,durationsandaspectratios.Atinferencetime,wecancontrolthesizeofgeneratedvideosbyarrangingrandomly-in

15、itializedpatchesinanappropriately-sizedgrid.ScalingtransformersforvideogenerationSoraisadiffusionmodel21,22,23,24,25;giveninputnoisypatches(andconditioninginformationliketextprompts),itstrainedtopredicttheoriginaldeanpatches.Importantly,Soraisadiffusiontransformer.26Transformershavedemonstratedremar

16、kablescalingpropertiesacrossavarietyofdomains,includinglanguagemodeling,13,14computerViSiOn,15,16,17,18andimagegeneration.27z28,29在高层次上，我们通过首先将视频压缩成低维度潜在空间，然后将表示分解为时空补丁来将视频转换成补丁。视频压缩网络我们训练了一个网络来降低视觉数据的维度。这个网络以原始视频作为输入，并输出一个在时间和空间上都被压缩的潜在表示。Sora在这个压缩的潜在空间上进行训练，并随后生成视频。我们还训练了一个对应的解码器模型，将生成的潜在空间映射回像素空间

17、。时空潜在补丁给定一个压缩的输入视频，我们提取一系列的时空补T这些补丁充当transformer的tocken。我们基于补丁的表示使得Sora能够在不同分辨率、持续时间和宽高比的视频和图像上进行训练。在推理时，我们可以通过将随机初始化的补丁适当地排列在一个大小合适的网格中来控制生成视频的大小。将transformer扩展到视频生成SOra是一个扩散模型；给定输入的初始噪声（以及文本提示等条件信息），它被训练为预测原始的“干净”补丁。重要的是，Sora是一个扩散transformerotransformer在多个领域展示了显著的扩展性能，包括语言建模、计算机视觉和图像生成。Inthiswork,

18、wefindthatdiffusiontransformersscaleeffectivelyasvideomodelsaswell.Below,weshowacomparisonofvideosampleswithfixedseedsandinputsastrainingprogresses.Samplequalityimprovesmarkedlyastrainingcomputeincreases.在这项工作中，我们发现扩散变压器在视频模型中也能有效地扩展。在下面，我们展示了随着训练进行，具有固定种子和输入的视频样本的比较。随着训练计算量的增加，样本质量显著提高。Variabledura

19、tions,resolutions,aspectratiosPastapproachestoimageandvideogenerationtypicallyresize,croportrimvideostoastandardsize-e.g.z4secondvideosat256x256resolution.Wefindthatinsteadtrainingondataatitsnativesizeprovidesseveralbenefits.SamplingflexibilitySoracansamplewidescreen1920xl080pvideos,vertical1080x192

20、0videosandeverythinginbetween.ThisletsSoracreatecontentfordifferentdevicesdirectlyattheirnativeaspectratios.Italsoletsusquicklyprototypecontentatlowersizesbeforegeneratingatfullresolution-allwiththesamemodel.可变持续时间、分辨率、宽高比过去的图像和视频生成方法通常将视频调整为标准大小，例如，4秒的视频以256x256分辨率。我们发现，与其这样处理，训练原始大小的数据提供了几个好处。采样灵活

21、性Sora可以采样宽屏19201080p视频、竖屏1080x1920视频以及介于两者之间的所有内容。这使得Sora可以直接以原生宽高比为不同设备创建内容。它还使我们能够在生成全分辨率之前，快速原型化低分辨率的内容一一而且只需使用同一个模型。ImprovedframingandcompositionWeempiricallyfindthattrainingonvideosattheirnativeaspectratiosimprovescompositionandframing.WecompareSoraagainstaversionofourmodelthatcropsalltrainingv

22、ideostobesquare,whichiscommonpracticewhentraininggenerativemodels.Themodeltrainedonsquarecrops(left)sometimesgeneratesvideoswherethesubjectisonlypartiallyinview.Incomparison,videosfromSora(right)shaveimprovedframing.改进的构图和组合我们凭经验发现，以视频的原生宽高比进行训练可以改善构图和组合。我们将Sora与我们的模型的一个版本进行比较，该版本将所有训练视频裁剪为正方形，这是训练生

23、成模型时的常见做法。在使用正方形裁剪训练的模型（左侧）有时会生成主体仅部分可见的视频。相比之下，Sora生成的视频（右侧）具有改进的构图。1.anguageunderstandingTrainingtext-to-videogenerationsystemsrequiresalargeamountofvideoswithcorrespondingtextcaptions.Weapplythere-captioningtechniqueintroducedinDALLE330tovideos.Wefirsttrainahighlydescriptivecaptionermodelandthenu

24、seittoproducetextcaptionsforallvideosinourtrainingset.Wefindthattrainingonhighlydescriptivevideocaptionsimprovestextfidelityaswellastheoverallqualityofvideos.SimilartoDALLE3,wealsoleverageGPTtoturnshortuserpromptsintolongerdetailedcaptionsthataresenttothevideomodel.ThisenablesSoratogeneratehighquali

25、tyvideosthataccuratelyfollowuserprompts.PromptingwithimagesandvideosAlloftheresultsaboveandinourlandingpageshowtext-to-videosamples.ButSoracanalsobepromptedwithotherinputs,suchaspre-existingimagesorvideo.ThiscapabilityenablesSoratoperformawiderangeofimageandvideoeditingtaskscreatingperfectlyloopingv

26、ideo,animatingstaticimages,extendingvideosforwardsorbackwardsintime,etc.AnimatingDALLEimagesSoraiscapableofgeneratingvideosprovidedanimageandpromptasinput.BelowweshowexamplevideosgeneratedbasedonDALLE231andDALLE330images.语言理解训练文本到视频生成系统需要大量具有对应文本标题的视频。我们将DALL-E3引入的重新标题技术应用到视频中。我们首先训练一个高度描述性的标题模型，然后使

27、用它为我们训练集中的所有视频生成文本标题。我们发现，训练在高度描述性视频标题上可以提高文本的准确性以及视频的整体质量。类似于DALLE3,我们还利用GPT将用户简短提示转换为更详细的长标题，然后将其发送给视频模型。这使得Sora能够生成高质量的视频，准确地遵循用户的提示。使用图像和视频作为输入prompt我们在上述所有结果和我们的登陆页面上展示的都是文本到视频的样本。但是Sora也可以使用其他输人来提示，例如预先存在的图像或视频。这种能力使得Sora能够执行各种图像和视频编辑任务一一创建完美循环的视频，给静态图像添加动画，将视频向前或向后延伸等等。把DALL-E图像变成动画A SMm hw d

28、oQ waan9 a brw and biad tocfc.A SMm ku dog *artf*g a bct AndbUdcekHterwck.In an omM. Mtonc h. . *Mv tMtep*h* and bgaa* to .trf Wtfrg g moment. 3Ny PC工 0nM XOrI 21. rrttnM Odtf WM p*lttbQM to CfMlt T0 cufff Mt2Q tb moK. SlOlfuIyth t*o WZSora能够生成基于DALLE231和DALLE330图像的视频，只需提供图像和提示作为输入。下面我们展示了基于这些图像生成的示

29、例视频。ExtendinggeneratedvideosSoraisalsocapableofextendingvideos,eitherforwardorbackwardintime.Belowarefourvideosthatwereallextendedbackwardintimestartingfromasegmentofageneratedvideo.Asaresult,eachofthefourvideosstartsdifferentfromtheothers,yetallfourvideosleadtothesameending.Wecanusethismethodtoexte

30、ndavideobothforwardandbackwardtoproduceaseamlessinfiniteloop.Video-to-videoeditingDiffusionmodelshaveenabledaplethoraofmethodsforeditingimagesandvideosfromtextprompts.Belowweapplyoneofthesemethods,SDEdit,32toSora.ThistechniqueenablesSoratotransformthestylesandenvironmentsofinputvideoszero-shot.延长生成的

31、视频Sora还能够延长视频，无论是向前还是向后延长。下面是四个视频，它们都是从一个生成的视频片段开始向时间的后方延长。因此，这四个视频的开头各不相同，但最终都会导向相同的结尾。我们你也可以用这个方法扩展一个视屏的头和尾让他首尾相连成一个无限循环的视频。视频到视频编辑扩散模型已经为从文本提示编辑图像和视频提供了大量方法。下面我们将其中一种方法，SDEdit,应用到Sora上。这种技术使得SOra能够在零样本情况下转换输入视频的风格和环境。Cw up portrx ho( erf a women G autun. x*rw MW.ConnectingvideosWecanalsouseSorato

32、graduallyinterpolatebetweentwoinputvideos,creatingseamlesstransitionsbetweenvideoswithentirelydifferentsubjectsandscenecompositions.Intheexamplesbelow,thevideosinthecenterinterpolatebetweenthecorrespondingvideosontheleftandright.ImagegenerationcapabilitiesSoraisalsocapableofgeneratingimages.Wedothis

33、byarrangingpatchesofGaussiannoiseinaspatialgridwithatemporalextentofoneframe.Themodelcangenerateimagesofvariablesizesupto20482048resolution.连接视频我们还可以使用Sora逐渐插值两个输入视频之间，从而在完全不同的主题和场景构图的视频之间创建无缝的过渡。在下面的示例中，中间的视频在左侧和右侧对应视频之间进行插值。图像生成能力Sora也能够生成图像。我们通过将高斯噪声的补丁以一个帧的时间范围排列成空间网格来实现这一点。该模型可以生成不同尺寸的图像，分辨率高达2

34、0482048oEmergingsimulationcapabilitiesWefindthatvideomodelsexhibitanumberofinterestingemergentcapabilitieswhentrainedatscale.ThesecapabilitiesenableSoratosimulatesomeaspectsofpeople,animalsandenvironmentsfromthephysicalworld.Thesepropertiesemergewithoutanyexplicitinductivebiasesfor3D,objects,etc.the

35、yarepurelyphenomenaofscale.3Dconsistency.Soracangeneratevideoswithdynamiccameramotion.Asthecamerashiftsandrotates,peopleandsceneelementsmoveconsistentlythroughthree-dimensionalspace.1.ong-rangecoherenceandobjectpermanence.Asignificantchallengeforvideogenerationsystemshasbeenmaintainingtemporalconsis

36、tencywhensamplinglongvideos.WefindthatSoraisoften,thoughnotalways,abletoeffectivelymodelbothshort-andlong-rangedependencies.Forexample,ourmodelcanpersistpeople,animalsandobjectsevenwhentheyareoccludedorleavetheframe.Likewise,itcangeneratemultipleshotsofthesamecharacterinasinglesample,maintainingthei

37、rappearancethroughoutthevideo.Interactingwiththeworld.Soracansometimessimulateactionsthataffectthestateoftheworldinsimpleways.Forexample,apaintercanleavenewstrokesalongacanvasthatpersistovertime,oramancaneataburgerandleavebitemarks.涌现出模拟的能力我们发现，在大规模训练时，视频模型表现出许多有趣的新兴能力。这些能力使得Sora能够模拟来自物理世界的一些人、动物和环境

38、的方面。这些属性是在没有任何明确的归纳偏见的情况下出现的，比如对3D、物体等一一它们纯粹是规模现象。3D一致性。Sora可以生成具有动态摄像机运动的视频。随着摄像机的移动和旋转，人物和场景元素在三维空间中保持一致的移动。长程连贯性和对象永恒性。对于视频生成系统来说，一个重要挑战是在采样长视频时保持时间一致性。我们发现，Sora通常能够有效地模拟短期和长期依赖关系，尽管并非总是如此。例如,我们的模型可以在人、动物和物体被遮挡或离开画面时仍然保持其持久性。同样地，它可以在一个样本中生成同一个角色的多个镜头，并在整个视频中保持其外观。与世界进行交互。Sora有时可以模拟一些简单方式影响世界状态的动作

39、。例如，一个画家可以在画布上留下持续一段时间的新笔触，或者一个人可以吃掉一个汉堡并留下咬痕。Simulatingdigitalworlds.Soraisalsoabletosimulateartificialprocesses-oneexampleisvideogames.SoracansimultaneouslycontroltheplayerinMinecraftwithabasicpolicywhilealsorenderingtheworldanditsdynamicsinhighfidelity.Thesecapabilitiescanbeelicitedzeroshotbyprom

40、ptingSorawithcaptionsmentioning“Minecraft.”模拟数字世界。Sora还能够模拟人工过程，一个例子是视频游戏。Sora可以同时使用基本策略控制我的世界中的玩家，并以高保真度呈现世界及其动态。这些能力可以通过提示Sora提到“Minecraft”的标题来零样本激发。IlosooThesecapabilitiessuggestthatcontinuedscalingofvideomodelsisapromisingpathtowardsthedevelopmentofhighly-capablesimulatorsofthephysicalanddigital

41、world,andtheobjects,animalsandpeoplethatlivewithinthem.DiscussionSoracurrentlyexhibitsnumerouslimitationsasasimulator.Forexample,itdoesnotaccuratelymodelthephysicsofmanybasicinteractions,likeglassshattering.Otherinteractions,likeeatingfood,donotalwaysyieldcorrectchangesinobjectstate.Weenumerateother

42、commonfailuremodesofthemodel-suchasincoherenciesthatdevelopinlongdurationsamplesorspontaneousappearancesofobjectsinourlandingpage.WebelievethecapabilitiesSorahastodaydemonstratethatcontinuedscalingofvideomodelsisapromisingpathtowardsthedevelopmentofcapablesimulatorsofthephysicalanddigitalworld,andtheobjects,animalsandpeoplethatlivewithinthem.这些能力表明，持续扩展视频模型是发展高度能力的物理世界和数字世界模拟器，以及其中的物体、动物和人的有前景的途径。讨论目前，SOra作为模拟器表现出了许多限制。例如，它并不能准确地模拟许多基本交互的物理特性，比如玻璃破碎。其他交互，比如吃食物，并不总是产生正确的物体状态变化。我们在我们的登陆页面上列举了模型的其他常见失败模式一例如，在长时间样本中发展的不一致性或对象的突然出现。我们相信，SOra目前的能力证明了持续扩展视频模型是发展能力强大的物理世界和数字世界模拟器，以及其中的物体、动物和人的有前景的途径。

展开阅读全文