|𝔻⟩irac's Student: Thoughts on AI Video & Mixed Reality

🎥 OpenAI Sora

As you might have seen the other day, the new text-to-video tool by OpenAI, Sora, was announced with a few demos [1]. At first glance this looks very impressive and a feat of AI engineering. My guess is what makes it so good is "knowledge" about what the physics of a scene should be. The only details I could find on there announcement is it achieves the results by transforming the video data into "patches" with different diffusion transformer to learn the realistic visual patterns. I believe this is how LLM learn, in that they take these "patches" and use them as building blocks to create new content.

The thing that gets me is that some of these scenes still seem to violate spatial permeance. You can see scenes where things out of no where, although very smoothly, or objects in space appear to merge and then reemerge. My thinking is that this is because the inference of the next frame comes from learning the most likely position for of an object for the next frame. This probably works well for slowly moving scenes but for dynamic scenes with multiple objects it will likely generate wonky outputs. We have to remember images and video are 2D light projections of our 3D world, and unless we store the light field data, we are training on a compressed information set. For example in the demo video of two pirate ships in a coffee mug (below), things don't look too bad, until one of the ships (on the right) pivots and turns around. Then it looks like the ship is floating on top of the surface and moving in a way that is not natural

It seems to me that what one needs is a physics engine to somehow quickly check nothing is violated and if it is feeds back into Sora to make adjustments. I don't have the technical lexicon here to speak in standard diffusion/transformer terms, but is there a way to position and momentum vector to each pixel in a video frame and then check with some physics engine that there are no violations? Is this too computationally expensive to make it useful.

🥽 AR/XR/VR in Science

I've also been thinking a lot about augmented, mixed, and virtual reality (AR/XR/AV) given Apple's release of Vision Pro, and the excitement. For one, I think Meta's Quest 3,should get a little bit more respect. When I read through what it does in comparison to the Vision Pro, other than the "premiumness" of the Vision Pro they are very comparable¹.

What I've been thinking about is where the true value is, at least for me. I obviously see the entertainment aspect of AR/XR/AV but I'm more interested in how to use these to do science and education. The education aspect is pretty clear:

Create some kind of AR/XR/VR lab experience.
Trainees/students work through the lab in these spaces.

This is pretty valuable because each person can get direct guidance as they work through the steps and material. Prior to this you would have to wait for the instructor to provide individual assistance, which is time consuming. Also if you pair this AR/XR/VR environment with a LLM you have real time query, for example:

Me: Do I mix 🧪 chemical A with B?

XR Agent: No, you need to mix chemical A with chemical D (points in XR environment to chemical D)

DALL-E Generated XR scenario of chemistry lab

That would be for a augmented/mixed reality. You could also imagine completely VR where a physical lab, along with all the physics and chemistry, are replicated. This would be akin to a digital twin. This is great because you 1.) can repeat the lab over-and-over 2.) eliminate physical risk in dangerous scenarios 3.) lab partners can be anywhere in the physical world and participate. The major limiting factor here is the need for fully physics driven digital twins. An example would be if I mix two chemicals in a beaker in a hood, it should look and behave based on real physics. I think we are getting there on the looks part via tools like the unreal physics engine, we now need to incorporate other physics like thermodynamics and quantum effects² .

Closing remarks

Its going to be an exciting time to be alive and do science. I'm hoping that these resources and tools keep improving and don't stagnate because of some technical limitation. For example, Sora just isn't able to properly incorporate the known conservation laws of physics or VR/XR/VR can't truly handle the compute demand to make these scenarios as "reality" 😆. My hope is that I can easily go back to school and learn how to do actual things by putting on a headset in the evening.

Footnotes

I have not used either of these devices so my commentary is a bit unwarranted. ↩
I don't mean simulating quantum mechanics of say molecules. What I mean is the known physics results. ↩