Stable Fast 3D: Rapid 3D Asset Generation from Single Images

321 points · meetpateltech · 89 days ago

84 comments

timr · 89 days ago

For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI. It hits the sweet spot for the fundamental limitations of the methods:

* so-called "hallucination" (actually just how generative models work) is a feature, not a bug.

* anyone can easily see the unrealistic and biased outputs without complex statistical tests.

* human intuition is useful for evaluation, and not fundamentally misleading (i.e. the equivalent of "this text sounds fluent, so the generator must be intelligent!" hype doesn't really exist for imagery. We're capable of treating it as technology and evaluating it fairly, because there's no equivalent human capability.)

* even lossy, noisy, collapsed and over-trained methods can be valuable for different creative pursuits.

* perfection is not required. You can easily see distorted features in output, and iteratively try to improve them.

* consistency is not required (though it will unlock hugely valuable applications, like video, should it ever arrive).

* technologies like LoRA allow even unskilled users to train character-, style- or concept-specific models with ease.

I've been amazed at how much better image / visual generation models have become in the last year, and IMO, the pace of improvement has not been slowing as much as text models. Moreover, it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc., but rather, a generation of crazy AI-based power tools that can do things like add and remove concepts to imagery with a few text prompts. It's insanely useful, and just like Photoshop in the 90s, a new generation of power-users is already emerging, and doing wild things with the tools.

Show replies

woolion · 88 days ago

This is the third image to 3D AI I've tested, and in all cases the examples they give look like 2D renders of 3D models already. My tests were with cel-shaded images (cartoony, not with realistic lighting) and the model outputs something very flat but with very bad topology, which is worse than starting with a low poly or extruding the drawing. I suspect it is unable to give decent results without accurate shadows from which the normal vectors could be recomputed and thus lacks any 'understanding' of what the structure would be from the lines and forms.

In any case it would be cool if they specified the set of inputs that is expected to give decent results.

Show replies

talldayo · 89 days ago

> 0.5 seconds per 3D asset generation on a GPU with 7GB VRAM

Holy cow - I was thinking this might be one of those datacenter-only models but here I am proven wrong. 7GB of VRAM suggests this could run on a lot of hardware that 3D artists own already.

puppycodes · 88 days ago

I really can't wait for this technology to improve. Unfortunately just from testing this it seems not very useful. It takes more work to modify the bad model it approximates from the image output than starting with a good foundation from scratch. I would rather see something that took a series of steps to reach a higher quality end product more slowly instead of expecting everything to come from one image. Perhaps i'm missing the use case?

Show replies

fsloth · 88 days ago

Not the holy grail yet, but pretty cool!

I see these usable not as main assets, but as something you would add as a low effort embellishment to add complexity to the main scene. The fact they maintain profile makes them usable for situations where mere 2d billboard impostor (i.e the original image always oriented towards the camera) would not cut it.

You can totally create a figure image (Midjourney|Bing|Dalle3) and drag and drop it to the image input and get a surprising good 3d presentation that is not a highly detailed model, but something you could very well put to a shelf in a 3d scene as an embellishment where the camera never sees the back of it, and the model is never at the center of attention.