Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Writing a training loop for CLIP manually wound up with me banging against all sorts of strange roadblocks and missing bits of documentation, and I still don't have it working.

There is working training code for openCLIP https://github.com/mlfoundations/open_clip

But training multi-modal text-to-image models is still a _very_ new thing, in terms of the software world. Given that, my experience has been that it's never been easier to get to work on this stuff from the software POV. The hardware is the tricky bit (and preventing bandwidth issues on distributed systems).

That isn't to say that there isn't code out there for training. Just that you're going to run into issues and learning how to solve those issues as you encounter them is going to be a highly valuable skill soon.

edit:

I'm seeing in a sibling comment that you're hoping to train your own model from scratch on a single GPU. Currently, at least, scaling laws for transformers [0] mean that the only models that perform much of anything at all need a lot of parameters. The bigger the better - as far as we can tell.

Very simply - researchers start by making a model big enough to fill a single GPU. Then, they replicate the model across hundreds/thousands of GPU's, but feed each on a different set of the data. Model updates are then synchronized, hopefully taking advantage of some sort of pipelining to avoid bottlenecks. This is referred to as data-parallel.

[0] https://www.lesswrong.com/tag/scaling-laws



All this horsepower deployed to image generation is interesting but somebody wake me up when there is a stable diffusion for SQL or when on demand generative User Interfaces are spun up on the fly to suit the purpose.


Will do!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: