it looks like there have been attempts to use small language models with retrieval augmented generation. here is one such effort from 2024 that seems to have had some level of success.
perhaps it's possible to train a small language model on a subset of code that does not require attribution for reuse (e.g. 0-clause bsd, mit no attribution, etc. [1]) and then to have it work with a much larger body of code (i.e. include stuff that requires attribution) via retrieval augmented generation.
the idea is not to replicate the full capabilities of what major dark boxen provide atm, but rather to see what is possible. it would be interesting to see what kinds of tasks this sort of thing could be useful for. i think at least from the legal and ethical perspectives, one might be in the clear because the parts that would require attribution would be traceable [2].
[1] i know there isn't as much of this kind of thing compared to stuff that requires attribution.
[2] as apparently when using retrieval augmented generation, source attribution works. or so i have read :)