And when I say major limitations, I mean that foundationally it will need to change for it even to be possible. The current models being used haven't changed at the base level for over a decade, so as of right now we don't even know how to solve this problem, and if we solve this problem, whether the new model will solve the problems currently being solved. Some have wondered if we should put supervising AI around current AI to redo things over and over until it's right essentially, but current models are already prohibitively expensive for profit, so obviously not practical. So here are a few things.
1. Consistently spell words right in a composed image
2. Automatically manipulate an existing image or audio and change perspective on it without recomposing it from scratch
3. Actually count anything. When it gets it right, it's just guessing.