The question will be how long it will take to get there, and if it will be quicker or cheaper to use complementary technologies like radar and lidar in combination with cameras in the meantime.
It'll likely just take more advanced cameras than they have now for full autonomy (ones that more closely replicate human eye abilities) and as you said, advanced AI that gets up to human driving intelligence levels. Both of those areas show progress, however it is slower than some might want when it comes to autonomous driving capabilities - particularly when companies already have limited scope autonomous vehicles in use in certain regions that use a combination of sensors in addition to cameras.
In order to somewhat replicate human vision, I've heard a camera would need to have a combination of multiple current cameras - a wide angle color camera, a narrow angle high resolution camera, and a high frame rate mono camera, all with dynamic ability to adjust focus and aperture. Interestingly, a team of researchers at the University of Maryland created a camera, called the Artificial Microsaccade-Enhanced Event Camera (AMI-EV) that uses a rotating wedge prism to mimic the saccades of the human eye. Paired with the right software, such a camera might be able to differentiate between objects and handle objects in motion similar to the human eye. As camera tech improves in the future and incorporate multiple angle and frame rate components with things like AMI-EV, it is possible they could have hardware that would be up to the task. I would anticipate those cameras will be quite expensive initially though, it will probably take some time for them to become cost effective. Likewise, software over the coming years will also probably improve greatly. Weather will always probably be a challenge with vision-based systems, but advanced cameras in the future may be able to mitigate that.
Tesla FSD 13 however, is a very impressive system using the types of cameras that are available today, and they have some excellent software for it. It's a very good application for something short of truly autonomous, and will likely be a good option in the meantime until camera hardware improves and AI solutions might allow for full autonomy. Until some of those newer technologies for vision-only are available, I assume many companies will continue integrating multiple types of sensor in a complementary and redundancy role to bridge the gap.