I agree with you that our ability to create meta representations is a better explanation for our cognitive abilities than Gould’s byproduct idea. But the question remains: Why have we humans, as you say, “run away with” metarepresentations as no other animals have?
My conjecture is that we were enabled by a lucky coincidence of having rapid, flexible vocal systems and rapid, flexible manipulators (hands). They enabled our ancestors to communicate by “show and tell" in ways that no other known animals have yet been able to do nearly as well.
It’s still an open question for me whether our ability to create metarepresentations is just an extension of our language ability or something beyond it.