Again super early and anecdotal results, but so far yes. I'm running an experiment today where I'm swapping all of my "chain of thought" calls to just use Sonnet 3.5 and so far I'm seeing excellent and consistent results. I'm using it for coding, so it may not perform as well in other domains. But so far my vibes-based evals are passing with flying colors.
Wow that's amazing. Yeah I'd think Claude to do particularly well in chain of thought calls - Anthropic seems to be embedding that functionality more and more in Claude. Saw that with the Golden Gate Bridge Claude, and now with Claude doing COT thinking while tool calling