@GaryMarcus @hendrycks @Yoshua_Bengio We recently tested all of the major LLMs with tic-tac-toe, modified chess, and a novel game -- even the top models all failed: illegal moves, claiming a win when they lost, etc.
https://www.linkedin.com/posts/srinipagidyala_%F0%9D%90%80%F0%9D%90%9D%F0%9D%90%9A%F0%9D%90%A9%F0%9D%90%AD-%F0%9D%90%A8%F0%9D%90%AB-%F0%9D%90%82%F0%9D%90%A8%F0%9D%90%A5%F0%9D%90%A5%F0%9D%90%9A%F0%9D%90%A9%F0%9D%90%AC%F0%9D%90%9E-you-asked-ugcPost-7463024667893678080-ry-S/?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAiQskBDwD97HygfnTfd5jikrmGP83UMa0