@mcxfrank (amazing) works also create huge participatory data collection efforts. Don't miss them. Wordbank gives you the words children around the world know (at 16-30mo). https://wordbank.stanford.edu/
Training on a large amount of data (not only children) sounds promising, but apparently, the mismatch between what you see and what is said is too great.
