I’ve had this idea for a long time. What if you let the player (optionally) voice act their own character? You give them the lines, let the player record, etc. Once and done sort of deal.
I think, for multiplayer, only download other player voices if the other player is your friend.
There will be ingame Voice chat, so that is kinda not really useful.
Though dedicated custom gestures could contain Audio. Ofcourse only if the Server allows for those. In a LAN Game (which technically is a Server but with different Defaults), you can honestly just trust people with their content, since they are usually in the same Building as you.
Not to mention people manage to make offensive MC Skins, and you know how limited those are. If someone wants to make something offensive they will manage to somehow do it.
Most of that would be custom gestures or part of the Model creation process. As for talking with NPCs, I do not think that will involve voice usage, because there is just way too much stuff to take into account with all the Dynamic Item Names and everything. The NPCs could only do a few basic Lines if I allowed that.