Random Material Name Generator

wdnmd · October 19, 2021, 11:20am

Random , where the Game will make up random shit on the fly, give it randomized names made out of random Syllables and you have no Idea what each Ore is, until you experiment with it.
ore.7z (751.3 KB)

use rand::{prelude::ThreadRng, seq::SliceRandom, thread_rng};
use lazy_static::lazy_static;

lazy_static!{
    static ref VOVELS: Vec<&'static str> = vec![
        "i", "e", "u", "a", "ee", "ea", "ie", "ou", "er", "or", "o", "ir", "ur", "ear", "our", "ar",
        "al", "au", "ae", "oar", "oo", "ui", "ew", "a", "eigh", "ow", "oa", "oi", "oy", "eer",
        "ere", "are", "air", "oor",
    ];
    static ref CONSTONANTS: Vec<&'static str> = vec![
        "b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x",
        "y", "z",
    ];
    static ref TEMPLATE: Vec<&'static str> = vec!["cv", "vc", "cvc"];
}
fn gen_syllable(rng: &mut ThreadRng) -> String {
    let choose=TEMPLATE.choose(rng).unwrap();
    let mut output=String::new();
    for c in choose.chars(){
        if c=='c'{
            output+=*CONSTONANTS.choose(rng).unwrap();
        }else{
            output+=*VOVELS.choose(rng).unwrap();
        }
    }
    output
}
fn gen_name(len: i32,rng: &mut ThreadRng) -> String {
    let mut str = String::new();
    for _ in 0..len {
        str += &gen_syllable(rng);
    }
    str+"ite"
}
fn main() {
    let mut rng = thread_rng();
    println!("{}", gen_name(2, &mut rng));
}

Gregorius · October 19, 2021, 1:41pm

Haha, nice one, there will be a blacklist of words based on the US and UK English Dictionary too.

Also it is “Consonants” and “Vowels”. Also since when is “R” a Vowel?

I might also remove certain Consonants and Vowels from the List, because not everyone can actually say them out loud. Such as how many Asian people cant do the rolling “R”, or how Germans like me cant do the “Th”. Not to mention the many ways you can ambiguously pronounce “Ch”.

Gregorius · November 7, 2021, 9:01pm

So I found this Index of /enwiki/latest/

There is a Title Dump File in there which is a bit less than 90MB in Size. I think that should be a sustainable Blacklist for “Names of things that exist in Real Life” to check against, while generating Words.

I also love how the first few lines contain Titles for Wikipedia Articles named “Fuck You And Then Some!”. XD

Gregorius · November 7, 2021, 11:33pm

I extracted the Archive of Title Names (87MiB) to a 330MiB large Text File, and then ran that through the following:

cat input | awk -F  "_" '!$3' | iconv -f utf8 -t ascii//TRANSLIT | tr -cd '[:alpha:]\n' | tr '[:upper:]' '[:lower:]' | tr -s 'a-z' | awk 'length($0)>3' | sort | uniq -u >> output

This shortened the File down from the original 16 Million to just 5.6 Million Lines and 72MiB.

Here is what each Pipe Segment is doing:

open the File
remove all lines that have 3 or more words in them. The Underscore is what Wikipedia uses instead of Spaces because URLs.
Screw all this UTF-8 nonsense, I want easy to pronounce ASCII and transliterations!
Just kill all the other Stuff, I want only the Alphabet and \n in this File.
Make it all lowercase now!
Kill all repetitive Characters like double-L or so.
After all this is trimmed down, everything 3 or shorter should get cut out entirely.
Sort the whole thing.
Deduplicate Entries.
And append to the Output File!

exylic · November 11, 2021, 11:20pm

In standard American English, a final R is actually a rhotacized vowel sound. In standard British English, it’s just a vowel sound.

Gregorius · November 11, 2021, 11:55pm

Interesting. Though I actually will blacklist certain letters of the alphabet too.

Or consider them a collision, so you wont get "Foorium and “Foolium” since L and R sound the same in certain Languages, meaning these two would be considered identical Names.

So while the letters L and R can show up, in the boiled down collision check they will be considered the same letter. M and N would be the same too, just like C, K and Q.

H and Apostrophes would be entirely blacklisted.

Consonant Clusters may also be avoided, because I dont want “ng”, “nk”, “kst”, “ch” or “sch” sounds be forced onto people who cant pronounce those for shit.

Edit: Also the Double-O would also not be allowed, making the Foo example above a bit bad. And the “-ium” ending would be a special case only allowed for base Materials and nothing else.

(also because people love to misunderstand things: This is only for AUTOMATICALLY generated names, you can still manually name a Material or something with whatever weird UTF-8 Characters you can come up with. But you may need to expect Tofu)

The Game will also be able to mostly detect transliterations when people reference Ingame Objects, which might help with automatic Translation, and even may end up linking to the ingame knowledge database (which contains all Information your Player-Character happens to have about the subject of that made up Word).

My Plan for this is to at least work a little bit with Japanese, even though I highly doubt there will be more than a dozen people ever using that part of translation, and one might have to expect the Japanese side to at least know the Latin Alphabet.

Edit: The Wikipedia Article on Lojban might be a good informative thing to read.

Also I already know better ways than above to ensure Communication works without Languages, but that isn’t what this particular Thread is about.

rdr · November 19, 2021, 5:01pm

I do not think that this should be that much of concern, otherwise you are going to have to ban too much sounds. For example, adyghe language (spoken by 800 thousand people here in russia) has only two vowels, which are а and ы; russian, spoken by around 260 million, has 6 vowels (around three times less than german, i suppose). Mongolian (spoken by around 6 million people) does not distinguish b and v; it also has f sound only in loaned words. Turkish (spoken by around 90 million) has some terminal devoicing: at (“horse”) and ad (“name”) are indistinguishable (more or less); this also appears in russian (kod and kot are indistinguishable) and some other slavic languages, in malay language (spoken by around 290 million), in mongolian, and, as far as i know, in some varieties of german. And these are just some more basic troubles that someone might have with pronouncing randomly-generated words. Mongolian, for example, has vowel harmony, thus, for mongolians it is hard to pronounce a word that does not accord to rules of vowel harmony. You could possibly make a random word generator which creates words easy to pronounce for everybody, but:

You will have to exclude most of world’s languages using some kind of criteria, e.g. exclude non-indoeuropean languages, or exclude languages that are spoken by less than 200 million people, etc.
You will have to acknowledge the different phonological processes of all languages you chose.

But such a system will probably be just bland. I suppose you could instead use some obscure language with simple phonetics (such as hawaiian maybe, it sounds sweet and does not have much consonants). Other option is to choose some short list of languages according to your taste and only follow their rules.

Though i personally would like the opposite: making phonetics of random name generator as complex as possible. Maybe even making some obscure phonological rules (a foundation of the far-future mechaenetia conlang???).

Anyway, probably the best (at least from a value-to-effort standpoint) option is just use latin as a foundation.

Gregorius · November 19, 2021, 9:05pm

Thank you! That one will boil down the possible words even more! Lets see maybe I go below 33MiB with that one. 29.6MiB by considering b, p, v, w and f the same letter in the Blacklist! (I already consider ALL vowels as the same too)

And yes there is a lot of Language Issues ofcourse, and I mostly expect English to be the used Language, but at the very least I want to make it possible for people to unambiguously say words with whatever horribly butchered pronunciation they can come up with, and still be understood perfectly fine, because there is no similar made-up or real word that is even close to that.

Most letters will ofcourse still exist, R and L could still be used in the same word, it’s just that a Word with L and R in switched places would be considered identical and therefore another RNG name would need to be chosen for it.

I cant wait to see which sequence of blacklisted stuff is so short, that my system will bias towards words starting with certain Letters (favoring shorter generated names), since there is no “long” words that start with certain combos IRL. (I know I phrased this one very badly)

Also Also, those Names would be used for very specific Stuff, Metal is still gonna be Metal, Magic Diagram Dust is still gonna be Magic Diagram Dust (yes THAT is what I think I will name Redstone if I copy it!), Wood is still gonna be Wood, but may come from a Foobar Tree.

rdr · November 19, 2021, 9:32pm

Probably the easiest solution really… or maybe even the only one that will not drive a man into the abyss of madness. Though i think all languages have at least two vowels, so there might be a way to sneak two of them into the generator… Also, even though different vowels are considered the same, how will they look in the output of the generator? Will there be only a u e o i? Will there be more? Will there be less? Will there (in the actual output only) be more?

Gregorius · November 19, 2021, 9:48pm

A E I O U will exist in the actual output, they will just be considered identical when it comes to checking name collisions.

And considering some Names will have the -ium or the -ite Ending due to being “Elements” and “Ores”, there is no way I could avoid them. Only H is truly blacklisted from the Alphabet, unless I seriously have to use it. (which I do not think would be the case)

rdr · November 19, 2021, 9:52pm

Thanks for the clarification. There will be no more vowels, right?

Gregorius · November 20, 2021, 2:25am

Unless you consider Y and J to be vowels, there shouldn’t be any more.

rdr · November 20, 2021, 3:14pm

By the way, what do you think of also considering d-t, z-s, and g-k (and also zh-sh if such combinations will be… since they are not clusters) pairs the same (not all of these consonants the same but only their pairs; they are paired by voicing)? At least in the ends of words, because of terminal devoicing described earlier. Though considering russian, it would be better to just consider the voiced-devoiced pairs the same everywhere. This will probably cleanse list of possible words much more.

Gregorius · November 20, 2021, 4:25pm

yeah the combos of certain things are considered the same as many single letter ones. and yes I boiled down d and t, and g, c, k too. I looked at my keyboard and went through all Characters basically.

loggeek · January 21, 2022, 10:20am

You could check this for phonetics. Voiced combos could be b/p, v/f, d/t, g/k, z/s, zh/sh…
But why would you store all possible words in a file? You could just add the RWG in your game (with a blacklist BTW).

Gregorius · January 21, 2022, 12:14pm

Did i not already make clear that I was planning on a huge List of words for a blacklist? I would not go for a whitelist approach at all, that’s just too many, lol.

The “List of Wikipedia Titles” is basically a List of Words I do NOT want to show up when randomly generating Words.