I extracted the Archive of Title Names (87MiB) to a 330MiB large Text File, and then ran that through the following:
cat input | awk -F "_" '!$3' | iconv -f utf8 -t ascii//TRANSLIT | tr -cd '[:alpha:]\n' | tr '[:upper:]' '[:lower:]' | tr -s 'a-z' | awk 'length($0)>3' | sort | uniq -u >> output
This shortened the File down from the original 16 Million to just 5.6 Million Lines and 72MiB.
Here is what each Pipe Segment is doing:
- open the File
- remove all lines that have 3 or more words in them. The Underscore is what Wikipedia uses instead of Spaces because URLs.
- Screw all this UTF-8 nonsense, I want easy to pronounce ASCII and transliterations!
- Just kill all the other Stuff, I want only the Alphabet and \n in this File.
- Make it all lowercase now!
- Kill all repetitive Characters like double-L or so.
- After all this is trimmed down, everything 3 or shorter should get cut out entirely.
- Sort the whole thing.
- Deduplicate Entries.
- And append to the Output File!