To be honest, I’m a techie by training, and I’ve never been into linguistics. Sure, it’s interesting to know languages, but learning them is a hassle. In general, the technical sciences seemed to me more understandable and interesting than the humanities. It was so until I had to think up a new domain name. Tormented by the lack of good ideas and insights, rejected a lot of banal options, I thought that if there is no inspiration, then it must look somewhere, and decided to approach the issue technically. I decided to make a domain name generator.
A good randomizer idea came quickly.There are already almost two million domains in runet, with good and bad names. Of course, the "good name"-"bad name" evaluation is individual, but there is something in common that unites both. I think this common factor has puzzled over more than one generation of linguists (or maybe it was already known long ago), but I decided to be technical about it, so I decided that good and bad domains are determined by the combination of letters) So, the idea arose as follows: we break the domain name into syllables and save the syllables of each domain name in the "syllable dictionary". With the syllable dictionary we can combine them in any order, getting a nice
domain names (provided that the original base from which the dictionary was compiled had good names). In addition, with this approach, you can generate not only domain names, but anything. For example, nicknames, drug names, or names.
The first experiments gave optimistic results, but also showed that everything is not so simple. Adjusting for the absolute randomness of the word obtained, it was possible to say that the nickname was similar to the nickname, and the name of the drug was similar to the name of the drug. But here the yield of good choices was small. Besides, we can easily distinguish a man’s name from a woman’s name by ear (we don’t take exceptions into account), but it was hard to distinguish a generated man’s name from a generated woman’s one. Besides, words that are unnatural for the language (e.g., beginning with a soft or hard sign, or with unpronounceable sound combinations like mts-, nts-) need to be sifted out or marked somehow.
After some more thought, I decided that the main problem was the endings. When the ending of an "artificial" word was like the ending of a "natural" word, the word itself was like a natural word. When the ending crept forward on the word, or ran away altogether, it was hard to call the word good. So, I decided to put the endings in a separate dictionary and make new words along the lines of
[word] = [arbitrary syllable combination]+[arbitrary ending].
This principle has started to give very good results, in my opinion. However, the problem with sifting out unnatural words remained. To fix it, I have decided to try to make a function for numerical word evaluation: an excellent word should get 100 points, and what is a word
cannot be counted at all, should get 0.
After surfing the Internet, I found a good word to describe the characteristic I need for numerical evaluation: "euphony". But googling "euphony estimation algorithms" didn’t give me any good results. So I decided to do the following: to classify as "euphony" a word with alternating vowels and consonants, and as "unsound" a word with letters of the same type. Then numerical evaluation of euphony can be defined as the ratio of the number of "vowel-consonant" pairs to the total number of letter pairs. For completeness I have introduced some additional conditions :
– forbidden letters (b, b and y for Russian), in case of which there are at the beginning of the word,
it gets 0 points.
– for the presence of paired letters at the beginning of the word, the "artificial" word scores are reduced by 80%
– for the presence of two unpaired consonants or vowels at the beginning of a word, its scores are reduced by a quarter
As a result of such simple calculations, we can rank artificial words one way or another, discarding bad ones or highlighting good ones.
What came out of my experiments can be seen at http://vidumschik.ru I think a generator like this could be useful to a lot of people. But I’d really like to know if anyone has done any assessments of word euphony? Or maybe someone can suggest a good algorithm?
All this was done by my comrade, who cannot write here for well-known reasons.
Soundness in numbers