Aron Joosse, Gökçe Kuşcu and Giovanni Cassani
In this work, we predict traits of fictional characters (gender, polarity, and age) based on their names only, considering the traits assigned to each character by the author of the story as well as the intuitions of participants in an online survey.
Building on previous studies on the sound symbolism of names, we address the following gaps in the literature. First, we evaluate whether form-based features (e.g., letter unigrams) allow us to reliably predict characters’ attributes from names; then, we compare these features to semantic representations extracted from character n-grams using FastText (FT). Second, we analyze different types of names (made-up, e.g. Morgra; real, e.g. John; and talking, i.e. relying on existing English words, e.g., Bolt) to assess whether sound symbolism equally impacts choices and perceptions when names do or do not rely on words with established semantics. Third, we extend the analysis of sound symbolism to a novel attribute, age.
We derived a set of target names from a fantasy fan-fiction corpus (crawled from Archive Of Our Own, AOOO names) and a corpus of children and Young-Adult books (YA names) kindly provided by Vanessa Joosen (University of Antwerp). AOOO names were manually tagged as referring to male/female, good/evil characters. YA names were manually tagged as referring to male/female, young/old characters. We then ran an online survey asking English monolingual participants to drag a slider bar (anchored between -50 and 50) to indicate how much a name would fit a young/old, good/evil, male/female character.
Names were featurized as count vectors of letter uni-grams, to investigate the predictive power of surface form features. Moreover, we trained a custom FT model (n-grams between 2 and 5, window size = 2) from the Corpus of Contemporary American English (CoCA). We sentence-tokenized the corpus and removed stop-words. We then embedded names using the FT model. We predicted author choices using Random Forest classifiers and participant ratings using ElasticNet regression to correct for the large number of dimensions in the FT embeddings
Preliminary results indicate that the letter unigrams and FT embeddings are good at predicting gender, with an accuracy of 0.67 when predicting author choices. Figures decrease for age and polarity, with an accuracy of 0.57 and 0.56 respectively (majority baseline = 0.5 for all classification problems). When predicting survey ratings using FT embeddings for all three character traits, made-up names (MAE = 21.56) performed worse than real names (MAE = 15.51) and talking names (MAE = 17.30), indeed suggesting that semantic representations are more reliable when they can rely on co-occurrence patterns.
Therefore, while unigrams appearing in character names are to some extent predictive of the attributes authors chose, deriving semantic embeddings from n-gram features yield varying results depending on name type. We plan on extending this work by also considering abstract phonological features and testing a wider array of classifiers and regressors.