Bram Gubbels and Raquel G. Alhama
According to the Efficient Market Hypothesis (Fama, 1970, Jrnl. of Finance) all publicly available information is reflected in the stock price. Thus, when new information is released, the stock price will change accordingly. The ability to automatically analyze this information improves the competitive position of investors; concretely, the sentiment expressed in publicly available data turns out to be an important predictor for future prices.
In this project, we focus on sentiment analysis of annual reports (10-ks), which are comprehensive summaries of the performance of a company in the past year, including expectations for the future. The annual report is more extensive and detailed compared to quarterly reports or other public sources, and reflects the optimism or pessimism of the management. We will analyze the 10-ks from the companies in the Standard and Poor’s 500 index, which is the most commonly used index as it is the Benchmark for the United States stock market.
The dominant method used to extract the sentiment from annual reports in the finance and accounting literature is the bag-of-words model, which builds textual representations that do not take into account any syntactic information of the surrounding context of a word. However, the inability of the bag-of-word model to capture the linguistic relationships between words often leads to misclassification of the sentiment. This is particularly problematic when analyzing annual reports, since companies tend to mask bad news by including positive words in sentences that describe those bad news.
In our ongoing work, we explore whether we can solve this problem with the use of models that incorporate syntactic information. To this aim, we compare the bag-of-words approach to two models that incorporate different types of syntactic information. First, we use FinBERT (Araci,2019), a variant of the well-known BERT model (Devlin, et al. 2018,arXiv) that has been trained on texts from the financial domain. Based on the Transformer architecture, this model incorporates positional encoding, hence providing a representation of word order. Second, we use a Recursive Neural Tensor Network (RNTN) from the Stanford CoreNLP (Manning et al. 2014, ACL). Thanks to its recursive structure, the RNTN explicitly represents sentences as syntactic trees, and has been shown to capture negation scope and sentiment fluctuations in text (Socher et al. 2013, EMNLP). Finally, we will use the extracted sentiment to predict the response of the stock prices to the release of the annual report (10-k).