Predicting Dow Jones Index

Using Online News Text Mining

머신러닝을 활용한 다우존스 등락예측.pdf


Table of Contents

  1. Introduction
  2. Main Body 2-1. Related research 2-2. Data Collection: Reddit News 2-3. Analysis Process 2-4. Machine Learning Techniques Used
  3. Analysis Results
  4. Limitations and Future Improvements
  5. Conclusion
  6. References and Used Codes

SUMMARY

<aside> 💡 This study predicted the fluctuations of the Dow Jones Index by analyzing news articles with high popularity and views on the Reddit website. The titles of the top 25 popular news articles were extracted on a daily basis, and a Bag-of-Words (BOW) matrix was created using two methods (Count, TF-IDF). Three machine learning models (Logistic Regression, Naive Bayes, XGBoost) were trained to predict the fluctuations of the Dow Jones Index. XGBoost showed the highest accuracy among the models. However, there were limitations in predicting market movements solely based on news articles. Therefore, to use the model more effectively, the relationship between trading volume and news influence was utilized. Days corresponding to the top 45% of trading volume were selected for prediction, resulting in higher prediction accuracy.

</aside>

Keywords: Text Mining, Stock Prediction, Machine Learning, Dow Jones, News, Reddit.


1. Introduction

As the development of information technology progresses rapidly, data analysis utilizing big data is being actively utilized in various fields. Finance is a field where research and practical application using big data are actively carried out more than any other field. Currently, in the financial sector, analysis using big data in practical applications, such as personal credit evaluation, customer satisfaction evaluation, and insurance fraud detection, is actively being introduced. The relationship between big data and finance is becoming closer to the extent that the term 'techfin', rather than 'fintech', is now being used. One of the areas in the financial sector where research using big data is actively being conducted is the field of stock price prediction. As internet and mobile device usage expands, the influence of internet news on stock prices is increasing. Many market participants refer to internet news for decision-making because it is easily accessible and information is updated quickly. From a practical standpoint, this aspect is also actively reflected. In 2011, DCM Capital in the UK analyzed Twitter text to understand investor sentiment and used it for hedge fund operations. At that time, DCM Capital analyzed 100 million Twitter posts every day and reflected the results in its portfolio. In Korea, Koscom conducted sentiment analysis on specific news using big data and reflected it in stock price prediction. There are three main methods for stock price prediction: mathematical, statistical, and artificial intelligence methods (Schumaker, Chen, 2010). In the past, technical and mathematical analysis was predominant, but recently, research on stock price prediction using artificial intelligence techniques is being actively conducted. By using artificial intelligence techniques, not only numerical data but also unstructured data such as text and images can be used. If unstructured data is utilized, it can analyze even areas that require more precise techniques, such as investment sentiment and social atmosphere of economic actors, and can extract meaningful information. Recently, text mining, opinion mining, and cluster analysis are being conducted based on natural language processing technology. Through such analysis, cases of reflecting psychological factors of market participants in investment decision-making are increasing, which is presenting new perspectives on investment."


2. Main Body

2.1 Related Research

Existing studies on predicting stock prices using text mining mostly focused on analyzing news articles or headlines. An et al. (2010) combined machine learning with time series analysis to predict stock prices using news from Naver Finance, achieving a success rate of 55%. Jung et al. (2015) conducted sentiment analysis on news to predict stock prices for individual companies and confirmed some significant accuracy. Inwon et al. (2016) conducted a study on predicting the trend of stock price fluctuations by analyzing standardized disclosure information and news.

On the other hand, there have also been studies analyzing texts from social networking services (SNS), such as Twitter. With the increasing number of smartphone users, the influence of SNS is becoming more significant. Since SNS allows individuals to write and share their opinions freely, their positive or negative sentiments can be expressed, and their interests can be well reflected. Bollen et al. (2011) confirmed that public sentiment is reflected in a massive amount of text data on Twitter and demonstrated that the accuracy of predicting Dow Jones Industrial Average can be improved using it. Kim et al. (2014) conducted sentiment analysis on user comments in stock discussion rooms on Naver, Daum, and Paxnet and confirmed that if sufficient SNS data is available, it can be used to predict stock price fluctuations. Kim et al. (2014) built an appropriate sentiment dictionary for the stock market and presented a model for predicting stock price fluctuations using sentiment analysis and machine learning.

Furthermore, there have been cases where stock price analysis was conducted based on search volumes rather than news articles or SNS. Preis et al. (2013) used Google Trends data to predict changes in the stock market. According to this study, Google search volumes reflect the current economic situation and the psychology of economic agents. Before an economic crisis, an increase in search volumes for specific keywords was observed.

In this study, text mining was conducted using a mixed dataset of general news and SNS, unlike existing studies.