Abstract—this Data Analytics. It also includes six papers

Abstract—this paper is a survey about stock prediction techniques using Big Data Analytics. This also deals about the importance, challenges, and applications of both the Big Data and Big Data Analytics. It also includes six papers to describe the experimentation and proposed work in stock prediction using Big Data Analytics. It depicts various techniques and methodologies used in various papers and finally concludes that predictive analytics is more suitable for stock prediction. Keywords—Big Data; Big Data Analytics; Stock Prediction; Predictive Analytics I. INTRODUCTION This Big data is a technique which comprises of a huge volume of both the unstructured and structured data which can be manipulated only by new software and framework and cannot be performed by classical software and database techniques within a bounded time frame. Big Data with respect to the size refers to the data of size larger than a Gigabyte. But, even the smaller size of data will also refer to the ‘Big data’ as it depends on the context it is being used. The previous techniques that existed before ‘Big Data’ like relational databases are purely different from it. The main difference is that in olden days the single processor is used to perform and analyze the whole or infinite number of information. As in the case of ‘Big Data’, infinite number of processors is used to manipulate infinite number of information. A. Need for Big Data In this modern world, the usage of data has been increased to a larger extent. The main reason for this increase in size of data is the emergence of internet and digital machines. The data accumulation has been shifted from the employee generated data to user generated data and then finally to machine generated data. So, the data accumulation has grown exponentially over the years after the emergence of internet and digital devices/machines. To compute and manipulate this huge accumulation of data a new technique has to be implemented as the traditional techniques cannot compute this huge data within a prescribed data. So, this major problem enforced to the emergence and need for a new technique to handle this problem efficiently called “Big Data”. B. Characteristics of Big Data Basically there are four main characteristics of ‘Big Data’ which is often mentioned as four V’s of Big Data. They are Volume, Velocity, Variety and Value 1. But, there are also other characteristics such as Veracity, Validity and Volatility. 1) Volume: Volume refers to the amount of data that is to be processed by the help of big data tools. Volume may be huge or it may be small it depends upon the need of the application. 2) Velocity: Velocity refers about the processing speed of the data using big data tools. This characteristic helps to manipulate and complete the process within the prescribed time. 3) Variety: Variety refers to the different kinds and format of the information that are given as input to process the data. This discriminates the structured and unstructured data which is the basic need for ‘Big Data’ 4) Value: Value refers to the discrimination between the usage of both the small data and large data that are combined to yield a task. Value makes use of the volume and variety of data to be manipulated and offers quality analytics. C. Importance of Big Data 1) Big Data plays a vital role in various fields such as health care, business, science, research, etc. The main aspect of Big Data is to process huge amount of data in parallel to the infinite number of processors. 2) Big data can manipulate, compute, predict, analyze, compare and provide result to any number of information that is fed as input in a big data tool within a short span of time. D. Applications of Big Data Big Data can be applied to various sectors such as medicine, health care, education, business, social networks 9, 11, 13, etc. Each sector has its own specific way of organizing and implementing Big Data 978-1-5090-5682-8 /17/$31.00 ©2017 IEEE International Conference on Innovations in Power and Advanced Computing Technologies i-PACT2017 2 techniques. The main factor relies on volume and velocity for various applications. E. Challenges of Big Data As there are various importances in Big Data, there is an equal amount of challenges to manipulate and process as well. Various challenges include • Complex and regularly emerging techniques. • Utilizing it and understanding the difference. • Privacy and Security. • Lack of experts to process it. • Implementing cloud techniques. II. BIG DATA ANALYTICS To identify the “hidden patterns, unknown correlation, business information, user preference, trends in market and social network, and unknown statistical relations” a new technique was proposed in accordance with Big Data is referred to as the Big Data Analytics. A. Importance of Big Data Analytics: The major role of Big Data Analytics is to bind huge amount of statistical data to yield a precise output within a prescribed time limit. Various importances are as follows. 1) Cost Reduction: Mapping the data using Hadoop or similar tools to reduce the volume of the data in cloud storage will be a cost reduction technique using Data Analytics. 2) Quicker and Better Decision Making: Information can be analyzed immediately for quick decision making with the help of the Big Data tools. Better results such as prediction, accurate decision can be provided by the help of analytical tools. B. Types of Data Analytics: The Big Data analytics was broadly classified into three types. They are Predictive, Prescriptive and Descriptive analytics 2. 1) Descriptive Analytics: Business intelligence BI is the broad area where descriptive analysis technique was carried out. It is the beginning stage of data processing that provides some suggestion to make use of historical data for prediction. It utilizes both data mining and data aggregation methods. In BI, the traditional applications include scoreboard, dashboard, data screening and visualization which are the primary applications. In recent days, the Descriptive Analytics uses the major application to identify and analyze what had happened in the past data and what can be done to improve the decision / prediction. For this, purpose a new analytical technique emerged known as Predictive Analytics. 2) Predictive Analytics: Predictive Analytics is a technique which utilizes the past or the historical data to provide the future prediction with reasonable accuracy in prediction. It can be used in various fields such as weather forecasting, stock prediction, economy variation prediction, etc. The major tool used for Predictive Analytics is Rhadoop which is a combination of R and Hadoop to provide a better result with more accuracy. The combination of both the predictive and descriptive techniques constitute to a new analytics called Prescriptive Analytics described next. 3) Prescriptive Analytics: Prescriptive Analytics refers to the process of analyzing the abstraction of an exact data related to a particular field to enhance the classification result. It is the combination of both the predictive and descriptive analytics. The major application of prescriptive analytics is Business Intelligence. Prescriptive analysis will always produce a better result as it combines both the predictive and descriptive analysis techniques. F. Application Areas: Predictive Analytics can be used in various applications such as fraud detection, risk management, stock prediction, child protection, economy variation, clinical diagnose etc. Some of them are taken into account and explained in this paper. 1) Fraud Detection: The action of cheating or mishandling the authentication or authorization is referred to as Fraud. This happens in many industries, institutions, business sectors, even in public sector which directly relates to the loss of an individual or a government. In order to prevent this sort of fraudulent access or fraud mechanism, Predictive analytics plays a major role to prohibit the frauds from gaining the access. 2) Economy-Level Prediction: Economy of a company or a country is directly related to the profit or loss, increase or decrease in the market prices respectively. To predict the level or the stability of the money or a financial status of a company or country based on the behavior and current happenings as well as trend of the company or a county, Predictive Analytics. 3) Stock Prediction: Stocks are a factor describing financial standard of a company in every countries all over the world. Investment in stock market can lead to a financial gain or a loss of a particular person who bought a stock on that particular product or a company. So, to improve the financial status or to prevent the loss that is to be occurred in the future by the person, predicting the stock prices are very essential. For this stock prediction, Predictive Analytics is utilized. 4) Child protection: To bring down the child fatality and child abuse in various child welfare agencies, Predictive Analysis may be utilized. III. STOCK MARKET PREDICTION Stock market is a place where people do buy and sell their shares and stocks according to their wish with a basic motto of financial gain. Investing in stock market seems to be an easy task but that is not the major case. It also includes International Conference on Innovations in Power and Advanced Computing Technologies i-PACT2017 3 a high risk factor on investing in a particular stock. So, to identify the increase or decrease in price of a particular stock a technique is utilized called Stock Prediction. Stock prediction is an area in which interest in predicting the stock prices by analyst increases exponentially as it avoids the risk or to improve the financial status considerably. In this section various papers related to stock market prediction is included. A. Sentiment analysis on social media for stock movement prediction 1) Theme:To develop a model to predict the stock price movement using the sentiments of the specific topic. 2) Proposed Model: A new feature called “topicsentiment” is incorporated for better stock market prediction. The sentiments related to the specific topic of the company represented by ‘topic-sentiment’ are used for stock prediction. Topics and sentiment are simultaneously extracted. 3) Experimentation: Historical data are extracted from Yahoo Finance for 18 companies. The message board of 18 stocks from Yahoo Finance Message Board is extracted for about one year period. The sentences are split and then Stanford Core NLP was used for POS tagging and lemmatization of each word in each sentence. For each transaction date, the sentiment value of each topic was calculated and also the importance of each topic was considered 3. 4) Advantages :the main advantages are a) It is the first research to show the effectiveness of sentiment analysis incorporation by investigating large scale data. b) A Sentiment of a particular topic was considered. c) Both JST based and Aspect-based methods are implemented. 5) Disadvantages: It has some drawbacks such as a) Only stock movements (either up or down) are predicted. b) A limited number of topics and sentiments are retrieved and accuracy was very low of about 56 %. B. Twitter mood predicts the stock market 1) Theme: To check whether public sentiment expressed in huge by collection of daily tweets can predict the stock market. 2) Proposed Model: Two tools are used to measure variation in the public mood from tweets. The public mood variation results are correlated with Dow Jones Industrial Average (DJIA).extracted. 3) Experimentation: Opinion Finder is used to identify the emotional polarity of the sentences either strong or weak. GPOMS can measure 6 different mood states such as Calm, Alert, Sure, Vital, Kind and Happy. Cross validated the abilities to capture various states of mood. Granger causality analysis excludes the exceptional public mood responses 4. 4)Advantages :the main advantages are a) Accuracy of 86.7% was obtained in predicting the stock market prices. b) A Mean Average Percentage Error (MAPE) was reduced more than 6% 5)Disadvantages: It has some drawbacks such as a) Only few factors were included for analysis. b) Ground truth for public mood states is not considered. C. The Use of Artificial Neural Networks in the Analysis and Prediction of Stock Prices 1) Theme: To predict the close price of the stock (PETR4) by utilizing artificial neural network. 2) Proposed Model: Three stages are included for generating prediction. The datasets of PETR4 stocks were obtained. Cleaning and data normalization are achieved in Pre-processing stage. MLP feedforward network model was used for prediction. Both these techniques are correlated to find the accuracy in stock prediction. 3) Experimentation: The dates that have no data are removed by cleaning process with the help of DJIA. Inputs and intended outputs are used for training the network by the help of resilient back propagation. The performance of neural networks is calculated by Root Mean Square Error and Mean Percentage Error. Then the comparison of both the error is calculated to check the accuracy 5. 4) Advantages :the main advantages are a) Satisfactory results were obtained. b) A Provided the strategy for best performance for neural network prediction. 5) Disadvantages: It has some drawbacks such as a) Only one stock’s historical price was included for prediction. b) Behavior and tendency of stocks were not considered. D. Stock Market Prediction: A Big Data Approach 1) Theme: To predict the stock performance by applying machine learning and fundamental analysis. 2) Proposed Model: Random Walk Theory and Efficient Market Hypothesis are utilized. Data are gathered and prepared for sentiment analysis. After analyzing, the sentiments are aggregated and visualized in the form of graph. Machine leaning was done by the help of linear regression technique. 3) Experimentation: News articles are collected by Mozenda Web Crawler. After lemmatizing news as well as tweets, every stop words, URL’s and duplicates are removed. Sentiment analysis was carried out by the help of HDFS in Hadoop hive environment. Sentiments are International Conference on Innovations in Power and Advanced Computing Technologies i-PACT2017 4 aggregated ant the graphs are generated using Rhadoop. In machine learning, generalized linear model with family binomial is utilized for linear regression model. The historical data are obtained from Yahoo Finance on the daily basis 6. 4) Advantages :the main advantages are a) Social media along with historical data improves prediction results. b) Proved that the political and economic news influences the stock market prices. 5) Disadvantages: It has some drawbacks such as a) Only particular news or tweets are considered for prediction. b) Prediction was less accurate. E. Improved Twitter Sentiment Prediction through ‘Cluster-then-Predict Model’ 1) Theme: To propose a hybrid approach which combines unsupervised learning to cluster the tweets and perform supervised learning methods. 2) Proposed Model: The sentiment was predicted by both supervised and unsupervised learning. Feature extraction was implemented after obtaining the data set. Cluster of tweets were formed. Various decision tree algorithms are implemented and the performance was evaluated. And then prediction was done. 3) Experimentation: Tweets are collected using Python’s twitter API called Tweepy and feature extraction was done by bag of words approach. K-means clustering algorithm partitions the tweets according to the words they contain. It classifies the tweets having the similar words will get one cluster. Random Forest decision tree makes the solution more interpretable and classification algorithms such as CART, Support Vector Machines, and logistic regression were used and the performance parameters were evaluated. Finally prediction was done with the help of cluster-then predict model 7. 4) Advantages :the main advantages are a) It is scalable and so considerable number of tweets can be extracted. b) Hybrid mechanism was used to improve the accuracy of prediction about 2.33% than previous models. 5) Disadvantages: It has some drawbacks such as a) Only 2000 tweets were considered for sentiment analysis. b) Only two factors (positive/negative) were able to be predicted. F. Spin-offs in Indian Stock Market owing to Twitter Sentiments, Commodity Prices and Analyst Recommendations 1) Theme: To find out whether twitter sentiments and commodity prices help in predicting actual stock prices for top 50 companies listed on NIFTY at NSE, India. 2) Proposed Model: Natural Language Processing, Sentiment Analysis and Machine Learning techniques are used for prediction purpose. Tweets are collected and then processed to perform sentiment analysis and the correlated. The correlated information is used for the prediction model. Analysts recommended some suggestion to improve the accuracy level in predicting the stock prices. 3) Experimentation: Tweets are collected using search twitter API and the preprocessed to remove stop words and duplicates. Sentiment analysis was carried out by Lexicon – based approach to find the polarity in words. Multiple linear regression models by Granger causality method were done 8. Pearson’s co-relation coefficient was used to find the correlation between commodity prices and actual stock prices. 4) Advantages :the main advantages are a) Observed that analyst’s recommendations have higher prediction accuracy than twitter sentiments. b) Proved that commodities highly correlated to stock market increases accuracy in prediction. 5) Disadvantages: It has some drawbacks such as a) Only 6 commodity prices were included. b) Only 2 factors (BUY/SELL) were implemented. IV. DISCUSSIONS In this paper, a survey about Big Data, Big Data Analytics and Stock Prediction is taken into account. In Big Data section, its definition, importance, need, difference between Big Data and its predecessors, Challenges and its applications are discussed. Then brief information about Big Data Analytics has been included. In this section, definition, importance, various types of Data Analytics, types and application areas are discussed. In various types of Big Data Analytics, three techniques namely Descriptive Analytics, Prescriptive Analytics and Predictive Analytics have been included. In the application areas four distinct applications such as Child Protection, Economy-Level Prediction, Fraud Detection and Stock Prediction are included. In the next section, the concept of Stock Prediction is described. Implementation of Big Data Analytics in stock prediction is included in that section. It includes the theme, proposed model, experimentation, advantages and disadvantages of five different papers that are related to stock price prediction that can be implemented with the help of Big Data Analytics. International Conference on Innovations in Power and Advanced Computing Technologies i-PACT2017 5 From the papers included in the stock prediction section, it is clear that the predictive analytics can infer both the historical data and sentiment analysis to achieve the desired result. The resource for historical data can be obtained from Yahoo Finance and the Tweets can be collected either by Search Twitter API or Stream Twitter API. Also it is clear from the survey that having huge number of data as input for stock prediction will produce higher accurate result in stock prediction. Performance factors for stock prediction model can be achieved by correlating two or more predictive analytics algorithms such as linear regression, clustering or classification to reduce the error in the prediction technique which is an important factor for stock prediction. Bio-inspired optimization algorithms can also be used for effective machine learning based stock prediction model 10, 12. V. CONCLUSION Thus a survey of about Big Data, Big Data Analytics, various Data Analytics types and some Stock Price Prediction in Big Data Analytics are included in this paper. A discussion section which describes the flow and content of this paper is also added. This survey depicts the study of all techniques related to Big Data Analytics and Stock Prediction techniques. It is also clear from the study that the Predictive analytics will be the best Analytical technique to predict the stocks well in advance. This survey paper concludes that combining both the historical and sentiment analysis techniques will enhance the accuracy of stock prediction.