Deep learning deep portfolios

所需积分/C币:15 2019-01-14 14:41:06 296KB PDF
46
收藏 收藏
举报

We explore the use of deep learning hierarchical models for problems in financial prediction and classification. Financial prediction problems – such as those presented in designing and pricing securities, constructing portfolios, and risk management – often involve large data sets with complex data interactions that currently are difficult or impossible to specify in a full economic model. Applying deep learning methods to these problems can produce more useful results than standard methods in finance. In particular, deep learning can detect and exploit interactions in the data that are, at least currently, invisible to any existing financial economic theory.
Applied Stochastic Models in Business J B. hEatoN.NG. POLSON AND. H. WITTE and Industry of input-output pairs and a loss function c(Y. 5) at the level of the output signal. In its simplest form, we soly . Yo denote the learning parameters that we compute during training. To do this, we need a training data set d=yo arg minw.∑(,P(x) It is common to add a regularization penalty, denoted by o (w, b), to avoid over-fitting and to stabilize our predictive rule We combine this with the loss function via a parameter a>0, which gages the overall level of regularization. We then need to solve arg mIn w ∑ C(Y, Yw.(X))+ ap(w, b) The choice of the amount of regularization, n, is a key parameter. This gages the trade-off present in any statistical modeling that too little regularization will lead to over-fitting and poor out-of-sample performance In many cascs, we will take a separable penalty, (w,b=g(w)+o(b). The most useful penalty is the ridge or L-norm, which can be viewed as a default choice, namely d(W)=w=∑W Other norms include the lasso, which corresponds to an L-norm, and which can be used to induce sparsity in the weights and/or off-scts. The ridge norm is particularly useful when the amount of regularization, has itself to be learned. This is because there are many good predictive generalization results for ridge-type predictors. When sparsity in the weights is paramount, it is common to use a lasso L-norm penalty The common numerical approach for the solution of (2)is a form of stochastic gradient descent, which adapted to a deep learning setting is usually called back- propagation. One caveat of back-propagation in this context is the multi-modality of the system to be solved(and the resulting slow convergence properties), which is the main reason why deep learning methods heavily rely on the availability of large computational power One of the advantages of using a deep network is that first-order derivative information is directly available. There are tensor libraries available that directly calculate wiE(Y;, yw. (X ) using the chain rule across the training data set. For ultra-large data sets, we use mini-batches and stochastic gradient descent to perform this optimization [9]. An active area of research is the use of this information within a Langevin MCMC algorithm that allows sampling from the full posterior distribution of the architecture. The deep learning model by its very design is highly multi-modal, and the parameters are high dimensional and in many cases unidentified in the traditional sense. Traversing the objective function is the desired problem, and handling the multi-modal and slow convergence of traditional decent methods can be alleviated with proximal algorithms such as the alternating method of multipliers, as has been discussed in Polson et al. [10] There are two key training problems that can be addressed using the predictive performance of an architecture (1) How much regularization to add to the loss function. As indicated before, one approach is to use cross validation and to teach the algorithm to calibrate itself to a training data. An independent hold-out data set is kept separately to perform an out-of-sample measurement of the training success in a second step. As we vary the amount of regular ization, we obtain a regularization path and choose the level of regularization to optimize out-of-sample predictive loss. Another approach is to use Steins unbiased estimator of risk(SURE, Stein,[11]) ()A more challenging problem is to train the size and depth of each layer of the architecture, that is, to determine L and N=(N,., N,. This is known as the model selection problem. In the next subsection, we will describe a technique known as dropout, which solves this problem Steins unbiased estimator of risk (SURE) proceeds as follows. For a stable predictor, Y, we can define the degrees of freedom of a predictor by df=E EL dr: /aY: ) Then, given the scalability of our algorithm, the derivative dr/oy is available using the chain rule for the composition of the L layers opyright o 2016 John Wiley Sons, Ltd lppl. Stochastic Models Bus. Ind. 2017, 33 3-12 Applied Stochastic Models in Business and Industry J B. hEatoN. N.G. POLSON ANDJ.H WITTE Now let the in-sample mse (mean-squared error) be given by err =y-Y13 and, for a future observation y*,the out-of-sample predictive mse is Err=Ep(llYx-YI) In expectation, we then have E(Err)=E(err+ 2Var(Y,Y), where the expectation is taken over the data generating process. The latter term is a covariance and depends on df Steins unbiased risk estimate then becomes =|Y-12+202∑ Models with the best predictive MsE are favored Dropout is a model selection technique. It is designed to avoid over- fitting in the training process and does so by removing input dimensions in X randomly with a given probability p. In a simple model with one hidden layer, we replace the network =f(z0 X)+ with the dropout architecture ( Ber(p) (l) () z =WoX(+b In effect, this replaces the input X by D*x, where denotes the element-wise product and D is a matrix of independent Bernoulli ber(p) distributed random variables It is instructive to scc how this affects the underlying loss function and optimization problem. For example, setting biases zero for simplicity, and suppose that we wish to minimize mse, c(r,)=lly-y, then, when marginalizing over the randomness, we have a new objective arg minEp-Berrlp∥y-W(D*A川∥2. With r=(diag(XTX))2, this is equivalent to arg minw llr-pWX 2+p(I-pllrwl2 We can also interpret the last expression as a Bayesian ridge regression with a g-prior. Put simply, dropout reduces the likelihood of over-reliance on small sets of input data in training [12, 13]. Dropout can be viewed as the optimization version of model selection. This contrasts with the traditional spike-and-slab prior( that has proven so popular in bayesian model-averaging), which switches between probability models and requires computationally intensive MCMC models for implementation Another application of dropout regularization is the choice of the number of hidden units in a layer. This can be achieved if we drop units of the hidden rather than the input layer and then establish which probability p gives the best results 3. Probabilistic interpretation In a traditional probabilistic setting, we could view the output y as a random variable generated by a probability model P(YIYW,(X)), where the conditioning is on the predictor Y(X). The corresponding loss function is then C(Y,Y)=-logPYIYw (X)) namely, the negative log-likelihood. For example, when predicting the probability of default, we have a multinomial logistic regression model, which leads to a cross-entropy loss function. Often, the Ly-norm for a traditional least squares problem C(Y Y(X=Y-Y(X: )lI opyright o 2016 John Wiley Sons, Ltd lppl. Stochastic Models Bus. Ind. 2017, 33 3-12 Applied Stochastic Models in Business J B. hEatoN.NG. POLSON AND. H. WITTE and Industry is chosen as an error measure, giving an mse target function Probabilistically, the regularization term, no(W,b), can be viewed as a negative log-prior distribution over parameters namely logp((W,b)=λφ(W,b), p(中(W,b)∝exp(-λφ(W,b) This framework then provides a correspondence with Bayes learning. Our deep predictor is simply a regularized maximum a posteriori estimator. We can show this using Bayes rule as p(W,bD)∝p(Y1Y(X)p( exp logp(YlYw, (X) ))-logp(W, b) and the deep learning predictor satisfies w, b where (w,b):=arg minh logp(w,bID), lgp(W,bD)=∑c(Y0,y(x0)+减dW,b) is the log-posterior distribution over parameters given the training data, d=ro.X ji.(For more detail on the experimental link between deep learning and prohability theory, see also Lake et al.,[14].) 4. Stacked auto-encoders For finance applications, one of the most useful deep learning applications is an auto-encoder. An auto-encoder is a deep learning routine that trains the architecture to replicate X itself, namely, X=Y, via a bottleneck structure. This means we select a model FW b(X), which aims to concentrate the information required to recreate X. Put differently, an auto-encoder creates a more cost effective representation of X. Suppose that we have N input vectors X ∈R MXN nd N output(or target)vectors x1,., XN E RMXN If(for simplicity)we set biases to zero and use one hidden layer (L= 2) with only k< w factors, then our input-output market-map becomes W=F(x=∑w(∑w K ∑ W2 Z, for Z,=(∑ k=l N, where f()is a univariate activation function Because, in an auto-encoder, we are trying to fit the model X= Fw (X), in the simplest possible case with zero biases, we train the weights W=(W,W2)via a criterion function c(W)=arg minw IIX-F(X)Il,+a(W) with (w)=∑w412+ where n is a regularization penalty If we use an augmented Lagrangian(as in alternating method of multipliers)and introduce the latent factor Z, then we have a criterion function that consists of two steps, an encoding step(a penalty for Z), and a decoding step for reconstructing the output signal via arg minw, z lIX-W2Z Ap(Z)+lIz-f(W1, X)2 where the regularization on W, induces a penalty on Z. The last term is the encoder, the first two the decoder In an auto-encoder, for a training data set IX1,X2,... we set the target values as y;=X. A static auto-encoder with two linear layers, akin to a traditional factor model, can be written as a deep learner as opyright o 2016 John Wiley Sons, Ltd lppl. Stochastic Models Bus. Ind. 2017, 33 3-12 Applied Stochastic Models in Business and Industry J B. hEatoN. N.G. POLSON ANDJ.H WITTE 7(2)=W1x+b(1) (2) f(z2), z(3)=W tbo =f(z(3) where a(2),a()are activation levels. It is common to set a)=X. The goal is to learn the weight matrices w, w2).If X∈R, then w)∈ RM and w(∈RM, where M≤ N provides the auto-encoding at a lower dimensional level If w, is estimated from the structure of the training data matrix then we have a traditional factor model, and the w matrix provides the factor loadings (We note that PCa in particular falls into this category [15] If w2 is estimated based on he pair X=[Y, X|=X(which means estimation of W2 based on the structure of the training data matrix with the specific auto-cncoder objcctive), then we have a sliced inverse regression modcl. If WI and W, are simultancously estimated based on the training data X, then we have a two layer deep learning model. a dynamic one layer auto-encoder for a financial time series(Y) can, for example, be written as a coupled system of the form Y=WX,+W-I and wY Y We then need to learn the weight matrices W and W. Here, the state equation encodes and the matrix w decodes the y vector into its history Y-I and the current state X, The auto-encoder demonstrates nicely that in deep learning we do not have to model the variance-covariance matrix explicitly, as our model is already directly in predictive form. Given an estimated nonlinear combination of deep learners, there is an implicit variance-covariance matrix, but that is not the driver of the method) 5. Application: smart indexing for the biotechnology IBB index We consider weekly returns data for the component stocks of the biotechnology IbB index for the period January 2012 to April 2016. We train our learner without knowledge of the component weights. Our goal is to find a selection of investments for which good out-of-sample tracking properties of our objective can be found 5.1. Four slep algorithm Assume that the available market data have been separated into two(or more for an iterative process) disjoint sets for training and validation, respectively, denoted by X and X Our goal is to provide a self-contained procedure that illustrates the trade-offs involved in constructing portfolios to achieve a given goal, for example, to beat a given index by a pre-specified level. The projected real-time success of such a goal will depend crucially on the market structure implied by our historical returns. (While not explicitly investigated here, there is also the possibility of including further conditioning variables during our training phase. These might include accounting information or further returns data in the form of derivative prices or volatilities in the market Our four-step deep learning algorithm proceeds via auto-encoding, calibrating, validating, and verifying. This data- driven and model independent approach provides a new paradigm for prediction and can be summarized as follows. (See also Hutchinson et al. [16]. To contextualize within classic statistical methods, e.g., see Wold [17 or Hastie et al. [18]. Auto-encoding Find the market-map, denoted by Fw(X), that solves the regularization problem arg min-(X川 subject to W≤L" For appropriately chosen F, this auto-encodes X with itself and creates a more information-efficient representation of X (in a form of pre-processing) IL. Calibrating For a desired result(or target)Y, find the portfolio-map, denoted by Fw(X), that solves the regularization problem 4n|Y-Pm(X川 subject to‖W/≤L” arg ml This creates a(nonlinear) portfolio from X for the approximation of objective r opyright o 2016 John Wiley Sons, Ltd lppl. Stochastic Models Bus. Ind. 2017, 33 3-12 Applied Stochastic Models in Business J B. hEatoN.NG. POLSON AND. H. WITTE and Industry Ill. validating Find l and l to suitably balance the trade-off between the two errors ‖x-Fm(x) =‖Y-FB(X)川 where ww and w are the solutions to(3)and(4), respectivel lV. verifying Choose market-map Fm and portfolio-map Fp such that validation(step 3)is satisfactory A central observation to the application of our four step procedure in a finance setting is that univariate activation functions can frequently be interpreted as compositions of financial put and call options on linear combinations of the input assets. As such, the deep feature abstractions implicit in a deep learning routine become deep portfolios, and are investible hich gives rise to a deep portfolio theory. Put differently, deep portfolio thcory relies on deep features, lower (or hidden layer abstractions, which, through training, correspond to the independent variable The question is how to use training data to construct the deep portfolios. The theoretical flexibility to approximate virtually any nonlinear payout function puts regularization in training and validation at the center of deep portfolio theory. In our four-step procedure, portfolio optimization and inefficiency detection become an almost entirely data driven(and therefore model-free) tasks, contrasting with classic portfolio theory When plotting the goal of interest as a function of the amount of regularization, we refer to this as the efficient deep rontier, which serves as a metric during the verification step 5.2. Smart indexing the 1BB index For the four phases of our deep portfolio process(auto-encode, calibrate, validate, and verify), we conduct auto-encoding and calibration on the period January 2012 to December 2013, and validation and verification on the period January 2014 to April 2016. For the auto-encoder as well as the deep learning routine, we use one hidden layer with five neurons c. After auto-encoains the universe of stocks, we consider the two-norm difference between every stock and its auto- encoded version and rank the stocks by this measure of degree of communal information (In reproducing the universe of stocks from a bottleneck network structure the auto-encoder reduces the total information to an information subset which is applicable to a large number of stocks. Therefore, proximity of a stock to its auto-encoded version provides a measure for the similarity of a stock with the stock universe. As there is no benefit in having multiple stocks contributing the same information, we increase the number of stocks in our deep portfolio by using the 10 most communal stocks plus x-number of most non-communal stocks(as we do not want to add unnecessary communal information ) for example, 25 stocks means 10 plus 15(where X= 15). In the top-left chart in Figure l, we see the stocks AmGn and bCrX with their auto-cncoded versions as the two stocks with the highest and lowest communal information, respectively In the calibration phase, we use rectified linear units (ReLU) and fourfold cross validation. In the top-right chart in Figure l, we see training results for deep portfolios with 25, 45, and 65 stocks, respectively In the bottom-left chart of Figure 1, we see validation (i.e., out-of-sample application) results for the different deep portfolios In the bottom-right chart in Figure 1, we see the efficient deep frontier of the considered example, which plots the number of stocks used in the deep porfolio against the achieved validation accuracy Model selection (i.e, verification)is conducted through comparison of efficient deep frontiers While the efficient deep frontier still requires us to choose(similarly to classic portfolio theory) between two desirables sample performance, making deep portfolio theory a strictly data-driven approach sions are now purely based on out-of- namely, index tracking with few stocks as well as a low validation error, these dec 5.3. Outperforming the IBB index The 1% problem seeks to find the best strategy to outperform a given benchmark by I %o per year. In our theory of deep portfolios, th is is achieved by uncovering a performance improving deep feature, which can be trained and validated successfully. Crucially, thanks to the Kolmogorov-Arnold theorem(Section 2), hierarchical layers of univariate nonlinear payouts can be used to scan for such features in virtually any shape and form For the current example(beating the IBB index), we have amended the target data during the calibration phase by replacing all returns smaller than% by exactly 5%, which aims to create an index tracker with anti-correlation in periods of large drawdowns. We see the amended target as the red curve in the top-left chart in Figure 2 and the training success on the top-right. In the bottom-left chart in Figure 2, we see how the learned deep portfolio achieves outperformance (in times of drawdowns)during validation opyright o 2016 John Wiley Sons, Ltd lppl. Stochastic Models Bus. Ind. 2017, 33 3-12 Applied Stochastic Models in Business and Industry J B. hEatoN. N.G. POLSON ANDJ.H WITTE Auto-Encoder--HighvLow Precision Example Calibration Phase 09 2012-01 2012-07 13 2013-07 201401201201 2012-07 2013-01 20130 201401 AMGN-BCRX AMGN. Autoenc-BCRX Autoenc 一旧B-s25-S45-565 Validation Phase Verification Phase-Deep Frontier M 百 0,0 60 2014012014-07 2015-01 2015-07 201601 B-s25-5,45-s.65 Validation Error (2-norm Out-of-sample Error Figure 1. We see the four phases of a deep portfolio process: auto-encode, calibrate, validate, and verify. For the auto-encoder as well as the deep learning routine, we use one hidden layer with five neurons. We use rectified linear unit activation functions. We have a list of component stocks but no weights. We want to select a subset of stocks and infer weights to track the IBB index S25, S45 and so on denotes number of stocks used. after ranking the stocks in auto-encoding we are increasing the number of stocks by using the 10 most communal stocks plus x-number of most non-communal stocks(as we do not want to add unnecessary communal information), for example, 25 stocks means 10 plus 15(where x= 15). We use weekly returns and fourfold cross validation in training. We calibrate on the period January 20 12 to December 2013, and then validate on the period January 2014 to April 2016 The dccp frontier(bottom right) shows the tradc-off bctwccn the numbcr of stocks uscd and thc validation crror The efficient deep frontier in the bottom-right chart in Figure 2 is drawn with regard to the amended target during the validation period. Due to the more ambitious target, the validation error is larger throughout now, but, as before, the verification suggests that, for the current model, a deep portfolio of at least 40 stocks should be employed for reliable prediction opyright o 2016 John Wiley Sons, Ltd lppl. Stochastic Models Bus. Ind. 2017, 33 3-12 Applied Stochastic Models in Business J B. HEatoN. N.G. POLSON AND.H. WITTE and Industry Calibration Target-Amendment Calibration Phase 12- 1.2- 08- 08 00 0.0 201201 2012-07 2013-01 2013-02 201401201201 2012-07 2013-01 201307 2014-01 一 BB. mad-B m。-5.2-545-S6 validation Phase Verification Phase- Deep Frontier 12 08 0,4 00- 2014-01 2014-07 20151 201507 201601 BB-s,25-5.45-565 validation Error(2-norm Out-of-sample Error) Figure 2. We proceed exactly as in Figure 1, but we alter the target index in the calibration phase by replacing all returns 5% by exactly 5%, which aims to create an index tracker with anti-correlation in periods of large drawdowns. On the top left, we see the altered calibration target. During the validation phase(bottom left), we notice that our tracking portfolio achieves the desired returns in periods of drawdowns, while the deep frontier(which is calculated with respect to the modified target on the validation set, bottom right)shows that the expected deviation from the target increases somewhat throughout compared to Figure I(as would be expected) 6. Conclusion Deep learning presents a general framework for using large data sets to optimize predictive performance. As such, deep learning frameworks are well-suited to many problems -both practical and theoretical -in finance. This paper introduces deep learning hierarchical decision models for problems in financial prediction and classification. Deep learning has the potential to improve-sometimes dramatically -on predictive performance in conventional applications. Our example on smart indexing in Section 5 presents just one way to implement deep learning models in finance. Sirignano [19l provides an application to limit order books. Many other applications remain for development opyright o 2016 John Wiley Sons, Ltd lppl. Stochastic Models Bus. Ind. 2017, 33 3-12 Applied Stochastic Models in Business and Industry J B. hEatoN. N.G. POLSON ANDJ.H WITTE References 1. Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks. Advances in Neural Information Processing Systems 2012: 25 1223-1231 2. Ripley BD Pallern Recognition und Neurul Nelworks. Cambridge University Press: Cambridge, 1996 3. Kolmogorov A. The representation of continuous functions of many variables by superposition of continuous functions of one variable and ddition Dokl. Akad. Nauk SSsr 1957:114: 953-956 4. Diaconis P, Shahshahani M. On non-linear functions of linear combinations. SIAM Journal on Scientific and Statistical Computing 1984 5(1):175-191 5. Lorentz GG. The 13th problem of Hilbert. Proceedings of Symposia in Pure Mathematics, American Mathematical society 1976: 28: 419-430 6. Gallant ar. white h. there exists a neural network that does not make avoidable mistakes IEEE International Conference on Neural Networks 1988;1:657-664. 7. Poggio T, Girosi F Networks for approximation and learning. Proceedings of the IEEE 1990: 78(9): 1481-1497 8. Hornik K, Stinchcombe M, White H Multilayer feedforward networks are universal approximators Neural networks 1989; 2(5): 359-366 9. LeCun YA, Bottou L, Orr GB, Muller KR Efficient backprop. Neural Networks: Tricks of the Trade 1998: 1524: 9-48 10. Polson NG, Scott JG, Willard BT. Proximal algorithms in statistics and machine learning. Statistical Science 2015; 30: 559-581 11. Stein C. Estimation of the mean of a multivariate normal distribution. Journal of the American Statistical Association 1981; 97: 210-221 12. Hinton GE, Salakhutdinov rr. reducing the dimensionality of data with neural networks. Science 2006: 313(5786): 504-507 13. Srivastava et al. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine learning Research 2014: 15: 1929-1958 14. Lake BM, Salakhutdinov R, Tenenbaum JB. Human-level concept learning through probabilistic program induction. Science 2015: 3560 1332-1338 15. Cook RD. Fisher lecture: dimension reduction in regression. Statistical Science 2007: 1-26 16. Hutchinson JM, Lo AW, Poggio T A nonparametric approach to pricing and hedging derivative securities via learning networks. Journal of Finance1994:48(3):851-889 17. Wold H. Causal inference from observational data: a review of end and means. Journal of the royal Statistical Society 1956; series A(General: 28-61 18. Hastie T, Tibshirani R, Friedman J. The elements of Statistical Learning, voL. 2, 2009 19. Sirignano J Deep learning for limit order books, 2016. arXiv 1601.01987v7 opyright o 2016 John Wiley Sons, Ltd lppl. Stochastic Models Bus. Ind. 2017, 33 3-12

...展开详情
试读 10P Deep learning deep portfolios
立即下载
限时抽奖 低至0.43元/次
身份认证后 购VIP低至7折
一个资源只可评论一次,评论内容不能少于5个字
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
关注 私信
上传资源赚钱or赚积分
最新推荐
Deep learning deep portfolios 15积分/C币 立即下载
1/10
Deep learning deep portfolios第1页
Deep learning deep portfolios第2页

试读结束, 可继续读1页

15积分/C币 立即下载