10-13-2022, 05:18 PM
Deep Forecasting Bitcoin with LSTM Architectures
<div>
<div class="kk-star-ratings kksr-auto kksr-align-left kksr-valign-top" data-payload="{"align":"left","id":"782261","slug":"default","valign":"top","ignore":"","reference":"auto","class":"","count":"1","readonly":"","score":"5","best":"5","gap":"5","greet":"Rate this post","legend":"5\/5 - (1 vote)","size":"24","width":"142.5","_legend":"{score}\/{best} - ({count} {votes})","font_factor":"1.25"}">
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend" style="font-size: 19.2px;"> 5/5 – (1 vote) </div>
</div>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="519" height="923" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-127.png" alt="" class="wp-image-782501" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-127.png 519w, https://blog.finxter.com/wp-content/uplo...69x300.png 169w" sizes="(max-width: 519px) 100vw, 519px" /></figure>
</div>
<p>Although <a href="https://blog.finxter.com/how-neural-networks-learn/" data-type="post" data-id="568016" target="_blank" rel="noreferrer noopener">Neural Networks</a> do a tremendous job learning rules in tabular, structured data, it leaves a great deal to be desired in terms of ‘unstructured’ data. And there we come to a new concept: <strong><em>Recurrent Neural Networks</em></strong>.</p>
<figure class="wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube"><a href="https://blog.finxter.com/bitcoin-price-forecast-with-lstm-based-architectures/"><img src="https://blog.finxter.com/wp-content/plugins/wp-youtube-lyte/lyteCache.php?origThumbUrl=https%3A%2F%2Fi.ytimg.com%2Fvi%2FZYmVCK2sX9w%2Fhqdefault.jpg" alt="YouTube Video"></a><figcaption></figcaption></figure>
<h2>Recurrent Neural Network</h2>
<p>A Recurrent Neural Network is to a Feedforward Neural Network as a single object is to a list: it may be thought as a set of interrelated feedforward networks, or a looped network. </p>
<p>It is specialized in picking up and highlighting the main characteristics of your data (more on that in <a rel="noreferrer noopener" href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/" target="_blank">Andrej Karpathy’s Blog</a>). They are often followed by a Feed Forward (Dense) Layer which will weigh the output.</p>
<h2>Long Short-Term Memory</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="576" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-128-1024x576.png" alt="" class="wp-image-782540" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-128-1024x576.png 1024w, https://blog.finxter.com/wp-content/uplo...00x169.png 300w, https://blog.finxter.com/wp-content/uplo...68x432.png 768w, https://blog.finxter.com/wp-content/uplo...ge-128.png 1345w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Long Short-Term Memory (LSTM) clusters have the extra special ability to deal with time (more on it can be found in <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/" target="_blank" rel="noreferrer noopener">Colah’s article</a>). </p>
<p>As the term <strong><em>memory</em></strong> suggests, its greatest promise is to understand correlations between past and present events. In particular, they fit naturally in time series forecasts. </p>
<p>Here we aim at a hands-on introduction to several LSTM-based architectures (and more is to come <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f609.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" />). </p>
<h2>Article Overview</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="682" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-129-1024x682.png" alt="" class="wp-image-782562" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-129-1024x682.png 1024w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w, https://blog.finxter.com/wp-content/uplo...68x512.png 768w, https://blog.finxter.com/wp-content/uplo...ge-129.png 1345w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>We use Bitcoin daily closing price as a case study. Specifically, we use the Bitcoin price and sentiment analysis we have gathered in a <a rel="noreferrer noopener" href="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" target="_blank">previous article</a>. We use <a href="https://blog.finxter.com/tensorflow-developer-income-and-opportunity/" data-type="post" data-id="259596" target="_blank" rel="noreferrer noopener">TensorFlow</a>‘s <a href="https://blog.finxter.com/keras-developer-income-and-opportunity/" data-type="post" data-id="257517" target="_blank" rel="noreferrer noopener">Keras</a> API for the implementation.</p>
<p>In this article will aim at the following architectures:</p>
<ol>
<li>‘Vanilla’ LSTM</li>
<li>Stacked LSTM</li>
<li>Bidirectional LSTM</li>
<li>Encoder-Decoder LSTM-LSTM</li>
<li>Encoder-Decoder CNN-LSTM</li>
</ol>
<p>The last one being the more convoluted (pun intended).</p>
<p>There is one main issue dealing with time series, which is the implementation of the problem. Are common situation both having only the historical target value alone (univariate problem) or together with other information (multivariate problem). </p>
<p>Moreover, you might be interested in one-step prediction or a multi-step prediction, i.e., predicting only the next day or, say, all days in the next week. Although it doesn’t sound so, you have to adjust your model to whatever situation you are facing. </p>
<p>Think of how you would deal with a multivariate multi-step problem: should you train a one-step model and forecast all features in order to feed your model to predict the following days? That would be a crazy! </p>
<p><a rel="noreferrer noopener" href="https://www.kaggle.com/code/ryanholbrook/forecasting-with-machine-learning" target="_blank">Kaggle’s time series course</a> does a good job introducing the several strategies present to deal with multi-step prediction. Fortunately, setting an LSTM network for a multi-step multivariate problem is as easy as setting it for a univariate one-step problem – you just need to change two numbers. </p>
<p>This is another advantage of Neural Networks, apart from its capacity of memory. </p>
<p>Of course, the architecture list above is not exhaustive. For instance, a new <em>Attention</em> layer was recently introduced, which has been <a href="https://exchange.scale.com/public/blogs/state-of-ai-report-2021-transformers-taking-ai-world-by-storm-nathan-benaich" target="_blank" rel="noreferrer noopener">working wonders</a>. We shall come back to it in a next article, where we will walk through a hybrid Attention-CLX model.</p>
<p>Credits to <a href="https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/" target="_blank" rel="noreferrer noopener">ML Mastery blog</a> for part of the code. </p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f6ab.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Disclaimer</strong>: This article is a programming/data analysis tutorial only and is not intended to be any kind of investment advice.</p>
<h2>How to Prepare the Data for LSTM?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="683" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-130-1024x683.png" alt="" class="wp-image-782575" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-130-1024x683.png 1024w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w, https://blog.finxter.com/wp-content/uplo...68x512.png 768w, https://blog.finxter.com/wp-content/uplo...ge-130.png 1345w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>We will use two sources of data, both explicit in our <a href="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" target="_blank" rel="noreferrer noopener">previous article</a>: the <a href="https://senticrypt.com" target="_blank" rel="noreferrer noopener">SentiCrypt</a>‘s Bitcoin sentiment analysis and Bitcoin’s daily closing price (by following the steps in the previous article, you can do it differently, using a minute-base data, for example). </p>
<p>Let us load the already-saved sentiment analysis and download the Bitcoin price:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd
import yfinance as yf sentic = pd.read_csv('sentic.csv', index_col=0, parse_dates=True)
sentic.index.freq='D' btc = yf.download('BTC-USD', start='2020-02-14', end='2022-09-23', period='1d')[['Close']]
btc.columns = ['btc'] data = pd.concat([sentic,btc], axis=1) data
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="887" height="416" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-115.png" alt="" class="wp-image-782302" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-115.png 887w, https://blog.finxter.com/wp-content/uplo...00x141.png 300w, https://blog.finxter.com/wp-content/uplo...68x360.png 768w" sizes="(max-width: 887px) 100vw, 887px" /></figure>
</div>
<p>The LSTM layer expects a 3D array as input whose shape represents:</p>
<p><code>(data_size, timesteps, number_of_features)</code>.</p>
<p>Meaning, the first and last elements are the number of rows and columns from the input data, respectively. The timestep argument is the size of the time chunk you want your LSTM to process at a time. This will be the time frame the LSTM will look for relations between past and present. It is essentially the size of its (long short-term) <em>memory</em>.</p>
<p>To decide how many time-steps, we recall our first time series article where we explored partial auto-correlations of Bitcoin price’s lags. </p>
<p>That is easily achieved through <a href="https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/" data-type="post" data-id="22984" target="_blank" rel="noreferrer noopener">statsmodels</a>:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from statsmodels.graphics.tsaplots import plot_pacf
import matplotlib.pyplot as plt plot_pacf(data.btc, lags=20)
plt.show()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="568" height="433" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-116.png" alt="" class="wp-image-782309" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-116.png 568w, https://blog.finxter.com/wp-content/uplo...00x229.png 300w" sizes="(max-width: 568px) 100vw, 568px" /></figure>
</div>
<p>If you were there, in the <a href="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" data-type="post" data-id="715831" target="_blank" rel="noreferrer noopener">first article</a>, with me, you might remember our curious 10-lags correlation. Here we use this magic number and<strong> feed the model with a 10 days frame and to make a 5 days prediction. </strong>I found the results with 10 days better than for 6 or 20 days (for most cases – see below for more about this). We also assume we have today’s data and try to forecast the next 5 days.</p>
<p>An easy way to accomplish the reshaping of the data is through (a slight modification) of our <code>make_lags</code> function together with NumPy’s <code><a href="https://blog.finxter.com/numpy-reshape/" data-type="post" data-id="3781" target="_blank" rel="noreferrer noopener">reshape()</a></code> method. </p>
<p>So, instead of a Series, we will take a DataFrame as input and will output a concatenation of the original frame with its respective lags. We use negative lags to prepare the target DataFrame. We will ignore observations with the produced NaN values and will use the align method to align their indexes. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def make_lags(df, n_lags=1, lead_time=1): """ Compute lags of a pandas.DataFrame from lead_time to lead_time + n_lags. Alternatively, a list can be passed as n_lags. Returns a pd.DataFrame resulting from the concatenation of df's shifts. """ if isinstance(n_lags,int): lag_list = range(lead_time, n_lags+lead_time) else: lag_list = n_lags lags=list() for i in lag_list: df_lag = df.shift(i) if i!=0: df_lag.columns = [f'{col}_lag_{i}' for col in df.columns] lags.append(df_lag) return pd.concat(lags, axis=1) X = make_lags(data, n_lags=20, lead_time=0).dropna()
y = make_lags(data[['btc']], n_lags=range(-5,0)).dropna() X, y = X.align(y, join='inner', axis=0)
</pre>
<p>Next, we train-test split the data with <a href="https://blog.finxter.com/tutorial-how-to-create-your-first-neural-network-in-1-line-of-python-code/" data-type="post" data-id="2463" target="_blank" rel="noreferrer noopener">sklearn</a>, taking 10% as test size. As usual for time series, we include <code>shuffle=False</code> as a parameter.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.1, shuffle=False)
</pre>
<p>Before proceeding, it is good practice to normalize the data before feeding it into a Neural Network. We do it now, before things get 3D.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler().fit(X_train)
X_train, X_val = mms.transform(X_train), mms.transform(X_val)
</pre>
<p>Finally, we use NumPy to <a href="https://blog.finxter.com/numpy-reshape/" data-type="post" data-id="3781" target="_blank" rel="noreferrer noopener">reshape</a> everything to 3D arrays. Observe that there is not such a thing as a 3D <code>pd.DataFrame</code>.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import numpy as np def add_dim(df, timesteps=5): """ Transforms a pd.DataFrame into a 3D np.array with shape (n_samples, timesteps, n_features) """ df = np.array(df) array_3d = df.reshape(df.shape[0],timesteps ,df.shape[1]//timesteps) return array_3d X_train, X_val = map(add_dim, [X_train, X_val], [timesteps]*2)
</pre>
<p>Of course, you can always prepare a function to do everything in one shot:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def prepare_data(df, target_name, n_lags, n_steps, lead_time, test_size, normalize=True): ''' Prepare data for LSTM. ''' if isinstance(n_steps,int): n_steps = range(1,n_steps+1) n_steps = [-x for x in list(n_steps)] X = make_lags(df, n_lags=n_lags, lead_time=lead_time).dropna() y = make_lags(df[[target_name]], n_lags=n_steps).dropna() X, y = X.align(y, join='inner', axis=0) from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=test_size, shuffle=False) if normalize: from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler().fit(X_train) X_train, X_val = mms.transform(X_train), mms.transform(X_val) if isinstance(n_lags,int): timesteps = n_lags else: timesteps = len(n_lags) return add_dim(X_train,timesteps), add_dim(X_val,timesteps), y_train, y_val
</pre>
<p>Note that one should give positive values to <code>n_steps</code> to have the right negative shifts. Fortunately, <code>y_train</code>, <code>y_val</code> are not reshaped, which makes life easier when comparing predictions with reality.</p>
<p>All set, let’s start with the most basic Vanilla model.</p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f4a1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Side note</strong>: We are keeping things simple here, but in a future post, we will prepare our own batches and explore better the <em>stateful</em> parameter of an LSTM layer. More on its input and output can be found in <a href="https://github.com/MohammadFneish7/Keras_LSTM_Diagram" target="_blank" rel="noreferrer noopener">Mohammad’s Git</a>.</p>
<h2>How to Implement Vanilla LSTM with Keras?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="683" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-131-1024x683.png" alt="" class="wp-image-782588" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-131-1024x683.png 1024w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w, https://blog.finxter.com/wp-content/uplo...68x512.png 768w, https://blog.finxter.com/wp-content/uplo...ge-131.png 1345w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>A model is called Vanilla when it has no additional structure apart from the output layer. </p>
<p>To implement it we add an <a href="https://keras.io/api/layers/recurrent_layers/lstm/" target="_blank" rel="noreferrer noopener">LSTM</a> and a <a href="https://keras.io/api/layers/core_layers/dense/" target="_blank" rel="noreferrer noopener">Dense</a> layer. We must pass the number of units of each and the input shape for the LSTM layer.</p>
<p>The input shape is exactly <code>(n_timesteps, n_features)</code> which can be inferred from <code>X_train.shape</code>. The number of units for the LSTM layer is a hyperparameter and shall be tuned, for the Dense layer it is the number of outputs we want. Therefore 5. </p>
<p>Next follows a hypertuning-friendly code, specifying the main parameters in advance. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from keras.models import Sequential
from keras.layers import Dense, LSTM # Data preparation
n_lags, n_steps, lead_time, test_size = 10, 5, 0, .2 # hyperparameters
epochs, batch_size, verbose = 50, 72, 0 model_params = {} # preparing data
X_train, X_val, y_train, y_val = prepare_data(data, 'btc', n_lags, n_steps, lead_time, test_size) # model architecture
vanilla = Sequential()
vanilla.add(LSTM(units=200, activation='relu', input_shape=(X_train.shape[1],X_train.shape[2]) ))
vanilla.add(Dense(n_steps))
</pre>
<p>The <code>model_params</code> <a href="https://blog.finxter.com/python-dictionary/" data-type="post" data-id="5232" target="_blank" rel="noreferrer noopener">dictionary</a> will be useful for including additional parameters to the compile method, such as an <code>EarlyStopping</code> callback. </p>
<p>We also write a function that fits the model, plot and assess predictions. The present code does not output anything, so, feel free to change it in order to do so. We fix the optimizer as Adam and the loss metric as Mean Squared Error.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def fit_model(model, learning_rate=0.001, time_distributed=False, epochs=epochs, batch_size=batch_size, verbose=verbose): y_ind = y_val.index if time_distributed: y_train_0 = y_train.to_numpy().reshape((y_train.shape[0], y_train.shape[1],1)) y_val_0 = y_val.to_numpy().reshape((y_val.shape[0], y_val.shape[1],1)) else: y_train_0 = y_train y_val_0 = y_val # fit network from keras.optimizers import Adam adam = Adam(learning_rate=learning_rate) model.compile(loss='mse', optimizer='adam') history = model.fit(X_train, y_train_0, epochs=epochs, batch_size=batch_size, verbose=verbose, **model_params, validation_data=(X_val, y_val_0), shuffle=False) # make a prediction if time_distributed: predictions = model.predict(X_val)[:,:,0] else: predictions = model.predict(X_val) yhat = pd.DataFrame(predictions, index=y_ind, columns=[f'pred_lag_{i}' for i in range(-n_steps,0)]) yhat_shifted = pd.concat([yhat.iloc[:,i].shift(-n_steps+i) for i in range(len(yhat.columns))], axis=1) # calculate RMSE from sklearn.metrics import mean_squared_error, r2_score rmse = np.sqrt(mean_squared_error(y_val, yhat)) import matplotlib.pyplot as plt fig, (ax1,ax2) = plt.subplots(2,1,figsize=(14,14)) y_val.iloc[:,0].plot(ax=ax2,legend=True) yhat_shifted.plot(ax=ax2) ax2.set_title('Prediction comparison') ax2.annotate(f'RMSE: {rmse:.5f} \n R2 score: {r2_score(yhat,y_val):.5f}', xy=(.68,.93), xycoords='axes fraction') ax1.plot(history.history['loss'], label='train') ax1.plot(history.history['val_loss'], label='test') ax1.legend() plt.show()
</pre>
<p>The <code>time_distributed</code> parameter will be used in the last two architectures. </p>
<p>I opted to set a manual <code>learning_rate</code> since once the Stacked LSTM’s output was an array of NaNs. After figuring out that the gradient descent was not converging, that was fixed by decreasing Adam’s learning rate. </p>
<p>Use <code>verbose=1</code> as a global parameter to debug your network.</p>
<p>Without further ado:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fit_model(vanilla)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-117-1024x1019.png" alt="" class="wp-image-782356" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-117-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-117.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>The performance is comparable to our XGBoost 1-day prediction in the <a href="https://blog.finxter.com/time-series-forecast-a-complete-workflow-part-ii/" target="_blank" rel="noreferrer noopener">last article</a>:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="479" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-118-1024x479.png" alt="" class="wp-image-782359" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-118-1024x479.png 1024w, https://blog.finxter.com/wp-content/uplo...00x140.png 300w, https://blog.finxter.com/wp-content/uplo...68x359.png 768w, https://blog.finxter.com/wp-content/uplo...ge-118.png 1536w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Moreover, we are predicting 5 days, not only one, making the r2 score more impressive. </p>
<p>What bothers me, on the other hand, is the fact the predictions for all five days look identical. It requires further analysis to understand why that is happening, which we will not do here.</p>
<h2>How to Build a Stacked LSTM?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="615" height="923" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-132.png" alt="" class="wp-image-782591" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-132.png 615w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 615px) 100vw, 615px" /></figure>
</div>
<p>We also can queue two LSTM layers. </p>
<p>To this aim, we need to be careful to give a 3D input to the second LSTM layer and that is the role the parameter <code>return_sequences</code> plays. We gain a slight increase in the training score in this case.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># model architecture
stacked = Sequential()
stacked.add(LSTM(100, activation='relu', return_sequences=True, input_shape=(X_train.shape[1],X_train.shape[2])))
stacked.add(LSTM(100, activation='relu'))
stacked.add(Dense(n_steps)) fit_model(stacked)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-119-1024x1019.png" alt="" class="wp-image-782369" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-119-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-119.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<h2>What is a Bidirectional LSTM Layer?</h2>
<p>In general, any RNN within minimal requirements can be made bidirectional through Keras’ <a rel="noreferrer noopener" href="https://keras.io/api/layers/recurrent_layers/bidirectional/" target="_blank">Bidirectional</a> layer. It stacks two copies of your RNN layer, making one backward. </p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="766" height="406" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-120.png" alt="" class="wp-image-782378" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-120.png 766w, https://blog.finxter.com/wp-content/uplo...00x159.png 300w" sizes="(max-width: 766px) 100vw, 766px" /><figcaption>Image from <a href="https://analyticsindiamag.com/complete-guide-to-bidirectional-lstm-with-python-codes/">AIM</a>.</figcaption></figure>
</div>
<p>You can either specify the <code>backward_layer</code> as a second RNN layer or just wrap a single one, which will make the Bidirectional instance use a copy as the backward model. An implementation can be found below. </p>
<p>The score is comparable to the Stacked LSTM.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from keras.layers import Bidirectional bilstm = Sequential()
bilstm.add(Bidirectional(LSTM(100, activation='relu'), input_shape=(X_train.shape[1], X_train.shape[2])))
bilstm.add(Dense(n_steps)) fit_model(bilstm)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-121-1024x1019.png" alt="" class="wp-image-782386" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-121-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-121.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<h2>Encoder-Decoder LSTM</h2>
<p>An Encoder-Decoder structure is designed in a way you have one network dedicated to feature selection and a second one to the actual forecast. The architectures used can be of different types; even of recurrent-non recurrent pairs are allowed. </p>
<p>Here we explore two pairs: LSTM-LSTM and CNN-LSTM. </p>
<p>Compared to the previous presented architectures, the main difference is the inclusion of the <a href="https://keras.io/api/layers/reshaping_layers/repeat_vector/" target="_blank" rel="noreferrer noopener"><code>RepeatVector</code></a> layer and the wrapper <code><a href="https://keras.io/api/layers/recurrent_layers/time_distributed/" target="_blank" rel="noreferrer noopener">TimeDistributed</a></code>. </p>
<p>Although the <code>RepeatVector</code> is smoothly included, the <code>TimeDistributed</code> layer needs some care. It wraps a layer object and has the duty to apply a copy of each to each temporal slice imputed into it. It considers the <code>.shape[1]</code> of the first input as the temporal dimension (our <code>prepare_data</code> is in accordance to that). </p>
<p>Moreover, one has to watch out since it outputs a 3D array, in particular our model will output 3D predictions. </p>
<p>For this reason, we have to feed the model with reshaped <code>y_val</code>, <code>y_train</code> so that the loss functions can be computed. Fortunately, we already included the <code>time_distributed</code> parameter in the <code>fit_model</code> to deal with the reshaping. </p>
<p>We also increase the number of Epochs since these networks seem to take longer to find a minimum. We include an <code>EarlyStopping</code> though. It already gives an astonishing score!</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from keras.layers import RepeatVector, TimeDistributed # Data preparation
n_lags, n_steps, lead_time, test_size = 10, 5, 0, .2 # hyperparameters
epochs, batch_size, verbose = 300, 32, 0 model_params = {'callbacks':[EarlyStopping( monitor="val_loss", patience=20, mode="auto")]} # preparing data
X_train, X_val, y_train, y_val = prepare_data(data, 'btc', n_lags, n_steps, lead_time, test_size, normalize=True) # Encoder
lstmlstm = Sequential()
lstmlstm.add(LSTM(100, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2])))
lstmlstm.add(RepeatVector(n_steps)) # Decoder
lstmlstm.add(LSTM(100, activation='relu', return_sequences=True))
lstmlstm.add(TimeDistributed(Dense(n_steps))) fit_model(lstmlstm, time_distributed=True)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1016" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-122-1024x1016.png" alt="" class="wp-image-782407" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-122-1024x1016.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x762.png 768w, https://blog.finxter.com/wp-content/uplo...ge-122.png 1162w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>This is the first time the steps outputs are visibly different from each other. </p>
<p>Nevertheless, it seems to be following some trend. In theory, the NN should be so powerful that it can capture trends as well. However, in practice detrending often gives better results. Nevertheless, 0.82 is a massive increase from our 0.32 XGBoost. </p>
<h2>Encoder-Decoder CNN-LSTM Network</h2>
<p>The last architecture we present is the CNN-LSTM one. </p>
<p>Here a <a rel="noreferrer noopener" href="https://blog.finxter.com/classification-of-star-wars-lego-images-using-cnn-and-transfer-learning/" data-type="post" data-id="33108" target="_blank">Convolutional Neural Network</a> is used as a feature selector, being well-known to perform well in this role for photos and videos. </p>
<p>The main reason they are so useful in this case is mathematical: the convolutional part of CNN’s name refers to the <a href="https://en.wikipedia.org/wiki/Convolution">convolution operation</a><a rel="noreferrer noopener" href="https://en.wikipedia.org/wiki/Convolution" target="_blank"> </a><a href="https://en.wikipedia.org/wiki/Convolution">in mathematics</a>, which is used to emphasize translation-invariant features. </p>
<p>That makes complete sense when you have a photo, since you want your mobile phone to recoginze Toto as a dog, independent if it is in the lower-left corner or in the upper-center of the picture (of course your dog’s name is Toto, right?). You may recognize the CNN action as the smoothed lines in the graph. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from keras.layers import RepeatVector, TimeDistributed, Conv1D, MaxPooling1D, Flatten # Data preparation
n_lags, n_steps, lead_time, test_size = 10, 5, 0, .2 # hyperparameters
epochs, batch_size, verbose = 300, 32, 0 model_params = {'callbacks':[EarlyStopping( monitor="val_loss", patience=20, mode="auto")]} # preparing data
X_train, X_val, y_train, y_val = prepare_data(data, 'btc', n_lags, n_steps, lead_time, test_size) # Encoder
cnn_lstm = Sequential()
cnn_lstm.add(Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2])))
cnn_lstm.add(Conv1D(filters=64, kernel_size=3, activation='relu'))
cnn_lstm.add(MaxPooling1D(pool_size=2))
cnn_lstm.add(Flatten())
cnn_lstm.add(RepeatVector(n_steps)) # Decoder
cnn_lstm.add(LSTM(200, activation='relu', return_sequences=True))
cnn_lstm.add(TimeDistributed(Dense(100, activation='relu')))
cnn_lstm.add(TimeDistributed(Dense(n_steps))) fit_model(cnn_lstm, time_distributed=True)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-123-1024x1019.png" alt="" class="wp-image-782418" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-123-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-123.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<h2>Extra Perks</h2>
<p>For the sake of completion, we tweaked the code around a bit. </p>
<p>Do you remember the seemly significant correlation popped up in the 20-days lags? Well, increasing from 10 to 20 timesteps actually increases the R2 score in the last model:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-124-1024x1019.png" alt="" class="wp-image-782424" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-124-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-124.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Funnily enough, it increases even more if you use <strong>unnormalized data</strong>, making a stellar ~.94 score! </p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-125-1024x1019.png" alt="" class="wp-image-782429" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-125-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-125.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>The last thing worth mentioning is the choice of the <a href="https://blog.finxter.com/gradient-descent-in-neural-nets-a-simple-guide-to-ann-learning/" data-type="post" data-id="673142" target="_blank" rel="noreferrer noopener">activation function</a>. If you got the Warning below and wonder why, the Keras’ <a href="https://keras.io/api/layers/recurrent_layers/lstm/" target="_blank" rel="noreferrer noopener">LSTM documentation</a> provides an answer.</p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f6d1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>WARNING</strong>: <code>tensorflow:Layer lstm_70</code> will not use cuDNN kernels since it doesn’t meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.</p>
<p>(No, I did not loaded 70 LSTM layers. I loaded around 210 <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f635-200d-1f4ab.png" alt="??" class="wp-smiley" style="height: 1em; max-height: 1em;" />)</p>
<p>The documentation says:</p>
<p><em>“The requirements to use the cuDNN implementation are:</em></p>
<ol>
<li><em>activation == tanh</em></li>
<li><em>recurrent_activation == sigmoid</em></li>
<li><em>recurrent_dropout == 0</em></li>
<li><em>unroll is False</em></li>
<li><em>use_bias is True</em></li>
<li><em>Inputs, if use masking, are strictly right-padded.</em></li>
<li><em>Eager execution is enabled in the outermost context.”</em></li>
</ol>
<p>Changing the activation to ‘<em>tanh</em>‘ is enough in our case to use cuDNN, and they are incredibly faster! However tanh fits poorly into our problem:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fit_model(cnn_lstm, time_distributed=True, learning_rate=1)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-126-1024x1019.png" alt="" class="wp-image-782440" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-126-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-126.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>(You saw it right, the learning rate is 1000x larger than the default. Otherwise the loss curve does not even change.)</p>
<h2>Main Takeaways</h2>
<p>There are a few points we have to keep in mind about LSTM:</p>
<ul>
<li>The shape of their input </li>
<li>What are <em>time steps</em></li>
<li>The shape of the layer’s output, especially when using <code>return_sequences</code></li>
<li>Hyperparameters tunning is worth your time. For instance, the activation functions <em>relu</em> and <em>tanh</em> have their own pros and cons.</li>
<li>There are different architectures to play with (and many more to come – we will deal with Attention blocks and Multi-headed networks soon). Consider using them. I’ve become specially inclined towards the Encoder-Decoders</li>
</ul>
<p>Feel free to use and edit the code here. </p>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
</div>
https://www.sickgaming.net/blog/2022/10/...itectures/
<div>
<div class="kk-star-ratings kksr-auto kksr-align-left kksr-valign-top" data-payload="{"align":"left","id":"782261","slug":"default","valign":"top","ignore":"","reference":"auto","class":"","count":"1","readonly":"","score":"5","best":"5","gap":"5","greet":"Rate this post","legend":"5\/5 - (1 vote)","size":"24","width":"142.5","_legend":"{score}\/{best} - ({count} {votes})","font_factor":"1.25"}">
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend" style="font-size: 19.2px;"> 5/5 – (1 vote) </div>
</div>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="519" height="923" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-127.png" alt="" class="wp-image-782501" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-127.png 519w, https://blog.finxter.com/wp-content/uplo...69x300.png 169w" sizes="(max-width: 519px) 100vw, 519px" /></figure>
</div>
<p>Although <a href="https://blog.finxter.com/how-neural-networks-learn/" data-type="post" data-id="568016" target="_blank" rel="noreferrer noopener">Neural Networks</a> do a tremendous job learning rules in tabular, structured data, it leaves a great deal to be desired in terms of ‘unstructured’ data. And there we come to a new concept: <strong><em>Recurrent Neural Networks</em></strong>.</p>
<figure class="wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube"><a href="https://blog.finxter.com/bitcoin-price-forecast-with-lstm-based-architectures/"><img src="https://blog.finxter.com/wp-content/plugins/wp-youtube-lyte/lyteCache.php?origThumbUrl=https%3A%2F%2Fi.ytimg.com%2Fvi%2FZYmVCK2sX9w%2Fhqdefault.jpg" alt="YouTube Video"></a><figcaption></figcaption></figure>
<h2>Recurrent Neural Network</h2>
<p>A Recurrent Neural Network is to a Feedforward Neural Network as a single object is to a list: it may be thought as a set of interrelated feedforward networks, or a looped network. </p>
<p>It is specialized in picking up and highlighting the main characteristics of your data (more on that in <a rel="noreferrer noopener" href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/" target="_blank">Andrej Karpathy’s Blog</a>). They are often followed by a Feed Forward (Dense) Layer which will weigh the output.</p>
<h2>Long Short-Term Memory</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="576" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-128-1024x576.png" alt="" class="wp-image-782540" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-128-1024x576.png 1024w, https://blog.finxter.com/wp-content/uplo...00x169.png 300w, https://blog.finxter.com/wp-content/uplo...68x432.png 768w, https://blog.finxter.com/wp-content/uplo...ge-128.png 1345w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Long Short-Term Memory (LSTM) clusters have the extra special ability to deal with time (more on it can be found in <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/" target="_blank" rel="noreferrer noopener">Colah’s article</a>). </p>
<p>As the term <strong><em>memory</em></strong> suggests, its greatest promise is to understand correlations between past and present events. In particular, they fit naturally in time series forecasts. </p>
<p>Here we aim at a hands-on introduction to several LSTM-based architectures (and more is to come <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f609.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" />). </p>
<h2>Article Overview</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="682" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-129-1024x682.png" alt="" class="wp-image-782562" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-129-1024x682.png 1024w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w, https://blog.finxter.com/wp-content/uplo...68x512.png 768w, https://blog.finxter.com/wp-content/uplo...ge-129.png 1345w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>We use Bitcoin daily closing price as a case study. Specifically, we use the Bitcoin price and sentiment analysis we have gathered in a <a rel="noreferrer noopener" href="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" target="_blank">previous article</a>. We use <a href="https://blog.finxter.com/tensorflow-developer-income-and-opportunity/" data-type="post" data-id="259596" target="_blank" rel="noreferrer noopener">TensorFlow</a>‘s <a href="https://blog.finxter.com/keras-developer-income-and-opportunity/" data-type="post" data-id="257517" target="_blank" rel="noreferrer noopener">Keras</a> API for the implementation.</p>
<p>In this article will aim at the following architectures:</p>
<ol>
<li>‘Vanilla’ LSTM</li>
<li>Stacked LSTM</li>
<li>Bidirectional LSTM</li>
<li>Encoder-Decoder LSTM-LSTM</li>
<li>Encoder-Decoder CNN-LSTM</li>
</ol>
<p>The last one being the more convoluted (pun intended).</p>
<p>There is one main issue dealing with time series, which is the implementation of the problem. Are common situation both having only the historical target value alone (univariate problem) or together with other information (multivariate problem). </p>
<p>Moreover, you might be interested in one-step prediction or a multi-step prediction, i.e., predicting only the next day or, say, all days in the next week. Although it doesn’t sound so, you have to adjust your model to whatever situation you are facing. </p>
<p>Think of how you would deal with a multivariate multi-step problem: should you train a one-step model and forecast all features in order to feed your model to predict the following days? That would be a crazy! </p>
<p><a rel="noreferrer noopener" href="https://www.kaggle.com/code/ryanholbrook/forecasting-with-machine-learning" target="_blank">Kaggle’s time series course</a> does a good job introducing the several strategies present to deal with multi-step prediction. Fortunately, setting an LSTM network for a multi-step multivariate problem is as easy as setting it for a univariate one-step problem – you just need to change two numbers. </p>
<p>This is another advantage of Neural Networks, apart from its capacity of memory. </p>
<p>Of course, the architecture list above is not exhaustive. For instance, a new <em>Attention</em> layer was recently introduced, which has been <a href="https://exchange.scale.com/public/blogs/state-of-ai-report-2021-transformers-taking-ai-world-by-storm-nathan-benaich" target="_blank" rel="noreferrer noopener">working wonders</a>. We shall come back to it in a next article, where we will walk through a hybrid Attention-CLX model.</p>
<p>Credits to <a href="https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/" target="_blank" rel="noreferrer noopener">ML Mastery blog</a> for part of the code. </p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f6ab.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Disclaimer</strong>: This article is a programming/data analysis tutorial only and is not intended to be any kind of investment advice.</p>
<h2>How to Prepare the Data for LSTM?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="683" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-130-1024x683.png" alt="" class="wp-image-782575" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-130-1024x683.png 1024w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w, https://blog.finxter.com/wp-content/uplo...68x512.png 768w, https://blog.finxter.com/wp-content/uplo...ge-130.png 1345w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>We will use two sources of data, both explicit in our <a href="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" target="_blank" rel="noreferrer noopener">previous article</a>: the <a href="https://senticrypt.com" target="_blank" rel="noreferrer noopener">SentiCrypt</a>‘s Bitcoin sentiment analysis and Bitcoin’s daily closing price (by following the steps in the previous article, you can do it differently, using a minute-base data, for example). </p>
<p>Let us load the already-saved sentiment analysis and download the Bitcoin price:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd
import yfinance as yf sentic = pd.read_csv('sentic.csv', index_col=0, parse_dates=True)
sentic.index.freq='D' btc = yf.download('BTC-USD', start='2020-02-14', end='2022-09-23', period='1d')[['Close']]
btc.columns = ['btc'] data = pd.concat([sentic,btc], axis=1) data
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="887" height="416" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-115.png" alt="" class="wp-image-782302" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-115.png 887w, https://blog.finxter.com/wp-content/uplo...00x141.png 300w, https://blog.finxter.com/wp-content/uplo...68x360.png 768w" sizes="(max-width: 887px) 100vw, 887px" /></figure>
</div>
<p>The LSTM layer expects a 3D array as input whose shape represents:</p>
<p><code>(data_size, timesteps, number_of_features)</code>.</p>
<p>Meaning, the first and last elements are the number of rows and columns from the input data, respectively. The timestep argument is the size of the time chunk you want your LSTM to process at a time. This will be the time frame the LSTM will look for relations between past and present. It is essentially the size of its (long short-term) <em>memory</em>.</p>
<p>To decide how many time-steps, we recall our first time series article where we explored partial auto-correlations of Bitcoin price’s lags. </p>
<p>That is easily achieved through <a href="https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/" data-type="post" data-id="22984" target="_blank" rel="noreferrer noopener">statsmodels</a>:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from statsmodels.graphics.tsaplots import plot_pacf
import matplotlib.pyplot as plt plot_pacf(data.btc, lags=20)
plt.show()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="568" height="433" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-116.png" alt="" class="wp-image-782309" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-116.png 568w, https://blog.finxter.com/wp-content/uplo...00x229.png 300w" sizes="(max-width: 568px) 100vw, 568px" /></figure>
</div>
<p>If you were there, in the <a href="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" data-type="post" data-id="715831" target="_blank" rel="noreferrer noopener">first article</a>, with me, you might remember our curious 10-lags correlation. Here we use this magic number and<strong> feed the model with a 10 days frame and to make a 5 days prediction. </strong>I found the results with 10 days better than for 6 or 20 days (for most cases – see below for more about this). We also assume we have today’s data and try to forecast the next 5 days.</p>
<p>An easy way to accomplish the reshaping of the data is through (a slight modification) of our <code>make_lags</code> function together with NumPy’s <code><a href="https://blog.finxter.com/numpy-reshape/" data-type="post" data-id="3781" target="_blank" rel="noreferrer noopener">reshape()</a></code> method. </p>
<p>So, instead of a Series, we will take a DataFrame as input and will output a concatenation of the original frame with its respective lags. We use negative lags to prepare the target DataFrame. We will ignore observations with the produced NaN values and will use the align method to align their indexes. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def make_lags(df, n_lags=1, lead_time=1): """ Compute lags of a pandas.DataFrame from lead_time to lead_time + n_lags. Alternatively, a list can be passed as n_lags. Returns a pd.DataFrame resulting from the concatenation of df's shifts. """ if isinstance(n_lags,int): lag_list = range(lead_time, n_lags+lead_time) else: lag_list = n_lags lags=list() for i in lag_list: df_lag = df.shift(i) if i!=0: df_lag.columns = [f'{col}_lag_{i}' for col in df.columns] lags.append(df_lag) return pd.concat(lags, axis=1) X = make_lags(data, n_lags=20, lead_time=0).dropna()
y = make_lags(data[['btc']], n_lags=range(-5,0)).dropna() X, y = X.align(y, join='inner', axis=0)
</pre>
<p>Next, we train-test split the data with <a href="https://blog.finxter.com/tutorial-how-to-create-your-first-neural-network-in-1-line-of-python-code/" data-type="post" data-id="2463" target="_blank" rel="noreferrer noopener">sklearn</a>, taking 10% as test size. As usual for time series, we include <code>shuffle=False</code> as a parameter.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.1, shuffle=False)
</pre>
<p>Before proceeding, it is good practice to normalize the data before feeding it into a Neural Network. We do it now, before things get 3D.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler().fit(X_train)
X_train, X_val = mms.transform(X_train), mms.transform(X_val)
</pre>
<p>Finally, we use NumPy to <a href="https://blog.finxter.com/numpy-reshape/" data-type="post" data-id="3781" target="_blank" rel="noreferrer noopener">reshape</a> everything to 3D arrays. Observe that there is not such a thing as a 3D <code>pd.DataFrame</code>.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import numpy as np def add_dim(df, timesteps=5): """ Transforms a pd.DataFrame into a 3D np.array with shape (n_samples, timesteps, n_features) """ df = np.array(df) array_3d = df.reshape(df.shape[0],timesteps ,df.shape[1]//timesteps) return array_3d X_train, X_val = map(add_dim, [X_train, X_val], [timesteps]*2)
</pre>
<p>Of course, you can always prepare a function to do everything in one shot:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def prepare_data(df, target_name, n_lags, n_steps, lead_time, test_size, normalize=True): ''' Prepare data for LSTM. ''' if isinstance(n_steps,int): n_steps = range(1,n_steps+1) n_steps = [-x for x in list(n_steps)] X = make_lags(df, n_lags=n_lags, lead_time=lead_time).dropna() y = make_lags(df[[target_name]], n_lags=n_steps).dropna() X, y = X.align(y, join='inner', axis=0) from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=test_size, shuffle=False) if normalize: from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler().fit(X_train) X_train, X_val = mms.transform(X_train), mms.transform(X_val) if isinstance(n_lags,int): timesteps = n_lags else: timesteps = len(n_lags) return add_dim(X_train,timesteps), add_dim(X_val,timesteps), y_train, y_val
</pre>
<p>Note that one should give positive values to <code>n_steps</code> to have the right negative shifts. Fortunately, <code>y_train</code>, <code>y_val</code> are not reshaped, which makes life easier when comparing predictions with reality.</p>
<p>All set, let’s start with the most basic Vanilla model.</p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f4a1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Side note</strong>: We are keeping things simple here, but in a future post, we will prepare our own batches and explore better the <em>stateful</em> parameter of an LSTM layer. More on its input and output can be found in <a href="https://github.com/MohammadFneish7/Keras_LSTM_Diagram" target="_blank" rel="noreferrer noopener">Mohammad’s Git</a>.</p>
<h2>How to Implement Vanilla LSTM with Keras?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="683" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-131-1024x683.png" alt="" class="wp-image-782588" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-131-1024x683.png 1024w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w, https://blog.finxter.com/wp-content/uplo...68x512.png 768w, https://blog.finxter.com/wp-content/uplo...ge-131.png 1345w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>A model is called Vanilla when it has no additional structure apart from the output layer. </p>
<p>To implement it we add an <a href="https://keras.io/api/layers/recurrent_layers/lstm/" target="_blank" rel="noreferrer noopener">LSTM</a> and a <a href="https://keras.io/api/layers/core_layers/dense/" target="_blank" rel="noreferrer noopener">Dense</a> layer. We must pass the number of units of each and the input shape for the LSTM layer.</p>
<p>The input shape is exactly <code>(n_timesteps, n_features)</code> which can be inferred from <code>X_train.shape</code>. The number of units for the LSTM layer is a hyperparameter and shall be tuned, for the Dense layer it is the number of outputs we want. Therefore 5. </p>
<p>Next follows a hypertuning-friendly code, specifying the main parameters in advance. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from keras.models import Sequential
from keras.layers import Dense, LSTM # Data preparation
n_lags, n_steps, lead_time, test_size = 10, 5, 0, .2 # hyperparameters
epochs, batch_size, verbose = 50, 72, 0 model_params = {} # preparing data
X_train, X_val, y_train, y_val = prepare_data(data, 'btc', n_lags, n_steps, lead_time, test_size) # model architecture
vanilla = Sequential()
vanilla.add(LSTM(units=200, activation='relu', input_shape=(X_train.shape[1],X_train.shape[2]) ))
vanilla.add(Dense(n_steps))
</pre>
<p>The <code>model_params</code> <a href="https://blog.finxter.com/python-dictionary/" data-type="post" data-id="5232" target="_blank" rel="noreferrer noopener">dictionary</a> will be useful for including additional parameters to the compile method, such as an <code>EarlyStopping</code> callback. </p>
<p>We also write a function that fits the model, plot and assess predictions. The present code does not output anything, so, feel free to change it in order to do so. We fix the optimizer as Adam and the loss metric as Mean Squared Error.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def fit_model(model, learning_rate=0.001, time_distributed=False, epochs=epochs, batch_size=batch_size, verbose=verbose): y_ind = y_val.index if time_distributed: y_train_0 = y_train.to_numpy().reshape((y_train.shape[0], y_train.shape[1],1)) y_val_0 = y_val.to_numpy().reshape((y_val.shape[0], y_val.shape[1],1)) else: y_train_0 = y_train y_val_0 = y_val # fit network from keras.optimizers import Adam adam = Adam(learning_rate=learning_rate) model.compile(loss='mse', optimizer='adam') history = model.fit(X_train, y_train_0, epochs=epochs, batch_size=batch_size, verbose=verbose, **model_params, validation_data=(X_val, y_val_0), shuffle=False) # make a prediction if time_distributed: predictions = model.predict(X_val)[:,:,0] else: predictions = model.predict(X_val) yhat = pd.DataFrame(predictions, index=y_ind, columns=[f'pred_lag_{i}' for i in range(-n_steps,0)]) yhat_shifted = pd.concat([yhat.iloc[:,i].shift(-n_steps+i) for i in range(len(yhat.columns))], axis=1) # calculate RMSE from sklearn.metrics import mean_squared_error, r2_score rmse = np.sqrt(mean_squared_error(y_val, yhat)) import matplotlib.pyplot as plt fig, (ax1,ax2) = plt.subplots(2,1,figsize=(14,14)) y_val.iloc[:,0].plot(ax=ax2,legend=True) yhat_shifted.plot(ax=ax2) ax2.set_title('Prediction comparison') ax2.annotate(f'RMSE: {rmse:.5f} \n R2 score: {r2_score(yhat,y_val):.5f}', xy=(.68,.93), xycoords='axes fraction') ax1.plot(history.history['loss'], label='train') ax1.plot(history.history['val_loss'], label='test') ax1.legend() plt.show()
</pre>
<p>The <code>time_distributed</code> parameter will be used in the last two architectures. </p>
<p>I opted to set a manual <code>learning_rate</code> since once the Stacked LSTM’s output was an array of NaNs. After figuring out that the gradient descent was not converging, that was fixed by decreasing Adam’s learning rate. </p>
<p>Use <code>verbose=1</code> as a global parameter to debug your network.</p>
<p>Without further ado:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fit_model(vanilla)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-117-1024x1019.png" alt="" class="wp-image-782356" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-117-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-117.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>The performance is comparable to our XGBoost 1-day prediction in the <a href="https://blog.finxter.com/time-series-forecast-a-complete-workflow-part-ii/" target="_blank" rel="noreferrer noopener">last article</a>:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="479" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-118-1024x479.png" alt="" class="wp-image-782359" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-118-1024x479.png 1024w, https://blog.finxter.com/wp-content/uplo...00x140.png 300w, https://blog.finxter.com/wp-content/uplo...68x359.png 768w, https://blog.finxter.com/wp-content/uplo...ge-118.png 1536w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Moreover, we are predicting 5 days, not only one, making the r2 score more impressive. </p>
<p>What bothers me, on the other hand, is the fact the predictions for all five days look identical. It requires further analysis to understand why that is happening, which we will not do here.</p>
<h2>How to Build a Stacked LSTM?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="615" height="923" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-132.png" alt="" class="wp-image-782591" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-132.png 615w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 615px) 100vw, 615px" /></figure>
</div>
<p>We also can queue two LSTM layers. </p>
<p>To this aim, we need to be careful to give a 3D input to the second LSTM layer and that is the role the parameter <code>return_sequences</code> plays. We gain a slight increase in the training score in this case.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># model architecture
stacked = Sequential()
stacked.add(LSTM(100, activation='relu', return_sequences=True, input_shape=(X_train.shape[1],X_train.shape[2])))
stacked.add(LSTM(100, activation='relu'))
stacked.add(Dense(n_steps)) fit_model(stacked)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-119-1024x1019.png" alt="" class="wp-image-782369" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-119-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-119.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<h2>What is a Bidirectional LSTM Layer?</h2>
<p>In general, any RNN within minimal requirements can be made bidirectional through Keras’ <a rel="noreferrer noopener" href="https://keras.io/api/layers/recurrent_layers/bidirectional/" target="_blank">Bidirectional</a> layer. It stacks two copies of your RNN layer, making one backward. </p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="766" height="406" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-120.png" alt="" class="wp-image-782378" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-120.png 766w, https://blog.finxter.com/wp-content/uplo...00x159.png 300w" sizes="(max-width: 766px) 100vw, 766px" /><figcaption>Image from <a href="https://analyticsindiamag.com/complete-guide-to-bidirectional-lstm-with-python-codes/">AIM</a>.</figcaption></figure>
</div>
<p>You can either specify the <code>backward_layer</code> as a second RNN layer or just wrap a single one, which will make the Bidirectional instance use a copy as the backward model. An implementation can be found below. </p>
<p>The score is comparable to the Stacked LSTM.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from keras.layers import Bidirectional bilstm = Sequential()
bilstm.add(Bidirectional(LSTM(100, activation='relu'), input_shape=(X_train.shape[1], X_train.shape[2])))
bilstm.add(Dense(n_steps)) fit_model(bilstm)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-121-1024x1019.png" alt="" class="wp-image-782386" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-121-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-121.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<h2>Encoder-Decoder LSTM</h2>
<p>An Encoder-Decoder structure is designed in a way you have one network dedicated to feature selection and a second one to the actual forecast. The architectures used can be of different types; even of recurrent-non recurrent pairs are allowed. </p>
<p>Here we explore two pairs: LSTM-LSTM and CNN-LSTM. </p>
<p>Compared to the previous presented architectures, the main difference is the inclusion of the <a href="https://keras.io/api/layers/reshaping_layers/repeat_vector/" target="_blank" rel="noreferrer noopener"><code>RepeatVector</code></a> layer and the wrapper <code><a href="https://keras.io/api/layers/recurrent_layers/time_distributed/" target="_blank" rel="noreferrer noopener">TimeDistributed</a></code>. </p>
<p>Although the <code>RepeatVector</code> is smoothly included, the <code>TimeDistributed</code> layer needs some care. It wraps a layer object and has the duty to apply a copy of each to each temporal slice imputed into it. It considers the <code>.shape[1]</code> of the first input as the temporal dimension (our <code>prepare_data</code> is in accordance to that). </p>
<p>Moreover, one has to watch out since it outputs a 3D array, in particular our model will output 3D predictions. </p>
<p>For this reason, we have to feed the model with reshaped <code>y_val</code>, <code>y_train</code> so that the loss functions can be computed. Fortunately, we already included the <code>time_distributed</code> parameter in the <code>fit_model</code> to deal with the reshaping. </p>
<p>We also increase the number of Epochs since these networks seem to take longer to find a minimum. We include an <code>EarlyStopping</code> though. It already gives an astonishing score!</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from keras.layers import RepeatVector, TimeDistributed # Data preparation
n_lags, n_steps, lead_time, test_size = 10, 5, 0, .2 # hyperparameters
epochs, batch_size, verbose = 300, 32, 0 model_params = {'callbacks':[EarlyStopping( monitor="val_loss", patience=20, mode="auto")]} # preparing data
X_train, X_val, y_train, y_val = prepare_data(data, 'btc', n_lags, n_steps, lead_time, test_size, normalize=True) # Encoder
lstmlstm = Sequential()
lstmlstm.add(LSTM(100, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2])))
lstmlstm.add(RepeatVector(n_steps)) # Decoder
lstmlstm.add(LSTM(100, activation='relu', return_sequences=True))
lstmlstm.add(TimeDistributed(Dense(n_steps))) fit_model(lstmlstm, time_distributed=True)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1016" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-122-1024x1016.png" alt="" class="wp-image-782407" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-122-1024x1016.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x762.png 768w, https://blog.finxter.com/wp-content/uplo...ge-122.png 1162w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>This is the first time the steps outputs are visibly different from each other. </p>
<p>Nevertheless, it seems to be following some trend. In theory, the NN should be so powerful that it can capture trends as well. However, in practice detrending often gives better results. Nevertheless, 0.82 is a massive increase from our 0.32 XGBoost. </p>
<h2>Encoder-Decoder CNN-LSTM Network</h2>
<p>The last architecture we present is the CNN-LSTM one. </p>
<p>Here a <a rel="noreferrer noopener" href="https://blog.finxter.com/classification-of-star-wars-lego-images-using-cnn-and-transfer-learning/" data-type="post" data-id="33108" target="_blank">Convolutional Neural Network</a> is used as a feature selector, being well-known to perform well in this role for photos and videos. </p>
<p>The main reason they are so useful in this case is mathematical: the convolutional part of CNN’s name refers to the <a href="https://en.wikipedia.org/wiki/Convolution">convolution operation</a><a rel="noreferrer noopener" href="https://en.wikipedia.org/wiki/Convolution" target="_blank"> </a><a href="https://en.wikipedia.org/wiki/Convolution">in mathematics</a>, which is used to emphasize translation-invariant features. </p>
<p>That makes complete sense when you have a photo, since you want your mobile phone to recoginze Toto as a dog, independent if it is in the lower-left corner or in the upper-center of the picture (of course your dog’s name is Toto, right?). You may recognize the CNN action as the smoothed lines in the graph. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from keras.layers import RepeatVector, TimeDistributed, Conv1D, MaxPooling1D, Flatten # Data preparation
n_lags, n_steps, lead_time, test_size = 10, 5, 0, .2 # hyperparameters
epochs, batch_size, verbose = 300, 32, 0 model_params = {'callbacks':[EarlyStopping( monitor="val_loss", patience=20, mode="auto")]} # preparing data
X_train, X_val, y_train, y_val = prepare_data(data, 'btc', n_lags, n_steps, lead_time, test_size) # Encoder
cnn_lstm = Sequential()
cnn_lstm.add(Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2])))
cnn_lstm.add(Conv1D(filters=64, kernel_size=3, activation='relu'))
cnn_lstm.add(MaxPooling1D(pool_size=2))
cnn_lstm.add(Flatten())
cnn_lstm.add(RepeatVector(n_steps)) # Decoder
cnn_lstm.add(LSTM(200, activation='relu', return_sequences=True))
cnn_lstm.add(TimeDistributed(Dense(100, activation='relu')))
cnn_lstm.add(TimeDistributed(Dense(n_steps))) fit_model(cnn_lstm, time_distributed=True)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-123-1024x1019.png" alt="" class="wp-image-782418" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-123-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-123.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<h2>Extra Perks</h2>
<p>For the sake of completion, we tweaked the code around a bit. </p>
<p>Do you remember the seemly significant correlation popped up in the 20-days lags? Well, increasing from 10 to 20 timesteps actually increases the R2 score in the last model:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-124-1024x1019.png" alt="" class="wp-image-782424" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-124-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-124.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Funnily enough, it increases even more if you use <strong>unnormalized data</strong>, making a stellar ~.94 score! </p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-125-1024x1019.png" alt="" class="wp-image-782429" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-125-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-125.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>The last thing worth mentioning is the choice of the <a href="https://blog.finxter.com/gradient-descent-in-neural-nets-a-simple-guide-to-ann-learning/" data-type="post" data-id="673142" target="_blank" rel="noreferrer noopener">activation function</a>. If you got the Warning below and wonder why, the Keras’ <a href="https://keras.io/api/layers/recurrent_layers/lstm/" target="_blank" rel="noreferrer noopener">LSTM documentation</a> provides an answer.</p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f6d1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>WARNING</strong>: <code>tensorflow:Layer lstm_70</code> will not use cuDNN kernels since it doesn’t meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.</p>
<p>(No, I did not loaded 70 LSTM layers. I loaded around 210 <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f635-200d-1f4ab.png" alt="??" class="wp-smiley" style="height: 1em; max-height: 1em;" />)</p>
<p>The documentation says:</p>
<p><em>“The requirements to use the cuDNN implementation are:</em></p>
<ol>
<li><em>activation == tanh</em></li>
<li><em>recurrent_activation == sigmoid</em></li>
<li><em>recurrent_dropout == 0</em></li>
<li><em>unroll is False</em></li>
<li><em>use_bias is True</em></li>
<li><em>Inputs, if use masking, are strictly right-padded.</em></li>
<li><em>Eager execution is enabled in the outermost context.”</em></li>
</ol>
<p>Changing the activation to ‘<em>tanh</em>‘ is enough in our case to use cuDNN, and they are incredibly faster! However tanh fits poorly into our problem:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fit_model(cnn_lstm, time_distributed=True, learning_rate=1)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="1019" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-126-1024x1019.png" alt="" class="wp-image-782440" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-126-1024x1019.png 1024w, https://blog.finxter.com/wp-content/uplo...00x298.png 300w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w, https://blog.finxter.com/wp-content/uplo...68x764.png 768w, https://blog.finxter.com/wp-content/uplo...ge-126.png 1159w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>(You saw it right, the learning rate is 1000x larger than the default. Otherwise the loss curve does not even change.)</p>
<h2>Main Takeaways</h2>
<p>There are a few points we have to keep in mind about LSTM:</p>
<ul>
<li>The shape of their input </li>
<li>What are <em>time steps</em></li>
<li>The shape of the layer’s output, especially when using <code>return_sequences</code></li>
<li>Hyperparameters tunning is worth your time. For instance, the activation functions <em>relu</em> and <em>tanh</em> have their own pros and cons.</li>
<li>There are different architectures to play with (and many more to come – we will deal with Attention blocks and Multi-headed networks soon). Consider using them. I’ve become specially inclined towards the Encoder-Decoders</li>
</ul>
<p>Feel free to use and edit the code here. </p>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
</div>
https://www.sickgaming.net/blog/2022/10/...itectures/