10-03-2022, 07:46 AM
Python Time Series Forecast on Bitcoin Data (Part II)
<div>
<div class="kk-star-ratings kksr-auto kksr-align-left kksr-valign-top" data-payload="{"align":"left","id":"746418","slug":"default","valign":"top","ignore":"","reference":"auto","class":"","count":"1","readonly":"","score":"5","best":"5","gap":"5","greet":"Rate this post","legend":"5\/5 - (1 vote)","size":"24","width":"142.5","_legend":"{score}\/{best} - ({count} {votes})","font_factor":"1.25"}">
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend" style="font-size: 19.2px;"> 5/5 – (1 vote) </div>
</div>
<figure class="wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube"><a href="https://blog.finxter.com/time-series-forecast-a-complete-workflow-part-ii/"><img src="https://blog.finxter.com/wp-content/plugins/wp-youtube-lyte/lyteCache.php?origThumbUrl=https%3A%2F%2Fi.ytimg.com%2Fvi%2FQxHPH6Bn2Tc%2Fhqdefault.jpg" alt="YouTube Video"></a><figcaption></figcaption></figure>
<p>A Time Series is essentially a tabular data with the special feature of having a time index. The common forecast taks is <em>‘knowing the past (and sometimes the present), predict the future’</em>. This task, taken as a principle, reveals itself in several ways: in how to interpret your problem, in feature engineering and in which forecast strategy to take.</p>
<p>This is the second article in our series. In the <a href="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" target="_blank" rel="noreferrer noopener">first article</a> we discussed how to create features out of a time series using lags and trends. Today we follow the opposite direction by highlighting trends as something you want directly deducted from your model. </p>
<p>Reason is, <a href="https://blog.finxter.com/machine-learning-engineer-income-and-opportunity/" data-type="post" data-id="306050" target="_blank" rel="noreferrer noopener">Machine Learning</a> models work in different ways. Some are good with subtractions, others are not. </p>
<p>For example, for any feature you include in a <a href="https://blog.finxter.com/python-linear-regression-1-liner/" data-type="post" data-id="1920" target="_blank" rel="noreferrer noopener">Linear Regression</a>, the model will automatically detect whether to deduce it from the actual data or not. A <a href="https://blog.finxter.com/random-forest-classifier-made-simple/" data-type="post" data-id="2531" target="_blank" rel="noreferrer noopener">Tree Regressor</a> (and its variants) will not behave in the same way and usually will ignore a trend in the data. </p>
<p>Therefore, whenever using the latter type of models, one usually calls for a <em>hybrid model</em>, meaning, we use a Linear(ish) first model to detect global periodic patterns and then apply a second Machine Learning model to infer more sophisticated behavior. </p>
<p>We use the <a href="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" data-type="URL" data-id="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" target="_blank" rel="noreferrer noopener">Bitcoin Sentiment Analysis</a> data we captured in the last article as a proof of concept.</p>
<p>The hybrid model part of this article is heavily based on <a href="https://www.kaggle.com/learn/time-series" target="_blank" rel="noreferrer noopener">Kaggle’s Time Series Crash Course</a>, however, we intend to automate the process and discuss more in-depth the <code>DeterministicProcess</code> class.</p>
<h2>Trends, as something you don’t want to have</h2>
<p>(Or that you want it deducted from your model)</p>
<p>An aerodynamic way to deal with trends and seasonality is using, respectively, <a rel="noreferrer noopener" href="https://www.statsmodels.org/dev/generated/statsmodels.tsa.deterministic.DeterministicProcess.html" target="_blank"><code>DeterministicProcess</code></a> and <code><a href="https://www.statsmodels.org/dev/generated/statsmodels.tsa.deterministic.CalendarFourier.html" target="_blank" rel="noreferrer noopener">CalendarFourier</a></code> from <code><a href="https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/" data-type="post" data-id="22984" target="_blank" rel="noreferrer noopener">statsmodel</a></code>. Let us start with the former. </p>
<p><code>DeterministicProcess</code> aims at creating features to be used in a Regression model to determine trend and periodicity. It takes your <code>DatetimeIndex</code> and a few other parameters and returns a DataFrame full of features for your ML model. </p>
<p>A usual instance of the class will read like the one below. We use the <code>sentic_mean</code> column to illustrate.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from statsmodels.tsa.deterministic import DeterministicProcess y = dataset['sentic_mean'].copy() dp = DeterministicProcess(
index=y.index, constant=True, order=2
) X = dp.in_sample() X
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="634" height="830" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-12.png" alt="" class="wp-image-746494" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-12.png 634w, https://blog.finxter.com/wp-content/uplo...29x300.png 229w" sizes="(max-width: 634px) 100vw, 634px" /></figure>
</div>
<p>We can use <code>X</code> and <code>y</code> as features and target to train a <code>LinearRegression</code> model. In this way, the <code>LinearRegression</code> will learn whatever characteristics from <code>y</code> can be inferred (in our case) solely out of: </p>
<ul>
<li>the number of elapsed time intervals (<code>trend</code> column); </li>
<li>the last number squared (<code>trend_squared</code>); and </li>
<li>a bias term (<code>const</code>). </li>
</ul>
<p>Check out the result:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X,y) predictions = pd.DataFrame( model.predict(X), index=X.index, columns=['Deterministic Curve']
)
</pre>
</p>
<p>Comparing predictions and actual values gives:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import matplotlib.pyplot as plt plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
plt.show()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="559" height="434" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-13.png" alt="" class="wp-image-746507" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-13.png 559w, https://blog.finxter.com/wp-content/uplo...00x233.png 300w" sizes="(max-width: 559px) 100vw, 559px" /></figure>
</div>
<p>Even the quadratic term seems ignorable here. The <code>DeterministicProcess</code> class also helps us with future predictions since it carries a method that provides the appropriate future form of the chosen features. </p>
<p>Specifically, the <code>out_of_sample</code> method of <code>dp</code> takes the number of time intervals we want to predict as input and generates the needed features for you. </p>
<p>We use 60 days below as an example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">X_out = dp.out_of_sample(60) predictions_out = pd.DataFrame( model.predict(X_out), index=X_out.index, columns=['Future Predictions']
) plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, color='red')
plt.show()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="559" height="434" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-14.png" alt="" class="wp-image-746529" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-14.png 559w, https://blog.finxter.com/wp-content/uplo...00x233.png 300w" sizes="(max-width: 559px) 100vw, 559px" /></figure>
</div>
<p>Let us repeat the process with <code>sentic_count</code> to have a feeling of a higher-order trend. </p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f44d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>As a rule of thumb, the order should be one plus the total number of (trending) hills + peaks in the graph, but not much more than that.</strong> </p>
<p>We choose 3 for <code>sentic_count</code> and compare the output with the <code>order=2</code> result (we do not write the code twice, though).</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">y = dataset['sentic_count'].copy() from statsmodels.tsa.deterministic import DeterministicProcess, CalendarFourier dp = DeterministicProcess( index=y.index, constant=True, order=3
)
X = dp.in_sample() model = LinearRegression().fit(X,y) predictions = pd.DataFrame( model.predict(X), index=X.index, columns=['Deterministic Curve']
) X_out = dp.out_of_sample(60) predictions_out = pd.DataFrame( model.predict(X_out), index=X_out.index, columns=['Future Predictions']
) plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, color='red')
plt.show()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="570" height="429" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-15.png" alt="" class="wp-image-746535" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-15.png 570w, https://blog.finxter.com/wp-content/uplo...00x226.png 300w" sizes="(max-width: 570px) 100vw, 570px" /></figure>
</div>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="570" height="429" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-16.png" alt="" class="wp-image-746538" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-16.png 570w, https://blog.finxter.com/wp-content/uplo...00x226.png 300w" sizes="(max-width: 570px) 100vw, 570px" /></figure>
</div>
<p>Although the order-three polynomial fits the data better, use discretion in deciding whether the sentiment count will decrease so drastically in the next 60 days or not. Usually, trust short-time predictions rather than long ones.</p>
<p><code>DeterministicProcess</code> accepts other parameters, making it a very interesting tool. Find a description of the almost full list below.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">dp = DeterministicProcess( index, # the DatetimeIndex of your data period: int or None, # in case the data shows some periodicity, include the size of the periodic cycle here: 7 would mean 7 days in our case constant: bool, # includes a constant feature in the returned DataFrame, i.e., a feature with the same value for everyone. It returns the equivalent of a bias term in Linear Regression order: int, # order of the polynomial that you think better approximates your trend: the simplest the better seasonal: bool, # make it True if you think the data has some periodicity. If you make it True and do not specify the period, the dp will try to infer the period out of the index additional_terms: tuple of statsmodel's DeterministicTerms, # we come back to this next drop: bool # drops resulting features which are collinear to others. If you will use a linear model, make it True
)
</pre>
<h2>Seasonality</h2>
<p>As a hardened Mathematician, seasonality is my favorite part because it deals with Fourier analysis (and wave functions are just… <a rel="noreferrer noopener" href="https://youtu.be/2awbKQ2DLRE?t=218" target="_blank">cool!</a>):</p>
<figure class="wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube"><a href="https://blog.finxter.com/time-series-forecast-a-complete-workflow-part-ii/"><img src="https://blog.finxter.com/wp-content/plugins/wp-youtube-lyte/lyteCache.php?origThumbUrl=https%3A%2F%2Fi.ytimg.com%2Fvi%2F2awbKQ2DLRE%2Fhqdefault.jpg" alt="YouTube Video"></a><figcaption></figcaption></figure>
<p>Do you remember your first ML course when you heard Linear Regression can fit arbitrary functions, not only lines? So, why not a wave function? We just did it for polynomials and didn’t even feel like it <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f609.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p>In general, for any expression <code>f</code> which is a function of a feature or of your <code>DatetimeIndex</code>, you can create a feature column whose ith row is the value of <code>f</code> corresponding to the ith index. </p>
<p>Then linear regression finds the constant coefficient multiplying <code>f</code> that best fits your data. Again, this procedure works in general, not only with Datetime indexes – the <code>trend_squared</code> term above is an example of it.</p>
<p>For seasonality, we use a second <code>statsmodel</code>‘s amazing class: <code><a href="https://www.statsmodels.org/dev/generated/statsmodels.tsa.deterministic.CalendarFourier.html" target="_blank" rel="noreferrer noopener">CalendarFourier</a></code>. It is another <code>statsmodel</code>‘s <code>DeterministicTerm</code> class (i.e., with the <code>in_sample</code> and <code>out_of_sample</code> methods) and instantiates with two parameters, <code>'frequency'</code> and <code>'order'</code>. </p>
<p>As a <code>'frequency'</code>, the class expects a string such as ‘D’, ‘W’, ‘M’ for day, week or month, respectively, or any of the quite comprehensive Pandas <a href="https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases" target="_blank" rel="noreferrer noopener">Datetime offset aliases</a>. </p>
<p>The <code>'order'</code> is the Fourier expansion order which should be understood as the number of waves you are expecting in your chosen frequency (count the number of ups and downs – one wave would be understood as one up and one down)</p>
<p><code>CalendarFourier</code> integrates swiftly with <code>DeterministicProcess</code> by including an instance of it in the list of <code>additional_terms</code>. </p>
<p>Here is the full code for <code>sentic_mean</code>:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from statsmodels.tsa.deterministic import DeterministicProcess, CalendarFourier y = dataset['sentic_mean'].copy() fourier = CalendarFourier(freq='A',order=2) dp = DeterministicProcess( index=y.index, constant=True, order=2, seasonal=False, additional_terms=[fourier], drop=True
)
X = dp.in_sample() from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X,y) predictions = pd.DataFrame( model.predict(X), index=X.index, columns=['Prediction']
) X_out = dp.out_of_sample(60) predictions_out = pd.DataFrame( model.predict(X_out), index=X_out.index, columns=['Prediction']
) plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, color='red')
plt.show()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="559" height="434" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-17.png" alt="" class="wp-image-746594" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-17.png 559w, https://blog.finxter.com/wp-content/uplo...00x233.png 300w" sizes="(max-width: 559px) 100vw, 559px" /></figure>
</div>
<p>If we take <code>seasonal=True</code> inside <code>DeterministicProcess</code>, we get a crispier line:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="559" height="434" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-18.png" alt="" class="wp-image-746597" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-18.png 559w, https://blog.finxter.com/wp-content/uplo...00x233.png 300w" sizes="(max-width: 559px) 100vw, 559px" /></figure>
</div>
<p>Including <code>ax.set_xlim(('2022-08-01', '2022-10-01'))</code> before <code>plt.show()</code> zooms the graph in:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="571" height="466" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-19.png" alt="" class="wp-image-746600" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-19.png 571w, https://blog.finxter.com/wp-content/uplo...00x245.png 300w" sizes="(max-width: 571px) 100vw, 571px" /></figure>
</div>
<p>Although I suggest using the <code>seasonal=True</code> parameter with care, it does find interesting patterns (with huge RMSE error, though).</p>
<p>For instance, look at this BTC percentage change zoomed chart:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="572" height="461" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-20.png" alt="" class="wp-image-746607" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-20.png 572w, https://blog.finxter.com/wp-content/uplo...00x242.png 300w" sizes="(max-width: 572px) 100vw, 572px" /></figure>
</div>
<p>Here period is set to 30 and <code>seasonal=True</code>. I also manually rescaled the predictions to be better visible in the graphic. Although the predictions are far away from truth, thinking as a trader, isn’t it impressive how many peaks and hills it gets right? At least for this zoomed month…</p>
<p>To maintain the workflow promise, I prepared a code that does everything so far in one shot:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def deseasonalize(df: pd.Series, season_freq='A', fourier_order=0, constant=True, dp_order=1, dp_drop=True, model=LinearRegression(), fourier=None, dp=None, **DeterministicProcesskwargs)->(pd.Series, plt.Axes, pd.DataFrame): """ Returns a deseasonalized and detrended df, a seasonal plot, and the fitted DeterministicProcess instance. """ if fourier is None: fourier = CalendarFourier(freq=season_freq, order=fourier_order) if dp is None: dp = DeterministicProcess( index=df.index, constant=True, order=dp_order, additional_terms=[fourier], drop=dp_drop, **DeterministicProcesskwargs ) X = dp.in_sample() model = LinearRegression().fit(X, df) y_pred = pd.Series( model.predict(X), index=X.index, name=df.name+'_pred' ) ax = plt.subplot() y.plot(ax=ax, legend=True) predictions.plot(ax=ax) y_pred.columns = df.name y_deseason = df - y_pred y_deseason.name = df.name +'_deseasoned' return y_deseason, ax, dp The sentic_mean analyses get reduced to: y_deseason, ax, dp= deseasonalize(y, season_freq='A', fourier_order=2, constant=True, dp_order=2, dp_drop=True, model=LinearRegression() )
</pre>
<h2>Cycles and Hybrid Models</h2>
<p>Let us move on to a complete Machine Learning prediction. We use <code>XGBRegressor</code> and compare its performance among three instances: </p>
<ol>
<li>Predict <code>sentic_mean</code> directly using lags;</li>
<li>Same prediction adding the seasonal/trending with a <code>DeterministicProcess</code>;</li>
<li>A hybrid model, using <code>LinearRegression</code> to infer and remove seasons/trends, and then apply a <code>XGBRegressor</code>.</li>
</ol>
<p>The first part will be the bulkier since the other two follow from simple modifications in the resulting code. </p>
<h3>Preparing the data</h3>
<p>Before any analysis, we split the data in train and test sets. Since we are dealing with time series, this means we set the ‘present date’ as a point in the past and try to predict its respective ‘future’. Here we pick 22 days in the past.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">s = dataset['sentic_mean'] s_train = s[:'2022-09-01']
</pre>
<p>We made this first split in order to not leak data while doing any analysis. </p>
<p>Next, we prepare target and feature sets. Recall our SentiCrypto’s data was set to be available everyday at 8AM. Imagine we are doing the prediction by 9AM. </p>
<p>In this case, anything until the present data (the ‘<code>lag_0</code>‘) can be used as features, and our target is <code>s_train</code>‘s first lead (which we define as a -1 lag). To choose other lags as features, we examine theirs statsmodel’s partial auto-correlation plot:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from statsmodels.graphics.tsaplots import plot_pacf plot_pacf(s_train, lags=20)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="568" height="433" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-21.png" alt="" class="wp-image-746619" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-21.png 568w, https://blog.finxter.com/wp-content/uplo...00x229.png 300w" sizes="(max-width: 568px) 100vw, 568px" /></figure>
</div>
<p>We use the first four for <code>sentic_mean</code> and the first seven + the 11th for <code>sentic_count</code> (you can easily test different combinations with the code below.) </p>
<p>Now we finish choosing features, we go back to the full series for engineering. We apply to <code>s_maen</code> and <code>s_count</code> the <code>make_lags</code> function we defined in the last article (which we transcribe here for convenience). </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def make_lags(df, n_lags=1, lead_time=1): """ Compute lags of a pandas.Series from lead_time to lead_time + n_lags. Alternatively, a list can be passed as n_lags. Returns a pd.DataFrame whose ith column is either the i+lead_time lag or the ith element of n_lags. """ if isinstance(n_lags,int): lag_list = list(range(lead_time, n_lags+lead_time)) else: lag_list = n_lags lags ={ f'{df.name}_lag_{i}': df.shift(i) for i in lag_list } return pd.concat(lags,axis=1) X = make_lags(s, [0,1,2,3,4]) y = make_lags(s, [-1]) display(X)
y
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="524" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-22-1024x524.png" alt="" class="wp-image-746635" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-22-1024x524.png 1024w, https://blog.finxter.com/wp-content/uplo...00x153.png 300w, https://blog.finxter.com/wp-content/uplo...68x393.png 768w, https://blog.finxter.com/wp-content/uplo...36x785.png 1536w, https://blog.finxter.com/wp-content/uplo...age-22.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="524" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-23-1024x524.png" alt="" class="wp-image-746639" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-23-1024x524.png 1024w, https://blog.finxter.com/wp-content/uplo...00x153.png 300w, https://blog.finxter.com/wp-content/uplo...68x393.png 768w, https://blog.finxter.com/wp-content/uplo...36x785.png 1536w, https://blog.finxter.com/wp-content/uplo...age-23.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Now a train-test split with <code>sklearn</code> is convenient (Notice the <code>shuffle=False</code> parameter, that is key for time series):</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=22, shuffle=False) X_train
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="536" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-24-1024x536.png" alt="" class="wp-image-746647" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-24-1024x536.png 1024w, https://blog.finxter.com/wp-content/uplo...00x157.png 300w, https://blog.finxter.com/wp-content/uplo...68x402.png 768w, https://blog.finxter.com/wp-content/uplo...36x805.png 1536w, https://blog.finxter.com/wp-content/uplo...age-24.png 1596w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>(Observe that the final date is set correctly, in accordance with our analysis’ split.)</p>
<p> Applying the regressor:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">xgb = XGBRegressor(n_estimators=50) xgb.fit(X_train,y_train) predictions_train = pd.DataFrame( xgb.predict(X_train), index=X_train.index, columns=['Prediction']
) predictions_test = pd.DataFrame( xgb.predict(X_test), index=X_test.index, columns=['Prediction']
) print(f'R2 train score: {r2_score(y_train[:-1],predictions_train[:-1])}') plt.figure()
ax = plt.subplot()
y_train.plot(ax=ax, legend=True)
predictions_train.plot(ax=ax)
plt.show() plt.figure()
ax = plt.subplot()
y_test.plot(ax=ax, legend=True)
predictions_test.plot(ax=ax)
plt.show() print(f'R2 test score: {r2_score(y_test[:-1],predictions_test[:-1])}')
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="819" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-25-1024x819.png" alt="" class="wp-image-746650" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-25-1024x819.png 1024w, https://blog.finxter.com/wp-content/uplo...00x240.png 300w, https://blog.finxter.com/wp-content/uplo...68x614.png 768w, https://blog.finxter.com/wp-content/uplo...age-25.png 1318w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="868" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-26-1024x868.png" alt="" class="wp-image-746653" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-26-1024x868.png 1024w, https://blog.finxter.com/wp-content/uplo...00x254.png 300w, https://blog.finxter.com/wp-content/uplo...68x651.png 768w, https://blog.finxter.com/wp-content/uplo...age-26.png 1302w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>You can reduce overfitness by reducing the number of estimators, but the R2 test score maintains negative.</p>
<p>We can replicate the process for <code>sentic_count</code> (or whatever you want). Below is a function to automate it.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from statsmodels.tsa.stattools import pacf def apply_univariate_prediction(series, test_size, to_predict=1, nlags=20, minimal_pacf=0.1, model=XGBRegressor(n_estimators=50)): ''' Starting from series, breaks it in train and test subsets; chooses which lags to use based on pacf > minimal_pacf; and applies the given sklearn-type model. Returns the resulting features and targets and the trained model. It plots the graph of the training and prediction, together with their r2_score. ''' s = series.iloc[:-test_size] if isinstance(to_predict,int): to_predict = [to_predict] from statsmodels.tsa.stattools import pacf s_pacf = pd.Series(pacf(s, nlags=nlags)) column_list = s_pacf[s_pacf>minimal_pacf].index X = make_lags(series, n_lags=column_list).dropna() y = make_lags(series,n_lags=[-x for x in to_predict]).loc[X.index] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=False) model.fit(X_train,y_train) predictions_train = pd.DataFrame( model.predict(X_train), index=X_train.index, columns=['Train Predictions'] ) predictions_test = pd.DataFrame( model.predict(X_test), index=X_test.index, columns=['Test Predictions'] ) fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,5), sharey=True) y_train.plot(ax=ax1, legend=True) predictions_train.plot(ax=ax1) ax1.set_title('Train Predictions') y_test.plot(ax=ax2, legend=True) predictions_test.plot(ax=ax2) ax2.set_title('Test Predictions') plt.show() print(f'R2 train score: {r2_score(y_train[:-1],predictions_train[:-1])}') print(f'R2 test score: {r2_score(y_test[:-1],predictions_test[:-1])}') return X, y, model apply_univariate_prediction(dataset['sentic_count'],22)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="472" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-27-1024x472.png" alt="" class="wp-image-746672" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-27-1024x472.png 1024w, https://blog.finxter.com/wp-content/uplo...00x138.png 300w, https://blog.finxter.com/wp-content/uplo...68x354.png 768w, https://blog.finxter.com/wp-content/uplo...36x708.png 1536w, https://blog.finxter.com/wp-content/uplo...age-27.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">apply_univariate_prediction(dataset['BTC-USD'], 22)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="479" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-28-1024x479.png" alt="" class="wp-image-746679" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-28-1024x479.png 1024w, https://blog.finxter.com/wp-content/uplo...00x140.png 300w, https://blog.finxter.com/wp-content/uplo...68x359.png 768w, https://blog.finxter.com/wp-content/uplo...36x718.png 1536w, https://blog.finxter.com/wp-content/uplo...age-28.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<h2>Predicting with Seasons</h2>
<p>Since the features created by <code>DeterministicProcess</code> are only time-dependent, we can add them harmlessly to the feature DataFrame we automated get from our univariate predictions. </p>
<p>The predictions, though, are still univariate. We use the deseasonalize function to obtain the season features. The data preparation is as follows:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">s = dataset['sentic_mean'] X, y, _ = apply_univariate_prediction(s,22); s_deseason, _, dp = deseasonalize(s, season_freq='A', fourier_order=2, constant=True, dp_order=2, dp_drop=True, model=LinearRegression() );
X_f = dp.in_sample().shift(-1) X = pd.concat([X,X_f], axis=1, join='inner').dropna()
</pre>
<p>With a bit of copy and paste, we arrive at:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="477" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-29-1024x477.png" alt="" class="wp-image-746689" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-29-1024x477.png 1024w, https://blog.finxter.com/wp-content/uplo...00x140.png 300w, https://blog.finxter.com/wp-content/uplo...68x358.png 768w, https://blog.finxter.com/wp-content/uplo...36x715.png 1536w, https://blog.finxter.com/wp-content/uplo...age-29.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>And we actually perform way worse! <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f631.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h2>Deseasonalizing</h2>
<p>Nevertheless, the right-hand graphic illustrates the inability of grasping trends. Our last shot is a hybrid model. </p>
<p>Here we follow three steps:</p>
<ol>
<li>We use the <code>LinearRegression</code> to capture the seasons and trends, rendering the series <code>y_s</code>. Then we acquire a deseasonalized target <code>y_ds = y-y_s</code>;</li>
<li>Train an <code>XGBRegressor</code> on <code>y_ds</code> and the lagged features, resulting in deseasonalized predictions <code>y_pred</code>;</li>
<li>Finally, we incorporate <code>y_s</code> back to <code>y_pred</code> to compare the final result.</li>
</ol>
<p>Although Bitcoin-related data are hard to predict, there was a huge improvement on the <code>r2_score</code> (finally something positive!). We define the used function below.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">get_hybrid_univariate_prediction(dataset['sentic_mean'], 22, season_freq='A', fourier_order=2, constant=True, dp_order=2, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=True, season_period=7, dp=None, to_predict=1, nlags=20, minimal_pacf=0.1, model2=XGBRegressor(n_estimators=50) )
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="475" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-30-1024x475.png" alt="" class="wp-image-746703" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-30-1024x475.png 1024w, https://blog.finxter.com/wp-content/uplo...00x139.png 300w, https://blog.finxter.com/wp-content/uplo...68x356.png 768w, https://blog.finxter.com/wp-content/uplo...36x712.png 1536w, https://blog.finxter.com/wp-content/uplo...age-30.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Instead of going through every detail, we will also automate this code. In order to get the code running smoothly, we revisit the deseasonalize and the <code>apply_univariate_prediction</code> functions in order to remove the plotting part of them. </p>
<p>The final function only plots graphs and returns nothing. It intends to give you a baseline for a hybrid model score. Change the function at will to make it return whatever you need.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def get_season(series: pd.Series, test_size, season_freq='A', fourier_order=0, constant=True, dp_order=1, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=False, season_period=None, dp=None): """ Decompose series in a deseasonalized and a seasonal part. The parameters are relative to the fourier and DeterministicProcess used. Returns y_ds and y_s. """ se = series.iloc[:-test_size] if fourier is None: fourier = CalendarFourier(freq=season_freq, order=fourier_order) if dp is None: dp = DeterministicProcess( index=se.index, constant=True, order=dp_order, additional_terms=[fourier], drop=dp_drop, seasonal=is_seasonal, period=season_period ) X_in = dp.in_sample() X_out = dp.out_of_sample(test_size) model1 = model1.fit(X_in, se) X = pd.concat([X_in,X_out],axis=0) y_s = pd.Series( model1.predict(X), index=X.index, name=series.name+'_pred' ) y_s.name = series.name y_ds = series - y_s y_ds.name = series.name +'_deseasoned' return y_ds, y_s def prepare_data(series, test_size, to_predict=1, nlags=20, minimal_pacf=0.1): ''' Creates a feature dataframe by making lags and a target series by a negative to_predict-shift. Returns X, y. ''' s = series.iloc[:-test_size] if isinstance(to_predict,int): to_predict = [to_predict] from statsmodels.tsa.stattools import pacf s_pacf = pd.Series(pacf(s,nlags=nlags)) column_list = s_pacf[s_pacf>minimal_pacf].index X = make_lags(series, n_lags=column_list).dropna() y = make_lags(series,n_lags=[-x for x in to_predict]).loc[X.index].squeeze() return X, y def get_hybrid_univariate_prediction(series: pd.Series, test_size, season_freq='A', fourier_order=0, constant=True, dp_order=1, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=False, season_period=None, dp=None, to_predict=1, nlags=20, minimal_pacf=0.1, model2=XGBRegressor(n_estimators=50) ): """ Apply the hybrid model method by deseasonalizing/detrending a time series with model1 and investigating the resulting series with model2. It plots the respective graphs and computes r2_scores. """ y_ds, y_s = get_season(series, test_size, season_freq=season_freq, fourier_order=fourier_order, constant=constant, dp_order=dp_order, dp_drop=dp_drop, model1=model1, fourier=fourier, dp=dp, is_seasonal=is_seasonal, season_period=season_period) X, y_ds = prepare_data(y_ds,test_size=test_size) X_train, X_test, y_train, y_test = train_test_split(X, y_ds, test_size=test_size, shuffle=False) y = y_s.squeeze() + y_ds.squeeze() model2 = model2.fit(X_train,y_train) predictions_train = pd.Series( model2.predict(X_train), index=X_train.index, name='Prediction' )+y_s[X_train.index] predictions_test = pd.Series( model2.predict(X_test), index=X_test.index, name='Prediction' )+y_s[X_test.index] fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,5), sharey=True) y_train_ps = y.loc[y_train.index] y_test_ps = y.loc[y_test.index] y_train_ps.plot(ax=ax1, legend=True) predictions_train.plot(ax=ax1) ax1.set_title('Train Predictions') y_test_ps.plot(ax=ax2, legend=True) predictions_test.plot(ax=ax2) ax2.set_title('Test Predictions') plt.show() print(f'R2 train score: {r2_score(y_train_ps[:-to_predict],predictions_train[:-to_predict])}') print(f'R2 test score: {r2_score(y_test_ps[:-to_predict],predictions_test[:-to_predict])}')
</pre>
<p><strong>A note of warning:</strong> if you do not expect your data to follow time patterns, do focus on cycles! The hybrid model succeeds well for many tasks, but it actually decreases the R2 score of our previous Bitcoin prediction:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">get_hybrid_univariate_prediction(dataset['BTC-USD'], 22, season_freq='A', fourier_order=4, constant=True, dp_order=5, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=True, season_period=30, dp=None, to_predict=1, nlags=20, minimal_pacf=0.05, model2=XGBRegressor(n_estimators=20) )
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="474" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-31-1024x474.png" alt="" class="wp-image-746717" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-31-1024x474.png 1024w, https://blog.finxter.com/wp-content/uplo...00x139.png 300w, https://blog.finxter.com/wp-content/uplo...68x356.png 768w, https://blog.finxter.com/wp-content/uplo...36x711.png 1536w, https://blog.finxter.com/wp-content/uplo...age-31.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>The former score was around 0.31.</p>
<h2>Conclusion</h2>
<p>This article aims at presenting functions for your time series workflow, specially for lags and deseasonalization. Use them with care, though: apply them to have baseline scores before delving into more sophisticated models.</p>
<p>In future articles we will bring forth multi-step predictions (predict more than one day ahead) and compare performance of different models, both univariate and multivariate.</p>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
</div>
https://www.sickgaming.net/blog/2022/10/...a-part-ii/
<div>
<div class="kk-star-ratings kksr-auto kksr-align-left kksr-valign-top" data-payload="{"align":"left","id":"746418","slug":"default","valign":"top","ignore":"","reference":"auto","class":"","count":"1","readonly":"","score":"5","best":"5","gap":"5","greet":"Rate this post","legend":"5\/5 - (1 vote)","size":"24","width":"142.5","_legend":"{score}\/{best} - ({count} {votes})","font_factor":"1.25"}">
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend" style="font-size: 19.2px;"> 5/5 – (1 vote) </div>
</div>
<figure class="wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube"><a href="https://blog.finxter.com/time-series-forecast-a-complete-workflow-part-ii/"><img src="https://blog.finxter.com/wp-content/plugins/wp-youtube-lyte/lyteCache.php?origThumbUrl=https%3A%2F%2Fi.ytimg.com%2Fvi%2FQxHPH6Bn2Tc%2Fhqdefault.jpg" alt="YouTube Video"></a><figcaption></figcaption></figure>
<p>A Time Series is essentially a tabular data with the special feature of having a time index. The common forecast taks is <em>‘knowing the past (and sometimes the present), predict the future’</em>. This task, taken as a principle, reveals itself in several ways: in how to interpret your problem, in feature engineering and in which forecast strategy to take.</p>
<p>This is the second article in our series. In the <a href="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" target="_blank" rel="noreferrer noopener">first article</a> we discussed how to create features out of a time series using lags and trends. Today we follow the opposite direction by highlighting trends as something you want directly deducted from your model. </p>
<p>Reason is, <a href="https://blog.finxter.com/machine-learning-engineer-income-and-opportunity/" data-type="post" data-id="306050" target="_blank" rel="noreferrer noopener">Machine Learning</a> models work in different ways. Some are good with subtractions, others are not. </p>
<p>For example, for any feature you include in a <a href="https://blog.finxter.com/python-linear-regression-1-liner/" data-type="post" data-id="1920" target="_blank" rel="noreferrer noopener">Linear Regression</a>, the model will automatically detect whether to deduce it from the actual data or not. A <a href="https://blog.finxter.com/random-forest-classifier-made-simple/" data-type="post" data-id="2531" target="_blank" rel="noreferrer noopener">Tree Regressor</a> (and its variants) will not behave in the same way and usually will ignore a trend in the data. </p>
<p>Therefore, whenever using the latter type of models, one usually calls for a <em>hybrid model</em>, meaning, we use a Linear(ish) first model to detect global periodic patterns and then apply a second Machine Learning model to infer more sophisticated behavior. </p>
<p>We use the <a href="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" data-type="URL" data-id="https://blog.finxter.com/python-time-series-forecast-a-guided-example-on-bitcoin-price-data/" target="_blank" rel="noreferrer noopener">Bitcoin Sentiment Analysis</a> data we captured in the last article as a proof of concept.</p>
<p>The hybrid model part of this article is heavily based on <a href="https://www.kaggle.com/learn/time-series" target="_blank" rel="noreferrer noopener">Kaggle’s Time Series Crash Course</a>, however, we intend to automate the process and discuss more in-depth the <code>DeterministicProcess</code> class.</p>
<h2>Trends, as something you don’t want to have</h2>
<p>(Or that you want it deducted from your model)</p>
<p>An aerodynamic way to deal with trends and seasonality is using, respectively, <a rel="noreferrer noopener" href="https://www.statsmodels.org/dev/generated/statsmodels.tsa.deterministic.DeterministicProcess.html" target="_blank"><code>DeterministicProcess</code></a> and <code><a href="https://www.statsmodels.org/dev/generated/statsmodels.tsa.deterministic.CalendarFourier.html" target="_blank" rel="noreferrer noopener">CalendarFourier</a></code> from <code><a href="https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/" data-type="post" data-id="22984" target="_blank" rel="noreferrer noopener">statsmodel</a></code>. Let us start with the former. </p>
<p><code>DeterministicProcess</code> aims at creating features to be used in a Regression model to determine trend and periodicity. It takes your <code>DatetimeIndex</code> and a few other parameters and returns a DataFrame full of features for your ML model. </p>
<p>A usual instance of the class will read like the one below. We use the <code>sentic_mean</code> column to illustrate.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from statsmodels.tsa.deterministic import DeterministicProcess y = dataset['sentic_mean'].copy() dp = DeterministicProcess(
index=y.index, constant=True, order=2
) X = dp.in_sample() X
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="634" height="830" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-12.png" alt="" class="wp-image-746494" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-12.png 634w, https://blog.finxter.com/wp-content/uplo...29x300.png 229w" sizes="(max-width: 634px) 100vw, 634px" /></figure>
</div>
<p>We can use <code>X</code> and <code>y</code> as features and target to train a <code>LinearRegression</code> model. In this way, the <code>LinearRegression</code> will learn whatever characteristics from <code>y</code> can be inferred (in our case) solely out of: </p>
<ul>
<li>the number of elapsed time intervals (<code>trend</code> column); </li>
<li>the last number squared (<code>trend_squared</code>); and </li>
<li>a bias term (<code>const</code>). </li>
</ul>
<p>Check out the result:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X,y) predictions = pd.DataFrame( model.predict(X), index=X.index, columns=['Deterministic Curve']
)
</pre>
</p>
<p>Comparing predictions and actual values gives:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import matplotlib.pyplot as plt plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
plt.show()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="559" height="434" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-13.png" alt="" class="wp-image-746507" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-13.png 559w, https://blog.finxter.com/wp-content/uplo...00x233.png 300w" sizes="(max-width: 559px) 100vw, 559px" /></figure>
</div>
<p>Even the quadratic term seems ignorable here. The <code>DeterministicProcess</code> class also helps us with future predictions since it carries a method that provides the appropriate future form of the chosen features. </p>
<p>Specifically, the <code>out_of_sample</code> method of <code>dp</code> takes the number of time intervals we want to predict as input and generates the needed features for you. </p>
<p>We use 60 days below as an example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">X_out = dp.out_of_sample(60) predictions_out = pd.DataFrame( model.predict(X_out), index=X_out.index, columns=['Future Predictions']
) plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, color='red')
plt.show()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="559" height="434" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-14.png" alt="" class="wp-image-746529" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-14.png 559w, https://blog.finxter.com/wp-content/uplo...00x233.png 300w" sizes="(max-width: 559px) 100vw, 559px" /></figure>
</div>
<p>Let us repeat the process with <code>sentic_count</code> to have a feeling of a higher-order trend. </p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f44d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>As a rule of thumb, the order should be one plus the total number of (trending) hills + peaks in the graph, but not much more than that.</strong> </p>
<p>We choose 3 for <code>sentic_count</code> and compare the output with the <code>order=2</code> result (we do not write the code twice, though).</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">y = dataset['sentic_count'].copy() from statsmodels.tsa.deterministic import DeterministicProcess, CalendarFourier dp = DeterministicProcess( index=y.index, constant=True, order=3
)
X = dp.in_sample() model = LinearRegression().fit(X,y) predictions = pd.DataFrame( model.predict(X), index=X.index, columns=['Deterministic Curve']
) X_out = dp.out_of_sample(60) predictions_out = pd.DataFrame( model.predict(X_out), index=X_out.index, columns=['Future Predictions']
) plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, color='red')
plt.show()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="570" height="429" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-15.png" alt="" class="wp-image-746535" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-15.png 570w, https://blog.finxter.com/wp-content/uplo...00x226.png 300w" sizes="(max-width: 570px) 100vw, 570px" /></figure>
</div>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="570" height="429" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-16.png" alt="" class="wp-image-746538" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-16.png 570w, https://blog.finxter.com/wp-content/uplo...00x226.png 300w" sizes="(max-width: 570px) 100vw, 570px" /></figure>
</div>
<p>Although the order-three polynomial fits the data better, use discretion in deciding whether the sentiment count will decrease so drastically in the next 60 days or not. Usually, trust short-time predictions rather than long ones.</p>
<p><code>DeterministicProcess</code> accepts other parameters, making it a very interesting tool. Find a description of the almost full list below.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">dp = DeterministicProcess( index, # the DatetimeIndex of your data period: int or None, # in case the data shows some periodicity, include the size of the periodic cycle here: 7 would mean 7 days in our case constant: bool, # includes a constant feature in the returned DataFrame, i.e., a feature with the same value for everyone. It returns the equivalent of a bias term in Linear Regression order: int, # order of the polynomial that you think better approximates your trend: the simplest the better seasonal: bool, # make it True if you think the data has some periodicity. If you make it True and do not specify the period, the dp will try to infer the period out of the index additional_terms: tuple of statsmodel's DeterministicTerms, # we come back to this next drop: bool # drops resulting features which are collinear to others. If you will use a linear model, make it True
)
</pre>
<h2>Seasonality</h2>
<p>As a hardened Mathematician, seasonality is my favorite part because it deals with Fourier analysis (and wave functions are just… <a rel="noreferrer noopener" href="https://youtu.be/2awbKQ2DLRE?t=218" target="_blank">cool!</a>):</p>
<figure class="wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube"><a href="https://blog.finxter.com/time-series-forecast-a-complete-workflow-part-ii/"><img src="https://blog.finxter.com/wp-content/plugins/wp-youtube-lyte/lyteCache.php?origThumbUrl=https%3A%2F%2Fi.ytimg.com%2Fvi%2F2awbKQ2DLRE%2Fhqdefault.jpg" alt="YouTube Video"></a><figcaption></figcaption></figure>
<p>Do you remember your first ML course when you heard Linear Regression can fit arbitrary functions, not only lines? So, why not a wave function? We just did it for polynomials and didn’t even feel like it <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f609.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p>In general, for any expression <code>f</code> which is a function of a feature or of your <code>DatetimeIndex</code>, you can create a feature column whose ith row is the value of <code>f</code> corresponding to the ith index. </p>
<p>Then linear regression finds the constant coefficient multiplying <code>f</code> that best fits your data. Again, this procedure works in general, not only with Datetime indexes – the <code>trend_squared</code> term above is an example of it.</p>
<p>For seasonality, we use a second <code>statsmodel</code>‘s amazing class: <code><a href="https://www.statsmodels.org/dev/generated/statsmodels.tsa.deterministic.CalendarFourier.html" target="_blank" rel="noreferrer noopener">CalendarFourier</a></code>. It is another <code>statsmodel</code>‘s <code>DeterministicTerm</code> class (i.e., with the <code>in_sample</code> and <code>out_of_sample</code> methods) and instantiates with two parameters, <code>'frequency'</code> and <code>'order'</code>. </p>
<p>As a <code>'frequency'</code>, the class expects a string such as ‘D’, ‘W’, ‘M’ for day, week or month, respectively, or any of the quite comprehensive Pandas <a href="https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases" target="_blank" rel="noreferrer noopener">Datetime offset aliases</a>. </p>
<p>The <code>'order'</code> is the Fourier expansion order which should be understood as the number of waves you are expecting in your chosen frequency (count the number of ups and downs – one wave would be understood as one up and one down)</p>
<p><code>CalendarFourier</code> integrates swiftly with <code>DeterministicProcess</code> by including an instance of it in the list of <code>additional_terms</code>. </p>
<p>Here is the full code for <code>sentic_mean</code>:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from statsmodels.tsa.deterministic import DeterministicProcess, CalendarFourier y = dataset['sentic_mean'].copy() fourier = CalendarFourier(freq='A',order=2) dp = DeterministicProcess( index=y.index, constant=True, order=2, seasonal=False, additional_terms=[fourier], drop=True
)
X = dp.in_sample() from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X,y) predictions = pd.DataFrame( model.predict(X), index=X.index, columns=['Prediction']
) X_out = dp.out_of_sample(60) predictions_out = pd.DataFrame( model.predict(X_out), index=X_out.index, columns=['Prediction']
) plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, color='red')
plt.show()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="559" height="434" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-17.png" alt="" class="wp-image-746594" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-17.png 559w, https://blog.finxter.com/wp-content/uplo...00x233.png 300w" sizes="(max-width: 559px) 100vw, 559px" /></figure>
</div>
<p>If we take <code>seasonal=True</code> inside <code>DeterministicProcess</code>, we get a crispier line:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="559" height="434" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-18.png" alt="" class="wp-image-746597" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-18.png 559w, https://blog.finxter.com/wp-content/uplo...00x233.png 300w" sizes="(max-width: 559px) 100vw, 559px" /></figure>
</div>
<p>Including <code>ax.set_xlim(('2022-08-01', '2022-10-01'))</code> before <code>plt.show()</code> zooms the graph in:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="571" height="466" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-19.png" alt="" class="wp-image-746600" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-19.png 571w, https://blog.finxter.com/wp-content/uplo...00x245.png 300w" sizes="(max-width: 571px) 100vw, 571px" /></figure>
</div>
<p>Although I suggest using the <code>seasonal=True</code> parameter with care, it does find interesting patterns (with huge RMSE error, though).</p>
<p>For instance, look at this BTC percentage change zoomed chart:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="572" height="461" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-20.png" alt="" class="wp-image-746607" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-20.png 572w, https://blog.finxter.com/wp-content/uplo...00x242.png 300w" sizes="(max-width: 572px) 100vw, 572px" /></figure>
</div>
<p>Here period is set to 30 and <code>seasonal=True</code>. I also manually rescaled the predictions to be better visible in the graphic. Although the predictions are far away from truth, thinking as a trader, isn’t it impressive how many peaks and hills it gets right? At least for this zoomed month…</p>
<p>To maintain the workflow promise, I prepared a code that does everything so far in one shot:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def deseasonalize(df: pd.Series, season_freq='A', fourier_order=0, constant=True, dp_order=1, dp_drop=True, model=LinearRegression(), fourier=None, dp=None, **DeterministicProcesskwargs)->(pd.Series, plt.Axes, pd.DataFrame): """ Returns a deseasonalized and detrended df, a seasonal plot, and the fitted DeterministicProcess instance. """ if fourier is None: fourier = CalendarFourier(freq=season_freq, order=fourier_order) if dp is None: dp = DeterministicProcess( index=df.index, constant=True, order=dp_order, additional_terms=[fourier], drop=dp_drop, **DeterministicProcesskwargs ) X = dp.in_sample() model = LinearRegression().fit(X, df) y_pred = pd.Series( model.predict(X), index=X.index, name=df.name+'_pred' ) ax = plt.subplot() y.plot(ax=ax, legend=True) predictions.plot(ax=ax) y_pred.columns = df.name y_deseason = df - y_pred y_deseason.name = df.name +'_deseasoned' return y_deseason, ax, dp The sentic_mean analyses get reduced to: y_deseason, ax, dp= deseasonalize(y, season_freq='A', fourier_order=2, constant=True, dp_order=2, dp_drop=True, model=LinearRegression() )
</pre>
<h2>Cycles and Hybrid Models</h2>
<p>Let us move on to a complete Machine Learning prediction. We use <code>XGBRegressor</code> and compare its performance among three instances: </p>
<ol>
<li>Predict <code>sentic_mean</code> directly using lags;</li>
<li>Same prediction adding the seasonal/trending with a <code>DeterministicProcess</code>;</li>
<li>A hybrid model, using <code>LinearRegression</code> to infer and remove seasons/trends, and then apply a <code>XGBRegressor</code>.</li>
</ol>
<p>The first part will be the bulkier since the other two follow from simple modifications in the resulting code. </p>
<h3>Preparing the data</h3>
<p>Before any analysis, we split the data in train and test sets. Since we are dealing with time series, this means we set the ‘present date’ as a point in the past and try to predict its respective ‘future’. Here we pick 22 days in the past.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">s = dataset['sentic_mean'] s_train = s[:'2022-09-01']
</pre>
<p>We made this first split in order to not leak data while doing any analysis. </p>
<p>Next, we prepare target and feature sets. Recall our SentiCrypto’s data was set to be available everyday at 8AM. Imagine we are doing the prediction by 9AM. </p>
<p>In this case, anything until the present data (the ‘<code>lag_0</code>‘) can be used as features, and our target is <code>s_train</code>‘s first lead (which we define as a -1 lag). To choose other lags as features, we examine theirs statsmodel’s partial auto-correlation plot:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from statsmodels.graphics.tsaplots import plot_pacf plot_pacf(s_train, lags=20)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="568" height="433" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-21.png" alt="" class="wp-image-746619" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-21.png 568w, https://blog.finxter.com/wp-content/uplo...00x229.png 300w" sizes="(max-width: 568px) 100vw, 568px" /></figure>
</div>
<p>We use the first four for <code>sentic_mean</code> and the first seven + the 11th for <code>sentic_count</code> (you can easily test different combinations with the code below.) </p>
<p>Now we finish choosing features, we go back to the full series for engineering. We apply to <code>s_maen</code> and <code>s_count</code> the <code>make_lags</code> function we defined in the last article (which we transcribe here for convenience). </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def make_lags(df, n_lags=1, lead_time=1): """ Compute lags of a pandas.Series from lead_time to lead_time + n_lags. Alternatively, a list can be passed as n_lags. Returns a pd.DataFrame whose ith column is either the i+lead_time lag or the ith element of n_lags. """ if isinstance(n_lags,int): lag_list = list(range(lead_time, n_lags+lead_time)) else: lag_list = n_lags lags ={ f'{df.name}_lag_{i}': df.shift(i) for i in lag_list } return pd.concat(lags,axis=1) X = make_lags(s, [0,1,2,3,4]) y = make_lags(s, [-1]) display(X)
y
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="524" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-22-1024x524.png" alt="" class="wp-image-746635" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-22-1024x524.png 1024w, https://blog.finxter.com/wp-content/uplo...00x153.png 300w, https://blog.finxter.com/wp-content/uplo...68x393.png 768w, https://blog.finxter.com/wp-content/uplo...36x785.png 1536w, https://blog.finxter.com/wp-content/uplo...age-22.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="524" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-23-1024x524.png" alt="" class="wp-image-746639" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-23-1024x524.png 1024w, https://blog.finxter.com/wp-content/uplo...00x153.png 300w, https://blog.finxter.com/wp-content/uplo...68x393.png 768w, https://blog.finxter.com/wp-content/uplo...36x785.png 1536w, https://blog.finxter.com/wp-content/uplo...age-23.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Now a train-test split with <code>sklearn</code> is convenient (Notice the <code>shuffle=False</code> parameter, that is key for time series):</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=22, shuffle=False) X_train
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="536" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-24-1024x536.png" alt="" class="wp-image-746647" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-24-1024x536.png 1024w, https://blog.finxter.com/wp-content/uplo...00x157.png 300w, https://blog.finxter.com/wp-content/uplo...68x402.png 768w, https://blog.finxter.com/wp-content/uplo...36x805.png 1536w, https://blog.finxter.com/wp-content/uplo...age-24.png 1596w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>(Observe that the final date is set correctly, in accordance with our analysis’ split.)</p>
<p> Applying the regressor:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">xgb = XGBRegressor(n_estimators=50) xgb.fit(X_train,y_train) predictions_train = pd.DataFrame( xgb.predict(X_train), index=X_train.index, columns=['Prediction']
) predictions_test = pd.DataFrame( xgb.predict(X_test), index=X_test.index, columns=['Prediction']
) print(f'R2 train score: {r2_score(y_train[:-1],predictions_train[:-1])}') plt.figure()
ax = plt.subplot()
y_train.plot(ax=ax, legend=True)
predictions_train.plot(ax=ax)
plt.show() plt.figure()
ax = plt.subplot()
y_test.plot(ax=ax, legend=True)
predictions_test.plot(ax=ax)
plt.show() print(f'R2 test score: {r2_score(y_test[:-1],predictions_test[:-1])}')
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="819" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-25-1024x819.png" alt="" class="wp-image-746650" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-25-1024x819.png 1024w, https://blog.finxter.com/wp-content/uplo...00x240.png 300w, https://blog.finxter.com/wp-content/uplo...68x614.png 768w, https://blog.finxter.com/wp-content/uplo...age-25.png 1318w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="868" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-26-1024x868.png" alt="" class="wp-image-746653" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-26-1024x868.png 1024w, https://blog.finxter.com/wp-content/uplo...00x254.png 300w, https://blog.finxter.com/wp-content/uplo...68x651.png 768w, https://blog.finxter.com/wp-content/uplo...age-26.png 1302w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>You can reduce overfitness by reducing the number of estimators, but the R2 test score maintains negative.</p>
<p>We can replicate the process for <code>sentic_count</code> (or whatever you want). Below is a function to automate it.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from statsmodels.tsa.stattools import pacf def apply_univariate_prediction(series, test_size, to_predict=1, nlags=20, minimal_pacf=0.1, model=XGBRegressor(n_estimators=50)): ''' Starting from series, breaks it in train and test subsets; chooses which lags to use based on pacf > minimal_pacf; and applies the given sklearn-type model. Returns the resulting features and targets and the trained model. It plots the graph of the training and prediction, together with their r2_score. ''' s = series.iloc[:-test_size] if isinstance(to_predict,int): to_predict = [to_predict] from statsmodels.tsa.stattools import pacf s_pacf = pd.Series(pacf(s, nlags=nlags)) column_list = s_pacf[s_pacf>minimal_pacf].index X = make_lags(series, n_lags=column_list).dropna() y = make_lags(series,n_lags=[-x for x in to_predict]).loc[X.index] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=False) model.fit(X_train,y_train) predictions_train = pd.DataFrame( model.predict(X_train), index=X_train.index, columns=['Train Predictions'] ) predictions_test = pd.DataFrame( model.predict(X_test), index=X_test.index, columns=['Test Predictions'] ) fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,5), sharey=True) y_train.plot(ax=ax1, legend=True) predictions_train.plot(ax=ax1) ax1.set_title('Train Predictions') y_test.plot(ax=ax2, legend=True) predictions_test.plot(ax=ax2) ax2.set_title('Test Predictions') plt.show() print(f'R2 train score: {r2_score(y_train[:-1],predictions_train[:-1])}') print(f'R2 test score: {r2_score(y_test[:-1],predictions_test[:-1])}') return X, y, model apply_univariate_prediction(dataset['sentic_count'],22)
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="472" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-27-1024x472.png" alt="" class="wp-image-746672" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-27-1024x472.png 1024w, https://blog.finxter.com/wp-content/uplo...00x138.png 300w, https://blog.finxter.com/wp-content/uplo...68x354.png 768w, https://blog.finxter.com/wp-content/uplo...36x708.png 1536w, https://blog.finxter.com/wp-content/uplo...age-27.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">apply_univariate_prediction(dataset['BTC-USD'], 22)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="479" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-28-1024x479.png" alt="" class="wp-image-746679" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-28-1024x479.png 1024w, https://blog.finxter.com/wp-content/uplo...00x140.png 300w, https://blog.finxter.com/wp-content/uplo...68x359.png 768w, https://blog.finxter.com/wp-content/uplo...36x718.png 1536w, https://blog.finxter.com/wp-content/uplo...age-28.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<h2>Predicting with Seasons</h2>
<p>Since the features created by <code>DeterministicProcess</code> are only time-dependent, we can add them harmlessly to the feature DataFrame we automated get from our univariate predictions. </p>
<p>The predictions, though, are still univariate. We use the deseasonalize function to obtain the season features. The data preparation is as follows:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">s = dataset['sentic_mean'] X, y, _ = apply_univariate_prediction(s,22); s_deseason, _, dp = deseasonalize(s, season_freq='A', fourier_order=2, constant=True, dp_order=2, dp_drop=True, model=LinearRegression() );
X_f = dp.in_sample().shift(-1) X = pd.concat([X,X_f], axis=1, join='inner').dropna()
</pre>
<p>With a bit of copy and paste, we arrive at:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="477" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-29-1024x477.png" alt="" class="wp-image-746689" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-29-1024x477.png 1024w, https://blog.finxter.com/wp-content/uplo...00x140.png 300w, https://blog.finxter.com/wp-content/uplo...68x358.png 768w, https://blog.finxter.com/wp-content/uplo...36x715.png 1536w, https://blog.finxter.com/wp-content/uplo...age-29.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>And we actually perform way worse! <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f631.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h2>Deseasonalizing</h2>
<p>Nevertheless, the right-hand graphic illustrates the inability of grasping trends. Our last shot is a hybrid model. </p>
<p>Here we follow three steps:</p>
<ol>
<li>We use the <code>LinearRegression</code> to capture the seasons and trends, rendering the series <code>y_s</code>. Then we acquire a deseasonalized target <code>y_ds = y-y_s</code>;</li>
<li>Train an <code>XGBRegressor</code> on <code>y_ds</code> and the lagged features, resulting in deseasonalized predictions <code>y_pred</code>;</li>
<li>Finally, we incorporate <code>y_s</code> back to <code>y_pred</code> to compare the final result.</li>
</ol>
<p>Although Bitcoin-related data are hard to predict, there was a huge improvement on the <code>r2_score</code> (finally something positive!). We define the used function below.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">get_hybrid_univariate_prediction(dataset['sentic_mean'], 22, season_freq='A', fourier_order=2, constant=True, dp_order=2, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=True, season_period=7, dp=None, to_predict=1, nlags=20, minimal_pacf=0.1, model2=XGBRegressor(n_estimators=50) )
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="475" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-30-1024x475.png" alt="" class="wp-image-746703" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-30-1024x475.png 1024w, https://blog.finxter.com/wp-content/uplo...00x139.png 300w, https://blog.finxter.com/wp-content/uplo...68x356.png 768w, https://blog.finxter.com/wp-content/uplo...36x712.png 1536w, https://blog.finxter.com/wp-content/uplo...age-30.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Instead of going through every detail, we will also automate this code. In order to get the code running smoothly, we revisit the deseasonalize and the <code>apply_univariate_prediction</code> functions in order to remove the plotting part of them. </p>
<p>The final function only plots graphs and returns nothing. It intends to give you a baseline for a hybrid model score. Change the function at will to make it return whatever you need.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def get_season(series: pd.Series, test_size, season_freq='A', fourier_order=0, constant=True, dp_order=1, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=False, season_period=None, dp=None): """ Decompose series in a deseasonalized and a seasonal part. The parameters are relative to the fourier and DeterministicProcess used. Returns y_ds and y_s. """ se = series.iloc[:-test_size] if fourier is None: fourier = CalendarFourier(freq=season_freq, order=fourier_order) if dp is None: dp = DeterministicProcess( index=se.index, constant=True, order=dp_order, additional_terms=[fourier], drop=dp_drop, seasonal=is_seasonal, period=season_period ) X_in = dp.in_sample() X_out = dp.out_of_sample(test_size) model1 = model1.fit(X_in, se) X = pd.concat([X_in,X_out],axis=0) y_s = pd.Series( model1.predict(X), index=X.index, name=series.name+'_pred' ) y_s.name = series.name y_ds = series - y_s y_ds.name = series.name +'_deseasoned' return y_ds, y_s def prepare_data(series, test_size, to_predict=1, nlags=20, minimal_pacf=0.1): ''' Creates a feature dataframe by making lags and a target series by a negative to_predict-shift. Returns X, y. ''' s = series.iloc[:-test_size] if isinstance(to_predict,int): to_predict = [to_predict] from statsmodels.tsa.stattools import pacf s_pacf = pd.Series(pacf(s,nlags=nlags)) column_list = s_pacf[s_pacf>minimal_pacf].index X = make_lags(series, n_lags=column_list).dropna() y = make_lags(series,n_lags=[-x for x in to_predict]).loc[X.index].squeeze() return X, y def get_hybrid_univariate_prediction(series: pd.Series, test_size, season_freq='A', fourier_order=0, constant=True, dp_order=1, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=False, season_period=None, dp=None, to_predict=1, nlags=20, minimal_pacf=0.1, model2=XGBRegressor(n_estimators=50) ): """ Apply the hybrid model method by deseasonalizing/detrending a time series with model1 and investigating the resulting series with model2. It plots the respective graphs and computes r2_scores. """ y_ds, y_s = get_season(series, test_size, season_freq=season_freq, fourier_order=fourier_order, constant=constant, dp_order=dp_order, dp_drop=dp_drop, model1=model1, fourier=fourier, dp=dp, is_seasonal=is_seasonal, season_period=season_period) X, y_ds = prepare_data(y_ds,test_size=test_size) X_train, X_test, y_train, y_test = train_test_split(X, y_ds, test_size=test_size, shuffle=False) y = y_s.squeeze() + y_ds.squeeze() model2 = model2.fit(X_train,y_train) predictions_train = pd.Series( model2.predict(X_train), index=X_train.index, name='Prediction' )+y_s[X_train.index] predictions_test = pd.Series( model2.predict(X_test), index=X_test.index, name='Prediction' )+y_s[X_test.index] fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,5), sharey=True) y_train_ps = y.loc[y_train.index] y_test_ps = y.loc[y_test.index] y_train_ps.plot(ax=ax1, legend=True) predictions_train.plot(ax=ax1) ax1.set_title('Train Predictions') y_test_ps.plot(ax=ax2, legend=True) predictions_test.plot(ax=ax2) ax2.set_title('Test Predictions') plt.show() print(f'R2 train score: {r2_score(y_train_ps[:-to_predict],predictions_train[:-to_predict])}') print(f'R2 test score: {r2_score(y_test_ps[:-to_predict],predictions_test[:-to_predict])}')
</pre>
<p><strong>A note of warning:</strong> if you do not expect your data to follow time patterns, do focus on cycles! The hybrid model succeeds well for many tasks, but it actually decreases the R2 score of our previous Bitcoin prediction:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">get_hybrid_univariate_prediction(dataset['BTC-USD'], 22, season_freq='A', fourier_order=4, constant=True, dp_order=5, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=True, season_period=30, dp=None, to_predict=1, nlags=20, minimal_pacf=0.05, model2=XGBRegressor(n_estimators=20) )
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="474" src="https://blog.finxter.com/wp-content/uploads/2022/10/image-31-1024x474.png" alt="" class="wp-image-746717" srcset="https://blog.finxter.com/wp-content/uploads/2022/10/image-31-1024x474.png 1024w, https://blog.finxter.com/wp-content/uplo...00x139.png 300w, https://blog.finxter.com/wp-content/uplo...68x356.png 768w, https://blog.finxter.com/wp-content/uplo...36x711.png 1536w, https://blog.finxter.com/wp-content/uplo...age-31.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>The former score was around 0.31.</p>
<h2>Conclusion</h2>
<p>This article aims at presenting functions for your time series workflow, specially for lags and deseasonalization. Use them with care, though: apply them to have baseline scores before delving into more sophisticated models.</p>
<p>In future articles we will bring forth multi-step predictions (predict more than one day ahead) and compare performance of different models, both univariate and multivariate.</p>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
</div>
https://www.sickgaming.net/blog/2022/10/...a-part-ii/