07-02-2019, 11:28 AM
Jupyter and data science in Fedora
<div style="margin: 5px 5% 10px 5%;"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora.png" width="1150" height="804" title="" alt="" /></div><div><p>In the past, kings and leaders used oracles and magicians to help them predict the future — or at least get some good advice due to their supposed power to perceive hidden information. Nowadays, we live in a society obsessed with quantifying everything. So we have data scientists to do this job.</p>
<p>Data scientists use statistical models, numerical techniques and advanced algorithms that didn’t come from statistical disciplines, along with the data that exist on databases, to find, to infer, to predict data that doesn’t exist yet. Sometimes this data is about the future. That is why we do a lot of predictive analytics and prescriptive analytics.</p>
<p>Here are some questions to which data scientists help find answers:</p>
<p> <span id="more-28554"></span> </p>
<ol>
<li>Who are the students with high propensity to abandon the class? For each one, what are the reasons for leaving?</li>
<li>Which house has a price above or below the fair price? What is the fair price for a certain house?</li>
<li>What are the hidden groups that my clients classify themselves?</li>
<li>Which future problems this premature child will develop?</li>
<li>How many calls will I get in my call center tomorrow 11:43 AM?</li>
<li>My bank should or should not lend money to this customer?</li>
</ol>
<p>Note how the answer to all these question is not sitting in any database waiting to be queried. These are all data that still doesn’t exist and has to be calculated. That is part of the job we data scientists do.</p>
<p>Throughout this article you’ll learn how to prepare a Fedora system as a Data Scientist’s development environment and also a production system. Most of the basic software is RPM-packaged, but the most advanced parts can only be installed, nowadays, with Python’s <em>pip</em> tool.</p>
<h2>Jupyter — the IDE</h2>
<p>Most modern data scientists use Python. And an important part of their work is EDA (exploratory data analysis). EDA is a manual and interactive process that retrieves data, explores its features, searches for correlations, and uses plotted graphics to visualize and understand how data is shaped and prototypes predictive models.</p>
<p>Jupyter is a web application perfect for this task. Jupyter works with Notebooks, documents that mix rich text including beautifully rendered math formulas (thanks to <a href="http://mathjax.org">mathjax</a>), blocks of code and code output, including graphics.</p>
<p>Notebook files have extension <em>.ipynb</em>, which means Interactive Python Notebook.</p>
<h3>Setting up and running Jupyter</h3>
<p>First, install essential packages for Jupyter (<a href="https://fedoramagazine.org/howto-use-sudo/">using </a><em><a href="https://fedoramagazine.org/howto-use-sudo/">sudo</a></em>): </p>
<pre class="wp-block-preformatted">$ sudo dnf install python3-notebook mathjax sscg</pre>
<p>You might want to install additional and optional Python modules commonly used by data scientists: </p>
<pre class="wp-block-preformatted">$ sudo dnf install python3-seaborn python3-lxml python3-basemap python3-scikit-image python3-scikit-learn python3-sympy python3-dask+dataframe python3-nltk</pre>
<p>Set a password to log into Notebook web interface and avoid those long tokens. Run the following command anywhere on your terminal:</p>
<pre class="wp-block-preformatted">$ mkdir -p $HOME/.jupyter<br />$ jupyter notebook password</pre>
<p>Now, type a password for yourself. This will create the file <em>$HOME/.jupyter/jupyter_notebook_config.json</em> with your encrypted password.</p>
<p>Next, prepare for SSLby generating a self-signed HTTPS certificate for Jupyter’s web server: </p>
<pre class="wp-block-preformatted">$ cd $HOME/.jupyter; sscg</pre>
<p>Finish configuring Jupyter by editing your <em>$HOME/.jupyter/jupyter_notebook_config.json</em> file. Make it look like this:</p>
<pre class="wp-block-preformatted">{<br /> "NotebookApp": {<br /> "password": "<span style="color: blue">sha1:abf58...87b</span>",<br /> "ip": "*",<br /> "allow_origin": "*",<br /> "allow_remote_access": true,<br /> "open_browser": false,<br /> "websocket_compression_options": {},<br /> "certfile": "<span style="color: red">/home/aviram</span>/.jupyter/<span style="color: green">service.pem</span>",<br /> "keyfile": "<span style="color: red">/home/aviram</span>/.jupyter/<span style="color: green">service-key.pem</span>",<br /> "notebook_dir": "<span style="color: red">/home/aviram</span>/Notebooks"<br /> }<br />} </pre>
</p>
<p>The parts in <span style="color: red">red</span> must be changed to match your folders. Parts in <span style="color: blue">blue</span> were already there after you created your password. Parts in <span style="color: green">green</span> are the crypto-related files generated by <em>sscg</em>.</p>
<p>Create a folder for your notebook files, as configured in the <em>notebook_dir</em> setting above: </p>
<pre class="wp-block-preformatted">$ mkdir $HOME/Notebooks</pre>
<p>Now you are all set. Just run Jupyter Notebook from anywhere on your system by typing:</p>
<pre class="wp-block-preformatted">$ jupyter notebook</pre>
<p>Or add this line to your <em>$HOME/.bashrc</em> file to create a shortcut command called <em>jn</em>:</p>
<pre class="wp-block-preformatted">alias jn='jupyter notebook'</pre>
<p>After running the command <em>jn</em>, access <em>https://your-fedora-host.com:8888</em> from any browser on the network to see the Jupyter user interface. You’ll need to use the password you set up earlier. Start typing some Python code and markup text. This is how it looks:</p>
<figure class="wp-block-image"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora.png" alt="" /><figcaption>Jupyter with a simple notebook</figcaption></figure>
<p>In addition to the IPython environment, you’ll also get a web-based Unix terminal provided by <em>terminado</em>. Some people might find this useful, while others find this insecure. You can disable this feature in the config file.</p>
<h2>JupyterLab — the next generation of Jupyter</h2>
<p>JupyterLab is the next generation of Jupyter, with a better interface and more control over your workspace. It’s currently not RPM-packaged for Fedora at the time of writing, but you can use <em>pip</em> to get it installed easily:</p>
<pre class="wp-block-preformatted">$ pip3 install jupyterlab --user<br />$ jupyter serverextension enable --py jupyterlab<br /></pre>
<p>Then run your regular <em>jupiter notebook</em> command or <em>jn</em> alias. JupyterLab will be accessible from <em>http://your-linux-host.com:8888/<strong>lab</strong></em>.</p>
<h2>Tools used by data scientists</h2>
<p>In this section you can get to know some of these tools, and how to install them. Unless noted otherwise, the module is already packaged for Fedora and was installed as prerequisites for previous components.</p>
<h3><strong>Numpy</strong></h3>
<p><em>Numpy</em> is an advanced and C-optimized math library designed to work with large in-memory datasets. It provides advanced multidimensional matrix support and operations, including math functions as log(), exp(), trigonometry etc.</p>
<h3>Pandas</h3>
<p>In this author’s opinion, Python is THE platform for data science mostly because of Pandas. Built on top of numpy, Pandas makes easy the work of preparing and displaying data. You can think of it as a no-UI spreadsheet, but ready to work with much larger datasets. Pandas helps with data retrieval from a SQL database, CSV or other types of files, columns and rows manipulation, data filtering and, to some extent, data visualization with matplotlib.</p>
<h3>Matplotlib</h3>
<p>Matplotlib is a library to plot 2D and 3D data. It has great support for notations in graphics, labels and overlays</p>
<figure class="wp-block-image"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora-1.png" alt="" class="wp-image-28572" /><figcaption>matplotlib pair of graphics showing a cost function searching its optimal value through a gradient descent algorithm</figcaption></figure>
<h3>Seaborn</h3>
<p>Built on top of matplotlib, Seaborn’s graphics are optimized for a more statistical comprehension of data. It automatically displays regression lines or Gauss curve approximations of plotted data.</p>
<figure class="wp-block-image"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora-2.png" alt="" /><figcaption>Linear regression visualised with SeaBorn</figcaption></figure>
<h3><a href="https://www.statsmodels.org/">StatsModels</a></h3>
<p>StatsModels provides algorithms for statistical and econometrics data analysis such as linear and logistic regressions. Statsmodel is also home for the classical family of <a href="https://www.statsmodels.org/stable/examples/index.html#stats">time series algorithms</a> known as ARIMA.</p>
<figure class="wp-block-image"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora-3.png" alt="" class="wp-image-28575" /><figcaption>Normalized number of passengers across time (blue) and ARIMA-predicted number of passengers (red)</figcaption></figure>
<h3>Scikit-learn</h3>
<p>The central piece of the machine-learning ecosystem, <a href="https://scikit-learn.org/stable/">scikit</a> provides predictor algorithms for <a href="https://scikit-learn.org/stable/supervised_learning.html#supervised-learning">regression</a> (Elasticnet, Gradient Boosting, Random Forest etc) and <a href="https://scikit-learn.org/stable/supervised_learning.html#supervised-learning">classification</a> and clustering (K-means, DBSCAN etc). It features a very well designed API. Scikit also has classes for advanced data manipulation, dataset split into train and test parts, dimensionality reduction and data pipeline preparation.</p>
<h3>XGBoost</h3>
<p>XGBoost is the most advanced regressor and classifier used nowadays. It’s not part of scikit-learn, but it adheres to scikit’s API. <a href="https://xgboost.ai">XGBoost</a> is not packaged for Fedora and should be installed with pip. <a href="https://xgboost.readthedocs.io/en/latest/gpu/index.html">XGBoost can be accelerated with your nVidia GPU</a>, but not through its <em>pip</em> package. You can get this if you compile it yourself against CUDA. Get it with:</p>
<pre class="wp-block-preformatted">$ pip3 install xgboost --user</pre>
<h3>Imbalanced Learn</h3>
<p><a href="https://imbalanced-learn.readthedocs.io">imbalanced-learn</a> provides ways for under-sampling and over-sampling data. It is useful in fraud detection scenarios where known fraud data is very small when compared to non-fraud data. In these cases data augmentation is needed for the known fraud data, to make it more relevant to train predictors. Install it with <em>pip</em>:</p>
<pre class="wp-block-preformatted">$ pip3 install imblearn --user</pre>
<h3>NLTK</h3>
<p>The <a href="https://www.nltk.org">Natural Language toolkit</a>, or NLTK, helps you work with human language data for the purpose of building chatbots (just to cite an example).</p>
<h3>SHAP</h3>
<p>Machine learning algorithms are very good on predicting, but aren’t good at explaining why they made a prediction. <a href="https://github.com/slundberg/shap">SHAP</a> solves that, by analyzing trained models.</p>
<figure class="wp-block-image"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora-4.png" alt="" /><figcaption>Where SHAP fits into the data analysis process</figcaption></figure>
<p>Install it with <em>pip</em>:</p>
<pre class="wp-block-preformatted">$ pip3 install shap --user</pre>
<h3><a href="https://keras.io">Keras</a></h3>
<p>Keras is a library for deep learning and neural networks. Install it with <em>pip</em>:</p>
<pre class="wp-block-preformatted">$ sudo dnf instal python3-h5py<br />$ pip3 install keras --user</pre>
<h3><a href="https://www.tensorflow.org">TensorFlow</a></h3>
<p>TensorFlow is a popular neural networks builder. Install it with <em>pip</em>:</p>
<pre class="wp-block-preformatted">$ pip3 install tensorflow --user<br /></pre>
<hr class="wp-block-separator" />
<p><em>Photo courtesy of <a href="https://www.flickr.com/photos/87249144@N08/">FolsomNatural</a> on <a href="https://www.flickr.com/photos/87249144@N08/45871861611/">Flickr</a> (CC BY-SA 2.0).</em></p>
</div>
<div style="margin: 5px 5% 10px 5%;"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora.png" width="1150" height="804" title="" alt="" /></div><div><p>In the past, kings and leaders used oracles and magicians to help them predict the future — or at least get some good advice due to their supposed power to perceive hidden information. Nowadays, we live in a society obsessed with quantifying everything. So we have data scientists to do this job.</p>
<p>Data scientists use statistical models, numerical techniques and advanced algorithms that didn’t come from statistical disciplines, along with the data that exist on databases, to find, to infer, to predict data that doesn’t exist yet. Sometimes this data is about the future. That is why we do a lot of predictive analytics and prescriptive analytics.</p>
<p>Here are some questions to which data scientists help find answers:</p>
<p> <span id="more-28554"></span> </p>
<ol>
<li>Who are the students with high propensity to abandon the class? For each one, what are the reasons for leaving?</li>
<li>Which house has a price above or below the fair price? What is the fair price for a certain house?</li>
<li>What are the hidden groups that my clients classify themselves?</li>
<li>Which future problems this premature child will develop?</li>
<li>How many calls will I get in my call center tomorrow 11:43 AM?</li>
<li>My bank should or should not lend money to this customer?</li>
</ol>
<p>Note how the answer to all these question is not sitting in any database waiting to be queried. These are all data that still doesn’t exist and has to be calculated. That is part of the job we data scientists do.</p>
<p>Throughout this article you’ll learn how to prepare a Fedora system as a Data Scientist’s development environment and also a production system. Most of the basic software is RPM-packaged, but the most advanced parts can only be installed, nowadays, with Python’s <em>pip</em> tool.</p>
<h2>Jupyter — the IDE</h2>
<p>Most modern data scientists use Python. And an important part of their work is EDA (exploratory data analysis). EDA is a manual and interactive process that retrieves data, explores its features, searches for correlations, and uses plotted graphics to visualize and understand how data is shaped and prototypes predictive models.</p>
<p>Jupyter is a web application perfect for this task. Jupyter works with Notebooks, documents that mix rich text including beautifully rendered math formulas (thanks to <a href="http://mathjax.org">mathjax</a>), blocks of code and code output, including graphics.</p>
<p>Notebook files have extension <em>.ipynb</em>, which means Interactive Python Notebook.</p>
<h3>Setting up and running Jupyter</h3>
<p>First, install essential packages for Jupyter (<a href="https://fedoramagazine.org/howto-use-sudo/">using </a><em><a href="https://fedoramagazine.org/howto-use-sudo/">sudo</a></em>): </p>
<pre class="wp-block-preformatted">$ sudo dnf install python3-notebook mathjax sscg</pre>
<p>You might want to install additional and optional Python modules commonly used by data scientists: </p>
<pre class="wp-block-preformatted">$ sudo dnf install python3-seaborn python3-lxml python3-basemap python3-scikit-image python3-scikit-learn python3-sympy python3-dask+dataframe python3-nltk</pre>
<p>Set a password to log into Notebook web interface and avoid those long tokens. Run the following command anywhere on your terminal:</p>
<pre class="wp-block-preformatted">$ mkdir -p $HOME/.jupyter<br />$ jupyter notebook password</pre>
<p>Now, type a password for yourself. This will create the file <em>$HOME/.jupyter/jupyter_notebook_config.json</em> with your encrypted password.</p>
<p>Next, prepare for SSLby generating a self-signed HTTPS certificate for Jupyter’s web server: </p>
<pre class="wp-block-preformatted">$ cd $HOME/.jupyter; sscg</pre>
<p>Finish configuring Jupyter by editing your <em>$HOME/.jupyter/jupyter_notebook_config.json</em> file. Make it look like this:</p>
<pre class="wp-block-preformatted">{<br /> "NotebookApp": {<br /> "password": "<span style="color: blue">sha1:abf58...87b</span>",<br /> "ip": "*",<br /> "allow_origin": "*",<br /> "allow_remote_access": true,<br /> "open_browser": false,<br /> "websocket_compression_options": {},<br /> "certfile": "<span style="color: red">/home/aviram</span>/.jupyter/<span style="color: green">service.pem</span>",<br /> "keyfile": "<span style="color: red">/home/aviram</span>/.jupyter/<span style="color: green">service-key.pem</span>",<br /> "notebook_dir": "<span style="color: red">/home/aviram</span>/Notebooks"<br /> }<br />} </pre>
</p>
<p>The parts in <span style="color: red">red</span> must be changed to match your folders. Parts in <span style="color: blue">blue</span> were already there after you created your password. Parts in <span style="color: green">green</span> are the crypto-related files generated by <em>sscg</em>.</p>
<p>Create a folder for your notebook files, as configured in the <em>notebook_dir</em> setting above: </p>
<pre class="wp-block-preformatted">$ mkdir $HOME/Notebooks</pre>
<p>Now you are all set. Just run Jupyter Notebook from anywhere on your system by typing:</p>
<pre class="wp-block-preformatted">$ jupyter notebook</pre>
<p>Or add this line to your <em>$HOME/.bashrc</em> file to create a shortcut command called <em>jn</em>:</p>
<pre class="wp-block-preformatted">alias jn='jupyter notebook'</pre>
<p>After running the command <em>jn</em>, access <em>https://your-fedora-host.com:8888</em> from any browser on the network to see the Jupyter user interface. You’ll need to use the password you set up earlier. Start typing some Python code and markup text. This is how it looks:</p>
<figure class="wp-block-image"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora.png" alt="" /><figcaption>Jupyter with a simple notebook</figcaption></figure>
<p>In addition to the IPython environment, you’ll also get a web-based Unix terminal provided by <em>terminado</em>. Some people might find this useful, while others find this insecure. You can disable this feature in the config file.</p>
<h2>JupyterLab — the next generation of Jupyter</h2>
<p>JupyterLab is the next generation of Jupyter, with a better interface and more control over your workspace. It’s currently not RPM-packaged for Fedora at the time of writing, but you can use <em>pip</em> to get it installed easily:</p>
<pre class="wp-block-preformatted">$ pip3 install jupyterlab --user<br />$ jupyter serverextension enable --py jupyterlab<br /></pre>
<p>Then run your regular <em>jupiter notebook</em> command or <em>jn</em> alias. JupyterLab will be accessible from <em>http://your-linux-host.com:8888/<strong>lab</strong></em>.</p>
<h2>Tools used by data scientists</h2>
<p>In this section you can get to know some of these tools, and how to install them. Unless noted otherwise, the module is already packaged for Fedora and was installed as prerequisites for previous components.</p>
<h3><strong>Numpy</strong></h3>
<p><em>Numpy</em> is an advanced and C-optimized math library designed to work with large in-memory datasets. It provides advanced multidimensional matrix support and operations, including math functions as log(), exp(), trigonometry etc.</p>
<h3>Pandas</h3>
<p>In this author’s opinion, Python is THE platform for data science mostly because of Pandas. Built on top of numpy, Pandas makes easy the work of preparing and displaying data. You can think of it as a no-UI spreadsheet, but ready to work with much larger datasets. Pandas helps with data retrieval from a SQL database, CSV or other types of files, columns and rows manipulation, data filtering and, to some extent, data visualization with matplotlib.</p>
<h3>Matplotlib</h3>
<p>Matplotlib is a library to plot 2D and 3D data. It has great support for notations in graphics, labels and overlays</p>
<figure class="wp-block-image"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora-1.png" alt="" class="wp-image-28572" /><figcaption>matplotlib pair of graphics showing a cost function searching its optimal value through a gradient descent algorithm</figcaption></figure>
<h3>Seaborn</h3>
<p>Built on top of matplotlib, Seaborn’s graphics are optimized for a more statistical comprehension of data. It automatically displays regression lines or Gauss curve approximations of plotted data.</p>
<figure class="wp-block-image"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora-2.png" alt="" /><figcaption>Linear regression visualised with SeaBorn</figcaption></figure>
<h3><a href="https://www.statsmodels.org/">StatsModels</a></h3>
<p>StatsModels provides algorithms for statistical and econometrics data analysis such as linear and logistic regressions. Statsmodel is also home for the classical family of <a href="https://www.statsmodels.org/stable/examples/index.html#stats">time series algorithms</a> known as ARIMA.</p>
<figure class="wp-block-image"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora-3.png" alt="" class="wp-image-28575" /><figcaption>Normalized number of passengers across time (blue) and ARIMA-predicted number of passengers (red)</figcaption></figure>
<h3>Scikit-learn</h3>
<p>The central piece of the machine-learning ecosystem, <a href="https://scikit-learn.org/stable/">scikit</a> provides predictor algorithms for <a href="https://scikit-learn.org/stable/supervised_learning.html#supervised-learning">regression</a> (Elasticnet, Gradient Boosting, Random Forest etc) and <a href="https://scikit-learn.org/stable/supervised_learning.html#supervised-learning">classification</a> and clustering (K-means, DBSCAN etc). It features a very well designed API. Scikit also has classes for advanced data manipulation, dataset split into train and test parts, dimensionality reduction and data pipeline preparation.</p>
<h3>XGBoost</h3>
<p>XGBoost is the most advanced regressor and classifier used nowadays. It’s not part of scikit-learn, but it adheres to scikit’s API. <a href="https://xgboost.ai">XGBoost</a> is not packaged for Fedora and should be installed with pip. <a href="https://xgboost.readthedocs.io/en/latest/gpu/index.html">XGBoost can be accelerated with your nVidia GPU</a>, but not through its <em>pip</em> package. You can get this if you compile it yourself against CUDA. Get it with:</p>
<pre class="wp-block-preformatted">$ pip3 install xgboost --user</pre>
<h3>Imbalanced Learn</h3>
<p><a href="https://imbalanced-learn.readthedocs.io">imbalanced-learn</a> provides ways for under-sampling and over-sampling data. It is useful in fraud detection scenarios where known fraud data is very small when compared to non-fraud data. In these cases data augmentation is needed for the known fraud data, to make it more relevant to train predictors. Install it with <em>pip</em>:</p>
<pre class="wp-block-preformatted">$ pip3 install imblearn --user</pre>
<h3>NLTK</h3>
<p>The <a href="https://www.nltk.org">Natural Language toolkit</a>, or NLTK, helps you work with human language data for the purpose of building chatbots (just to cite an example).</p>
<h3>SHAP</h3>
<p>Machine learning algorithms are very good on predicting, but aren’t good at explaining why they made a prediction. <a href="https://github.com/slundberg/shap">SHAP</a> solves that, by analyzing trained models.</p>
<figure class="wp-block-image"><img src="http://www.sickgaming.net/blog/wp-content/uploads/2019/07/jupyter-and-data-science-in-fedora-4.png" alt="" /><figcaption>Where SHAP fits into the data analysis process</figcaption></figure>
<p>Install it with <em>pip</em>:</p>
<pre class="wp-block-preformatted">$ pip3 install shap --user</pre>
<h3><a href="https://keras.io">Keras</a></h3>
<p>Keras is a library for deep learning and neural networks. Install it with <em>pip</em>:</p>
<pre class="wp-block-preformatted">$ sudo dnf instal python3-h5py<br />$ pip3 install keras --user</pre>
<h3><a href="https://www.tensorflow.org">TensorFlow</a></h3>
<p>TensorFlow is a popular neural networks builder. Install it with <em>pip</em>:</p>
<pre class="wp-block-preformatted">$ pip3 install tensorflow --user<br /></pre>
<hr class="wp-block-separator" />
<p><em>Photo courtesy of <a href="https://www.flickr.com/photos/87249144@N08/">FolsomNatural</a> on <a href="https://www.flickr.com/photos/87249144@N08/45871861611/">Flickr</a> (CC BY-SA 2.0).</em></p>
</div>