Fedora - Spam Classification with ML-Pack - Printable Version +- Sick Gaming (https://www.sickgaming.net) +-- Forum: Computers (https://www.sickgaming.net/forum-86.html) +--- Forum: Linux, FreeBSD, and Unix types (https://www.sickgaming.net/forum-88.html) +--- Thread: Fedora - Spam Classification with ML-Pack (/thread-96307.html) |
Fedora - Spam Classification with ML-Pack - xSicKxBot - 07-21-2020 Spam Classification with ML-Pack <div><h2 id="introduction">Introduction</h2> <p><a href="https://mlpack.org">ML-Pack</a> is a small footprint C++ machine learning library that can be easily integrated into other programs. It is an actively developed open source project and released under a <a href="https://github.com/mlpack/mlpack/blob/master/LICENSE.txt">BSD-3 license</a>. Machine learning has gained popularity due to the large amount of electronic data that can be collected. Some other popular machine learning frameworks include <a href="https://www.tensorflow.org/">TensorFlow</a>, <a href="https://mxnet.apache.org/">MxNet</a>, <a href="https://pytorch.org/">PyTorch</a>, <a href="https://chainer.org/">Chainer</a> and <a href="http://paddlepaddle.org/">Paddle Paddle</a>, however these are designed for more complex workflows than ML-Pack. On Fedora, ML-Pack is packaged by its lead developer <a href="https://koji.fedoraproject.org/koji/packageinfo?packageID=15021">Ryan Curtin</a>. In addition to a command line interface, ML-Pack has bindings for <a href="https://www.python.org/">Python</a> and <a href="https://julialang.org/">Julia</a>. Here, we will focus on the command line interface since this may be useful for system administrators to integrate into their workflows.</p> <p> <span id="more-31415"></span> </p> <h2 id="installation">Installation</h2> <p>You can install ML-Pack on the Fedora command line using</p> <pre class="wp-block-preformatted">$ sudo dnf -y install mlpack mlpack-bin</pre> <p>You can also install the documentation, development headers and Python bindings by using …</p> <pre class="wp-block-preformatted">$ sudo dnf -y install mlpack-doc \ mlpack-devel mlpack-python3</pre> <p>though they will not be used in this introduction.</p> <h2 id="example">Example</h2> <p>As an example, we will train a machine learning model to classify spam SMS messages. To keep this article brief, linux commands will not be fully explained, but you can find out more about them by using the man command, for example for the command first command used below, <i>wget</i></p> <pre class="wp-block-html">$ man wget</pre> <p>will give you information that <i>wget</i> will download files from the web and options you can use for it.</p> <h3 id="getdata">Get a dataset</h3> <p>We will use an example spam dataset in Indonesian provided by Yudi Wibisono</p> <pre class="wp-block-preformatted"> <div class="codecolorer-container text default" style="overflow:auto;border:1px solid #9F9F9F;width:435px"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace">$ wget https://drive.google.com/file/d/1-stKadfTgJLtYsHWqXhGO3nTjKVFxm_Q/view<br /> $ unzip dataset_sms_spam_bhs_indonesia_v1.zip</div></div> </pre> <h3 id="preprocessdata">Pre-process dataset</h3> <p>We will try to classify a message as spam or ham by the number of occurrences of a word in a message. We first change the file line endings, remove line 243 which is missing a label and then remove the header from the dataset. Then, we split our data into two files, labels and messages. Since the labels are at the end of the message, the message is reversed and then the label removed and placed in one file. The message is then removed and placed in another file.</p> <pre class="wp-block-html">$ tr 'r' 'n' < dataset_sms_spam_v1.csv > dataset.txt $ sed '243d' dataset.txt > dataset1.csv $ sed '1d' dataset1.csv > dataset.csv $ rev dataset.csv | cut -c1 | rev > labels.txt $ rev dataset.csv | cut -c2- | rev > messages.txt $ rm dataset.csv $ rm dataset1.csv $ rm dataset.txt</pre> <p>Machine learning works on numeric data, so we will use labels of 1 for ham and 0 for spam. The dataset contains three labels, 0, normal sms (ham), 1, fraud (spam), and 2 promotion (spam). We will label all spam as 1, so promotions and fraud will be labelled as 1.</p> <pre class="wp-block-html">$ tr '2' '1' < labels.txt > labels.csv $ rm labels.txt</pre> <p>The next step is to convert all text in the messages to lower case and for simplicity remove punctuation and any symbols that are not spaces, line endings or in the range a-z (one would need expand this range of symbols for production use)</p> <pre class="wp-block-html">$ tr '[:upper:]' '[:lower:]' < \ messages.txt > messagesLower.txt $ tr -Cd 'abcdefghijklmnopqrstuvwxyz n' < \ messagesLower.txt > messagesLetters.txt $ rm messagesLower.txt</pre> <p>We now obtain a sorted list of unique words used (this step may take a few minutes, so use nice to give it a low priority while you continue with other tasks on your computer).</p> <pre class="wp-block-html">$ nice -20 xargs -n1 < messagesLetters.txt > temp.txt $ sort temp.txt > temp2.txt $ uniq temp2.txt > words.txt $ rm temp.txt $ rm temp2.txt</pre> <p>We then create a matrix, where for each message, the frequency of word occurrences is counted (more on this on Wikipedia, <a href="https://en.wikipedia.org/wiki/Tf–idf">here</a> and <a href="https://en.wikipedia.org/wiki/Document-term_matrix">here</a>). This requires a few lines of code, so the full script, which should be saved as ‘makematrix.sh’ is below</p> <pre class="wp-block-html">#!/bin/bash declare -a words=() declare -a letterstartind=() declare -a letterstart=() letter=" " i=0 lettercount=0 while IFS= read -r line; do labels[$((i))]=$line let "i++" done < labels.csv i=0 while IFS= read -r line; do words[$((i))]=$line firstletter="$( echo $line | head -c 1 )" if [ "$firstletter" != "$letter" ] then letterstartind[$((lettercount))]=$((i)) letterstart[$((lettercount))]=$firstletter letter=$firstletter let "lettercount++" fi let "i++" done < words.txt letterstartind[$((lettercount))]=$((i)) echo "Created list of letters" touch wordfrequency.txt rm wordfrequency.txt touch wordfrequency.txt messagecount=0 messagenum=0 messages="$( wc -l messages.txt )" i=0 while IFS= read -r line; do let "messagenum++" declare -a wordcount=() declare -a wordarray=() read -r -a wordarray <<> wordfrequency.txt echo "Processed message ""$messagenum" let "i++" done < messagesLetters.txt # Create csv file tr ' ' ',' data.csv </pre> <p>Since <a href="https://www.gnu.org/software/bash/">Bash</a> is an interpreted language, this simple implementation can take upto 30 minutes to complete. If using the above Bash script on your primary workstation, run it as a task with low priority so that you can continue with other work while you wait:</p> <pre class="wp-block-html">$ nice -20 bash makematrix.sh</pre> <p>Once the script has finished running, split the data into testing (30%) and training (70%) sets:</p> <pre class="wp-block-html">$ mlpack_preprocess_split \ --input_file data.csv \ --input_labels_file labels.csv \ --training_file train.data.csv \ --training_labels_file train.labels.csv \ --test_file test.data.csv \ --test_labels_file test.labels.csv \ --test_ratio 0.3 \ --verbose</pre> <h3 id="trainmodel">Train a model</h3> <p>Now train a <a href="https://mlpack.org/doc/mlpack-3.3.1/cli_documentation.html#logistic_regression">Logistic regression model</a>:</p> <pre class="wp-block-html">$ mlpack_logistic_regression \ --training_file train.data.csv \ --labels_file train.labels.csv --lambda 0.1 \ --output_model_file lr_model.bin</pre> <h3 id="testmodel">Test the model</h3> <p>Finally we test our model by producing predictions,</p> <pre class="wp-block-html">$ mlpack_logistic_regression \ --input_model_file lr_model.bin \ --test_file test.data.csv \ --output_file lr_predictions.csv</pre> <p>and comparing the predictions with the exact results,</p> <pre class="wp-block-html">$ export incorrect=$(diff -U 0 lr_predictions.csv \ test.labels.csv | grep '^@@' | wc -l) $ export tests=$(wc -l < lr_predictions.csv) $ echo "scale=2; 100 * ( 1 - $((incorrect)) \ / $((tests)))" | bc</pre> <p>This gives approximately 90% validation rate, similar to that obtained <a href="https://towardsdatascience.com/spam-detection-with-logistic-regression-23e3709e522">here</a>.</p> <p>The dataset is composed of approximately 50% spam messages, so the validation rates are quite good without doing much parameter tuning. In typical cases, datasets are unbalanced with many more entries in some categories than in others. In these cases a good validation rate can be obtained by mispredicting the class with a few entries. Thus to better evaluate these models, one can compare the number of misclassifications of spam, and the number of misclassifications of ham. Of particular importance in applications is the number of false positive spam results as these are typically not transmitted. The script below produces a confusion matrix which gives a better indication of misclassification. Save it as ‘confusion.sh’</p> <pre class="wp-block-html">#!/bin/bash declare -a labels declare -a lr i=0 while IFS= read -r line; do labels[i]=$line let "i++" done < test.labels.csv i=0 while IFS= read -r line; do lr[i]=$line let "i++" done < lr_predictions.csv TruePositiveLR=0 FalsePositiveLR=0 TrueZerpLR=0 FalseZeroLR=0 Positive=0 Zero=0 for i in "${!labels[@]}"; do if [ "${labels[$i]}" == "1" ] then let "Positive++" if [ "${lr[$i]}" == "1" ] then let "TruePositiveLR++" else let "FalseZeroLR++" fi fi if [ "${labels[$i]}" == "0" ] then let "Zero++" if [ "${lr[$i]}" == "0" ] then let "TrueZeroLR++" else let "FalsePositiveLR++" fi fi done echo "Logistic Regression" echo "Total spam" $Positive echo "Total ham" $Zero echo "Confusion matrix" echo " Predicted class" echo " Ham | Spam " echo " ---------------" echo " Actual| Ham | " $TrueZeroLR "|" $FalseZeroLR echo " class | Spam | " $FalsePositiveLR " |" $TruePositiveLR echo "" </pre> <p>then run the script</p> <pre class="wp-block-html">$ bash confusion.sh</pre> <p>You should get output similar to</p> <p>Logistic Regression<br />Total spam 183<br />Total ham 159<br />Confusion matrix</p> <figure class="wp-block-table"> <table> <tbody> <tr> <td> </td> <td> </td> <td colspan="2">Predicted class</td> </tr> <tr> <td> </td> <td> </td> <td>Ham</td> <td>Spam</td> </tr> <tr> <td rowspan="2">Actual class</td> <td>Ham</td> <td>128</td> <td>26</td> </tr> <tr> <td>Spam</td> <td>31</td> <td>157</td> </tr> </tbody> </table> </figure> <p>which indicates a reasonable level of classification. Other methods you can try in ML-Pack for this problem include <a href="https://mlpack.org/doc/mlpack-3.3.1/cli_documentation.html#nbc">Naive Bayes</a>, <a href="https://mlpack.org/doc/mlpack-3.3.1/cli_documentation.html#random_forest">random forest</a>, <a href="https://mlpack.org/doc/mlpack-3.3.1/cli_documentation.html#decision_tree">decision tree</a>, <a href="https://mlpack.org/doc/mlpack-3.3.1/cli_documentation.html#adaboost">AdaBoost</a> and <a href="https://mlpack.org/doc/mlpack-3.3.1/cli_documentation.html#perceptron">perceptron</a>.</p> <p>To improve the error rating, you can try other pre-processing methods on the initial data set. Neural networks can give upto 99.95% validation rates, see for example <a href="https://thesai.org/Downloads/Volume11No1/Paper_67-The_Impact_of_Deep_Learning_Techniques.pdf">here</a>, <a href="https://www.kaggle.com/kredy10/simple-lstm-for-text-classification">here</a> and <a href="https://www.kaggle.com/xiu0714/sms-spam-detection-bert-acc-0-993">here</a>. However, using these techniques with ML-Pack cannot be done on the command line interface at present and is best covered in another post.</p> <p>For more on ML-Pack, please see the <a href="https://mlpack.org/docs.html">documentation</a>.</p> </div> https://www.sickgaming.net/blog/2020/07/20/spam-classification-with-ml-pack/ |