{"id":135390,"date":"2025-12-23T09:15:44","date_gmt":"2025-12-23T09:15:44","guid":{"rendered":"https:\/\/blog.finxter.com\/?p=1671622"},"modified":"2025-12-23T09:15:44","modified_gmt":"2025-12-23T09:15:44","slug":"stop-testing-llms-with-poetry-use-blackjack-instead","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2025\/12\/23\/stop-testing-llms-with-poetry-use-blackjack-instead\/","title":{"rendered":"Stop Testing LLMs with Poetry: Use Blackjack Instead"},"content":{"rendered":"<p class=\"has-base-2-background-color has-background\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f64f.png\" alt=\"\ud83d\ude4f\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/> <strong>Image and research source<\/strong>: Thomas Taylor (<a href=\"https:\/\/github.com\/thomasgtaylor\/llm21\">GitHub<\/a>)<\/p>\n<p>If you want to see what an LLM is really good at (and where it still slips), don\u2019t ask it to write a poem or generate code. Ask it to make the same small decision again and again under clear rules.<\/p>\n<p><strong>That is why blackjack basic strategy is such a useful lens.<\/strong><\/p>\n<p>Basic strategy is basically a decision table. Given your hand and the dealer\u2019s upcard, there is a best move for a given rule set. Hit, stand, double, split, surrender. It is not a vibe. It is a lookup problem.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><a href=\"https:\/\/x.com\/FinxterDotCom\/status\/2002478044414677196\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"710\" height=\"748\" src=\"https:\/\/blog.finxter.com\/wp-content\/uploads\/2025\/12\/image-43.png\" alt=\"\" class=\"wp-image-1671623\" style=\"aspect-ratio:0.949216628368498;width:710px;height:auto\" srcset=\"https:\/\/blog.finxter.com\/wp-content\/uploads\/2025\/12\/image-43.png 710w, https:\/\/blog.finxter.com\/wp-content\/uploads\/2025\/12\/image-43-285x300.png 285w\" sizes=\"auto, (max-width: 710px) 100vw, 710px\" \/><\/a><\/figure>\n<\/div>\n<p>So you would expect modern models to nail it. And some do. But what makes this benchmark interesting is not \u201cwho got the highest score.\u201d It is how the models fail.<\/p>\n<h3 class=\"wp-block-heading\">The result that matters is not the winner, it is the pattern of mistakes<\/h3>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"509\" src=\"https:\/\/blog.finxter.com\/wp-content\/uploads\/2025\/12\/image-45-1024x509.png\" alt=\"\" class=\"wp-image-1671625\" srcset=\"https:\/\/blog.finxter.com\/wp-content\/uploads\/2025\/12\/image-45-1024x509.png 1024w, https:\/\/blog.finxter.com\/wp-content\/uploads\/2025\/12\/image-45-300x149.png 300w, https:\/\/blog.finxter.com\/wp-content\/uploads\/2025\/12\/image-45-768x382.png 768w, https:\/\/blog.finxter.com\/wp-content\/uploads\/2025\/12\/image-45.png 1455w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<p><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a1.png\" alt=\"\u26a1\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/> Check out Thomas&#8217; Page: <a href=\"https:\/\/thomasgtaylor.com\/blackjack\/\">https:\/\/thomasgtaylor.com\/blackjack\/<\/a><\/p>\n<p>When models get decisions wrong in blackjack, they do not usually fail randomly. They tend to develop a consistent style of mistakes.<\/p>\n<p>One model might double too often. Another might be overly cautious and miss good doubles. Another might surrender in spots where it should fight on. That is a big deal because it mirrors what many developers see in real products: the model is mostly reliable, but it has a few recurring blind spots.<\/p>\n<p>This is the key point for builders. LLMs do not fail like buggy programs. They fail like inconsistent policies.<\/p>\n<h3 class=\"wp-block-heading\">Accuracy and outcomes are not the same thing<\/h3>\n<p>The benchmark tracks two things that people often confuse:<\/p>\n<ul class=\"wp-block-list\">\n<li>decision accuracy: did the model pick the basic strategy move?<\/li>\n<li>outcome: did the bankroll go up or down over the run?<\/li>\n<\/ul>\n<p>These can diverge. Blackjack has asymmetric payouts. A single bad double can hurt more than a small hit\/stand mistake. And over a limited number of hands, luck still matters. So you can see a model that is slightly less accurate end up with a better balance simply because variance went its way.<\/p>\n<p>This is not just gambling trivia. It is a reminder that your evaluation metric shapes what looks \u201cbest.\u201d If your product cares about costly failures, you should measure cost-weighted errors, not just raw accuracy.<\/p>\n<h3 class=\"wp-block-heading\">Why this matters outside blackjack<\/h3>\n<p>A blackjack hand is a tiny state with a clear action set. Software is full of the same structure:<\/p>\n<ul class=\"wp-block-list\">\n<li>incident triage rules<\/li>\n<li>retry and backoff policies<\/li>\n<li>access control and permissions<\/li>\n<li>billing and pricing logic<\/li>\n<li>feature rollout rules<\/li>\n<li>compliance checks<\/li>\n<\/ul>\n<p>In all of these, you often have clear policies you want followed. If a model struggles to consistently follow a small decision table, it will also drift when it is asked to follow your company\u2019s rules unless you design around that.<\/p>\n<h3 class=\"wp-block-heading\">The better mental model: LLMs behave like learned heuristics<\/h3>\n<p>A traditional program executes rules. A plain LLM often imitates rules and sometimes improvises. That is why you see those \u201cerror personalities.\u201d The model is not just retrieving the correct table cell every time. It is applying a learned pattern that is usually right, and occasionally biased.<\/p>\n<p>This is the important angle for the Finxter community: treat the model like a policy learner, not a calculator.<\/p>\n<h3 class=\"wp-block-heading\">What to do with this insight<\/h3>\n<p>The engineering move is not to argue with the model harder. It is to change the shape of the task so the model cannot drift.<\/p>\n<p>A few practical approaches:<\/p>\n<ul class=\"wp-block-list\">\n<li>Put the strategy table in code and have the model call it.<\/li>\n<li>If you keep it in the prompt, force a structured lookup format and validate the output.<\/li>\n<li>Log mistakes by category (too many doubles, early surrenders, split errors) because that tells you what to fix.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">A simple Finxter challenge you can copy<\/h3>\n<p>The real win here is not blackjack itself. It is the idea of a small, repeatable benchmark.<\/p>\n<p>Pick any domain where ground truth exists as a clear set of rules or a decision table. Generate a lot of reproducible test cases. Score both accuracy and cost-weighted outcomes. Then look for recurring error patterns, not just the overall score.<\/p>\n<p>That gives you something far more useful than \u201cmodel A feels smarter than model B.\u201d It tells you how a model behaves under repetition, which is what matters when you are building real systems.<\/p>\n<p class=\"has-base-2-background-color has-background\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2728.png\" alt=\"\u2728\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/> <strong><a href=\"https:\/\/blog.finxter.com\/ai\/\">Join the Finxter AI Newsletter<\/a><\/strong> to be on the right side of change &#8211; with 130k readers!<\/p>\n<p>The post <a href=\"https:\/\/blog.finxter.com\/stop-testing-llms-with-poetry-use-blackjack-instead\/\">Stop Testing LLMs with Poetry: Use Blackjack Instead<\/a> appeared first on <a href=\"https:\/\/blog.finxter.com\">Be on the Right Side of Change<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Image and research source: Thomas Taylor (GitHub) If you want to see what an LLM is really good at (and where it still slips), don\u2019t ask it to write a poem or generate code. Ask it to make the same small decision again and again under clear rules. That is why blackjack basic strategy is [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,857],"tags":[73,468,528],"class_list":["post-135390","post","type-post","status-publish","format-standard","hentry","category-games","category-python-tut","tag-programming","tag-python","tag-tutorial"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/135390","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=135390"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/135390\/revisions"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=135390"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=135390"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=135390"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}