February 12, 2024
Search "Artificial Intelligence" on Accounting Today and you'll get 2,018 hits. There's endless content around how AI will revolutionize accounting, yet large language models (LLMs) like ChatGPT still suffer from a control problem. Hallucinations and indeterminate outputs are key hurdles in implementing them as autonomous solutions to tasks, where the margin of error is slim.
Don't just take my word for it though. Let's go through a concrete example of vendor coding, automated from start to finish with the help of AI. We'll take a close look at this reliability problem, why it matters, and how we can overcome it.
In addition, this is meant to be a comprehensive tutorial to let you stand up your own robust vendor coding workflow. And when I say robust, I mean provable guarantees of accuracy. No sifting through thousands of transactions trying to find the AI's small mistakes by hand.
The only prerequisite here is the ability to execute a Python script on your computer. If you can't do this, I would highly recommend this video -- I promise it won't take more than 15 minutes to learn.
Okay, let's get started.
A lot of the time, bookkeeping boils down to identifying the vendor for a transaction. Accounting softwares have rule engines that can be configured to reliably output the expense type from the vendor and the amount.
Sounds simple enough, so why is this a hard problem?
Details of a transaction are all packed into a single memo. The vendor passes some custom string of text to the bank, and the bank wraps it with some additional context. e.g. if you swiped your business card at Taco Bell, your bank statement might report something like:
Card Purchase 8/29 Taco Bell #xxxxx Madison WI Card
08/29 TacoBell xxx-xxx-xxxx Madison WI Card
TACOB #xxxxx Madison WI Card xxxx
It's hard because neither the bank nor the vendors have a standardized way of reporting. This creates a maelstrom of edge cases, whose solutions are impossible to fully express in traditional code. Instead, we've historically solved this by handing the transactions to a bookkeeper with a fresh cup of coffee.
Let's look at a real example from a business banking with Chase. These are 10 of the 9,827 total transactions we need to categorize:
Our goal is to pull out the clean vendor names: Lagos, Fortnite, Panda Express, Viasat, and so on.
As you can see, the exact method for extraction would be different for each of the 10 examples. Hard coding this would be futile, but it's a textbook use case for the flexible text processing capabilities of LLMs.
For this part, we'll be using the OpenAI API.
It's no different than passing a set of instructions along with the specific memo we want to process to ChatGPT on the web interface. Instead of having to do this by hand 10,000 times, we can create a simple Python script that does it for us.
Here's the set of instructions I used (see appendix for the full code):
You will be presented with a bank transaction memo that should have a vendor name in it. Please extract the vendor from the memo. Infer the proper company name (i.e. amzn should be Amazon), use proper capitalization (capital first letter then the rest lowercase until the next word), and ignore the location suffix. Return only the vendor and nothing else in your response -- no exposition.
It's slower and more expensive, but I always start with GPT-4, the most powerful model available. My process is to use the greediest approach until the task is completed to a high standard, then figure out how to trim the fat later.
Below are the same 10 transactions processed with the script:
Pretty good right? Well, yes and no.
The upside is the extraction is nearly perfect. The downside is the memos themselves are noisy, which presents problems for downstream vendor-based rules. For example, say we have a rule that any transaction from Viasat is a travel expense. Because the extracted memo is "Viasat In-Flight Wi-Fi", our rule wouldn't apply.
When it comes to bookkeeping, you're only as strong as your weakest link. If you still have to sift through all 9,827 transactions to find these slight mistakes, you're not actually saving that much time. This is a common obstacle for implementing LLMs -- engineering a workflow to make review efficient is essential.
So how can we do this for vendor coding?
GPT-4's high fidelity extraction gives us a list of 9,827 candidate vendors. What we need to do is match these potentially misspelled, truncated, or partially extracted candidates to a master list vendors over which the downstream associations to expense accounts are defined.
If you don't already have a master list, there's two easy ways to start. Either produce it from completed examples of vendor coding you've done in the past, or search online for companies sorted by their yearly revenue and export the first few thousand.
Off the bat, I got 4,197 perfect matches with my fairly short list of 1,000 standardized vendors. That's nearly half of the transactions! Even if we stopped here it would be a significant dent in the work, but we can do better. Let's use some more sophisticated techniques for the remaining 5,630 transactions.
If you've ever tried to correct data entry errors, you're probably familiar with the type of problem we need to solve here.
For simple typos, you can sometimes get away with using the edit distance between strings of text (also called Levenstein distance) -- Tropi Quatics Pet Cente and Tropiquatics Pet Center have a low edit distance of 2. However, this fails if the elements of your master list are too close together under this metric. For example, AMC and ATT also have an edit distance of 2, which could be a source of false matches.
Though false positives are more concerning, the edit distance can also be a poor indicator of true matches. e.g. Town council might be abbreviated TC -- the edit distance is 10, even though it would be obvious to any human assigned with the task.
Let's look at an LLM-powered string comparison method that blows edit distance out of the water. I'm convinced variants of this technique will be an absolute game changer for bookkeeping in the next few years.
LLMs like GPT-4 are trained to be fill-in-the-blank machines -- huge swaths of text from the internet are randomly masked and the LLM's goal is to correctly guess the hidden words. To succeed, the LLM is forced to construct a well organized "mental model" of the world that helps it generate the prediction.
Unlike with humans, we can directly access the mental model of a trained LLM -- it's just a collection of numbers saved in memory.
Vector embedding describes the process of mathematically embedding a string of text into the mental model of an LLM. Conceptually similar strings are mathematically close together in this embedding space. Thus, we can use it just like the edit distance to assess how far apart two pieces of text are in meaning. For ease of explanation, let's call this the semantic distance metric.
For each of the remaining 5,630 extracted vendors, I computed the best match from the master list using the semantic distance (see script in appendix). I listed each match along with the corresponding level of similarity (between 0 and 1).
Just like the edit distance, we must choose a threshold to accept a match. But how do we know what threshold to choose? Below is a plot of accuracy and number of transactions categorized as a function of the threshold chosen.
For every choice of threshold between 0 and 1, the blue line shows the number of automatically categorized transactions, and the red line shows the corresponding accuracy. The power of this technique is we can get statistical guarantees on the margin of error based on the threshold we use.
The black dotted lines show that if we choose 0.78 as our threshold, we would categorize 1,479 additional transactions at a 99% accuracy. That means we're down to 4,151 or 42% of the transactions we started with.
99% accuracy is not only on par with human error, but because of the semantic similarity, the 14 incorrect transactions were all matched to highly related vendors. For example, "Jet's Pizza" was misclassified as "Dominos Pizza."
There's all sorts of creative ways we can leverage the above threshold to confidently automate the vendor coding process. If we care more about categorizing large transactions correctly, we can decide on different thresholds based on the amount of the transaction.
Ultimately, the key insight here is that we've achieved agency over the certainty with which our AI bookkeeper makes decisions.
In this toy example, we're left with 4,151 transactions to categorize by hand. This number would undoubtedly have been lower if our vendor list were more complete, but it's a rock-solid 58% reduction in the amount of work done by hand. With some optimization and more historical transactions, this could easily turn into 90% of transactions instantly and reliably categorized with no need for human intervention. That's an order of magnitude less time and money spent.
This is how LLMs will be successfully woven into accounting processes at large. They don't need to handle everything, but the tasks they do take on should be executed with absolute confidence.
We can make the above workflow faster, cheaper, and less error prone. I won't go into too much detail, but I want to give a sense of what levers you might pull for optimization.
GPT-4 costs $0.03/1k prompt tokens and $0.06/1k sampled tokens. If you're not familiar, tokens are the way LLMs break down their inputs. It's just a different unit like characters or words. I use this simple calculator to determine the number of tokens in a block of text.
For the example we went through, the prompt comes out to about 125 tokens. The output vendor should be no more than 10 tokens. For 10,000 transactions, we have:
10,000 txns * 125 tokens/txn = 1,250,000 prompt tokens * $0.03/1k tokens = $3.75.
10,000 txns * 10 tokens/txn = 100,000 sampled tokens * $0.06/1k tokens = $0.60.
That's a total of $4.35 for all 10,000 transactions. If we use GPT-3.5 turbo instead, which costs $0.0005/1k prompt tokens and $0.0015/1k tokens, we get a total of $0.78. That's six times cheaper!
You can shave costs down even further by going with open source models instead of a model provider like OpenAI. In this case, the unit economics are good even with GPT-4, but it makes a difference if you further subdivide tasks and use more expensive, longer prompts.
Astonishingly, the embeddings of all 11,000 pieces of text cost <$0.01!
In the example, we performed the vendor extraction with a single prompt. It's often possible to get better results by splitting tasks into simpler subcomponents. Here's an alternative workflow we could have tested:
Doing all that now takes us to four OpenAI calls, where things can start to get expensive. There's rich space for exploration here, and many companies are developing tools to test and compare different workflows and prompting techniques.
The script for our example is set up to run serially. It took about a second to process a transaction on my machine, so that's around 3 hours for the full 10,000 transactions.
We can make this faster by:
I've made all the scripts I used in this tutorial available for use. Here they are: