The conversation around AI is dominated by technology. But to use bots and AI agents well, something far less glamorous is required. For those of us working in analytics, the best preparation isn’t a new platform or framework – it’s returning to an old, often postponed task: documentation.
Many are now aware of large language models and LLM:s and accessing them via chats. And know the issues of hallucinations and wrong answers – as well as the ability to unintentionally or intentionally get an answer we like from it. Technological solutions are hyped.
This stands somewhat in opposition to the world view of analysts, data-oriented people and engineers that likes to find the answers in large datasets and reasoning around numbers. These sources-of-truths obviously should be of great utility to help the LLM:s to being more correct.
As everyone gets used to get answers (correct or not) from their chatbots, they will want to ask it for the latest sales figure rather than having to look up a report or dig into a database. And other AI systems will require access to this data as well (see the latest buzzwords of Agentic AI and protocols like MCP). Whether you sign up for the hype train or not, the expectations will come.
So how do we start to prepare to do that, and what should we look out for?
AN EXPLOSION OF TOOLS, PRODUCTS AND PROMISES
One perspective is the technical one. Many products, platforms and much hype are starting to pop-up with a bewildering number of discussion surrounding integrations and advanced AI platforms managing much more than just LLM:s.
Technical protocols such as MCP, query routing, selecting correct models, and AI engineering in general are important concepts being discussed at the moment.
WORDS ARE IMPORTANT
We must bear in mind that no matter how advanced the technology becomes, agents and LLM:s still operate in the domain of words.
But what does words mean?
This is not a new challenge for us in data engineering or analytics. Imagine a meeting where a salesperson and someone from accounting are arguing over last year’s sales figures. The salesperson considers what he registered in the sales system when he talked to the client. The accounting person might only consider actual invoiced items that finally was signed and paid for. They have different concepts for the same term or word. The keyword here is “semantics”, which on a a philosophical level deals with the meaning of words.
Between humans, we solve confusions such as the one above by talking to each other and aligning to each other through micro-discussions. We understand that we refer to slightly different meanings.
But large language models as of today have no concept of meaning. They are just statistical models based on words. So, in order for them be effective we have to explain everything precisely. And we can pass on what things mean by describing it through words. Or override it’s usual behaviour in agents (if asked for sales, please ask for specification). But if our description is just “Here you find sales”, it will not do a great job.
DOCUMENTATION
The solution is not some new complex magic, but something we all probably have on the back log from yesterday:
Just document what every dataset and field actually means:
Consider the example above. A LLM will not understand what we mean with “sales inc” in Table 1 as it isn’t specified. Just by adding the row description will get it a fighting chance to answer a question with the intended sales number.
This can be done with prompt engineering, such as providing the description in the query (“sales inc means...”). Or you can put it in the reports. Or ideally integrating it in a compliance/lineage system such as Purview. There is no one-size-fits-all solution.
And the trick here is that you can ask an AI to document your system, summarize descriptions and take a guess. It might be able to figure out that your sales definition corresponds with typical sales processes or financial standards. But for an unclear “sales inc” it will likely guess wrong, so you will still need to correct it.
POINTS TO CONSIDER
Though providing meaningful descriptions is a good idea, it entails a few pitfalls for us data-centric people:
CONCLUSION – DOCUMENT EARLY, MOVE FASTER LATER
Without documented systems, metrics and ratios, everything remains an unknown unknown. Therefore, the first and most critical step is simply to document what exists today, to the best of our ability, and to surface ambiguities rather than hide them.
As technical solutions mature, the urgency of this foundation will only increase. Expectations will quickly shift from “this would be nice to have” to “this should already exist.” Starting early allows organizations to build understanding, ownership, and process maturity. Trying to retrofit structure and documentation later – under pressure and scrutiny – is almost always more painful and far less effective.
Author: Daniel Hedblom, BI Consultant, Random Forest