Data Validation In Power BI Copilot AI Instructions

Here’s yet another post in my series on things I think you should be doing in Power BI Copilot AI Instructions. Today: validating values that users enter as filters in their prompts. It’s something of a companion piece to last week’s post about helping users understand what data is and isn’t in the semantic model, because the more I think about it, the biggest problem users have when trying to query data in natural language is knowing about the data that is there to be queried. As I said last week, if you’re an end user interacting with a Power BI report via a slicer or a filter you know what values are available to choose because you can see them listed in the slicer or the filter – but you don’t see them when composing a prompt. As a developer or someone doing a demo it’s easy to forget this because you know the data so well but for an end user it’s not so easy and so they need all the help that the model developer can give them.

Let’s see an example using the semantic model that I’ve been using in this series containing UK real estate sales data. The Transactions table in my semantic model contains one row for each property sold; each property’s address is given and each address has a UK postcode (something like a US zip code – I’m sure all countries have an equivalent).

Everyone in the UK knows their postcode and a postcode contains a wealth of geographic information, as this section of the Wikipedia article on postcodes shows. There’s no need to get too detailed on their format though because I want to point out one important feature of all of the properly-formatted postcodes in the Transactions table shown above: they all have a space in the middle of them. And people being people, when they use postcodes, they usually forget that and write a postcode without the space.

This has consequences for Power BI Copilot. For example, the prompt:

show count of transactions for the postcode YO89XG

Returns a message saying that Copilot can’t find any data for the postcode “YO89XG”. This is because the postcode doesn’t contain a space. This is what you might expect as a developer but it will not make much sense to an end user.

On the other hand if the postcode in the prompt does contain a space in the right place, like so:

show count of transactions for the postcode YO8 9XG

…it returns the desired result:

How can we address this specific issue? Fairly easily, it turns out, because UK postcode formats are well documented and I would imagine Copilot has been trained on the same Wikipedia page on postcodes that I linked to above. As a result, adding the following to the AI Instructions for my semantic model:

The postcode column contains postcodes for locations in England and Wales. If the user enters a value to filter by for postcode that returns no data and the value is in an invalid format for a postcode, tell them they appear to have made a mistake, explain why the postcode format is wrong and suggest some changes to the value entered by the user that might result in a valid postcode.

Means that when I use the first prompt above, for the postcode without the space, I get a much more helpful response:

Clicking on the first option in this screenshot alters the prompt to include a space in the right place, which results in the user seeing the desired data:

I was encouraged by this, but there’s one obvious problem here: this only works for data like UK postcodes where the format is widely known. The format of your company’s invoice numbers is unlikely to be something that Copilot knows about.

So I experimented with using regular expressions in my AI Instructions and guess what, they seemed to work really well! But then I stopped to think – could I really trust an LLM to use a regex to validate values? The good thing about working at Microsoft is that I have a bunch of friendly colleagues who know way more about AI than I do so I asked them this question. One of them told me that for Copilot to properly validate data using regexes it would need to write some code and it can’t do that yet; instead it’s probably interpreting what the regex is looking for and trying to match the value against the interpretation. So while it might appear to work it would be prone to making errors.

Damn. That meant that if the LLM made a mistake when validating the data before running the query it would run the risk of preventing the user from filtering by a valid postcode, which would not be good. But then I thought, what if I applied the validation after it was clear that the user had entered a postcode that returned no data? That way it would be less important if the LLM made a mistake in its check because it would only happen when it was clear the user needed extra help.

Writing the AI Instruction to only validate the data after checking to see if the value the end user was filtering on didn’t exist seemed to work. Here’s the AI Instruction using a regex I found here to validate UK postcodes:

The postcode column contains postcodes for locations in England and Wales. UK postcodes must follow the format described in the following regular expression:
^([A-Za-z]{2}[\d]{1,2}[A-Za-z]?)[\s]+([\d][A-Za-z]{2})$
If the user enters a value to filter by for postcode that returns no data and the value is in an invalid format for a postcode, tell them they appear to have made a mistake, explain why the postcode format is wrong and suggest some changes to the value entered by the user that might result in a valid postcode.

Note how I say “If the user enters a value to filter by for postcode that returns no data…”

Here’s the result for the followng prompt asking for data for an invalid postcode:

show count of transactions for the postcode E48QJJ

I did some other tests on sample data and it does indeed suggest that the wording you use in the AI Instruction can control whether Copilot tries to validate the data before checking if the value the user is filtering on exists (which, as I said, would be bad because of the risk of it making a mistake when trying to validate data) or after (which, as I said, is a lot less dangerous).

All in all it seems that putting some thought into data validation in AI Instructions can result in a much friendlier end user experience in Copilot. That said, I doubt that however good your AI Instructions the experience will ever match the experience of seeing a list of possible values in a filter or slicer. Maybe what we need is something like IntelliSense when writing a prompt so you can see and search for values in your data?

Share this Post

Comments (0)

Leave a comment