Using LLMs to validate geocoder outputs

LLMs
address data
Author

Alex Lee

Published

January 25, 2026

Machine learning algorithms are typically evaluated and compared against one another by using some sort of independent validation dataset. For geocoding algorithms such datasets don’t appear to be readily available. Furthermore, to evaluate the performance of an algorithm against an address dataset how do we define the correctness of a prediction? One way is to use the latitude, longitude values along with the Haversine Formula to calculate the distance between each prediction and the ground truth. Alternatively we can evaluate each address that has been matched with a ground truth.

This is a bit more involved since the results are text. One performance measure I find useful is to compare address pairs at four different levels:

So, for example if the correct address is 4 / 23 Johnston Avenue Burwood East 3151 and our geocoder produces 23 Johnston Avenue Burwood East 3151 then this is accurate up to street number but not the apartment number.

Depending on the application, a geocoder that is accurate to the street name (or even suburb name) may be sufficient, for example if we are only interested in created aggregated statistics across suburbs. For other applications (e.g., a food delivery service) the exact apartment number may be required.

When evaluating the outputs of my own geocoding algorithms I would previously have to manually review each pair of addresses to score the predictions. For example see this notebook where I compare whereabouts with other geocoders. LLMs provide a more streamlined way to automate this. I found that the following prompt with GPT4 gives a structured outputs and assesses the accuracy at the four levels of geographic granularity above.

You are scoring the quality of a list of Australian street addresses with their corresponding matches.

You should identify all the components of the street address, including unit or apartment number, shop / tenancy number, street number,  street name, suburb and postcode and use these components in creating a score. 

You consider four levels of granularity for each address in order to assign a score: unit, property number (e.g., house number), 
street name, suburb / postcode. A numerical score should be assigned as follows:

- 4: unit number, street number, street name and suburb / postcode all match (including where there is no unit number)
- 3: property number, street name and suburb / postcode match
- 2: only the streetname and suburb / postcode match
- 1: only the suburb / postcode match
- 0: none of the components match or if the state is different, for example 'NSW' instead of 'QLD'

Additional instructions:
- Shop and tenancy are equivalent, as are unit and apartment number. Level is not considered relevant. 
- If the suburb name is different but the postcode and other components of the address match then we consider them the same suburb.
- If there is no unit or shop number in the addresses then we consider that to be a match on the unit or shop number, since they are both empty

Structure of the output:
- The output should consist ONLY of a json list with keys for each entry: 
  - input_address, matched_address, unit_match, street_number_match, street_name_match, suburb_match, postcode_match, overall_score
- The columns with the '_match' suffix are binary. 
- No other text should be shown in the output, ONLY THE JSON.

The pairs of input and matched addresses are shown below:

Another example of LLMs saving time on menial tasks.