A simple geocoder using Postgres and Python

python
postgres
geospatial
Author

Alex Lee

Published

March 14, 2021

For a work project last year we needed to create at short notice a geocoder for Victorian addresses. While there are a number of open source options such as libpostal, we decided to take a different approach, following the ideas in the paper Exploiting Redundancy, Recurrence and Parallelism: How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes. This allowed us to create a proof-of-concept within a few days and was relatively simple to code from scratch, using Python and Postgres.

An observation about street addresses

The paper is partly based on some observations about the statistics of street addresses which imply significant amount of redundancy in the way they are described. For example:

Based on the January 2021 release of the GNAF there are 3,839,496 addresses in Victoria. 516,926 of which (i.e., over 13%) are uniquely defined by three numbers: flat number, street number and postcode.

As a result, these addresses can be identified irrespective of typographical errors in the street name, suburb or other text. Furthermore, even where the text is necessary to identify an address, often only a subset of the text is needed, and this is generally the parts of the text that occur relatively infrequently.

The algorithm: general idea

The authors of the paper use this idea as the starting point for a geocoding algorithm (actually an address linking algorithm in their paper) that has some nice properties:

Despite its simplicity it works surprisingly well and at least for me was a good project for getting more hands-on experience using Postgres.

Based on these ideas, the authors developed a simple geocoding algorithm that represents each address in terms of rare phrases, i.e., pairs of consecutive tokens that occur below a certain frequency, along with other infrequently occuring elements.

The algorithm: details and implementation

Here are the basic steps of the algorithm for geocoding street addresses:

  1. Create a database to store the GNAF data;
  2. For both the input addresses and GNAF create corresponding phrase tables which generate the phrases for each address, i.e., the pairs of consecutive tokens;
  3. An inverted index is generated on the phrases which maps each phrase to a set of addresses that contain these phrases;
  4. The inverted index tables for the input addresses and GNAF addresses are then linked, only considering those that appear below a certain frequency. This has the advantage of only requiring searches over a small subset of addresses. The result is a set of candidate matches for each of the input addresses;
  5. For each input address a similarity score is calculated with its set of candidate addresses. We used the similarity function from pg_tgrm module;
  6. As a final step, an additional constraint on the numeric tokens is imposed: the set of numeric tokens in the input address has to be a subset of those in the matched address. This ensures, for example, that the unit numbers are correct.

The paper has SQL code for most of these steps, with the exception of the GNAF setup and the numerical constraint.

I have written a Python module for geocoding that implements this algorithm in Postgres thanks to the nice results library. It is available in the repo PhraseGeocoder.

Summary and extensions to the algorithm

I found this an instructive project for learning more about Postgres, in particular things like Common Table Expressions, string similarity and operations that I wasn’t at all familiar with.

There are a few obvious ways that this algorithm can be extended:

Some of these variations I have coded up and will upload them at a later date.

References