Imaging you have a map and on that map you define a bunch of geo-locations; polygons, which are defined by their vertices as latitude and longitude co-ordinates. These geo-locations may overlap and may either be very big or very small (or in-between). The problem is to figure out, for any point on the map, which of these geo-locations bound it.

Finding whether a polygon bounds a point is actually pretty simple. The problem here is that the number of polygons I have to work with is huge (millions and millions) and the answer has to be delivered super-duper fast; in the order of a couple of millisecond, max! Of course, it is possible to brute-force a solution by checking the point against each and every polygon but that takes far too long.

What’s needed is a way of producing a “candidate” list; a small sub-set of the complete list that can be checked very quickly.

It turns out there are a number of possible ways of doing this. One way is to use a “quad tree” to section off the map into smaller and smaller segments. It’s, effectively, like performing a 2D binary search; drilling down on the map until you get to a small enough quadrant that you have a candidate set.

This approach does have its down sides!

Depending on the distribution of your polygons in 2D space you could end up with a greatly unbalanced tree, which could be very costly in terms of performance. Also, it’s necessary to test every candidate found and where there are lots of overlapping polygons this could end up being a large number.

I need a better, faster and more efficient solution!

After giving this problem a lot of thought I’ve come up with an approach that, at least on paper (because I’ve yet to code it up) should provide a way of identifying the bounding polygons in near constant time. Unfortunately, it’s slightly complex and is a little difficult to explain and, I suspect, will be even harder to code but if it works it will be quite a nice and elegant solution. I decided to share my idea in the hopes of getting constructive feedback.

Here goes…

This is a proposed framework to build a pre-computed point-in-polygon index, which can be used to perform constant time O(1) translation from Lat/Long to geo-location polygon items. Unfortunately, the process to create this index is a little convoluted; however, it makes use of standard algorithms and so it should be possible to use pre-written libraries for the more complex parts of the process.

The solution works by creating a collection of virtual polygons that represent all the intersections of real polygons such that the virtual polygons never overlap. A good way to think of this is to imagine three glass polygons overlapping. If you focus on the edges of each polygon you’ll be able to see that you can actually visualise a larger set of polygons made up of the edges and where they intersect.

The above diagram demonstrates a simple example of 3 ‘glass’ polygons (rectangles for the sake of keeping things simple) overlapping. As can be seen, the edges of these polygons intersect and these intersection points can be assigned vertices, thus creating a collection of virtual polygons.

What’s interested about these virtual polygons is that they do not overlap; they overlap real polygons but not each other. We can use this property to our advantage such that we can assume a one to many mapping of virtual polygons to the real polygons that they overlay. For example, VP7 overlays RP1, RP2 and RP3 but it does not overlay any thing else. Using this we can’t create a simple virtual polygon to real polygon index.

- VP1 = RP1
- VP2 = RP2
- VP3 = RP3
- VP4 = RP1, RP2
- VP5 = RP1, RP3
- VP6 = RP2, RP3
- VP7 = RP1, RP2, RP3

Obviously, figuring out the virtual polygons is by no means a simple task but it turns out that this is actually quite a common thing to want to do and is a staple of most image processing packages. The process is called “clipping” and there are libraries for this that support set operations on the polygons. I am hopeful that one of these libraries might be used to pre-compute the intersection vertices.

Of course, we still needs a way to figure out which of the virtual polygons our point of interest resides. To do this we can use a geohash that, effectively, divides the map into a grid with each sector in the grid being a bucket containing the details of the virtual polygons and which real polygons they overlay within that sector.

The example, above, has a point in the 4,5 sector. In this sector we find two virtual polygons, VP4 and VP7. The point is not in VP4, the dot is in VP7. Our index will map VP7 to R1, R2 and R3 and so these are the real polygons our point resides within. Sectors that don’t contain any polygons do not need to be indexed.

When we are performing a lookup on a specific geo-point we use the same geohash to figure out what bucket it should map to and extract the corresponding virtual polygon candidates. Once we have these it’s a simple case of iterating the list until we find one our point lives within. The candidate list should be very small because virtual polygons don’t overlap.

Since we know none of these virtual polygons overlap we can stop searching as soon as we find a match. Unless we happen to have a grid location that happens to have an intersection of numerous polygons of acute angles there is a fair chance that most of these buckets will contain a small (4-6) number of virtual polygons to test.

This being the case the amortised time complexity for performing a lookup in the computed index is approximately O(1). It’s actually O( n ) where n is the number of virtual polygons in that bucket but as this is likely to be a tiny number in proportion to the complete sample space we can assume that for all intents and purposes that the amortised time is constant.

Of course, it is quite likely the final index will be quite large; however, this can be tuned by changing the granularity of the geohash. The greater the geohash granularity the bigger the index but the less v-polygons that need testing. Whilst this may prove to be yet another problem this is one that is relatively simple to solve as there are more than enough solutions to work around this.

For example, if the index was too large to be kept in memory it could easily be stored in a persisted key/value document store (something like Kyoto Cabinet). Disk and memory are cheap and so the trade-off for the ability to perform geo-location lookups in almost constant time is likely to be worth it.