Sorting into Chunks

Posted by kumakyoo on 11 April 2025 in English. Last updated on 25 April 2025.

This blog post is part of a series of blog posts about the new OSM file format “OMA”. This is the sixth post. At the end of the article you’ll find links to the other blog entries.

The Oma file format uses a set of bounding boxes to sort the data geographically. The default set used by the converter consists of three grids.

The Default Grids

The first grid uses a mesh size of 1 degree in both directions between -45° and 45° latitude. Towards the poles the mesh size increases and finally the area around the poles is put into a single bounding box.

The grid looks like this:

the default bounding box grid, level 1

The second grid uses a mesh size of 10 degrees in both directions between -80° and 80° latitude. It does not include the areas around the poles.

This grid looks like this:

the default bounding box grid, level 2

The third grid isn’t really a grid, it’s just the whole world as one big bounding box.¹

Data Distribution in Bounding Boxes

In practice, data isn’t evenly distributed around the globe. Large areas are covered by oceans where there is little data distribution, while other regions are densely covered.

Using identical bounding boxes everywhere results in a few large chunks and many small chunks. It would be better to balance the sizes. This would speed up data access.

As an example, look at the distribution of nodes in Germany:

distribution of data in Germany

The Ruhr area (centre, left) contains about 10% of Germany’s nodes. And the area around Berlin contains another 8.6%, while for example the area just north of Berlin contains only 1.4%.

Let’s see, how this affects the search times: I choose the street Bienenweg in Neubrandenburg for comparison. The search for this street takes 2.5 seconds. Compare this with the 4.7 seconds needed for the Viktorstraße in Wuppertal.

Improving Bounding Boxes

But how do you get a better set of bounding boxes? I tried some ad hoc algorithms, but could not find out, what really makes a good set of bounding boxes. Probably, this question could make a good master thesis (or even a PhD thesis).

Anyway, I’d like to take a quick look at what I tried with data distribution: The first idea led to this set of bounding boxes:

distribution of data with an improved set

The nodes are still not evenly distributed, but the differences are somewhat smaller. Searching for the Bienenweg now takes 2.8 seconds, while searching for Viktorstraße has been reduced to 2.9 seconds. On average, this seems to be at least some improvement.

I wondered if overlapping bounding boxes were a problem. So I tried the same thing with (almost) non-overlapping bounding boxes. The result looks like this:

distribution of data with an improved set without overlap

The distribution looks similar to before. Unfortunately, the search times are worse: still 2.8 seconds for the Bienenweg, but 3.5 seconds for the Viktorstraße.

I also tried to increase the number of bounding boxes. It seems that this improves the results slightly. Unfortunately, the more bounding boxes you use, the more difficult it is to create the Oma files (crashes happen more often). I fear that the algorithm for sorting data into bounding boxes will have to be improved before further experiments can be made with this.

In one of the comments to an earlier post Geonick pointed out a discussion about geoparquet.² In that discussion, GitHub user jiayuasu claims that an algorithm called KDB tree is optimal for geoparquet. I have not had the time to look into this, but it might also be a good choice for OMA files as well.

To summarise: The question of what makes a good set of bounding boxes for OMA files is still open.

Discussion

Comment from -karlos- on 21 April 2025 at 08:22

Did you consider the concept of Caesium, to half split the data?

You did see this: osm.org/user/daniel-j-h/diary/406584

Comment from kumakyoo on 21 April 2025 at 09:40

Thanks for pointing out those two things. I wasn’t aware of the concept of Caesium, so I didn’t consider it. As far as I understand it, it’s something like a k-d-tree approach.

Regarding the Z-curve: If anything, this could be used within slices to sort the data once more. It would be nice, but you need to have indexed access to the elements, which is not the case: The data is compressed and must be retrieved linearly. And even, if it were not compressed, the size of the elements is not fixed, so indexing still does not work. So I’m not sure if such a technique could be used. Anyway, it would be nice, to be able to read only part of the slices if only part is needed.

OpenStreetMap

kumakyoo's Diary