OpenStreetMap logo OpenStreetMap

Post When Comment
Sorting into Chunks

Thanks for pointing out those two things. I wasn’t aware of the concept of Caesium, so I didn’t consider it. As far as I understand it, it’s something like a k-d-tree approach.

Regarding the Z-curve: If anything, this could be used within slices to sort the data once more. It would be nice, but you need to have indexed access to the elements, which is not the case: The data is compressed and must be retrieved linearly. And even, if it were not compressed, the size of the elements is not fixed, so indexing still does not work. So I’m not sure if such a technique could be used. Anyway, it would be nice, to be able to read only part of the slices if only part is needed.

About main keys and values

@SomeoneElse Yes, there is certainly room for improvement in the type file. In your case, you can add pedestrian after highway - EXCEPTION. In this case, closed paths with highway=pedestrian will be treated as areas.

About main keys and values

@SomeoneElse: Here’s what the algorithm does:

  1. Multipolygons are always areas.
  2. If the sequence of nodes is not closed, it’s a way.
  3. If the area tag exists and it’s value is yes or no, this value is used.
  4. The information from the type file is used (if a key could be determined).

So a man_made=pier will always be treated as a linear feature, except if it has area=yes (or it’s a multipolygon).

Neither leisure=track nor highway=raceway are listed in the type file. leisures are always areas, highways are always linear features. Again, this can be overridden by the area tag.

Unfortunately, OSM does not provide a clear way to distinguish between ways and areas. It’s always some guesswork - that cannot be helped.

The OMA File Format

I fear I wasn’t as clear about this, as I wanted: I’m not using single-precision floats, exactly because they loose precision. I’m using the same trick that pbf files (and many others) use: I multiply the numbers by 10,000,000 which leads to integers without any loss of precision. With that accuracy is at about 1cm. In my opinion that’s enough.

A New File Format for OSM Data

First, I would encourage you to go ahead and put the information in this post into a wiki linked to one of your repos.

I’m not sure, which wiki you are referring to. I plan to add a page in the OSM-Wiki when the format is finalized. I don’t want to do it in advance, because then I would have to change the entry every time I change the format. But maybe you have got something else in mind.

I would suggest, not a reason to give up on this effort.

Of course not. I have spend a year on this. I’m not going to give up, because there is something similar out there. But I like the idea of comparing it to my approach. It will help me get a clearer picture of the strengths and weaknesses of my format. And it might bring up some new ideas that I may have overlooked.

Third, can oma files be used to generate tiles? This might help make the format’s usefulness more obvious.

I think so. It’s an all-purpose format that contains almost all the information available from OSM. It might be better suited for vector tiles though, because, I think, Oma files could be used directly, without any additional preparation.

Fourth, are you testing your converters so that we can be confident of the round-trip behavior of a conversion into and then out of oma? I do not see a “tests” directory anywhere. :–)

No, there is currently no automated testing. The main goal so far has been to create a new file format. In my opinion this cannot be tested automatically, because after every change I would have to rewrite all the tests, and then I would have to test the tests, just to run them once…

The converter and the library are a kind of add-on, a prototype to show what is possible. When the format is fixed and a “real” converter/libraray is created, it should definitely be accompanied by automated tests.

Having said that, I did a lot of testing during the development of the two tools mentioned. It was just nothing automated. :-)

Regarding a roundtrip: That is not possible. You can’t convert Oma files back to OSM files. Some information is lost in the conversion process, for example the IDs and other meta information of the nodes that make up a way, but also how multipolygons have been pieced together and a few other things.

There is only one round trip I know of: From Oma to Opa and back. The resulting Oma file must be identical to the original.

And finally, would you object to things being done in python? I have a lot of experience working in java, but it would be good to have tools in other languages too.

Of course it would be nice to have the library in several languages. But first, the file format needs to be finalized. Python and PHP are languages, where I’ll probably write the library myself (but I don’t mind if someone else volunteers) when the time comes. For other languages other people will have to do the job.

Concerning the converter: I doubt that Python is fast enough for this job. And memory management may also be an issue. Java is (despite its reputation) one of the fastest languages available (but memory is an issue here too - I’ll cover that in my next post in this series) and thus a rewrite in another language may be required sooner or later.

Using the Oma Library

In the TypeFilter, what are possible parameter values? A (area), W (way), N (node).

And C (collection) - I’ll go into more detail on collections in one of my next blog posts (about how Oma files handle relations). Parameters can be combined, so you can also use “WA” if you are interested in ways and areas. Have a look at the API for more details.

Instead of taking a char as argument, you probably could use an enum with a fixed set of options

Yes. Would probably be more Java like. It’s only a prototype of a library; mainly intended to show what is possible. For a “real” library there needs to be a lot of refinement, I think.

Traditionally in Java, I have a query method or a method returning an iterator, and then I can iterate over the found values (e.g. with iterator.hasNext(); iterator.next().

In the past, I have run into problems, when trying to write Java iterators myself, so I have shied away from this approach. I’m probably missing something fundamental here. The design was probably inspired by Python.

In your code, the reader has a next() method that seems to automatically reset when you set a filter. But this “reset” is not really visible in the code.

Originally you had to call reset() everytime you set a filter manually. This was errorprone and thus I decided to make it automatically be called. (Normally I don’t like automatisms that cannot be overruled by a human, but in this case I can’t see a use case, because after setting a filter, the OmaReader is in an undefined state and the only way to get back to a defined state is to call reset() anyway.)

Although then people might try to run two queries in parallel on the same reader, and I don’t know if that is supported or not.

It’s not supported yet. You might create two OmaReaders in two threads and make them read the same file in parallel. That should work, but might slow everything down; depends on how file access works under the hood.

I think what confuses me is that I don’t see where/when the query happens: I set a filter, and suddenly I can access next to get results. This does not seem intuitive to me (but others might have a different opinion on that)

When an OmaReader is created, the file is immediately opened and some basic data (needed for all queries) is retrieved. Querying starts with the first call to next(). It scans the file until it finds an item that fits the filter (skipping larger parts if possible) and returns this item. With the next call it continues search at the place it stopped before.

A New File Format for OSM Data

More modern compression algorithms would be ZStandard (https://facebook.github.io/zstd/) or LZ4 (https://github.com/lz4/lz4-java), which are much faster for both compression and decompression, and Zstd might even result in better compression than the default deflate.

Sounds like a good idea. I didn’t know these two compression algorithms and I didn’t look for alternatives to deflate. Many thanks for pointing this out. I’ll have a look soon.

I’ll also have a look at GeoParquet and GeoDesk when I find the time. They might contain additional ideas I overlooked. Many thanks too.

A New File Format for OSM Data

Would you share some figures to query the same data from pbf files, and maybe even specialized services like overpass ?

This is very difficult to answer (which is the reason, why I didn’t give any numbers above).

First of all, the idea behind overpass is similar to the idea behind Oma files - do the problematic stuff once. Instead of a file, overpass uses a database. Databases have some advantages over files: For example they contain indexes to speed up searching. Since I don’t have an instance of overpass on my computer I can only guess, but I’d say that it would be faster. The drawback is that databases are not so easy to share. The dumps tend to get big. And the initialisation may take some time to create the indices. All in all, I think, these two approches are not easily compareable.

Querying a pbf file with tools like osmium or osmconvert is not easily compareable either. For example, osmconvert cannot search for tags. You can just reduce the amount of data, for example by specifying the bounding box of Wuppertal (wherever you get it from, with Oma files you can easily query it). This step alone took 1:44 minutes.

After that, you can use osmfilter to search for tags. Theoretically, I think, it should be possible to query the Viktorstraße. In practice, I always confuse the --keep and --drop options you need for that, and I do not get what I’m looking for. But if I managed, I think, it would add only a second or so.

Osmium is easier to use, but the problems are similar: You have to run it twice (once with osmium extract to limit the data to Wuppertal, not knowing, where to get the bounding polygon from) and once with osmium tags-filter to select the tags. I can’t give times here though. Osmium just crashes, because my computer does not have not enough main memory.

A New File Format for OSM Data

I hope, that at least the library can be used by non-specialists too (only some basics about Java programming is needed). :-) I plan to give a short introduction on how to use the library in my next post.