cbeddow's Diary

Overture Places Data: Matching to OSM Tags

Posted by cbeddow on 3 August 2023 in English.

With the new release of more than 59 million points of interest (POIs) from Overture, consisting of Microsoft and Meta POI datasets combined, the natural question arises: how can this be useful for OpenStreetMap?

Challenges to consider

The most important challenge in getting this data into OSM is making sure the place labels in Overture have an equivalent in OSM. This is mostly doable with automation, but many cases require context.

Validation of these is a forthcoming challenge: street-level imagery from Mapillary will be especially helpful, but being there in person to validate is also a big advantage. That aside, even if the data can be added to OSM one-by-one (not imported) with validation, the tags need to have a proper format.

Loading up the data to analyze

I got started by referencing Feye Andal’s great and succinct guide on viewing the data in AWS Athena. I found a slight lack of clarity in the instructions: you need to make sure your Athena instance, and your S3 bucket where queries are saved, are on us-west-2 region, same as the Overture dataset, unless you copy the dataset first to a bucket in your other region. So make sure the regions are the same, and the instructions should work flawlessly!

Analyzing the data

Exploring the dataset, there are 1037 unique place labels in it. 86,000+ are structure_and_geography which can refer to a wide range of natural geography or built structures in OSM, difficult to match with any specific tag without context. Others translate directly, such as a laundromat.

Some example tags include: "forest", "stadium_arena", "farm", "professional_services", "baptist_church", "park", "print_media", "spas", "passport_and_visa_services", "restaurant", "dentist"

To get most of the tags matched, I used Python to import the OpenAI module, and connect to my OpenAI account, which charges a few fractions of a penny per request.

I set a system message, which defines the role the AI should play or assume. My message was:

system_msg = 'You are a helpful assistant who understands data structures, place and map data labeling ontology, and OpenStreetMap tagging. I will give you single labels of a POI category, and you will give me back the single OSM equivalent tag that most makes sense in the format of list with a single string like ["key=value"] unless it has multiple tags such as a mexican restaurant, then give the list of multiple like ["amenity=restaurant","cuisine=mexican"] or if there is no good match you will write back in all caps, ["UNKNOWN"]. Only include a list of tags or the list with unknown value, do not include any dialogue.'

I made an empty dictionary:

overture_osm_dict = { }

Then I made a list of all the unique tags, and looped through it. My code looks like:

for tag in overture_tags:
if  tag not in overture_osm_dict:
    user_msg = tag
    response = openai.ChatCompletion.create(model="gpt-3.5-turbo",
                                        messages=[{"role": "system", "content": system_msg},
                                         {"role": "user", "content": user_msg}])
    osm_tag = response["choices"][0]["message"]["content"]
    overture_osm_dict[tag] = osm_tag

It is recommended to add some sleep timer, or handler for a timeout response, as parsing 1037 items came with probably 10 timeouts.

In the end I had a few tags that were unknown, and I made manual fixes as needed. Running the loop multiple times yielded different results, so it is good to be aware that the AI is not consistent.

I made various fixes to the JSON structure, including stray line breaks, quotations in the wrong place, and bad tag formats. Some tags were also simply invented, it seemed, such as amenity=water_supplier for Overture’s water_supplier, which I changed to office=water_utility though that could be quite wrong, depending on the POI.

There are other debatable tags that came out as unknown, so I added in tags:

“personal_assistant”: [“office=administrative”]
“kids_recreation_and_party”: [“shop=party”]
“sewing_and_alterations”: [“shop=tailor “] instead of equal to “craft=sewing”
“sports_bar”: [“amenity=bar”, “sport=”] but dropping “sport=” to just be a bar

There are many more up for review.

In my final version, the dictionary is something like:

{   "forest": [
"landuse=forest"   ],   "stadium_arena": [
"leisure=stadium"   ],   "farm": [
"landuse=farm"   ],   "professional_services": [
"office"   ],   "baptist_church": [
"amenity=place_of_worship",
"religion=baptist"   ],   "park": [
"leisure=park"   ],   "print_media": [
"amenity=newspaper"   ],   "spas": [
"amenity=spa"   ],   "passport_and_visa_services": [
"office=government",
"office=visa",
"office=passport"   ],   "restaurant": [
"amenity=restaurant"   ],   "dentist": [
"amenity=dentist"   ],   "sports_club_and_league": [
"sport=club"   ],   "thai_restaurant": [
"amenity=restaurant",
"cuisine=thai"   ],   "clothing_store": [
"shop=clothes"   ],   "insurance_agency": [
"office=insurance"   ],   "barber": [
"shop=hairdresser"   ],   "bar": [
"amenity=bar"   ],   "agriculture": [
"landuse=farmland"   ],   "accommodation": [
"amenity=hotel"   ],   "event_planning": [
"amenity=event_planning"   ],   "non_governmental_association": [
"amenity=community_centre"   ],   "elementary_school": [
"amenity=school",
"education=primary"   ],   "landmark_and_historical_building": [
"historic=yes"   ],   "gym": [
"leisure=sports_centre"   ],   "pilates_studio": [
"amenity=gym",
"sport=pilates"   ],   "hotel": [
"tourism=hotel"   ],   "advertising_agency": [
"office=advertising_agency"   ],   "educational_research_institute": [
"amenity=school",
"research_institute=yes"   ],   "furniture_store": [
"shop=furniture"   ], ....

The full gist is available to download as a Githib gist and I hope to get feedback on it, so we may arrive at a more officially agreed upon translation of the tags.

Conclusion

These POIs offer a lot of opportunity to improve one of the categories that is often cited as lacking in OSM. The quality is not perfect, whether in location accuracy, proper tagging, etc, but it is at least professionally curated. Nothing is better than crowdsourcing–which is how many POIs sourced from Facebook business pages or Foursquare check-ins are generated–and OSM is the best spatial crowdsourcing platform in the world.

Some data needs special analysis. For example, I asked the AI to help me with a case I could not verify without context, for example a structure_and_geography category, where the AI noticed the Turkish name for it has the Turkish word for “harbor” and recommended the tag is “natural=harbor”.

Before we can start finding ways to validate the data and ingest it into the map on a case by case basis, we need to have a good basis for the tagging. The user can always modify this to be more appropriate before confirming and sending an OSM changeset, but getting a good first guess to present to users helps reduce the friction and increase the success rate.

Location: Schönegg, Oberarth, Goldau, Arth, Schwyz, 6410, Switzerland

Discussion

Comment from 4004 on 3 August 2023 at 14:26

Good write up, interesting application of OpenAI’s platform. I wonder if Overture themselves are able to give pointers on how they came up with their tags - unless they’ve just taken all the business categories on Facebook or something

Comment from SomeoneElse on 3 August 2023 at 17:17

The most important challenge in getting this data into OSM is making sure the place labels in Overture have an equivalent in OSM.

No, the biggest challenge is to make sure that the data proposed to be added isn’t utter garbage. See the discussion in the forum thread at https://community.openstreetmap.org/t/overturemaps-org-big-businesses-osmf-alternative/6760/271 and elsewhere.

With regard to “what this data was used for”, see the threads at https://en.osm.town/@migurski@mastodon.social/110804743862527535 . It sounds like the FB data was basically used to drive potential customers to Facebook; quality wasn’t really a factor in that. Having a high false positive rate wasn’t a problem with that use case; some people just got (even) more spam from Facebook than they might have done otherwise.

There is a “confidence rating” in the released data, but a quick test shows that not to be especially helpful. at that location (York Minster, a large cathedral dating from the 1200s).

The POI for York Minster appears with a confidence of about 0.9, but unfortunately so does a statue (which exists, but not in that location) and a childcare facility (that certainly does not exist there either). A car valeting company appears with a “confidence” of about 0.88. I have never seen cars being cleaned in the north transept.

There may be some benefit in using some of this data as an “aide to survey”, but it certainly isn’t any use for e.g. maproulette or (heaven help us) some “AI” attempt at adding data, which would simply come up with plausible OSM tags for something that simply does not exist.

Comment from cbeddow on 3 August 2023 at 18:15

@SomeoneElse - I wouldn’t call that the biggest challenge at this stage. You can’t start to verify if the data is not connected. If some just do not have a good or easy tag match, it’s already thrown out probably. Validating the rest is not difficult per se (not a crazy math equation to solve), but rather, a long, slow, and careful effort. We don’t need to make sure the dataset itself is not utter garbage, only individual line items. That’s quite a small effort for individual users on a local scale of let’s say a 1km radius of your home/work.

If you think quality is not a factor in FB Places data, yikes.

I do not mention the confidence rating because I do not see a good use of it. Without some kind of survey, confidence is unverifiable to OSM users.

In the end I perfectly agree though: this is useful as an aid to survey. That is well put.

Which AI would attempt to add the data? I have not seen AI make changesets myself. AI is excellent for converting 1000 labels to potential equivalent tags, that for now me, and “someone else” (anybody interested) can help manually check for right conversion, before then looking to check for existing matches on OSM to know what to throw out as already existing, then whittle down to opportunities worth validating.

Comment from DarkDays on 4 August 2023 at 08:23

I’ve started doing this the human way.

And I’ve written some some python to parse that into a dict, but I’ve not posted that anywhere.

Comment from cbeddow on 4 August 2023 at 11:56

@CjMalone do you mind if I update my JSON dict with some of your tags, then merge the rest of mine into your table? We can keep it evolving until it’s correct.

Comment from DarkDays on 4 August 2023 at 12:03

Yeah it’s purposely on the Wiki so that anyone can change it and improve it. I thought about having it in git, which would have been easier for consumption, but I wanted it more accessible.

Comment from SomeoneElse on 9 August 2023 at 12:57

If you think quality is not a factor in FB Places data, yikes.

“yikes” doesn’t even begin to cover it! Basically, one of the following is presumably the case:

the OSM community has misinterpreted the data, and realistically Facebook don’t think that it’s 88% likely that there is a car valeting service operating in the north transept of York Minster.
the data is very low quality (in OSM terms) but actually perfectly serviceable for Facebook’s data use, which was to drive potential customers to Facebook.
Facebook actually think that this is “high quality data” (in OSM terms) because their view of “quality” does not match OSM’s.

I suspect (but obviously don’t know for sure) that “2” is what is actually going on here. It doesn’t mean that other Overture Maps members don’t have access to better data, just that that has not been made available in this release.

Comment from cbeddow on 9 August 2023 at 14:16

@SomeoneElse

the confidence scores I personally wrote are not worth trying to use, no reference point.
The data quality is variable by region for sure, but so is OSM. I am not sure if we could say it’s lower or higher quality than the POIs already on FB (in my area in Switzerland with a quite active community many are 5+ years old and there is not system nor paid team to maintain them)
I am not sure OSM has a definition of quality? Is there any written method of evaluating the quality for OSM?

Bottom line, all POIs are questionable until checked by a person, which is exactly the opportunity these present.

Comment from Minh Nguyen on 13 August 2023 at 16:33

Bottom line, all POIs are questionable until checked by a person, which is

The line below this line is whether a local community considers this dataset to be a good use of their time versus other potential data sources as a reference point for verification (either in person or in an armchair). There isn’t a single global answer to that question.

Comment from wille on 15 August 2023 at 19:18

Hi Chris! I’ve done a quality analysis and found the confidence score is quite precise: https://observablehq.com/d/9847c08c46f56ed6

Comment from cbeddow on 15 August 2023 at 20:44

Wow! Nice work @wille!