Country data dumps from OpenStreetMaps in GeoJson format
This is not a script you simply download on your laptop and run. It requires lots of RAM, CPU, disk and time to run.
With such huge data, performance is a core concern. Everything must be tuned to not be wasteful.
Sometimes, it is not even clear if an area is an independent country or not. Take Kosovo for example. Half of the world consider it to be a sovereign country, the other half consider it to be a part of Serbia. See https://en.wikipedia.org/wiki/International_recognition_of_Kosovo for more details.
For example, here is an enclave of Belgium in the Netherlands. It's a single city called Baarle divided into two countries like a jigsaw puzzle.
Or here, that's a real house number in Vietnam:
Or here, houses in Mannheim center do not belong to streets (which are unnamed) but to "blocks:
There are cases where it's not clear where the actual border is, because of conflicts or disputed territory.
Sometimes, it's not even clear if it's an independent country or not. For example, half of the world considers Kosovo to be an independent country, while the other half to be part of Serbia.
Postal codes are also crazy. So countries use them, others don't. Some are numeric, others not. Some are aligned with administrative borders, others don't. Streets might belong to different postal codes on different sides or segments...
If one thing is sure, it's that addresses are extremely diverse and full of unexpected things.
"Closed Ways" can be interpreted as both a "way" or an "area" depending on what it represents.
By default, the osmium-tool will export closed ways as both LineString
and MultiPolygon
.
This would cause a lot of duplication of course since for example every house will be present twice in the resulting dataset.
Therefore, some heuristic is applied. The upside is no duplicates, the downside is that some closed ways might be lost if it does not fit any of the categories defined in **
At the beginning, we used the extracts from GeoFabrik. However, they were not accurate enough for us. GeoFabrik provides "simplified country polygons". This is good for avarage usage, but in our case it caused issues for address extraction. In areas close to the country borders, addresses would belong to the neighbor country or worse, be missing. In order to extract addresses precisely, country extracts that perfectly match the country boundaries are necessary and is the reason why they are extracted here.
The process of building country extracts is also a two step procedure. First, the planet is devided into (sub-)regions, then each one is further split into countries. The reason to do so is a balancing between performance and memory consumption. Extracting all countries at once consumes too much RAM while extracting countries one after another takes too long. With such huge data, how you handle it is critical.
- Fetch the whole OSM planet per torrent
- Extract all country borders
- Build
extracts/{COUNTRY_CODE}-borders.geojson
for each country which contains its border polygon - Build
temp/{CONTINENT}-borders.geojson
for each continent by bundling some countries - Extract the continents
- Extract the countries from continents