Monday, December 27, 2010

Free Shapefiles of Countries of the World (3)


Free Shapefiles of Countries of the World (3)


This is part three of the comparison of free shapefiles of countries of the world. After feedback, new sources, blogs and tweets (thanks everyone) on my previous blogs (part 1 and part 2), I decided to show some things more.

Edited: Note that part 4 handles an additional shapefile, referred as 13 here.

Let's now look at the north of Italy, where several countries are situates: Italy itself, France, Switzerland, Austria, Germany, Slovenia, Croatia, and some very small countries:
  • San Marino
  • Liechtenstein
  • Monaco
  • Vatican City
And I can't really believe what I saw today. Vatican City has been moved by some shapefile creators...

Vatican City

With respect to Vatican City, all shapefiles that I found are wrong. Below an image created with Google Earth showing boundaries. The yellow one is Google's, so not from a shapefile. It is accurate. I really want a shapefile with Google's country borders! Didn't found it until now, even not in KML-format. All others displayed, 3, 5, 6, 7, 9 are not accurate, but acceptable. I know that Vatican City is a very small country, that on a worldscale map you could not expect too much detail, so it might make sense that they display the country border only roughly. So these are all reckoned sufficient... with respect to the requirements, see more about that below.

im3b


But I was really amazed by the other ones, in the zoom out below. Those (1, 2, 8, 12) are all wrong. (Remember that all numbers refer to numbered sources from previous blog, completely listed below). About 1: we already knew that it was far too rough, and it is actually understandable: it exaggarates small countries to show them on a worldscale map. Understandable. But 2, 8 and 12 are completely wrong. Especially from 12 (source:  eurostat, europa.eu) you would not expect this. What has happened here...

im3a


Some sources (4, 11) do not have Vatican City at all.

Monaco

For Monaco it is a little better, in the sence that countries are not moved by 24 kilometers. But most shapefiles are more or less wrong, 12 being the only exception.

im2c

The Yellow line is Google again, and 12 is the cyan one following it. All the rest is wrong.

Liechtenstein and San Marino

The other small countries, San Marino and Liechtenstein, are both displayed more or less acceptable in all but 1, 2 and 8 (all omitted in the pictures - 8 is for Liechtenstein OK). Number 12 (cyan) also here follows Google the most closely.
im3d

Based on this pictures, 12 (from Eurostat) is following Google's country borders the most closely, with an amazing exception being Vatican City, where it is totally wrong. The rest of the sources shows not a real winner here.

Elba

Not a country, but also compared, the island Elba is another sample where the vertices of source 12 (Eurostat) follow the borders of the island quite closely. Cyan is the 3M version, blue the 10M, darker (and paler) blue the 20M. In the 60M version Elba is omitted. I agree with this all. The other borders, magenta here below, are all worse. And I omitted 1,2 and 8 here because they are even more wrong.

intersections_3e.png (500×355)

Requirements

In the last two blogs I made remarks about the quality, but I didn't really listed them as formal requirements. So let's start with that now. They are composed on the fly, and maybe a bit subjective, I know that.
  • NL1: The five inhabited islands in the Waddensea in the Netherlands should be present
  • NL2: The inner lake in the Netherlands should be included
  • NL3: The polders in the Netherlands should be visualized correctly, so absence of non-existing polders and presence of polders existing since 1968 (sometimes you have to make requirements quite explicitly...)
  • CS1: The Caspian Sea should be there
  • DB1: Countries should be one multi_polygon per country, not separate polygons
  • DB2: All (multi)geometries should be valid within SQL Server 2008
  • DB2: All (multi)geometries should be valid within PostGIS
  • SM1: Vatican City should not be moved and (probably) not be omitted
  • SM2: Monaco's borders should follow real borders
  • SM3: San Marino's borders should follow real borders
  • SM4: Liechtenstein's borders should follow real borders
  • SM5: Elba's borders should be followed

In the table below the requirement list is crossed against the shapefiles.

# NL1 NL2 NL3 CS1 DB1 DB2 DB3 SM1 SM2 SM3 SM4 SM5
1 OK OK -
OK OK 1 1 Enlarged Enlarged Enlarged Enlarged Very poor
2 OK OK - OK OK 4 4 26 km off Very poor Very poor Sufficient Very poor
3 OK OK OK OK OK 34 5 Sufficient Poor Sufficient Good Sufficient
4 OK OK - polygons OK 1 6 Omitted Poor Sufficient Sufficient Poor
5 OK OK - OK OK 5 4 Sufficient Poor Sufficient Sufficient Poor
6 2 OK - - OK 3 OK Sufficient Poor Sufficient Sufficient Poor
7 2 OK OK - OK 2 7 Sufficient Poor Sufficient Sufficient Poor
8 2 OK - OK OK OK OK 26 km off Poor Poor Sufficient Very poor
9 OK - - OK OK 17 9 Sufficient Poor Sufficient Sufficient Sufficient
9b OK - - OK OK 3 3 Sufficient Poor Sufficient Sufficient Sufficient
9c 0 - - OK OK OK 1 Sufficient Poor Omitted Sufficient Omitted
11 0 - - OK OK OK OK omitted Poor Sufficient Good Omitted
12 (3m) OK - - OK OK 13 1 24 km off Excellent Excellent Excellent Excellent
12 (10m) OK - - OK OK 4 OK 24 km off Good Good Good Good
12 (20m) 2 - - OK OK 2 OK 24 km off Good Sufficient Sufficient Sufficient
12 (60m) 0 - - OK OK 2 OK 24 km off Sufficient Poor Poor Omitted
Edited: 13 OK - - as separate
country
OK 47 OK Excellent Poor Excellent Excellent Excellent

All sources compared by image

The images below show the same picture, using all shapefiles, as we did in the previous blog, but now showing these four countries, Italy, and parts of the surrounding countries.
i1
  • (1)
  • from aprs
  • file world.shp

i2
  • (2)
  • from brothersoft, serving blue marble data
  • file World_countries_shp.shp
i3
  • (3)
  • from Geodan
  • file WorldCountries.shp

i4

i5
i6
  • (6)
  • from openmap here, called cntry02
  • file cntry02.shp
i7
  • (7)
  • from this site ESRI data
  • file cntry08.shp
i8
i9a
i9b
i9c
(10). Empty
i11
  • (11)
  • from huebler
  • file world_adm0.shp 
i12_3
  • (12 1:3m)
  • from eurostat
  • file cntr_rg_03m_2006.SHP
i12_10m
  • (12 1:10m)
  • from eurostat
  • file cntr_rg_10m_2006.SHP
i12_10m
  • (12 1:20m)
  • from eurostat
  • file cntr_rg_20m_2006.SHP
i12-60m
  • (12 1:60m)
  • from eurostat
  • file cntr_rg_60m_2006.SHP


Conclusion

Again thanks to commentor Alectora who suggested Eurostat, it is a nice set of four datasets in different scales. The 1:3 million version is excellent in most measures, but it is a pity that Vatican is displaced and the IJsselmeer in the Netherlands is omitted. And that some polygons are not valid...

I still think a good and free shapefile of the world is still welcome.


Friday, December 24, 2010

Doxygen and QuickBook


Doxygen and QuickBook


Doxygen and QuickBook are both great tools for generating documentation from C++ code. But their cooperation is quite complex... let's say, a challenge.

Quickbook

QuickBook (QBK) is based on BoostBook, which used to be the documentation tool for Boost Libraries. BoostBook is on its turn based on DocBook.

With QuickBook (and also BoostBook) you can make good looking library documentation describing headers, examples, syntax highlighting, etc. Look at the QBK website for more information. For example, I like the callouts.

QuickBook input files (extension .qbk) are plain text files with Wiki style commands. However, when using QuickBook alone it is not maintainable to document hundreds of classes and functions, you really should have a link to the source code. And that is what Doxygen is doing.

Doxygen

Doxygen is a standalone tool, accepting C++ source code (it accepts other languages as well) with JavaDoc style comments to document function parameters, etc. Look at the website for a manual. While Doxygen does a great job, the generated documentation is often considered a bit too technical. In the review report was included:

Using Doxygen alone as a documentation tool has its (known) shortcomings when it comes to generic libraries.

And the individual reviews mentioned things as : "I think doxygen is convenient for developers but not very useful for users, because of the high structure/content ratio." and "I really did not like the doxygen generated content.  I found it difficult to find the real documentation among all the doxygen generated fluff." and "I have a general problem with doxygen generated docu with regard to generic libraries. It not easy to separate the important classes and fuctions from the implementation details. GGL helps with this through Grouping by modules and a overview Page, but  it could be even better. I think the Modules should them self be further grouped, and the most important should be highlighted.". 

This does not mean that Doxygen is wrong or bad, and maybe we did not spend enough time to configure it right. An individual review also mentioned: "Doc: Excellent. Looks very nice. Only an index is missing - but extensive Doxygenation including Doxygen comments are far above what passes for documentation in many packages". We did our best at least.

Anyway, people do like QuickBook generated documentation, so we, with Boost.Geometry, are moving to QuickBook. But we want to keep Doxygen there to link with the source code.

Doxygen -> QuickBook via XSLT

It is certainly possible to let Doxygen and QuickBook work together. Doug Gregor wrote here:

(...) If I haven't scared you yet, I'll put it in a diagram:

C++ sources -> Doxygen XML -> BoostBook XML -\
                                              -> Merged BoostBook XML
Quickbook sources -> BoostBook XML ----------/

Then, for the output:

Merged                      /--> HTML
BoostBook --> DocBook XML ------> FO --------> PDF
XML                         \--> Man pages

Look complex? It is. (...)

So it is complex. Especially the XSLT file to convert from Doxygen XML to BoostBook XML proved complex and sometimes incomplete, we (Mateusz) spent time on it and achieved good documentation, but it was really complex. And in the end, you don't get QuickBook, you get BoostBook... Suppose you want something different, for example adding "complexity" to the documentation... It means changing the XSLT files again.

What is in the docs?

I like to see in documentation, per function or per class:
  • What it does
  • The header where the function or class is found
  • Its parameters (for a function)
  • Its return type (for a function)
  • Its limitations
  • An example (this is very important)
  • An image (very important for a geometry-library, probably not for all libraries) to show what it is doing
  • Behaviour in various combinations
  • Its complexity
The cplusplus documentation has for every function or method a little example, I like that. It is great for programmers who program by example. Examples need to be compilable examples, they need to be extracted from real compiled code, otherwise old constructs or other obsolete things or typos will stay there unnoticed. You see that often.

About behaviour: within Boost.Geometry, one function (e.g. distance) can do many things (distance between two points, between point-and-polygon, etc). We also have multiple dimensions (2D, 3D) and coordinate systems. The behaviour of the functions might be different in different circumstances. Therefore, we want to have a possibility to document these differences explicitly.

Besides this, we often have two overloads, one with a so-called strategy, giving the library user the tools to change default calculation methods, or one without, so sticking to the defaults. Besides these differences, these overloads are the same. So documentation is the same. Where does the documentation come from? From the sources. Do we want to have the same documentation duplicated for two functions? No. Or probably not.

Don't repeat yourself, and macro's

Doxygen does have an overload option, but it seems to work only for overloaded member functions. We are overloading free functions, which are often heavily templated. I didn't have success to get Doxygen's overload construct running with our code.

Anyway, Doxygen provides a great utility: an alias. This is a sort of a macro, expanding to whatever you specify. So when we think we repeat ourself (e.g. writing for a template parameter "Coordinate system (e.g. cs::cartesian)", which will be found at many places) we can introduce an alias there, param_macro_coorsystem, and we can use it in the documentation.

The result looks cryptic, but at least you don't repeat yourself and if you want to change that sentence, you do it in one place.

To make the picture complete: QuickBook also provides a great utility: a macro. This is doing about the same: it is a placeholder which is replaced by quickbook by the actual text.

So we can actually chose: write plain English, take an alias, take a macro. Or mix them.

Repeating yourself is usually not so good, but in documentation, it cannot always be avoided. Refactoring all repetitions from comments and document fragments will result in a very cryptic developer-defined language. Actually, it is already looking cryptic now, read on...

Doxygen -> QuickBook via parser

As described above, the XSLT process to go from Doxygen to BoostBook was not completely satisfactory for us, with all our requirements to documentation. It probably can be done, but requires high XSLT skills, and who nowadays still writes XSLT? There are a lot of libraries parsing XML, in C++ (and in other languages), and for a C++ programmer it is quite easy to parse the Doxygen XML and output QuickBook. So that is what is done... We developed a new tool, doxygen_xml2qbk. It goes from the XML's generated by Doxygen to QBK.

The tool might be more useful in general (for more Boost libraries), but currently it does its job for Boost.Geometry.

It is currently (lightly) based on RapidXML (great library, by the way, and super fast).

The markup

So we document functions like this, I take here the area function:

/*!
\brief \brief_calc{area}
\ingroup area
\details \details_calc{area}. \details_default_strategy

So these lines use macros to say that it is calculating the area (brief), and again that it is doing that (detail). The differences in brief and detail are Doxygen differences, sometimes convenient, sometimes not. Then  a remark-macro is added that this function takes the default strategy. That is the case with a lot of functions, therefore it is a macro, to avoid repetition. Then it continues:

\tparam Geometry \tparam_geometry
\param geometry \param_geometry
\return \return_calc{area}

So here the parameters are described. There is a template-parameter (Doxygen's tparam) and a parameter (Doxygen's param). Those are similar, geometry! Not literally the same, but the parameter geometry (lower case) is of the type Geometry (Camel Case). This is the case for most of the functions, accepting sometimes one, sometimes two geometries. Therefore macro's (\tparam_geometry) are introduced here... Below we will see that they can be handled in the same line! Though not everyone is convinced of that feature. It might go away or be made flexible.

Avoiding repetitions: it the same for the return value, after Doxygen's \return. It would be possible to write here: "the function area calculates the area of the input geometry". For the function length we will write (about) the same comment (replace area with length). For perimeter the same. Et cetera. So, to avoid repetition, we created the cryptic macro \return_calc, with a parameter a literal string (in this case "area").

We continue:

\qbk{example,area_polygon}
\qbk{example,area_polygon_spherical}

So this is a Doxygen Alias (\qbk) to generate an XML node called <qbk.example>, it generates two of them, one per line. Such an XML node is recognized by our new tool doxygen_xml2qbk. And that tool generates, as expected, an example section. So these two lines generates this QuickBook syntax:

[heading Examples]
[area_polygon]
[area_polygon_spherical]

Where area_polygon can be found in one of the examples, using a QuickBook construct to show a syntax highlighted example.

We continue:

\qbk{behavior,__0dim__:[qbk_ret 0]}
\qbk{behavior,__1dim__:[qbk_ret 0]}
\qbk{behavior,__2dim__:[qbk_ret the area]}
\qbk{behavior,__cart__:[qbk_ret the area] __cs_units__}
\qbk{behavior,__sph__:[qbk_ret the area] __sph1__}
\qbk{behavior,__rev__:[qbk_ret the negative area]}

This all goes to a xml.behavior node, and then to a QuickBook behavior section, a table describing the behaviour for 0 dimensions (point), 1 dimension (linear), etc. All this is macro'd, where __sph__ is also a QuickBook macro, and [qbk_ret] is also a QuickBook macro. They can be nested as well. qbk_ret just says "returns". It has the same length, the only advantage is that you don't repeat yourself in words but now in macros.

OK, using two macro systems through each other, some having a specific meaning for our converter tool, this is quite complex and I realize that. But on the other hand, if everything is written in words here, it is probably unmaintainable.

The results

Let's finally show the results here, as screen-dumps.

Doxygen generates this piece:
q1a

To be honest, we must say that examples (using syntax highlighting) within Doxygen are certainly possible as well.

With Doxygen-conversion-QuickBook the following piece is generated:

qb1b


So we see what our \qbk alias does here... It generates the sections behavior, complexity, examples, and might generate more.

I like it, despite its complexity.


Wednesday, December 22, 2010

Intersections (2), "recursive polygons"


Intersections (2), "recursive polygons"


The Boost.Geometry review report said:

Testing: several reviewers mentioned the need for a thorough testing framework allowing the verification of the correctness of the algorithms in a wide range of use cases. Different test strategies need to be employed, such as high volume and random tests, known border case tests, tests using different numeric precision types, etc.

This blog will discuss a part of this, a program called recursive_polygon.cpp, which is a test in the category high volume an random, using different numeric precision types.

Recursively made polygons

In this testprogram, polygons are created in a recursive way. In the first step two polygons are created. They are created at a random place and either a box, or a triangle (in this case: a box with one coordinate omitted). The first version of the program was called recursive_boxes because it then did only boxes. Those two polygons are unioned and the result is either a polygon, or a multi-polygon. In the first step, it is probably a multi-polygon.

After this, two other polygons like this, based on random coordinates, are unioned and result in another multi-polygon.

In the second step, the results of these two earlier unioned multi-polygons are again unioned. This will deliver a third multi-polygon. And so the process goes on, each time creating more complex multi-polygons, and after a a while holes are generated, self-tangencies.

The figure below shows the idea more clearly.

i2a

In this figure, a field of 3x3 is used and intersections are done up to level 3.

The test and recursive structure and sequence can be compared to a genealogical Ahnentafel, the unions to marriages, the number 15 to the proband, having two parents, four grandparents, eight great-grandparents, et cetera.

Checking results

While we can run the program and see the results (the program optionally creates an SVG file), we have to have a mechanism to check if the results are correct.

Luckily we are doing geometry, we are doing mathematics, we are doing set theory, so we can use that to check our results.

Checks are done in each step, so during the processing of two polygons. Besides a union (u), an intersection (i) is done. The area is calculated of the original polygons (p and q), of the union, and of the intersection. And it always must be that the area of the union equals to the area of both input polygons, minus the area of the intersection. So: A(u) == A(p) + A(q) - A(i). A simple check, all done by the library itself.

So we can run the check for thousands of times, and for various levels, field sizes, boxes and triangles. It now also runs for counter clockwise polygons and open polygons.

It is not the case that it was without errors the first time... There have been many errors, especially in the self-tangencies, points were two polygons in a multi-polygon touch each other. That all have been solved. So running this test was absolutely valuable.

Seven levels

After seven levels in a field of 10x10 we might get the next results:

i2c

At the left the two input polygons (green and blue), at the right the union in red. After these seven levels in the 10x10 field, the unions are that large that they nearly completely fill up the whole area.

Last month I did most tests in this configuration, 10x10 and up to 7 levels. 7 levels results in 255 tests, this is 28-1

Twelve levels

Twelve levels results in 212-1 unions and shows in a 100x100 field like this:
1c

and we can go further and further, but for Boost.Geometry the tests do not give problems anymore. It all runs fine. This test (so 8191 unions and intersections ending in this complexity) runs in two seconds on my current machine (actually, to be precise, it is doing the union twice because the overlay function is reused in other tests as well).

Unit test

It is a sort of a unit test, because the program checks itself. I also use it as a unit test. But note that the program is based on random input, and that a 30-level test resuls in 230-1 unions and intersections (~1G). So it runs for a while, and including this in the standard Boost test suite will not be appreciated. So this recursive_polygons can be run manually, and there are parameters to specify the number of levels, et cetera.

By the way, there are many real unit tests within Boost.Geometry, there are currently 118 source files using the (great!) Boost.Test library, and some of them use polygons which are created by this testprogram as their input.

Conclusion

This recursive polygon unit test has been very valuable and, besides that, such tests are nice to program. The whole program, links are given above, is rather short, and it is of course Open Source. This test might be useful for other libraries as well.

Sunday, December 19, 2010

Intersections (1)


Intersections (1)


The boolean operations are probably the most interesting parts of the Boost.Geometry library.

Boolean operations are, based on two polygons:
  • intersection, returning polygons containing areas present in both input polygons
  • union, returning polygons containing all areas present in either one of the input polygons
  • difference, returning polygons containing areas present in one of the input polygons, but not in the other
  • symmetric difference, returning polygons containing areas present in one of either input polygons, but not in both
In other words, boolean operations are spatial implementations of mathematical set theoretic operations. Based on two polygons A and B:
  • intersection: A and B, also written as A ∩ B
  • union: A or B, also written as A ∪ B
  • difference: A and not B, also written as A \ B
  • symmetric difference: A xor B, also written as A ∆ B
The difference operation (also known as relative complement) is the only operation of these which is not commutative, A \ B results in another geometry than B \ A. 

The basics

1a

This blog will not explain the whole algorithm, that would be way too long. Other of my future blogs or a future article will explain more details. The algorithm used in Boost.Geometry is an adapted version of the Weiler-Atherton algorithm. The basics are quite simple, and explained here, and also in adapted versions as from Greiner-Hormann. Note that I have adapted the algorithm myself, such that intersection points are not inserted in the originals but stored separately. Which is essential because we don't want to change the input polygons, and we neither want to copy the input polygons. I have made more adaptations, such that all situations including self-tangencies (in some articles called degenerations) and robustness are handled.

So only the very basics in this blog. There are two input polygons, A and B (here: green and blue), both oriented clockwise. The arrows indicate orientation at the starting points. Intersection points (orange) are calculated. During the calculation of the intersection points, traversal information is gathered: for an intersection (usually: go to the right), would it continue following the green or the blue outline... For example, at point 0, above, it is figured out that for an intersection the green line should be followed, and for a union the blue line. Both until the next intersection point is encountered. That requires sorting which is used heavily within the algorithm.

Turning right for intersections or left for unions is also visualized below, from an older part of the GGL-doc:

1f

How traversal information is figured out, and how intersection points are sorted, will be explained in another blog, maybe much later.

So intersection points and turn information are basically all that is necessary up to this point. The traverse function uses this information, and adds points to the accumulated output polygon. Added points can be either intersection points or original vertices. The output intersection then looks like this (slightly shifted by hand to show it better):

1c

Note that it can become (often much) more complicated, holes, intersections at vertices, self-tangencies, polygons-within-holes, multi-polygons, etc. 

Here the process for the intersection of a polygon with a polygon with two holes:
qg

Note that the numbers of the intersection points are shown in the order found, and not in the order of traversal. It is not possible to indicate intersection points with an order of traversal, because that order differs for intersection and union (check this in the picture above). The (manually added) red arrows depict the traversal for intersection, starting with the red "0".

Note also that holes are not a big addition to the algorithm, if they are oriented counterclockwise in clockwise polygons (and vice versa). This is an assumption that is made within Boost.Geometry. If this is the case, traversal can take its routes always in the same way. The union operation would create two holes, which are different from the original holes and stored automatically in the correct orientation.

All boolean algorithms are the same

I mean to say here: intersection, union, and (symmetric) difference are all the same.

We have seen above that for intersection and union the same information is gathered. Only one variable, the traversal-direction, determines what is the output. So easy, we have one algorithm in which we calculate intersection or union.

Now for the difference, it turns out that if we reverse one of the polygons (e.g. green one), such that it is oriented counter-clockwise, the difference is calculated automatically by calculating the intersection of that reversed A and the still normal B. That is not that strange, because it follows from a mathematical rule. The difference of A and B is the same as the intersection of the complement of A with B. And the complement of A (Ac) is just its reverse. Voila, we do another function with the same algorithm.

The complement (from older GGL-doc) of polygon A is the whole world but A.
1e

The symmetric difference, A xor B, is just adding the two differences: it is also defined as A and not B unioned by B and not A.

Concluding this: all operations intersection, union, difference and symmetric difference are calculated with this one algorithm. The algorithm consists of several sub algorithms, relevant for all operations, and I hope to describe them later in more detail.

Reversing

How do we take the reverse in the algorithm? In the past, I cheated and reversed the polygon before doing the algorithm. That was easy, but has two drawbacks: 1) the input must be copyable and 2) it takes time to reverse. So we don't do this anymore. We just iterate through it in reverse way. That is easy enough, using an iterator walking reversely though the vertices. Or, in the case of Boost.Geometry, we use a reversible_range, as explained in previous blog. It is all implemented using specializations and meta-programming.

The difference is shown below. On the left side the result, again shifted slightly to make things more clear. The right side shows the same but in a different way: the intersection of the complement of A (green) with B (blue).

1b2

The one algorithm handles all geometries

We have seen above that one algorithm is enough and that we can walk in reverse through the polygons to support differences. What if we have counterclockwise polygons? The answer is the same, we walk through it using our reversible ranges, which are in fact Boost.Range reverse_range adaptors. So we calk backwards through it during getting intersection points, and later when vertices are necessary for the output, and actually everywhere where it is used.

So we can handle clockwise and counterclockwise input polygons, and we can even handle one clockwise, one counterclockwise, and the output can also be clockwise or counterclockwise at wish. That last feature is done by appending points during traversal to the back of the output rings, or inserting them at the front.

The algorithm also handles combinations of rectangles and polygons, because a rectangle can be seen as a polygon. The only difference is that the calculation of intersection points is done more efficient. Furthermore, multi-polygons can be handled. Note that the intersection of two polygons can result in a multi-polygon.

So this seems a litany of what the algorithm can handle, and it does not stop here. Because Boost.Geometry algorithms are independant on their input point types, and also independant on their input polygon types, and other geometry types, the algorithm can handle any geometry type based on any point type. So it can handle 4-byte-floats but also 8-byte-doubles or high-precision types like TTMath. It can handle boost::geometry polygons but also ESRI polygons, provided those polygons are adapted to Boost.Geometry.

Another aspect of this algorithm is that it is coordinate system agnostic. It can handle cartesian polygons, but can also calculate intersections of spherical polygons. This is possible because the algorithm is based on sub-algorithms as calculating the intersection points, and finding the side (left/right) of a point with respect to two other points. If those algorithms are provided (currently they are not...) the intersection algorithm will handle them.

The fastest...

Finally, the implementation of boolean operations in Boost.Geometry is also the most efficient intersection algorithm which can be found. If people don't believe this, or want to test it, or help benchmarking: there are Open Source benchmarks. Other people are more independant as I am, help would be welcome.

Below intersection of two multi-polygons with self-tangencies, intersection is again shifted slightly. This comes from one of the robustness tests.

1d

Final remarks

Using the reversible_range for (symmetric) difference instead of reversing geometries, in the implementaiton,  was done yesterday. So all this stuff above was planned but, until yesterday, not implemented like it is today. So another step forwards is made.

The review report (from now more than a year ago) said:

Boolean operations: while the library provides a set of Boolean operations those seem to be not complete and/or robust in terms of clockwise/counterclockwise geometries, closed/open polygons. Robust Boolean operations are a strong requirement, so this should be fixed as reported by at least one reviewer.

This point is now dealt with. There are still a few tweaks to be done but they will disappear coming weeks. I didn't discuss robustness in this blog, and open polygons neither, but these issues are already addressed too.

Don't think you can implement boolean operations easily, though the basic algorithm is quite simple... There are about 100 cases for extraction of turn information from two intersecting segments. Besides that there are also dozens of cases necessary for self-tangencies, in GIS not so ubiquitous, but in games certainly important. I started the algorithm in January 2008 and rewrote it two times after that, one time before the review, and another time after it because there were issues with the implementation, self-tangencies, and robustness. Now it should be ready for production.

The implementation is heavily based on C++, generic programming, templates, specialization and meta-programming. I don't think it is easily translatable into another language (but D). Also in C#, where specializations (at runtime) are possible according to one of my previous blogs, metaprogramming is not possible, it is a compile-time mechanism.


Monday, December 13, 2010

Range adaptors


Range adaptors


I seem to like doing blogs in series... (Or ranges ;-) .) See my previous blog about ranges here. This is a short bonus.

Since Boost 1.43 released in May 2010, just before BoostCon'10, the Boost.Range library has Range Adaptors. Range adaptors enable users to change range behaviour on the fly, for example traverse it in reverse order.

Reversing through geometries

See this small and working snippet.

using namespace boost;
geometry::model::linestring<P> ls;
geometry::read_wkt("linestring(1 1,2 2,3 3,4 4)", ls);
std::cout << geometry::wkt(ls) << std::endl;
std::cout << geometry::wkt(ls | adaptors::reversed) << std::endl;

This  will print out:

LINESTRING(1 1,2 2,3 3,4 4)
LINESTRING(4 4,3 3,2 2,1 1)

So the original one and the reversed one. Works great, and great syntax.

To let this work, I had to adapt the Boost.Range reverse_range to Boost.Geometry, such that Boost.Geometry knows the adapted range is a linestring. In this case it is a linestring, but it might also be a ring (also a range) or a multi-point (also a range), etc. So I adapted it like this:

namespace traits
{

template<typename Geometry>
struct tag<boost::range_detail::reverse_range<Geometry> >
{
    typedef typename geometry::tag<Geometry>::type type;
};

}

Just meta-programming, if a reverse_range is specified, it defines its tag as it is for the unreversed geometry. This should work for all these geometries because all of them are very light-weight, essentially only requiring the tag metafunction in namespace traits. (Well, one exception, the linear_ring (ring for short) also might optionally define point order and closure.)

We do the same for the other Boost.Range adaptors, where relevant, and then have filtered, sliced, strided, uniqued, reversed all out of the box. Great! I already liked Boost.Range, and this new feature is cool either.

Why ranges are cool, revisited

We see here again why ranges can be preferred above iterators, and make some other statements:

First we state in another way (than in the previous blog) that ranges are like normal objects:
Ranges are first class citizens, while iterators should be implementation details

And then we make a statement about the cool adaptors we encountered in Boost.Range:
A range allows overloading operators, in contrast with iterators

reverse_view

Before ranges were there in May 2010, I developed a small utility class called reversed_view, where ranges are traversed in either forward or backward direction.

This reversed_view class is still there, but now greatly simplified, the reverse one expressed in terms of the reversed_range adaptor. It is now defined like this:

enum iterate_direction { iterate_forward, iterate_reverse };

template <typename Range, iterate_direction Direction>
struct reversible_view {};

template <typename Range>
struct reversible_view<Range, iterate_forward>
{
    typedef identity_view<Range> type;
};

template <typename Range>
struct reversible_view<Range, iterate_reverse>
{
    typedef boost::range_detail::reverse_range<Range> type;
};


All meta-functions. Seeming empty classes, but of great value. In the first version it were not meta-functions but normal classes, derived from something or defining something. Therefore they needed an own constructor etc. These are only typedef's, do not define any code, therefore simpler and therefore better. In this case they exist because templated typedefs are not there.


closeable_view

Boost.Geometry also has a closeable view but alas I didn't find this in Boost.Range adaptors, for obvious reasons (the adaptors strip, replace or adapt, but do not add things). So the closeable_view is still there, but now also expressed as meta-functions, the actively-closing one is a real implementation, the already-closed on is another type definition.


Sunday, December 12, 2010

Range, return types and references


Range, return types and references


During BoostCon 2009 the keynote speaker was Andrei Alexandrescu, who wrote the book Modern C++ Design in 2001. Andrei held a fabulous presentation with the name "Iterators Must Go", in which he point out several serious drawbacks of iterators. Iterators have been the basic mechanism to use the Standard Template Library classes, so this speach was revolutionary, and enjoyable. Andrei not only critized iterators, he also gave an alternative: ranges. Ranges are for him the basic piece.

The keynote was the forenote for an article that Andrei wrote, now with the title "On Iteration". The title is changed into a more subtile, but the message is more or less the same. I quote: "STL iterators are marred by lack of safety, difficulty of usage, difficulty of definition" .

Before BoostCon '09 with its famous keynote, I already knew the Boost.Range library quite well. Most of the algorithms in Boost.Geometry are based on ranges. We use Boost.Range everywhere. Ranges are very nice. A std::vector is a range, but a std::pair of iterators is also handled as a range, and a Boost.Array is also handled as a range, and you can create your own ranges. Yes, like iterators, but ranges are more elegant.

Andrei Alexandrescu mentioned Boost.Range during his keynote, of course. He said that Boost.Range was a good start, but should be worked out way further. Until now, that has not been realized as far as I know. But even without changes, Boost.Ranges are far more convenient than iterators.

Iterators

People reading this blog will know iterators and otherwise will have stopped reading here. However, I like to explain the very basics shortly because of the point that Andrei  made: "iterators are unnecessary difficult", I've always found that. This is how standard iterators work:

std::vector<int> a;
// fill a
for (std::vector<int>::const_iterator it = a.begin(); it != a.end(); ++it)
// do something with *it

When I first worked with the std::library, somewhere in 199X, I refused to do this and wrote:

for (int i = 0; i < a.size(); i++)
// do something with a[i]

This is half the size, easier to write, easier to read. But it is (much) less efficient and should therefore not be done like that.

Note also that C++ programmers always write ++it and not it++ because it is slightly (and detectably) more efficient.

Boost.Ranges

A Boost.Range is a sort of view on a std::vector, or on any other iterable instance. Boost.Range still defines and works with iterators. It is another level of abstraction above the std:: library. With iterators you iterate like this (still vector a):

for (boost::range_iterator<std::vector<int> const>::type it = boost::begin(a); it != boost::end(a); ++it)
// do something with *it

Hey, this is even much longer! Yes, in this sample this is true and in general it is true, but ranges are normally (at least within Boost.Geometry) used in a template environment, where it becomes:

for (typename boost::range_iterator<Range const>::type it = boost::begin(a); it != boost::end(a); ++it)

Not shorter but more generic: a Range can be anything.

There are shortcuts as well: with BOOST_FOREACH or BOOST_AUTO. They are macro's, so often avoided, but they make things more readable and can save template metaprogramming, see below. But first we will see why ranges are so convenient.

From Iterators to Ranges

Boost.Geometry supports geometry algorithms such as distance. They are defined in a generic way, so


template <typename Geometry1, typename Geometry2>
double distance(Geometry1 const& g1, Geometry2 const& g2)

(The real version does not have the double there but a metafunction). With distance users can calculate the distance between two points, or a point and a line (linestring), or a point and a polygon, etc.

The earlier version of Boost.Geometry (then called GGL), before going to ranges, could calculate the distance to of a part of a linestring by specifying two iterators as well:

template <typename Geometry1, typename Iterator>
double distance(Geometry1 const& g1, Iterator it1, Iterator it2)

This is great functionality, but it is completely inconvenient to implement:
  • all other functions are generic, having two arguments, why does this one has three arguments?
  • all other functions are reversable, the distance from a point to a line, and from a line to a point, both result into the same implementation. For iterator support we would need to create an overload having the iterators first 
  • there are more versions (for example: with a strategy), we would need those overloads for those two, leading into another combinatorial explosion (there are more algorithms too...)
When we discussed the GGL previews on the list, Mathias Gaunard suggested to use ranges instead of Boost.Geometries and we took this over and were very glad with this change.

This is quite an argument for ranges instead of iterators in a template-library environment, and this is why I emphasis this at this point.

So let me state: Ranges are compatible with other objects, while iterator-pairs are not.

References

After this lengthy introduction we consider how iterators and ranges are passed.

Iterators are normally passed by value. The main reason for this is explained nicely in StackOverflow: by passing by value you can create local copies, like v.begin().

Ranges are normally passed by reference. Ranges might be non-copyable containers, so passing by reference is required.  Besides the non-copyable issue, a std::vector is a range. You definitely don't want to pass a std::vector by value.

Boost.Geometry's polygon

Boost.Geometry contains models of geometries like linestring and ring. They are defined as std::vector or a std::deque, but handled in the code as a range. Any linestring or ring fulfilling the Range Concept can be handled by Boost.Geometry, at least that is the basic idea. A polygon is more more complex because it might contain holes, in the definition of OGC and Boost.Geometry. So a polygon has an exterior ring and zero or more interior rings, defining the holes. The exterior ring fulfills the Ring Concept, the interior ring is a Range of Rings, so a Range of Ranges, a collection or rings.

Boost.Geometry contains a free function exterior_ring which returns the exterior ring by reference. And it also contains a free function interior_rings where the collection of interior rings are returned by reference. Both functions have a const and a non-const version. Of course they don't return the rings by value: polygons can be huge geometries, I've recently processed a polygon of 6 megabytes (in WKT-format). They should never be returned by value.

Other polygons

With Boost.Geometry, geometry classes of other libraries can be adapted, such that Boost.Geometry can run algorithms on them as if it was one of its own geometry classes. There are examples showing this with libgd's points, WxWidgets's points, Qt's polygons, etc. With traits classes these geometry classes can be adapted.

However, until today, polygons were required to return the exterior ring and interior rings by reference. But what if those geometries do not use ranges?

We encountered this adapting Boost.Polygon, our sister library within the Boost family. Boost.Polygon defines a polygon-with-holes-concept, where iterators begin_holes and end_holes are returned.

So after successfully adapting Boost.Polygon's point, rectangle and polygon (without holes, so we call it a ring) to our library, I started with the polygon. It is no problem that Boost.Polygon do not use the range concept internally. Because we can build a sort of proxy, simulating a range. So the adaptation of a Boost.Polygon polygon-with-data consists of defining:
  • an iterator to iterate through holes
  • a ring-proxy with an iterator-pair (point-begin, point-end)
  • an interior-ring-collection-proxy with an iterator-pair (holes-begin, holes-end).
But there was one problem:

Returning local references?

Local variables can (of course) not be returned as references. So you can create ring-proxies, but they cannot be returned... That is to say, they cannot be returned as a reference. Of course, they can be returned by value.

There are several solutions to this problem:
  1. Change the Boost.Geometry polygon concept to work with iterators, so let it return iterators. This would be quite a lot of rework, and sounds like going ten years back. So discarded.
  2. Change Boost.Polygon's polygon concept. But that is of course not wished and not possible. Boost.Polygon is now an incorporated library. Besides that, Boost.Polygon is an example here, many polygon libraries will have polygon types having no ranges internally. For example, shapelib's geometry is just a set of pointers... Boost.Geometry should handle any geometry. So discarded
  3. Let Boost.Geometry's polygon concept return ranges always as values. This is possible, but the return value for its own polygon should then not be the ring but a proxy containing the reference of the ring.
  4. Let Boost.Geometry's polygon concept return ranges sometimes as value, sometimes as references
I've chatted with my companion Bruno Lalande about this. He had the opinion that we definitely not should go back to iterators. He suggested the fourth solution. Let me quote Bruno: in Boost.Geometry, generic functions like exterior_ring() or interior_ring() that only forward the job to the actual adapted function, shouldn't make any assumption on their return type. Currently they seem to be generic in this regard since they get the return type by calling a metafunction, but they're not fully generic because they arbitrarily add a "&" to it. Boost.Range returns ranges by reference and by value.

Andrei Alexandrescu returns ranges in his article, on various places. We might change our own polygon type, returning proxies to ranges, and returning them by value. This would be solution 3. However, as long as this is not necessary, we prefer solution 4, to avoid an unecessary proxy in between, and to enable range-based polygon-types to return references to ranges.

Exterior rings should sometimes be returned as values, and sometimes as references.

So I changed the Boost.Geometry library into range-return-type agnostic behaviour.

So let me summarize: Ranges can be returned as values or as references

More metaprogramming

To implement solution 4 mentioned above, Boost.Geometry's polygon needed two characters extra: two ampersands. So its traits function ring_type now reads:
template
<
    typename Point, bool ClockWise, bool Closed,
    template<typename, typename> class PointList,
    template<typename, typename> class RingList,
    template<typename> class PointAlloc, template<typename> class RingAlloc
>
struct ring_type
<
    model::polygon
        <
            Point, ClockWise, Closed,
            PointList, RingList, PointAlloc, RingAlloc
        >
>
{
    typedef typename model::polygon
        <
            Point, ClockWise, Closed,
            PointList, RingList,
            PointAlloc, RingAlloc
        >::ring_type& type;
};

where the ampersand is added... Meaning, the ring-returning-type is now a reference!

Of course some more changes were necessary, but not more. The free function exterior_ring now uses a meta-function to define its return type. That meta-function uses the traits-function ring_type, so returns sometimes a reference, sometimes a value. For const and non-const, another metafunction is created:
template <typename T>
struct ensure_const_reference
{
    typedef typename mpl::if_
        <
            typename boost::is_reference<T>::type,
            typename boost::add_reference
                <
                    typename boost::add_const
                        <
                            typename boost::remove_reference<T>::type
                        >::type
                >::type,
            T
        >::type type;
};

I like metaprogramming. If the type was a reference, it removes it, adds a const, and adds a reference again. If it was not a reference, it keeps it unchanged. This is type calculation, metaprogramming.

Iterating through const/non const collections of values/references

At every place where Boost.Range iterators were used to iterate through interior rings, a change was necessary. This was possible creating another metafunction, but Bruno suggested to use BOOST_AUTO here. This avoids new metafunctions, and makes all iterations more readable. This works perfectly for both const and non const iterators, so in fact it is all much more simple than before. Loops are now written like this:

for (BOOST_AUTO(it, boost::begin(interior_rings(polygon)));
     it != boost::end(interior_rings(polygon));
 
     ++it)


This BOOST_AUTO word, which anticipates the C++0x auto keyword, saves defining and declaring metafunctions all the time. It is a macro (yack) but very valuable here, also because if we change things another time, this can stay as it is now.

One thing is noteworthy, there are two calls to interior_rings returning (in the return-by-value case) two copies. Because these copies are suspected to be proxies, containing references to the original (single) object, the iterator-behaviour will be all right, they are both pointing to the same original range.

However, better safe then sorry, better defensive, so we should write here:

BOOST_AUTO(irings, interior_rings(polygon));
for (BOOST_AUTO(it, boost::begin(irings)); it != boost::end(irings);  ++it)

which might be also more efficient in some cases.

Conclusions

  • Ranges are great.
  • Ranges are compatible with other objects, while iterator-pairs are not
  • Ranges can be returned as values or as references
  • Boost.Geometry's polygon concept uses ranges, allowing adaption of other polygon types using implementation of range proxies

Wednesday, December 8, 2010

Free Shapefiles of Countries of the World (2)


Free Shapefiles of Countries of the World (2)


Note: read also part 3 and part 4 of this blog.

After input on my recent blog (thanks!) and after that I find two more, another post about this.

To judge the quality of the dataset, I use my own country (the Netherlands) as the primary measure. I've a mental map about my country, and know how it should look like. In many of the found datasets it looks completely destroyed. Therefore, personally I would discard them at once. Other datasets quite precisely depict my country. Maybe it is not only because I live here, but also about the topography: it has a large inner lake, it has some islands, and it has a distinctive shape.

So here the previous datasets, and four more, and some variations, come again...

world1
  • (1)
  • from aprs
  • file world.shp
  • from 2002
  • 244 countries
  • Missing a polder existing since 1968

world2
  • (2)
  • from brothersoft, serving blue marble data
  • file World_countries_shp.shp
  • from 1996
  • 239 countries
  • All shape is destroyed
world3
  • (3)
  • from Geodan
  • file WorldCountries.shp
  • from 2003
  • 253 countries
  • The Netherlands are depicted correctly

world4
  • (4)
  • from mapping hacks
  • file world_borders.shp
  • from 2004
  • 251 countries, 3784 polygons
  • Showing a polder which is never realized

world5
  • (5)
  • from thematic mapping
  • file TM_WORLD_BORDERS-0.3.shp
  • from 2008
  • 246 countries
  • Showing a polder which is never realized

world5b
  • (5b)
  • also from thematic mapping but the simplified version
  • file TM_WORLD_BORDERS_SIMPL-0.3.shp
  • also from 2008
  • also 246 countries
  • Badly simplified
world6
  • (6)
  • from openmap here, called cntry02
  • file cntry02.shp
  • from 2002
  • 251 countries
  • Showing a polder which is never realized, missing three islands
world7
  • (7)
  • from this site ESRI data
  • file cntry08.shp
  • from 2008
  • 249 countries
  • Missing three (inhabited) islands in the North. For the rest it looks like a good map on this scale
world8
  • (8)
  • from vdstech via this blog
  • file world.shp
  • from 2001
  • 251 countries
  • Badly simplified
world9
  • (9)
  • from natural earth
  • file 10m-admin-0-countries.shp 
  • from 2009
  • 251 countries
  • Missing the inner lake the IJsselmeer
world9b
  • (9b)
  • from natural earth
  • file 50m_admin_0_countries.shp 
  • from 2010
  • 240 countries
  • Missing the inner lake the IJsselmeer
world9c
  • (9c)
  • from natural earth
  • file 110m_admin_0_countries.shp 
  • from 2010
  • 177 countries
  • Too much simplified
(10). I wanted to do VMAP0 / VMAP1 here. However, my little research concerns a (one) shapefile of countries of the world. Vmap0 is great but to let users figure out all the detailed files and merge them for four parts
of the world is not the purpose I started the research with... so no picture here. Thanks for the tip anyway.
world11
  • (11)
  • from huebler
  • file world_adm0.shp 
  • from 2002
  • 209 countries
  • Missing the inner lake the IJsselmeer


Based on this series, (3) would probably be the best choice, though a bit heavily detailed. Surprisingly, it is created in the Netherlands... So we cannot judge its quality from this series. Furthermore,  (7) would be a reasonable choice, but we saw in the previous blog that (7) is wrong with respect to the Caspian sea. The new dataset (9b) is looking quite good to me for a map on world scale, but alas it misses an important lake. We might see more in another blog, for example showing the countries around the Caspian sea, or the countries in the former republic of Yugoslavia.


Conclusion

I think a good and free shapefile of the world is still welcome.