This makes it possible to regenerate the test output again. Also adds an update_test_data target to the Makefile.
Dictionary order is not stable in Python < 3.6 so we need to sort by key to have consistent results. The LogHandler output is also different on older Python versions. Also, don't stop running python tests after the first error.