cogl_matrix_project_points and cogl_matrix_transform_points had an
optimization for the common case where the stride parameters exactly
match the size of the corresponding structures. The code for both when
generated by gcc with -O2 on x86-64 use two registers to hold the
addresses of the input and output arrays. In the strided version these
pointers are incremented by adding the value of a register and in the
packed version they are incremented by adding an immediate value. I
think the difference in cost here would be negligible and it may even
be faster to add a register.
Also GCC appears to retain the loop counter in a register for the
strided version but in the packed version it can optimize it out and
directly use the input pointer as the counter. I think it would be
possible to reorder the code a bit to explicitly use the input pointer
as the counter if this were a problem.
Getting rid of the packed versions tidies up the code a bit and it
could potentially be faster if the code differences are small and we
get to avoid an extra conditional in cogl_matrix_transform_points.