Олсо, оптимизации которые вы никогда не сможете в ассемблере:
Updating ahash to a newer version got us to 18x.
Changing from f64::powf() to f64::powi() helped get us to 19x.
Reuse of intermediate allocations (in the case where you’re doing multiple segmentations in a row) got performance to 25x faster than Python.
And finally, the adoption of the triangular matrix approach got us to 95x.