"To understand the speedup from using Winograd over direct convolution, let’s consider an example. With a filter size of 3×3, the number of FMA (fused multiply-accumulate) operations to compute one output element with direct convolution is 9, exactly one per filter element. With the 2×2 Winograd transform considered here, a 2×2 block of output pixels is computed in one shot. This requires working on a 4×4 block of inputs (given by the “field of view” of the outputs), and transforming the filter to 4×4 as well. The number of FMA operations in multiplying the transformed inputs with the transformed filter is 16. Since this gives us 4 outputs in one shot, the FMAs per output are 4. Compared to the 9 for direct convolution, this represents an algorithmic speedup of 2.25."