Yo prefiero un ligero cambio de notación debido a la cantidad de $n$'s que aparecen en el original. Deje $\alpha$ $\beta$ designar a las imágenes. Deje $i$ $j$ cada designar pares de índices en la imagen de filas y columnas. (Indexación va de $1$ $m$para las filas y $1$ $n$por columnas). Deje $h$ designar una relación de índice par (de modo que sus dos entradas son números enteros, cualquiera de los cuales puede ser negativo), también conocido como offset. Entonces, por definición, el valor de la experimental de la cruz-variograma de estas imágenes en un desplazamiento $h$ es
$$\gamma_{\alpha,\beta}(h)=\frac{1}{2n(h)}\sum_{i}\left(\alpha[i+h]-\alpha[i]\right)\left(\beta[i+h] - \beta[i]\right).$$
The sum ranges over all indexes $i$ for which both $i$ and $i+h$ are valid indexes into both images; $n(h)$ is the number of such indexes (easily computed in the same way by taking a similar sum of $1$'s).
By expanding the summand algebraically the calculation is reduced to the problem of obtaining
$$\sum_{i}\alpha[i+h]\beta[i]$$
for various $h$, both positive and negative, ranging from $(1-m,1-n)$ through $(m-1,n-1)$.
Let us say that the reversal of an image negates the indexes; that is, the value of the reversal of $\alpha$ at the pixel $h$ is the value of $\alpha$ at $(m+1,n+1)-h$.
Such a sum can be seen as the reversal of the convolution of the reversal of $\alpha$ with $\beta$. It is best computed using discrete Fourier transforms after first padding each image to the right and down with zeros. The padding must extend to the range of the largest $h$ for which $\gamma$ needs to be computed. Convolutions with Fourier transforms are obtained by taking the inverse Fourier transform (itself a scalar multiple of the FT) of the product of the FTs.
Direct computation of the variogram via its definition for a pair of $m$ by $n$ images requires up to $m n$ products and sums for each value of $h$. Typically $S(m, n)$ values of $h$ are needed. The direct algorithm therefore has $O(m^2 n^2)$ computational cost, which is ridiculously large for moderate (megapixel) images. The discrete Fourier transform costs at most $O(2m 2n \log(2m 2n))$ (assuming the maximum range of offsets $h$) and has to be applied only a constant number of times (3). The reversals and paddings cost $O(2m 2n)$. Thus the total cost is still only $S(12 m n \log(4 m n))$, a huge improvement.
As a simple example, take $\alpha$ and $\beta$ to be the matrices
1 2
3 4
and
5 6
7 8
After padding with zeros to the right and down (by two columns and two rows) and reversing $\alpha$, multiplying these two 4 by 4 matrices componentwise, and taking the inverse Fourier transform, we get
8 0 14 23
0 0 0 0
18 0 20 39
30 0 38 70
Rotating this right by 2 columns and 2 rows and reversing gives
0 0 0 0
0 8 23 14
0 30 70 38
0 18 39 20
If you think of the new row and column indexes ranging $-2, -1, 0, 1$, this new matrix is exactly $\sum_{i}\alpha[i+h]\beta[i]$ (indexed by $h$). For example, the $h = (0,1)$ entry is 38 and indeed
$$\alpha[1,2]\beta[1,1] + \alpha[2,2]\beta[2,1] = 2 \times 5 + 4 \times 7 = 38.$$