Estoy aprendiendo deep learning, y he estado mirando un tutorial de autoencoders usando MNIST. Es bastante sencillo, y creo que lo entiendo bastante bien hasta ahora. Este es el código del tutorial:
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D
from keras.models import Model
from keras import backend as K
#Load Dataset
(x_train, _), (x_test, _) = mnist.load_data()
#Scale Dataset values to lie between 0 and 1
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
x_train = np.reshape(x_train, (len(x_train), 28, 28, 1))
x_test = np.reshape(x_test, (len(x_test), 28, 28, 1))
#Add Noise to our MNNIST Dataset by sampling random values from Gaussian distribution by using np.random.normal() and adding it to our original images to change pixel values
noise_factor = 0.5
x_train_noisy = x_train + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_train.shape)
x_test_noisy = x_test + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_test.shape)
x_train_noisy = np.clip(x_train_noisy, 0., 1.)
x_test_noisy = np.clip(x_test_noisy, 0., 1.)
# Model Construction
input_img = Input(shape=(28, 28, 1))
x = Conv2D(32, (3, 3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same')(x)
# At this point the representation is (7, 7, 32)
x = Conv2D(32, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder.summary()
Así se construye la siguiente red:
Model: "model_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) (None, 28, 28, 1) 0
_________________________________________________________________
conv2d_11 (Conv2D) (None, 28, 28, 32) 320
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 14, 14, 32) 0
_________________________________________________________________
conv2d_12 (Conv2D) (None, 14, 14, 32) 9248
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 7, 7, 32) 0
_________________________________________________________________
conv2d_13 (Conv2D) (None, 7, 7, 32) 9248
_________________________________________________________________
up_sampling2d_5 (UpSampling2 (None, 14, 14, 32) 0
_________________________________________________________________
conv2d_14 (Conv2D) (None, 14, 14, 32) 9248
_________________________________________________________________
up_sampling2d_6 (UpSampling2 (None, 28, 28, 32) 0
_________________________________________________________________
conv2d_15 (Conv2D) (None, 28, 28, 1) 289
=================================================================
Total params: 28,353
Trainable params: 28,353
Non-trainable params: 0
Hasta ahora, todo bien. Puedo entrenarlo con
autoencoder.fit(x_train_noisy, x_train,
epochs=10,
batch_size=128,
shuffle=True
)
y converge a una solución con una pérdida de aproximadamente 0,099 (y aproximadamente la misma en el conjunto de pruebas).
Pero ahora he intentado rellenar mis imágenes de entrada. Lo hice para simplificar la construcción de autocodificadores más profundos, pero ese no es el punto aquí, el punto es que incluso rellenando ligeramente las imágenes, y manteniendo todo lo demás constante, los resultados cambiaron bastante:
x_train_padded = np.array([np.pad(x, ((2, 2), (2, 2), (0, 0)), mode='edge') for x in x_train])
x_test_padded = np.array([np.pad(x, ((2, 2), (2, 2), (0, 0)), mode='edge') for x in x_test])
x_train_padded_noisy = x_train_padded + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_train_padded.shape)
x_test_padded_noisy = x_test_padded + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_test_padded.shape)
x_train_padded_noisy = np.clip(x_train_padded_noisy, 0., 1.)
x_test_padded_noisy = np.clip(x_test_padded_noisy, 0., 1.)
# Model Construction
input_img_padded = Input(shape=(32, 32, 1))
x = Conv2D(32, (3, 3), activation='relu', padding='same')(input_img_padded)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same')(x)
# At this point the representation is (8, 8, 32)
x = Conv2D(32, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
autoencoder_padded = Model(input_img_padded, decoded)
autoencoder_padded.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder_padded.summary()
Como puede ver, la red es esencialmente la misma que antes (mismas capas de convolución, mismo número total de parámetros):
Model: "model_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) (None, 32, 32, 1) 0
_________________________________________________________________
conv2d_16 (Conv2D) (None, 32, 32, 32) 320
_________________________________________________________________
max_pooling2d_7 (MaxPooling2 (None, 16, 16, 32) 0
_________________________________________________________________
conv2d_17 (Conv2D) (None, 16, 16, 32) 9248
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 8, 8, 32) 0
_________________________________________________________________
conv2d_18 (Conv2D) (None, 8, 8, 32) 9248
_________________________________________________________________
up_sampling2d_7 (UpSampling2 (None, 16, 16, 32) 0
_________________________________________________________________
conv2d_19 (Conv2D) (None, 16, 16, 32) 9248
_________________________________________________________________
up_sampling2d_8 (UpSampling2 (None, 32, 32, 32) 0
_________________________________________________________________
conv2d_20 (Conv2D) (None, 32, 32, 1) 289
=================================================================
Total params: 28,353
Trainable params: 28,353
Non-trainable params: 0
_________________________________________________________________
Sin embargo, cuando entreno esta red, utilizando
autoencoder_padded.fit(x_train_padded_noisy, x_train_padded,
epochs=10,
batch_size=128,
shuffle=True
)
consigue una pérdida de aproximadamente 0,075, casi un 25% menos que antes. ¿A qué se debe esto? El único cambio fue un par de píxeles adicionales alrededor del borde de la imagen. Por supuesto que me alegro de la mejora, pero me confunde mucho y me gustaría entender por qué ocurre esto.
Por cierto, repetí el entrenamiento varias veces, por si había algún efecto aleatorio que diera lugar a la diferencia. Pero obtuve aproximadamente los mismos resultados cada vez.