Спасибо за отличный вопрос, скоро мы выложим FAQ, где будет на него ответ.
Why is metric value on validation dataset sometimes better than one on training dataset.
This happens because auto-generated numerical features that are based on categorical features are calculated differently for training dataset and for validation dataset.
For training dataset the feature is calculated differently for every object in the dataset. For object i feature is calculated based on data from first i-1 objects (the first i-1 objects in some random permutation).
For validation dataset the same feature is calculated using data from all objects of the training dataset.
The feature that is calculated using data from all objects of the training dataset, uses more data, then the feature, that is calculated only on on part of the dataset. For this reason this feature is more powerful. A more powerful feature results in a better loss value.
Thus, loss value on the validation dataset might be better then loss value for training dataset, because validation dataset has more powerful features.
The algorithm, that represents how auto-generated numerical features are calculated, and theoretical foundations for them are described in the following papers:
https://tech.yandex.com/catboost/doc/dg/concepts/educational-materials-papers-docpage/ (the first two papers) and here
https://tech.yandex.com/catboost/doc/dg/concepts/educational-materials-videos-docpage/ (the second video).