so a good research would look like finding a good combination of the parameters for the following:
0) choosing a baseline performance for your study (and optionally perform baseline analysis)
1) preprocessing (no FFT, FFT: hann/hamming, frame size, window size, overlap)
2) architecture: finding the best architecture.
3) learning the possible reasons of overfitting and measuring their impact into the final quliaty.
4) "theoretical maximum" quality, probably a kind of analysys of the data variance across people, data noise (maybe by trying to soften the data) and label noise (how often similar data leads to different labels).
You can take a small part of dataset for most of these studies, so the network would train very fast (in several minutes on modern GPUs).