можно еще попробовать препроцессинг
I replaced prefixed whitespace, trimming leading/trailing whitespace in all fields
replace 3+ spaces with newlines
deleted all 2+ spaces
dropped poems with <100 characters (generally a scrape error)
remove Unicode junk
serialize it as title+author+tags (if any) / poem / <|endoftext|> (ie the inline metadata trick, allowing for potentially better learning and a small degree of control in conditional generation)