Hi everyone,
I have a >7000 dataset of bibliographic records from a domain-specific search I did in Scopus.
My goal is to group these data (all titles + abstracts) in a meaningful way.
My question is:
Do you recommend me any SOTA framework, workflow, pipeline, article, book, or tutorial, to perform this task?
Moreover, I have several doubts:
(1) I already tried CountVec + HDP, LSI, LDA with GENSIM, and LDA was good enough.
However with TF-IDF + NMF it was "coherent", too.
So, how I justify the use of any of these combinations: e.g., CountVec + LDA or TF-IDF + NMF
Is CountVec or TF-IDF only proper for some of these methods but not all ?
(2) If instead of a BOW approach I use embeddings, are there appropriate combinations whereas some others are not?
For example, Word2Vec + LDA? But not SentVec + NMF?
(3) Finally, are any of these approaches (BOW or embeddings) more appropriate for a specific clustering technique (e.g., BOW + K-Means, embeddings + affinity propagation)?
Moreover, would it be appropriate to perform Topic Modeling in each cluster?
If so, which combination would be appropriate in this case?
I am writing an article, and as you know, reviewers are very picky, and that's why I am very detailed.
I hope my questions make sense.
If you have any advice, I would really appreciate it!
Thank you!
Carlos.