Carlos A.
Hi everyone,
I have a >7000 dataset of bibliographic records from a domain-specific search I did in Scopus.
My goal is to group these data (all titles + abstracts) in a meaningful way.
My question is:
Do you recommend me any SOTA framework, workflow, pipeline, article, book, or tutorial, to perform this task?
Moreover, I have several doubts:
(1) I already tried CountVec + HDP, LSI, LDA with GENSIM, and LDA was good enough.
However with TF-IDF + NMF it was "coherent", too.
So, how I justify the use of any of these combinations: e.g., CountVec + LDA or TF-IDF + NMF
Is CountVec or TF-IDF only proper for some of these methods but not all ?
(2) If instead of a BOW approach I use embeddings, are there appropriate combinations whereas some others are not?
For example, Word2Vec + LDA? But not SentVec + NMF?
(3) Finally, are any of these approaches (BOW or embeddings) more appropriate for a specific clustering technique (e.g., BOW + K-Means, embeddings + affinity propagation)?
Moreover, would it be appropriate to perform Topic Modeling in each cluster?
If so, which combination would be appropriate in this case?
I am writing an article, and as you know, reviewers are very picky, and that's why I am very detailed.
I hope my questions make sense.
If you have any advice, I would really appreciate it!
Thank you!
Carlos.
Everything is empirical when we are building topic models.
One method might work for some and might not for others.
And regarding your last point, it is okay to perform topic modelling within each of the clusters as well.