Neat idea, but I'm not sure the approach of using euclidian distance on what's essentially a categorical variable is valid. Instead try a different clustering algorithm like K-prototypes [1], or Gower distance instead of euclidian.<p>[1] <a href="https://pdfs.semanticscholar.org/d42b/b5ad2d03be6d8fefa63d25d02c0711d19728.pdf" rel="nofollow">https://pdfs.semanticscholar.org/d42b/b5ad2d03be6d8fefa63d25...</a><p>Edit: Thinking about it more, you could treat the cards in each deck as a bag of words and run LDA on it. Alternatively create an embedding (just keep in mind skip-grams aren't meaningful for decks of cards) and cluster those.