Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

Clustering using mixed variables , with categorical variables having about 10000 categories

$
0
0

@codehunter wrote:

I am trying to find outliers using clustering analysis.

Data size : > 50 million records

Total columns : 50 . [ 39 Categorical , 12 numerical ]

Domain : Healthcare

Problem :

  • about 5-6 categorical variables have more than 10,000 possible values
  • about 12-14 have about 15 categories possible
  1. Is clustering the right way to look for outliers in this scenario ?

  2. What are the best feature engineering [Feature selection and dimensionality reduction] methods in this case?

  3. Is it advised to do kmeans by converting all the categorical into numerical , If yes any ideas and pointers on that.

  4. Is it advised to do K-prototypes ? If yes, is it reliable/mature enough to work with. And any theories and pointers to the code base is appreciated.

K-prototypes : https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py

Any other sample codes would help

Looking for ideas and direction to approach this problem ,using python for coding

Posts: 1

Participants: 1

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles