@codehunter wrote:
I am trying to find outliers using clustering analysis.
Data size : > 50 million records
Total columns : 50 . [ 39 Categorical , 12 numerical ]
Domain : Healthcare
Problem :
- about 5-6 categorical variables have more than 10,000 possible values
- about 12-14 have about 15 categories possible
Is clustering the right way to look for outliers in this scenario ?
What are the best feature engineering [Feature selection and dimensionality reduction] methods in this case?
Is it advised to do kmeans by converting all the categorical into numerical , If yes any ideas and pointers on that.
Is it advised to do K-prototypes ? If yes, is it reliable/mature enough to work with. And any theories and pointers to the code base is appreciated.
K-prototypes : https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py
Any other sample codes would help
Looking for ideas and direction to approach this problem ,using python for coding
Posts: 1
Participants: 1