DEALING WITH MULTIPLE LEVELS IN QUALITATIVE DATA IN MACHINE LEARNING

@srikanthgirijala wrote:

While performing data analysis, it is common that we encounter categorical values with numerous levels for example Zip code, States, city etc.,
Analyst’s usual approach is to look for only high frequent levels in variable ad set a threshold levels to ignore rest of them or combine them into one level or ignoring the variable itself and going ahead
with model development.

Are any of these the right approach to deal with categorical variables ?
Can there be any other approach which is perfect in extracting and retaining the right information,which otherwise these levels hide.

This short notes is to discuss the right approach to deal with numerous levels in qualitative variables usually more than 50 levels.

One approach is to find importance of levels is by calculating Frequency and Response rate. These are brief steps to follow (i) Calculate the frequency of every level
within the variable (ii) Add response rate against target variable.

This combination can yield very valuable information about the levels and their impact.

Lets take a practical approach to discuss this, Imagine we are looking at 30,000 zip codes in US both (3 digit ad 5 digit zip codes)
and we are trying to predict customer loan default. (This is just a hypothetical example and by any means does not represent the actual numbers).

These zip codes can be segmented based on whether the frequency of customers are high/low and distribution of default with in the zip code(High/medium/Low).
Consider below numbers for understanding-

Zip code Frequency Response
ZX345 3% 60%

By observing the above date, zip code(ZX345) can be easily segmented as High Response and Low Frequency Bin. A point to note here, High and Low are subjective and might need
business / domain expertise. However, for safe assumption of High/Low impact pick up the range of frequency and response rate
and divide them in to three/two logical bins. For instance, the max value for frequency is 12% and minimum frequency is 1% of population, in case we choose to make three bins, each bin will have 4%
coverage as described below (can be segmented as High and Low, to avoid complexity).

Frequency: Response Rate ( % of distribution of positive class)
1% - 4% -Low 1% - 20%
5% - 8% - Medium 21% - 40%
9% - 12% - High 41% - 60%

Levels in category are coded based on frequency and response rate. This approach will shrink thousands of levels in variables
to 9 levels and to 8 columns when converted to dummies while developing models.

This would help anyone working on data science projects to abstract information from variables,
which otherwise ignored or inappropriately dealt with. This approach is useful when the requirement is to dig deep into low frequency high impact category levels.

Posts: 1

Participants: 1

Read full topic

DEALING WITH MULTIPLE LEVELS IN QUALITATIVE DATA IN MACHINE LEARNING

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List