@doraemon_z2000 wrote:
Hopefully I’ve articulated my queries as clearly as possible. I’ve provided sample data below (assume this is 100,000+ rows)
Data characteristics
- Each row is a unique observation
- The first 3 columns depict the raw data. “Purchased a car” is the outcome (1 or 0)
- Remaining columns represent extracted features
Objective
- To build a binary classification model (I’m using logistic regression) to predict if a person is likely to buy a car or not)
Feature extraction
- While analysing the raw data, I constructed additional features that tell me how many people within a similar age range purchased/didn’t purchase a car. For e.g. if you look at the first entry, +/-5% of 22 is 23 & 21 respectively. The two features “No. of people” help signify how many people within that age range purchased a car or not (aggregated across the whole dataset)
What I need advice on
- Is this method of feature extraction sound - and are there any potential gotchas I ought to be aware of? I’ve poured through numerous forums / articles on feature engineering, but derived features (in particular those that are aggregated using the raw data) is something where I couldn’t find much literature
- When it comes to splitting the dataset into test/train … how should this feature be treated? If I were to split the raw data into test/train before feature derivation, the “No. of people” features would look vastly different between test and train
- Is this misuse of feature extraction?
A possible option … but is this sound?
- Split the raw data into test/train
- For the train data, extract these additional “No. of people” features
- For the test data, extract these additional “No. of people” features using train data. This will ensure that the “No. of people” counts when validating using test data is reflective of the training dataset
- When predicting any new observations, the “No. of people” features would need to be computed based on the test dataset
Data
+-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ | Raw data | Derived | +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ | Age | Location | Purchased a car | Age - 5% | Age + 5% | No. of people within ages (+/-5% who purchased) | No. of people within ages (+/-5% who did not purchase) | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 22 | Penrith | 1 | 21 | 23 | 2 | 0 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 33 | Peakhurst | 1 | 31 | 35 | 2 | 1 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 21 | Peakhurst | 1 | 20 | 22 | 2 | 0 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 33 | Peakhurst | 1 | 31 | 35 | 2 | 1 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 29 | Peakhurst | 1 | 28 | 30 | 1 | 0 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 18 | Penrith | 1 | 17 | 19 | 1 | 0 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 50 | Penrith | 0 | 48 | 53 | 0 | 2 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 52 | Penrith | 0 | 49 | 55 | 0 | 2 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 33 | Penrith | 0 | 31 | 35 | 2 | 1 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 61 | Penrith | 0 | 58 | 64 | 0 | 2 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 63 | Penrith | 0 | 60 | 66 | 0 | 2 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+ | 77 | Penrith | 0 | 73 | 81 | 0 | 1 | +-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
Posts: 1
Participants: 1