Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

Query regarding ML feature extraction and aggregated features

$
0
0

@doraemon_z2000 wrote:

Hopefully I’ve articulated my queries as clearly as possible. I’ve provided sample data below (assume this is 100,000+ rows)

Data characteristics

  • Each row is a unique observation
  • The first 3 columns depict the raw data. “Purchased a car” is the outcome (1 or 0)
  • Remaining columns represent extracted features

Objective

  • To build a binary classification model (I’m using logistic regression) to predict if a person is likely to buy a car or not)

Feature extraction

  • While analysing the raw data, I constructed additional features that tell me how many people within a similar age range purchased/didn’t purchase a car. For e.g. if you look at the first entry, +/-5% of 22 is 23 & 21 respectively. The two features “No. of people” help signify how many people within that age range purchased a car or not (aggregated across the whole dataset)

What I need advice on

  • Is this method of feature extraction sound - and are there any potential gotchas I ought to be aware of? I’ve poured through numerous forums / articles on feature engineering, but derived features (in particular those that are aggregated using the raw data) is something where I couldn’t find much literature
  • When it comes to splitting the dataset into test/train … how should this feature be treated? If I were to split the raw data into test/train before feature derivation, the “No. of people” features would look vastly different between test and train
  • Is this misuse of feature extraction?

A possible option … but is this sound?

  • Split the raw data into test/train
  • For the train data, extract these additional “No. of people” features
  • For the test data, extract these additional “No. of people” features using train data. This will ensure that the “No. of people” counts when validating using test data is reflective of the training dataset
  • When predicting any new observations, the “No. of people” features would need to be computed based on the test dataset

Data

+-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| Raw data                          | Derived                                                                                                                        |
+-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| Age | Location  | Purchased a car | Age - 5% | Age + 5% | No. of people within ages (+/-5% who purchased) | No. of people within ages (+/-5% who did not purchase) |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 22  | Penrith   | 1               | 21       | 23       | 2                                               | 0                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 33  | Peakhurst | 1               | 31       | 35       | 2                                               | 1                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 21  | Peakhurst | 1               | 20       | 22       | 2                                               | 0                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 33  | Peakhurst | 1               | 31       | 35       | 2                                               | 1                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 29  | Peakhurst | 1               | 28       | 30       | 1                                               | 0                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 18  | Penrith   | 1               | 17       | 19       | 1                                               | 0                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 50  | Penrith   | 0               | 48       | 53       | 0                                               | 2                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 52  | Penrith   | 0               | 49       | 55       | 0                                               | 2                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 33  | Penrith   | 0               | 31       | 35       | 2                                               | 1                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 61  | Penrith   | 0               | 58       | 64       | 0                                               | 2                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 63  | Penrith   | 0               | 60       | 66       | 0                                               | 2                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 77  | Penrith   | 0               | 73       | 81       | 0                                               | 1                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+

Posts: 1

Participants: 1

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles