@bgarcial wrote:
I have the following dataset represented like
numpyarraydireccion_viento_pos Out[32]: array([['S'], ['S'], ['S'], ..., ['SO'], ['NO'], ['SO']], dtype=object)The dimension of this array is:
direccion_viento_pos.shape (17249, 8)I am using python and scikit learn to encode these categorical variables in this way:
from __future__ import unicode_literals import pandas as pd import numpy as np # from sklearn import preprocessing # from matplotlib import pyplot as plt from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import LabelEncoder, OneHotEncoderThen I create a label encoder object:
labelencoder_direccion_viento_pos = LabelEncoder()I take the column position 0 (the unique column) of the
direccion_viento_posand apply thefit_transform()method addressing all their rowsdireccion_viento_pos[:, 0] = labelencoder_direccion_viento_pos.fit_transform(direccion_viento_pos[:, 0])My
direccion_viento_posis of this way:direccion_viento_pos[:, 0] array([5, 5, 5, ..., 7, 3, 7], dtype=object)Until this moment, each row/observation of
direccion_viento_poshave a numeric value, but I want solve the inconvenient of weight in the sense that there are rows with a value more higher than others.Due to this, I create the dummy variables, which according to this reference are:
A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or more distinct categories/levels
Then, in my
direccion_viento_poscontext, I have 8 values
SO- Sur oesteSE- Sur esteS- SurN- NorteNO- Nor oesteNE- Nor esteO- OesteE- EsteThis mean, 8 categories.
Next, I create a OneHotEncoder object with thecategorical_featuresattribute which specifies what features will be treated like categorical variables.
onehotencoder = OneHotEncoder(categorical_features = [0])And apply this
onehotencoderto ourdireccion_viento_posmatrix.
direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()My
direccion_viento_poswith their categorized variables has stayed so:direccion_viento_pos array([[0., 0., 0., ..., 1., 0., 0.], [0., 0., 0., ..., 1., 0., 0.], [0., 0., 0., ..., 1., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 1.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 1.]])Then, until here, I’ve created dummy variables to each category.
I wanted to narrate this process, to arrive at my question.
If these dummy encoder variables already in a 0-1 range, is necessary apply the MinMaxScaler feature scaling?
Some say that it is not necessary to scale these fictitious variables. Others say that if necessary because we want accuracy in predictions
I ask this question due to when I apply the
MinMaxScalerwith thefeature_range=(0, 1)
my values have been changed in some positions … despite to still keep this scale.What is the best option which can I have to choose with respect to my dataset
direccion_viento_pos
Posts: 6
Participants: 3
