Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

Dealing with categorical variables - Looking for recommendations


@bgarcial wrote:

I have the following dataset, in which the Direccion del viento (Pos) column have categorical values

In total Direccion del viento (Pos) it has 8 categories:

  • SO - Sur oeste
  • SE - Sur este
  • S - Sur
  • N - Norte
  • NO - Nor oeste
  • NE - Nor este
  • O - Oeste
  • E - Este

Then I convert this dataframe to numpy array and I get:

dtype: bool

Since I have character string values, I want these to be numeric values, so I need to code the categorical variables. That is, coding the text we have as numerical values

Then I perform two activities:

  1. I use LabelEncoder() to simply encode the values into number according to how many categories I have.

Label encoding is simply converting each value in a column to a number

labelencoder_direccion_viento_pos = LabelEncoder()
direccion_viento_pos[:, 0] = labelencoder_direccion_viento_pos.fit_transform(direccion_viento_pos[:, 0])
  1. I use One Hot Encoding to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column.

    onehotencoder = OneHotEncoder(categorical_features = [0])
    direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()

Is of this way, since I get these new values:

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

Then I convert this direccion_viento_pos array to dataframe to visualize of a best way:

# Turn array to dataframe with columns indexes
cols = ['E', 'N', 'NE', 'NO', 'O', 'S', 'SE', 'SO']
df_direccion_viento = pd.DataFrame(direccion_viento_pos, columns=cols)

Then, I can get by each category value a new column and assigns a 1 or 0 (True/False) value to the column.

If I use pandas.get_dummies() function I get the same result.

My question is:
Is this the best way of deal with these categorical variables?
Having a column for each category and having values of zeros in several of them does not help to have a bias or noise in the data for when automatic learning algorithms are applied?

I’ve recently started reading about it in this article, but any guidance on this I appreciate

Posts: 1

Participants: 1

Read full topic

Viewing all articles
Browse latest Browse all 4448

Trending Articles