Whenever we feed text into a computer, it decodes it into 0s and 1s which cannot be directly understood by humans. It interprets these numbers as instructions for displaying text, sound, image, etc., which are meaningful to people. Similarly, when we send data to any machine learning (ML) model, we need to do it in the proper format since algorithms only understand numbers. These categorical variables also contain valuable pieces of information about the data. In this article, we will learn how to encode categorical variables to numeric with Pandas and Scikit-learn. Show
Categorical variables are generally addressed as ‘strings’ or ‘categories’ and are finite in number. Here are a few examples:
There are two types of categorical data, ordinal and nominal.
Now that we have knowledge about categorical variables, let’s look at the options for encoding them using Pandas and Scikit-learn. Find and replaceThe simplest method of encoding categorical data is with find and replace. The replace() method replaces each matching occurrence of the old character in the string with the new character. Here’s how it works: Suppose there is a column named “number of cylinders” in a dataset and the highest cylinder a car can have is 4. The values this column contains cannot exceed 4. However, the problem is that all these values are written in text, such as “two”, “one”, etc. What we can do is directly replace these text values with their numeric equivalent by using the ‘replace’ function provided by Pandas.
Here, we are creating a mapping dictionary that will map all the text values to their numeric values. This approach is very useful when dealing with ordinal data because we need to maintain the sequence. In the above example of “a person’s degree”, we can map the highest degree to a greater number and the lowest degree to the lowest number. Label encodingIn this approach, each label is assigned a unique integer based on alphabetical ordering. We can implement this using the Scikit-learn library.
This dataset contains some null values, so it’s important to remove them. Let’s look at the data type of these features:
We can see that almost all the variables are represented by the object data type, except the “symboling” column. Let’s encode the “body_style” column:
Image source: Practical Business Python Since label encoding uses alphabetical ordering, “convertible” has been encoded with 0, “hatchback” has been encoded with 2, and “sedan” with 3. There must be another category in body_style that was encoded with 1. If we look at the “body_style” column, we will notice that it does not have any order. If we perform label encoding on it, we will see that the column is ranked based on the alphabets. Due to this order, the model may capture some hypothetical relationship. One-hot encodingWe generally use one-hot encoding to solve the disadvantage of label encoding. The strategy is to convert each category into a column and assign it a 1 or 0 value. It is a process of creating dummy variables. Let’s see how we can implement it in Python:
We can see from the table above that all the unique categories were assigned a new column. If a category is present, we have 1 in the column and 0 for others. Since the data is sparse, it results in a dummy variable trap as the outcome of one variable can be predicted with the help of the remaining variables. This problem occurs when the variables are highly correlated to each other. It also leads to a collinearity problem which causes issues in various regression models. There’s another problem with this method: if there are many unique categories and we want to encode them, we will have many extra columns. This will eventually increase the model complexity and time as it will take longer to analyze the relationship between the variables. Converting categorical data to numerical data using PandasThe following are the methods used to convert categorical data to numeric data using Pandas. Method 1: Using get_dummies()Syntax:
Image source: GeeksforGeeks
Image source: GeeksforGeeks Method 2: Using replace()Syntax:
Convert the same data using a different approach: Image source: GeeksforGeeks
Image source: GeeksforGeeks Converting categorical data to numerical data in Scikit-learn can be done in the following ways: Method 1: Label encodingLet’s implement this on different data and see how it works. 0Image source: GeeksforGeeks 1 2Image source: GeeksforGeeks Method 2: One-hot encoding 3Which encoding technique to use?In order to know when to use which encoding technique, we need to understand our data well. We then need to decide which model to apply. For example, if there are more than 15 categorical features and we decide to use the support vector machine (SVM) algorithm, the training time might increase as SVM is slow. Feeding it many features separately adds to the model’s complexity and training time. Below are some key points to note when choosing an encoding technique: Use find and replace method
Use one-hot encoding
Use label encoding
We have explored the various ways to encode categorical data along with their issues and suitable use cases. To summarize, encoding is a crucial and unavoidable part of feature engineering. It’s important to know the advantages and limitations of all the methods used too so that the model can learn properly. How do you convert numeric data to categorical data?At first thought, converting numeric data to categorical data seems like an easy problem. One simple approach would be to divide the raw source data into equal intervals. For example, for the data in the demo and Figure 2, the range is 78.0 - 60.0 = 18.0.
How to convert numerical data to categorical data in pandas?Pandas cut function or pd. cut() function is a great way to transform continuous data into categorical data.. 0 to 2 = 'Toddler/Baby'. 3 to 17 = 'Child'. 18 to 65 = 'Adult'. 66 to 99='Elderly'. How do you make a numeric variable categorical in Python?Step 1 - Import the library. import pandas as pd. ... . Step 2 - Setting up the Data. We have created a dictionary and passed it through the pd.DataFrame to create a dataframe with columns 'name', 'episodes', 'gender'. ... . Step 3 - Making Dummy Variables and Printing the final Dataset.. How do you change data type in Python?It is possible to convert a string to an integer in Python using the int() function, while the str() function converts an integer to a string. Integers can be converted to floats using float() , and floats can be changed to integers, although this can cause data loss.
|