-encode- it


Over lunch today, a friend asked how to generate a new variable that will have unique numeric IDs corresponding to the string values in an existing variable.  The first command that comes to mind is -encode-*. -encode- generates a numeric variable from a string variable and uses the string values as labels for the generated numeric values. Its partner, -decode-, does the reverse. To illustrate, let’s use the overused** auto.dta:

/***********************************************/
sysuse auto, clear
encode make, gen(make_id)
/***********************************************/

By default, the order of the number generated corresponds to the alphabetical order of the string variable.

What -encode- does is to save you from writing longer codes, such as:

/***********************************************/
sysuse auto, clear
gen byte make_id = .
replace make_id = 1 if make == “AMC Concord”
replace make_id = 2 if make == “AMC Pacer”
.
.
.
replace make_id = 74 if make == “Volvo 260”
/* By the time you get here, you could have finished an episode of
“The Big Bang Theory” */


label define make 1 “AMC Concord” 2 “AMC Pacer” …
/* and another episode here */

/***********************************************/

or the more complex but unnecessary

/***********************************************/
sysuse auto, clear
levelsof make, local(l)
gen byte make_id = .
local id = 1
foreach i of local l{
replace make_id = id' if <em>make </em>== "i'”
label define make id' "i'”, add
local id = `id’ + 1
}
label values make_id make
/***********************************************/

Another way is to use -group- under -egen-. Example:

/***********************************************/
sysuse auto, clear
egen make_id = group(make)
/***********************************************/

But then you still have to create and attach the value labels to make_id. Nick Cox pointed out in his comment that -group- has a -label- option.

See -help encode- for more options and for its counterpart -decode-.


*I came across -encode- in Christopher Baum’s An Introduction to Stata Programming


**Does one lose a byte when data is overused? Sort of the ‘wear-and-tear’ we see in most things that aren’t invisible.

3 Responses

  1. Two points:

    1. If you were using -egen, group()- it has an option -label-. So you don’t need to do much extra work to attach labels.

    2. -multencode- from SSC is one way of ensuring that the same association between values and value labels is used for several variables.

  2. I sometimes encode manually by doing:
    contract make
    gen make_id=[_n]

    This works well when used with append and merge and I tend to use it when I want the same serial numbers to be harmonized across many files but I don’t care very much about the labels. For instance, I sometimes do this kind of thing when dealing with IMDb, which is both huge and relational.

    • I also use the _n subscript when labels are not necessary, but I have yet to use -contract-. What I used to do was to do delete duplicates using -duplicates drop- and then drop the extra variables. Learned something new today…thanks.

Leave a Reply