-encode- it


Over lunch today, a friend asked how to generate a new variable that will have unique numeric IDs corresponding to the string values in an existing variable.  The first command that comes to mind is -encode-*. -encode- generates a numeric variable from a string variable and uses the string values as labels for the generated numeric values. Its partner, -decode-, does the reverse. To illustrate, let’s use the overused** auto.dta:

/***********************************************/
sysuse auto, clear
encode make, gen(make_id)
/***********************************************/

By default, the order of the number generated corresponds to the alphabetical order of the string variable.

What -encode- does is to save you from writing longer codes, such as:

/***********************************************/
sysuse auto, clear
gen byte make_id = .
replace make_id = 1 if make == “AMC Concord”
replace make_id = 2 if make == “AMC Pacer”
.
.
.
replace make_id = 74 if make == “Volvo 260”
/* By the time you get here, you could have finished an episode of
“The Big Bang Theory” */


label define make 1 “AMC Concord” 2 “AMC Pacer” …
/* and another episode here */

/***********************************************/

or the more complex but unnecessary

/***********************************************/
sysuse auto, clear
levelsof make, local(l)
gen byte make_id = .
local id = 1
foreach i of local l{
replace make_id = id' if <em>make </em>== "i'”
label define make id' "i'”, add
local id = `id’ + 1
}
label values make_id make
/***********************************************/

Another way is to use -group- under -egen-. Example:

/***********************************************/
sysuse auto, clear
egen make_id = group(make)
/***********************************************/

But then you still have to create and attach the value labels to make_id. Nick Cox pointed out in his comment that -group- has a -label- option.

See -help encode- for more options and for its counterpart -decode-.


*I came across -encode- in Christopher Baum’s An Introduction to Stata Programming


**Does one lose a byte when data is overused? Sort of the ‘wear-and-tear’ we see in most things that aren’t invisible.