That dummy

“There are only 10 types of people in the world —
those who understand binary, and those who don’t.”

It is almost always the case that dummy variables are defined using the 2 digits of the binary system: 0 and 1. To illustrate how to create dummies, we will use the data auto.dta available in Stata’s website.

webuse auto
/* -webuse- loads dataset from Stata web site. Type “help webuse” for more details. */

gen fuelecon=(mpg<=20)
/* define fuelecon=1 if the condition mpg<=20 miles holds and fuelecon=0 if it does not */

This is equivalent to:

gen fuelecon=1 if mpg<=20 /* define fuelecon=1 if mpg<=20 miles /
replace fuelecon=0 if mpg>20 /
and fuelecon=0 if mpg>20 miles */

CAUTION: Missing values. Stata treats missing values as very large numbers. See example below.

tab rep78, m

rep78 |      Freq.     Percent        Cum.

1 |          2        2.70        2.70
2 |          8       10.81       13.51
3 |         30       40.54       54.05
4 |         18       24.32       78.38
5 |         11       14.86       93.24

. |          5        6.76      100.00

Total |         74      100.00

gen repmorethan4=(rep78>4)

tab repmorethan4, m

rep~4 |      Freq.     Percent        Cum.

0 |         58       78.38       78.38

1 |         16       21.62      100.00

Total |         74      100.00

We have just instructed Stata to code the cars with missing values as if they have been repaired more than 4 times. Not cool. The solution is to add the missing values as condition or use the -if- qualifier:

replace repmorethan4=(rep78>4 & rep78~=.)


replace repmorethan4=rep78>4 if rep78~=.


replace repmorethan4=rep78>4 if rep78<.

The missing values will be coded as 0 instead of 1. But, note that this is correct if we know that a missing value represents that the car is not repaired. Unfortunately, without prior information, a missing value could also mean that the information is indeed missing. ALWAYS KNOW WHAT MISSING VALUES MEAN AND KNOW WERE THEY GO.

If we have many categories, it is easier to use the -tab- command and gen() option. For example, if we want to create dummy variables for each of the 5 values of rep78, we type:

tab rep78, gen(rep78_)

This command will create 5 dummy variables: rep78_1, rep78_2, rep78_3, rep78_4, and rep78_5. rep78_1 is 1 if rep78==1 and 0, otherwise … rep78_5 is 1 if rep78==5 and 0, otherwise. No variable was created for the missing values; but if you want to create a variable for the missing values, just specify the missing option for -tab-:

tab rep78, gen(rep78_) m

Leave a Reply