creative destruction: collapse and contract

Creative destruction, coined by Joseph Schumpeter in Capitalism, Socialism, and Democracy, refers to the process by which new innovations kill old inefficient products or processes. But we are not talking about that but instead, of destroying data to create more useful information. By destroying, we mean altering the data currently loaded in memory with no undo button to rely to. When you load or open data into Stata, Stata stores the data in your machine’s RAM. Any changes made, therefore, are not permanent or saved in your hard drive until you call on save, but still be careful that you do not overwrite your raw data files.
Continue reading

That dummy

“There are only 10 types of people in the world —
those who understand binary, and those who don’t.”

It is almost always the case that dummy variables are defined using the 2 digits of the binary system: 0 and 1. To illustrate how to create dummies, we will use the data auto.dta available in Stata’s website.

webuse auto
/* -webuse- loads dataset from Stata web site. Type “help webuse” for more details. */

gen fuelecon=(mpg<=20)
/* define fuelecon=1 if the condition mpg<=20 miles holds and fuelecon=0 if it does not */

This is equivalent to:

gen fuelecon=1 if mpg<=20 /* define fuelecon=1 if mpg<=20 miles /
replace fuelecon=0 if mpg>20 /
and fuelecon=0 if mpg>20 miles */

CAUTION: Missing values. Stata treats missing values as very large numbers. See example below.

tab rep78, m

rep78 |      Freq.     Percent        Cum.

1 |          2        2.70        2.70
2 |          8       10.81       13.51
3 |         30       40.54       54.05
4 |         18       24.32       78.38
5 |         11       14.86       93.24

. |          5        6.76      100.00

Total |         74      100.00

gen repmorethan4=(rep78>4)

tab repmorethan4, m

rep~4 |      Freq.     Percent        Cum.

0 |         58       78.38       78.38

1 |         16       21.62      100.00

Total |         74      100.00

We have just instructed Stata to code the cars with missing values as if they have been repaired more than 4 times. Not cool. The solution is to add the missing values as condition or use the -if- qualifier:

replace repmorethan4=(rep78>4 & rep78~=.)


replace repmorethan4=rep78>4 if rep78~=.


replace repmorethan4=rep78>4 if rep78<.

The missing values will be coded as 0 instead of 1. But, note that this is correct if we know that a missing value represents that the car is not repaired. Unfortunately, without prior information, a missing value could also mean that the information is indeed missing. ALWAYS KNOW WHAT MISSING VALUES MEAN AND KNOW WERE THEY GO.

If we have many categories, it is easier to use the -tab- command and gen() option. For example, if we want to create dummy variables for each of the 5 values of rep78, we type:

tab rep78, gen(rep78_)

This command will create 5 dummy variables: rep78_1, rep78_2, rep78_3, rep78_4, and rep78_5. rep78_1 is 1 if rep78==1 and 0, otherwise … rep78_5 is 1 if rep78==5 and 0, otherwise. No variable was created for the missing values; but if you want to create a variable for the missing values, just specify the missing option for -tab-:

tab rep78, gen(rep78_) m

Ways to count the number of unique values in a variable

There are at least 3 convenient ways to count the number of distinct values contained in a variable: -tab-, -inspect-, and -codebook-.

tab varname, nofreq
display r(r)

The option nofreq supresses the reporting of the frequency table. Besides displaying output in the results window, Stata stores the results of some commands so that you can use them in subsequent commands. Results of r-class commands, such as -tab-, are stored in r(). In the expample above, display r(r) returns the number of rows in the table, that is, the number of unique observations for variable varname. The problem with using -tab- to count the unique number of values is its row limits: 12,000 rows (Stata/MP and Stata/SE), 3,000 rows (Stata/IC), or 500 rows (Small Stata).

inspect varlist
display r(N_unique)

Besides reporting the number of unique values, -inspect- also reports: the number of negative, zero,  positive, and missing values. It also draws a histogram. There is no need for r(N_unique) if the number of unique values is less than or equal to 99 as -inspect- reports the actual number. But if the number of unique values is more than 99, it will return “More than 99 unique values”. In this case, you need to type the second line.

codebook varlist

-codebook- also provide other summaries besides unique values: type of variable (numeric, etc), the range of values, mean, standard deviation, missing values, and some percentiles.

Note: If varlist is not specified in -inspect- and -codebook-, the commands will return the reports for all variables.