break me


Sometimes we want to break a continuous variable into a smaller set of categories—into evenly spaced or equally sized groups, or groups based on limits we specify, or groups based on another variable or a set of variables.

Let us take for example the variable price of cars in auto.dta.
sysuse auto.dta, clear    // open a Stata built-in data
summ price



Suppose we want to break price into 5 categories according to different ways of grouping: 1 evenly spaced, another equally sized, and so on. We create the following variables: price_cat1, price_cat2, ..., price_cat4.
  1. price_cat1 groups the prices into evenly spaced intervals, say, by 3300.
  2. price_cat2 groups the prices into 5 equally sized groups. Equally is used loosely here. In our example data there are 74 cars, 1 group will definitely have 1 less car in it.
  3. price_cat3 groups the prices into categories bounded to the left by 0, 5000, 8000, 10000. Why those boundaries? Because I can… :p
  4. price_cat4 groups the prices based on the variable rep78.

This is a walk in the park if you have already mastered the generate and replace commands and the qualifier if. But, even better (a walk in the park with ice cream and a puppy), are the shortcuts. For each case, we first use if followed by shorter ways to get the same result.
*** Case 1: groups the prices into evenly spaced intervals
* using the if qualifier
gen price_cat1 = .
replace price_cat1 = 1 if price>=0 & price<3300
replace price_cat1 = 2 if price>=3300 & price<6600
replace price_cat1 = 3 if price>=6600 & price<9900
replace price_cat1 = 4 if price>=9900 & price<13200
replace price_cat1 = 5 if price>=13200 & price<16500

. tab price_cat1, m

 price_cat1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.70        2.70
          2 |         55       74.32       77.03
          3 |          7        9.46       86.49
          4 |          6        8.11       94.59
          5 |          4        5.41      100.00
------------+-----------------------------------
      Total |         74      100.00

* using the egen function cut()
drop price_cat1
egen price_cat1 = cut(price), at(0(3300)16500)

The generated codes for the categories correspond to the left ends of the intervals. In the at() option, you may also list of breakpoints separated by spaces or commas instead of using the interval syntax for numlist, e.g. at(0 3300 6600 9900 13200 16600). Adding the option icodes will generate codes 0,1,..,5 instead of the left ends.
. tab price_cat1, m

 price_cat1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          2        2.70        2.70
       3300 |         55       74.32       77.03
       6600 |          7        9.46       86.49
       9900 |          6        8.11       94.59
      13200 |          4        5.41      100.00
------------+-----------------------------------
      Total |         74      100.00

That’s it! The 6 lines we wrote earlier can be written concisely in 1 line. Be careful however with the breakpoints. In our example above, the intervals are inclusive of the left end, i.e., from a up to (but excluding b) or, in set notation, [a,b). In some cases, you may want that the interval is inclusive of the right end, (a,b]. In the last case and where there are observations exactly at the breakpoints, egen cut() could be erroneous. We can use instead the function recode() or autocode() for the case (a,b].
* using recode() or autocode() function
gen price_cat1b = recode(price, 3300, 6600, 9900, 13200, 16500)
gen price_cat1c = autocode(price, 5, 0, 16500)

. tab price_cat1b, m

price_cat1b |      Freq.     Percent        Cum.
------------+-----------------------------------
       3300 |          2        2.70        2.70
       6600 |         55       74.32       77.03
       9900 |          7        9.46       86.49
      13200 |          6        8.11       94.59
      16500 |          4        5.41      100.00
------------+-----------------------------------
      Total |         74      100.00

. tab price_cat1c, m

price_cat1c |      Freq.     Percent        Cum.
------------+-----------------------------------
       3300 |          2        2.70        2.70
       6600 |         55       74.32       77.03
       9900 |          7        9.46       86.49
      13200 |          6        8.11       94.59
      16500 |          4        5.41      100.00
------------+-----------------------------------
      Total |         74      100.00

Both commands generate the same results (only because we want them to). In recode(), Stata compares the first argument (price in our case) with the second argument (3300, the first right end). If the price is less than or equal to 3300 it will return 3300. If not, then Stata compares the price with the third argument (the next boundary), and so on. Using the longhand with the if qualifier, this is equivalent to:
gen price_cat1b = 3300 if price<=3300
replace price_cat1b = 6600 if price>3300 & price<=6600
replace price_cat1b = 9900 if price>6600 & price<=9900
.
.
.

In autocode(), Stata breaks price (first argument) into 5 evenly spaced groups (second argument) from 0 to 16500 (last 2 arguments). If the price is less than or equal to the first generated breakpoint (3300), it will fall in this category, and so on. Thus, by using autocode(), we have instructed Stata to automatically generate the breakpoints that will yield evenly spaced intervals (in this case the same as our manually generated breakpoints) and then automatically assign the prices into the categories.

Let us save Case 2: groups the prices into “equally” sized groups for another post…

 

2 Responses

  1. The user-written procedure “katego”, by Andres Gonzales Rangel (see SSC), comes in handy. It generates the value labels automatically.

  2. Mitch:

    . gen price_cat1 = 1 + floor(price/3300)

    gets you there directly. ceil(price/3300) is also possible. Watch for small differences if values equal bin limits depending on whether you want < or <=. I'd argue, or agree, that multiplying by 3300 gives values that are nearer self-explanatory.

    One key detail about using floor() or ceil() directly is that it what happens at bin limits (do values equal to limits get classified down or up?) is explicit once you understand what those functions do. Otherwise you have to consult the documentation or even the code to be clear.

    For an accessible Tip on floor and ceiling, see http://www.stata-journal.com/sjpdf.html?articlenum=dm0002

Leave a Reply