Sometimes we want to break a continuous variable into a smaller set of categories—into evenly spaced or equally sized groups, or groups based on limits we specify, or groups based on another variable or a set of variables.

Let us take for example the variable

`price`

of cars in *auto.dta*.

sysuse auto.dta, clear // open a Stata built-in data summ price

Suppose we want to break

`price`

into 5 categories according to different ways of grouping: 1 evenly spaced, another equally sized, and so on. We create the following variables: `price_cat1, price_cat2, ..., price_cat4.`

`price_cat1`

groups the prices into evenly spaced intervals, say, by 3300.`price_cat2`

groups the prices into 5 equally sized groups.*Equally*is used loosely here. In our example data there are 74 cars, 1 group will definitely have 1 less car in it.`price_cat3`

groups the prices into categories bounded to the left by 0, 5000, 8000, 10000. Why those boundaries? Because I can… :p`price_cat4`

groups the prices based on the variable`rep78`

.

This is a walk in the park if you have already mastered the

`generate`

and `replace`

commands and the qualifier `if`

. But, even better (a walk in the park with ice cream and a puppy), are the shortcuts. For each case, we first use `if`

followed by shorter ways to get the same result.*** Case 1: groups the prices into evenly spaced intervals * using the`if`

qualifier gen price_cat1 = . replace price_cat1 = 1 if price>=0 & price<3300 replace price_cat1 = 2 if price>=3300 & price<6600 replace price_cat1 = 3 if price>=6600 & price<9900 replace price_cat1 = 4 if price>=9900 & price<13200 replace price_cat1 = 5 if price>=13200 & price<16500 . tab price_cat1, m price_cat1 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 2.70 2.70 2 | 55 74.32 77.03 3 | 7 9.46 86.49 4 | 6 8.11 94.59 5 | 4 5.41 100.00 ------------+----------------------------------- Total | 74 100.00 * using the`egen`

function`cut()`

drop price_cat1 egen price_cat1 = cut(price), at(0(3300)16500)

The generated codes for the categories correspond to the left ends of the intervals. In the

`at()`

option, you may also list of breakpoints separated by spaces or commas instead of using the interval syntax for numlist, e.g. `at(0 3300 6600 9900 13200 16600)`

. Adding the option `icodes`

will generate codes 0,1,..,5 instead of the left ends.. tab price_cat1, m price_cat1 | Freq. Percent Cum. ------------+----------------------------------- 0 | 2 2.70 2.70 3300 | 55 74.32 77.03 6600 | 7 9.46 86.49 9900 | 6 8.11 94.59 13200 | 4 5.41 100.00 ------------+----------------------------------- Total | 74 100.00

That’s it! The 6 lines we wrote earlier can be written concisely in 1 line. Be careful however with the breakpoints. In our example above, the intervals are inclusive of the left end, i.e., from

**a**up to (but excluding

**b**) or, in set notation,

**[a,b)**. In some cases, you may want that the interval is inclusive of the right end,

**(a,b]**. In the last case and where there are observations exactly at the breakpoints,

`egen cut()`

could be erroneous. We can use instead the function `recode()`

or `autocode()`

for the case **(a,b]**.

* using`recode()`

or`autocode()`

function gen price_cat1b = recode(price, 3300, 6600, 9900, 13200, 16500) gen price_cat1c = autocode(price, 5, 0, 16500) . tab price_cat1b, m price_cat1b | Freq. Percent Cum. ------------+----------------------------------- 3300 | 2 2.70 2.70 6600 | 55 74.32 77.03 9900 | 7 9.46 86.49 13200 | 6 8.11 94.59 16500 | 4 5.41 100.00 ------------+----------------------------------- Total | 74 100.00 . tab price_cat1c, m price_cat1c | Freq. Percent Cum. ------------+----------------------------------- 3300 | 2 2.70 2.70 6600 | 55 74.32 77.03 9900 | 7 9.46 86.49 13200 | 6 8.11 94.59 16500 | 4 5.41 100.00 ------------+----------------------------------- Total | 74 100.00

Both commands generate the same results (only because we want them to). In

`recode()`

, Stata compares the first argument (`price`

in our case) with the second argument (3300, the first right end). If the price is less than or *equal to*3300 it will return 3300. If not, then Stata compares the price with the third argument (the next boundary), and so on. Using the longhand with the

`if`

qualifier, this is equivalent to:gen price_cat1b = 3300 if price<=3300 replace price_cat1b = 6600 if price>3300 & price<=6600 replace price_cat1b = 9900 if price>6600 & price<=9900 . . .

`In autocode()`

, Stata breaks `price`

(first argument) into 5 evenly spaced groups (second argument) from 0 to 16500 (last 2 arguments). If the price is less than or equal to the first generated breakpoint (3300), it will fall in this category, and so on. Thus, by using `autocode()`

, we have instructed Stata to automatically generate the breakpoints that will yield evenly spaced intervals (in this case the same as our manually generated breakpoints) and then automatically assign the prices into the categories.Let us save

*Case 2: groups the prices into “equally” sized groups*for another post…

Filed under: Basic functions Tagged: | categories, cut, egen, groups, recode

Aldo Benini, on 9 January 2015 at 12:21 AM said:The user-written procedure “katego”, by Andres Gonzales Rangel (see SSC), comes in handy. It generates the value labels automatically.

Nick Cox, on 7 January 2015 at 8:44 PM said:Mitch:

. gen price_cat1 = 1 + floor(price/3300)

gets you there directly. ceil(price/3300) is also possible. Watch for small differences if values equal bin limits depending on whether you want < or <=. I'd argue, or agree, that multiplying by 3300 gives values that are nearer self-explanatory.

One key detail about using floor() or ceil() directly is that it what happens at bin limits (do values equal to limits get classified down or up?) is explicit once you understand what those functions do. Otherwise you have to consult the documentation or even the code to be clear.

For an accessible Tip on floor and ceiling, see http://www.stata-journal.com/sjpdf.html?articlenum=dm0002