_n, its big brother _N, and Super -bysort-


_n and _N are Stata system variables—they exist whether you like them or not. They are also referred to as “underscore variables” for the obvious reason that they are written as _variable. Little  _n contains the line number of the current observation, while its big brother _N contains the total number of observations in the data.

_n is often used to generate unique codes for each observation:

gen code=_n          /* generates the variable code that contains the integer 1 (for the first obervation, _n==1) to _N (for the last observation, _n==_N) */

Or to refer to neighboring observations (also called subscripting):

gen gdplag=gdp[_n-1]          /* generates the variable gdplag, which is equal to the preceding observation’s gdp */

gen gdpgrowth=(gdp /gdp[_n-1] – 1)*100         /* generates the variable gdpgrowth, the growth rate for the variable gdp */

[Note: The gdplag and gdpgrowth for the first observation (_n==1) will be missing since observation [_n-1] does not exist for _n==1.]

Make sure, however, that you refer to the right neighbor! For example, if you are calculating the growth rate of variable gdp between 1999 and 2000, gdp must be in order such that the gdp subscripted by [_n-1] is the gdp for 1999. This is easily addressed by invoking the -sort- command, “sort year,” before generating the growth rate variable. There is another complication, however, when you are calculating this for different groups of observations, say by country. Will “sort country year” before generating the variable suffice? No. Why? Because the [_n-1] for the first observation of country B refers to the last observation of country A. Here is where Super -bysort- comes to the rescue:

bysort country year: gen gdplag=gdp[_n-1] 

bysort country year: gen gdpgrowth=(gdp /gdp[_n-1] – 1)*100       

// Another syntax for bysort is: by country (year), sort: … 

_n may also be used to keep the nth observation by group:

bysort householdid: keep if _n==1         /* keeps the first observation for each householdid */

Big brother _N, on the other hand, may be used to generate  a variable that contains the number of observations by group:

bysort householdid: gen householdsize=_N         /* generates the variable householdsize, which is equal to the number of observations  for each householdid. */

What we have illustrated above are just a few examples to showcase the potential of underscore variables _n and _N. For sure, you will find other uses of _n and _N. Another underscore variable is the beautiful number π, which, as you would’ve guessed, is written as _pi.

One Response

  1. Hi statadaily, i think you can also try

    sort year
    tsset gdp /*this will identify gdp as a time series variable */
    gen gdp_previous=l.gdp

Leave a Reply