Regression discontinuity design in Stata (Part 1)


There has been a growing use of regression discontinuity design (RDD), introduced by Thistlewaite and Campbell (1960), in evaluating impacts of development programs. Lee and Lemieux (2010), Imbens and Lemieux (2007), and Cook (2008) provide comprehensive reviews of regression discontinuity design and its applications in the social sciences. This provides a summary. In Part 2, a comparison of user-written Stata estimation packages is provided. In Part 3, validation or falsification tests are discussed.

RDD is a quasi-experimental method for evaluating program impact when observation units (example, households) can be sorted using some continuous metric (example, income) and program assignment is based on a pre-determined threshold or cutoff point of the sorting metric. Observations just below the cutoff are deemed similar to, and therefore, compare well to those just above the cutoff. In the absence of the program, one would expect that any shifts in outcome variables would happen smoothly alongside minor changes in the running variable. Thus, a large jump in the outcome variable, observed precisely at the threshold value of the running variable, after program intervention can be attributed to the program itself.

Among the advantages of RDD are the weaker assumptions required for its validity compared to other non-experimental impact evaluation methods. For example, Hahn, Todd, and van der Klaauw (2001) showed that RDD requires milder assumptions relative to those needed for other non-experimental methods.

The main caveat in RDD is that because program impact is estimated locally, or using observations very close to the cutoff, the generalizability of RDD estimated effect is limited. While the evaluation results using RDD has strong internal validity properties considered by many as next only to RCT, it needs to be recognized that its external validity is limited to observation units near the eligibility threshold.

RDD can be characterized as an estimation of whether an outcome variable exhibits a discontinuous jump precisely at the cutoff of the running variable. The magnitude of the discontinuous jump at the cutoff may be estimated using a local regression that limits the observations to a specified bandwidth around the cutoff where the functional form is most likely linear. Figures below graphically illustrates a local linear regression RDD before and after program participation on a simulated data within a specified bandwidth, h. In the right panel, the discontinuous jump, tau, at the cutoff is the estimated program impact.

rd-beforerd-after

Drawing the graphs above:
// Before program participation 
set seed 2
set obs 1000
range x_obs -1 1 1000
g y_pre = x_obs^3 + rnormal()

tw scatter y_pre x_obs, msize(small) mcolor(gs10) ///
    || lfit y_pre x_obs, range(-.35 .35) lcolor(black) lw(thick) ///
        xline(0, lpattern(-)) ///
        yt("Outcome variable (Y)") ///
        xt("Assignment variable (X)") ///
        t("Before program participation") ///
        legend(off)  ///
        xline(-.35 .35, lp(-) lc(gs10)) ///
        text(-4.5 -.35 "{it:-h}") ///
        text(-4.5 +.35 "{it:h}")

// After program participation 
set seed 2
local tau = 1.25
cap drop y_post
g y_post= x +`tau' + rnormal() if x=0

tw scatter y_post x_obs, msize(small) mcolor(gs10) yl(-4(2)4) ///
     || lfit y_post x_obs if inrange(x_obs, -.35, 0), ///
        range(-.35 0 ) lcolor(black) lw(thick) ///
     || lfit y_post x_obs if inrange(x_obs, 0, .35), ///
        range(0  .35 ) lcolor(black) lw(thick) ///
        xline(0, lpattern(-)) ///
        yt("Outcome variable (Y)") ///
        xt("Assignment variable (X)") ///
        t("After program participation") ///
        legend(off) ///
        text(.9 0.05 "{&tau}", size(*2)) ///
        xline(-.35 .35, lp(-) lc(gs10)) ///
        text(-4.5 -.35 "{it:-h}") ///
        text(-4.5 +.35 "{it:h}")

How to select the appropriate bandwidth, h,  from which to estimate tau? The determination of the bandwidth is a tradeoff between bias and variance. Bias increases as one moves away from the cut-off while variance increases with smaller number of observations as one moves closer to the cut-off and vice-versa. A narrow bandwidth will have lower bias because more observations are near the cut off, but will have larger variance because of smaller number of observations. An optimal h therefore balances this tradeoff. Selcting bandwidths have been proposed by, among others, Imbens and Kalyanaraman (2012), Calonico, Cattaneo, and Titiunik (2014), and Ludwig and Miller (2007), which we will refer to as the IK, CCT, and CV (cross-validation) bandwidths, repsectively.

In Stata, there are atleast three user-written RD estimation packages: (1) Austin Nichols’s -rd- (ssc install rd); (2) CCT’s -rdrobust- (ssc install rdrobust); and Boris Kaiser’s -rdcv- (ssc install rdcv). A comparison of these will be presented in Part 2.

 

Note: The discussions above are recycled from what I wrote for the report “Keeping children healthy and in school: Evaluating the Pantawid Pamilya Using Regression Discontinuity Design” (2014). Full report written with Dr. Babes Orbeta, Mico del Mundo, Melba Tutor, Mai Valera, and Dama Yarcia. We benefited a lot from Mattias Cattaneo (University of Michigan) who provided a short course on regression discontinuity here in Manila in September 2014 through the Asian Development Bank and Jed Friedman (World Bank) during the technical review sessions.

 

References:

Bloom, Howard. 2012. Modern Regression Discontinuity Analysis. Journal of Research on Educational Effectiveness, 5(1):43-82.

Calonico, S., M. D. Cattaneo, and R. Titiunik. 2014. Robust Nonparametric Confidence Intervals for Regression-Discontinuity Designs. Econometrica, 82(6):2295–2326.

Calonico, Sebastian, Matias Cattaneo, and Rocio Titiunik. 2014b. Robust data-driven inference in the regression discontinuity design. The Stata Journal, vv(ii): 1–36.

Cook, Thomas. 2008. Waiting for Life to Arrive: A history of the regression-discontinuity design in Psychology, Statistics and Economics. Journal of Econometrics, 142(2): 636–654.

Hahn, Jinyong, Petra Todd, and Wilbert Vab der Klaauw. 2001. Identification and Estimation of Treatment Effects with a Regression Discontinuity Design. Econometrica, 69(1): 201–209.

Imbens, Guido and Karthik Kalyanaraman. 2012. Optimal bandwidth choice for the regression discontinuity estimator. Review of Economic Studies, 79: 933–959.

Imbens, Guido and Thomas Lemieux. 2007. “Regression Discontinuity Designs: A Guide to Practice.” NBER Working Paper 13039. http://nber.org/papers/w13039

Lee, David S., and Thomas Lemieux. 2010. “Regression Discontinuity Designs in Economics.” Journal of Economic Literature, 48(2): 281–355. http://www.aeaweb.org/articles.PhP?doi=10.1257/jel.48.2.281

McRary, Justin. 2008. Manipulation of the Running Variable in the Regression Discontinuity Design: a Density Test. Journal of Econometrics, 142(2): 698–714.

Nichols, Austin. 2011. rd 2.0: Revised Stata module for regression discontinuity estimation. http://ideas.repec.org/c/boc/bocode/s456888.html

Thistlewaite, D.L., and Campbell, D.T. 1960. Regression Discontinuity Analysis: An Alternative to the Ex-Post Facto Design. Journal of Educational Psychology, 51: 309-317.

5 Responses

  1. the following command may be wrong: g y_post= x +`tau’ + rnormal() if x=0. There are no observations with x==0.

  2. Good job! Looking foward to part II!

  3. waiting for part II of RD; your work through is contribution is amazing!!!

    All the best

  4. Can you explain in detail how to get RD estimates in stata?

Leave a Reply