Regression discontinuity design in Stata (Part 1)

There has been a growing use of regression discontinuity design (RDD), introduced by Thistlewaite and Campbell (1960), in evaluating impacts of development programs. Lee and Lemieux (2010), Imbens and Lemieux (2007), and Cook (2008) provide comprehensive reviews of regression discontinuity design and its applications in the social sciences. This provides a summary. In Part 2, a comparison of user-written Stata estimation packages is provided. In Part 3, validation or falsification tests are discussed.
Continue reading

Bootstrapping Gini

Income inequality in the Philippines, as measured by the Gini coefficient, declined from 46.05 to 44.84 between 2003 and 2009.[1] Is the observed difference in the the Gini coefficient a real reduction in inequality in income distribution or is it only due to sampling variations?

A friend asked me a question related to this weeks ago. She asked if I know a Stata command that tests the significance between the difference of two Gini coefficients. Lazy to think about it, I just shrug the problem with a ‘no’ and never bothered to search for a solution. This problem came back to me while reading the bootstrapping section of NetCourse 151–Lecture 3. Indeed, when I searched further, a number of literature have used bootstrapping for this purpose. Examples include Mills and Zandvakili (1997) and Biewen (2002).[2][3]

Using Stephen Jenkins’s -ineqdeco- (“ssc install ineqdeco”) to calculate for the Gini coefficients, I did the following bootstrapping exercise for a hypothetical dataset:

/* Program to calculate difference in Gini coefficients */
cap program  drop ginidiff
program ginidiff, rclass
qui ineqdeco income if year == 1
return scalar gini1 = r(gini)
qui ineqdeco income if year == 2
return scalar gini2 = r(gini)
return scalar diff = return(gini1) – return(gini2)

/* Generate hypothetical dataset /
set obs 200
gen year = 1 in 1/100
replace year = 2 in 101/200
set seed 56453
gen income = runiform()
1000 if year == 1
set seed 86443
replace income = runiform()*1000 if year == 2

/* Apply bootstrap /
assert income < .    /
Make sure no missing values */
set seed 873023
bootstrap r(diff), reps(1000) nodots : ginidiff

program drop ginidiff

Whether this straightforward application of bootstrapping is the best solution is another story (and the more important one). For a discussion of other proposed methods and of the limitations of bootstrapping in this context, see for example, Palmitesta, P. et al (2000) and Van Kerm (2002).[4][5]

[1] Thanks to Shiel Velarde (World Bank Manila) for the Gini estimates. These numbers are based on the Family Income and Expenditure Survey (FIES).

[2] Mills, J. and Zandvakili, S. (1997). “Statistical Inference via Bootstrapping for Measures of Inequality”. Journal of Applied Econometrics 12 (2): 133–150. (Working Paper version can be downloaded here)

[3] Biewen, M. (2002). “Bootsrap inference for inequality, mobility and poverty measurement.” Journal of Econometrics 108 (2002): 317–342.

[4] Palmitesta, P. et al (2000). “Confidence Interval Estimation for Inequality Indices of the Gini Family.” Computational Economics

[5] Van Kerm, P. (2002). “Inference on inequality measures: A Monte Carlo experiment.” Journal of Economics 9(Supplement1): 283-306. (Working Paper version can be downloaded here)

Can’t get enough of factor variables

In relation to yesterday’s blog post, here are three links to useful sites/files that provide very good introduction to Stata’s factor variables:

Christopher Baum’s Factor Variables and Marginal Effects in Stata 11

Michael Mitchell’s Stata Tidbit of the Week – What the ##?

UCLA ATS’s What’s new in Stata 11: Factor variables, margins and interactions

Getting to know “factor variables”

This is an update to the earlier post i. without the prefix -xi-. So the i.‘s (or “i options” as Joe Glass called it) have a name. Stata calls them “factor variables” and there is more to them than i. .See -help fvvarlist- for the documentation and some very helpful examples.

i. without the prefix -xi-

This blog (including the pile of books to read, the GRE reviewers to go through, the TV shows to watch…this list will never end) has been neglected for more than week. Anyway… here is i. .

i. is usually used with the -xi- prefix. But it may also be used without -xi-. i. allows you to include dummy variables for a categorical variable without explicitly generating new variables for each category.

For example:

sysuse auto, clear
reg price i.rep78

is the same as generating dummy variables for each of the n=5 categories of rep78 first and including n-1 of them in the regression:

sysuse auto, clear
tab rep78, gen(rep78_)
reg price rep78_2 rep78_3 rep78_4 rep78_5

By default, the category with the lowest value (in this case, n=1) is omitted. No new variables are generated using the command above. Without the -xi- prefix, however, the use of i. is limited to only one of the four possible dummy variable creation allowed with -xi-. With -xi-, it is possible to directly specify interactions. Also, with -xi-, it is possible to choose the omitted category. See –help xi-.

Reading IMEUS

I am currently reviewing econometrics by reading Christopher Baum’s An Introduction to Modern Econometrics Using Stata.* Although linear regression is not discussed until chapter 4, chapters 1 to 3 (particularly chapter 3) are equally important (I finished chapter 3 last week…long way to go given my current rate of <1 chapter a week). Afterall, before anyone complicates his life with all those regressions, one needs to be sure that the data is “clean” and structured in such a way that it is fit for the analyis required. In fact, I think that the time spent on data management is so much more than the time spent on the actual data analysis.

Here are some of the things I remember about the first three chapters of the book. The discussion of Stata’s features in chapter 1 made me appreciate Stata even more. Chapter 2 outlines the basic tools one needs to learn to efficiently work with Stata. This chapter provided me very helpful tips on how to handle missing data and dates. And in chapter 3, I especially liked the section on data validation, which introduced me to the command -assert-. If only I knew -assert- when I was “cleaning” the QIDS dataset (7 years ago), it would have been easier to do all those consistency checks. Instead of running each line one by one, using -assert- would have allowed me to run the whole do-file but will stop and let me know whenever conditions are not met. So much for regret…

On to the next chapter…

*Thanks to Ellen and Jo Cain for leaving me this book before they left for the US :)

Pairwise comparison of means

The -ttest- allows comparison of means between groups; the syntax of which is:

ttest varname [if] [in] , by(groupvar) [options1]

However, this only works if you have at most 2 distinct groups in groupvar. What would you do if you have more than 2 groups and you want to compare the means for each pairwise combination? This problem was presented to me recently. Since I am not aware of a single command that does this, it seems that the solution is to loop between groups.

Below, I used student2.dta (used in Statistics with STATA: Version 12) to illustrate one way of solving the problem. In this example, I want to (1) test whether the mean gpa between students taking different majors are the same, and (2)  save the results I need into a tab-delimited text file. Since there are 7 groups in major — coded as 1,2,…,7 — 21 pairs of means will be tested.

Since comparing the means of gpa for i=1 and j=2 is the same as comparing the means for i=2 and j=1, the -if- command is specified so that these duplicates are excluded. The -makematrix- (Nick Cox) command produces a matrix of the results of the command specified after the “:”. Here, we have specified that it will keep in memory 3 saved results from -ttest-: (1) mean for the 1st group, (2) mean for the 2nd group, and (3) the two-sided p-value. -matrix colnames-, on the other hand, is specified to indicate the names for each column. If this is not indicated, the default column names are the names of the saved results — “r(mu_1),” “r(mu_2),” and “r(p)”. Lastly, -mat2txt- (Michael Blasnik and Ben Jann) writes the matrix into a text file. Note that -mat2txt- needs to be installed. To install, type:

ssc install mat2txt

Now, what if I need to test the means for more than 1 variable, say for both gpa and study?  I can just add another loop for this:

The code above tests for means of gpa and study between each pair of groups in major.

I am sure there is a shorter and better way to do this. Until I find that solution, I will have to bear with what I have come up with.

Note: The code above is “wrapped”. If the line is long, its continuation is indented in the next line. Thanks to Cuong Nguyen for giving me the opportunity to learn something new over the weekend.

-logit- and -logistic-, what’s the difference

Both -logit- and -logistic- are used to estimate binary logistic regression models. Thier difference lies in the reports that they display — -logit- reports coefficients, while -logistic- reports odds ratios. Since the odd ratios can be computed (in terms of the coefficient, b) as e^b, the choice between the two is just a matter of preference. To illustrate, I have posted below the results of  -logit- and -logistic- using womenwk.dta, which is used in Christopher Baum’s An Introduction to Modern Econometrics Using Stata.

-tabout- and -svy-

Yesterday, I was trying to create tables from a survey dataset. With the number of variables (and the possibility that I will repeat the same process many more times), doing it by hand (i.e., copy-pasting results from -tab- to Excel), is not an option. For this task, I turned to Ian Watson’s -tabout-. This is probably the best Stata code that creates very neat tables and exports them into text files (that  spreadsheets, such as Excel, can read).  Since I am using a complex survey dataset, I checked if -svy- allows -tabout-, i.e, if I can write something like: svy, subpop(var): tabout vars…

This is not possible. But -tabout- has the svy option that makes use of the  survey design variables specified in -svyset-. First problem solved.

My second problem was how to generate estimates for subsamples using -tabout-. In Use subpop() to generate subsample estimates using a survey data, we said that using the subpop() -svy- option, not the -if- qualifier, provides the correct standard errors. But -tabout- does not have a subpop option, only the -if- and -in- qualifiers. Fortunately, when using the svy option in -tabout-, the -if- and -in- qualifiers works the same as the subpop option (see note below). Second problem solved.

To install -tabout-, type: ssc install tabout

The -tabout- command — how to use, problems/erors in using, etc. — is well discussed in Statalist. The best way to start learning about -tabout- is by reading Publications quality tables in Stata: a tutorial for the tabout program (Watson 2007).

Note: Thanks to Ian Watson for pointing out footnote #3 (which I have missed) in Publications quality tables in Stata: a tutorial for the tabout program, page 3.

Use subpop() to generate subsample estimates using a survey data

Suppose you have a complex survey data and you want to generate estimates for a specific subgroup, say females (coded as female==1). The -if- qualifier seems like the obvious choice to exclude the male population (female==0):

svy: tab agegroup if female==1, ci

Unfortunately, this is not correct. The correct way of generating estimates for subpopulations is to use -svy-‘s subpop() option. The difference lies in how Stata treats the excluded category in calculating the standard errors. By using subpop(), the excluded cases (in our example, “male”) are still included in the calculation of the standard errors, which should be the case. Thus:

svy, subpop(female): tab age, ci

For the math of all of these, see Stata’s Survey Data Reference Manual: subpopulation estimation (pp. 53-58, Stata 11 documentation). I also find section 4 of Jeff Pitblado’s Survey Data Analysis in Stata (2009) helpful.