Regression discontinuity design in Stata (Part 1)


There has been a growing use of regression discontinuity design (RDD), introduced by Thistlewaite and Campbell (1960), in evaluating impacts of development programs. Lee and Lemieux (2010), Imbens and Lemieux (2007), and Cook (2008) provide comprehensive reviews of regression discontinuity design and its applications in the social sciences. This provides a summary. In Part 2, a comparison of user-written Stata estimation packages is provided. In Part 3, validation or falsification tests are discussed.
Continue reading

Using Stata to make sense of my Uber data


I tried Uber in late May and since then it has been 131 Uber rides covering 1,200 kilometers and 80 hours on the road. Uber (and GrabTaxi) has eliminated the wait under the heat (and rain) and the dealing with the assholeness of most taxi drivers here in Metro Manila. But what I love most about Uber, apart from their customer service, is the data they send. Trip receipts are automatically sent as soon as the trip has ended. These do not only show how much I am charged but include time, distance, fare disaggregated by time and distance, and many more. GrabTaxi receipts, on the other hand, only show amount paid and manually encoded by drivers.
Continue reading

The adventures of tin()


Using the if qualifer with time-series data is tricky. Until you meet tin(). Let us use quarterly German macro data, lutkepohl2, from Stata website to illustrate.
Continue reading

Fun with maps in Stata


Today I was toying with Kevin Crow’s -shp2dta- and Maurizio Pisati’s -spmap-. -shp2dta- transforms spatial datasets in shapefile format into Stata data format and -spmap- graphs data onto maps. For starters, I browsed through Stata’s FAQ on -spmap- and Friedrich Huebler’s blog post “Guide to creating maps with Stata”, where instructions are clearly detailed. And, of course, I also checked out -help spmap- and -help shp2dta-.

It is surprising to find how easy it is to represent data onto maps using Stata. I know people who hire GIS specialists or buy expensive GIS software to draw the simplest of maps while they are armed with Stata. I have nothing against GIS experts or GIS software packages. Sure they are needed for some sophisticated GIS applications. But for very basic maps, I think -spmap- is more than enough.

The availability of basic geo-referenced database is also becoming less and less of a constraint. Global Administrative Areas (GADM), for example, provides free access to GIS datasets for almost all countries. PhilGIS* also provides free access to GIS data on the Philippines.

Using free shapefile datasets from GADM and provincial poverty data** from the Philippine National Statistics Coordination Board (NSCB), here is my first map:



The map shows poverty incidence in the Philippines by province for 2003 and 2009. Each shade of red represents a quartile (the default), where the darkest shade corresponds to the provinces with the highest incidence rates.

Here’s how I created the map:
cd c:/map    /* Change the directory to where the shapefiles are */

/* Convert shapefiles into Stata data */
shp2dta using PHL_adm1, database(phdb) coordinates(phxy) ///
    genid(id) replace

/* This created 2 Statadata files phdb.dta (master dataset) 
and phxy.dta (basemap dataset), where the variable id in the master
dataset corresponds to the variable _ID in the basemap dataset. */

use phdb.dta, clear /* Load the master dataset */
merge 1:1 ID_1 using prov.dta 
             /* Merge poverty data into the masterfile */
assert _merge==3 /* Make sure all provinces are merged */
drop _merge

/* Draw the maps */
local year 2003 2009
foreach y of local year{
    cap graph drop pov`y'
    spmap pov`y' using phxy.dta, ///
    id(id) fcolor(Reds)  ///
    title("Poverty Incidence `y'") ///
    name(pov`y') nodraw
    }
graph combine pov2003 pov2009

  • PhilGIS has incorporated Philippine Standard Geographic Classification (PSGC) codes into the barangay (village) level dataset. This is very helpful. I have not used PhilGIS data yet because Stata returns the following error when I apply -shp2dta- using the data from PhilGIS:

st_addvar():  3300  argument out of range
read_dbf():     –  function returned error
<istmt>:     –  function returned error

I am waiting for a response from Stata Technical support. A previous Statalist post suggests that the error has something to do with strings exceeding the limit.

** I had to create a Stata data file out of the table in NSCB’s website and add the unique IDs for each province so that I can merge it later with the ‘master’ dataset. Data for Surigao del Norte and Maguindanao were used for Dinagat Islands and Shariff Kabunsuan, respectively, as they were not reported separately for these newly created provinces.

Functions and marker symbols


Stata’s -graph twoway function- draws the line plot of a specified function. Example, the half of a parabola with the equation y = x^2 can be drawn by typing:

tw function y = 4*x^2


The default is that the function is drawn over the range [0,1]. If you want to draw the other side of the parabola or change the range, you can specify the range as follows:

tw function y = 4*x^2, range(-2 2)



It is easy to change the line attributes of the plot. For example, if you want to change the color, width, and pattern of the line use lcolor, lwidth, and lpattern options:

tw function y = 4*x^2, range(-2 2) lcolor(red) lwidth(medthick) lpattern(-)



These options are particularly helpful when you have many functions to plot. For example:

tw function y = x^2, range(-2 2) || ///
function y = x^3, range(-2 2) lpattern(-) || ///
function y = x^4, range(-2 2) lpattern(.-) ///
legend(label(1 “y = x{sup:2}”) ///
label(2 “y = x{sup:3}”)  ///
label(3 “y = x{sup:4}”) ///
cols(3) pos(5) ring(0) region(lcolor(none)))


[Note: For more text in graph options (e.g. bold, Greek letters, font type), see -help graph text-. See also Writing Greek letters and other symbols in graphs.]

But suppose you prefer to use marker symbols rather than line patterns to differentiate the line plots, how can you specify this option? Use the recast option (-help advanced_options-):

tw function y = x^2, range(-2 2) recast(connected) msymbol(O) || ///
function y = x^3, range(-2 2) recast(connected) msymbol(T) n(20) || ///
function y = x^4, range(-2 2) recast(connected) msymbol(S) n(20)  ///
legend(label(1 “y = x{sup:2}”) ///
label(2 “y = x{sup:3}”)  ///
label(3 “y = x{sup:4}”) ///
cols(3) pos(5) ring(0) region(lcolor(none)))


The recast option will tell Stata to treat the plot as a new plot. In our case, we specified that the new plot type is a connected graph, for which you can specify marker symbols. What is “n(#)” for? This tells Stata to draw the plot at 20 points. If this were not specified, as in the plot for y = x^2, the plot will be connected by 300 (the default) markers — ugly.

Combining charts


You can combine separate Stata graphs into one graph by using -combine-. We will illustrate how this works using census.dta included in Stata. (The not-so-neat) Figure 1 below is the combination of three separate graphs. How was Figure 1 created?



sysuse census.dta

/* First, create separate graphs and and store each of them into memory by using the name() option. Note that name() and saving() options are two different things. saving() saves the graph permanently (of course, until you erase it!) into a disk, while name() temporarily saves the graph into memory. */

/*-nodraw- supresses the display of the individual graphs. */

graph bar marriage, over(region) title({bf:A}, size(huge) ring(0) pos(1)) name(panela, replace) nodraw

tw scatter marriage divorce, title({bf:B}, size(huge) ring(0) pos(1))  name(panelb, replace) nodraw

graph hbar pop, over(state2) title({bf:C}, size(huge) ring(0) pos(1)) name(panelc, replace) nodraw

/* Now, we combine the graphs. */

graph combine panela panelb, cols(1)


/* We can also change the number of columns or add an overall title for the combined graph… */

graph combine panela panelb, cols(2) title(“Senseless Graphs“)


/* …or overall y- and x- axis titles. The code below draws figure 1. */

graph combine panela panelb panelc, cols(2) l1title(“Left-side title”) b1title(“Bottom title”) title(“This is Figure 1”)

/* There are other options for combine-, see “help graph_combine”. Also,the x- and y- labels and titles (particularly of panel C in figure 1) can be formatted before they are combined to make them look cleaner and prettier. */

Graph Recorder: how it works


The graph recorder records all edits that you make in the graph editor so that you can apply the same edits on other charts without manually going through the same process over and over again.

Once you click the Start Recording button , all the clicking and typing that you do within the graph editor is saved until you end the recording by clicking the same button. It is easy: Start-Name-Play. To see how this works, click the following link: Recording illustration

There are two other ways to replay a saved recording. One is to play the recording using the command line by typing “graph play recordingname“. This will apply the recorded edits to the current graph. Another is to add the play() option in your graph command. For example, to replay auto:

sysuse auto
tw scatter price mpg
graph play auto

OR

sysuse auto
tw scatter price mpg, play(auto)

 


I learned about Jing and Screencast.com through this blog post: Essiential productivity tools for the academic (Brian Perron).

The order() suboption makes label() redundant


Before, I thought that the order() suboption for graphic legend() option is only used to specify which legend keys will be shown and in which order you want them to be displayed. But recently I learned that it also makes the label() suboption redundant. To illustrate, we use census.dta which is installed with Stata.

sysuse census.dta, clear

#delimit ;
tw (scatter  marriage divorce, mlabel(state2))
(lfit marriage divorce),
ytitle(“Number of marriages”)
ylabel(, angle(0))
legend(label(2 “Fitted regression line”) order(2) ring(0) pos(5))
;
#delimit cr

#delimit ;
tw (scatter  marriage divorce, mlabel(state2))
(lfit marriage divorce),
ytitle(“Number of marriages”)
ylabel(, angle(0))
legend(order(2 “Fitted regression line”) ring(0) pos(5))
;
#delimit cr

Both commands above will result in the same graph (see below). The legend key for the first- tw- plot, which by default is the y-variable name or label (if exists),  is not displayed.



Another advantage of using order() over label() is that label(), by default, is limited to hold 15 keys.


Thanks to Derek Wagner of Stata for his reply to our query on maximum number of keys using label() and for pointing out the capabilities of order().
Before, I thought that the order() suboption for graphic legend() option is only used to specify which legend keys will be shown and in which order you want them to be displayed. But recently I learned that it also makes the label() suboption redundant. To illustrate, we use census.dta which is installed with Stata. 

sysuse census.dta, clear

#delimit ;
tw (scatter  marriage divorce, mlabel(state2))
(lfit marriage divorce),
ytitle(“Number of marriages”)
ylabel(, angle(0))
legend(label(2 “Fitted regression line”) order(2) ring(0) pos(5))
;
#delimit cr

#delimit ;
tw (scatter  marriage divorce, mlabel(state2))
(lfit marriage divorce),
ytitle(“Number of marriages”)
ylabel(, angle(0))
legend(order(2 “Fitted regression line”) ring(0) pos(5))
;
#delimit cr

Both commands above will give the same graph.

Another advantage of using order() over label() is that label(), by default, is limited to hold 15 keys.

______________________

Thanks to Derek Wagner of Stata for his reply to our query on maximum number of keys using label().

Writing Greek letters and other symbols in graphs


Greek letters, math symbols, and other symbols (such as Copyright and Trademark symbols) can be incorporated in Stata graphs with the use of SMCL tags. SMCL (Stata Markup and Control Language; pronounced as “smickle”) is used to modify all text output in Stata.

The tag for Greek letters or symbols are of the form {&name}. To illustrate, we will use auto.dta:


#delimit ;

sysuse auto;  /* see note below */

sum price, meanonly;
local p=r(mean);
sum mpg, meanonly;
local m=r(mean);

tw scatter price mpg, xline(m') yline(p’) note(“The verical and horizontal lines correspond” “to {&mu}{subscript:mpg} and {bf:{&mu}{subscript:price}}, respectively.”, pos(2) ring(0)) size(vlarge);

#delimit cr




{&mu} –> lower case Greek letter mu
{subscript:mpg} –> display mpg as a subscript
{bf:{&mu}{subscript:price}} –> display mu with subscript price as bold

For the complete list of symbols, type: “help graph_text”.

Note: -sysuse- is used to to load example datasets that are installed with Stata. To list all data  installed with Stata, type: “sysuse dir”.

Drawing scatter plots


The graph command -twoway scatter- (or -tw scatter-) draws scatter plots. Here we draw the scatter plots of the share of electronics in total export of the Philippines and Malaysia over time.

In Figure 0, we draw a scatter plot of Philippine export shares in electronics. If not specified, Stata will use default options for marker color, size, and shape, axis titles, etc.

tw scatter  exportshare year if reporter==”Philippines”, title(“Figure 0”)

In Figure 1, we specified marker and axis options and added a note.

#delimit ;
tw (scatter  exportshare year if reporter==”Philippines”,
ytitle(“Export Share in Electronics* %”) ylabel(0(10)60, angle(0))
xtitle(“”) xlabel(1965(10)1995 2007)
title(“Figure 1”)
note(“*SITC 2-digit category 77”)
;

In Figure 2, we added Malaysia’s export shares for the same period. By default, Stata will use new colors for added graphs unless this is specified. Here we moved the title to the bottom of the chart by using the suboption pos, which follows clock positions. The default for the title is at 12 o’clock position, pos(12); but we moved it to the 6 o’clcok position, pos(6). We also added the legend() option.

#delimit ;
tw (scatter  exportshare year if reporter==”Philippines”,
msize(large) mlabel(year))
(scatter  exportshare year if reporter==”Malaysia”,
msize(large) mlabel(year)),
ytitle(“Export Share in Electronics* %”) ylabel(0(10)60, angle(0))
xtitle(“”) xlabel(“”)
title(“Figure 2”, pos(6))
note(“*SITC 2-digit category 77”)
legend(label(1 “Philippines”) label(2 “Malaysia”))
;

In Figure 3, we specified marker colors and labels, draw lines connecting the dots (connect), and changed the position of marker labels (mlabpos). We also made changes on the legend(): cols(1) tells Stata to present the legend in 1 column, pos(9) moves the legend to the 9 o’clock position, and ring(0) moves the legend inside the chart.

#delimit ;
tw (scatter  exportshare year if reporter==”Philippines”,
msize(large) mcolor(dkgreen) mlabel(year) mlabpos(12) connect(l))
(scatter  exportshare year if reporter==”Malaysia”,
msize(large) mcolor(dkorange) mlabel(year) mlabpos(6) mlabcolor(black) connect(l)),
ytitle(“Export Share in Electronics* %”) ylabel(0(10)60, angle(0))
xtitle(“”) xlabel(“”)
title(“Figure 3″, pos(10) ring(0))
note(”          *SITC 2-digit category 77″)
legend(cols(1) label(1 “Philippines”) label(2 “Malaysia”) pos(9) ring(0))
;

In Figure 4, we use export share as weights.

#delimit ;
tw (scatter  exportshare year [aw=exportshare],
msymbol(Oh)mcolor(red) mlabpos(12)),
ytitle(“Export Share in Electronics* %”) ylabel(0(10)60, angle(0))
xtitle(“”) xlabel(1965(10)1995 2007)
title(“Figure 4”, pos(5) ring(0))
note(“*SITC 2-digit category 77”)
;