UN Comtrade API in Stata


The UN Comtrade is the largest repository of disaggregated trade statistics. It offers free access to detailed annual trade data starting from 1962 and monthly trade data from 2010. Free access is limited to 50,000 records per query. This limitation is relaxed in some cases but the API (as of date) only allows a maximum of 50,000 records per query for all users.

The UN Comtrade data extraction API (currently beta version) to access the database is publicly available. How can we exploit this to download Comtrade data directly from Stata?
Continue reading

Using Stata to make sense of my Uber data


I tried Uber in late May and since then it has been 131 Uber rides covering 1,200 kilometers and 80 hours on the road. Uber (and GrabTaxi) has eliminated the wait under the heat (and rain) and the dealing with the assholeness of most taxi drivers here in Metro Manila. But what I love most about Uber, apart from their customer service, is the data they send. Trip receipts are automatically sent as soon as the trip has ended. These do not only show how much I am charged but include time, distance, fare disaggregated by time and distance, and many more. GrabTaxi receipts, on the other hand, only show amount paid and manually encoded by drivers.
Continue reading

Import data from Excel sheets


How do we import data from all sheets in a number of Excel files? Each Excel file has a different number of sheets with names of no discernible pattern, but (thankfully) each sheet has the same structure: the first observation is in the same row and the columns correspond to each other. An example is the set of 17 Excel files of census data of barangays (villages) that was provided to me. Each Excel file corresponds to one region and within each file are sheets corresponding to the province in the region.  How do we consolidate all sheets in all files into one data file?

Continue reading

Put anything anywhere in Excel without sweat


putexcel has recently become a very good friend. For those who (or working with people who) find comfort in working with tables in Excel after data processing or estimation in Stata (yes, there are others who don’t find comfort in this.) and already into Stata 13, learning putexcel could be very helpful (put an end to copy-pasting!). A number of user-written commands, such as outreg [1], outreg2, tabout, are also already available for similar purposes. What puts putexcel apart is its ‘user-friendliness’ and flexibility. You can put anything anywhere in Excel without sweat.

Continue reading

Tell me, where did I go wrong


If you are Filipino, you are most likely singing the title by now :)

Looking for a missing bracket, a misplaced comma, or a space that shouldn’t be there—or debugging in general—can be a pain sometimes. When the usual error message fails to point out where you messed up, try turning trace on to track down the error. trace literally traces the execution of programs. It echoes the lines that Stata executes internally. Reading through the whole thing on your results window can be daunting but you don’t really have to go through those. You just need to know which line it stops executing and see why it stopped there. To turn on trace, type:

set trace on
And.. don’t forget to turn it off when you don’t need it. They can be really very long.

set trace off
Next time you wonder where you went wrong, use trace before you lose your mind. For more options, see help set trace.

Now, continue with the singing.. “what did I do to make you change your mind completely..”

 

Blah-blah-blah about do-files


Data analysis in Stata may be carried out by interactive mode using the Command window, by using the drop down menu, or by executing a set of Stata commands written in a do-file. It is most likely that you will be using a combination of these rather than using one method exclusively. Making use of the command window or GUI interface are ways to easily get acquainted with Stata.

Using Stata interactively allows you to use Stata on the fly by submitting a command line in the Command window or clicking options from the drop down menu. You can load, explore, manage, and analyse your data in this manner. Clicking away is particularly useful if you do not know the command; and typing away is convenient particularly in the early part of analysis when you are in the process of getting to know your data or when you are just figuring out what to do.

The drawback, however, is that what you do interactively is lost when you exit Stata. You may have saved your output (a table or a chart) but the process from which it resulted is gone and replicating or modifying your work may prove difficult and often impossible. Also, it will be difficult to audit your work.

The advice to keep an organized, self-explanatory record (a do-file or a set of do-files) of the processes you have been through throughout your work—from loading your data to cleaning it to producing tables and charts—cannot be overemphasized. Easier said than done.

Bootstrapping Gini


Income inequality in the Philippines, as measured by the Gini coefficient, declined from 46.05 to 44.84 between 2003 and 2009.[1] Is the observed difference in the the Gini coefficient a real reduction in inequality in income distribution or is it only due to sampling variations?

A friend asked me a question related to this weeks ago. She asked if I know a Stata command that tests the significance between the difference of two Gini coefficients. Lazy to think about it, I just shrug the problem with a ‘no’ and never bothered to search for a solution. This problem came back to me while reading the bootstrapping section of NetCourse 151–Lecture 3. Indeed, when I searched further, a number of literature have used bootstrapping for this purpose. Examples include Mills and Zandvakili (1997) and Biewen (2002).[2][3]

Using Stephen Jenkins’s -ineqdeco- (“ssc install ineqdeco”) to calculate for the Gini coefficients, I did the following bootstrapping exercise for a hypothetical dataset:

/*************************************************************/
/* Program to calculate difference in Gini coefficients */
clear
cap program  drop ginidiff
program ginidiff, rclass
qui ineqdeco income if year == 1
return scalar gini1 = r(gini)
qui ineqdeco income if year == 2
return scalar gini2 = r(gini)
return scalar diff = return(gini1) – return(gini2)
end

/* Generate hypothetical dataset /
set obs 200
gen year = 1 in 1/100
replace year = 2 in 101/200
set seed 56453
gen income = runiform()
1000 if year == 1
set seed 86443
replace income = runiform()*1000 if year == 2

/* Apply bootstrap /
assert income < .    /
Make sure no missing values */
set seed 873023
bootstrap r(diff), reps(1000) nodots : ginidiff

program drop ginidiff
exit
/*************************************************************/

Whether this straightforward application of bootstrapping is the best solution is another story (and the more important one). For a discussion of other proposed methods and of the limitations of bootstrapping in this context, see for example, Palmitesta, P. et al (2000) and Van Kerm (2002).[4][5]


[1] Thanks to Shiel Velarde (World Bank Manila) for the Gini estimates. These numbers are based on the Family Income and Expenditure Survey (FIES).

[2] Mills, J. and Zandvakili, S. (1997). “Statistical Inference via Bootstrapping for Measures of Inequality”. Journal of Applied Econometrics 12 (2): 133–150. (Working Paper version can be downloaded here)

[3] Biewen, M. (2002). “Bootsrap inference for inequality, mobility and poverty measurement.” Journal of Econometrics 108 (2002): 317–342.

[4] Palmitesta, P. et al (2000). “Confidence Interval Estimation for Inequality Indices of the Gini Family.” Computational Economics
16(1-2):137-147.

[5] Van Kerm, P. (2002). “Inference on inequality measures: A Monte Carlo experiment.” Journal of Economics 9(Supplement1): 283-306. (Working Paper version can be downloaded here)

Website tables into Stata


That last post bit me (in a good way)…speaking of returns! The downside: no time for blogging. During that time I have also set aside the Stata course work I enrolled in. The lecture notes are easy to follow, and the exercises are very helpful. I just started going through lecture 3, but no more course leaders to ask when I get stuck. On the upside, I have also learned something new while working on other things. The subject of this post is one example.

This was my problem: I need barangay (village) information (population, code, type) for all municipalities. The information is available in NSCB’s PSGC (Philippine Standard Geographic Codes) website, but these are organized by municipality. It would not have been a problem if NCSB provided a data file (in text or any data format) that is readily downloadable. Copy-pasting the tables is out of the question—the Philippines has 1,496 municipalities! Thanks to Google for pointing me to Eric Booth’s reply to Friedrich Huebler’s query on reading HTML files with Stata at Statalist.

Using Eric Booth’s example as guide, below I illustrate how I grabbed the barangay table for a single municipality. In honor of Manny Pacquiao‘s win over Shane Mosley last Sunday, I will use as example the municipality of Kiamba, where Manny Pacquiao resides as Congressman of the province of Sarangani.* The output I need is a Stata dataset with all the 19 barangays of Kiamba, their corresponding PSGC, barangay type (whether urban or rural), and population (as in the table below).

===========================================================

/* Copy web page and save to a text file. The PSGC code for Kiamba is “128003000”, where the first 2 digits correspond to the region code, the next 2 digits correspond to the province code. */

local m “128003000”
local nscb “http://www.nscb.gov.ph/activestats/psgc/municipality.asp?”
local r = substr(“m'",1,2)
local p = substr("
m'”,3,2)
copy “nscb'muncode=m’&regcode=r'&amp;provcode=p'”  “m`m’.txt”, text

/* Delete unnecessary HTML codes and replace quotation marks (\Q) with “%”. I tried not to replace the quotation marks and learned that there is a reason why Eric Booth got rid of them. */
filefilter “mm'.txt" "xmm’.txt”, from(“</td>”) to(“”) replace
filefilter “xmm'.txt" "mm’.txt”, from(“\Q”) to(“%”) replace
erase “xm`m’.txt”

/* Load text file */
intext using m`m’.txt, gen(nscb) clear

/* Drop unnecessary rows and columns */
drop in 1/160
keep nscb1

/* Flag lines to keep */
local brgy “<td width=%255% class=%dataCell%>”
local psgc “<td width=%130% align=%center% class=%dataCell%>”
local popn “<td width=%115% align=%center% class=%dataCell%>”

/* -findval- (Stas Kolenikov) is from SSC archive. Type -ssc install findval- */

findval “brgy'", substr gen(flag_brgy)
findval "
psgc'”, substr gen(flag_psgc)
findval “Urban”, substr gen(flag_urban)
findval “Rural”, substr gen(flag_rural)
findval “`popn'”, substr gen(flag_popn)

drop if (flag_brgy + flag_urban + flag_rural + flag_psgc + flag_popn == 0)

/* Create required variables */
gen brgy = subinstr(nscb1, “brgy'", "", 1) if flag_brgy
gen psgc = subinstr(nscb1, "
psgc'”, “”, 1) if flag_psgc
gen btyp = “Urban” if flag_urban
replace btyp = “Rural” if flag_rural
gen popn = real(subinstr(nscb1, “`popn'”, “”, 1)) if flag_popn

label var brgy “Barangay name”
label var psgc “PSGC codes”
label var btyp “Urban/Rural”
label var popn “Population”

/* Align data */
keep brgy psgc btyp popn
replace psgc = psgc[_n+1]
replace btyp = btyp[_n+3]
replace popn = popn[_n+4]
drop if brgy == “”

/* Delete trailing spaces */
replace brgy = ltrim(brgy)
replace psgc = ltrim(psgc)
replace btyp = ltrim(btyp)

/* Checks /
assert mi(psgc) == 0 /
No missing code /
assert length(psgc) == 8 | length(psgc) == 9 /
Length of codes within range /
bysort psgc: assert _N == 1    /
Barangay codes unique */

list



===========================================================

I then loop the codes over all muncipality codes for which I have data. This may not be the most elegant way to do this as Eric Booth pointed out but it works, and that is all that matters (most of the time).


*If it were a ‘running backwards’ contest, Shame Mosley would have won it against Manny Pacquiao. Alas, it was boxing.

*Kiamba is also where we use to spend summer vacations with Apong Baket and Apong Lakay (grandmother and grandfather, respectively, in Ilocano, a language used in the northern part of the Philippines).

*The code above can be downloaded here.

Trying out Stata’s NetCourse 151


I just enrolled in Stata’s “NetCourse 151: Introduction to Stata Programming“*, which will start tomorrow. This decision came after accepting the fact that I don’t have the discipline to finish Christopher Baum’s “Introduction to Stata Programming” on my own. I may already know (or maybe not!) some of the course content, but learning Stata programming in a more structured way is one good incentive to enroll… and the price is an incentive (of course not as good incentive as the prospect of what I can possibly learn) to complete to the course. Looking forward to learning new things. :)


*Awaiting to hear from a sales person; Stata’s email said this would come shortly.

Saving variable labels before -collapse-


collapse literally collapses the dataset  into a dataset of summary statistics. After collapse, the dataset in memory is lost unless -preserve- was declared. Also, the labels of all variables in clist are replaced with (stat) variable name, where stat can be mean, sum, etc. (see help collapse).

Instead of retyping all the variable labels, you can use the extended macro function var label (see help extended_fcn) to  save the variable label of each variable in the varlist into a local macro before collapse. Restore the labels using label var (see help label). Example:
sysuse auto, clear

foreach var of varlist * {
    local vlab`var': var label `var'
    }

collapse price - gear_ratio, by(foreign)

foreach var of varlist * {
    label var `var' "`vlab`var''"
    }