Website tables into Stata

That last post bit me (in a good way)…speaking of returns! The downside: no time for blogging. During that time I have also set aside the Stata course work I enrolled in. The lecture notes are easy to follow, and the exercises are very helpful. I just started going through lecture 3, but no more course leaders to ask when I get stuck. On the upside, I have also learned something new while working on other things. The subject of this post is one example.

This was my problem: I need barangay (village) information (population, code, type) for all municipalities. The information is available in NSCB’s PSGC (Philippine Standard Geographic Codes) website, but these are organized by municipality. It would not have been a problem if NCSB provided a data file (in text or any data format) that is readily downloadable. Copy-pasting the tables is out of the question—the Philippines has 1,496 municipalities! Thanks to Google for pointing me to Eric Booth’s reply to Friedrich Huebler’s query on reading HTML files with Stata at Statalist.

Using Eric Booth’s example as guide, below I illustrate how I grabbed the barangay table for a single municipality. In honor of Manny Pacquiao‘s win over Shane Mosley last Sunday, I will use as example the municipality of Kiamba, where Manny Pacquiao resides as Congressman of the province of Sarangani.* The output I need is a Stata dataset with all the 19 barangays of Kiamba, their corresponding PSGC, barangay type (whether urban or rural), and population (as in the table below).


/* Copy web page and save to a text file. The PSGC code for Kiamba is “128003000”, where the first 2 digits correspond to the region code, the next 2 digits correspond to the province code. */

local m “128003000”
local nscb “”
local r = substr(“m'",1,2)
local p = substr("
copy “nscb'muncode=m’&regcode=r'&provcode=p'”  “m`m’.txt”, text

/* Delete unnecessary HTML codes and replace quotation marks (\Q) with “%”. I tried not to replace the quotation marks and learned that there is a reason why Eric Booth got rid of them. */
filefilter “mm'.txt" "xmm’.txt”, from(“</td>”) to(“”) replace
filefilter “xmm'.txt" "mm’.txt”, from(“\Q”) to(“%”) replace
erase “xm`m’.txt”

/* Load text file */
intext using m`m’.txt, gen(nscb) clear

/* Drop unnecessary rows and columns */
drop in 1/160
keep nscb1

/* Flag lines to keep */
local brgy “<td width=%255% class=%dataCell%>”
local psgc “<td width=%130% align=%center% class=%dataCell%>”
local popn “<td width=%115% align=%center% class=%dataCell%>”

/* -findval- (Stas Kolenikov) is from SSC archive. Type -ssc install findval- */

findval “brgy'", substr gen(flag_brgy)
findval "
psgc'”, substr gen(flag_psgc)
findval “Urban”, substr gen(flag_urban)
findval “Rural”, substr gen(flag_rural)
findval “`popn'”, substr gen(flag_popn)

drop if (flag_brgy + flag_urban + flag_rural + flag_psgc + flag_popn == 0)

/* Create required variables */
gen brgy = subinstr(nscb1, “brgy'", "", 1) if flag_brgy
gen psgc = subinstr(nscb1, "
psgc'”, “”, 1) if flag_psgc
gen btyp = “Urban” if flag_urban
replace btyp = “Rural” if flag_rural
gen popn = real(subinstr(nscb1, “`popn'”, “”, 1)) if flag_popn

label var brgy “Barangay name”
label var psgc “PSGC codes”
label var btyp “Urban/Rural”
label var popn “Population”

/* Align data */
keep brgy psgc btyp popn
replace psgc = psgc[_n+1]
replace btyp = btyp[_n+3]
replace popn = popn[_n+4]
drop if brgy == “”

/* Delete trailing spaces */
replace brgy = ltrim(brgy)
replace psgc = ltrim(psgc)
replace btyp = ltrim(btyp)

/* Checks /
assert mi(psgc) == 0 /
No missing code /
assert length(psgc) == 8 | length(psgc) == 9 /
Length of codes within range /
bysort psgc: assert _N == 1    /
Barangay codes unique */



I then loop the codes over all muncipality codes for which I have data. This may not be the most elegant way to do this as Eric Booth pointed out but it works, and that is all that matters (most of the time).

*If it were a ‘running backwards’ contest, Shame Mosley would have won it against Manny Pacquiao. Alas, it was boxing.

*Kiamba is also where we use to spend summer vacations with Apong Baket and Apong Lakay (grandmother and grandfather, respectively, in Ilocano, a language used in the northern part of the Philippines).

*The code above can be downloaded here.

2 Responses

  1. If you want a more generalizable way of doing this, especially in a situation where you need to pass cookies to pages or navigate jquery, you can use Python (mechanize and beautifulsoup packages) to output to csv files.

Leave a Reply