Website tables into Stata

That last post bit me (in a good way)…speaking of returns! The downside: no time for blogging. During that time I have also set aside the Stata course work I enrolled in. The lecture notes are easy to follow, and the exercises are very helpful. I just started going through lecture 3, but no more course leaders to ask when I get stuck. On the upside, I have also learned something new while working on other things. The subject of this post is one example.

This was my problem: I need barangay (village) information (population, code, type) for all municipalities. The information is available in NSCB’s PSGC (Philippine Standard Geographic Codes) website, but these are organized by municipality. It would not have been a problem if NCSB provided a data file (in text or any data format) that is readily downloadable. Copy-pasting the tables is out of the question—the Philippines has 1,496 municipalities! Thanks to Google for pointing me to Eric Booth’s reply to Friedrich Huebler’s query on reading HTML files with Stata at Statalist.

Using Eric Booth’s example as guide, below I illustrate how I grabbed the barangay table for a single municipality. In honor of Manny Pacquiao‘s win over Shane Mosley last Sunday, I will use as example the municipality of Kiamba, where Manny Pacquiao resides as Congressman of the province of Sarangani.* The output I need is a Stata dataset with all the 19 barangays of Kiamba, their corresponding PSGC, barangay type (whether urban or rural), and population (as in the table below).


/* Copy web page and save to a text file. The PSGC code for Kiamba is “128003000”, where the first 2 digits correspond to the region code, the next 2 digits correspond to the province code. */

local m “128003000”
local nscb “”
local r = substr(“m'",1,2)
local p = substr("
copy “nscb'muncode=m’&regcode=r'&provcode=p'”  “m`m’.txt”, text

/* Delete unnecessary HTML codes and replace quotation marks (\Q) with “%”. I tried not to replace the quotation marks and learned that there is a reason why Eric Booth got rid of them. */
filefilter “mm'.txt" "xmm’.txt”, from(“</td>”) to(“”) replace
filefilter “xmm'.txt" "mm’.txt”, from(“\Q”) to(“%”) replace
erase “xm`m’.txt”

/* Load text file */
intext using m`m’.txt, gen(nscb) clear

/* Drop unnecessary rows and columns */
drop in 1/160
keep nscb1

/* Flag lines to keep */
local brgy “<td width=%255% class=%dataCell%>”
local psgc “<td width=%130% align=%center% class=%dataCell%>”
local popn “<td width=%115% align=%center% class=%dataCell%>”

/* -findval- (Stas Kolenikov) is from SSC archive. Type -ssc install findval- */

findval “brgy'", substr gen(flag_brgy)
findval "
psgc'”, substr gen(flag_psgc)
findval “Urban”, substr gen(flag_urban)
findval “Rural”, substr gen(flag_rural)
findval “`popn'”, substr gen(flag_popn)

drop if (flag_brgy + flag_urban + flag_rural + flag_psgc + flag_popn == 0)

/* Create required variables */
gen brgy = subinstr(nscb1, “brgy'", "", 1) if flag_brgy
gen psgc = subinstr(nscb1, "
psgc'”, “”, 1) if flag_psgc
gen btyp = “Urban” if flag_urban
replace btyp = “Rural” if flag_rural
gen popn = real(subinstr(nscb1, “`popn'”, “”, 1)) if flag_popn

label var brgy “Barangay name”
label var psgc “PSGC codes”
label var btyp “Urban/Rural”
label var popn “Population”

/* Align data */
keep brgy psgc btyp popn
replace psgc = psgc[_n+1]
replace btyp = btyp[_n+3]
replace popn = popn[_n+4]
drop if brgy == “”

/* Delete trailing spaces */
replace brgy = ltrim(brgy)
replace psgc = ltrim(psgc)
replace btyp = ltrim(btyp)

/* Checks /
assert mi(psgc) == 0 /
No missing code /
assert length(psgc) == 8 | length(psgc) == 9 /
Length of codes within range /
bysort psgc: assert _N == 1    /
Barangay codes unique */



I then loop the codes over all muncipality codes for which I have data. This may not be the most elegant way to do this as Eric Booth pointed out but it works, and that is all that matters (most of the time).

*If it were a ‘running backwards’ contest, Shame Mosley would have won it against Manny Pacquiao. Alas, it was boxing.

*Kiamba is also where we use to spend summer vacations with Apong Baket and Apong Lakay (grandmother and grandfather, respectively, in Ilocano, a language used in the northern part of the Philippines).

*The code above can be downloaded here.

Splitting numbers

For a very long time I have used the string function -substr- to split all sorts of codes into components. For example, if I want to split the Philippine Standard Geographic Codes (PSGC) into smaller geographical units, I would write the following codes (see note below):

tostring psgccode, gen(str_psgc) format(“%09.0f”)
gen reg=substr(str_psgc, 1, 2)
gen prov=substr(str_psgc, 3, 2)
gen mun=substr(str_psgc, 5, 2)
gen bgy=substr(str_psgc, 7, 3)

/* Thanks to Nick Cox for pointing out a booboo I have committed in an earlier version of the codes above in his comment below */

This is not to downplay the capabilities of -substr-. In fact, -substr- is all you need if you only need a part of a string, e.g. first n digits (see related post). But today I found a more convenient way of splitting numbers by using -nsplit- (Dan Blanchette). -nsplit- creates new numeric variables to split a numeric variable according to digit pattern. By using -nsplit-, the block of codes above can be shortened into a single line as follows:

nsplit psgccode, digits(2 2 2 3) gen(reg prov mun bgy)

Another advantage of -nsplit- is that you do not have to worry about leading 0’s. For example, here are the PSGC codes for Bgy. Cannery Site, Polomolok, South Cotabato, Region XII (126312024) and Bgy. Sto. Domingo, Milaor Camarines Sur, Region V (051721019), and the generated codes after the split:

You may install -nsplit- by typing: “ssc install nsplit”

Note:  The PSGC is a 9-digit code that represents all geographic units in the Philippines — from the largest (region) to the smallest (barangay or village). It is composed of (in this order) the 2-digit region code, the 2-digit province code, the 2-digit municipaltiy code, and the 3-digit barangay code. PSGC codes are available at the National Statistical Coordination Board.