Splitting numbers

For a very long time I have used the string function -substr- to split all sorts of codes into components. For example, if I want to split the Philippine Standard Geographic Codes (PSGC) into smaller geographical units, I would write the following codes (see note below):

tostring psgccode, gen(str_psgc) format(“%09.0f”)
gen reg=substr(str_psgc, 1, 2)
gen prov=substr(str_psgc, 3, 2)
gen mun=substr(str_psgc, 5, 2)
gen bgy=substr(str_psgc, 7, 3)

/* Thanks to Nick Cox for pointing out a booboo I have committed in an earlier version of the codes above in his comment below */

This is not to downplay the capabilities of -substr-. In fact, -substr- is all you need if you only need a part of a string, e.g. first n digits (see related post). But today I found a more convenient way of splitting numbers by using -nsplit- (Dan Blanchette). -nsplit- creates new numeric variables to split a numeric variable according to digit pattern. By using -nsplit-, the block of codes above can be shortened into a single line as follows:

nsplit psgccode, digits(2 2 2 3) gen(reg prov mun bgy)

Another advantage of -nsplit- is that you do not have to worry about leading 0’s. For example, here are the PSGC codes for Bgy. Cannery Site, Polomolok, South Cotabato, Region XII (126312024) and Bgy. Sto. Domingo, Milaor Camarines Sur, Region V (051721019), and the generated codes after the split:

You may install -nsplit- by typing: “ssc install nsplit”

Note:  The PSGC is a 9-digit code that represents all geographic units in the Philippines — from the largest (region) to the smallest (barangay or village). It is composed of (in this order) the 2-digit region code, the 2-digit province code, the 2-digit municipaltiy code, and the 3-digit barangay code. PSGC codes are available at the National Statistical Coordination Board.

6 Responses

  1. tnx dear
    it s very useful

  2. […] similar solution was suggested in Splitting numbers before nsplit was discussed for numeric […]

  3. When writing -split- (which in turn leaned on earlier work with Michael Blasnik) I thought about extending it to cover variables without embedded delimiters, but decided that would make the syntax too complicated, if only because that is really a different problem, although sharing the same name.

    In any case the key problem with implementing a syntax like that of -nsplit- is deciding what to do when the numbers don’t match the pattern specified. It’s arguable that those are malformed and the result should be returned as missing.

    You’ve described that -nsplit- indulges one fewer digit, which is a feature in your case, and that behaviour is documented by way of examples, but it could be precisely what is not wanted in other problems. What about one more digit, two fewer digits, etc.?

    Naturally, the bottom line is just standard: be careful.

    P.S. By the way, I imagine that your -substr()- syntax was more like

    gen reg=substr(str_psgc, 1, 2)
    gen prov=substr(str_psgc, 3, 2)
    gen mun=substr(str_psgc, 5, 2)
    gen bgy=substr(str_psgc, 7, 3)

    as the last argument of -substr()- is the length of the substring, not the position of the last character to be extracted.

    • Thanks for pointing out the -substring- booboo. I really need to keep in mind the ‘standard’ tip you just mentioned: be careful.

      -nsplit- is really convenient for most of what I do. But if it fails, -substr- is always my reliable choice (for now).

  4. wow! this is really useful — saves you 4 lines in the do-file! does it apply to numeric vbles too? and vbles with delimiters?

    • It is for numeric variables. If you have a string variable with character delimiters (or separated by spaces), there is the Stata command -split-. For example:

      split stringvariable, gen(stubname) parse(delimiters)

      See “help split”

Leave a Reply