dates in Starbucks

In yesterday’s post, one of the needed cleaning was to extract date and day of week from the string:
. list date in 1/5

     |                                       date |
  1. |  Date: August 31, 2015 at 1:42:41 PM GMT+8 |
  2. | Date: August 24, 2015 at 12:36:55 PM GMT+8 |
  3. |    Date: July 27, 2015 at 2:51:27 PM GMT+8 |
  4. |    Date: July 20, 2015 at 2:45:43 PM GMT+8 |
  5. |    Date: July 20, 2015 at 2:07:49 PM GMT+8 |

Continue reading

Splitting numbers

For a very long time I have used the string function -substr- to split all sorts of codes into components. For example, if I want to split the Philippine Standard Geographic Codes (PSGC) into smaller geographical units, I would write the following codes (see note below):

tostring psgccode, gen(str_psgc) format(“%09.0f”)
gen reg=substr(str_psgc, 1, 2)
gen prov=substr(str_psgc, 3, 2)
gen mun=substr(str_psgc, 5, 2)
gen bgy=substr(str_psgc, 7, 3)

/* Thanks to Nick Cox for pointing out a booboo I have committed in an earlier version of the codes above in his comment below */

This is not to downplay the capabilities of -substr-. In fact, -substr- is all you need if you only need a part of a string, e.g. first n digits (see related post). But today I found a more convenient way of splitting numbers by using -nsplit- (Dan Blanchette). -nsplit- creates new numeric variables to split a numeric variable according to digit pattern. By using -nsplit-, the block of codes above can be shortened into a single line as follows:

nsplit psgccode, digits(2 2 2 3) gen(reg prov mun bgy)

Another advantage of -nsplit- is that you do not have to worry about leading 0’s. For example, here are the PSGC codes for Bgy. Cannery Site, Polomolok, South Cotabato, Region XII (126312024) and Bgy. Sto. Domingo, Milaor Camarines Sur, Region V (051721019), and the generated codes after the split:

You may install -nsplit- by typing: “ssc install nsplit”

Note:  The PSGC is a 9-digit code that represents all geographic units in the Philippines — from the largest (region) to the smallest (barangay or village). It is composed of (in this order) the 2-digit region code, the 2-digit province code, the 2-digit municipaltiy code, and the 3-digit barangay code. PSGC codes are available at the National Statistical Coordination Board.

Truncating numbers

In Preserving numerical format after string transformation, we used -substr- to generate a new variable that contains the first 2-digits of a numerical code. We have transformed our numeric variable, code, into a string because, as we said, P-E-M-D-A-S  will not do the trick. To recall, we used the following command:

gen string3=substr(string(code, “%06.0f“),1,2)

While it is true that the basic arithmetic operations will not help, there exists a function in Stata that will return the same first 2-digit code. This is the int() function. int(x) returns the integer obtained from dropping the digits after the decimal point of number x. For example:

gen string4=int(code/10000)
/* This will return the first 2 digits of the 6-digit code. */

Thus, if code==250001, string4==25; and if code==10001, string4==1.

[Note: int() does not round the number x to the nearest integer; for this use the function round().For other math functions, type: “help math functions”.]

Now, suppose you want to keep the last 4 digits. Will int() help? Yes:

gen string5=code – int(code/10000)*10000

If code==250001, string5==1 (string5==250001-250000); and if code==10001, string5==1 (string5==10001-10000).

Similarly, you can use modular arithmetic (also called “remainder arithmetic”). In Stata, the modulo function is mod():

gen string6=mod(code, 10000)

If code==250001, mod(250001,10000)==1.

Preserving numerical format after string transformation

Here is one example where you need to preserve the numerical format for strings. Suppose you have a 6-digit numeric observation ID, code, where the first 2 digits represent geographic code and the last 4 digits represent unique observation codes, and you want to generate a new variable, reg, that represents the 2-digit geographic code. The entries in variable code, with numeric format %06.0f will look like: 010001, 010002,…, 250001,…,999999. For variable reg, entries will be: 01, 02,…,99.

[Note: The format %06.0f means that code is fixed as a 6-digit number with leading zeros (i.e. if code is less than 100000, it has 0’s before the first non-zero digit) and with nothing after the decimal point.]

How will you you go about this? First, you need to transform the numeric code into a string. Why? Because there is no basic number operations that returns the first 2-digit of a number. You cannot P-E-M-D-A-S your way out of this. And, second,you take the subset of the string and call it something else. In Stata, this involves the following commands:

tostring code, gen(string1) format(“%06.0f“)        /generates the a string variable string1 and preserves the format with leading zeros/

gen string2=substr(string1,1,2)          /* generates string variable string2. It is subset of the string string1, starting at element 1 with length 2 (in short, the first 2 digits). */

What happens if you only write “tostring code, gen(string1)”? This command will return the string without the leading zeros. For example, from 010001 to “10001.” Then, for observations with code<100000, the “gen string2=substr(string1,1,2)” will return the 2nd and 3rd digit of the code. You’re screwed!

Another way is to use:

gen string1=string(code, “%06.0f“)                 /* generates string variable string1 and preserves the format with leading zeros */

gen string2=substr(string1,1,2)

Or (the most elegant of all):

gen string3=substr(string(code, “%06.0f“),1,2)

In Stata, there are many ways to solve a problem, like there are many ways to prove the Pythagorean theorem. And, like the Pythagorean theorem proofs, there are the very long ones, the shorter ones, and the one that is the most elegant of all.