Here is one example where you need to preserve the numerical format for strings. Suppose you have a 6-digit numeric observation ID,

*code*, where the first 2 digits represent geographic code and the last 4 digits represent unique observation codes, and you want to generate a new variable,

*reg*, that represents the 2-digit geographic code. The entries in variable

*code*, with numeric format

*%06.0f*will look like: 010001, 010002,…, 250001,…,999999. For variable

*reg*, entries will be: 01, 02,…,99.

[Note: The format

*%06.0f*means that

*code*is

**f**ixed as a

**6**-digit number with leading zeros (i.e. if

*code*is less than 100000, it has 0’s before the first non-zero digit) and with nothing after the decimal point.]

How will you you go about this? First, you need to transform the numeric code into a string. Why? Because there is no basic number operations that returns the first 2-digit of a number. You cannot P-E-M-D-A-S your way out of this. And, second,you take the subset of the string and call it something else. In Stata, this involves the following commands:

**tostring**

*code*,

**gen**(

*string1*) format(“

*%06.0f*“) /

*generates the a string variable*/

*string1*and preserves the format with leading zeros**gen**

*string2*=

**substr**(

*string1*,

*1*,

*2*) /* generates string variable

*string2*. It is subset of the string string1, starting at element 1 with length 2 (in short, the first 2 digits). */

What happens if you only write “

**tostring**

*code*,

**gen**(

*string1*)”? This command will return the string without the leading zeros. For example, from 010001 to “10001.” Then, for observations with

*code*<100000, the “

**gen**

*string2*=

**substr**(

*string1*,

*1*,

*2*)” will return the 2nd and 3rd digit of the code. You’re screwed!

Another way is to use:

**gen**

*string1*=

**string**(

*code*, “

*%06.0f*“) /* generates string variable string1 and preserves the format with leading zeros */

**gen**

*string2*=

**substr**(

*string1*,

*1*,

*2*)

Or (the most elegant of all):

**gen**

*string3*=

**substr**(

**string**(

*code*, “

*%06.0f*“),

*1*,

*2*)

In Stata, there are many ways to solve a problem, like there are many ways to prove the Pythagorean theorem. And, like the Pythagorean theorem proofs, there are the very long ones, the shorter ones, and the one that is the most elegant of all.

Filed under: Basic functions Tagged: | format, string, substr, tostring

Help! | Stata Daily, on 18 April 2014 at 5:31 PM said:[…] second, to divide the string into two. But you have no idea how to tell Stata what to do. (See also Preserving numerical format after string transformation and Truncating numbers for more detailed discussion on how to split strings or numerical […]