Splitting strings


In -destring- complication, Anup asked how to split a string variable. In his case, he has a variable of the form 28-18-0018-02183100-02-O-B where 28 represents state code, 18 represents districts code, 0018 represents subdistricts code and 02183100 represents village code. His problem is how to extract the state, districts, etc. codes separately from the variable and label all the code accordingly.

In response, Freddy provided a solution using the substr function assuming that the code for each part is of the same length of characters, i.e., a district code is always 2 characters, a subdistrict code is always 4 characters.
gen state = substr(yourvariablename, 1, 2)
gen district = substring(yourvariablename, 4, 2)
gen subdistrict = substring(yourvariablename, 7, 4)

A similar solution was suggested in Splitting numbers before nsplit was discussed for numeric variables.

But how about when the codes are not of the same length but is separated by a character, such as a hyphen: 8-18-18-02183100-02-O-B and 8-18-018-02183100-02-O-B. In this case, split would be helpful. split literally splits string variables into parts using specified character or strings as a separator. The basic syntax for split is (see help split):

split stringvariable, parse(stringseparator)

To split a code of the form 8-18-0018-02183100-02-O-B into 7 parts using the hyphen as a parser:

split yourvariablename, parse(-)

This will create 7 new variables: yourvariablename1, yourvariablename2, and so on. You may specify a new prefix using the gen() option. You may also want rename the variable names after.

In splitting variables, string or numeric, I would like to echo Nick Cox’s comment in Splitting numbers, “the bottom line is just standard: be careful.”

Leave a Reply