-destring- complication

Source: http://xkcd.com/208/

Converting string to numeric variables is easy with -destring- (-help destring-). But when -destring- returns “income contains nonnumeric characters; no generate,” it is an unwelcome complication. This tells you that there is a nonnumeric character in a variable that you expect to be all numeric, but it does not tell you what the character(s) is(are) exactly (like the doctor telling you ‘you are sick [full stop]’). There are two ways to deal with this. First is to use the force option, which converts all nonnumeric strings into missing values. This must be done with CAUTION. Second is to use the ignore() option to specify nonnumeric characters to take out. This must also be done with CAUTION. But to use ignore(), you must know what the specific nonnumeric characters are.

Nonnumeric characters are often easy to spot if you are working with a small dataset or the same character(s) appear in all observations. In this case you can -browse- or -list- the data or use -tab- (if you have few distinct values). Manually looking for the nonnumeric characters becomes a complication, however, if you have a huge dataset and the character(s) appear only in very few cases (for example a single “-” in the middle of a dataset with thousands of distinct observations).

Why are there nonnumeric charcters in a suppose-to-be numeric variable in the first place? There could be embedded spaces or the codes used to indicate missing values are not among the Stata’s 27 numeric missing values. There could be other reasons, including encoding errors. I usually encounter different codes for missing values including “na” (and all its variants), “no data”, or “-“, and this used to give me a headache until I figured out what what ‘regular expressions’ are (-help regexm-).

Below is an illustration.

clear
input str6 income
“9747”
“1,234”
“938.9”
“8344”
“2398”
“-”
“53822”
“na”
“$28477”
“n/a”
end

What we want is a command that will show us what the unwanted characters are, that is, nonnumeric characters excluding the decimal point “.” (except when you expect series of decimal points such as “..”). The condition in the following -tab- command does so.

tab income if regexm(income, “[^0-9 .]”)

destring income, ignore(“$” “-” “,” “na” “n/a”) gen(n_income)
list


It is tempting to overuse -regexm-, but it is not necessary in cases where the characters are obvious.

See also Stata’s FAQ: What are regular expressions and how can I use them in Stata? (Kevin S. Turner, StataCorp).

17 Responses

  1. […] In -destring- complication, Anup asked how to split a string variable. In his case, he has a variable of the form 28-18-0018-02183100-02-O-B where 28 represents state code, 18 represents districts code, 0018 represents subdistricts code and 02183100 represents village code. His problem is how to extract the state, districts, etc. codes separately from the variable and label all the code accordingly. […]

  2. hi
    I have a quick question, suppose if a variable is in the form of number of this (28-18-0018-02183100-02-O-B) and 28 represents state code, 18 represents districts code, 0018 represents sub districts code and 02183100 represents village code and this coding is different for different state, districts etc. i just want to know that how can i extract the state, districts etc codes separately from the variable ( mentioned above) and label all the code according to there respective state in Stata? Any suggestions would be appreciated.

    • hi anup,

      If your variables always have the same lenght of characters this is an easy one. just use substr() to cut your variable into pieces.

      like

      gen state = substring(yourvariable,1,2)
      gen district = substring(yourvariable,4,2)
      gen subdistrict = substring(yourvariable,7,4)

      and so forth…

      all the best
      freddy

  3. Hi,

    Following on this, I have a quick question: What if all values of a variable are 12 characters. When all the characters are digit I get something like 123456789123 for a 12 digit value or 000000000123 for a 3 digit value; however, some of the values are negative and less than 12 digits, e.g., 00000-123584. When I – destring – this variable obviously values with negative sign in the middle are not easy to deal with. Any suggestion would be appreciated.

    • hi amir,

      here is one solution:
      /*
      (1) generate a new variable “neg” which takes 1 if observation in variable v1 contains “-”
      (2) ignore “-” when ‘destringing’ v1
      (3) put back the negative sign to those with neg==1
      */
      gen neg = regexm(v1, “-“)
      destring v1, ignore(“-“) replace
      replace v1 = -v1 if neg==1
      drop neg

      hth :)
      mitch

  4. Had the same problem, but thanks to this article I was able to cope with this shit!!!

    I used the force-method and identified my missings using:

    tab ID(the datasets ID-V)ariable Var1(destringed, force) if Var1==., m

    then I could simply check what caused Stata to see those entries as non-numeric….

    Good guy Mitch Abdon!!!

  5. I was trying to convert my “numeric” string to actual string using destring and it kept giving me error. Thank God I found your article or I would be stuck for good amount of time.

  6. I have a slightly different problem for which I’ve had trouble finding an answer. I have a variable which is largely numeric with the exception of some “&” and “-” characters. If my dictionary describes them as numeric variables, these values are dropped. If I describe the variable as string and either encode or destring it, the nonnumeric values become missing. The problem is that these & and – values are legitimate codes left over from the days of multipunching with meaningful interpretations that I’d like to convert to numeric codes. If I force or ignore them, they become missing values and I can’t distinguish them from the cases that are legitimately missing. How can I read these values and save them in numeric form?

    • Hi Ken..

      Stata has more than 1 code for missing values (-help missing-). You may read them as string first and replace “&” to one of the codes for missing values (eg “.a” or “.b”) then use -destring- and “-” to another missing value code (eg “.z” or “.”).

      Best,
      Mitch

      • Hi Mitch. Thanks for the suggestion. I’ll give it a try but I have one question. Ultimately, I want to use the values assigned to “&” and “-” as numeric codes. (Ultimately, “&” will become 10 and “-” will become 11–both nonmissing). I thought that when I destring the variable, the nonnumeric values will be missing values automatically. Is there a way to get Stata to change a particular missing value to an integer value?

  7. thanks, great tip.

  8. Also another way of getting to see what in myvar won’t convert to string without force is to add

    … if missing(real(myvar))

    This catches more than your regular expression, but often that is useful too.

  9. Just a comment as the original author of -destring-, which is now an official command: The purpose of -destring- is emphatically not just converting string to numeric variables, _without qualification_. At the time it was written, -real()- and -encode- were already available for precisely that purpose, -real()- asking no questions, but just ignoring non-numeric characters, and -encode- always doing that, because whatever was in a string variable could be preserved in value labels. The main purpose of -destring- was essentially to deal with variables that should be numeric but by mistake were in string form. So, it’s not a complication that -destring- refuses to act when non-numeric characters are present; it is intended behaviour — and I suggest that you are misinterpreting the purpose of -destring- slightly.

    That said, there are common reasons why -destring- won’t act without forcing (e.g. commas to show where fractions begin) and there’s scope for arguing about how friendly it should be when refusing to act. As said, I’ve provided. as a user, a command -charlist- which may help diagnosis here. But I don’t agree that it would be ideal behaviour for -destring- to provide a list of unacceptable characters so that the user could then specify what they are in -ignore()-. It would be better if the user looked carefully at their data to work out where these characters occur and why they occur. After all, a clear implication of -destring- refusing is that the user has inaccurate ideas about the form of the data. It may not seem so, but it is a clear and designed feature of commands like -destring- to make it difficult to change your data, except by using an option like -force-, whose name contains its own warning.

  10. tab income if regexm(income, “[^0-9 .]“) is really clever! Great tip!
    P.S. I would love to see more regular expression tutorials, guides in Stata!

  11. Great tip. Also NJC’s -charlist- from SSC is useful for hunting down these types of characters — especially hidden space chars.

Leave a Reply