-destring- uncomplicated

In a comment to the previous post destring complication: negative numbers, Nick Cox pointed out “the most important advise” in using destring: “never destring, replace unless you are absolutely sure that you are right or are willing to do things again if you made the wrong decision. The generate() option is there for a purpose.”

In addition, his comment point to simpler solution than using regular expressions.

Continue reading

-destring- complication: negative numbers

Less than 2 hours flight…

In a Stata training, one of the students wondered why after importing an Excel file of financial indicators into Stata some were read as strings. A quick browse at the data indicates the presence of hyphens (“-“) and that these were used in different ways: one to indicate a negative number and another to indicate a missing observation.


How do we convert these variables to numeric as destring returns an error?

Continue reading

-destring- complication

Source: http://xkcd.com/208/

Converting string to numeric variables is easy with -destring- (-help destring-). But when -destring- returns “income contains nonnumeric characters; no generate,” it is an unwelcome complication. This tells you that there is a nonnumeric character in a variable that you expect to be all numeric, but it does not tell you what the character(s) is(are) exactly (like the doctor telling you ‘you are sick [full stop]’). There are two ways to deal with this. First is to use the force option, which converts all nonnumeric strings into missing values. This must be done with CAUTION. Second is to use the ignore() option to specify nonnumeric characters to take out. This must also be done with CAUTION. But to use ignore(), you must know what the specific nonnumeric characters are.

Nonnumeric characters are often easy to spot if you are working with a small dataset or the same character(s) appear in all observations. In this case you can -browse- or -list- the data or use -tab- (if you have few distinct values). Manually looking for the nonnumeric characters becomes a complication, however, if you have a huge dataset and the character(s) appear only in very few cases (for example a single “-” in the middle of a dataset with thousands of distinct observations).

Why are there nonnumeric charcters in a suppose-to-be numeric variable in the first place? There could be embedded spaces or the codes used to indicate missing values are not among the Stata’s 27 numeric missing values. There could be other reasons, including encoding errors. I usually encounter different codes for missing values including “na” (and all its variants), “no data”, or “-“, and this used to give me a headache until I figured out what what ‘regular expressions’ are (-help regexm-).

Below is an illustration.

input str6 income

What we want is a command that will show us what the unwanted characters are, that is, nonnumeric characters excluding the decimal point “.” (except when you expect series of decimal points such as “..”). The condition in the following -tab- command does so.

tab income if regexm(income, “[^0-9 .]”)

destring income, ignore(“$” “-” “,” “na” “n/a”) gen(n_income)

It is tempting to overuse -regexm-, but it is not necessary in cases where the characters are obvious.

See also Stata’s FAQ: What are regular expressions and how can I use them in Stata? (Kevin S. Turner, StataCorp).