-destring- complication: negative numbers


Less than 2 hours flight…

In a Stata training, one of the students wondered why after importing an Excel file of financial indicators into Stata some were read as strings. A quick browse at the data indicates the presence of hyphens (“-“) and that these were used in different ways: one to indicate a negative number and another to indicate a missing observation.

hyphen

How do we convert these variables to numeric as destring returns an error?


. destring  GRChange  NIChange, replace
GRChange contains nonnumeric characters; no replace
NIChange contains nonnumeric characters; no replace

First, we use charlist to check whether “-” is the only non-numeric character that complicates our use of destring:
. charlist GRChange
 -.0123456789

. charlist Niinmill
-.0123456789

. charlist NIChange
-.0123456789

charlist results show that for the variables we checked “-” is all that gives us trouble. What happens when we use the ignore() option of destring?
. destring GRChange Niinmill NIChange, replace ignore(-)
GRChange: characters - removed; replaced as double
(47 missing values generated)
Niinmill: characters - removed; replaced as double
(1 missing value generated)
NIChange: characters - removed; replaced as double
(393 missing values generated)

destring_wrong

Yes, the variables GRChange Niinmill NIChange were converted to numeric BUT we messed the data. BIG TIME. We cannot ignore the “-” that indicates a negative value.

One solution is to tag the negative values before using destring, ignore(-). How do we discriminate between the hypthens used to indicate negative values and those that indicate missing values? In the data that we have, negative observations are those that starts with “-” and followed by a number or a decimal point (e.g., -2.5, -0.135, -.34) and missing observations are those with only “-” (there are no observations with “-.”).

Following the patterns we observed above, we use regular expressions to generate a new variable that identifies the negative observations, for example:

gen tagNEGGRchange = regexm(GRChange, "^-[. 0-9]")

This creates the variable tagNEGGRchange with values 1 for negative observations and 0 otherwise. Then we can destring the GRChange ignoring “-” and creating the new variable GRChangeNew.

destring GRChange, ignore(-) gen(GRChangeNew)

Then finally bring back the “-” to the originally negative observations by negating the “destringed” values using the generated tag as qualifier:

replace GRChangeNew = -GRChangeNew if tagNEGGRchange==1

destrin_ok

See also -destring- complication.

Later… let’s catch that elusive bandwidth in this part of the world to post this in the world wide web.

2 Responses

  1. Backticks to emphasise code didn’t work. I must have misread something. Sorry about that.

  2. It’s good to document this: as you spell out clearly, doing the wrong thing would bite you hard.

    Once you know the problem it’s easy to suggest a solution. With perfect hindsight, you can just replace isolated hyphens with empty strings or `”.”`

    `replace whatever = “” if whatever == “-“`

    Then there is nothing to ignore and no need for anything more complicated. My personal bias is that very many uses of regular expressions could be replaced with something simpler, although I am a fan when they are the best solution, which is also common.

    My standard advice is to check surprising refusals to `destring` with

    `tab whatever if missing(real(whatever))`

    In a recent problem for me, that isolated one fixable string in one of about 600,000 observations, for which looking through the dataset was unattractive.

    The most important advice is just this: never `destring, replace` unless you are absolutely sure that you are right or are willing to do things again if you made the wrong decision. The `generate()` option is there for a purpose.

Leave a Reply