run me ctrl-d

in pc, in ios

’tis cmd-shift-d

Filed under: Basic functions | Tagged: do, ios, keyboard shortcut, mac, run | 1 Comment »

Posted on 22 April 2014 by Mitch Abdon

run me ctrl-d

in pc, in ios

’tis cmd-shift-d

Filed under: Basic functions | Tagged: do, ios, keyboard shortcut, mac, run | 1 Comment »

Posted on 12 April 2014 by Mitch Abdon

In -destring- complication, Anup asked how to split a string variable. In his case, he has a variable of the form 28-18-0018-02183100-02-O-B where 28 represents state code, 18 represents districts code, 0018 represents subdistricts code and 02183100 represents village code. His problem is how to extract the state, districts, etc. codes separately from the variable and label all the code accordingly.

In response, Freddy provided a solution using the

`substr`

function assuming that the code for each part is of the same length of characters, i.e., a district code is always 2 characters, a subdistrict code is always 4 characters.gen state = substr(yourvariablename, 1, 2) gen district = substring(yourvariablename, 4, 2) gen subdistrict = substring(yourvariablename, 7, 4)

A similar solution was suggested in Splitting numbers before

`nsplit`

was discussed for numeric variables.But how about when the codes are not of the same length but is separated by a character, such as a hyphen: 8-18-18-02183100-02-O-B and 8-18-018-02183100-02-O-B. In this case,

`split`

would be helpful. `split`

literally splits string variables into parts using specified character or strings as a separator. The basic syntax for `split`

is (see `help split`

):`split stringvariable, parse(stringseparator)`

To split a code of the form 8-18-0018-02183100-02-O-B into 7 parts using the hyphen as a parser:

`split yourvariablename, parse(-)`

This will create 7 new variables:

`yourvariablename1`

, `yourvariablename2`

, and so on. You may specify a new prefix using the `gen()`

option. You may also want rename the variable names after.In splitting variables, string or numeric, I would like to echo Nick Cox’s comment in Splitting numbers, “the bottom line is just standard: be careful.”

Filed under: Basic functions | Tagged: split, string | Leave a comment »

Posted on 31 March 2014 by Mitch Abdon

In And we’re rolling, rolling; rolling on the river, Hasan asked how he could “keep only those values that were calculated using at least 3 observations” after he calculated the 4 period rolling standard deviation of a set of observations. One solution is to tag the periods when the missing observations within the window (in this case 4) is more than 1 then replace the calculated standard deviations for these periods to missing.

Two things to note are:

(1)

`rolling`

requires that your data has been declared as a time-series dataset (see `help tsset`

). Time-series operators, such L. for lags, are allowed.(2) The

`keep()`

option in `rolling`

allows you to keep the date variable, which you can use as an identifier in merging filesHere is an illustration (assuming nonrecursive analysis):

clear set obs 20 set seed 1 gen date = _n gen v1 = 1+int((100)*runiform()) gen v2 = v1 replace v2 = . in 1/4 replace v2 = . in 10/12 replace v2 = . in 18/20 tsset date rolling sd2 = r(sd), window(4) keep(date) saving(f2, replace): sum v2 merge 1:1 date using f2, nogenerate gen tag = missing(l3.v2) + missing(l2.v2) + missing(l1.v2) + missing(v2) > 1 gen sd = sd2 if tag==0

In the first block, we created an artificial data set of 20 uniformly distributed random integers between 1 and 100, replaced some observations to missing, and told Stata that we are dealing with a time-series data set.

In the second block, we calculated the 4 window rolling standard deviation. By using the

`saving()`

option rather than `clear`

, we have not replaced the current data in memory and saved the resulting dataset from the `rolling`

command in f2.dta. We merged this to our current data.In the last block, we generated the variable tag that returns 1 if the expression

`missing(l3.v2) + missing(l2.v2) + missing(l1.v2) + missing(v2) > 1`

is true, i.e., if the number of missing observations within the 4 period window is more than 1. Otherwise, tag is 0. Finally, in the last line, we created a new variable `sd`

that is missing if the number of observations used in each window is less than 3.Filed under: Basic functions | Tagged: missing observation, rolling, standard deviation | 1 Comment »

Posted on 3 November 2013 by Mitch Abdon

Relational operators (>, <. >=, <=, ==, !=) evaluate to 1 if the expression is true and 0 if false. Given this definition, a dummy variable can be created using, for example:

Instead of the longer alternative:

Why bother with the if-not-missing statement? If this statement is excluded, i.e.,

and

Filed under: Basic functions | Tagged: dummy, relational operators | 2 Comments »

Posted on 20 October 2011 by Mitch Abdon

I once was asked what is wrong about the code similar to the one below:

gen asean4 = 1 if countryname == “Indonesia” | “Malaysia” | “Philippines” | “Thailand”

This is a common mistake. Understandably, the assumption that repeating the left side of the expression, in this case ‘countryname’, is redundant is not far-off. Alas, Stata requires it and the correct syntax is:

gen asean4 = 1 if countryname == “Indonesia” | countryname == “Malaysia” ///

| countryname == “Philippines” | countryname == “Thailand”

But we can do better by using the built-in function inlist(). Learning a little bit more about Stata’s built-in functions can be very convenient (sometimes necessary)—shorter codes, faster processing, more facebook time. Using inlist(), the equivalent code is:

gen asean4 = 1 if inlist(countryname, “Indonesia”, “Malaysia”, “Philippines”, “Thailand”)

inlist() may also be used for numeric values. For example:

gen asean4 = 1 if inlist(countrycode, 360, 458, 608, 764)

The difference between using numeric and string values is in the number of allowable elements in the list (number of countries in our example). For numeric values, 254 elements are allowed and for string values, only 9. See -help inlist-.

Filed under: Basic functions | Tagged: inlist | 15 Comments »

Posted on 16 July 2011 by Mitch Abdon

‘o save me’ memory cries

heed and keep data light

byte is all that’s necessary

if variable

choose the optimal data type

to save memory and keep data light

but if you’re not sure what is best

trust Stata and use -compress-

-compress- reduces memory

by demoting the type of

if

-compress- stores v as str3 not str20

if

-compress- stores

Filed under: Basic functions | Tagged: compress, memory | 2 Comments »

Posted on 10 June 2011 by Mitch Abdon

Source: http://xkcd.com/208/

Converting string to numeric variables is easy with -destring- (-help destring-). But when -destring- returns *“income contains nonnumeric characters; no generate,*” it is an unwelcome complication. This tells you that there is a nonnumeric character in a variable that you expect to be all numeric, but it does not tell you what the character(s) is(are) exactly (like the doctor telling you ‘you are sick [full stop]’). There are two ways to deal with this. First is to use the force option, which converts all nonnumeric strings into missing values. This must be done with CAUTION. Second is to use the ignore() option to specify nonnumeric characters to take out. This must also be done with CAUTION. But to use ignore(), you must know what the specific nonnumeric characters are.

Nonnumeric characters are often easy to spot if you are working with a small dataset or the same character(s) appear in all observations. In this case you can -browse- or -list- the data or use -tab- (if you have few distinct values). Manually looking for the nonnumeric characters becomes a complication, however, if you have a huge dataset and the character(s) appear only in very few cases (for example a single “-” in the middle of a dataset with thousands of distinct observations).

Why are there nonnumeric charcters in a suppose-to-be numeric variable in the first place? There could be embedded spaces or the codes used to indicate missing values are not among the Stata’s 27 numeric missing values. There could be other reasons, including encoding errors. I usually encounter different codes for missing values including “na” (and all its variants), “no data”, or “-“, and this used to give me a headache until I figured out what what ‘regular expressions’ are (-help regexm-).

Below is an illustration.

clear

input str6 income

“9747”

“1,234”

“938.9”

“8344”

“2398”

“-”

“53822”

“na”

“$28477”

“n/a”

end

What we want is a command that will show us what the unwanted characters are, that is, nonnumeric characters excluding the decimal point “.” (except when you expect series of decimal points such as “..”). The condition in the following -tab- command does so.

**tab** *income* if regexm(*income*, “[^0-9 .]”)

**destring** *income*, ignore(“$” “-” “,” “na” “n/a”) gen(*n_income*)

**list**

It is tempting to overuse -regexm-, but it is not necessary in cases where the characters are obvious.

See also Stata’s FAQ: What are regular expressions and how can I use them in Stata? (Kevin S. Turner, StataCorp).

Filed under: Basic functions | Tagged: charlist, destring, regexm, regular expressions | 17 Comments »

Posted on 19 May 2011 by Mitch Abdon

I just learned about -rolling- today. Thanks to a friend for asking about moving averages and standard deviations yesterday. The problem was how to generate a new variable that contains the average and standard deviation of the previous 10 period. For example, the generated data for 1961 would be the average and the standard deviation for the period 1951 to 1960. I knew -tssmooth ma- can be used for moving averages, but I was not aware of a similar command for standard deviations so I did the following exercise for the moving standard deviation yesterday:

/* Create hypothetical data */

clear

set obs 50

gen year = 1951 if _n==1

replace year = year[_n-1] + 1 if _n!=1

set seed 528

gen data1 = runiform()

set seed 285

gen data2 = runiform()

/* Calculate moving standard deviation

sort year

foreach d of varlist data

qui gen sd

`d' = .`

local N = _N

local i = 1

local j = 10

forvalues k=11/

N'{qui sum

`d' in`

i’/`j'`

qui replace sd

d’ = r(sd) if _n==`k'`

local i =

i’ + 1local j = `j’ + 1

}

}

I should have googled first. If I had, I should have found Nick Cox’s reply to the Statalist post “calculating moving standard deviation” by Ravi Yatawara where he suggested -rolling-. By reshaping the data into panel format and applying -xtset-, I can now use -rolling-.

/* Create the same hypothetical data as above

/

reshape long data, i(year) j(group 1 2)

xtset group year

/* Calculate moving standard deviation */

rolling sd=r(sd), window(10) keep(group) clear: sum data

gen year = end + 1

keep group year sd

It is also possible to generate more than one statistics. For example, if I also want to calculate the moving average, I can write:

rolling sd=r(sd) mean=r(mean), window(10) keep(group) clear: sum data

See -help rolling- for more options.

The most important advantage of -rolling- (aside from its simplicity), I think, is that you do not have to worry about the order of your data because -xtset- or -tsset- already took charge of that. Note that by using -in- in my unnecessary code, I have to make sure that the data is sorted by year, otherwise I will be getting the standard deviations for the wrong time periods.

Lesson: Google first!

*”Proud Mary” [not “Rolling” as I used to think] by Tina Turner.

Filed under: Basic functions | Tagged: rolling, standard deviation, tssmooth ma | 15 Comments »

Posted on 1 March 2011 by Mitch Abdon

This is an update to the earlier post i. without the prefix -xi-. So the

Filed under: Basic functions, Data Management, Econometrics / Statistics | Tagged: factor variables, fvvarlist | Leave a comment »

Posted on 1 March 2011 by Mitch Abdon

This blog (including the pile of books to read, the GRE reviewers to go through, the TV shows to watch…this list will never end) has been neglected for more than week. Anyway… here is

For example:

is the same as generating dummy variables for each of the n=5 categories of

By default, the category with the lowest value (in this case, n=1) is omitted. No new variables are generated using the command above. Without the -xi- prefix, however, the use of

Filed under: Basic functions, Econometrics / Statistics | Tagged: factor variables, xi | 6 Comments »