Getting to know “factor variables”

This is an update to the earlier post i. without the prefix -xi-. So the i.‘s (or “i options” as Joe Glass called it) have a name. Stata calls them “factor variables” and there is more to them than i. .See -help fvvarlist- for the documentation and some very helpful examples.

World Bank’s open data policy and -wbopendata-

Last year, a friend from the World Bank (Manila) sent an email about World Bank’s open access policy that allows free download access to thousands of indicators from the World Bank data catalog. As I always had access to World Bank data sets via our institution’s subscription, I took this information for granted. This is not to say that I ignore the implications of this initiative. The World Bank model puts pressure on governments and other development agencies to follow. It is odd that there are still countries in the world today where economic data, such as GDP or inflation data, are not made public.

It is only a matter of time that applications, not only to automate data download, but also to present these wealth of information in ingenious ways will be made available. ESRI, for example, published a free web application that maps any one of more than a thousand economic and financial indicators for any region of the world. In the screen shot below, the size of the bubbles represents workers’ remittances inflows to countries in Asia.

To bring out the best of ideas, the World Bank initiated the “Apps for Development” competition, a challenge to software developers and development practitioners to create innovative apps using World Bank data (vote for your favorite apps here).

For Stata users, -wbopendata- (J.P. Azevedo 2011) is the module to access data from World Bank data catalog. -wbopendata- is easy to use but note that it requires an internet connection. First, install -wbopendata- via SSC:

ssc install wbopendata

-wbopendata- allows you to download (i) all indicators for a specific country for all years or (ii) a specific indicator for all countries and for all years or (iii) a set of indicators within a specified topic for all countries and for all years. -wbopendata- loads the data into the Stata memory. For example, to download all data available for the Philippines for all years, type:

wbopendata, country(phl) clear

This returns data for 972 indicators from 1960 to the latest year available. The default data display is in wide format. To display the data in a long format, use the ‘long’ option:

wbopendata, country(phl) long clear

To download GDP per capita (in constant PPP $) for all countries, type:

wbopendata, indicator(ny.gdp.pcap.pp.kd) clear

Lastly, to download all indicators under the topic “Poverty” for all countries, type:

wbopendata, topics(11) clear

The list of countries, topics, and indicators and their corresponding codes are documented in the help file (see -help wbopendata-). -wbopendata- also has other options not mentioned here.

ComtradeTools and Stata: Automating UN Comtrade data downloads

ComtradeTools (developed by UN) is a command line program that allows you to obtain data via UN Web Services and to convert the data into CSV format (or to an SQL server). You can use Stata’s -shell- command to run ComtradeTools.

First, read the instructions here. Download the latest version (March 2010) of ComtradeTools here and the required Microsoft .NET Framework (version 1.1.4322.573) here.

The command line parameters are listed in /help which you can copy into a text file. To copy ComtradeTools /help contents to ComtradeTools_Help.txt:

cd “C:\Program Files\UNSD\ComtradeTools\”
shell ComtradeTools /help >>U:\Data\Comtrade\ComtradeTools_Help.txt

/help does not enumerate all possible entries for each parameter, but you can find them in UN Comtrade’s website. Some are listed below:

/r: Reporter Code. To get the list of countries and their corresponding codes:

copy U:\Data\Comtrade\Countrylist.xml //You can open this in Excel

/y: Year. 4-digit year.

/px: Classification.
SITC Rev.3=S3
SITC Rev.2=S2
SITC Rev.1=S1

/cc: Commodity Code. For example, to get commodity codes for HS1996 and SITC Rev. 2:

local classification H1 S2
foreach c of local classification{
copy'" <em>U:\Data\Comtrade\Commoditylist_</em>c’.xml

/rg: Trade Flow.

Try to download trade data (SITC Rev. 2) for the Philippines for the year 2007 and load it to Stata.

shell ComtradeTools /r:608 /y:2007 /px:S2 /action:DownloadAndConvertToCSV /outputDirectory:U:\Data\Comtrade\
insheet using U:\Data\Comtrade\S22007608_CSV.txt, comma clear

Note that if you do not specify /outputDirectory, the files will be saved in the current working directory, that is, where the program ComtradeTools is saved.

Now, it is easy to download data for multiple countries and years by using loops. For example, to download trade data (SITC Rev. 2) for the Philippines every 5 years from 1980 to 2005 and save them as dta files:

forvalues y=1980(5)2005{
shell ComtradeTools /r:608 /y:y' <strong>/px:</strong><em>S2</em> <strong>/action:</strong>DownloadAndConvertToCSV <strong>/outputDirectory:</strong><em>U:\Data\Comtrade\</em>
<strong>insheet </strong>using <em>U:\Data\Comtrade\S2</em>
y’608_CSV.txt, comma clear
save U:\Data\Comtrade\S2y'<em>608.dta</em>, replace
<strong>erase </strong><em>U:\Data\Comtrade\S2</em>

Reading IMEUS

I am currently reviewing econometrics by reading Christopher Baum’s An Introduction to Modern Econometrics Using Stata.* Although linear regression is not discussed until chapter 4, chapters 1 to 3 (particularly chapter 3) are equally important (I finished chapter 3 last week…long way to go given my current rate of <1 chapter a week). Afterall, before anyone complicates his life with all those regressions, one needs to be sure that the data is “clean” and structured in such a way that it is fit for the analyis required. In fact, I think that the time spent on data management is so much more than the time spent on the actual data analysis.

Here are some of the things I remember about the first three chapters of the book. The discussion of Stata’s features in chapter 1 made me appreciate Stata even more. Chapter 2 outlines the basic tools one needs to learn to efficiently work with Stata. This chapter provided me very helpful tips on how to handle missing data and dates. And in chapter 3, I especially liked the section on data validation, which introduced me to the command -assert-. If only I knew -assert- when I was “cleaning” the QIDS dataset (7 years ago), it would have been easier to do all those consistency checks. Instead of running each line one by one, using -assert- would have allowed me to run the whole do-file but will stop and let me know whenever conditions are not met. So much for regret…

On to the next chapter…

*Thanks to Ellen and Jo Cain for leaving me this book before they left for the US :)

Find that variable fast with -lookfor-

Looking for a certain variable in a dataset with so many variables, even if you yourself have created these variables, can sometimes test your patience. Use -lookfor- to minimize that stress. But, if in the first place you have created variable names and variable labels that do not make sense—for example, v1 with label “variable 1“—you deserve that stress!

-lookfor- is like Ctrl+f in some applications. It helps you find the variables whose names or labels (or a part of their names or labels) match your string search criteria. -lookfor- searches for any match in all variable names and labels in the dataset and returns the list of resulting variable names and their corresponding descriptions. Example:

sysuse census.dta, clear
lookfor pop
lookfor POP
lookfor census

Notice that -lookfor- is not case sensitive—looking for “pop” is the same as looking for “POP.” Alternately, to look for all occurrences of “pop” or “census,” you may type:

lookfor pop census

See “help lookfor” for more.

Don’t forget to take notes

Writing notes in your data is always a good idea. No matter how many hours you have spent working on your data, there is always the possibility of forgetting important details, such as, data source, definitions, adjustments you made, etc. To show how to create notes in your data, here is an example using Stata’s lifeexp.dta.

webuse lifeexp

-notes- will return the following:

1.  Source: The World Bank Group, Learning Modules,

1.  Population Growth rate, average annual growth % 1980-1998

1.  Life Expectancy at Birth, 1998

1.  GNP per capita, 1998

1.  Access to safe water, % of population, 1990-96

How were these notes written?

notes: Source: The World Bank Group, Learning Modules,
notes popgrowth: Population Growth rate, average annual growth % 1980-1998
notes lexp: GNP per capita, 1998
notes safewater: Access to safe water, % of population, 1990-96

Now, create your own notes (and don’t forget to save).

Another case for -insheet-

Recently, a friend showed me a text file with data that looks like the one below (the data below is a subset of auto.dta from Stata website and saved as auto.txt):

“AMC Concord,4099,22,3,2.5”
“AMC Pacer,4749,17,3,3”
“AMC Spirit,3799,22,,3”
“Buick Century,4816,20,3,4.5”
“Buick Electra,7827,15,4,4”
“Buick LeSabre,5788,18,3,4”
“Buick Opel,4453,26,,3”
“Buick Regal,5189,20,3,2”
“Buick Riviera,10372,16,3,3.5”
“Buick Skylark,4082,19,3,3.5”
“Cad. Deville,11385,14,3,4”
“Cad. Eldorado,14500,14,2,3.5”
“Cad. Seville,15906,21,3,3”
“Chev. Chevette,3299,29,3,2.5”
“Chev. Impala,5705,16,4,4”

The data seems to be comma-delimited but -insheet- will fail to load this data properly. Why? Because of the quotation marks (). -insheet- will read everything inside a ” “ as a single string. The solution therefore is to delete all . To do this, you may open the text file using a notepad and then find-replace (Ctrl+H) all with a blank. Without the , you may now use -insheet- to load the file:

insheet using auto.txt, comma names
/* the option “comma” is optional /
the option “names” indicate the the first line of the text file — make,price,mpg,rep78,headroom — are the variable names */

An alternate solution involves 3 steps, but without using other programs (i.e. text editors) besides Stata:

(1) load the original text file using -insheet-;

insheet using auto.txt, nonames clear
/* -insheet- will read the data as a single variable and by specifying the option “nonames”, it will call the variable v1. This is important; otherwise, the variable name will be the concatenated variables names without the commas (because commas are not allowed in variable names): makepricempgrep78headroom, which will be difficult to separate. */

(2) save it (as a new file or replace the old file) without the quotation marks using -outsheet-; and,

outsheet using auto.txt, noquote nonames replace
/* without the “noquote” option, the quotation marks will be retained /
the “nonames” options will exclude the generated variable name “v1” */

(3) load the correct file comma-delimited file using -insheet-

insheet using auto.txt, comma names clear
/* this should load the file correctly */

This is particularly helpful when you have a huge text file that makes it difficult or impossible to open in text file editors such as Notepad.

Ways to count the number of unique values in a variable

There are at least 3 convenient ways to count the number of distinct values contained in a variable: -tab-, -inspect-, and -codebook-.

tab varname, nofreq
display r(r)

The option nofreq supresses the reporting of the frequency table. Besides displaying output in the results window, Stata stores the results of some commands so that you can use them in subsequent commands. Results of r-class commands, such as -tab-, are stored in r(). In the expample above, display r(r) returns the number of rows in the table, that is, the number of unique observations for variable varname. The problem with using -tab- to count the unique number of values is its row limits: 12,000 rows (Stata/MP and Stata/SE), 3,000 rows (Stata/IC), or 500 rows (Small Stata).

inspect varlist
display r(N_unique)

Besides reporting the number of unique values, -inspect- also reports: the number of negative, zero,  positive, and missing values. It also draws a histogram. There is no need for r(N_unique) if the number of unique values is less than or equal to 99 as -inspect- reports the actual number. But if the number of unique values is more than 99, it will return “More than 99 unique values”. In this case, you need to type the second line.

codebook varlist

-codebook- also provide other summaries besides unique values: type of variable (numeric, etc), the range of values, mean, standard deviation, missing values, and some percentiles.

Note: If varlist is not specified in -inspect- and -codebook-, the commands will return the reports for all variables.

Loading and storing data to and from Stata and Mata

mkmat varlist, matrix(M)    /creates Stata matrix M/
M=st_matrix(“M”)        /loads matrix M to Mata/

st_matrix(“M”, M)
svmat M    /loads matrix M back to Stata/


M=st_data(.,.)   /reads current Stata data as matrix M in Mata/

st_store(.,1..cols(M), M)    /loads matrix M back to Stata/

My -fpref-: anti -fren-

Yesterday, I used -fren- to delete the prefix “mus” in the file names of all data and do-files used in the book “Microeconometrics Using Stata” (MUS). Now, I want them back! -fren- is not helpful. So, here is the little program -fpref- that I wrote (my first ado-file) to add a prefix to file names by batch:


[Note: I had to paste it as an object to retain indentations. While indentations are purely aesthetics, a program (like a mathematical proof) doesn’t look right without them. Have not found a way yet to keep indentations (and font style) here for free.]

The first line “capture prog drop fpref” drops the program called -fpref-, if it exists. If it does not exist, Stata will not return an error and will continue to read the next line because of the command -capture- (this line can be deleted in the final version of the ado file). In the next line “prog define fpref”, the command is named as -fpref-. By typing “version 10” in the third line, the command that was named -fpref- is compatible with Stata 10 or newer versions of Stata. The next line defines the syntax. The body of the program follows the syntax. In this program, the first part of the body, with the while{} and if{} loops, returns an error if the operating system used is not Windows or the prefix is missing or separated by blank spaces. The second part of the body concatenates prefix to the old file names, which we have listed in a local macro. Finally, a Stata program always ends with an “end”.

I have saved my fpref.ado (and its corresponding help file fpref.sthlp) and changed my directory. I can now type “help fpref”, which will open the -help- window with the following information:

-help fpref-

To bring back the prefix “mus” to my files:

fpref dta, prefix(mus)  /* adds the prefix “mus” to all *.dta files in the current directory */

fpref do, prefix(mus)  /* adds the prefix “mus” to all *.do files in the current directory */

[Note: fileextensions can by anything, e.g., doc, xls, txt; and prefix can be any character or string (without spaces) that is allowed in a file name.]

Now, all my MUS file names are as they were. Happiness…