creative destruction: collapse and contract

Creative destruction, coined by Joseph Schumpeter in Capitalism, Socialism, and Democracy, refers to the process by which new innovations kill old inefficient products or processes. But we are not talking about that but instead, of destroying data to create more useful information. By destroying, we mean altering the data currently loaded in memory with no undo button to rely to. When you load or open data into Stata, Stata stores the data in your machine’s RAM. Any changes made, therefore, are not permanent or saved in your hard drive until you call on save, but still be careful that you do not overwrite your raw data files.
Continue reading

dates in Starbucks

In yesterday’s post, one of the needed cleaning was to extract date and day of week from the string:
. list date in 1/5

     |                                       date |
  1. |  Date: August 31, 2015 at 1:42:41 PM GMT+8 |
  2. | Date: August 24, 2015 at 12:36:55 PM GMT+8 |
  3. |    Date: July 27, 2015 at 2:51:27 PM GMT+8 |
  4. |    Date: July 20, 2015 at 2:45:43 PM GMT+8 |
  5. |    Date: July 20, 2015 at 2:07:49 PM GMT+8 |

Continue reading

Using Stata to make sense of my Uber data

I tried Uber in late May and since then it has been 131 Uber rides covering 1,200 kilometers and 80 hours on the road. Uber (and GrabTaxi) has eliminated the wait under the heat (and rain) and the dealing with the assholeness of most taxi drivers here in Metro Manila. But what I love most about Uber, apart from their customer service, is the data they send. Trip receipts are automatically sent as soon as the trip has ended. These do not only show how much I am charged but include time, distance, fare disaggregated by time and distance, and many more. GrabTaxi receipts, on the other hand, only show amount paid and manually encoded by drivers.
Continue reading

-destring- uncomplicated

In a comment to the previous post destring complication: negative numbers, Nick Cox pointed out “the most important advise” in using destring: “never destring, replace unless you are absolutely sure that you are right or are willing to do things again if you made the wrong decision. The generate() option is there for a purpose.”

In addition, his comment point to simpler solution than using regular expressions.

Continue reading

-destring- complication: negative numbers

Less than 2 hours flight…

In a Stata training, one of the students wondered why after importing an Excel file of financial indicators into Stata some were read as strings. A quick browse at the data indicates the presence of hyphens (“-“) and that these were used in different ways: one to indicate a negative number and another to indicate a missing observation.


How do we convert these variables to numeric as destring returns an error?

Continue reading

Import data from Excel sheets

How do we import data from all sheets in a number of Excel files? Each Excel file has a different number of sheets with names of no discernible pattern, but (thankfully) each sheet has the same structure: the first observation is in the same row and the columns correspond to each other. An example is the set of 17 Excel files of census data of barangays (villages) that was provided to me. Each Excel file corresponds to one region and within each file are sheets corresponding to the province in the region.  How do we consolidate all sheets in all files into one data file?

Continue reading

No -usespss- for Mac

In Reading SPSS data file into Stata, I describe Sergiy Radyakin’s -usespss- that loads SPSS data (.sav) into Stata. I was on Windows then. -usespss- is unfortunately not available for Mac OS. StatTransfer could easily do this if you have the software. Another option is to use R.

A quick Google search led me to a simple R routine that does exactly this. Following the steps outlined in Daniel Marcelino’s Loading SPSS (.sav) into Stata, I managed to covert an SPSS data set I downloaded from IDRA UCLA website into a Stata .dta file.

Try it out! It is easy to follow.

An interesting extension is Gabriel Rossman’s importspss.ado (requires R). It implements the R routine as an ado-file.

Error in reading DBF

In Fun with maps in Stata I noted an error that occurs when trying to convert shapefiles from PhilGIS to Stata data files using -shp2dta- (Kevin Crow). It turns out that the column widths are more than 244 (even though the strings in the cells are not long). The easiest way to get around this is to open the dbf file using another program and change the column widths.[1] In my case, I used MS Excel.[2] If you are using Excel 2003, there is a ‘save as dbf’ option. But for some reason, this option is not available in Excel 2007 (what were the geniuses at Microsoft thinking?). Here is where the ExcelToDBF add-in, developed by Andrea Bonfiglio, comes to the rescue.[3] Once you have download and installed the add-in, the ‘DBF IV’ option will be added to the ‘Save As’ menu. To illustrate, here is a screencast.

[1] Some Statalist posts hint that one can use Mata. Didn’t try that.
[2] Al Tongco of PhilGIS also suggested the free spreadsheet program OpenOffice Calc.
[3] I have also tried 2 other add-ins (SaveDBFIV and XLSX2DBF) but they returned an error when I tried to convert a huge file.

Stata-MySQL a first encounter

In late April, I had my first encounter with .myd, .myi, and .frm files. The challenge was to read these directly from Stata so that users can easily work on the data using Stata. I was told that these are SQL databases. I initially thought that this can easily be read using a straightforward use of -odbc- command. But the file extensions .myd, .myi, and .frm are not among the choices avilable in Windows ODBC Data Source Administrator*. Google…google…google.

Thanks to Andrew Dyck’s post Connect to MySQL database using Stata. His Step 0 and Step 1 is exactly the second half of what I needed to do. But, as I had no idea how SQL works, it took me a while to understand what I needed to do in the first half. I confused the term ‘MySQL server’ to a specific physical machine other than my own PC. Eventually I figured I need MySQL server installed. For those who have their first encounters like me, here is the first half: Step -2 and Step -1.

Step -2: Download/Install the MySQL server**
Not knowing any better, I just used all default settings during the installation, and  typed “pass” as password when asked. I did not bother to change the default username “root”.

You did OK if you can open the MySQL Command Line Client (screen shot below) from the start menu. As you can see (or maybe not because they are too small) from the screenshot, I typed “show databases;” to see the databases that the system has access to. I found that only the default MySQL databases are in the list. Why can’t it see my database? Because my folder was in the desktop.

I guess it is like Stata. If you put your data or ado-file in the wrong place, Stata will not be able to find it. Where is the right place?

Step -1: Paste the folder with your data in the right place
That right place can be found in my.ini file (in my case, my.ini file is in “C:\Program Files\MySQL\MySQL Server 5.5″).

#Path to the database root
datadir=”C:\ProgramData\MySQL\MySQL Server 5.5\data”

Note that C:\ProgramData\ is, by default, a hidden folder. If you can’t find it, it is either you have not changed the settings in folder options or you have no admin rights.

Can you change this data directory? Yes. First, you need to stop MySQL from running (closing the MySQL Client window is not enough). You can use Windows task manager to end the program mysqld.exe. Second, move the entire data folder from old path to the new path. Third, change the data directory path in my.ini. In my case, my new data directory path is:


Now, you can move your folder (with the .myd, .myi, and .frm files) to this new data directory. Reboot.

Follow Andrew Dyck’s Step 0 and Step 1. Note that the connectionstring() used in Step 1 was only added to -odbc- late last year (see Stata’s blog). Make sure your Stata is up-to-date.

It also helps to learn a few SQL syntax so that you can load only selected lines or those that meet specific conditions, or load everything (see MySQL documentation). To illustrate, we use the built-in MySQL database information_schema.

Example 1: Load rows 1-20 of table TABLES
odbc load, exec(“SELECT * FROM TABLES LIMIT 20“) conn(“DRIVER={MySQL ODBC 5.1 Driver};SERVER=localhost;DATABASE=information_schema;UID=root;PWD=pass;”) clear

Example 2: Load observations where the variable ROW_FORMAT==”Fixed”
odbc load, exec(“SELECT * FROM TABLES WHERE ROW_FORMAT=’Fixed'”) conn(“DRIVER={MySQL ODBC 5.1 Driver};SERVER=localhost;DATABASE=information_schema;UID=root;PWD=pass;”) clear

Example 3: Load everything from table TABLES
odbc load, exec(“SELECT * FROM TABLES“) conn(“DRIVER={MySQL ODBC 5.1 Driver};SERVER=localhost;DATABASE=information_schema;UID=root;PWD=pass;”) clear

*Control Panel–>Administrtive Tools–>Data Sources (ODBC)

** Complications arise if you have no admin rights to the PC. I am using a 64-bit Windows 7 machine; I am not sure if this works the same in other platforms as well.

-encode- it

Over lunch today, a friend asked how to generate a new variable that will have unique numeric IDs corresponding to the string values in an existing variable.  The first command that comes to mind is -encode-*. -encode- generates a numeric variable from a string variable and uses the string values as labels for the generated numeric values. Its partner, -decode-, does the reverse. To illustrate, let’s use the overused** auto.dta:

sysuse auto, clear
encode make, gen(make_id)

By default, the order of the number generated corresponds to the alphabetical order of the string variable.

What -encode- does is to save you from writing longer codes, such as:

sysuse auto, clear
gen byte make_id = .
replace make_id = 1 if make == “AMC Concord”
replace make_id = 2 if make == “AMC Pacer”
replace make_id = 74 if make == “Volvo 260”
/* By the time you get here, you could have finished an episode of
“The Big Bang Theory” */

label define make 1 “AMC Concord” 2 “AMC Pacer” …
/* and another episode here */


or the more complex but unnecessary

sysuse auto, clear
levelsof make, local(l)
gen byte make_id = .
local id = 1
foreach i of local l{
replace make_id = id' if <em>make </em>== "i'”
label define make id' "i'”, add
local id = `id’ + 1
label values make_id make

Another way is to use -group- under -egen-. Example:

sysuse auto, clear
egen make_id = group(make)

But then you still have to create and attach the value labels to make_id. Nick Cox pointed out in his comment that -group- has a -label- option.

See -help encode- for more options and for its counterpart -decode-.

*I came across -encode- in Christopher Baum’s An Introduction to Stata Programming

**Does one lose a byte when data is overused? Sort of the ‘wear-and-tear’ we see in most things that aren’t invisible.