creative destruction: collapse and contract


Creative destruction, coined by Joseph Schumpeter in Capitalism, Socialism, and Democracy, refers to the process by which new innovations kill old inefficient products or processes. But we are not talking about that but instead, of destroying data to create more useful information. By destroying, we mean altering the data currently loaded in memory with no undo button to rely to. When you load or open data into Stata, Stata stores the data in your machine’s RAM. Any changes made, therefore, are not permanent or saved in your hard drive until you call on save, but still be careful that you do not overwrite your raw data files.

collapse and contract are two of the commands that destroy the current data in memory to create a new data that is more useful for the user. What is the difference between the two? collapse creates a dataset of summary statistics (e.g., mean, median, iqr) while contract creates a dataset of frequences and percentages.

contract is less used than collapse, at least in my experience, because frequency tables are usually generated for at most 2 variables and this can be done using tabulate. For example, using my Uber rides data:
. tab  timecat

    timecat |      Freq.     Percent        Cum.
------------+-----------------------------------
    5-7:30a |         26       16.46       16.46
 7:30-9:30a |         18       11.39       27.85
9:30a-4:30p |         79       50.00       77.85
 4:30-9:30p |         30       18.99       96.84
  9:30p-12a |          5        3.16      100.00
------------+-----------------------------------
      Total |        158      100.00

. tab day timecat

           |                        timecat
       day |   5-7:30a  7:30-9:30  9:30a-4:3  4:30-9:30  9:30p-12a |     Total
-----------+-------------------------------------------------------+----------
    Sunday |         0          2          5          0          0 |         7 
    Monday |         5          1         19          5          1 |        31 
   Tuesday |         7          1         13          2          1 |        24 
 Wednesday |         4          1         15          9          2 |        31 
  Thursday |         7          3         10          5          1 |        26 
    Friday |         3          7         11          6          0 |        27 
  Saturday |         0          3          6          3          0 |        12 
-----------+-------------------------------------------------------+----------
     Total |        26         18         79         30          5 |       158 

tabulate is not destructive. It returns a table as above in the results window and keeps the data in memory in tact. tabulate, however, is limited up to two-way contingency tables. Suppose you want to look at frequency table for combinations of more than 2 categorical variables, for example, similar to the output of proc freq in SAS that allows mulitple variables (this was pointed to me by Utsav, a former colleague in ADB).

The example below shows the frequency, cumulative frequency, percentage, and cumulative percentage of all combinations of categorical variables day, peak, and surgedum.
. contract  day peak surgedum, cfreq(cumfreq) percent(percentage) cpercent(cumpercent)
. list

     +---------------------------------------------------------------------+
     |       day   peak   surgedum   _freq   cumfreq   percen~e   cumper~t |
     |---------------------------------------------------------------------|
  1. |    Sunday      0          0       5         5       3.16       3.16 |
  2. |    Sunday      0          1       2         7       1.27       4.43 |
  3. |    Monday      0          0      21        28      13.29      17.72 |
  4. |    Monday      0          1       5        33       3.16      20.89 |
  5. |    Monday      1          0       3        36       1.90      22.78 |
     |---------------------------------------------------------------------|
  6. |    Monday      1          1       2        38       1.27      24.05 |
  7. |   Tuesday      0          0      20        58      12.66      36.71 |
  8. |   Tuesday      0          1       1        59       0.63      37.34 |
  9. |   Tuesday      1          0       3        62       1.90      39.24 |
 10. | Wednesday      0          0      20        82      12.66      51.90 |
     |---------------------------------------------------------------------|
 11. | Wednesday      0          1       1        83       0.63      52.53 |
 12. | Wednesday      1          0      10        93       6.33      58.86 |
 13. |  Thursday      0          0      18       111      11.39      70.25 |
 14. |  Thursday      1          0       5       116       3.16      73.42 |
 15. |  Thursday      1          1       3       119       1.90      75.32 |
     |---------------------------------------------------------------------|
 16. |    Friday      0          0      13       132       8.23      83.54 |
 17. |    Friday      0          1       2       134       1.27      84.81 |
 18. |    Friday      1          0       7       141       4.43      89.24 |
 19. |    Friday      1          1       5       146       3.16      92.41 |
 20. |  Saturday      0          0      12       158       7.59     100.00 |
     +---------------------------------------------------------------------+

Note that by using contract, what is left of your data in memory is that table above. This, of course, is a nonissue if you have written all your cleaning and other data management stuff in a do-file and you have kept your raw data files in their original form.

collapse, on the other hand, creates a database of commonly used summary statistics. For example:
. collapse (mean) charged kph (max) maxkph=kph (min) minkph=kph, by(day)
. format  charged kph maxkph minkph %4.1f
. list

     +----------------------------------------------+
     |       day   charged    kph   maxkph   minkph |
     |----------------------------------------------|
  1. |    Sunday     158.4   25.3     36.1     16.5 |
  2. |    Monday     183.7   17.5     31.2      4.6 |
  3. |   Tuesday     191.5   15.6     27.8      9.0 |
  4. | Wednesday     160.8   15.1     31.2      7.6 |
  5. |  Thursday     215.6   14.8     25.9      5.7 |
     |----------------------------------------------|
  6. |    Friday     160.8   14.0     21.8      5.2 |
  7. |  Saturday     141.6   16.8     35.3      9.1 |
     +----------------------------------------------+

But if you only need the summary table, there is actually no need to collapse your data. You can use table to arrive at the same table:
. table day, c(mean charged mean kph max kph min kph) format(%4.1f)

----------------------------------------------------------------------
      day | mean(charged)      mean(kph)       max(kph)       min(kph)
----------+-----------------------------------------------------------
   Sunday |         158.4           25.3           36.1           16.5
   Monday |         183.7           17.5           31.2            4.6
  Tuesday |         191.5           15.6           27.8            9.0
Wednesday |         160.8           15.1           31.2            7.6
 Thursday |         215.6           14.8           25.9            5.7
   Friday |         160.8           14.0           21.8            5.2
 Saturday |         141.6           16.8           35.3            9.1
----------------------------------------------------------------------

See help contract and collapse for more options.

Again, do not overwrite your raw data files!

TBBT Season 9 on TV. Leonard and Penny are finally married… =D

One Response

  1. The words “destroy” and “destruction” are useful if only, but not only, as a reminder to users that the original dataset disappears. But, as your post emphasises, there is an entirely positive aspect to this.

    Although it’s customary to talk about “data analysis”, at least since the 1960s, earlier scientists often (and I think more commonly!) talked about “data reduction”, meaning that original measurements need to be boiled down to a much smaller presentation or results set (resultsset!). That’s still a very useful expression. I tend to explain -collapse- and -contract- and indeed others such as -statsby- as reduction commands.

    I wrote the predecessor of -contract- partly because the -collapse- command was (then) so slow for the same task. Another motive for using -contract- is to get data in a form suitable for loglinear models for categorical data (in which frequency is the response variable and you have categorical covariates). For this purpose explicit zeros (for cross-combinations of covariates that do not occur) are crucial and that is a specific task where -contract- does something that -collapse- cannot do.

Leave a Reply