Using Stata to make sense of my Uber data


I tried Uber in late May and since then it has been 131 Uber rides covering 1,200 kilometers and 80 hours on the road. Uber (and GrabTaxi) has eliminated the wait under the heat (and rain) and the dealing with the assholeness of most taxi drivers here in Metro Manila. But what I love most about Uber, apart from their customer service, is the data they send. Trip receipts are automatically sent as soon as the trip has ended. These do not only show how much I am charged but include time, distance, fare disaggregated by time and distance, and many more. GrabTaxi receipts, on the other hand, only show amount paid and manually encoded by drivers.

Uber sends receipts via email that goes straight to my Mac Mail app. From Mail, it is easy to export filtered mails to a single text file:
1) Filter “Uber receipts” mails using the search tab
2) Edit > Select All (or cmd+a)
3) File > Save As … (name the file, uncheck “Include attachments”, and choose “Plain Text” format)

I have not tried doing this from Gmail or Yahoo mail via their web apps. Another way to save the the receipts as text files is to read them directly from the folder where they are saved. In any case, at least in Mac Mail App, the steps above work.

Below are sample lines from the text file (some details are replaced by xxxxx):
From: Uber Receipts <receipts.manila@uber.com>
Subject: Your Monday afternoon trip with Uber
Date: August 31, 2015 at 1:42:41 PM GMT+8
To: xxxxx@yahoo.com
 
AUGUST 31, 2015 
₱104.00 Thanks for choosing Uber, Mitch 
 
 
01:25pm
xxxxx, Metro Manila, Philippines 
01:42pm
xxxxx, Metro Manila, Philippines 
CAR
uberX 
KILOMETERS
5.36 
TRIP TIME
00:17:01 

FARE BREAKDOWN

Base Fare 
40.00
Distance 
30.53
Time 
34.03
Subtotal
₱104.56
Rounding Down 
-0.56 
CHARGED 
 
Personal •••• xxxxx 
₱104.00 

How do we convert this into a tidy data that we can play with? First, we use insheet or import delimited to load the text file into memory. Everything is stored in 1 column, v1:
. list in 1/30, sep(31) compress
     +------------------------------------------------+
     |                                             v1 |
     |------------------------------------------------|
  1. | From: Uber Receipts <receipts.manila@uber.com> |
  2. |  Subject: Your Monday afternoon trip with Uber |
  3. |      Date: August 31, 2015 at 1:42:41 PM GMT+8 |
  4. |                             To: xxxx@yahoo.com |
  5. |                                AUGUST 31, 2015 |
  6. |      ₱104.00 Thanks for choosing Uber, Mitch |
  7. |                                        01:25pm |
  8. |               xxxxx, Metro Manila, Philippines |
  9. |                                        01:42pm |
 10. |               xxxxx, Metro Manila, Philippines |
 11. |                                            CAR |
 12. |                                          uberX |
 13. |                                     KILOMETERS |
 14. |                                           5.36 |
 15. |                                      TRIP TIME |
 16. |                                       00:17:01 |
 17. |                                 FARE BREAKDOWN |
 18. |                                      Base Fare |
 19. |                                          40.00 |
 20. |                                       Distance |
 21. |                                          30.53 |
 22. |                                           Time |
 23. |                                          34.03 |
 24. |                                       Subtotal |
 25. |                                      ₱104.56 |
 26. |                                  Rounding Down |
 27. |                                          -0.56 |
 28. |                                        CHARGED |
 29. |                    Personal •••• xxxxx |
 30. |                                      ₱104.00 |
     +------------------------------------------------+

Next, we use Stas Kolenikov’s findval (ssc install findval), to identify rows that we will use as references to grab the information we need. For example:

findval "From:", substr gen(from)
gen id = sum(from)

generates a new variable from that returns 1 if the row contains the string “From:”. We can use this then as the indicator for the first line of each trip receipt and from which we can create id per trip.

We can use the following:

findval "Date:", substr gen(datetime)
gen date = v1 if datetime==1

to indicate the row from where we can get the trip date. And the following:

findval "KILOMETERS", substr gen(kmx)
gen km = v1[_n+1] if kmx==1

to indicate the row from where we can get the distance covered during the trip, which is in the line next to the reference row.

We do this for all the rows from where we can get the information we need. The reference rows may also differ depending whether your trip is in surge pricing, or in a promo, or includes other charges (example, toll fee). Yes, it could get messier!

The process of cleaning the data does not end here. What we need in the end is a clean data, that is, each row represents a unique trip and each column represents an attribute of that trip (date, distance, time, fare, etc.). We also want that the variables are stored in the the type we need them to be. For example, fare and distance, among others, must be numeric.

The process we did so far leads us to something like this below, where a single trip is encoded in multiple rows.
. list from datetime kmx date km in 1/30, sep(40) compress

     +-----------------------------------------------------------------------+
     | from   da~me   kmx                                        date     km |
     |-----------------------------------------------------------------------|
  1. |    1       0     0                                                    |
  2. |    0       0     0                                                    |
  3. |    0       1     0   Date: August 31, 2015 at 1:42:41 PM GMT+8        |
  4. |    0       0     0                                                    |
  5. |    0       0     0                                                    |
  6. |    0       0     0                                                    |
  7. |    0       0     0                                                    |
  8. |    0       0     0                                                    |
  9. |    0       0     0                                                    |
 10. |    0       0     0                                                    |
 11. |    0       0     0                                                    |
 12. |    0       0     0                                                    |
 13. |    0       0     1                                               5.36 |
 14. |    0       0     0                                                    |

How do we collapse information for a single trip into one line? What I did was to fill, for each column, all missing rows with the value of the non-missing row:
. by id (date), sort: replace date = date[_N] if date[_N]!=""
. by id (km), sort: replace km = km[_N] if km[_N]!=""

. list from datetime kmx date km in 1/30, sep(40) compress

     +-----------------------------------------------------------------------+
     | from   da~me   kmx                                        date     km |
     |-----------------------------------------------------------------------|
  1. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
  2. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
  3. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
  4. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
  5. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
  6. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
  7. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
  8. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
  9. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
 10. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
 11. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
 12. |    0       0     1   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
 13. |    0       1     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |
 14. |    0       0     0   Date: August 31, 2015 at 1:42:41 PM GMT+8   5.36 |

From here it is easy to keep one row for each trip. One way to do that is to keep the last observation:

by id, sort: keep if _n==_N

Are we done? Not quite. More janitorial work are still needed: dropping unnecessary columns, changing string to numeric using destring, getting rid of weird characters such as “DZ” using subinstr(), formatting date and time variables, and generating variables such as average speed and indicator for peak hours.

So what is the point of all of these… despite knowing how bad the traffic is here in Metro Manila, it still surprised some people when I showed them how slow it can go. My average speed for all 131 rides is 16kph. This goes down to 11kph during peak hours ( 7:30-9:30am and 4:30pm-9:30pm). Off-peak average speed is 17kph. To put into context, average speed in New York is 28kph (17.6mph).

uberscatter

averagespeedtime

whattgif
***********************************
* Draw graphs
***********************************

*** Can't get pass 40...
sum kph
loc meankph = string(round(`r(mean)', 1))
sum kph if peak==1
loc meankphpeak = string(round(`r(mean)', 1))
sum kph if peak==0
loc meankphnpeak = string(round(`r(mean)', 1))

scatter kph d, ms(dh) || ///
    scatter kph d if peak==1, ms(d) mc(brown) ///
    xsize(2) ysize(2) xl(, labsize(*.5)) ///
    yl(0(20)80 `meankphpeak' `meankphnpeak')  ///
    yline(`meankphpeak', lp(-) lcolor(maroon)) ///
    yline(`meankphnpeak', lp(-) lcolor(emerald)) ///
    yt("Average speed (kph)") xt("") ///
    legend(order(2 "Peak hours") ring(0) pos(10) size(*.6) ///
        col(5) symxsize(*.2) symysize(*.2) region(lcolor(none))) 

*** Go to work early, leave late!
graph bar (mean) kph if (day!=0 & day!=6) & holiday==0, ///
    over(timecat) ///
    asyvar bargap(10)  blabel(bar, format(%4.0f) pos(outside)) ///
    yt("Average speed (kph)", size(*.7)) ///
    xsize(2) ysize(2) ///
    legend(col(5) symxsize(*.2) symysize(*.5) size(*.5) region(lcolor(none))) ///
    title("Go to work early, leave late!", size(*.8)) ///
    subtitle("Weekdays (excluding holidays)", size(*.5)) 

*** What TGIF?!
graph bar (mean) kph if holiday==0, ///
    over(day, relabel(1 "sun" 2 "mon" 3 "tue" 4 "wed" 5 "thu" 6 "fri" 7 "sat"))  ///
    yt("Average speed (kph)", size(*.7)) ///
    xsize(2) ysize(2) ///
    title("What TGIF?!", size(*.8)) 

What have I learned from my Uber data so far…

That there is no point bragging about owning a high-powered car
That cursing binge happens almost twice a month (when average speed falls below 8kph)
That Friday sucks
That I have wasted on average 1.25 hours in traffic per day that I am on the road
That over 3 workdays per month over the last 3 months was lost to traffic
That if the same trend persists, i am implicitly being taxed half a month worth of productivity for a year (may I refund this from the BIR? DOTC?)

Assuming that my travel pattern is average (which is not because I work from home half of the time and as much as possible I avoid peak hours… leaving at 6am for a 9am meeting… better to lounge in Starbucks than being stuck in traffic), 3 work days lost to traffic per month multiplied by the millions of commuters is fatal to the economy. And cursing binge is fatal to sanity!

Tax payers (with taxable income of at least P500,000 annually) already lose 3 months of the year worth of income to income taxes (add to that VAT), must they also need to lose half a month to traffic?

Related Facebook posts:

Airport taxis: why are they making us pay more?

Patak bawat kilometro bawat minuto

The time-wasted-due-to-traffic question and Uber data

Mga araw at oras na mura ka ng mura…

Para saan ang pangharurot na makina ng kotse mo…

4 Responses

  1. […] }); Great TR! For Filipino traffic insights, you might want to read this (i.e. average speed in traffic is 10 miles/hour for that time of day for this guy's Uber rides, […]

  2. Uber cool!

  3. Thanks for your post! I clean a lot of data for public health use and the examples of findval and keep if _n = _N were great. More like this please!

Leave a Reply