Mini Project 01: Fiscal Characteristics of Major US Public Transit Systems
Introduction
This project was taken as inspiration from a YouTuber City Nerd, who created a video about the “10 Transit Services That Do Huge Numbers at the Farebox”. Farebox Recovery is the fraction of revenues raised from fares instead of taxes.
For this project, we will use data from the National Transit Database as our primary source. The following tables/reports will be used:
The 2022 Fare Revenue table, in which we will mostly explore revenue.
The lastest Monthly Ridership tables, in which we will mostly explore the Unlinked Passenger Trips (UPT) and Vehicle Revenue Miles (VRM).
We will use the 2022 version of all reports, as up-to-date data and newer reports are often uploaded on a lag.
Getting Started
First, we must download, clean, and join the tables.
Unfortunately, the code that allowed the immediate download of the data sets did not work for me. So instead, I had to download the data and import it into R, which is seen in my code below.
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)library(readr)FARES <-read_excel("Miniproject001/2022 Fare Revenue (1).xlsx") |>select(-`State/Parent NTD ID`, -`Reporter Type`,-`Reporting Module`,-`TOS`,-`Passenger Paid Fares`,-`Organization Paid Fares`) |>filter(`Expense Type`=="Funds Earned During Period") |>select(-`Expense Type`) |>group_by(`NTD ID`, # Sum over different `TOS` for the same `Mode``Agency Name`, # These are direct operated and sub-contracted `Mode`) |># of the same transit modality# Not a big effect in most munis (significant DO# tends to get rid of sub-contractors), but we'll sum# to unify different passenger experiencessummarize(`Total Fares`=sum(`Total Fares`)) |>ungroup()
`summarise()` has grouped output by 'NTD ID', 'Agency Name'. You can override
using the `.groups` argument.
Rows: 3744 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Agency, City, State, Organization Type, Reporter Type, UZA Name, M...
dbl (13): NTD ID, Report Year, UACE Code, Primary UZA Population, Agency VOM...
lgl (7): Vehicle Operations Questionable, Vehicle Maintenance Questionable,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
`summarise()` has grouped output by 'NTD ID'. You can override using the `.groups` argument.
Now, we must extract the monthly transit numbers. My computer struggled to read the whole file for ridership, so I split it into two respective files, one for UPT and one for VRM. This is seen in my code below.
TRIPS <-read_excel("Miniproject001/Ridership_UPT.xlsx") |>filter(`Mode/Type of Service Status`=="Active") |>select(-`Legacy NTD ID`, -`Reporter Type`, -`Mode/Type of Service Status`, -`UACE CD`, -`TOS`) |>pivot_longer(-c(`NTD ID`:`3 Mode`), names_to="month", values_to="UPT") |>drop_na() |>mutate(month=my(month)) # Parse _m_onth _y_ear date specsMILES <-read_excel("Miniproject001/Ridership_VRN.xlsx") |>filter(`Mode/Type of Service Status`=="Active") |>select(-`Legacy NTD ID`, -`Reporter Type`, -`Mode/Type of Service Status`, -`UACE CD`, -`TOS`) |>pivot_longer(-c(`NTD ID`:`3 Mode`), names_to="month", values_to="VRM") |>drop_na() |>group_by(`NTD ID`, `Agency`, `UZA Name`, `Mode`, `3 Mode`, month) |>summarize(VRM =sum(VRM)) |>ungroup() |>mutate(month=my(month)) # Parse _m_onth _y_ear date specs
`summarise()` has grouped output by 'NTD ID', 'Agency', 'UZA Name', 'Mode', '3
Mode'. You can override using the `.groups` argument.
Here, I decided to rename three of the columns in my table, one to remove spaces so it is easier to manipulate in code, and the other to have more common names for easier understanding. Thus, I wanted:
First, I needed to know what the unique codes were used in the Mode column in our data set. To do this, I ran the following command, which produced the following 18 codes:
distinct(USAGE, Mode)
# A tibble: 18 × 1
Mode
<chr>
1 DR
2 FB
3 MB
4 SR
5 TB
6 VP
7 CB
8 RB
9 LR
10 YR
11 MG
12 CR
13 AR
14 TR
15 HR
16 IP
17 PB
18 CC
Using the Glossary on the National Transit Database website, I forced search each of these codes to find their corresponding name. Then, I recoded my table to display these names.
# A tibble: 1 × 2
Agency VRM
<chr> <dbl>
1 MTA New York City Transit 10832855350
Thus, the transit agency that had the most total VRM in our data set was MTA New York City Transit, with a Vehicle Revenue Miles of 10832855350. This makes sense, as public transit easily spans the entirety of New York City, whether it be the subway, buses, the ferry, or more. Since public transit is prevalent in our lives as New Yorkers, with many not even owning their own vehicle and solely relying on public transit, it is understandable that NYC is top of the list.
What transit mode had the most total VRM in our data set?
So, in May 2024, 237383777 unlinked passenger trips were taken on the NYC subway (Heavy Rail). Again, I believe that this checks out, especially with the way many New Yorkers rely on the subway to get around Manhatten during the work day.
How much did NYC subway ridership fall between April 2019 and April 2020?
# A tibble: 1 × 1
april_difference
<dbl>
1 -296416858
Here, we can see that 296416858 less riders used the NYC Subway in April 2020 than April 2019. This makes a lot of sense, as April 2020 was the true start of lock down we faced because of the Covid-19 Pandemic. Whether riders were terrified to leave their homes in fear of contracting the virus or they were listening to the mandate to stay inside unless it is for an emergency, it is not a shock that the MTA had almost 300 million less riders in comparison.
To further prove this point, let’s consider the same comparison, but for May 2019 and May 2020.
# A tibble: 1 × 1
may_difference
<dbl>
1 -295050652
The month of May also experienced close to 300 million less riders in May 2020 than May 2019. I am sure if we were to investigate the following months as well, we would come to a similar conclusion. It is truly interesting to see the major impact that the pandemic caused in such a short period of time!
Task 4: Explore and Analyze
What Metro Area has the most total amount of Unlinked Passenger Trips? Do they also have the most total Vehicle Revenue Miles?
# A tibble: 1 × 2
Metro_Area most_miles
<chr> <dbl>
1 New York--Jersey City--Newark, NY--NJ 21190345637
The Metro Area that had the most total amount of Unlinked Passenger Trips was New York–Jersey City–Newark, NY–NJ, with 84020935224 trips. This metro area also has the most total Vehicle Revenue Miles, with 21190345637 miles. It is interesting to see that this metro area is dominant on both lists, especially with the NYC MTA also leading other lists prior.
What Date had the most total amount of Vehicle Revenue Miles? What date had the least?
The date with the most total Vehicle Revenue Miles is October 1, 2019. The date with the least is April 1, 2020. Though I am unsure why October 1, 2019 was a popular date to travel, as there was no holidays or world events happen that day, it makes sense why April 1, 2020 was the lowest, due to COVID.
Which Metro Area contributed the most to the busiest day of the year in regard to Unlinked Passenger Trips? ? Which contributed the least?
# A tibble: 1 × 2
Metro_Area most_metro
<chr> <dbl>
1 Decatur, AL 25944
On the busiest day of the year for Vehicle Revenue Miles, New York–Jersey City–Newark, NY–NJ contributed the most with 87400676 miles. However, Decatur, AL contributed the least, with only 25944 miles. The New York–Jersey City–Newark, NY–NJ area providing the most to this list is not shocking, as it is also the leader on the list for total VRM.
Task 5: Table Summarization
Now, we will create a new table from USAGE that also has annual total UPT and VRM for 2022.
`summarise()` has grouped output by 'NTD ID', 'Agency', 'Metro_Area', 'Mode',
'Unlinked_Passenger_Trips'. You can override using the `.groups` argument.
When I attempted to run the given code the first time, I discovered that the tables were not joining together with all of their values. I quickly realized that this was because the Modes in the Financial table was still known as their short names. For example, the “Heavy Rail” was still referred to as “HR”. Thus, I ran the same code that we did on the USAGE table earlier to manipulate these codes to reflect their proper names. After this, the table formed with no issue.
`summarise()` has grouped output by 'Agency'. You can override using the
`.groups` argument.
print(MTSUPT)
# A tibble: 1,129 × 3
# Groups: Agency [525]
Agency Mode mosttsupt
<chr> <chr> <dbl>
1 MTA New York City Transit Heavy Rail 1.79e9
2 MTA New York City Transit Bus 4.59e8
3 Los Angeles County Metropolitan Transportation Authority Bus 1.94e8
4 Chicago Transit Authority Bus 1.40e8
5 New Jersey Transit Corporation Bus 1.13e8
6 Chicago Transit Authority Heavy Rail 1.04e8
7 MTA Bus Company Bus 1.00e8
8 Washington Metropolitan Area Transit Authority Heavy Rail 9.84e7
9 Southeastern Pennsylvania Transportation Authority Bus 9.66e7
10 Washington Metropolitan Area Transit Authority Bus 8.99e7
# ℹ 1,119 more rows
The transit system, agency and mode, that had the most UPT in 2022 was the MTA New York City Transit with their Heavy Rail, with a total UPT of 1793073801. This number is way larger than the following on the list, which interestly enough also belongs to the MTA New York City transit and their Bus system.
Which transit system (agency and mode) had the highest farebox recovery, defined as the highest ratio of Total Fares to Expenses?
`summarise()` has grouped output by 'Agency'. You can override using the
`.groups` argument.
print(highest_farebox)
# A tibble: 146 × 3
# Groups: Agency [112]
Agency Mode highest_ratio
<chr> <chr> <dbl>
1 Anaheim Transportation Network Bus 0.865
2 City of Gainesville, FL Bus 0.548
3 MTA New York City Transit Heav… 0.435
4 Massachusetts Bay Transportation Authority Heav… 0.375
5 Woods Hole, Martha's Vineyard and Nantucket Steamship Au… Ferr… 0.335
6 Metro-North Commuter Railroad Company, dba: MTA Metro-No… Comm… 0.331
7 Centre Area Transportation Authority Bus 0.324
8 MTA Long Island Rail Road Comm… 0.286
9 Southeastern Pennsylvania Transportation Authority Heav… 0.253
10 Regional Transportation Commission of Southern Nevada Bus 0.252
# ℹ 136 more rows
Here, we can clearly see that the transit system with the highest farebox recovery is Anaheim Transportation Network with their bus transportation. The fairbox recovery for each transit system is found by dividing their Total Fares by their Expenses. Interestingly enough, Anaheim Transportation Network and their buses have a ratio of .865, which is about a 57% greater than the second place spot, which is the City of Gainesville, FL with their buses!
Which transit system (agency and mode) has the lowest expenses per UPT?
`summarise()` has grouped output by 'Agency'. You can override using the
`.groups` argument.
print(low_expense_UPT)
# A tibble: 146 × 3
# Groups: Agency [112]
Agency Mode lowest_exp_p_upt
<chr> <chr> <dbl>
1 Anaheim Transportation Network Bus 12.8
2 University of Georgia Bus 14.9
3 University of Michigan Parking and Transportation Ser… Bus 16.2
4 Town of Blacksburg Bus 17.5
5 Ames Transit Agency Bus 21.5
6 Centre Area Transportation Authority Bus 23.5
7 MTA New York City Transit Heav… 31.1
8 Greater Lafayette Public Transportation Corporation Bus 31.5
9 San Diego Metropolitan Transit System Ligh… 31.6
10 Champaign-Urbana Mass Transit District Bus 39.8
# ℹ 136 more rows
Similar to the last inquiry , the transit system that has the lowest expenses per UPT is Anaheim Transportation Network with their bus transportation. Their ratio of expenses per UPT is 12.8, which is drastically less than the highest. The highest expense per UPT is Northeast Illinois Regional Commuter Railroad Corporation and their Cable Cars, with a ratio of 600, which is an extreme difference between the two.
Which transit system (agency and mode) has the highest total fares per UPT?
`summarise()` has grouped output by 'Agency'. You can override using the
`.groups` argument.
print(high_fare_UPT)
# A tibble: 146 × 3
# Groups: Agency [112]
Agency Mode high_fare_p_upt
<chr> <chr> <dbl>
1 New Jersey Transit Corporation Bus 536.
2 Northeast Illinois Regional Commuter Railroad Corporat… Comm… 265.
3 Metro-North Commuter Railroad Company, dba: MTA Metro-… Comm… 161.
4 New Jersey Transit Corporation Comm… 119.
5 MTA New York City Transit Comm… 112.
6 Massachusetts Bay Transportation Authority Bus 107.
7 Massachusetts Bay Transportation Authority Comm… 84.4
8 Woods Hole, Martha's Vineyard and Nantucket Steamship … Ferr… 78.0
9 MTA Long Island Rail Road Comm… 73.9
10 Peninsula Corridor Joint Powers Board Comm… 67.5
# ℹ 136 more rows
The transit system with the highest total fares per UPT is New Jersey Transit Corporation and their buses, with a total fare of 536 to one Unlinked Passenger Trip. This is more than double the second place spot, which goes to Northeast Illinois Regional Commuter Railroad Corporation and their Commuter Bus, with a total fare of 265 to one Unlinked Passenger Trip.
Which transit system (agency and mode) has the lowest expenses per VRM?
Interestingly enough, I had to change my “Expenses” column name to “expense” for my code to recognize the column instead of the table we established prior, hence the reasoning for that code below.
`summarise()` has grouped output by 'Agency'. You can override using the
`.groups` argument.
print(low_exp_vrm)
# A tibble: 146 × 3
# Groups: Agency [112]
Agency Mode low_exp_p_vrm
<chr> <chr> <dbl>
1 Interurban Transit Partnership Bus 77.2
2 City of El Paso Bus 85.8
3 Des Moines Area Regional Transit Authority Bus 86.3
4 Central Florida Regional Transportation Authority Bus 87.5
5 San Francisco Bay Area Rapid Transit District Heavy Rail 88.8
6 Transportation District Commission of Hampton Roads Bus 88.9
7 City of Gainesville, FL Bus 90.1
8 Greater Lafayette Public Transportation Corporation Bus 90.5
9 Ames Transit Agency Bus 92.0
10 Delaware Transit Corporation Bus 92.4
# ℹ 136 more rows
The transit system with the lowest expenses per Vehicle Revenue Miles was Interurban Transit Partnership and their Bus transportation, with an expense of 77.2 per Vehicle Revenue Mile. This is far cheaper than the leader of this category, which is the New York City Department of Transportation and their Ferryboat, with an expense of 771 per vehicle revenue mile. Interestingly enough, out of the ten highest expenses on the list, nine of the most expensive Modes of transportation is the Ferryboat, with only the Cable Car in San Francisco coming in at fourth. This shows that a Ferryboat is most-likely the most expensive transportation mode for transit agencies to run in comparison to its vehicle revenue miles.
Which transit system (agency and mode) has the highest total fares per VRM?
`summarise()` has grouped output by 'Agency'. You can override using the
`.groups` argument.
print(high_fare_VRM)
# A tibble: 146 × 3
# Groups: Agency [112]
Agency Mode high_fare_p_vrm
<chr> <chr> <dbl>
1 Washington State Ferries Ferr… 1120.
2 Woods Hole, Martha's Vineyard and Nantucket Steamship … Ferr… 829.
3 New York City Economic Development Corporation Ferr… 188.
4 Anaheim Transportation Network Bus 170.
5 Massachusetts Bay Transportation Authority Ligh… 131.
6 Port Authority Trans-Hudson Corporation Heav… 119.
7 Massachusetts Bay Transportation Authority Heav… 101.
8 MTA Long Island Rail Road Comm… 98.6
9 Metro-North Commuter Railroad Company, dba: MTA Metro-… Comm… 94.4
10 MTA New York City Transit Heav… 91.5
# ℹ 136 more rows
The transit system with the highest total fares per vehicle revenue miles is Washington State Ferries, with a ratio of 1120 total fares to one vehicle revenue mile. An interesting correlation to the last inquiry is that the top three highest total fares per VRM belong to ferryboats. Furthermore, the highest value on the list is almost 500% greater than the third value, which is 188, and that percentage only grows as that list goes further down.
Conclusion
In regard to what transit system in the country is the most effective, it depends on which data points you use to weigh your opinion. If you are considering which system has the highest values in URM and UPT, meaning it is the most frequently used, the answer would be the major transit system of NYC’s MTA Transit. Not only is its heavy rail system extremely prevalent in these numbers, which makes sense since numerous rails span the lengths of New York City and are used daily by many, but the NYC MTA Bus also holds rank compared to others.
However, if you are considering which transit system is the most cost effective, it would have to be the Anaheim Transportation Network. Since a high farebox recovery ratio indicates that a transit system is profitable, Anaheim Transportation Network bus’ numbers alone show that this system is highly more profitable than the rest. This agency also leads margin of lowest expenses per unlinked passenger trip, which shows that not only is this system great at making money, it also great at retaining it and not spending the majority of it on expenses. However, while this is great from a business standpoint, as a passenger this transportation agency ranks high on the total fares per vehicle revenue mile scale, as it is in 4th place. This means that this transportation is fairly expensive to ride compared to the others.
In my opinion, I believe that the MTA New York City Transit is the most effective in the country, due to the enormous scale it can run on at a moderate price compared to other agencies.