(TFL) is a statutory physique liable for London’s public transport community, managing buses, the Underground, Docklands Gentle Railway, Overground, and main roads. Their ‘Open Data’ coverage signifies that they share a lot of their inner information with the general public, which they are saying is at present powering over 600 apps for Londoners.
One fascinating information supply they share with the general public is Santander Cycle (additionally identified colloquially as Boris Bikes) utilization information. Each bike journey is recorded. This information goes again from 2015 all the best way as much as 2025. The info is organized in unwieldy weekly CSV recordsdata to obtain: Every row of this information is one bike journey, with every bike journey ranging from a selected bike station. This equals 9.2 million station-hours, 800 bike stations, 144 weekly CSVs. See an instance of the info beneath.
| Begin Date | StartStation Identify | Finish Date | EndStation Identify | Period |
|:-----------------|:---------------------------------|:-----------------|:------------------------------------|-----------:|
| 10/01/2016 00:00 | Drury Lane, Covent Backyard | 10/01/2016 00:04 | Frith Road, Soho | 240 |
| 10/01/2016 00:00 | Pott Road, Bethnal Inexperienced | 10/01/2016 00:05 | Victoria Park Street, Hackney Central | 300 |
| 10/01/2016 00:00 | Harrington Sq. 2, Camden City | 10/01/2016 00:20 | Baylis Street, Waterloo | 1200 |
| 10/01/2016 00:01 | Canton Road, Poplar | 10/01/2016 00:14 | Hewison Road, Outdated Ford | 780 |
| 10/01/2016 00:01 | Cephas Road, Bethnal Inexperienced | 10/01/2016 00:11 | Brick Lane Market, Shoreditch | 600 |We will take every row and combination this information up so we are able to see the seasonality traits throughout just a few years:
This dataset now offers us a glimpse into the bike utilization throughout London (this information doesn’t include each bike journey in London, however we are able to anticipate that Boris Bike utilization is said to total bike utilization). For a Causal Knowledge Science fanatic, the pure subsequent query is: how can we use this dataset to reply some fascinating causal questions? What occasions happen which have a big affect on cycle journeys? What are some widespread giant scale disruptions that trigger folks to not be capable of take the tube? How do employees present the worth of their labour to their employers by withholding it? Strikes!
On this article I will likely be analyzing the causal affect of main tube strikes on cycle utilization in London. Historic strikes are considerably arduous to pin down throughout the web, however fortunately for me there’s a FOI into strike motion, which provides us dates of strike motion at a line degree, between 2014-18.
As the info begins off as one row for each bike journey throughout all bike stations throughout London, we now have some work to do to get right into a format we are able to use. Now we have 144 weekly CSVs that we convert to parquet’s to assist with reminiscence constraints. We then mix all these parquet recordsdata into collectively one large dataframe and group by bike station and hour.
| station_id | trips_start | ts |
|-------------:|:--------------------|-----:|
| 1 | 2016-01-10 09:00:00 | 4 |
| 1 | 2016-01-10 10:00:00 | 1 |
| 1 | 2016-01-10 11:00:00 | 2 |
| 1 | 2016-01-10 12:00:00 | 2 |
| 1 | 2016-01-10 13:00:00 | 2 |TFL additionally present coordinates for every bike station. We be a part of on the coordinates to their corresponding H3 cell. H3 is a hexagonal grid system that’s utilized by Uber and is helpful for a lot of spatial evaluation duties. The plot beneath reveals how bike journeys are distributed throughout London.

We will now combination the journey information as much as H3 cell-day degree together with some confounders that we predict additionally have an effect on biking utilization in London. These embody climate and seasonality options.
# Course of in chunks to keep away from reminiscence spike
chunk_size = 100_000
h3_cells = []
for i in vary(0, len(bf), chunk_size):
chunk = bf.iloc[i:i+chunk_size]
h3_cells.prolong([h3.latlng_to_cell(lat, lon, 8) for lat, lon in zip(chunk["lat"], chunk["lon"])])
print(f" Processed {min(i+chunk_size, len(bf)):,} / {len(bf):,}")
bf["h3_cell"] = h3_cells
# Mixture to cell-day
bf["day"] = pd.to_datetime(bf["trips_start"]).dt.date
cell_day = (
bf.groupby(["h3_cell", "day"])
.agg(
total_trips = ("ts", "sum"),
frac_exposed = ("strike_exposed", "mean"),
n_stations = ("station_id", "nunique"),
temperature_2m = ("temperature_2m", "mean"),
precipitation = ("precipitation", "mean"),
is_weekend = ("is_weekend", "first"),
is_bank_holiday = ("is_bank_holiday", "first"),
is_school_holiday = ("is_school_holiday", "first"),
days_to_next_strike = ("days_to_next_strike", "first"),
days_since_last_strike= ("days_since_last_strike", "first"),
month = ("month", "first"),
12 months = ("year", "first"),
doy = ("doy", "first"),
lat = ("lat", "mean"),
lon = ("lon", "mean"),
)
.reset_index()
)
Which means each row of our dataset now incorporates all Santander bike journeys for every day and every H3 cell. Now we have 172 cells noticed throughout 1,192 days.
We additionally filtered so that every cells that had a minimum of one tube cease inside 500m have been included – that is neccessary to fulfill the Positivity Assumption. This assumption states that each unit has to have a non zero chance of each therapy and management. If a cell has no tube stops inside 500m (we are able to moderately assume {that a} commuter who can’t use the tube due to strikes would stroll 500m to make use of a Santander bike).
cell_day = cell_day[cell_day["n_tube_within_500m"] >= 1].copy()This offers us a cell-day dataset with 62 H3 cells, 66,039 rows and 98.4% of cells ever handled.
Subsequent we are able to outline our consequence and therapy variables. As every cell may have differing ranges of anticipated bike utilization, we create our consequence variable to be relative to every cell’s capability – the overall journeys for every cell on every day divided by the variety of bike stations in that cell. we take the log in order that our coefficient tells us about proportional modifications quite than absolute ones and in order that the statistical assumptions of the regression are happy, and we add one in order that quiet cell-days with zero recorded journeys are included within the evaluation quite than silently dropped.
[
Y_{i,t} = logleft(1 + frac{text{Total Bike Trips in cell } i text{ on day } t}{text{Number of Bike Stations in cell } i}right)
]
We will calculate the result variable in python with the next code.
cell_day["y_per_station_log1p"] = np.log1p(cell_day["total_trips"] / cell_day["n_stations"])Defining the therapy variable for strike publicity isn’t as easy. We all know which tube traces have been hanging on every day – however this data doesn’t neatly map to every cell, as every tube line snakes throughout London. Once we are pondering the query of what occurs to bike utilization when tube traces are usually not operational, it’s useful to first determine when bike stations are “near” to tube stations which might be being effected by strikes. Now we have outlined a motorcycle station to be affected by a strike whether it is inside 400m of a tube station that serves one of many hanging traces.
We then outline a h3 cell to be strike affected if any bike station is strike affected inside that h3 cell. That is now our therapy variable.
[
T_{i,t} =
begin{cases}
1, & text{if cell } i text{ is strike-exposed on day } t
0, & text{otherwise}
end{cases}
]
To assemble this therapy variable for our dataset, we first need to create a strike effected column for our station degree information. We do that utilizing the next operate which takes in our station-hour information, a dataframe which tells us which traces have been hanging on every day and a dataframe which tells us stations are linked to every hanging line.
def attach_strikes_to_base(
base: pd.DataFrame,
strikes_daily: pd.DataFrame,
station_line_map: pd.DataFrame,
) -> pd.DataFrame:
"""
Connect a binary strike_exposed indicator to the station-hour panel.
A station-hour is handled (strike_exposed = 1) if any Underground line
serving that station is on strike on that day.
base will need to have columns: station_id, trips_start (datetime), ts (numeric journey rely).
"""
df = base.copy()
df["date"] = pd.to_datetime(df["trips_start"]).dt.ground("D")
station_day_treat = (
strikes_daily
.merge(station_line_map[["station_id", "affected_line"]], on="affected_line", how="inner")
.drop_duplicates(subset=["station_id", "date"])
.assign(strike_exposed=1)
[["station_id", "date", "strike_exposed"]]
)
df = df.merge(station_day_treat, on=["station_id", "date"], how="left")
df["strike_exposed"] = df["strike_exposed"].fillna(0).astype(int)
return df.drop(columns=["date"])Once we combination the station-hour dataframe to cell-day degree we take the imply of strike_exposed column into a brand new column frac_exposed, and any cells with a optimistic frac_exposed develop into handled cells.
cell_day["treated"] = (cell_day["frac_exposed"] > 0).astype(int)Extra element on the info wrangling may be discovered on
Now we’ve outlined our consequence and therapy variables, let’s take a step again and speak concerning the underlying causal principle that underpins all the outcomes that we’ll arrive at on this article.
What’s the query we need to ask?
The causal mechanism underlying our evaluation is substitution. When a tube line strikes, commuters who would usually journey underground are displaced and should discover another. We argue that for commuters close to main interchange stations, Santander Bikes characterize probably the most accessible different: they’re out there with out pre-registration, priced for brief journeys, and bodily current on the stations the place displaced commuters emerge. This substitution story is what connects our therapy variable, to our consequence by way of a reputable causal pathway quite than mere correlation.
Strike happens → tube commuters can’t journey → these commuters search for alternate options → some stroll to a close-by Santander dock → bike journeys improve. Every arrow in that chain is a step within the mechanism. With out it, even a statistically important result’s only a correlation with a narrative hooked up. With it, you could have a purpose to consider the impact is actual.
The causal mechanism we’re describing may be described by the next structural causal mannequin.

As a result of strike timing is decided by labour negotiations quite than by something associated to biking demand, we now have good purpose to consider that strike days are usually not systematically totally different from non-strike days in ways in which would independently have an effect on bike utilization. A strike referred to as on a Tuesday in January isn’t referred to as as a result of January Tuesdays are unusually good or dangerous for biking – it’s referred to as as a result of a wage negotiation broke down. This makes the counterfactual comparability credible: the bike utilization we observe on comparable non-strike days is an inexpensive approximation of what would have occurred on strike days had the strike not occurred.
Now that we now have our causal mechanism said, we are able to stick with it with our causal evaluation. However earlier than we try this, let’s undergo among the vital constructing blocks of causal inference – the potential outcomes framework.
Potential Outcomes
The basic downside of causal inference is that we don’t observe the counterfactual outcomes – we by no means know what would have occurred to bike utilization on a strike day, if that strike had not occurred. That is by definition unobservable.
In an excellent world, we might observe each potential outcomes for every unit: which is the potential consequence if cell had not skilled a strike on day , and which is the potential consequence if it did expertise a stike. From right here we are able to outline the person therapy impact for cell on day which is the distinction between the 2 potential outcomes:
[
tau_{i,t} = Y_{i,t}(1) – Y_{i,t}(0)
]
We might like to know this amount for every statement, however as talked about above, we solely ever observe one of many two potential outcomes. The logical subsequent step is to common this impact for over all items. That is the Common Therapy Impact (ATE):
[
ATE = E[Y_{i,t}(1) – Y_{i,t}(0)] = E[tau_{i,t}]
]
That is the anticipated therapy impact for a randomly chosen unit from the complete. In our setting, it solutions: for a randomly chosen cell-day in our panel, what’s the anticipated change in log bike journeys per station if that cell-day have been to develop into strike-exposed?
We will additionally outline one other therapy impact: The Common Therapy Impact on the Handled (ATT):
[
ATT = E[Y_{i,t}(1) – Y_{i,t}(0) | D_i = 1] = E[tau_{i,t} | D_i = 1]
]
The place is the therapy indicator. This shifts focus onto items that have been really handled. for a cell-day that was really strike-exposed, what was the causal impact of that publicity?
Naive Therapy Impact
Earlier than we get into how we estimate these figures utilizing sturdy causal strategies, we are able to first illustrate what goes mistaken after we estimate the ATE naively. To do that as merely as doable, we might estimate the ATE to be distinction in pattern means between the handled and management observations. That’s,
[
tau^{naive} = overline{Y}_{D=1} – overline{Y}_{D=0}
]
print(f"Naive diff : {np.expm1(cell_day.loc[cell_day['treated']==1,'y_per_station_log1p'].mean() - cell_day.loc[cell_day['treated']==0,'y_per_station_log1p'].mean())*100:+.1f}%")In our information, this provides a naive distinction of +5.5%. Cells with any strike publicity have considerably larger log bike journeys per station than cells with out. However this isn’t a reputable causal estimate. We will decompose the naive distinction algebraically to see precisely what it’s estimating:
[
overline{Y}_{D=1} – overline{Y}_{D=0} = underbrace{E[Y_{i,t}(1) – Y_{i,t}(0) | D_i = 1]}_{ATT} + underbrace{E[Y_{i,t}(0) | D_i = 1] – E[Y_{i,t}(0) | D_i = 0] }_{textual content{choice bias}}
]
The primary time period is the ATT, what we would like. The second time period is choice bias – the distinction in management potential outcomes between handled and untreated items. In our case, this bias is probably going optimistic: cells which might be strike-exposed are close to tube traces, which suggests they’re in denser, extra central areas of London which have larger baseline bike utilization no matter any strike. The naive estimate conflates the impact of strikes with the pre-existing benefit of centrally situated cells.
Eliminating this choice bias is your complete job of the strategies that comply with.
Panel Knowledge
Our dataset has a construction that’s notably well-suited to addressing choice bias. It’s a panel. A panel dataset observes the identical items repeatedly over time. Our particular panel has the next construction
[
{ X_{i,t}, D_{i,t}, Y_{i,t} }
]
The place represents our H3 cells and and represents our days noticed over our dataset. (get precise worth of T and N right here) Now we have N x T of complete observsations.
The important thing perception that panel information offers is that this: if we observe the identical cell on a number of days, we are able to separate the time-invariant part of that cell’s consequence from the day-specific variation. A cell close to Financial institution station is all the time going to be busier than a cell close to Pimlico – that may be a everlasting characteristic of the cell’s location, not one thing that modifications with strikes. Panel strategies allow us to account for this everlasting characteristic with out ever having to measure it straight.
We will use the inherent arrange of the panel information to mannequin the therapy impact utilizing a two manner fastened results mannequin. It is a generalisation of a conventional Distinction in Variations methodology. This mannequin is about up within the following manner:
[
Y_{i,t} = alpha_{i} + lambda_{t} + tau{D}_{i,t} + beta X_{i,t} + epsilon_{i,t}
]
The place is our consequence variable for cell on day , is the fastened impact for cell , is the fastened impact for day , is the causal therapy impact, is the therapy indicator, are the coefficents for covariates and are our errors.
On this mannequin, we now have two fastened results, and for every cell and every day , which act as dummy variables for every cell and day. The cell fastened impact incorporates all time invariant cell traits (all of the geographical options of cell that don’t change over time) and the date fastened impact incorporates all cell invariant variation (day particular variation). That is equal to demeaning inside every cell and inside every date, which removes all time invariant cell traits and customary day-level shocks.
We will merely run this regression evaluation utilizing the ols operate from the statsmodels.system.api library:
twfe = smf.ols(
"""y_per_station_log1p ~ handled
+ temperature_2m + precipitation
+ is_weekend + is_bank_holiday + is_school_holiday
+ days_to_next_strike + days_since_last_strike
+ C(h3_cell) + C(date_str)""",
information=cell_day,
).match(
cov_type="cluster",
cov_kwds={"groups": cell_day["h3_cell"]},
)Word how we are able to’t run unusual OLS because the observations from the cell throughout totally different days are correlated. If we ignored this correlation and used commonplace OLS commonplace errors, we might systematically understate the uncertainty in , producing confidence intervals which might be too slim and p-values which might be too small. We will handle this by utilizing the usual resolution of clustering errors on the cell degree. This enables for arbitrary correlation between the residuals and for a similar cell i throughout any two dates and , whereas sustaining the belief of independence throughout cells.
Outcomes
Our TWFE methodology offers us a rise of three.95% in Santander bike utilization on strike days, with a p-value of 0.097.
Earlier than we dive deeper into these outcomes, we first give attention to some modifications we made to our information to tighten the causal mechanisms that we need to perceive.
Having established that each cell in our evaluation will need to have a minimum of one tube station inside 500 metres – our positivity situation – we apply a stronger restriction motivated by the causal mechanism itself. Not all tube stations generate equal commuter displacement once they strike. The 42 stations we give attention to are the main interchange stations of central London: Financial institution, Liverpool Road, King’s Cross, Waterloo, Victoria, and their neighbours. These are the stations the place 1000’s of commuters converge every morning, the place Santander Bike docks are densest, and the place the substitution from tube to bike is most frictionless – a displaced commuter walks out of a closed station and finds a rack of bikes inside metres.
At extra peripheral stations, even the place a Santander dock exists close by, the displacement mechanism is weaker. Fewer commuters are purely tube-dependent, and the strolling distance to a motorcycle dock is extra more likely to exceed what a time-pressured commuter will tolerate. Proscribing to the 32 cells inside 800 metres of those 42 main interchange stations is subsequently a deliberate give attention to the geographic inhabitants the place each the demand shock from the strike and the provision response from the bike community are sufficiently concentrated for the substitution impact to be detectable.
# Get centroids of all distinctive cells in cell_day_clean
unique_cells = cell_day["h3_cell"].distinctive()
cell_centroids = pd.DataFrame([
{"h3_cell": c,
"lat": h3.cell_to_latlng(c)[0],
"lon": h3.cell_to_latlng(c)[1]}
for c in unique_cells
])
# Construct KD-tree over the 42 station coordinates
station_coords = np.radians(CENTRAL_42[["lat", "lon"]].values)
tree = cKDTree(station_coords)
# Question every cell centroid
cell_coords = np.radians(cell_centroids[["lat", "lon"]].values)
radius_rad = 0.8 / 6371.0 # 800m in radians
# For every cell, discover distance to nearest of the 42 stations
nearest_dist_rad, _ = tree.question(cell_coords, okay=1)
cell_centroids["dist_to_central_42_km"] = nearest_dist_rad * 6371.0
cell_centroids["near_central_42"] = nearest_dist_rad <= radius_rad
central_cells = set(
cell_centroids.loc[cell_centroids["near_central_42"], "h3_cell"]
)
# ── Filter ─────────────────────────────────────────────────────
cell_day_central = cell_day_clean[
cell_day["h3_cell"].isin(central_cells)
].copy()Days 300 days away from any strike have very totally different seasonal traits from strike days, and don’t have any causal relevance to the comparability. Together with them forces the date fastened results to span a large seasonal vary, and the cell fastened results are estimated from a interval that’s not straight related to the comparability. By proscribing to a neighborhood window of 45 days round every strike date we are able to create a cleaner experiment: the management days look extra just like the counterfactual for the handled days, and seasonal confounding is lowered.
sub = cell_day_central[cell_day_central["days_to_nearest"] <= 45].copy()We now have 4 totally different variations of basefile, every with an more and more highly effective sign to noise ratio.
| Basefile Model | Rows | Therapy % |
|------------------------------------------:|:-----------|-------------:|
| Solely cells inside 500m of tube cease | 66,039 | 0.82 |
| Solely cells near Central Stations | 34,590 | 0.94 |
| Solely days inside 45 days of strike days | 16,799 | 1.95 |The plot reveals the totally different TWFE estimates throughout the totally different basefile specs. With probably the most causally highly effective arrange of our panel information achieves an estimated therapy impact of three.95% with a p-value of 0.097.

Our p-value above the usual p=0.05 that’s used as commonplace. Which means our outcomes of a 3.95% improve could be achieved randomly 9.7% of the time. Though our p-value is beneath the standardly used benchmark, we are able to see that our three estimates are constantly optimistic, and the width of the boldness interval displays the restricted variety of strike occasions within the FOI information, not the absence of an impact.
Causal Inference Assumptions
Earlier than getting too carried away with these outcomes, we now have to cease and contemplate the assumptions that need to be made for the TWFE estimate to have an informal interpretation.
Positivity/Overlap requires that each unit has to have a non zero likelihood of being handled. Now we have addressed this by ensuring each cell within the panel will need to have a minimum of one tube cease inside 500m.
Parallel traits requires that within the absence of strikes, handled and management cells would have skilled the identical time development in bike utilization. That is believable in our setting as a result of strike timing is decided by labour negotiation dynamics — the choice to strike on a selected date is pushed by bargaining outcomes between TfL administration and unions, not by something associated to the underlying trajectory of motorcycle utilization.
No anticipation requires that cells don’t change their behaviour earlier than therapy happens — that the announcement of a strike doesn’t itself alter bike utilization within the days earlier than the strike. That is partially addressed by the inclusion of days_to_next_strike as a covariate within the managed specification, which captures any systematic pre-strike development. We be aware that for actually unannounced strikes, the no-anticipation assumption is routinely happy.
SUTVA (Secure Unit Therapy Worth Assumption, Rubin 1980) requires that the potential outcomes of 1 cell don’t rely upon the therapy standing of different cells. That is the belief probably to be violated in our setting: a strike displaces commuters throughout a large geographic space, probably affecting bike utilization at cells past these straight adjoining to hanging traces. SUTVA violations will attenuate our estimate towards zero, that means our +3.95% ought to be interpreted as a decrease sure on the true impact for probably the most straight uncovered cells.
Closing Remarks
This text got down to reply a easy query: do London tube strikes push commuters onto Santander Bikes? The reply, based mostly on a two-way fastened results evaluation of 4 years of TfL open information, is sure, however arriving at that reply was significantly much less simple than the clear outcome may counsel.
Working with actual life information is rarely simple. To get the journey information right into a format which was usable for me to reply the questions Whereas parsing 144 weekly CSVs, I needed to reconcile inconsistent column schemas throughout information releases, appropriate a silent naming mismatch between strike line identifiers, and rebuilt the spatial mapping between bike stations and tube stops a number of occasions.
This was all earlier than contemplating the distinction causal assumptions crucial to construct a reputable argument. Coming from an ML background, I additionally spent a non-trivial period of time investigating meta-learners (S, T, and X learners, that are a set of predictive machine studying strategies to estimate therapy results) for this downside. This might have given us richer perception – the conditional common therapy have an effect on, or CATE, which might inform us how the therapy impact varies throughout London.
I discovered the arduous the best way that the software didn’t match the issue. Panel information with recurring binary therapy and a powerful geographic identification story desires a hard and fast results regression, not a cross-sectional ML estimator.



