Bachelor Thesis — Mariano Cánovas

The Work

Cities face growing mobility challenges, and Bicycle-Sharing Systems are part of the solution, but only if users can find a bike when they need one. In NYC's Citi Bike docked system, a bike is only available if someone else has previously returned it to a dock, making forecasting a complex problem of predicting both pickups and returns.

I approached this as a time-series problem, comparing Random Forest (a strong baseline for station-level predictions, per Ashqar et al. 2017) and LSTM (better at sequential dependencies) across windows from 15 minutes to 8 hours.

One key challenge: Citi Bike's datasets record trips but not direct availability. I leveraged a GBFS inventory snapshot to reconstruct availability patterns from 1.1M trip records, an approach none of the prior literature had documented.

Core Contributions

Novel reconstruction of station-level availability from trip records using GBFS snapshot
Explicit data leakage prevention through lagged availability variables
Multicollinearity analysis (VIF) to remove redundant features
Comparison across six prediction horizons: 15 min to 8 hours

Key Results

Base Comparison

LSTM Wins All

Outperforms RF at every horizon

Long-term

LSTM by 2x

At 8 hours

Best Accuracy

R² = 0.939

RF optimized

Stable

R² = 0.577

LSTM at 8h

LSTM outperforms Random Forest in almost every time window, and the gap increases as we move from short-term to longer-term forecasts. At the 8-hour window, LSTM's error is about half of Random Forest's, and its R² remains positive while RF's turns negative, meaning RF is performing worse than simply guessing the average.

Model Performance Comparison: MAE Across Time Windows

What Matters

Initial models had a problem: current availability was dominating predictions because it already contains direct information about the target, classic data leakage. Switching to lagged variables fixed this. After the fix, feature importance shifted meaningfully: past availability becomes less useful for extended forecasts, while environmental factors like humidity gain relevance over longer horizons.

To push Random Forest further, I added rolling averages of recent bike availability. These rolling features act like a memory of the station's recent activity, basically what LSTM does naturally. The 15-minute prediction error dropped from 1.25 to 0.79.

Why This Matters

This thesis demonstrates that machine learning can effectively predict short to long-term bike availability in a large-scale docked BSS like Citi Bike. LSTM consistently outperforms Random Forest beyond one-hour horizons, while RF remains a strong option for rapid predictions, especially with rolling features.

The GBFS reconstruction approach offers a practical solution for any system without direct availability tracking. Incorporating temporal, spatial, and environmental factors improved accuracy, lagged availability dominates short forecasts, weather variables gain relevance over longer horizons. Overall, these insights can support better bike redistribution planning.

Reflection & Limitations

A key limitation: I conducted this analysis on a single high-usage station (1 Ave & E 110 St). While chosen to reflect potential availability challenges, the study does not directly verify how often this station experiences actual unavailability events, so model performance may not fully generalize to stations with different demand patterns.

The long-term forecast windows also contained comparatively few records, which likely reduced statistical robustness. Future work should validate these findings across multiple stations over longer periods, and explore whether the approach scales to all 2,000+ stations in the network.

ML-Based Bike Forecasting

The Work

Key Results

What Matters

Why This Matters

Reflection & Limitations

Bachelor Thesis PDF