The Work
Cities face growing mobility challenges, and Bicycle-Sharing Systems are part of the solution, but only if users can find a bike when they need one. In NYC's Citi Bike docked system, a bike is only available if someone else has previously returned it to a dock, making forecasting a complex problem of predicting both pickups and returns.
I approached this as a time-series problem, comparing Random Forest (a strong baseline for station-level predictions, per Ashqar et al. 2017) and LSTM (better at sequential dependencies) across windows from 15 minutes to 8 hours.
One key challenge: Citi Bike's datasets record trips but not direct availability. I leveraged a GBFS inventory snapshot to reconstruct availability patterns from 1.1M trip records, an approach none of the prior literature had documented.
- Novel reconstruction of station-level availability from trip records using GBFS snapshot
- Explicit data leakage prevention through lagged availability variables
- Multicollinearity analysis (VIF) to remove redundant features
- Comparison across six prediction horizons: 15 min to 8 hours
Key Results
LSTM outperforms Random Forest in almost every time window, and the gap increases as we move from short-term to longer-term forecasts. At the 8-hour window, LSTM's error is about half of Random Forest's, and its R² remains positive while RF's turns negative, meaning RF is performing worse than simply guessing the average.
What Matters
Initial models had a problem: current availability was dominating predictions because it already contains direct information about the target, classic data leakage. Switching to lagged variables fixed this. After the fix, feature importance shifted meaningfully: past availability becomes less useful for extended forecasts, while environmental factors like humidity gain relevance over longer horizons.
To push Random Forest further, I added rolling averages of recent bike availability. These rolling features act like a memory of the station's recent activity, basically what LSTM does naturally. The 15-minute prediction error dropped from 1.25 to 0.79.
Why This Matters
This thesis demonstrates that machine learning can effectively predict short to long-term bike availability in a large-scale docked BSS like Citi Bike. LSTM consistently outperforms Random Forest beyond one-hour horizons, while RF remains a strong option for rapid predictions, especially with rolling features.
The GBFS reconstruction approach offers a practical solution for any system without direct availability tracking. Incorporating temporal, spatial, and environmental factors improved accuracy, lagged availability dominates short forecasts, weather variables gain relevance over longer horizons. Overall, these insights can support better bike redistribution planning.
Reflection & Limitations
A key limitation: I conducted this analysis on a single high-usage station (1 Ave & E 110 St). While chosen to reflect potential availability challenges, the study does not directly verify how often this station experiences actual unavailability events, so model performance may not fully generalize to stations with different demand patterns.
The long-term forecast windows also contained comparatively few records, which likely reduced statistical robustness. Future work should validate these findings across multiple stations over longer periods, and explore whether the approach scales to all 2,000+ stations in the network.