Thesis front page
View PDF
Thesis Publication

ML-Based Bike Forecasting

Station-level availability predictions for the Citi Bike system

Best Short-term
0.79 MAE
15-minute prediction error
Network Scale
2,218 Stations
3.7M time-series records
Key Finding
LSTM ≫ RF
Deep learning excels long-term

The Work

Cities face growing mobility challenges, and Bicycle-Sharing Systems are part of the solution, but only if users can find a bike when they need one. In NYC's Citi Bike docked system, a bike is only available if someone else has previously returned it to a dock, making forecasting a complex problem of predicting both pickups and returns.

I approached this as a time-series problem, comparing Random Forest (a strong baseline for station-level predictions, per Ashqar et al. 2017) and LSTM (better at sequential dependencies) across windows from 15 minutes to 8 hours.

One key challenge: Citi Bike's datasets record trips but not direct availability. I leveraged a GBFS inventory snapshot to reconstruct availability patterns from 1.1M trip records, an approach none of the prior literature had documented.

Core Contributions
  • Novel reconstruction of station-level availability from trip records using GBFS snapshot
  • Explicit data leakage prevention through lagged availability variables
  • Multicollinearity analysis (VIF) to remove redundant features
  • Comparison across six prediction horizons: 15 min to 8 hours

Key Results

Base Comparison
LSTM Wins All
Outperforms RF at every horizon
Long-term
LSTM by 2x
At 8 hours
Best Accuracy
R² = 0.939
RF optimized
Stable
R² = 0.577
LSTM at 8h

LSTM outperforms Random Forest in almost every time window, and the gap increases as we move from short-term to longer-term forecasts. At the 8-hour window, LSTM's error is about half of Random Forest's, and its R² remains positive while RF's turns negative, meaning RF is performing worse than simply guessing the average.

Model Performance Comparison: MAE Across Time Windows
15m 30m 1h 2h 4h 8h 0 1 2 3 4 5 Forecast Window MAE (bikes) LSTM Random Forest

What Matters

Initial models had a problem: current availability was dominating predictions because it already contains direct information about the target, classic data leakage. Switching to lagged variables fixed this. After the fix, feature importance shifted meaningfully: past availability becomes less useful for extended forecasts, while environmental factors like humidity gain relevance over longer horizons.

To push Random Forest further, I added rolling averages of recent bike availability. These rolling features act like a memory of the station's recent activity, basically what LSTM does naturally. The 15-minute prediction error dropped from 1.25 to 0.79.

Why This Matters

This thesis demonstrates that machine learning can effectively predict short to long-term bike availability in a large-scale docked BSS like Citi Bike. LSTM consistently outperforms Random Forest beyond one-hour horizons, while RF remains a strong option for rapid predictions, especially with rolling features.

The GBFS reconstruction approach offers a practical solution for any system without direct availability tracking. Incorporating temporal, spatial, and environmental factors improved accuracy, lagged availability dominates short forecasts, weather variables gain relevance over longer horizons. Overall, these insights can support better bike redistribution planning.

Reflection & Limitations

A key limitation: I conducted this analysis on a single high-usage station (1 Ave & E 110 St). While chosen to reflect potential availability challenges, the study does not directly verify how often this station experiences actual unavailability events, so model performance may not fully generalize to stations with different demand patterns.

The long-term forecast windows also contained comparatively few records, which likely reduced statistical robustness. Future work should validate these findings across multiple stations over longer periods, and explore whether the approach scales to all 2,000+ stations in the network.