
Senior Data Engineer
atUberOn Uber's ride-hailing data layer, I built the modeled datasets and metric definitions that turn fragmented session events into trusted, decision-ready data — supporting pricing, dispatch, and experimentation across global markets.
- Modeled end-to-end rider lifecycle data (shopping → matching → trip completion), enabling analytics across 200M+ monthly active users and ~3B+ quarterly trips.
- Engineered scalable lakehouse pipelines (Bronze/Silver/Gold) using Spark, Python, SQL, and AWS S3, processing multi-terabyte daily datasets from thousands of upstream sources.
- Integrated 5+ heterogeneous sources — rider events, driver events, pricing services, dispatch logs, and trip records — into unified analytical tables for downstream analytics and ML.
- Established freshness, completeness, duplication, and schema-drift checks across 100+ datasets, improving reliability for production use cases.
- Refined incremental processing strategies for ~10–15% late-arriving trip records, reducing recomputation overhead and improving pipeline efficiency.
- Tuned Spark workloads through partitioning and join strategies, reducing average job runtime by 20–30% across recurring batch pipelines.
- Enabled experiment-ready datasets supporting 100+ concurrent A/B tests, accelerating iteration on marketplace features.
- Aligned metric definitions across 10+ cross-functional teams and improved documentation, lineage, and ownership tracking across 100+ datasets.


