Available for senior data engineering roles

Michael Ma
Senior Data Engineer

Lakehouse pipelines, trusted metrics, and experimentation-grade datasets at marketplace scale.

Arlington, TX·9+years·Uber · Airbnb · DoorDash
Abstract bronze, silver, gold lakehouse with glowing data pipelines
200M+
MAPCs at Uber
500M+
Searches at Airbnb
100+
Datasets owned
About

Building the data foundations behind marketplaces that move billions of decisions a year.

I'm a Senior Data Engineer with nearly a decade of experience building the data foundations behind some of the largest consumer marketplaces on the internet. My focus is the unglamorous-but-load-bearing work that makes products measurable and improvable: modeled tables, curated metrics, lakehouse pipelines, data quality, and experimentation-ready datasets.

Across Uber, Airbnb, and DoorDash, I've worked between raw operational events and the people consuming them — product managers shipping marketplace levers, scientists running A/B tests, engineers debugging regressions, and operations teams monitoring marketplace health. I migrate and refactor pipelines, model business entities, harden data quality, and make datasets discoverable so teams answer the same question with the same logic.

I care about reliability, freshness, and trust at scale. I enjoy work where small improvements to correctness and discoverability compound — because they sit underneath pricing reads, ETA models, ranking changes, dashboards, and operational decisions across global markets.

Skills

The stack I reach for.

Pragmatic, production-tested choices across compute, storage, orchestration, streaming, and governance.

Languages
  • Python
  • SQL
  • PySpark
  • JavaScript
  • TypeScript
  • Java
  • Scala
Cloud & Infrastructure
  • AWS S3
  • AWS EMR
  • AWS Lambda
  • AWS Glue
  • AWS CDK
  • Azure Synapse
  • Terraform
  • Docker
  • Kubernetes
Data Platforms
  • Databricks
  • Snowflake
  • Redshift
  • Hadoop
  • Lake Formation
Orchestration
  • Airflow
  • Prefect
  • DBT
Streaming
  • Kafka
  • Kinesis
  • Spark Streaming
  • Flink
ML & BI
  • MLflow
  • Feature Store
  • Tableau
  • PowerBI
Governance & Compliance
  • Data Catalog
  • Lineage
  • PHI
  • SOC 2
  • HIPAA
Data Modeling
  • Star Schema
  • Snowflake Schema
  • Medallion Architecture
  • MDM
Experience

Nine years across three of the largest consumer marketplaces.

The same throughline: making fragmented operational data discoverable, trustworthy, and ready for product and experimentation decisions.

Aerial view of a city with hexagonal demand mesh and ride-hailing analytics dashboards
Uber
Jan 2022 – Present

Senior Data Engineer

atUber
Ride Session Analytics Platform·200M+ MAPCs · 3.75B+ quarterly trips · 20,000+ critical pipelines

On Uber's ride-hailing data layer, I built the modeled datasets and metric definitions that turn fragmented session events into trusted, decision-ready data — supporting pricing, dispatch, and experimentation across global markets.

  • Modeled end-to-end rider lifecycle data (shopping → matching → trip completion), enabling analytics across 200M+ monthly active users and ~3B+ quarterly trips.
  • Engineered scalable lakehouse pipelines (Bronze/Silver/Gold) using Spark, Python, SQL, and AWS S3, processing multi-terabyte daily datasets from thousands of upstream sources.
  • Integrated 5+ heterogeneous sources — rider events, driver events, pricing services, dispatch logs, and trip records — into unified analytical tables for downstream analytics and ML.
  • Established freshness, completeness, duplication, and schema-drift checks across 100+ datasets, improving reliability for production use cases.
  • Refined incremental processing strategies for ~10–15% late-arriving trip records, reducing recomputation overhead and improving pipeline efficiency.
  • Tuned Spark workloads through partitioning and join strategies, reducing average job runtime by 20–30% across recurring batch pipelines.
  • Enabled experiment-ready datasets supporting 100+ concurrent A/B tests, accelerating iteration on marketplace features.
  • Aligned metric definitions across 10+ cross-functional teams and improved documentation, lineage, and ownership tracking across 100+ datasets.
SparkPySparkPythonSQLAWS S3Hudi-style LakehouseAirflowKafkaFlinkPinotDatabook-style Catalog
World map with home listings, search filters, and a calendar representing Airbnb Flexible Dates
Airbnb
Jun 2019 – Dec 2021

Data Engineer

atAirbnb
Search & Discovery — Flexible Search·500M+ flexible-date searches in 2021 · ~99% of conversions via search

On Airbnb's Search & Discovery surface, I built feature pipelines, validation, and experiment-ready datasets that powered ranking, personalization, and the 2021 wave of flexibility-focused discovery — Flexible Dates, Flexible Matching, Flexible Destinations, and "I'm Flexible."

  • Developed data pipelines supporting Search & Discovery ranking and personalization systems, which drove ~99% of booking conversions through search and recommendation flows.
  • Enabled analytics for Flexible Search features (Flexible Dates, Flexible Destinations, "I'm Flexible"), supporting 500M+ searches in 2021 and requiring scalable feature data pipelines.
  • Constructed Airflow-based ETL using Spark, Hive, and Presto to generate feature datasets across billions of rows of listing and user-interaction data.
  • Applied data validation and quality checks to ensure the reliability of experiment-critical datasets used in hundreds of concurrent A/B tests.
  • Integrated batch and near-real-time pipelines using Kafka and CDC patterns, improving data freshness for search indexing and personalization.
  • Refactored DAG structures and reduced Airflow orchestration overhead, improving pipeline execution efficiency by ~25% in high-volume workflows.
  • Partnered with ML engineers and analysts to deliver feature tables for ranking models and recommendation systems.
  • Maintained consistency across dozens of upstream and downstream datasets so ML features and business metrics stayed aligned.
AirflowSparkHivePrestoScalaPySparkKafkaCDCEMRWall (DQ)
Restaurant kitchen, courier on a bike, and customer at a doorstep with delivery analytics overlays
DoorDash
Jun 2016 – May 2019

Data Engineer

atDoorDash
Food Delivery Marketplace·27.6% U.S. consumer-spend share by 2019 · millions of deliveries · three-sided marketplace

At DoorDash, I helped move the company off ad-hoc production-DB queries and onto a real warehouse — building ETL, dimensional models, and validation across the consumer–merchant–dasher marketplace during its hyper-growth from regional player to category leader.

  • Implemented ETL pipelines for the Food Delivery Marketplace, integrating consumer, merchant, and dasher data into centralized datasets used across business and operations teams.
  • Contributed to building a centralized data warehouse, supporting analytics across millions of delivery records during rapid marketplace expansion.
  • Modeled core business entities using star-schema dimensional modeling to support reporting on delivery lifecycle, fulfillment efficiency, and operational KPIs.
  • Supported datasets used in experimentation frameworks — including switchback testing across regions and time windows — for more accurate evaluation of marketplace changes.
  • Performed data validation, historical backfills, and schema updates across large-scale datasets, improving the reliability and consistency of production data.
  • Collaborated with analytics and operations teams to deliver datasets and reports supporting daily decision-making in a high-growth environment.
ETLSQLPythonAWSStar SchemaDimensional ModelingSwitchback ExperimentationData Warehouse
Education
Degree
Bachelor of Science in Computer Science
University of Houston
2012 – 2016