Revamping Meta's Data Ingestion: A Q&A on Massive-Scale Migration

Meta recently overhauled its data ingestion system to handle the immense scale of its social graph, moving from a legacy architecture to a more resilient and efficient service. This Q&A explores the challenges, strategies, and key decisions behind the successful migration of thousands of jobs spanning petabytes of MySQL data.

1. What prompted Meta to overhaul its data ingestion system?

Meta's social graph is one of the largest MySQL deployments globally, with daily incremental scraping of petabytes of data into an analytics warehouse. As operations scaled, the legacy system—built on customer-owned pipelines—became unstable under stringent data landing time requirements. Teams across the company depend on fresh snapshots for everything from day-to-day decisions to machine learning and product development. The old architecture worked well at smaller scales but couldn't keep up with hyperscale demands. To maintain reliability and efficiency, Meta decided to migrate to a self-managed data warehouse service that simplifies the architecture while delivering consistent, low-latency data at massive scale. The move was essential to ensure that analytics and downstream data products continue to function smoothly as Meta grows.

Revamping Meta's Data Ingestion: A Q&A on Massive-Scale Migration — Source: engineering.fb.com

2. How did Meta ensure a seamless transition during the migration?

A smooth transition required meticulous tracking of the migration lifecycle for thousands of jobs, along with robust rollout and rollback controls. Meta established clear verification criteria and success gates for each job before it could advance to the next lifecycle stage. This included no data quality issues (exact row counts and checksums matching between old and new systems), no landing latency regression (the new system had to match or improve data delivery speed), and no resource utilization regression. By enforcing these gates, Meta prevented data integrity problems and minimized operational risk. The team also built infrastructure to quickly revert any job to the legacy system if issues arose, ensuring that the broader ingestion pipeline remained stable throughout the transition.

3. What was the migration lifecycle, and what steps did each job follow?

Meta defined a clear migration lifecycle for each data ingestion job to maintain data integrity and operational reliability. The process began with a small-scale test where a job ran on the new system while still running on the old system, allowing side-by-side comparison. After passing initial verification, the job moved to a shadow phase where its output was validated against the legacy system without affecting consumers. Once verified, the job was promoted to production: it became the primary source of data, but the old system was kept on standby. Finally, after a stable period (e.g., multiple cycles without issues), the old system was deprecated and removed. Each step required meeting success criteria—like exact row counts, checksums, latency, and resource usage—before progressing. This phased approach allowed Meta to catch problems early and roll back any job if necessary.

4. What verification criteria guaranteed data correctness after migration?

To ensure the new system delivered identical data, Meta compared both row counts and checksums between outputs from the legacy and new architectures. A match in row counts alone isn't enough—different records could produce the same count. Checksums (like MD5 or SHA hashes) confirm that the exact same data is present, byte for byte. Additionally, they monitored landing latency, requiring the new system to be at least as fast as the old one, and checked resource utilization to avoid unintended spikes. These three criteria—data quality, latency, and resource usage—formed the gatekeeping thresholds. Only jobs that met all criteria could move forward in the migration lifecycle, ensuring that downstream analytics, reporting, and machine learning models received consistent, high-quality data without performance regressions.

5. How does the new architecture differ from the legacy system?

The legacy system relied on customer-owned pipelines: individual teams managed their own data ingestion flows from MySQL to the warehouse. While flexible at small scale, this led to fragmentation, duplicated efforts, and instability as usage grew. The new architecture adopts a self-managed data warehouse service that centralizes ingestion. Instead of each team maintaining custom pipelines, Meta built a shared, simplified system that handles scraping, transformation, and loading automatically. This shift reduces operational overhead, improves efficiency, and allows the service to scale to hyperscale loads without breaking. The new design also makes it easier to apply global monitoring, rollouts, and rollbacks—key for the migration success. By decoupling ingestion from individual teams, Meta can focus on optimizing the shared infrastructure for performance and reliability.

6. What factors most influenced Meta's architectural decisions during the migration?

Three key factors shaped Meta's choices: reliability at scale, operational simplicity, and verifiability. First, the system had to handle petabytes of daily data with strict landing time guarantees, so reliability was paramount. Second, moving away from complex customer-owned pipelines to a self-managed service simplified operations and reduced the burden on engineers. Third, every migration step required strong verification—exact data correctness (row counts and checksums), latency comparisons, and resource tracking—to ensure no regressions. These factors drove the adoption of a phased lifecycle, robust rollback capabilities, and automated validation. Additionally, the ability to monitor and revert thousands of jobs quickly was critical. By focusing on these pillars, Meta built a system that not only survives but thrives at hyperscale, while making future migrations easier.

Tags: