Data platform 2022: Global expansion in petabytes | Coupang Engineering | Coupang Engineering Blog

2022-06-17 03:01:44 By : Ms. Linda Cheung

By Youngwan Lim, Michael Sommers, Eddard (Hyo Kun) Park, Thimma Reddy Kalva, Ibrahim Durmus, Martin (Yuxiang) Hu, Enhua Tan

Coupang delivers millions of products each year at rocket speed and scale for a growing number of customers. Behind the magic of our game-changing services, such as next-day delivery, is a complex data platform that is constantly evolving under the careful supervision of our talented engineers across the globe.

Our vision for the data platform is to provide robust and user-friendly tools that empower internal users to transform large raw datasets into analytics that help us make critical and time-sensitive business decisions. The users of the data platform are from approximately 50 different teams and include everyone from engineers and business analysts to C-level executives and external users such as suppliers and advertisers.

As you can imagine, the data platform supports hundreds of users and petabytes of data. The platform processes over five thousand jobs and extracts over 2 TB of data from approximately 70 different sources every day. On top of that, we are expecting several fold increases in these numbers as we expand and grow our business.

In this post, we discuss how our data platform has evolved since 2019, focusing on the ingestion, machine learning, and experiment platforms. Stay tuned for part 2 of this series, where we dive into details about the analytics platform.

Let’s start off by examining our ingestion platform, the gateway between our data sources and data platform. Its goal is to efficiently ingest raw data in varying schemas from a wide range of data sources into a single data lake.

Previously for data ingestion, we relied on hand-crafted and source-specific pipelines built on an ad-hoc basis. However, with the enormous growth of data volume and analytic calculations, building new extraction and load pipelines for each source became difficult to manage at scale and limited on-demand analytics needed to make urgent business decisions.

In addition, these extraction pipelines required significant maintenance efforts in case of upstream schema changes. Such dependence on manual maintenance increased the chances of failures due to sudden data surges and skews.

Although this system worked for us in the early stages of the business, it became error prone and inefficient with our massive data growth.

To solve the challenges mentioned above and to move away from a high-touch service model, we introduced a scalable and self-service data ingestion platform.

We built the universal data ingestion (UDI) framework to be source-agnostic, fully automated, and user-driven (not data engineer-driven). The framework also aims to standardize ingestion processes for our 70 different data sources. UDI was designed with the three principles that we refer to as the 3Vs: Velocity, Variety, and Volume.

Velocity To increase velocity, UDI automatically orchestrates ingestions processes that were previously manual and time-consuming. With UDI, ingestion is simple with minimum-to-none engineering time involved.

For example, data batches go through an extensive inspection before ingestion. We check that the source data can be securely accessed according to security protocols. For structured data use cases, we check the data schema to only pull relevant sections. These and other comprehensive checking and filtering processes are automated with the UDI.

In addition, UDI automatically estimates the batch size of ingestion. This process is important for RDBMS-based data sources that contain critical business data. To avoid incurring an excessive load, UDI estimates an efficient yet secure amount of data that can be pulled in each batch.

Variety Initial pipelines focused on connecting MySQL to Hive, the most used configuration in our production. To support a more diverse range of data sources, we built the self-service ingestion as a plug-in framework on top of the Hadoop ecosystem. Users can add support for new data sources with incremental cost and minimum development efforts by adapting our reusable plug-in framework.

Volume From the very initial phase, UDI was designed for high scalability and followed open standards. To efficiently transform our large volumes of data into analytics, we employ the systems below.

In the future, we want to introduce a streaming solution, including robust event-based ingestion and generic event consumers. Additionally, we want to gradually deprecate the direct change data capture (CDC) sync process to adopt a strongly typed log-based ingestion process. Lastly, we will continue to seek ways to enhance the end-to-end customer experience by improving the self-service functions of the ingestion platform.

Now we will go through how we use the ingested data to train machine learning (ML) models that provide us with valuable insights. This section focuses on our internal machine learning platform, used by Coupang data scientists and ML engineers to seamlessly build, train, and deploy ML models at scale.

Previously, data scientists and ML engineers across different teams managed their own ML infrastructures. This negatively impacted our engineering efficiency in two ways. First, each team spent large amounts of time setting up similar data science environments utilizing the same packages and tools. Second, each team separately prepared data, built feature pipelines, and trained and deployed models without a standardized process. Due to a lack of standardization and central management of resources, it was difficult for teams to adequately access GPUs. Such redundant engineering efforts resulted in inefficient resource utilization and low ML throughput.

We needed to standardize the model building and GPU utilization processes in a simple and user-friendly platform.

The ML platform was built as a scalable training platform that supports popular ML frameworks and tools, such as TensorFlow and PyTorch, to support building feature pipelines and offline training. The platform offers data preparation capabilities, interactive notebook environments, distributed training using GPUs, hyperparameter turning, and model serving.

Our ML workflow consists of data preparation, model training, and predictions, each of which we will detail in the following sections.

We leveraged our existing data pipeline orchestrator based on Airflow to produce high-quality features and label datasets required for model training. This orchestrator was already fully integrated with our data store and large-scale data processing engines such as Spark, Hive, and Presto. We also developed an SDK to easily synchronize data across the training platform. Together, the orchestrator and data store SDK allow engineers to create, manage, and share large training datasets.

The model training and development platform was built on open-source tools and aims to standardize training environments for all engineers at Coupang. Here are some of its main features.

To further reduce the time between model training and inference, we built model deployment tools and a model serving infrastructure. We utilized our in-house container orchestration platform for Kubernetes and customized open-source tools to support model serving. Engineers can simply write a Jsonnet spec with autoscaling and expose REST and GRPC inference endpoints to serve their models.

Currently, the Search Ranking team uses the ML platform to train BERT and DNN models to improve query understanding, search results, and purchase rates. The Advertising team trains deep interest networks to improve click-through rates and conversion rates. The Marketing team uses it to present customers with personalized recommendations and promotions to improve engagement and conversions to paid memberships.

Our next goal is to improve the model store, where engineers store and manage varying versions of their ML models. We are also collaborating with internal teams to incorporate a feature store and enhance integration of the serving layer with the prediction store.

In this last section, we will introduce our experiment platform and how it helps us make data-driven decisions that enhance the customer experience across the globe.

Adding new features to any app is tricky because their impact can be difficult to measure. To establish a scientific and customer-centered way to approach feature launches, Coupang uses A/B testing. Every feature on our app from the delivery preference page to the item recommendations carousel are all born through rigorous A/B testing.

Because A/B tests are conducted on actual customers, an ineffective feature can negatively impact sales by blocking customers in the purchase funnel. Our first challenge was to quickly catch and automatically stop tests that could damage our business.

The second challenge we tackled was engineering efficiency in launching A/B test monitoring formulas. It took at least two weeks and significant engineering resources to implement the complex monitoring formulas requested by data scientists. Examples of these jobs include monitoring misassigned groups, guardrail latency, flat exposure log, no traffic, and so on for each A/B test. Furthermore, engineers needed to restart the monitoring job if test queries were modified midway through the experiment.

Lastly, our experiment platform handles 3 billion events and a thousand live experiments per day. As Coupang expands internationally, we must scale out to service multiple regions, app versions, and app permutations. These additions greatly increased the engineering complexity of calculating metrics for all our A/B tests in an efficient manner.

Previously, the test owner was notified if buyer conversion was negatively impacted by a significant margin with p-value ≤ 0.001. If the owner took no action within 12 hours, the A/B test was automatically stopped. Although this system worked, we wanted to detect bad A/B tests in critical domains faster and stop them automatically.

At first, we created a force-stop system which runs by batch every 4 hours to stop bad tests. However, 4 hours was still a lengthy wait, especially with potential sales on the line. Also, this system still relied on manual interference from test owners.

To make decisions faster and automatically, we launched a mission-critical circuit breaker that detects and stops bad experiments from hurting conversion rates in our main domains in real-time.

In the circuit breaker architecture, each Spark job calculates metrics every 2 minutes and stores it in S3 or MySQL. We decided on a 2-minute interval due to streaming job initialization times. Based on config, the circuit breaker filters and deduplicates data to reduce computations. If the circuit breaker metric is negative by a significant margin with p-value ≤ 0.001 for five consecutive turns, the A/B test is automatically stopped, and an alert is sent to the test owner. With our circuit breaker, negative A/B tests are terminated without any manual interference in approximately 10 minutes.

Figure 5 shows one example where the circuit breaker prevented a significant drop in sales by suspending an A/B test within a few minutes. Without the circuit breaker, conversion rates in the purchase funnel may have dropped by approximately 20% in the cart page for at least 4 hours.

Our next challenge was to add new monitoring jobs without consuming large amounts of engineering resources. To maximize efficiency and flexibility, we designed a query-based monitoring system that exposes all metrics data from MySQL to a separate monitoring solution. By leveraging this new monitoring system, monitoring jobs with complex formulas could be transformed into simple queries.

The detailed architecture of the monitoring system can be found in Figure 6. First, the monitoring solution periodically collects data from the metric-exposer, which contains A/B test metric data. Then, the rule-based monitoring system runs queries every minute to detect A/B tests that meet certain criteria. The rule-based monitoring system passes the detected A/B tests to the message-generator, which acts according to the user config, such as notifying the test owners or halting the A/B test.

With the new monitoring system, engineers can write a simple query like the one below to add a monitoring alert.

Finally, to accommodate our international expansion and business diversification, we employed Apache ZooKeeper. Here’s how we used ZooKeeper to reliably serve our needs.

We want to prepare for our international customers and business endeavors by improving our experiment implementation. Our next goal is to reduce implementation complexity by developing feature flags, dynamic config override, and dynamic targeting.

In this post, we discussed the improvements we made to our data platform over the past years to handle our ever-growing customer base and data volume. Stay tuned for part 2 of this series, which will discuss our analytics platform in more detail.

If building a data platform at a large data-driven company like Coupang excites you, see our open positions.

Read about our engineers and how they are changing the face of e-commerce and beyond.

We write about how our engineers build Coupang’s e-commerce, food delivery, streaming services and beyond.