Zendesk's migration from DynamoDB to MySQL and S3

Zendesk has achieved more than an 80% reduction in data storage costs through a strategic migration from DynamoDB to a combination of MySQL and Amazon S3. Escalating costs under the former system drive the company to seek out a more economical and scalable storage solution. With this significant cost reduction, Zendesk addresses financial sustainability and improves operational efficiency, setting a benchmark for similar enterprises facing high data management costs.

Initial use of DynamoDB

Zendesk initially employs DynamoDB for storing event-stream data, taking advantage of its managed database service capabilities which offer built-in security, backup and restore, and in-memory caching. Despite its initial efficacy, the platform’s maintenance costs escalate as the customer base expands. The advanced querying capabilities, specifically the need for Global Secondary Indexes (GSI), compound these costs, making the system financially untenable.

In response to rising costs, Zendesk shifts to a provisioned billing model, initially reducing expenses by 50%. Despite this considerable reduction, the growing demands of the architecture render the solution unsustainable, compelling Zendesk to explore other options.

Zendesk continuously evaluates multiple technologies including S3, Hudi, ElasticSearch, and MySQL, seeking a combination of functionality and cost-efficiency. The team discounts Hudi due to its operational complexity and the unacceptable 24-hour delay in data availability. Similarly, they dismiss ElasticSearch as its costs are on par with those of DynamoDB, offering no financial advantage.

Adoption of MySQL and S3

After thorough consideration, Zendesk opted for a tiered storage solution involving MySQL and S3. MySQL serves as a buffer for logs from Apache Kafka, capturing and storing metadata, while S3 handles large-scale data storage in economical batches of 10,000 logs per file. Logs older than four hours are systematically purged from the MySQL buffer to maintain efficiency and cost-effectiveness. This architecture supports effective data retrieval primarily on a chronological basis, optimizing the balance between accessibility and cost.

Queries initiated by consulting the MySQL metadata table, followed by executing parallel S3-Select queries based on the metadata outputs. This setup significantly reduces the data retrieval times and streamlines the querying process.

Challenges and innovations in querying

Querying challenges

Shane Hender, Group Tech Lead at Zendesk, identifies a key challenge in the new system: querying logs based on non-timestamp fields such as user IDs. Such queries potentially require scanning all relevant S3 data within a specified time range, complicating parallel processing and impacting performance.

Advanced data structures

To address multi-field query challenges efficiently, Zendesk incorporates Bloom Filters and Count-Min Sketch data structures. These innovations facilitate the identification of relevant S3 files for querying, significantly reducing the need for data duplication and improving query performance. They store serialized versions of these structures in an additional table, which helps in determining the specific S3 files to be queried, thus optimizing the querying process for multiple fields.

Post-migration cost and performance metrics

Post-migration, Zendesk has slashed its storage costs to less than 20% of the costs with provisioned DynamoDB. MySQL, specifically AuroraDB, now accounts for more than 90% of these reduced costs, with S3 and S3-Select comprising less than 10%. The redesigned system offers query latencies between 200-500 milliseconds, with occasional spikes into seconds. Zendesk continues to refine these metrics to achieve even greater efficiency and cost-effectiveness, setting a precedent for optimal data storage strategies in the tech industry.