Figma's recent announcement of FigCache, an in-house Redis proxy service, marks a significant shift in the company's approach to caching and data management. This move, detailed by software engineer Kevin Lin, is not just about improving performance; it's a strategic decision to centralize and streamline their Redis access tier, addressing a range of operational challenges. The article delves into the motivations, the technical intricacies, and the broader implications of this decision, offering a comprehensive analysis of Figma's caching strategy.
A Growing Threat to Site Availability
Figma's decision to build FigCache stems from the growing threat to site availability posed by scalability and reliability gaps in their Redis platform. Kevin Lin, the software engineer behind the project, explains that connection volumes were reaching hard limits, and rapid scale-ups of client services were causing thundering herd connection failures, leading to degraded availability. The team initially attempted to solve these issues with service-specific workarounds, but these only masked the underlying structural problems.
The Decision to Build: Overcoming Limitations
The decision to build an in-house proxy rather than adopt an existing open-source solution was driven by the limitations of available options. Lin notes that existing solutions lacked the semantic awareness needed to implement runtime guardrails and define custom commands. Additionally, Figma needed to support a fragmented existing client base, requiring a proprietary layer to handle variants transparently. Building FigCache allowed the team to create a solution tailored to their specific needs, ensuring extensibility and control.
Technical Insights: FigCache's Design and Architecture
FigCache is a stateless service built on ResPC, a Go library providing an RPC framework over the Redis Serialization Protocol (RESP). The proxy separates a frontend layer for connection management and protocol-aware command parsing from a backend layer for connection multiplexing and command execution against upstream clusters. This separation enables new behaviors to be introduced at either layer without disrupting the other, making the system highly extensible.
One of the more unusual design choices is the backend configuration, expressed as a Starlark program evaluated at runtime. This allows operators to change routing logic, key-prefix-based rejection rules, and command-type splitting purely through configuration, without redeploying server binaries. This dynamic configuration system is particularly interesting and could be a game-changer for teams facing similar problems.
Addressing Redis Cluster Limitations
FigCache also handles a class of problems that Redis Cluster normally surfaces to clients as errors. By intercepting eligible multi-shard pipelines and executing them internally as a parallelized scatter-gather, FigCache ensures that these errors never appear from the application's perspective. This feature is particularly useful for maintaining high availability and performance in complex Redis Cluster environments.
Migration Strategy and Lessons Learned
The migration strategy was designed to be reversible at every stage, with feature flags allowing instant reversion without code changes or binary deployments. For large workloads like Figma's main API service, traffic was shifted incrementally across independent domains rather than switched all at once. This approach ensured that the system could handle the migration smoothly and with minimal downtime, providing valuable lessons for other teams considering similar migrations.
Broader Implications and Future Directions
Figma's decision to build FigCache has broader implications for the Redis ecosystem. The company's approach goes beyond simply rethinking how data flows into Redis; it centralizes the Redis access tier itself. This centralization reduces coupling, improves scalability, and isolates components from one another's failure modes. Moreover, Figma's decision to support alternative backends like AWS MemoryDB and their own Postgres stack behind the same RESP-based interface suggests a more flexible and adaptable caching strategy.
The Choice Between Build and Buy
The question of whether to build or buy this kind of infrastructure is a common one for engineering teams. Sneha Wasankar, writing on dev.to, notes that the choice of cache-aside, write-through, or write-behind patterns often matters less than the reliability of the infrastructure beneath them. Figma's post argues that at sufficient scale, the infrastructure itself becomes the product, emphasizing the importance of building a robust and extensible solution.
Conclusion: A Thoughtful Takeaway
Figma's journey with FigCache offers a thoughtful takeaway for engineering teams facing similar challenges. Building a proxy that is transparent to existing clients while being extensible enough to absorb future requirements is a hard constraint to satisfy. While the approach may not generalize beyond Figma's specific combination of languages, deployment patterns, and operational history, it provides valuable insights into the complexities of caching and data management at scale. The design choices, migration strategy, and broader implications of FigCache offer a wealth of lessons for teams looking to improve their caching strategies and overall system reliability.