Snowflake is the best thing happened to data professionals since Excel
Snowflake has transformed enterprise data management, arguably the most significant leap since Excel. It offers an on-demand data warehouse with a pay-as-you-go model, seamless scalability from zero to infinity, and no need for configuration or tuning, for the first time in history, enabling non-techies to provision and use an analytical database completely on their own. Affordable for most workloads, it leverages the familiar 50-year-old SQL standard with transactional semantics. At the time, the only comparable technology was BigQuery, which actually predated Snowflake by a few years, but it was tightly bound to GCP so not as widely known but not less successful.
Significant number of workloads and datasets cannot be shipped to Snowflake
Despite its strengths, Snowflake has limitations. Arguably the main one is that it cannot run workloads locally, requiring all data to be shipped to Snowflake’s own cloud account. A significant number of datasets and workloads require local processing for diverse reasons: testing data transformation logic during development cycles, data residency and sovereignty constraints, credit overruns, prohibitively high costs for certain workloads or datasets, impracticality of shipping certain datasets or workloads to Snowflake’s cloud account, shift-left initiatives, operating in clouds or cloud regions where Snowflake service is unavailable yet, prepaid cloud credits which are incompatible with Snowflake and more. While these may seem niche, Snowflake’s ~$4B ARR, projected to reach ~$10B by decade’s end, underscores the growing significance of these issues and the need to run Snowflake workloads locally.
Historically, it was impractical to build a system that could run Snowflake workloads natively
Until recently, building a system to run Snowflake workloads natively was impractical, demanding substantial capital and effort. Snowflake itself has invested ~$1B and a decade of negineering into building out their platform. However, recent advancements have made this feasible for a small, dedicated team. This is exactly what Embucket is working on.
The following recent advancements puts it within reach for our small team
- Apache Iceberg and Table Buckets: Apache Iceberg, alongside AWS and Cloudflare’s “table buckets,” has commoditized the data storage layer. This functionality, now widely integrated into cloud storage foundations, eliminates the need for custom development and is expected to be ubiquitous across cloud providers by next year.
- AI-Driven Engineering: Modern AI accelerates development, enabling a small team to build a wire-compatible implementations of well documented software with sufficient number of examples. This dramatically reduces the resources required for complex engineering tasks around wire-compatibility. Embucket's system architecture is radically different from Snowflakes, so here we haven't seen much acceleration yet but we are optimistic about what next year might bring with the newest models.
- Cloud Provider Hardware Innovations: Cloud providers are now not in mercy of Intel and AMD. Their custom designed Arm based monster compute instances deliver linearly priced performance, reducing reliance on traditional clustering and MPP, essentially marginalizing them from mainstream to niche. We are excited about DataFusion-on-Ray and Ballista projects, but for the observable future, we are fanatically single instance only.
- Apache DataFusion and Rust: Apache DataFusion is emerging as the “PostgreSQL of analytical databases,” powering a new generation of data infrastructure startups. Rust, with its high productivity, AI compatibility, performance, and vibrant community, is the ideal language for data infrastructure. Unlike C/C++ (efficient but low-productivity) or JVM/scripting languages (high-productivity but inefficient), Rust bridges the gap. The rustecosystem around DataFusion, Arrow, Iceberg, and Parquet is thriving shaping the future of data infrastructure.
- SlateDB and Stateless Architecture: Embucket leverages SlateDB for a diskless, stateless, single-process Lakehouse engine. This design simplifies operations, making it nearly foolproof for non-technical users. In the worst case, restarting an instance loses no committed data or even user preferences, approaching Snowflake’s ease of use.
Embucket is being open sourced today and can already run several Snowflake DBT workloads natively
These advancements have enabled Embucket’s small team to build a data platform rapidly and on a modest budget. It already runs many Snowflake workloads natively with DBT support, though it’s not yet ready for production use, for adventurous rustaceans it might be good enough for experimentation. The entire codebase is being open sourced today. If you are attending DataCouncil.ai this week, please come say hi to our booth.