Starburst

Last Updated: 14th June 2024
Analyst Coverage: Philip Howard

Starburst was incorporated in 2017 and is based in Boston, Massachusetts. It was formerly known as Project Flex and its founders (Justin Borgman, Martin Traverso, Matthew Fuller, Dain Sundstrom, Kamil Bajda-Pawlikowski and David Phillips), previously helped create a predecessor product, the open-source MPP SQL query engine, PrestoSQL, which is now known as Trino. This product emerged from work at Facebook. Starburst is backed by institutional investors and in March 2022 conducted a financing round, raising $414 million at a valuation of $3.35 billion from ten investors including A16Z, Index Ventures and Salesforce Ventures. The company has around 500 employees at the time of writing and has accumulated over 250 corporate customers. These include Sky, Bank of America, DoorDash, HSBC, Gilead, Expedia, Tesla, Comcast and Carrefour. The company has over 200 corporate partners, from consulting firms to technology partners.

Company Info

Headquarters: 24 School St. 2nd Floor, Boston, Massachusetts 02108, USA

Starburst – A Data Fabric Foundation Technology

Last Updated: 5th March 2024
Mutable Award: Gold 2024

What is it?

Fig 01 - Starburst data lake analytics platform

Starburst offers an open data lakehouse that anchors on its enhanced offering of the Trino distributed massively parallel processing (MPP) SQL query execution engine to process petabyte scale data at high concurrency. It should not be confused with a database. It provides a commercial layer of management and security functions over the core SQL query engine.

Starburst supports optionality throughout the stack by providing choice in open file formats like Parquet, ORC, and Avro, and all four popular table formats with Apache Iceberg, Delta Lake, Apache Hudi, and Apache Hive. It connects to diverse data sources, whether these are within one or more cloud data lakes or databases, on-premise or hybrid. It supports a wide variety of data source formats including SQL databases like Oracle or PostgreSQL, Snowflake, Redshift, BigQuery, Kafka and more. The company has over 50 connectors at present, with more in the pipeline.

Fig 02 - Starburst data lake analytics platform

Starburst has a form of data catalogue and even has support for aspects of data governance including data lineage. This is used to form a map of the data assets in an enterprise, which might comprise tables, database views, or materialised views in the form of Data Products of business objects like “customer”, “asset” or “product”. Its enhanced query engine is essentially a sort of uber-optimiser, deciding how best to parcel out user queries to the various underlying source systems. Users see a kind of data marketplace of data products, and can then make inquiries which the Starburst optimiser fulfils in the most efficient way that it can.

There are two product flavours: Starburst Galaxy is a fully managed SaaS offering while Starburst Enterprise is suitable for those needing a highly tailored offering with multiple deployment options between on-premises, hybrid, and multi-cloud.

Customer Quotes

“Starburst is an essential part of our overarching data mesh initiative that gives the user the flexibility to access data through a single point rather than having to go around to ten different data sources.”
Ritesh Ranjan, Lead Data Architect, Sky

“We’re on a journey to democratize data as much aas possible, because there’s so much we don’t know and so many elements we have not tapped into. With Starburst, there’s so much more we can explore to drive decisions and insights.”
Sachin Gopalakrishnan Menon, Senior director of data, Priceline

What does it do?

Starburst provides a key component of a data fabric or data mesh architecture, allowing queries to be executed in a distributed fashion across a wide variety of source systems or a centralised Data Lake, whether these are in one or more public or private clouds or are on-premise, or a mixture of these. The technology deals with the nitty gritty of high-performance and high-concurrency query execution, while end users can access their data via familiar analytic products like Tableau, PowerBI, ThoughtSpot, Looker, and others. The product provides tools to allow a data marketplace of Data Products to be built, and this marketplace can then be accessed by business consumers. The product offers enhanced query execution, for example providing substantial performance gains over open-source Trino for memory-intensive queries.

The technology allows for fine-grained security and complete audit trails, and supports data domain ownership by data teams or business users. There is column masking and row-level filter capability. Data lineage can be visualised so users can trace the sources of data.

Furthermore, Starburst has expanded its capabilities to offer PyStarburst and real time streaming ingestion. With PyStarburst, users can leverage the standard Python DataFrame API to create complex transformation pipelines, build data apps, and interact with data using Python without moving data to the system where your application code runs. With streaming ingestion, customers can leverage Kafka to hydrate their data lake in near real-time, ensuring applications have the most up-to-date insights for their users. Upcoming support for fully managed solutions, such as Confluent Cloud, is also planned.

Lastly, Starburst will also be available as an embedded solution in a new Dell appliance, helping organisations gain all the benefits of Starburst across their on-premises and cloud data estate.

Why should you care?

Starburst provides a significant component of a data fabric architecture, as it allows for true distributed queries from multiple sources without moving complete copies of the source data around. Based on the data and analytics strategy of an organisation, Starburst can arguably either complement or replace existing data warehouses (Snowflake or Teradata) and especially data lake environments (such as Databricks). For organisations looking to modernize their Hadoop instances, Starburst can also optimise the performance of Hadoop by replacing Apache Hive or Apache Impala, as well as be the full-fledged open data lakehouse with a new cloud-centric data lake approach built with Starburst and Iceberg or other open table formats. However, in general, it sits alongside existing data warehouses or data lakes.

Some Starburst customers report impressive results. DoorDash, a US food delivery firm, initially relied on separate Snowflake and Databricks data sources but implemented Starburst as a single point of data access to both. They run over 1,000 queries across 250 TB of distributed data each day, reporting a ten times runtime speed improvement in their queries as well as reducing costs.

The bottom line

Starburst provides a key element of a distributed data fabric: a true distributed SQL query engine that can execute queries across multiple source systems, whether they are on-premise or in one or more clouds at massive scale and high concurrency. Their rapid growth and high valuation at their last financing round are a testament to the fact that many prestigious customers are deriving value from their technology.

Mutable Award: Gold 2024