Starburst
Last Updated:
Analyst Coverage: Philip Howard
Starburst was incorporated in 2017 and is based in Boston, Massachusetts. It was formerly known as Project Flex and its founders (Justin Borgman, Martin Traverso, Matthew Fuller, Dain Sundstrom, Kamil Bajda-Pawlikowski and David Phillips), previously helped create a predecessor product, the open-source MPP SQL query engine, PrestoSQL, which is now known as Trino. This product emerged from work at Facebook. Starburst is backed by institutional investors and in March 2022 conducted a financing round, raising $414 million at a valuation of $3.35 billion from ten investors including A16Z, Index Ventures and Salesforce Ventures. The company has around 500 employees at the time of writing and has accumulated over 250 corporate customers. These include Sky, Bank of America, DoorDash, HSBC, Gilead, Expedia, Tesla, Comcast and Carrefour. The company has over 200 corporate partners, from consulting firms to technology partners.
Starburst – A Data Fabric Foundation Technology
Last Updated: 5th March 2024
Mutable Award: Gold 2024
Starburst offers an open data lakehouse that anchors on its enhanced offering of the Trino distributed massively parallel processing (MPP) SQL query execution engine to process petabyte scale data at high concurrency. It should not be confused with a database. It provides a commercial layer of management and security functions over the core SQL query engine.
Starburst supports optionality throughout the stack by providing choice in open file formats like Parquet, ORC, and Avro, and all four popular table formats with Apache Iceberg, Delta Lake, Apache Hudi, and Apache Hive. It connects to diverse data sources, whether these are within one or more cloud data lakes or databases, on-premise or hybrid. It supports a wide variety of data source formats including SQL databases like Oracle or PostgreSQL, Snowflake, Redshift, BigQuery, Kafka and more. The company has over 50 connectors at present, with more in the pipeline.
Starburst has a form of data catalogue and even has support for aspects of data governance including data lineage. This is used to form a map of the data assets in an enterprise, which might comprise tables, database views, or materialised views in the form of Data Products of business objects like “customer”, “asset” or “product”. Its enhanced query engine is essentially a sort of uber-optimiser, deciding how best to parcel out user queries to the various underlying source systems. Users see a kind of data marketplace of data products, and can then make inquiries which the Starburst optimiser fulfils in the most efficient way that it can.
There are two product flavours: Starburst Galaxy is a fully managed SaaS offering while Starburst Enterprise is suitable for those needing a highly tailored offering with multiple deployment options between on-premises, hybrid, and multi-cloud.
Customer Quotes
“Starburst is an essential part of our overarching data mesh initiative that gives the user the flexibility to access data through a single point rather than having to go around to ten different data sources.”
Ritesh Ranjan, Lead Data Architect, Sky
“We’re on a journey to democratize data as much aas possible, because there’s so much we don’t know and so many elements we have not tapped into. With Starburst, there’s so much more we can explore to drive decisions and insights.”
Sachin Gopalakrishnan Menon, Senior director of data, Priceline
Starburst provides a key component of a data fabric or data mesh architecture, allowing queries to be executed in a distributed fashion across a wide variety of source systems or a centralised Data Lake, whether these are in one or more public or private clouds or are on-premise, or a mixture of these. The technology deals with the nitty gritty of high-performance and high-concurrency query execution, while end users can access their data via familiar analytic products like Tableau, PowerBI, ThoughtSpot, Looker, and others. The product provides tools to allow a data marketplace of Data Products to be built, and this marketplace can then be accessed by business consumers. The product offers enhanced query execution, for example providing substantial performance gains over open-source Trino for memory-intensive queries.
The technology allows for fine-grained security and complete audit trails, and supports data domain ownership by data teams or business users. There is column masking and row-level filter capability. Data lineage can be visualised so users can trace the sources of data.
Furthermore, Starburst has expanded its capabilities to offer PyStarburst and real time streaming ingestion. With PyStarburst, users can leverage the standard Python DataFrame API to create complex transformation pipelines, build data apps, and interact with data using Python without moving data to the system where your application code runs. With streaming ingestion, customers can leverage Kafka to hydrate their data lake in near real-time, ensuring applications have the most up-to-date insights for their users. Upcoming support for fully managed solutions, such as Confluent Cloud, is also planned.
Lastly, Starburst will also be available as an embedded solution in a new Dell appliance, helping organisations gain all the benefits of Starburst across their on-premises and cloud data estate.
Starburst provides a significant component of a data fabric architecture, as it allows for true distributed queries from multiple sources without moving complete copies of the source data around. Based on the data and analytics strategy of an organisation, Starburst can arguably either complement or replace existing data warehouses (Snowflake or Teradata) and especially data lake environments (such as Databricks). For organisations looking to modernize their Hadoop instances, Starburst can also optimise the performance of Hadoop by replacing Apache Hive or Apache Impala, as well as be the full-fledged open data lakehouse with a new cloud-centric data lake approach built with Starburst and Iceberg or other open table formats. However, in general, it sits alongside existing data warehouses or data lakes.
Some Starburst customers report impressive results. DoorDash, a US food delivery firm, initially relied on separate Snowflake and Databricks data sources but implemented Starburst as a single point of data access to both. They run over 1,000 queries across 250 TB of distributed data each day, reporting a ten times runtime speed improvement in their queries as well as reducing costs.
The bottom line
Starburst provides a key element of a distributed data fabric: a true distributed SQL query engine that can execute queries across multiple source systems, whether they are on-premise or in one or more clouds at massive scale and high concurrency. Their rapid growth and high valuation at their last financing round are a testament to the fact that many prestigious customers are deriving value from their technology.
Mutable Award: Gold 2024
Starburst Presto
Last Updated: 1st July 2020
Presto is an open source distributed “SQL on Anything” engine for running interactive analytic queries. It has the ANSI standard SQL engine you would expect from a database. It doesn’t include its own storage mechanisms, but it allows you to query data in any storage device be it in distributed storage or a database fully separating compute from storage. It reflects the current trend towards a separation between compute and storage. The corollary to this is that you can use whatever storage engine (see Figure 1), or combination of storage engines, as is suitable for your application. The company tells us that it takes between one and three months to support additional storage options such as Db2, Greenplum or Vertica and the company is continuously working with the open source community as well as its customers to add new connectors, based on demand.
This approach means that you can scale compute separately – there is an autoscaling feature – from your storage requirements. You can also use the front-end business intelligence tool of your choice. In turn, this means that Starburst Enterprise Presto is most commonly deployed to support query federation across multiple data sources.
Presto is available under an Apache license, for which Starburst provides commercial support, as well as offering Starburst Enterprise Presto. The company is a major contributor to the Presto project, in fact the founders of Presto are also founders of Starburst, as are companies such as Facebook (which developed Presto in the first place), Slack, Grubhub, Comcast, and FINRA. The product may be deployed in the cloud, on-premises, or in hybrid environments.
Customer Quotes
“Presto is amazing. Our lead engineer got it into production in just a few days. It’s an order of magnitude faster than. Hive in most of our use cases. It reads directly from HDFS, so unlike Redshift, there isn’t a lot of ETL before you can use it. It just works.
Airbnb
“FINRA monitors market data for trading fraud. Starburst Presto separates compute and storage, making it possible to scale economically and analyze 25PB of data – 100B rows of new data per day from
25+ sources.”
FINRA
Presto is a massively parallel distributed system that runs on a cluster of machines. A full installation includes a coordinator (which enables high availability) and multiple workers, as illustrated in Figure 2. Queries are submitted from a client such as the Presto CLI (command line interface) to the coordinator. The coordinator parses, analyses and plans the query execution, then distributes the processing to the workers. Specialised connectors are available for Cassandra, MySQL, Google BigQuery, ElasticSearch, Oracle, MongoDB, Snowflake, PostgreSQL and many others, while there is also ODBC and JDBC support. There are Presto client libraries that support C, Go, Java, Node.js, PHP, Python R and Ruby. Also notable are the in-memory capabilities, the use of vectorised columnar processing and integration with Kubernetes, which allows the deployment on any cloud and on-premises
The product does not currently support push-down query capability but the company intends to introduce this in 2020. This will be two-way to the extent that you push-down when that is appropriate but refrain from doing so if the source database is overworked.
A major feature of Starburst Enterprise Presto is that it offers a cost-based optimiser that is the result of a collaboration between what is now Starburst and Facebook, as opposed to the less capable optimiser used in standard Presto distributions. It has been designed specifically for Presto, as opposed to the Apache Calcite project, which is more of a generic optimiser. Another major feature that was previously contributed by Teradata is spill-to-disk, which is designed to support query processing when you run out of memory. There are a number of other in-memory engines which grind to a halt if you run out of memory. Workload management capabilities are provided along with resource groups.
The product has strong security capabilities, with support for LDAP and Kerberos, and you can inherit security details from the storage environment. In addition, Starburst ensures Presto security & governance with role-based access control, data masking and encryption (both at rest and in motion), column and row level security, and integration with Apache Ranger. And finally, the company has recently introduced Starburst Mission Control as a management console to manage Starburst Enterprise Presto clusters across platforms and data sources. It allows you to create, access, and manage multiple clusters, even across hybrid cloud environments, from a single intuitive user interface.
It is currently available on AWS and Kubernetes, which covers both cloud and on-premises deployments.
There are two questions to answer. Firstly, why choose Presto? And secondly, why prefer Starburst Enterprise Presto compared to other versions of Presto? In the first case, the ability to scale storage and compute separately is a major benefit. As is the ability to have heterogeneous storage engines, with built-in query federation. Also relevant is that there is no vendor lock-in: if you want to change your storage engine then Presto can accommodate that.
As far as Starburst is concerned the key reasons for adopting this version of Presto is exactly the same as applies to other open source products: you get support, high availability, enterprise connectors, security and the latest performance improvements.
The Bottom Line
Starburst was created because the founders believed that was a large market opportunity to create an enterprise-grade version of Presto. This is undoubtedly true. However, it is worth bearing in mind that Starburst has its origins in Teradata: a company that has had decades of experience in optimising analytic performance. This experience is evident in the various Starburst Enterprise Presto offerings.