Best Practices to make the most of your SQL Queries running on Presto / Trino
Processing big data is one of the marks of this digital era. Companies need to process and analyze huge amounts of collected data to gain actionable information to establish their competitive advantage. The increasing use of IoT, logs, artificial intelligence and machine learning solutions results in massive amounts of data to be queried and analyzed.
Presto (now Trino) SQL uses big data analytics to query data where it resides (data lake architecture). But how can you accelerate your Presto SQL query engine? This post shows you tips and best practices for making the most of your Presto queries.
What is Trino (FKA Presto SQL)?
Presto /Trino SQL is a distributed SQL query engine. Its goal is to efficiently query data from disparate sources where the data resides. Trino can query data sources of all sizes and formats, providing fast time-to-market and fast analytics in an open-source solution.
Benefits and Features of Trino SQL
Designed for performance
Presto is designed to query vast amounts of data by using distribution, created as an alternative to Apache Hive and Hadoop. Trino features like distributed l processing, multi-threaded execution models, dynamic filtering and cost-based optimizers result in faster and cost-effective insights into data.
Data-source agnostic
While originally created to query data from Hadoop Distributed File Services, PrestoDB and Trino query engines can query data from an array of data sources. This includes object storage, relational database management systems, and NoSQL databases. Thus, it works with relational and non-relational data sources like:
Non-relational:
- Amazon S3
- Hadoop
- HDFS
- MongoDB
Relational databases:
- PostgresSQL
- MS SQL Server
Queries data where it lives
Presto (and Trino) queries the data wherever it is stored, embracing the data lake architecture. Therefore, it doesn’t require migrating the data to another location to process it there. Since you can query data wherever it sits, it is an alternative to expensive data transfer processes like ETL.
6 Best Practices for Presto (includes practical tips from Github)
While Presto is faster than other query engines, complex queries take time to run. A tuned Presto cluster can run queries in minutes. Here are several tips you can follow to accelerate your Presto SQL query engine.
Accelerating Trino queries
Use Presto / Trino Web Interface
Use Presto / Trino’s web interface (UI) that allows you to monitor and manage the cluster and queries. At the top of the page, you can find the list of queries currently running. Different tabs give you information about the queries, which allow you to check how operators are performing and plan queries live. You can use the interface high-level statistics to compare and optimize queries.
Keep table statistics up to date
Updating the table statistics can help to keep implementation costs low since Presto engine picks the ones with the lowest costs. You can use the command ANALYZE to update table statistics for a specific table.
Since both PrestoDB and Trino are open source, users are developing improvements and posting them on sites like Github and Stack overflow. Here are some of the best practices we found in Github:
Filter each table first
This will enable you to prevent large JOINs. Since Presto joins tables in the order they are listed, joining the smaller tables first and then the larger can improve performance.
Prepare your data format
Presto only holds the amount of data relevant to the query in process. Formatting as Optimized Row Columnar (ORC) facilitates this process by compressing each column separately.
Size up your files
Contrary to what you may think, multiple small files create latency issues, as it creates multiple requests and handling for each file. That being said, compressing your data can ensure better performance.
Commands to avoid when possible
- GROUP BY queries take a lot of time. If you need to use this command, first order the columns by the highest number of unique values to the lowest.
- When working with massive datasets, use approx_distinct () instead of count (distinct)
- Try to avoid UNIONs as much as possible to optimize performance.
- Try to avoid nested sub-queries, instead, use WITH statements.
You can learn more about how to tune Presto queries in this video of the Presto Training Series carried on by Trino.
Presto & Trino best practices
- Configuring the coordinator and workers to run on separate servers can help your production deployments.
- Remember to adjust Presto’s memory configuration according to the available resources of your server or instance. Also, since there is not a default memory percentage, assign each java configuration node with 80% of the available physical memory.
- If using object storage, like S3, consider formatting your data as an ORC file format.
- Check for coordinator node overload, which can happen if the Presto cluster has more than 50 worker nodes.
Summary
Querying with PrestoDB or Trino offers you a fast way to analyze very large datasets from heterogeneous sources in a fast and effective way. Trino features, like querying the data where it resides, save processing time since you don’t have to migrate all the data from the sources to a central location for processing. However, some commands for complex queries can take a long time to process. Hopefully, the tips mentioned in this article can help you accelerate your Trino queries.