Quick summary: Teams migrating to Azure Synapse Analytics from SQL Server must rethink their approaches to data distribution, querying, and performance tuning to optimize use of Synapse’s capabilities.
A core strength of Azure Synapse Analytics is its flexibility, a key advantage for organizations moving to the cloud. That flexibility allows for quick insights across data in various formats and sources, using languages most developers would already be familiar with. Additionally, its approach to managing compute and storage separately offers financial flexibility, ensuring cost-effective and scalable resource utilization. These benefits, from both a technical and financial perspective, make Synapse a smart choice for many companies.
However, teams migrating from SQL Server to Synapse face a fundamental shift in their data warehouse approach. Unlike SQL Server, Synapse is a massively parallel processing (MPP) system, with data spread across 60 distributions, managed by 1 to 60 compute nodes. For companies transitioning to Synapse, this difference means that they must have a deep understanding of distributed architectures and use different strategies for modeling, querying, indexing, and partitioning. Additionally, teams with a SQL Server background may be surprised to find that familiar performance tuning tools aren’t available in Synapse.
Commonly, when encountering performance issues, the first assumption might be to simply increase the service level for greater processing power and concurrency. However, this approach can be both expensive and inefficient if the real solution lies in a holistic reevaluation of your strategy.
Migrating to Synapse is not a straightforward “lift-and-shift” task, especially for developers without experience with MPP architectures. By shifting your approach and upskilling your team, you can ensure a more effective use of Synapse’s capabilities, thereby maximizing your ROI and fully unlocking its power.
Companies transitioning from SQL Server to Synapse must have a deep understanding of distributed architectures and use different strategies for modeling, querying, indexing, and partitioning.
Key differences and strategic approaches
1. Distributions vs. indexes: understanding data movement
In SQL Server, performance tuning is often focused on indexing. However, in Synapse, distributions play a more significant role. Ineffective distribution strategies result in the rows of frequently joined tables not being collocated on the Compute nodes. This leads to data movement, where necessary rows are copied into the shared tempDB. The process not only consumes resources inefficiently, but also negates the benefits of its indexes, as they do not “follow” the data into this shared resource.
Recommendation: Carefully plan your data distribution based on common usage patterns and the type of data.
2. Optimizing hash distributions
Unlike SQL Server, Synapse offers hash-distributed tables, giving you control over how your data is balanced across distributions. It is important to choose a hash key column that minimizes data movement, especially in instances where the table joins with another on the same column. Equally important is ensuring that your choice balances load across the nodes to ensure parallel processing; your queries can only be as fast as your slowest distribution.
Recommendation: Carefully select hash columns that are frequently used in joins, and regularly monitor for skew as your data grows.
3. Understanding columnstore indexes and partitions
In SQL Server, clustered columnstore indexes (CCIs) are best justified when the table reaches about 1 million rows. However, due to Synapse’s architecture, each of the 60 distributions needs to justify a CCI independently.
Recommendation: Be mindful of the increased data requirements when implementing CCI and partitions in Synapse. Ensure the column used for partitioning is different from the hash distribution column.
4. Adopting denormalized architectures
In Synapse, the shift from a traditional relational model to a denormalization strategy, such as the one big table (OBT) approach, helps minimize data movement and optimize query performance. Unfortunately, without the graphical execution plan familiar to SQL Server developers, it becomes significantly harder to fine-tune queries and effectively design denormalized tables, particularly when determining optimal indexing strategies.
Recommendation: Optimize query performance in Synapse by breaking down complex queries into smaller segments with “Create Table As Select” (CTAS) to control the data distribution and indexing between subsequent steps. Additionally, leverage dynamic management views (DMVs) for detailed performance monitoring in the absence of graphical execution plans.
Navigating the shift
The strategies that served you well with SQL Server may not translate to Azure Synapse Analytics. It requires reevaluation of query optimization, architecture, and monitoring to adjust to a distributed architecture. Equipping your team with the right knowledge and tools is essential for navigating the MPP environment successfully. Understanding and implementing these techniques is key to achieving optimal performance and cost-efficiency in Synapse, enabling you to fully utilize its potential to convert your data into valuable insights.
Put your data to work for you
We bring together the four elements that transform your data into a strategic asset—and a competitive advantage:
Like what you see?
Jennifer Pritchard is a Senior Developer in Logic20/20’s Advanced Analytics practice.