Which Data Warehouse is the right choice - Redshift or Snowflake?

Nidhi Vichare
17 minute read
January 30, 2021
Snowflake
Distributed storage systems
Data Engineering
Data Warehouse
Databases
Metadata-driven design

Which Data Warehouse is the right choice - Redshift or Snowflake?


What is Modern Data Warehouse? A modern data warehouse lets you bring together all your data at any scale easily, and to get insights through analytical dashboards, operational reports, or advanced analytics for all your users.

Redshift vs Snowflake

Snowflake separates compute usage from storage, while Redshift bundles the two together. Concurrency scaling is automatically included with all editions of Snowflake. Redshift offers daily amount of concurrency scaling and charges by the second once usage exceeds it.

In this blog post I will be sharing my opinion on the areas where I see Snowflake being better than Redshift, and where Redshift has advantages over Snowflake.

Considerations while chosing the Data Warehouse

  • Platform Use Cases
  • Integration with Cloud Providers
  • Availability & Geolocation of the data
  • Volume of data
  • Dedicated engineering resources for the support and maintenance
  • Scalability: Horizontally vs. Vertically
  • Security
  • Pricing models
  • Performance

Platform Use Cases

AWS Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service. With Redshift, we can query petabytes of structured and semi-structured data across their data warehouse and data lake with standard SQL. Redshift allows users to save the results of their queries back to the S3 data lake adopting open formats, like Apache Parquet, to additionally analyze from other analytics services like Amazon EMR, Amazon Athena, and Amazon SageMaker. Redshift allows multiple integrations with different technologies, especially with tools on the AWS platform. Unlike Snowflake, Redshift considers that user data is in AWS S3 already for performing tasks. AQUA is a new distributed and hardware-accelerated cache that supports Redshift to go up to 10x faster than any other cloud data warehouse.

Snowflake is a fully-managed data platform service with solutions for data warehousing, data lakes, data engineering, data science, data application development, and for securely sharing and consuming shared data. Snowflake uses cloud-based data storage that enables enterprise users to store and analyze data utilizing cloud-based platforms. Snowflake has worked on Amazon S3 since 2014, and on Microsoft Azure since 2018, and also on Google Cloud Platform in 2019. Snowflake gives benefits such as global data replication, which implies that users can move data to any cloud in any geography. Whether you’re a business or technology professional, get the performance, flexibility, and near-infinite scalability to easily load, integrate, analyze, and securely share your data. It can power a near-unlimited number of concurrent workloads.

Snowflake operates as a virtual data lake. Snowflake provides analytical capability across various cloud platforms, which entails that companies can securely have data and applications irrespective of the platform. Snowflake is cloud-agnostic and its virtual nature makes it very useful and functional for big business users. Virtual Warehouses can be applied to store data or run queries and can perform both these jobs concurrently. Snowflake Virtual Warehouses can be scaled up or down on command and can be suspended when not in use to decrease the expenses on computing. So if a company is looking to cut down waiting time through Query, or uploading the data faster to provide a hassle-free end-user result, Snowflake offers various sizes of Virtual DWH to meet the usage needs.

Everything in Snowflake can be done with SQL. Creating Snowflake-specific objects like a Snowpipe or a File Format can be done with SQL, and you can even load data with SQL. Multi-statement transactions feature in Snowflke allows for ELT workloads where you want a big process that touches multiple tables to complete in an atomic way.

The global caching provided by Snowflake can save time and money, especially if you’re using a BI tool with frequently accessed and slowly changing dashboards. Snowflake caches the results at a global level. This means that if I run a complicated query that takes hours and you wanted to run the same query, Snowflake would allow you to get results in a fraction of the time by using the cache. This is because it caches results not at the user level but at the queru level.

Zero-copy cloning is one of the coolest features of Snowflake. Let’s say that you’ve built a new ELT pipeline and you want to run it in Dev with some fresh data from Production. In Snowflake you can clone the table from Prod to Dev and it’ll instantly make a copy of the table without duplicating the underlying storage. Since the underlying storage isn’t physically duplicated the cloning process happens super quick and you don’t have to pay for the additional storage. When you make changes to the Dev table you’ll only pay for the blocks you modify too.

Snowflake’s answer to doing ML is to hand-off the DS workload off to another tool or use external functions. External functions are a good option, but it’s not the same as having native ML functionality in your database as you do with RedShift.

If you want to stream some data into Snowflake your best option is to use a Snowpipe. Snowpipe is micro-batch however and not really a streaming option. In reality, I think it takes about 2 minutes on average before you see your data in your Snowflake table using a Snowpipe.

Vertically scaling warehouses to improve query performance is a manual process with Snowflake - an alter warehouse command - and when you do scale it up, in-flight queries still don't get the additional resources. Redshift automatically provisions resources on the fly in the background if a query is taking too long. With the Concurrency Scaling feature, you can support virtually unlimited concurrent users and concurrent queries, with consistently fast query performance. When concurrency scaling is enabled, Amazon Redshift automatically adds additional cluster capacity when you need it to process an increase in concurrent read queries. Amazon Redshift uses a machine learning algorithm to analyze each eligible query and predict the query's execution time. By default, WLM dynamically assigns a value for the SQA maximum runtime based on analysis of your cluster's workload.

If you have data sitting in a Redshift table and it hasn’t been modified in the last 90 days you pay a lower monthly rate for it by unloading data from database tables to a set of files in an Amazon S3 bucket and apply storage lifecycle policies. In Snowflake, data storage lifecycle management is not an option so you pay a single rate. This may have an impact on your storage costs if you land a bunch of data in your warehouse and never touch it again.

By using federated queries in Amazon Redshift, you can query and analyze data across operational databases, data warehouses, and data lakes. With the Federated Query feature, you can integrate queries from Amazon Redshift on live data in external databases with queries across your Amazon Redshift and Amazon S3 environments. Snowflake does not have this option today.

Snowflake provides a great web UI that super easy to use with tabbed worksheets and the ability to have multiple worksheets open at the same time. Saving, searching and opening worksheets is much easier and more intuitive in Snowflake. You can search for database objects using the hierarchical object viewer and have the ability to assume a different security role or change your database/schema context.

Integration with Cloud Poviders

  • AWS Redshift integrates with a multiple of AWS services like Athena, Glue, SageMaker, DynamoDB, Athena, CloudWatch, etc. So if you are looking to use a data warehouse with AWS, then Redshift is probably your best choice. All you have to do is Extract, Transform, Load (ETL) into the warehouse and start performing analytics.

  • Redshift itself doesn’t support schema-on-read. You can access the Glue catalog from redshift or use Spectrum to access data stored on S3; nevertheless, joins between spectrum or Athena tables with inner Redshift tables happen in Redshift, therefore query performances depend on the size of the Redshift cluster. in different contexts you might get different performances and it really depends on:

  • Snowflake does not have similar integrations, which makes it more challenging for clients to use tools like Kinesis, Glue, Athena, etc when attempting to integrate their data warehouse with their data lake architecture. It, on the other hand, integrates with tools like IBM Cognos, Informatica, Power BI, Qlik, Apache Spark, Tableau and a few others, which can be helpful for analytics processes.

  • Snowflake gives native support for JSON documents, providing built-in functions and querying for JSON data. In contrast, there is limited support for JSON at AWS Redshift, as reported by users.

Availability & Geolocation of the data

Data warehouse architecture is rapidly changing. Companies are increasingly moving towards cloud-based data warehouses with a lower upfront cost, improved scalability and performance instead of traditional on-premise systems. According to HGInsights as of November 2020:

  • Snowflake has been adopted by 4663 companies and has a growth of +83% over last year. Snowflake is available on all Cloud provides - AWS, GCP and Azure across Americas, Europe and Asia Pacific regions.
  • Redshift has been adopted by 13,886 companies and has a growth of 3% over last year. Redshift is available in Americas, Europe, Middle East, Africa and Asia Pacific regions.
  • BigQuery has been adopted by 9184 companies and has a growth of 31% over last year. BigQuery is available in across Americas, Europe and Asia Pacific regions.

Currently, Amazon Redshift only supports Single-AZ deployments. You can run data warehouse clusters in multiple AZ's by loading data into two Amazon Redshift data warehouse clusters in separate AZs from the same set of Amazon S3 input files. With Redshift Spectrum, you can spin up multiple clusters across AZs and access data in Amazon S3 without having to load it into your cluster. In addition, you can also restore a data warehouse cluster to a different AZ from your data warehouse cluster snapshots. Amazon Redshift will automatically detect and replace a failed node in your data warehouse cluster. The data warehouse cluster will be unavailable for queries and updates until a replacement node is provisioned and added to the DB. Amazon Redshift makes your replacement node available immediately and loads your most frequently accessed data from S3 first to allow you to resume querying your data as quickly as possible. Single node clusters do not support data replication. In the event of a drive failure, you will need to restore the cluster from snapshot on S3. We recommend using at least two nodes for production.

Snowflake offers a built-in high availability, built-in data protection, and built-in service protection against node failures:.Providing an even higher degree of data protection and service resilience, within the same deployment region, Snowflake provides standard failover protection across three availability zones (including the primary active zone). Your data and business are protected. As you ingest your data, it is synchronously and transparently replicated across availability zones. This protection is automatically extended from Snowflake to customers, at no added charge.

Volume of data

Amazon Redshift, Google BigQuery, Snowflake, and Hadoop-based solutions support a dataset size up to multiple petabytes in an optimal manner.

Dedicated engineering resources for the support and maintenance

The selection criteria are broken down into:

  • Setup engineering resources
  • Ongoing maintenance
  • Building skillsets in your team

Let's talk about setup, maintenance and management of Redshift and Snowflake.

  • Redshift: Sizing of cluster is required; compute and storage are together; Design dataflows to match resource size; Requires maintenance like data vacumming and analyzing; difficult to manage without skilled architect as management requires time investment.
  • Snowflake: No sizing required as compute and storage are seperate; Must choose the Cloud provider; Low maintenance with rapid provisioning of compute resource capability; Zero management for end users.

Scalability: Horizontally vs. Vertically

Horizontal scalability refers to the addition of more machines, whereas vertical scalability means the addition of resources into a single node to increase its capability.

  • Redshift : Redshift can be scaled up and down by adding or removing nodes in the existing cluster. Redshift has a local storage configuration and cannot scale independently; resizing requires cluster reconfiguration that takes several hours leaving cluster in Real-only state while data data distribution takes place.
  • Snowflake : There is no such concern as compute & storage are independent with processes already built-in for scaling vertically and horizontally. Metadata service also scales as needed; Trillions of rows can be sliced with ease; multiple concurrent users can access;

Snowflake offers instant scaling without needing to redistribute data or interrupting users. Snowflake’s auto concurrency allows users to set a minimum and maximum cluster size and the clusters will scale automatically over this range in case of high demands.

Redshift can also scale, but not as instantly as Snowflake and takes anywhere between minutes to hours while adding new nodes to the cluster. So in case of scaling, Snowflake has an obvious advantage over Redshift. Redshift also offers a mechanism called concurrency scaling that can increase the cluster capacity automatically when there is an increase in concurrent read query load.

Security

Both Redshift and Snowflake offer a strong security model.

Pricing

Redshift has a higher compute per dollar, saving you more money for the same amount of total compute time. On a general level, if we look at the pricing models, we see that Redshift is cheaper for on-demand pricing. Also, with Reserved Instances, costs can be further reduced for using AWS Redshift. The trend is that smaller companies lean towards Amazon Redshift due to its simple usability and affordable pricing.

Snowflake, on the other hand, has a dynamic cost model and depends on the workload and pricing can be billed basis each separate use patterns of the virtual warehouses about compute and storage. In large enterprises can find value in Snowflake as computing, and storage can be used separately, which can bring overall prices down. Regarding pricing models, with Snowflake you pay credits/hour for each virtual warehouse plus the data storage cost, which is normally negligible and aligned with your cloud provider costs. Basically you pay only when you have machines up and running, executing your queries and the total cost mostly depends on your usage pattern and the fact that your virtual warehouses are suspended when not in use.Snowflake provides an auto-suspension option to switch off a virtual warehouse when no queries are executed for a certain amount of time.

The reason why we adopted Snowflake was mostly to overcome limitations that you might find in other products like Athena and Redshift. Athena is an interactive query service.

Performance

Size and structure of the data. Type of queries that you are running and usage pattern. Data has to be wisely modelled and optimised for consumption, and in some cases role-based restrictions might come in handy. This can help to keep control of who is accessing your DW (and how) and prevent data from being consumed in the wrong way.

Since both the databases use different architectures and behave differently as per the type of queries run, it is tough to declare a clear winner when it comes to performance.

On raw query run times using unoptimized queries, Snowflake offers higher performance. Redshift query run times for unoptimized queries generally includes a long query optimization time and runs much faster if the same query is run frequently.

If you know your data, Redshift offers different ways of tuning both your queries and data structure to attain significant gains. Redshift offers SORTKEY and DISTKEY clauses which can be used while setting up the data. If used effectively, these clauses can provide a significant reduction in run times for queries involving JOINs and WHERE clauses.

Snowflake also offers a clause called PARTITION BY which helps in optimizing queries with WHERE clause, but optimization over JOIN queries is limited in Snowflake.

Both databases share below features:

a. Columnar data storage reduces the amount of data to be read from the table and reduces disk I/O significantly

b. MPP (Massive parallel processing) architecture

c. Stores compressed data to reduce disk I/O, which gets uncompressed during query execution

d. Query results get cached at the leader node in case of Redshift and Snowflake cache the query results at the compute node (local disk). This avoid execution of the repeated queries against the data

e. Redshift offers a variety of techniques to optimize database performance like distribution/sort keys, partitioning, and data distribution style. However, Snowflake doesn’t offer any optimization techniques. It may be good for the people, who find it complex but also takes away the opportunity from the experts.

Best Practices for a Warehouse Design

  • Metadata management – Documenting the metadata related to all the source tables, staging tables, and derived tables are very critical in deriving actionable insights from your data. It is possible to design the ETL tool such that even the data lineage is captured. Some of the widely popular ETL tools also do a good job of tracking data lineage.
  • Logging – Logging is another aspect that is often overlooked. Having a centralized repository where logs can be visualized and analyzed can go a long way in fast debugging and creating a robust ETL process.
  • Joining data – Most ETL tools have the ability to join data in extraction and transformation phases. It is worthwhile to take a long hard look at whether you want to perform expensive joins in your ETL tool or let the database handle that. In most cases, databases are better optimized to handle joins.
  • Keeping the transaction database separate – The transaction database needs to be kept separate from the extract jobs and it is always best to execute these on a staging or a replica table such that the performance of the primary operational database is unaffected.
  • Monitoring/alerts – Monitoring the health of the ETL/ELT process and having alerts configured is important in ensuring reliability.
  • Point of time recovery – Even with the best of monitoring, logging, and fault tolerance, these complex systems do go wrong. Having the ability to recover the system to previous states should also be considered during the data warehouse process design.

In Summary ...

In my opinion, Snowflake vs Redshift vs BigQuery is a choice that you have to make based on the use cases you have to support. Snowflake offers the best in class data warehouse cloud agnostic option. If your use cases are ML, Federated queries, cross-region replication, dynamic scaling of queries and save cost using storage lifecycle management, then Redshift or BigQuery are better options.

Further Reading

🔗 Read more about Snowflake here

🔗 Read more about Cassandra here

🔗 Read more about Elasticsearch here

🔗 Read more about Kafka here

🔗 Read more about Spark here

🔗 Read more about Data Lakes Part 1here

🔗 Read more about Data Lakes Part 2here

🔗 Read more about Data Lakes Part 3here

🔗 Read more about Redshift vs Snowflake here

🔗 Read more about Best Practices on Database Design here