6+2 setup
Recently, we split our Redshift provisioned from a single 8-node cluster to a 6+2 dual cluster setup. Main considerations are:
- Cost
- While the total number of nodes stayed the same, however with the 6+2 dual cluster setup, we can control the Concurrency scaling setup independently on each cluster.
- When an Redshift cluster activates concurrency scaling cluster, the activated cluster has the same amount of node as the source cluster. Hence, an 8-node cluster activating Concurrency scaling will be more example than a 6-node (or 2-node) cluster.
- Workload isolation
- Our ELT cluster shares common resources with BI and other service accounts, hence occasionally our ELT pipeline is impacted by other use cases.
- For example: Once, a user from another team set up an automated task that fired a lot of simple queries. However, the user wasn’t aware that queries via Python connectors will not auto-commit. Hence, it resulted in a deadlock when dbt tried to full-refresh a table that was read by the queries.
- Our ELT cluster shares common resources with BI and other service accounts, hence occasionally our ELT pipeline is impacted by other use cases.
The 6-node cluster is now the ETL cluster; while the 2-node cluster is the BI cluster.
Bridged by Data Share
Redshift has this tool called Data Share that enables data-sharing between a Producer and Consumer cluster. For our case, ETL cluster is the Producer, and BI cluster is the consumer.
Data is not stored in the consumer cluster at all. It reads from the producer cluster like a View, but without the locks and stuff.
Problems:
- Cold start
- Data Share works best with incremental tables. This is because when a table is shared from Producer to Consumer, the Consumer stores a metadata of exact locations where the data is stored. Hence, when the Producer table has a drastic change (eg: drop and recreated), the next time when the Consumer fires a query to the table, it will take some time to recollect the metadata. This problem only occurs for the first query from Consumer cluster.
- If you think about it, there is a flaw with this design. The Producer is already aware that the Consumer is reading from it. Hence, whenever there’s a change in data, it should make a “metadata push” the Consumer, instead of waiting for the Consumer to do a “metadata pull”. 🤔
- Random SQL errors
- Another problem we faced was BI reports that were previously working, had SQL errors after the switch to 6+2. The error was something like “invalid table reference”.
- Nonetheless, this is a super rare error, and it was solved by using an alias (
FROM table_a AS a
). - This is peanuts compared to the shit that Redshift Spectrum gave us. đź’€