Amazon Redshift Consultancy
Amazon says: “Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more”
Although Amazon Redshift is based on a very old version of PostgreSQL (8.0.2. from 2005), Amazon Redshift and PostgreSQL have a number of sigificant differences
Storage: Row vs Column
Traditional databases like Postgres store data in rows. Redshift stores data in columns. The data effectively becomes the key and a whole colum can be loaded into memory relatively quickly
OLTP vs. OLAP
Traditional databases are good at OLTP (Online Transaction Processing); Redshift is optimised for OLAP (analytical/reporting/data warehouse load), and makes a lot of tradeoffs along the way
This is a key attribute of Redshift – it has a Massively Parallel Processing (MPP) Architecture leveraged over a number of nodes. Having data stored as columns of a single datatype optimises the compression strategy and dramatically reduces I/O overhead.
This is a difficult topic because use-cases can differ enormously, so these figures are only illustrative.
A traditional Postgres database can reasonably handle up to 5 TB; bespoke approaches using sharding and other technologies can reach Petabytes.
Redshift is Petabyte scale
Postgres can scale vertically or horizontally via read only replicas, each of which is a complete database copy; although it is also possible to devise manual solutions using partitioning and sharding this adds to development and support complexity; performance also needs to be considered..
Redshift is designed out of the box to deliver horizontal scaling by adding extra compute nodes
Unique, primary key, and foreign key constraints are informational only. These constraints may be defined (to help the query optimiser) but are not enforced – so the developer must take responsibility for enforcing constraints – and if the constraint is corrupted queries may return wrong results!
Redshift does not support indexes. However queries can be optimised using sort and distribution keys.
Fewer datatypes than Postgres (eg JSON, XML, Interval are missing)
When to use Redshift?
This is very much use-case dependant, but we use the following pointers to help make a decision. In many cases the real test is to set up a proof of concept and run performance tests.
- OLTP load – or light/medium reporting/analytical
- Forecast size less than 10TB (ballpark depending on many things – could be lower or much higher)
- You want control to host on-premise or in the cloud or change as you wish
- Cost is a significant constraint
- Tables have a modest number/size of columns (eg dozens not hundreds/thousands especially very wide columns)
- You need Postgres features not present in Redshift – eg data types, technologies
- OLAP load – especially heavy duty/high performance
- Forecast size Petabyte+
- You are comfortable with vendor lock-in
- Cost is not a problem
- You can ensure unique key constraints and referential integrity
- You are comfortable maintaining skills in-house for (another) database technology
- Tables are very wide – hundreds/thousands of rows, a lot of (wide) text columns