The Affordable Data Infrastructure Stack

The goal of your data infrastructure is to enable data-driven decision making and give everyone access to accurate, comprehensive and timely data, while keeping things compliant. This particular stack is optimised for reporting and analytics, while keeping costs low.

The basic layer cake of the data infrastructure stack looks like this:

  • Data sources
  • ETL
  • Data warehouse
  • Reporting
  • Integrations

Data sources

Data sources are typically your production data, front-end events, and third party tools like CRMs, banks and other partners. Most data sources make it easy to connect your ETL tool and are suited to log-based or key-based incremental replication.

You also need a solution for front-end event tracking, like Google Tag Manager or Segment. Segment makes it really easy to capture front-end events and send those to the analytics platform of your choice. With Segment it’s easy to switch from Amplitude to Mixpanel for product analytics. They also have a user consent system, giving you the ability to manage cookie consent easily and compliantly.

10,000 monthly users costs $120 per month with segment.

ETL

The job of the ETL (extract, transform and load) layer is to move the data from your data sources into the data warehouse. Either via batch jobs or data streaming. Since I’m optimising for cost efficient reporting and analytics, ETL tools that rely on batch jobs are effective solutions with wide data source coverage.

The ETL layer can be handled by a SaaS solution such as Fivetran or Stitch, or built by in tools like Apache Airflow, AWS Glue or Microsoft SSIS. I usually go for Stitch, as it’s cheaper than Fivetran, especially if your database enables an efficient replication strategy. Stitch is also easier to set up and maintain than Apache Airflow. 100 million rows replicated per month costs $750.

Segment also offers some ETL capabilities from SaaS tools like Intercom, Salesforce, Hubspot, which would otherwise be a reason to go with Fivetran over Stitch.

Data warehouse

The data from the ETL process is sent to a data warehouse like Snowflake, AWS Redshift, Azure Synapse Analytics or Google BigQuery.

Google BigQuery tends to be cheaper, has a user friendly interface, lots of tooling and offers a handy integration with Google Sheets, and is therefore my go-to. For an average sized company, expect to pay $100 per month.

The data team will prepare reporting tables that enrich the data in various ways that are deployed to the data warehouse. A dedicated tool like DBT is a popular way to manage this, but I find that running things in Github and BigQuery, with a few cloud functions and scheduled queries is usually more than sufficient.

You often need to add some additional data sources manually. BigQuery makes it easy to connect a Google Sheet to the data warehouse, or to use a Cloud Function to import an Excel sheet every day.

Reporting

The reporting layer is where most users will interact with the data. Wise used Looker, which is versatile and comes with version control built in through LookML, but with prices starting from $35,000, it has gotten too expensive for mid-sized companies since the Google Acquisition. Instead, I prefer Metabase. It offers the same powerful capabilities as Looker, at only $5 per user per month with no minimum committments.

In addition to Metabase, you need a tool to analyse customer behaviour. Mixpanel and Amplitude are good options with similar feature sets. Amplitude has a great free plan with everything you need to do user funnels.

Integrations

Finally, I use Segment to send out data to tools I want to integrate into our data flow. You can send user events or revenue figures to tools like Intercom and Salesforce.

Total cost

For a business replicating 100 million rows per month, with 20,000 tracked users in Segment, 100 people using Metabase, five data analysts, and with the free version of Amplitude, here’s the cost:

  • $700 for Stitch
  • $240 for Segment
  • $100 for BigQuery
  • $50 for Github
  • $500 for Metabase

Total: $1590 per month.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *