Skip to content

Getting Started Building a Data Platform

Ever wonder what a data platform is and if your company needs one?

If the idea of hiring a data team to build and manage an enterprise data platform feels overwhelming, you’re not alone.

Let me break down how you can get started from zero and build up data capabilities at your company, one step at a time.

Nowadays every company is a data company. From marketing and sales to product usage and customer support, all aspects of your business generate data.

And that data is waiting to be activated. Turn it into reports to drive decisions. Build dashboards to support operations. Offer new product features and services to your customers that are directly driven by data and automation.

Getting started is much more a cultural challenge than a technical one.

Start small. Make sure you see first successes by doing the work manually without worrying about big investments in making it scalable.

But once you see concrete value and it becomes painfully clear that technology is holding you back, you know it’s time to build out your data capabilities.

Where do you go from here? Can you buy an off-the-shelf solution? Do you hire a data engineer? Do you need a dedicated data team or can your existing engineers handle it?

You can break down data infrastructure into four main layers:

  1. Ingestion: Connect data sources, extract data and load it into a central repository
  2. Storage: Store data in a structured format that is optimized for analytical workloads
  3. Transformation: Clean, enrich, and transform data to make it practical to work with and ensure consistent definitions of key metrics across the organization
  4. Business Intelligence: Create dashboards, reports, and alerts to share insights internally and with your customers and partners

There are many different tools to address all of these layers.

Focus on solving concrete problems you are experiencing and add new tools only when they directly solve a problem.

Start with querying your data directly where it is. Introduce tools to load data into a central repository only when the complexity and volume make this impractical.

You don’t need to address all data sources at once. Focus on the ones creating problems. Accept manual workarounds when practical.

Your data tooling should be able to query data across different data sources. You don’t need to worry about ingestion if directly querying a Postgres database and a Google Sheets gets the job done.

Chances are you already store your data in a database such as PostgreSQL or MySQL. If you’re not having performance issues, there’s no need to introduce a separate database for analytical workloads.

Only if performance or cost becomes an issue should you start addressing it.

Storage is a critical component since it’s where the actual data lives. Data outlives applications built on top of it. Pick an established and open standard to store data.

Keep in mind that there is no one-size-fits-all solution. You might need multiple data stores optimized for different use cases. You’ll know what you’re looking for when you act on concrete problems instead of trying to find a solution for hypothetical future problems.

Start delivering value before adding a separate data transformation step. Introduce a dedicated data transformation layer when queries start taking too long, or metrics become unreliable and hard to maintain because the same logic is repeated many places.

A few materialized views in your database can take you a long way.

You’ll know it’s time to look into real-time stream processing, data lineage and orchestration tools once you experience the issues that these tools are designed to solve.

Many software companies start out by building custom analytics features. As you use data to drive operations and user-facing functionality, building custom solutions for every new workflow and view on the data becomes slow and expensive.

Introduce a data visualization tool to quickly build analytics dashboards and reports. This is a great first step and enables a single data analyst to deliver a lot of value without introducing any other data infrastructure.

I built Shaper to help companies in exactly this situation.

Shaper is a simple interface on top of DuckDB that allows you to build analytics dashboards and automate data workflows with only SQL.

Thanks to DuckDB, it’s easy to query data across various sources ranging from databases to CSV files and Google Sheets.

You can go a long way before having to add more layers to your data stack.

Give it a try and let me know what you think.