Data is continuing to prove to be a valuable asset for businesses of all sizes.
I say that both from the fact that consulting firms like McKinsey have found that in their research companies that are using AI and analytics can attribute 20% of their earnings to it.
Similarly, I have been able to consult for several clients and help them find new revenue sources as well as cost reduction opportunities.
There is one catch.
You will need to develop some form of data infrastructure or update your current one to make sure you can fully harness all the benefits that the modern data world has to offer.
Just to clarify, I don’t mean you need to use the fanciest and most expensive data tooling. Sometimes I have steered clients to much simpler and most cost-effective solutions when it comes to data analytics tooling.
In this article, we will discuss what you should avoid when building your data architecture and which questions you should be asking yourselves as you try to build out your future data infrastructure.
What To Avoid When Developing Your Data Analytics Infrastructure
When you’re developing your data architecture there are a lot of pitfalls you can fall in. We will talk about a few of them here.
The big focus here is avoiding what is known as technical debt. Using data is great when you have a smooth system that is easy to manage and cost-effective. However, we have come across several examples of companies using ineffective methods as far as running data pipelines and managing their data. This varies from having patchwork ETL pipelines to using Excel to orchestrate their data transformations.
Let’s dig into some of these examples.
The Patch Work Cron Job Nightmare
Look, I know plenty of people that have done it.
They have tried to create our very own meta-database and job management framework to run our data pipelines.
These often end up being very complex data pipelines that are held together by a few overworked data engineers struggling to
This style of architecture remains a popular solution because it is so easy to set-up. Depending on how complex your pipelines are, how many you have and how often they are run. It really might not be the worse solution to have a few scripts that run your pipelines 1-2 times a quarter.
But.
Eventually, as your teams start needing daily updates, and live data feeds, you will have to switch over to a more mature and modern solution.
There are a whole host of solutions here that we will discuss. But just know, to mature in your data strategy, you will need to step away from this patchwork style architecture.
It creates a lot of technical debt and then you will need to hire some like me to come in and untangle your data pipeline mess.
Running Your Data Pipelines Too Often
Many data pipelines these days run on the cloud. This makes it easy to work with while all your teams are remote and you don’t need to mess with on-premise servers.
It can also lead to a lot of unnecessary costs because your engineers decide to run a data pipeline once every hour even though it supports a report or dashboard that is only viewed once a week.
Now depending on how you have your ETL set up, this could be very expensive. especially if the query reruns over all of your data every time.
This could lead to 10s if not 100s of thousands of dollars in extra costs annually.
When you develop ETLs and data pipelines. It is important to understand why and what you are building them for.
This can ensure that your data engineers and developers pick the right choice of cadence as well as the right tools for the job.
The “Let’s Just Do It All In Excel And Macros”
Excel is a great tool for doing ad-hoc analysis. It’s easy to create formulas, quick charts, and some quick slicing and dicing.
One thing I feel it doesn’t do well is act as a data pipelining tool or key component in your data infrastructure. This is to say, I have seen engineers develop systems that somehow have Excel orchestrating entire data pipelines through a combination of Macros and scripts. I have nothing against Excel macros.
They have their place. But overall, they don’t tend to be a good choice when it comes to developing enterprise data infrastructure.
This is because the code in Macros is often isolated in the said Excel document and isn’t built to easily integrate with all the other complex systems externally. Compare this a coding language like python that can easily be used to create data pipelines, run models or act as an API.
Creating data infrastructure in Excel can actually sometimes look very similar to the cron run option. It comes across just as unruly and will require you hiring a data engineering or data science consultant to fix the mess.
Ok, let’s step away from what you should avoid.
What To Consider When Your Building Modern Data Architecture
To pick the right data tools for your data strategy, you first need to have a data strategy.
For example, if your team is just starting and you are looking to create a monthly report that pulls from 5-6 different data sources.
You already know you probably won’t need a streaming data solution.
So step one to picking the correct architecture is deciding what you want to do with your data.
If you don’t already have an idea of what your team wants to do, then you can check out our guide for creating a data strategy.
Once you have a data strategy you can start to assess where your team currently is and where it is going.
There is a broad range of tools when it comes to setting up your data strategy.
Overall you need to consider a few key problems.
- What are your data goals?
- How will you get your data from your source systems to your data warehouse/data lake(house)?
- What data warehouse will you use?
- What tool will you use for your data visualization?
- And do you plan to do any form of machine learning or data science?
These are some of the core questions you can answer.
Our team is currently working on a data strategy book that will help answer these questions in detail, you can sign up here to get it as soon as it comes out.
But let’s start to answer the first question.
How will you get your data from your source systems to your data warehouse/data lake(house)?
The very first question is one of the more important ones you will answer.
Because the solution you pick here will define how much work and maintenance your future engineers will need to do.
For example, earlier in this article, we discussed using crontab to create a patchwork system that is all duct-taped together will be very challenging to work with.
But there are more modern data pipelines and ETL solutions that your team can use to avoid many of the problems that occur when you try to develop your own data infrastructure.
ETLs
ETLs are the classic data engineering bread and butter method for getting data from point a to point b. This is the process of extracting, transforming, and then loading data into your data warehouse or data lake.
ETL tools range from low code options which we will discuss those in the python frameworks section to drag and drop options like SSIS to Talend.
ELTs
ELTs are very similar to ETLs except for the order of steps that are done. Instead of your standard ETL where the transform is done in the middle and ELT focuses on just getting the data loaded first.
Then later on a transform may be applied if required.
There are a lot of tools that are trying to push this method. For example, Fivetran has been pushing hard to gain traction. Now, you can in theory also create an ELT with other tools like the Python Frameworks we will be discussing shortly.
The difference is that many of the tools such as Fivetran already have connectors to popular data sources so you don’t need to write any code to connect to things like Salesforce.
Python Frameworks/ This Could Also Be Under ETL/ ELT
Python libraries like Airflow and Luigi (just to name two) have become very popular over the past few years. The reason for their popularity is because of the ability to balance customization with pre-built meta-data databases, logging, dependency management, and several other bells and whistles that some python frameworks provide to build an ETL.
People were already building hacky cron and python data pipelines.
So once many of them were provided with a framework like Airflow, it quickly caught on.
What makes these tools unique is the fact that you can so easily use
Streaming Tools
Data pipelines are limited to batch-style tools like Fivetran or Airflow. Another popular method for getting data from point A to point B is using streaming tools.
Using streaming technology like Kafka, Kinesis and Rockset allows you to stream events from any source and then run transforms or data science models over the data as it goes through the streams.
Often streaming data will initially be loaded into a data lake but it could also be loaded into a data warehouse. Often the reason developers will pick a streaming data solution is that the business or customer may need to know the information right away. For example, let’s say your company has multiple factories it is managing around the world. And if a machine goes down in any of them, it could cause major delays or other issues.
Would you want to get the information of a likely failure the next day when the batch ETL job runs or right as it is happening?
Streaming provides the ability for data engineers to create systems that can provide up-to-date information.
We won’t go into this now. But one of the questions I get all the time is, why not have all your data live all the time?
This will require a separate article. However, currently, the amount of technical expertise required to maintain, use and migrate to streaming is heavy. So if it doesn’t make sense to run data all the time, you probably shouldn’t.
Do You Need To Modernize Your Data Analytics Architecture?
I will spend more time diving into some of the other questions in the future as well as in my data analytics strategy guide I will be putting out.
But, if you need to ask some questions about your data infrastructure today, or you want to know how you could use your data to help increase your revenue or reduce your costs, then feel free to set up some time with me.
Thanks for reading! If you want to read more about data consulting, big data, and data science, then click below.
How To Prepare For Your Data Engineering Interview
Developing A Data Analytics Strategy For Small Businesses And Start-ups
3 Ways To Improve Your Data Science Teams Efficiency
4 SQL Tips For Data Scientists
How To Improve Your Data-Driven Strategy
What Is A Data Warehouse And Why Use It