Photo by Kaleidico on Unsplash

Companies of all sizes have embraced using data to make decisions. However, according to a 2019 report from Goldman Sachs it’s actually quite difficult for businesses to use data to build a sustainable competitive advantage.

Our team has worked with and for companies across industries. We have seen the good, the bad and the ugly of data strategy. We have seen teams implement successful data lifecycles, dashboards, machine learning models and metrics. In addition, we have also had to come in and had to untangle, delete, migrate and upgrade entire data systems.

Through out these projects we have seen several issues that pop up repeatedly. A lack of data governance, bad data, complex Excels, a lack of alignment between data teams and the business’s and an over abundance of dashboards leading to confused decisions.

All of these data issues compound over time and solely erode a team or companies ability to trust and use their data.

In this article we will discuss some of these issues as well as possible solutions your teams can implement to improve your overall data lifecycle.


Having Inconsistent Data And Sources Of Truth

With all the various business workflows it is inevitable that the same data gets entered in multiple places. One team might use Salesforce for one business process while another might use workday.

This in turn can lead to inconsistent data being entered at each step. Whether this be due to timing or human error isn’t the point. Once these inconsistencies start entering a companies various data warehouses they can wreak havoc on reporting.

Just ask any BI engineer or Analyst who has delivered a report to a director only to find it mismatched another report that was reporting a slightly different slice of data.

Regardless of the size or data maturity of a company, they all suffer from this. Our team has worked at multiple companies and consulted across industries and they all face this issue.


Solution #1 -Set Up A Data Governance Strategy

This is usually solved by a data governance strategy. Data governance involves managing data’s availability, usability, integrity and security.

How you decide to deploy this data governance strategy depends on if you want to have a tight centralized data process or a more decentralized and separate processes that occasionally meet up to assess that their core data models don’t overlap.

Data governance is far from a sexy word. It’s not data science or machine learning and yet it is the foundation of your data science and machine learning departments. Not having a solid handle on what your core source of truths are can lead to a lot of conflicting strategies.

Solution #2 -Create A Central Data Team To Make Decisions Quickly

The other strategy is a little less recommended for large companies that want to move quickly. That is to develop a core data team. Their only focus is to develop and manage data sets. This strategy works when your company is small because the data team itself will also be small.

Meaning that when decisions need to be made on sources of truth and data integrity, it will happen quickly. There is no need to manage multiple teams or have centralized meeting every month.


Managing Complex Business Decisions In Excel

Excel and spreadsheets continue to drive billion dollar decisions in companies across the globe. This reliance on Excel has also leads to million and billion dollar mistakes by even the smartest companies.

For example, in 2008 Barclays agreed to purchase Lehman Brothers, except spreadsheet errors led them to eat losses on contracts they did not intend to buy. The detailed spreadsheet of Lehman assets contained approximately 1,000 rows that needed to be converted into a PDF. Except, the original Excel version had hidden rows with 179 items that Barclays did not want. The notes that they did not want those contracts were not transferred to the PDF, but the hidden rows were. As a result, they had to buy 179 contracts that they did not want.

And in 2012 JP Morgan lost nearly 6 billion dollars due to what came down to mostly Excel errors.

Excel is a very versatile data tool that can help teams manage a lot of workflows. However, they are also very error prone because of complex design, human error and how Excel operates in general.

In order to avoid these errors your team has a few strategies you can put into place.

Solution #1 - Treat Excel Like Engineers Treat Code

If your Excel is being used to make large decisions, then you should treat it the same way engineers treat code.

This means there should be Excel reviews and test cases. It may seem like overkill, but Excel is very much like code. In fact, some people even consider it a 4th generation coding language. That means it is prone to errors caused by logic, the process and the inputs provided.

In turn, it should be treated as such.

Don’t just trust that an analyst, regardless of how smart they are will make a perfect Excel sheet. Brilliant programmers make bugs, and so do brilliant analysts.

Thus a logic review, test cases and sanity checks should be put into to place to reduce these kind of errors. This all might seem unnecessary until your company loses massive amounts of money due to bad information.

Solution #2 -Automate Excel Using Python And SQL

Automate and develop clean data workflows for business processes that are well defined and can be converted into SQL and code.

A lot of reporting requires copy pasting data from a report or database table. Instead of copy pasting, there are methods your team can utilize to develop solutions that automatically provide the data outputs.

This can be done using a combination of SQL, code and Spreadsheets with limited manipulation. Code is still error prone. However, it can often be written in ways that limit the amount of errors and inputs. This is not the case with Excel. Excel’s flexibility is what makes it error prone.

Which solution is best depends on the complexity of the process, it’s repetitiveness and how large of decisions are being made off of the data solution.


Too Many Dashboards And Not Enough Purpose

Dashboards can be very powerful tools to help distill information and provide insights and guidance.

However, the ease of development and popularization of dashboards has lead to companies often being overwhelmed by dashboards. These dashboards don’t always all align with the company priorities, they in themselves might not be concise enough to make decisions on and sometimes they are just filled with vanity metrics.

Just because dashboards are easy to make, doesn’t mean teams should push them out quickly. Each dashboard should be meaningful and useful.

Dashboards should drive some sort of action as well as inform.

Solution #1 - Ask Why You’re Actually Building The Dashboards

In order to avoid developing too many dashboards teams should develop a process to figure out why they are actually building dashboards. Often we find that dashboards are just built because a manager or director asked without a thought out plan.

These dashboards end up in the dashboard graveyard. They will be used once, or maybe for a few months. Then suddenly they are forgotten. In those cases an Excel report might have been better.

This can be avoided by stopping and asking why your team

Trying To Analyze Too Much Data Without Enough Action

One of the more patterns we have noticed is that most companies after a certain size and complexity have more data than they know what to do with.

This can lead to a couple of problems.

Some companies try to pull all their data in from every data source at once in a massive data warehouse project. This is very expensive, time consuming and can be discouraging.

Companies will end up spending hundreds of thousands to millions of dollars developing a data warehouse only to realize it will take 2 years and twice the budget they thought.

Solution #1 -Shoot For Small Quick Data Victories

Instead of trying to bring in all the data at once. Figure out a few key areas your company wants to know more information about.

From there your teams should develop a plan for how you will answer those questions. Will it be a dashboard, a report or something similar.

This will then drive a lot of your first versions of your data warehouse.

Companies should spend a good amount of time planning their data warehouse design, their main entities, their data governance and so on. However, if you spend too much time planning and only get the framework up, then eventually business owners will get restless.

Thus, having some little victories along the way can keep people motivated.


Using Low Quality And Incomplete Data

Garbage in garbage out.

If you have worked in the data industry for a while. You will be accustomed to hearing this saying.

A more verbose way of saying this is, making good decisions based on bad data are actually bad decisions.

Putting bad data into your dashboards, reports and models will only provide invalid results.

We have seen companies spend months and sometimes years using data that everyone assumed was right and no one questioned. But with a little digging, they would have seen incomplete, inaccurate and untrustworthy data.

Every company has bad data, the only question is, are they using it to make million dollar decisions.

Solution #1 - Create An Automated QA System

Data is error prone. Much of data is input by humans and thus is very prone to error.

In turn your team will need to develop some form of automated data QA system to manage those errors.

Data in most systems is too large to manually check and in this case.

There is a solution! Thankfully, most of the data quality checks that a user does can be written down and placed into an automated QA system. This can save hundreds to thousands of hours company wide and increase accuracy.

These checks don’t always need to be complex. In fact you can create straightforward data like making checks that ensure that all the countries being input are real. Teams can also develop more dynamic checks like testing to make sure the total spending every month stays within a specific range.

Overall, having these initial column level and aggregate checks can be very helpful.

Solution #2 - Always Use Integration Tests

When developing dashboards, especially dashboards with calculated fields, blended data sets and filters, teams should create integration tests to make sure that the logic inside the dashboard tool didn’t change the initial data.

Personally, we feel you should limit as much of the logic in one place as possible. Such as in the query itself. Since this isn’t what usually happens, teams should then make sure the data they see in the dashboard matches the data they input.

There are lots of simple integration tests you can use. For example, perhaps you know that the data set should have 1000 people in it. From there your team can create an test dashboard that counts how many people are in the data set. Especially once you start using functionality like blending in Tableau.

This can cause a lot of problems if you don’t think through your final data. To avoid these problems, it is best to think through some basic integration tests.


Conclusion

Using data to make better decisions provides companies a competitive advantage. However, this depends  on the quality of data and the robustness of data processes set up.

Simply creating dashboards, data warehouses and machine learning models is not sufficient to make data driven decisions. Teams need to consider their data life-cycles and the processes used to manage each step. This means creating test cases, clear goals and processes can help improve your team’s performance and strategy. No one wants to get bogged down with too many processes and bureaucracy but having no form of plan or strategy for your team’s data life-cycle will also fail.

If you would like to read more about data science and data engineering. Check out the articles and videos below.

4 SQL Tips For Data Scientists

How To Analyze Data Without Complicated ETLs For Data Scientists

What Is A Data Warehouse And Why Use It

Kafka Vs RabbitMQ

SQL Best Practices — Designing An ETL Video

5 Great Libraries To Manage Big Data With Python