5 Key Factors to Remember with Third-Party Data Integration

In the ever-evolving realm of analytics professionals, honing skills and expanding expertise is intricately woven with the art of daring to explore uncharted territories, welcoming fresh methodologies, and gleaning invaluable wisdom from the lessons of our missteps. By navigating pitfalls and experimenting with different processes, a lot of trial and error was performed until we found the best operational process to consider when implementing third party integrations.

Third Party Data Integration

Bringing third party data into a single dashboard or platform can seem to be straightforward but can actually be surprisingly challenging. It might seem relatively simple to pull in data from various sources and merge them into a single data source to create a dashboard.

However, neglecting specific crucial steps can lead to unexpected complications and issues that may come back to haunt you later. Here are the main lessons we’ve learned after managing several projects that integrate anywhere from 2 to over 20 data sources, each with its own third party vendor.

#1. Establish and maintain clear communication lines with third party vendors When you start working with any vendors, make sure you know:

Who is responsible for communicating with the vendors (is it you, or your stakeholder, or someone else?)
Who is responsible for making changes/updates to the data source if needed (and it VERY likely will be needed, see next tips for details)
Will a regular cadence of meetings be needed to keep up to date with data changes?
How frequent are changes
Establish a unified communication methodology/tracking system to ensure everyone is on the same page.

Effective communication plays a vital role when collaborating with third-party vendors, as they often manage multiple clients and maintain a demanding schedule. Relying on an intermediary approach to relay information can be arduous for both parties, potentially resulting in increased effort and the need for extensive revisions.

#2. Spend the time to align/map the key fields between all data sources.

The same data may be labeled differently across sources. Vendors may hand over data sets that don’t map to your existing data sets, or other third-party data sets, 1:1. For example, if you’re trying to tie together multiple datasets from different insurance companies, what one lists as “billed amount” might be labeled “adjusted amount” in another dataset from a different insurance company, or they might not be the same at all.

The data aggregation may be structured in different time frames between sources. One data source might send data at the daily level, while another might send it once a year at the yearly-level totals, and you’ll have to figure out how to create weekly reporting out of both of those sets.

It’s tempting to assume that the vendors know what data you are trying to tie in and that they mapped everything out in the structure you need, but in my experience, that almost never happens and there is significant work involved in making sure all the fields are mapped, correctly. If you skimp on this, you’ll probably have several hours of rework and blame/confusion/frustration that can easily derail your deliverable timeline (and, frankly, your sanity).

I recommend that you take plenty of time at this step to figure out what structure you need, and wherever possible, ask your vendors to send the data in that format. Ask them as many questions as you have to, to make sure that you understand the data structure and how to best make it apples-to-apples with your other data sources.

#3. Set up the structure for sustainability, not a quick fix.

If your vendors have data structured in spreadsheets, you might think, “Great! I’ll just load this data into a table that refreshes regularly, and pull that into my data source!”
What you might not anticipate are the dozens of surprise updates/additions/issues that come with ingesting data in a spreadsheet format.

Someone goes in and changes the date format for a single row of data?
Your pipeline breaks.
Someone accidentally updates the name of the tab that the data is on?
Your pipeline breaks.

You can reconfigure setting permissions in the sheets but those can change, and manually updating adds the risk of human errors. In other words, using csv files or spreadsheets is not sustainable in the long run if you want a dashboard/report that works with minimal maintenance.

I recommend finding ways to reduce human error and other delays as much as possible by, for example, using APIs that pull directly from the data source.

The API’s are automatic, much closer to the source of truth, and can’t be changed by just anyone. It might be an upfront investment of time and resources to set up APIs, but trust me, they will save you lots of time and effort by making your data more reliable.

‍

#4. Maintain quality checks to track data issues

Even with an API, it’s important to do a thorough QA of your data against the established source of truth and document your findings. Sharing this documentation with your team/stakeholders is essential because you will be demonstrating that you don’t just care about a one-time setup, but that you care about the integrity of your data and accuracy of deliverables. The dashboard is useless if the data isn’t accurate within the established threshold.

Also, don’t rely on a one-time QA: you’ll probably need regular quality checks to make sure the data continues to be accurate and updating correctly.

There are ways to automate this, for example, by setting up anomaly alerts in your dashboarding tool or pipeline tool like Databricks, Tidal, or Alteryx to send alerts if a refresh fails, or if data or conversion rate falls outside a certain threshold of variance.

Sometimes, however, you will need to manually spot-check the data at regular intervals. I know it’s tedious, but it will go a long way in setting you up for success with your data reliability and peace of mind.

#5. Build and maintain documentation of caveats/definitions

As you near the completion of your deliverable, developers and users alike will encounter numerous essential points to remember. Over time, you may forget the rationale behind setting up certain fields as equivalent, and end users could also struggle to recall the meanings of specific terms, especially if they differ across various vendors or sources. To address these challenges, it is crucial to document all the pertinent caveats and key definitions in a readily accessible location, ensuring that all users can easily refer to and benefit from this valuable information. Keeping comprehensive documentation will streamline processes and enhance overall efficiency in utilizing the deliverable.

I like to create a glossary tab or “companion” doc for my dashboards that details the definitions of dimensions and metrics. Another option is to add tooltips/captions to visuals that detail important notes to indicate definition or special caveats of the data. e.g. “data isn’t available at the daily level for X source, so the full year data is divided by the number of days in the year as an approximation” or “Amount = billed amount for Source X and adjusted amount for Source Y”.

I hope this crash course (as in, a summary of my crashes that I turned into a course) helps you avoid some of my biggest hurdles in learning how to work with third party vendors and sources to create dashboards that provide valuable insights. By sharing my significant challenges and the tactics that transformed them into valuable insights, I hope to empower you to sidestep similar hurdles and harness the complete potential of crafting insightful dashboards that foster success.