Combining dirty CRM Data for Insights

Combining dirty CRM Data for Insights

← Guides

In the real world, sales data isn’t perfect. You will often encounter data sets spread across multiple tables without a proper relationship (aka a join key). However, with some clever data engineering, you can still get what you need. In this tutorial, we’ll walk you through how we go about handling such customer data and employing various matching and deduplication strategies.

Step 1: Identify Matching Fields

Your starting point is determining which fields contain data that can be utilized for the join. These fields must be present in all the tables and contain overlapping data, making them ideal for creating pseudo foreign keys. For example, if all the tables have an 'email' field or 'product', we can use these fields as our base for the join.

Pay close attention to fields which are similar, but maybe named differently, or are not exactly 1 to 1. You can still use these effectively in our hierarchical join strategy explained later. For example in one of our cases, the ‘product’ field in the lead was initially requested product, but in the order table, it was the actual purchased product, which could sometimes be different if the customer changed their mind.

Step 2: Create Pseudo Foreign Key Link Tables

Once you have your matching fields or nearly matching fields, it's time to create your pseudo foreign key link tables. These are datasets that imitate the function of a foreign key through grouping distinct records based on the identified matching field/s. Consider each unique combination of your matching fields and make them your pseudo foreign keys.

Let's say you have three fields in common across the tables - 'email', 'create_time', and 'product'. In this case, your grouped sets based on distinct records would look something like this:

  • Set 1: Leads grouped by email, create_time, and product
  • Set 2: Leads grouped by email, and create_time
  • Set 3: Lastly, Leads grouped by email only

Generate these grouping sets through SQL GROUP BY clause or equivalent tools in your programming language or tool of choice. At Sales Insights we recommend and work heavily in SQL.

Note: If working in SQL, I recommend you put each of these sets into another table (or temporary table) for later joining, rather than trying to do it all in one pass. This lets you inspect the data and simplify your final query.

Step 3: Choose a Deduplication Strategy

If any of your groups generate duplicate record IDs, they do not form a perfect key. In such cases, you need to apply a deduplication strategy to get unique records. Here are a few strategies you can use:

  1. First Record: Choose the first 'min(id)' matching record from the set.
  2. Last Record: Choose the last 'max(id)' matching record from the set.
  3. Ordered Record: Assign an order to all of the records, and match it up with all of the records in your other table. Use this when the number of records we are attempting to join is identical in both tables.
  4. Nearest Record: Selects the record closest to the target record, typically based on a sequential number like id or a temporal field such as date/time. This is the most complicated but look into Window functions in SQL on how to do this.

Each of these strategies ensures that you have a unique pseudo foreign key for each record that you can next use to join.

Step 4: Join in Order of Longest Key first

With your deduplicated pseudo keys ready, you can start joining your link tables to lookup the related record. Start with link table with the most fields and progressively join each shorter group. This approach simplifies the join process and accelerates your performance by matching the most specific cases first then down to the more general matches.

One essential tip here – always bear data quality in mind. Assess the quality and reliability of the pseudo keys you create and be conscious of the potential inaccuracies introduced during the joining process. Use your business expertise to gauge which fields are most reliable for matching.

A Case Study: Joining Leads and Orders Datasets

In a recent case, we were dealing with a leads dataset and an orders dataset for a company. The challenge was that the orders table had no lead_id relation to the leads, and there could be multiple leads and multiple orders associated with the same email address. Typically in their data, the lead's created_date and the order's created_date matched exactly, but not always, for example the lead may come in today but the client completes the order in a few days.

Despite these complexities, we were able to successfully join the datasets using the methods outlined in this tutorial. We started by identifying 'email', 'product' and various 'date' fields as common fields in both datasets. We then created pseudo foreign keys by grouping distinct records based on these fields.

Given the possibility of multiple leads and orders for the same email, we had to carefully choose our deduplication strategy. We decided to use the 'First Record' strategy for leads, selecting the earliest lead for each email and grouping of fields. This approach made sense for our business case, as we were interested in the initial lead that resulted in each order.

Finally, we joined the datasets using our pseudo foreign keys. We started with the longest key (in this case, the 'email', 'order_date', & 'product', fields) and progressively joined with shorter keys until all the records were matched.

The result was a robust, unified dataset that provided a comprehensive view of the leads-to-orders journey. Despite the lack of proper foreign keys, we were able to link the leads to the orders, providing valuable insights into the sales funnel and customer behavior.

In some cases, the linking will not be perfect. Perhaps for some orders, the linked lead is not a perfect match a human would choose. However usually these small discrepancies do not statistically impact the final results too much.

Wrapping up

This exercise demonstrates that with some clever data engineering, it's possible to join disparate datasets and extract meaningful insights, even when the data is not perfectly structured.

Regardless of whether you’re dealing with sales, leads, or other kinds of databases, knowing how to connect disparate datasets is crucial for gaining the most insights from your data.

Our client was able to get a unified data set and understand the conversion rate of their leads to orders, something they were not able to get properly before because their CRM system did not export proper foreign keys.

image

PredictableRevOps.org maintains the documentation of the PRO Framework and provides training certification for Partners and clients that wish to use PRO within their businesses.

PRO is sponsored by SalesInsights.io, a reporting and dashboard tool implementing the PRO Framework.

© 2024 Predictable Revenue Operations