How to Tackle Duplicates in Your CRM

Feb 21

No matter the tech stack, the pattern is consistent: when identity resolution is done well, teams move faster, reporting becomes trustworthy, and customer engagement gets dramatically more effective. When it’s done poorly (or treated as a one-time cleanup), the downstream cost shows up everywhere.

This post lays out a practical, field-tested framework you can use to set up identity resolution in a way that’s accurate, governable, and built to improve over time.

What identity resolution actually is

Identity resolution answers a deceptively simple question:

When do two records represent the same real-world entity?

That entity might be:

A person (Leads, Contacts)
A company (Accounts)
A location (office/site records)
Or a stitched profile in a CDP that spans systems

Getting this right affects:

Sales productivity
Marketing attribution accuracy
Pipeline reporting integrity
Personalization and customer experience
Executive confidence in analytics

1) Normalize data before you match anything

Matching raw CRM data directly is a mistake. Real-world data contains formatting noise:

Mixed casing
Punctuation
Legal suffixes
URL parameters
Embedded unit numbers
Hyphens and whitespace

The first step is to transform key fields into a lowest-common-denominator format before doing any similarity analysis. The normalized value becomes a durable “comparison key” (or part of one).

Company name normalization (example approach)

Common transformations:

Trim leading/trailing spaces
Standardize case (UPPER or LOWER)
Replace “&” with “AND”
Remove legal suffixes (INC, LLC, CORP, LTD, etc.)
Remove punctuation
Remove whitespace

Result: a canonical string that reduces variation and improves match quality.

Website & domain normalization (example approach)

Recommended steps:

Trim and lowercase
Remove protocol (http://, https://)
Remove www.
Remove everything after /, ?, or #
Extract the core domain

This eliminates the “noise” and focuses on the identity anchor: the domain.

Name normalization (person-level)

Useful transformations:

Trim
Uppercase
Remove punctuation
Remove hyphens
Remove whitespace
Remove suffixes (JR, SR, II, III)

Then create a derived key:

name_dedupe_key = first_name_norm + last_name_norm

Important: name-based keys should never be used alone for merges. They are supporting signals that must be paired with a stronger identifier (email, phone, LinkedIn).

Email normalization

Email transformations:

Trim
Lowercase
Extract from display strings (Name <email@domain.com>)
Remove surrounding punctuation

Be cautious with generic inboxes like info@, sales@, support@. Those require corroborating attributes before matching due to high false-positive risk.

LinkedIn slug extraction (high-value identifier)

Best practice:

Lowercase
Remove query strings and fragments
Remove trailing /
Extract slug after /in/ or /company/
Normalize separators

Once consistently extracted, LinkedIn slugs can be excellent identity anchors, especially for Accounts (company pages) and Contacts (person profiles).

Address normalization (where most teams get stuck)

Address data is noisy and inconsistent. For best results:

Strip embedded unit/suite info for street-level matching
Standardize street types (ST, AVE, BLVD, RD)
Standardize directional tokens (N, S, E, W, NE, NW, etc.)
Normalize numeric forms (FIRST → 1ST)
Normalize PO BOX formats consistently

A strong best practice is to maintain two derived fields:

address1_street_norm (unit removed; best for entity matching)
address1_full_norm (unit retained; best for location matching)

Address matching should always be paired with at least one other attribute (company name, domain, LinkedIn, phone).

2) Don’t rely on a single attribute

Using only one field (like Company Name) is rarely sufficient. Strong matching comes from combining attributes.

Person-level composite examples

Email + Name
Phone + Name
LinkedIn slug + Name

Account-level composite examples

Domain + Company Name
LinkedIn company slug + Company Name
Address1 + Company Name
Address1 + Phone

Identity resolution improves dramatically when you use layered, multi-attribute signals rather than single-field equality.

3) Use similarity scoring to handle typos and small variations

Exact matching alone will miss real duplicates.

Similarity scoring (e.g., Levenshtein distance) helps detect records that are “close enough” when values differ by minor spelling variation.

Traction On Demand Learnings Ob…

Best practices:

Test multiple thresholds (80%, 85%, 90%, 95%)
Measure precision vs. recall
Treat similarity as a signal, not an automatic merge trigger
Require corroboration from a second attribute when risk is high

4) Run distribution analysis before you execute merges

This is one of the highest-leverage steps and it’s often skipped.

Before activating any matching rule, run frequency and distribution analysis to quantify impact:

How many duplicate clusters are produced
How many total records are involved
Cluster size distribution (pairs vs 5 vs 10 vs 50+)
The largest clusters (anomaly detection)
% of the database impacted

Large clusters are not automatically “bad,” but they are always worth inspecting. They can indicate things like shared corporate domains, large parent-child hierarchies, franchise models, or broad identifiers that need special handling.

5) Survivorship rules are as important as matching rules

Identifying duplicates is only half the problem. Once you’ve identified candidate matches, you need rules for:

Which record becomes the “master”
Which attributes carry forward
What happens when values conflict
Whether certain systems override others
Whether recency beats completeness (or vice versa)

Survivorship is ultimately a business policy decision, and it should be documented clearly before merges become automated.

6) Model transactions separately from identity

Sometimes teams create duplicate entity records to represent transactional activity (registrations, referrals, submissions, etc.). That approach pollutes identity.

Best practice: model transactional activity in child objects with N:1 relationships to the entity record (Person/Account).

This keeps identity clean while preserving the ability to report on transactions.

7) Make identity resolution continuous, not static

Identity resolution is not a one-time cleanup. It’s a living system that requires ongoing evaluation because the world changes:

New data sources are introduced
Acquisition lists arrive in new formats
Enrichment vendors update schema/coverage
Teams change required fields and processes
Companies rename, merge, split, or rebrand
International data introduces new address and phone patterns

What ongoing evaluation looks like in practice

Monitor

Cluster growth trends over time
Largest clusters (monthly review)
Match rates by rule and by source system
False positives and false negatives (via sampling)

Refine

Similarity thresholds (quarterly tuning)
Blocking keys and composite rules
Exception handling for shared identifiers (franchises, shared domains, etc.)
New derived fields as your CRM/CDP schema evolves

Govern

Version your rule sets
Require distribution analysis before any rule change goes live
Document survivorship policy decisions
Maintain clear ownership and change control

This is how identity resolution stays accurate as your data ecosystem evolves.

Final takeaway

Effective identity resolution is built on five pillars:

Field normalization
Multi-attribute matching
Similarity scoring
Distribution analysis before execution
Ongoing evaluation and refinement

When treated as an operational discipline (not a one-time project), identity resolution becomes the invisible foundation that enables trustworthy reporting, better activation, and a much healthier CRM.

Brandon Farris