How to Tackle Duplicates in Your CRM

No matter the tech stack, the pattern is consistent: when identity resolution is done well, teams move faster, reporting becomes trustworthy, and customer engagement gets dramatically more effective. When it’s done poorly (or treated as a one-time cleanup), the downstream cost shows up everywhere.

This post lays out a practical, field-tested framework you can use to set up identity resolution in a way that’s accurate, governable, and built to improve over time.

What identity resolution actually is

Identity resolution answers a deceptively simple question:

When do two records represent the same real-world entity?

That entity might be:

  • A person (Leads, Contacts)

  • A company (Accounts)

  • A location (office/site records)

  • Or a stitched profile in a CDP that spans systems

Getting this right affects:

  • Sales productivity

  • Marketing attribution accuracy

  • Pipeline reporting integrity

  • Personalization and customer experience

  • Executive confidence in analytics

1) Normalize data before you match anything

Matching raw CRM data directly is a mistake. Real-world data contains formatting noise:

  • Mixed casing

  • Punctuation

  • Legal suffixes

  • URL parameters

  • Embedded unit numbers

  • Hyphens and whitespace

The first step is to transform key fields into a lowest-common-denominator format before doing any similarity analysis. The normalized value becomes a durable “comparison key” (or part of one).

Company name normalization (example approach)

Common transformations:

  • Trim leading/trailing spaces

  • Standardize case (UPPER or LOWER)

  • Replace “&” with “AND”

  • Remove legal suffixes (INC, LLC, CORP, LTD, etc.)

  • Remove punctuation

  • Remove whitespace

Result: a canonical string that reduces variation and improves match quality.

Website & domain normalization (example approach)

Recommended steps:

  • Trim and lowercase

  • Remove protocol (http://, https://)

  • Remove www.

  • Remove everything after /, ?, or #

  • Extract the core domain

This eliminates the “noise” and focuses on the identity anchor: the domain.

Name normalization (person-level)

Useful transformations:

  • Trim

  • Uppercase

  • Remove punctuation

  • Remove hyphens

  • Remove whitespace

  • Remove suffixes (JR, SR, II, III)

Then create a derived key:

name_dedupe_key = first_name_norm + last_name_norm

Important: name-based keys should never be used alone for merges. They are supporting signals that must be paired with a stronger identifier (email, phone, LinkedIn).

Email normalization

Email transformations:

  • Trim

  • Lowercase

  • Extract from display strings (Name <email@domain.com>)

  • Remove surrounding punctuation

Be cautious with generic inboxes like info@, sales@, support@. Those require corroborating attributes before matching due to high false-positive risk.

LinkedIn slug extraction (high-value identifier)

Best practice:

  • Lowercase

  • Remove query strings and fragments

  • Remove trailing /

  • Extract slug after /in/ or /company/

  • Normalize separators

Once consistently extracted, LinkedIn slugs can be excellent identity anchors, especially for Accounts (company pages) and Contacts (person profiles).

Address normalization (where most teams get stuck)

Address data is noisy and inconsistent. For best results:

  • Strip embedded unit/suite info for street-level matching

  • Standardize street types (ST, AVE, BLVD, RD)

  • Standardize directional tokens (N, S, E, W, NE, NW, etc.)

  • Normalize numeric forms (FIRST → 1ST)

  • Normalize PO BOX formats consistently

A strong best practice is to maintain two derived fields:

  • address1_street_norm (unit removed; best for entity matching)

  • address1_full_norm (unit retained; best for location matching)

Address matching should always be paired with at least one other attribute (company name, domain, LinkedIn, phone).

2) Don’t rely on a single attribute

Using only one field (like Company Name) is rarely sufficient. Strong matching comes from combining attributes.

Person-level composite examples

  • Email + Name

  • Phone + Name

  • LinkedIn slug + Name

Account-level composite examples

  • Domain + Company Name

  • LinkedIn company slug + Company Name

  • Address1 + Company Name

  • Address1 + Phone

Identity resolution improves dramatically when you use layered, multi-attribute signals rather than single-field equality.

3) Use similarity scoring to handle typos and small variations

Exact matching alone will miss real duplicates.

Similarity scoring (e.g., Levenshtein distance) helps detect records that are “close enough” when values differ by minor spelling variation.

Traction On Demand Learnings Ob…

Best practices:

  • Test multiple thresholds (80%, 85%, 90%, 95%)

  • Measure precision vs. recall

  • Treat similarity as a signal, not an automatic merge trigger

  • Require corroboration from a second attribute when risk is high

4) Run distribution analysis before you execute merges

This is one of the highest-leverage steps and it’s often skipped.

Before activating any matching rule, run frequency and distribution analysis to quantify impact:

  • How many duplicate clusters are produced

  • How many total records are involved

  • Cluster size distribution (pairs vs 5 vs 10 vs 50+)

  • The largest clusters (anomaly detection)

  • % of the database impacted

Large clusters are not automatically “bad,” but they are always worth inspecting. They can indicate things like shared corporate domains, large parent-child hierarchies, franchise models, or broad identifiers that need special handling.

5) Survivorship rules are as important as matching rules

Identifying duplicates is only half the problem. Once you’ve identified candidate matches, you need rules for:

  • Which record becomes the “master”

  • Which attributes carry forward

  • What happens when values conflict

  • Whether certain systems override others

  • Whether recency beats completeness (or vice versa)

Survivorship is ultimately a business policy decision, and it should be documented clearly before merges become automated.

6) Model transactions separately from identity

Sometimes teams create duplicate entity records to represent transactional activity (registrations, referrals, submissions, etc.). That approach pollutes identity.

Best practice: model transactional activity in child objects with N:1 relationships to the entity record (Person/Account).

This keeps identity clean while preserving the ability to report on transactions.

7) Make identity resolution continuous, not static

Identity resolution is not a one-time cleanup. It’s a living system that requires ongoing evaluation because the world changes:

  • New data sources are introduced

  • Acquisition lists arrive in new formats

  • Enrichment vendors update schema/coverage

  • Teams change required fields and processes

  • Companies rename, merge, split, or rebrand

  • International data introduces new address and phone patterns

What ongoing evaluation looks like in practice

Monitor

  • Cluster growth trends over time

  • Largest clusters (monthly review)

  • Match rates by rule and by source system

  • False positives and false negatives (via sampling)

Refine

  • Similarity thresholds (quarterly tuning)

  • Blocking keys and composite rules

  • Exception handling for shared identifiers (franchises, shared domains, etc.)

  • New derived fields as your CRM/CDP schema evolves

Govern

  • Version your rule sets

  • Require distribution analysis before any rule change goes live

  • Document survivorship policy decisions

  • Maintain clear ownership and change control

This is how identity resolution stays accurate as your data ecosystem evolves.

Final takeaway

Effective identity resolution is built on five pillars:

  • Field normalization

  • Multi-attribute matching

  • Similarity scoring

  • Distribution analysis before execution

  • Ongoing evaluation and refinement

When treated as an operational discipline (not a one-time project), identity resolution becomes the invisible foundation that enables trustworthy reporting, better activation, and a much healthier CRM.