Finding similar words

We had a situation in which application users were not checking the customer list before adding new customers, which was resulting in multiple entries for the same customer. We wanted to make it so that when they tried to add a new customer, they would be shown a list of possible duplicates.

It turned out that finding duplicates wasn’t as easy as I thought. Plain old text matching wasn’t good enough, as our users are quite creative in the way they find variant spellings of the customer names. SQL Server has the DIFFERENCE feature, as well as full text searching, which includes a thesaurus, as well as fuzzy searching using CONTAINS. Aside: Why do we still type SQL keywords in uppercase, even though it’s not necessary and looks really ugly?

Anyway, using SQL Server wasn’t really an option, as we needed speed, so had to do all this in the client application, without making repeated round-trips to the server every time they typed something in the customer name text box.

Some research turned up the Metaphone algorithm (and its successor, Double Metaphone), which replace all words with an approximate phonetic equivalent, and use that for matching. I found a C# implementation, and began to play.

The code as it stood worked OK, but had some deficiencies that I wanted to avoid. The most important of these was that it only really worked for single words. Now in all fairness, the sample project there was not intended to be full production project, but rather a sample, so this isn’t a criticism of the author, but it did fall quite short of what I wanted. It also used a HashTable to store the data, which involved a fair amount of casting.

To cut a long story short, I ended up with the following class…

using System;
using System.Collections.Generic;
using System.Linq;
using nullpointer.Metaphone;

namespace CheckMultipleWords.Helpers {
  public class SimilarityChooser<T, To> where To : SimilarityChooserEntityInterface {
    private readonly Dictionary<string, List<To>> _map;

    public SimilarityChooser(
      IEnumerable<T> entities,
      Func<T, string> getWord,
      Func<T, To> toOverview
    ) {
      _map = new Dictionary<string, List<To>>();
      foreach (T entity in entities) {
        foreach (string word in getWord(entity).Split(' ')) {
          DoubleMetaphone mp = new DoubleMetaphone(word);
          AddToMap(_map, mp.PrimaryKey, entity, toOverview);
          if (mp.AlternateKey != null) {
            AddToMap(_map, mp.AlternateKey, entity, toOverview);
          }
        }
      }
    }

    private static void AddToMap(
      IDictionary<string, List<To>> wordsMap,
      string key,
      T c,
      Func<T, To> convert
    ) {
      if (!wordsMap.ContainsKey(key)) {
        wordsMap[key] = new List<To> { convert(c) };
      } else {
        wordsMap[key].Add(convert(c));
      }
    }

    /// <summary>
    /// Find similar words to the one passed in
    /// </summary>

    /// <param name="word">The word to be used for comparison</param>
    /// <returns></returns>
    public IEnumerable<To> FindSimilar(string word) =>
      word
        .Split(' ')
        .Select(bit => new DoubleMetaphone(bit))
        .Where(mp => _map.ContainsKey(mp.PrimaryKey))
        .SelectMany(mp => _map[mp.PrimaryKey])
        .GroupBy(c => c.ID)
        .OrderByDescending(g => g.Count())
        .Select(g => g.First())
        .ToList();
  }

  public class SimilarityChooser {
    /// <summary>
    /// Create a new instance of a SimilarityChooser, which allows you to find similar words to those in the collection
    /// </summary>

    /// <param name="entities">A collection of entities that contains a string property to be used for the similarity search</param>
    /// <param name="getWord">A function that selects the string property on the entity</param>
    /// <param name="toOverview">A function that converts the entity to a lightweight version (use e => e if you want to store the full entity)</param>
    public static SimilarityChooser<T, To> CreateMap<T, To>(
      IEnumerable<T> entities,
      Func<T, string> getWord,
      Func<T, To> toOverview
    ) where To : SimilarityChooserEntityInterface =>
      new SimilarityChooser<T, To>(entities, getWord, toOverview);
  }
}

This uses the following interface, which ensures that the entity has an ID property that uniquely identifies it…

  public interface SimilarityChooserEntityInterface {
    int ID { get; set; }
  }

The way it works is that you first pass in the data to create the map. This requires the entity collection (customers in our case), a Func to get the string property to be used for comparison (such as the customer name) and a Func that creates a lightweight version of the entity that is stored in the map. The reason for the last item is that when we find possible duplicates, we might want to display more data than just the name, and will almost certainly want an identifier, such as the entity’s database ID. This should become clearer when you see the sample code below.

Creating the map can be done either from the constructor, or by a factory method. The following two are equivalent…

      SimilarityChooser<Customer, CustomerNameOverview> chooser
        = SimilarityChooser.CreateMap(
          CustomerRepository.GetAll(),
          c => c.Name,
          c => c.ToCustomerNameOverview()
        );

      SimilarityChooser<Customer, CustomerNameOverview> chooser
        = new SimilarityChooser<Customer, CustomerNameOverview>(
          CustomerRepository.GetAll(),
          c => c.Name,
          c => c.ToCustomerNameOverview()
        );

C# doesn’t infer generic types for constructors, so the first version is cleaner, as it doesn’t require generic parameters on the factory method.

As you can see, the second parameter is a simple Func that allows the class to access the string property that will be used for the comparison

The CustomerNameOverview class implements the the SimilarityChooserEntityInterface interface, and contains the database ID, the customer name and a string property for some extra details that will be displayed to the application user…

  public class CustomerNameOverview : SimilarityChooserEntityInterface {
    public int ID { get; set; }
    public string Name { get; set; }
    public string Details { get; set; }
  }

The Customer entity has the ToCustomerNameOverview method…

    public CustomerNameOverview ToCustomerNameOverview() =>
      new CustomerNameOverview {
        ID = ID,
        Name = Name,
        Details = CustomerType + " - " + City + ", " + Country
      };

If the entity is of modest size, you could simply store the entire entity in the map, in which case the third parameter would look like this…

          c => c

…however, this would require you to add the SimilarityChooserEntityInterface interface to your entity.

Using the class to find similar customers is now very simple…

IEnumerable<MatchingCustomerOverview> duplicates = _chooser.FindSimilar(name);

The collection returned can be bound to the view…Showing duplicate customers

The challenge of binding this with MVVM was a story in itself, which will hopefully be the subject of a separate blog post.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.