Company Name Matching Engine

Cancelado Publicado Mar 31, 2012 Pagado a la entrega
Cancelado Pagado a la entrega

I often have the need to match company names between two separate large csv files. Matching company names well is not a trivial task. Various algorithms and processes should be considered to do this including: Levenshtein Edit Distance, Smith-Waterman distances, Jaccard token distance, weighing common company name tokens differently than uncommon ones and so on.

For example, provided company names such as:

DSZ Investments, LLC

D.S.Z Investment Company

DSZ Investments, L.L.C

DSG Investments, LLC

The first 3 should be considered the same company, but the fourth should be considered a separate company even though the edit distance is very narrow. The common token "Company" has to have very low weight when doing the match. Whereas the uncommon token DSG must have a much heavier factor on the match due to it's rarity.

A highly relevant document that I read and that the principles within should be codified and integrated into the project is attached to this post.

Experience doing this type of matching or designing these types of algorithms would be very helpful. I work in a unix environment and I am looking for a command line tool that can run from the bash shell.

Please review the attached document and let's get the conversation going. Canned replies will be ignored.

Thanks for your interest in this project.

Instalación de scripts Shell Script

Nº del proyecto: #2727519

Sobre el proyecto

1 propuesta Proyecto remoto Activo Apr 22, 2012

1 freelancer está ofertando el promedio de $636 para este trabajo

AnkSoftware

See private message.

$635.8 USD en 20 días
(4 comentarios)
5.0