Acofarma: identification of potential market

Problem

Identificar entre cien mil registros a los mismos negocios cuando no tienen exactamente la misma información.

Solution

Proceso de normalización de direcciones con Google Platform y deduplicación utilizando diferentes criterios de puntuación para obtener candidatos iguales.

Results

El 98% de los registros se identifican con sus iguales y un 2% se debe revisar manualmente y tomar la decisión final. Ahorro en tiempo y mejores resultados.

Context

Conocer la penetración de mercado de una compañía cuando no hay un panel que nos ayude a calcularla es complicado. En un negocio B2B necesitamos tener una base de datos de todos los negocios del mercado en cuestión y, por otro lado, tener una relación de todos los clientes. ¿Cómo deduplicar a clientes iguales de diversos ficheros, cuando la información identificativa (NIF, direcciones, teléfonos, email…) de los clientes es distinta y de distinta calidad?

Process for obtaining peer candidates

Input

We started from eight internal and external files (referrer), with information on pharmacies. We audit each file (descriptive fields/values) and map the fields between the files. We select the fields that we are going to use to find the peer candidates: addresses, nif, email and telephone. We know that the files have some inconsistencies between the four identifying fields.

Standardization, Deduplication and Enrichment Processes

We identify the origin of each record and prepare the file to normalize the addresses using the Google Maps Platform service. In the resulting file we obtain records with normalized and geo-coordinated addresses. Not all records could be normalized due to old, partial or misspelled addresses.

Then, we cross the internal files against the external referenced files and obtain a score for each record, according to the following criteria:

Equal geo-coordinates: 10 points.
Distance <32 mtrs: 3 points
Distance names: 3 points
Distance of normal addresses: 3 points
Same NIF: 4 points
Same phone number: 3 points
Same Email: 3 points

Despite identifying a high percentage of records as equal or highly likely candidates, we still have too high a volume of records for manual review. We prepare a new file with these records and this time we use the Google Places service. The cost per record of the service is higher, but it provides us with more details that will help us refine the initial score:

  • Location details: Precise directions, GPS coordinates, location on maps, and photos of specific locations.
  • Hours: Opening and closing times, special hours (if any), and days of operation.
  • Opinions and reviews: User comments about the location, ratings, and personal experiences that can help other users make informed decisions.
  • Additional information: May include data such as the type of business (restaurant, hotel, museum, etc.), contact numbers, websites, and any additional relevant information that may be useful for those seeking information about that specific place.
Output
 
Finally, we have already identified the same pharmacies in an initial file of about 100,000 records. The sales team will only have to review and decide on a volume of less than 2% of the records.