The Challenges of Cleaning, De-Duping and Cross-Referencing Legal Entity Databases

When working with multiple sources of legal entity — data vendors, exchanges, regulators, rating agencies and LOUs — it can be very challenging to clean, de-dupe and cross-reference all these sources.

Lack of Unique Identifiers

Trying to match together the information from different data sources can be particularly challenging when there are no unique identifiers. Even with a clean, well-formatted set of information there is plenty of work to be done to match up each record accurately and precisely. For example, while the IRS provides a free download of all the entities with GIIN identifiers, it is not enough to help identify a match between an entity with a GIIN and the corresponding entity in your database. Not only does the GIIN not link to any proprietary identifiers, it doesn’t link to any public identifiers, such as the LEI (Legal Entity Identifier) or CIK (Central Index Key).

Lack of Consistency

In the absence of a consistent legal entity identifier, if all entities were listed by their formal legal name, name matching would still be a fairly simple task. However, in reality, matching entities based on the names provided is particularly difficult. Names may frequently be “as entered” by the relationship manager, customer or counterparty, so there is plenty of inconsistency around the name that is registered. Some entities may be recorded using their local name, some with abbreviations, others with old names. Furthermore, many databases have entries in the name field that are “overloaded” with geographical and other modifiers, producing a concatenation of disparate information, e.g., “BlackRock MultiAsset Portfolio III (Exclusively for Qualified Institutional Investors with ReSale Restrictions for the Japanese Investors).” Without any normalization or validation to match up the entity with its formal legal name there is no consistent name to compare to existing counterparties.

Regional and Language Variations

Name matching can get even more complicated once you observe name variations based on language. In the IRS GIIN database, for example, for countries with languages written in the Latin alphabet, the entity names are mostly written in the native language: French in France, German in Germany and Turkish in Turkey (e.g. BNP Paribas Obli Revenus). However, the information about Chinese, Greek and Saudi Arabian entities is generally presented in English (e.g. Zhejiang Yongkang Rural Cooperative Bank). Russian and Ukrainian entities, however, are somewhere in between – written mostly in English, though proper nouns and other non-translatable words are in the native language (e.g. PrJSC IC PZU Ukraine Life Insurance).

Limited Attribute Coverage

When the name is not enough to provide good quality record matching, the absence of other data attributes, such as registration address, city, state or province, further reduces confidence in making matches. If you are working with the GIIN dataset all the IRS provides is the name and country.