Legalese is defined as 'the specialized language of the legal profession'. Lawyers will be familiar with specific terms typically used in legal documents like contracts, court decisions and opinions. When aiming to process contracts with NLP techniques, the models used should also be familiar with these domain specific terms and the way they are used.

One way to achieve this, is to train a Word2Vec model on a large corpus of legal documents. The Word2Vec algorithm assigns a value to individual words in the corpus. The value is a vector with typically 100 to 300 dimensions. The remarkable finding is that words that are used in a similar context tend to be similar, meaning that their vector values are close to each other. The importance of context in which words occur had already been described by J.F. Firth:

You shall know a word by the company it keeps

John Rupert Firth in 1957

As words have been assigned a unique vector value, it is now possible to perform calculations with these words. The most famous calculation:

King - Man + Woman = Queen
When specifying a positive input and/or negative input, the words marked as positive will be added up and the negative input will be subtracted leading to a result that is in many cases appropriate and sometimes even spectacular.

A Word2Vec model has been trained on a large corpus of Dutch legal documents. This model is created using the well-known Gensim library. If you want to find similarities based on positive and negative input, try the model! The model is trained on Dutch words, so your input should be in Dutch.

Are you interested and do you want to know more? Please reach out!

Find similarities »