Specifically, Spacy has a Tag called Law which I was very interested in. Decide which kinds of entity we want (in our case similar to those of another project), Train a model on different data with similar entities, Apply the model to 500 randomly chosen decisions to pre-annotate them, Upload the decisions and their pre-annotations to, Generate a report showing all errors (discrepancies between the tags and the prediction of the model), Manually identify patterns among errors (some are errors in annotations, some are errors in predictions), Manually fix annotations following the manually discovered patterns in. English only (cased and uncased) available in different sizes (meaning different flavors of slowness). It is very likely that these decisions that are centralized and available as scanned PDF will be among the first ones to be added to the future open data database of legal cases maintained by our Supreme Court⁷. In those cases, if we check the French company registry, the corporate form will appear inside the commercial name. Another choice we considered was doccano : We tested 3 paid SAAS tagging platforms, 2 of them had buggy API with a broken UX. Other attempts of exclusive access to legal decisions followed and failed⁶. Here and there, it also appeared that the out of the box performances on small languages are not that high¹³. It again shows the importance of the open source ecosystem because all the tests below (but spaCy) have been performed by changing a single line of code, all libraries being able to talk together… wonderful! It can’t be shared as it would violate GDPR. They use sentencepiece to analyze the multilingual dataset and find significant subwords, meaning that a word can be split into several subwords. We fine-tuned the model bert-base-multilingual-cased for 78 epochs for less than a day. I want explicit dates, such as "24 October 2018". We won’t provide more details on that part as it’s out of the scope of the article. It is indeed an interesting shift in the legal publishing industry that should perhaps be considered properly when everyone around fancies about legal bots and digital labor…. The human part is one of the less documented and still most important parts of the project. Through this project and some other machine learning related one, the role of the data team is evolving to something new: create data for algorithms and audit algorithm output (so far their role was more about providing support on editorial products). Moreover, we think that with a larger dataset for pre-training (or a fine tuning step on in domain data), the results may be even higher, as found by our Supreme Court in its NER paper. can freely access them without having to pay anything to anyone. And then came the MultiFit paper¹⁴, showing that performances of mBERT can be significantly beaten by smart use of monolingual models¹⁵. Entrepreneurs had the need to access commercial decisions, that’s why the clerk’s consortium have invested in infrastructure to scan, centralize and distribute all legal decisions as scanned PDF, it’s a paid service to cover costs. In this context, the clerk’s association tries to improve things for everybody. If you have an annotation tool choice to do, my advice would be to always keep in mind that it is basically a make or buy decision, with all the nuances in between. Therefore to connect them with DKPro core requires a middleware tool. As explained below, we are also helping several European administrations to anonymize their legal decisions to push open data in justice. An automatic analysis would be similar to applying an active learning approach, which is known to have lots of limitations in many real life setups⁹. Can we teach machines to read and understand human language? As shown previously, spaCy model seems limited compared to today’s large models performance.