Saying the Patent Phrase Similarity Dataset

September 6, 2022

1

Posted Grigor Aslanyan, Software program Engineer, Google

Patent paperwork sometimes use authorized and extremely technical language, with context-dependent phrases which will have meanings fairly completely different from colloquial utilization and even between completely different paperwork. The method of utilizing conventional patent search strategies (e.g., key phrase looking out) to go looking via the corpus of over 100 million patent paperwork will be tedious and lead to many missed outcomes as a result of broad and non-standard language used. For instance, a “soccer ball” could also be described as a “spherical recreation system”, “inflatable sportsball” or “ball for ball sport”. Moreover, the language utilized in some patent paperwork could obfuscate phrases to their benefit, so extra highly effective pure language processing (NLP) and semantic similarity understanding can provide everybody entry to do a radical search.

The patent area (and extra basic technical literature like scientific publications) poses distinctive challenges for NLP modeling resulting from its use of authorized and technical phrases. Whereas there are a number of generally used general-purpose semantic textual similarity (STS) benchmark datasets (e.g., STS-B, SICK, MRPC, PIT), to the very best of our information, there are at the moment no datasets centered on technical ideas present in patents and scientific publications (the considerably associated BioASQ problem accommodates a biomedical query answering process). Furthermore, with the persevering with progress in measurement of the patent corpus (hundreds of thousands of latest patents are issued worldwide yearly), there’s a must develop extra helpful NLP fashions for this area.

In the present day, we announce the discharge of the Patent Phrase Similarity dataset, a brand new human-rated contextual phrase-to-phrase semantic matching dataset, and the accompanying paper, offered on the SIGIR PatentSemTech Workshop, which focuses on technical phrases from patents. The Patent Phrase Similarity dataset accommodates ~50,000 rated phrase pairs, every with a Cooperative Patent Classification (CPC) class as context. Along with similarity scores which might be sometimes included in different benchmark datasets, we embrace granular score lessons just like WordNet, comparable to synonym, antonym, hypernym, hyponym, holonym, meronym, and area associated. This dataset (distributed beneath the Artistic Commons Attribution 4.0 Worldwide license) was utilized by Kaggle and USPTO because the benchmark dataset within the U.S. Patent Phrase to Phrase Matching competitors to attract extra consideration to the efficiency of machine studying fashions on technical textual content. Preliminary outcomes present that fashions fine-tuned on this new dataset carry out considerably higher than basic pre-trained fashions with out fine-tuning.

The Patent Phrase Similarity Dataset

To raised practice the subsequent technology of state-of-the-art fashions, we created the Patent Phrase Similarity dataset, which incorporates many examples to deal with the next issues: (1) phrase disambiguation, (2) adversarial key phrase matching, and (3) laborious detrimental key phrases (i.e., key phrases which might be unrelated however acquired a excessive rating for similarity from different fashions ). Some key phrases and phrases can have a number of meanings (e.g., the phrase “mouse” could confer with an animal or a pc enter system), so we disambiguate the phrases by together with CPC lessons with every pair of phrases. Additionally, many NLP fashions (e.g., bag of phrases fashions) won’t do nicely on knowledge with phrases which have matching key phrases however are in any other case unrelated (adversarial key phrases, e.g., “container part” → “kitchen container”, “offset desk” → “desk fan”). The Patent Phrase Similarity dataset is designed to incorporate many examples of matching key phrases which might be unrelated via adversarial key phrase match, enabling NLP fashions to enhance their efficiency.

Every entry within the Patent Phrase Similarity dataset accommodates two phrases, an anchor and goal, a context CPC class, a score class, and a similarity rating. The dataset accommodates 48,548 entries with 973 distinctive anchors, break up into coaching (75%), validation (5%), and check (20%) units. When splitting the info, the entire entries with the identical anchor are stored collectively in the identical set. There are 106 completely different context CPC lessons and all of them are represented within the coaching set.

Anchor	Goal	Context	Ranking	Rating
acid absorption	absorption of acid	B08	precise	1.0
acid absorption	acid immersion	B08	synonym	0.75
acid absorption	chemically soaked	B08	area associated	0.25
acid absorption	acid reflux disorder	B08	not associated	0.0
gasoline mix	petrol mix	C10	synonym	0.75
gasoline mix	gasoline mix	C10	hypernym	0.5
gasoline mix	fruit mix	C10	not associated	0.0
faucet meeting	water faucet	A22	hyponym	0.5
faucet meeting	water provide	A22	holonym	0.25
faucet meeting	college meeting	A22	not associated	0.0

A small pattern of the dataset with anchor and goal phrases, context CPC class (B08: Cleansing, C10: Petroleum, fuel, gasoline, lubricants, A22: Butchering, processing meat/poultry/fish), a score class, and a similarity rating.

Producing the Dataset

To generate the Patent Phrase Similarity knowledge, we first course of the ~140 million patent paperwork within the Google Patent’s corpus and robotically extract essential English phrases, that are sometimes noun phrases (e.g., “fastener”, “lifting meeting”) and useful phrases (e.g., “meals processing”, “ink printing”). Subsequent, we filter and hold phrases that seem in a minimum of 100 patents and randomly pattern round 1,000 of those filtered phrases, which we name anchor phrases. For every anchor phrase, we discover the entire matching patents and the entire CPC lessons for these patents. We then randomly pattern as much as 4 matching CPC lessons, which turn into the context CPC lessons for the particular anchor phrase.

We use two completely different strategies for pre-generating goal phrases: (1) partial matching and (2) a masked language mannequin (MLM). For partial matching, we randomly choose phrases from your complete corpus that partially match with the anchor phrase (e.g., “abatement” → “noise abatement”, “materials formation” → “formation materials”). For MLM, we choose sentences from the patents that comprise a given anchor phrase, masks them out, and use the Patent-BERT mannequin to foretell candidates for the masked portion of the textual content. Then, the entire phrases are cleaned up, which incorporates lowercasing and the removing of punctuation and sure stopwords (e.g., “and”, “or”, “stated”), and despatched to professional raters for evaluation. Every phrase pair is rated independently by two raters expert within the know-how space. Every rater additionally generates new goal phrases with completely different rankings. Particularly, they’re requested to generate some low-similarity and unrelated targets that partially match with the unique anchor and/or some high-similarity targets. Lastly, the raters meet to debate their rankings and give you closing rankings.

Dataset Analysis

To guage its efficiency, the Patent Phrase Similarity dataset was used within the U.S. Patent Phrase to Phrase Matching Kaggle competitors. The competitors was very fashionable, drawing about 2,000 rivals from around the globe. A wide range of approaches had been efficiently utilized by the highest scoring groups, together with ensemble fashions of BERT variants and prompting (see the total dialogue for extra particulars). The desk under exhibits the very best outcomes from the competitors, in addition to a number of off-the-shelf baselines from our paper. The Pearson correlation metric was used to measure the linear correlation between the anticipated and true scores, which is a useful metric to focus on for downstream fashions to allow them to distinguish between completely different similarity rankings.

The baselines within the paper will be thought of zero-shot within the sense that they use off-the-shelf fashions with none additional fine-tuning on the brand new dataset (we use these fashions to embed the anchor and goal phrases individually and compute the cosine similarity between them). The Kaggle competitors outcomes display that by utilizing our coaching knowledge, one can obtain important enhancements in contrast with present NLP fashions. We now have additionally estimated human efficiency on this process by evaluating a single rater’s scores to the mixed rating of each raters. The outcomes point out that this isn’t a very straightforward process, even for human consultants.

Mannequin	Coaching	Pearson correlation
word2vec	Zero-shot	0.44
Patent-BERT	Zero-shot	0.53
Sentence-BERT	Zero-shot	0.60
Kaggle 1st place single	High quality-tuned	0.87
Kaggle 1st place ensemble	High quality-tuned	0.88
Human		0.93

Efficiency of standard fashions with no fine-tuning (zero-shot), fashions fine-tuned on the Patent Phrase Similarity dataset as a part of the Kaggle competitors, and single human efficiency.

Conclusion and Future Work

We current the Patent Phrase Similarity dataset, which was used because the benchmark dataset within the U.S. Patent Phrase to Phrase Matching competitors, and display that by utilizing our coaching knowledge, one can obtain important enhancements in contrast with present NLP fashions.

Extra difficult machine studying benchmarks will be generated from the patent corpus, and patent knowledge has made its approach into lots of in the present day’s most-studied fashions. For instance, the C4 textual content dataset used to coach T5 accommodates many patent paperwork. The BigBird and LongT5 fashions additionally use patents by way of the BIGPATENT dataset. The supply, breadth and open utilization phrases of full textual content knowledge (see Google Patents Public Datasets) makes patents a singular useful resource for the analysis group. Prospects for future duties embrace massively multi-label classification, summarization, data retrieval, image-text similarity, quotation graph prediction, and translation. See the paper for extra particulars.

Acknowledgements

This work was doable via a collaboration with Kaggle, Satsyil Corp., USPTO, and MaxVal. Because of contributors Ian Wetherbee from Google, Will Cukierski and Maggie Demkin from Kaggle. Because of Jerry Ma, Scott Beliveau, and Jamie Holcombe from USPTO and Suja Chittamahalingam from MaxVal for his or her contributions.

Supply hyperlink

Previous articleManaging catastrophe and disruption with AI, one tree at a time

Next articleY Combinator, International Mind again Tailor, a Japanese headless ERP startup – TechCrunch

Saying the Patent Phrase Similarity Dataset

New methods to replicate and thrive on World Psychological Well being Day!

System combines gentle and electrons to unlock quicker, greener computing | MIT Information

Evolution wired human brains to behave like supercomputers

LEAVE A REPLY Cancel reply

Most Popular

Why do we want iSIM within the IoT market?

Nab the Okay-Supreme Keurig Bundle for $150 With This QVC Deal

Utilizing SERS expertise to precisely monitor single-molecule diffusion habits

Battery-free origami microfliers from UW researchers provide a brand new bio-inspired way forward for flying machines

Recent Comments

ABOUT US

POPULAR POSTS

Why do we want iSIM within the IoT market?

Nab the Okay-Supreme Keurig Bundle for $150 With This QVC Deal

Utilizing SERS expertise to precisely monitor single-molecule diffusion habits

POPULAR CATEGORY