Please use this identifier to cite or link to this item:
https://www.um.edu.mt/library/oar/handle/123456789/135519| Title: | Automated detection of potential patent infringement |
| Authors: | Seracino, Jake (2024) |
| Keywords: | Information retrieval Patent searching Information storage and retrieval systems Vector spaces |
| Issue Date: | 2024 |
| Citation: | Seracino, J. (2024). Automated detection of potential patent infringement (Master’s dissertation). |
| Abstract: | Due to potential legal actions and associated penalties, patent infringement can incur significant financial ramifications for companies and individuals. Consequently, substantial human effort is invested in searching through patent databases to identify relevant documents such that possible breaches of intellectual property are flagged. However, existing patent retrieval systems underperform on key metrics used to measure their effectiveness, creating a need for better solutions. This research addresses this gap by investigating various aspects of patent retrieval to develop a single system that integrates the most effective approaches. Our study investigates different aspects of patent retrieval. We explore different means of query expansion for the domain of patent retrieval, how patent classification information can be leveraged during retrieval, the benefit of different embedding models to perform patent semantic search and compare various re-ranking approaches to optimize the precision of the system. These different aspects of patent retrieval were evaluated on the CLEF-IP 2011 dataset using the metrics R@10, R@100, R@1000, MAP, PRES@10, PRES@100, PRES@1000 and query duration. Our experiments have shown that using WordNet and word embeddings as a means of query expansion yields inferior results. Conversely, adding additional query terms from within the query patent improves the system’s overall performance. We also confirmed the findings of prior literature that leveraging classification information during retrieval greatly improves the results scored by the system. Additionally, we have found that IPC vector-based retrieval produces a better ranking of the retrieved set than performing filtering using classification codes but the latter produces better overall recall. Another finding of this study is that ANNOY-based retrieval using embeddings produced by PatentSBERTa matched the performance of keyword-based retrieval and exceeded it on the R@1000 metric. On the other hand, vectors produced from averaged word embeddings and Doc2Vec fared much worse and yielded inferior results. Finally, we found that combining GCS, TF-IDF-based similarity and PatentSBERTa-based similarity to rank retrieved patents generally yields the best results and that using such an approach with classification code-based filtering produced the best-performing system included in this study. All of these findings are found to be statistically significant using the one-tailed Wilcoxon Signed Rank Test with an alpha level of 0.05. |
| Description: | M.Sc.(Melit.) |
| URI: | https://www.um.edu.mt/library/oar/handle/123456789/135519 |
| Appears in Collections: | Dissertations - FacICT - 2024 Dissertations - FacICTAI - 2024 |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| 2419ICTICS520005062026_1.PDF Restricted Access | 1.79 MB | Adobe PDF | View/Open Request a copy |
Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.
