{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:10:57Z","timestamp":1750219857189,"version":"3.41.0"},"reference-count":4,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2022,6,1]],"date-time":"2022-06-01T00:00:00Z","timestamp":1654041600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGIR Forum"],"published-print":{"date-parts":[[2022,6]]},"abstract":"<jats:p>\n            This thesis addresses the problem of improving text spotting systems, which aim to detect and recognize text in unrestricted images (e.g., a street sign, an advertisement, a bus destination, etc.). The goal is to improve the performance of off-the-shelf vision systems by exploiting the semantic information derived from the image itself. The rationale is that knowing the content of the image or the visual context can help to decide which words are the correct candidate words. For example, the fact that an image shows a coffee shop makes it more likely that a word on a signboard reads as\n            <jats:italic>Dunkin<\/jats:italic>\n            and not\n            <jats:italic>unkind.<\/jats:italic>\n          <\/jats:p>\n          <jats:p>We address this problem by drawing on successful developments in natural language processing and machine learning, in particular, learning to re-rank and neural networks, to present post-process frameworks that improve state-of-the-art text spotting systems without the need for costly data-driven re-training or tuning procedures.<\/jats:p>\n          <jats:p>\n            Discovering the degree of semantic relatedness of candidate words and their image context is a task related to assessing the semantic similarity between words or text fragments. However, semantic relatedness is more general than similarity (e.g.,\n            <jats:italic>car, road<\/jats:italic>\n            , and\n            <jats:italic>traffic light<\/jats:italic>\n            are related but not similar) and requires certain adaptations. To meet the requirements of these broader perspectives of semantic similarity, we develop two approaches to learn the semantic relatedness of the spotted word and its environmental context: word-to-word (object) or word-to-sentence (caption). In the word-to-word approach, word embedding based re-rankers are developed. The re-ranker takes the words from the text spotting baseline and re-ranks them based on the visual context from the object classifier. For the second, an end-to-end neural approach is designed to drive image description (caption) at the sentence-level as well as the word-level (objects) and re-rank them based not only on the visual context but also on the co-occurrence between them.\n          <\/jats:p>\n          <jats:p>\n            As an additional contribution, to meet the requirements of data-driven approaches such as neural networks, we propose a visual context dataset for this task, in which the publicly available COCO-text dataset\n            <jats:sup>1<\/jats:sup>\n            has been extended with information about the scene (including the objects and places appearing in the image) to enable researchers to include the semantic relations between texts and scene in their Text Spotting systems, and to offer a common evaluation baseline for such approaches.\n          <\/jats:p>\n          <jats:p>\n            <jats:bold>Awarded by:<\/jats:bold>\n            Universitat Polit\u00e8cnica de Catalunya, Barcelona, Spain on 10 September 2020.\n          <\/jats:p>\n          <jats:p>\n            <jats:bold>Supervised by:<\/jats:bold>\n            Llu\u00eds Padr\u00f3 and Francesc Moreno-Noguer.\n          <\/jats:p>\n          <jats:p>\n            <jats:bold>Available at:<\/jats:bold>\n            https:\/\/upcommons.upc.edu\/handle\/2117\/334952.\n          <\/jats:p>","DOI":"10.1145\/3582524.3582542","type":"journal-article","created":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T17:06:33Z","timestamp":1674839193000},"page":"1-2","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Enhancing Scene Text Recognition with Visual Context Information"],"prefix":"10.1145","volume":"56","author":[{"given":"Ahmed","family":"Sabir","sequence":"first","affiliation":[{"name":"Universitat Polit\u00e8cnica de Catalunya, Barcelona, Spain"}]}],"member":"320","published-online":{"date-parts":[[2023,1,27]]},"reference":[{"key":"e_1_2_1_1_1","first-page":"271","volume-title":"Artificial Intelligence Research and Development","author":"Sabir Ahmed","year":"2018","unstructured":"Ahmed Sabir , Francesc Moreno-Noguer , and Llu\u00eds Padr\u00f3 . Enhancing text spotting with a language model and visual context information . In Artificial Intelligence Research and Development , pages 271 -- 280 . IOS Press , 2018 a. Ahmed Sabir, Francesc Moreno-Noguer, and Llu\u00eds Padr\u00f3. Enhancing text spotting with a language model and visual context information. In Artificial Intelligence Research and Development, pages 271--280. IOS Press, 2018a."},{"key":"e_1_2_1_2_1","first-page":"68","volume-title":"Asian Conference on Computer Vision","author":"Sabir Ahmed","year":"2018","unstructured":"Ahmed Sabir , Francesc Moreno-Noguer , and Llu\u00eds Padr\u00f3 . Visual re-ranking with natural language understanding for text spotting . In Asian Conference on Computer Vision , pages 68 -- 82 . Springer , 2018 b. Ahmed Sabir, Francesc Moreno-Noguer, and Llu\u00eds Padr\u00f3. Visual re-ranking with natural language understanding for text spotting. In Asian Conference on Computer Vision, pages 68--82. Springer, 2018b."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1346"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00279"}],"container-title":["ACM SIGIR Forum"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3582524.3582542","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3582524.3582542","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:47:16Z","timestamp":1750178836000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3582524.3582542"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6]]},"references-count":4,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,6]]}},"alternative-id":["10.1145\/3582524.3582542"],"URL":"https:\/\/doi.org\/10.1145\/3582524.3582542","relation":{},"ISSN":["0163-5840"],"issn-type":[{"type":"print","value":"0163-5840"}],"subject":[],"published":{"date-parts":[[2022,6]]},"assertion":[{"value":"2023-01-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}