{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T17:38:43Z","timestamp":1754156323999,"version":"3.41.2"},"reference-count":54,"publisher":"Emerald","issue":"1","license":[{"start":{"date-parts":[[2019,9,2]],"date-time":"2019-09-02T00:00:00Z","timestamp":1567382400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["JD"],"published-print":{"date-parts":[[2019,9,2]]},"abstract":"<jats:sec><jats:title content-type=\"abstract-subheading\">Purpose<\/jats:title><jats:p>The purpose of this paper is to present a language-agnostic approach to facilitate the discovery of \u201cparallel passages\u201d stored in historic and cultural heritage digital archives.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Design\/methodology\/approach<\/jats:title><jats:p>The authors explore a novel, and relatively simple approach, using a character-based statistical language model combined with a tailored version of the Basic Local Alignment Tool to extract exact and approximate string patterns shared between groups of documents.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Findings<\/jats:title><jats:p>The approach is applicable to a wide range of languages, and compensates for variability in the text of the documents as a result of differences in dialect, authorship, language change over time and errors due to inaccurate transcriptions and optical character recognition errors as a result of the digitisation process.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Research limitations\/implications<\/jats:title><jats:p>A number of case studies demonstrate that the approach is practical and generalisable to a wide range of archives with documents in different languages, domains and of varying quality.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Practical implications<\/jats:title><jats:p>The approach described can be applied to any digital archive of modern and contemporary texts. This makes the approach applicable to digital archives recording historic texts, but also those composed of more recent news articles, for example.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Social implications<\/jats:title><jats:p>The analysis of \u201cparallel passages\u201d enables researchers to quantify the presence and extent of text-reuse in a collection of documents, which can provide useful data on author style, text genres and cultural contexts.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Originality\/value<\/jats:title><jats:p>The approach is novel and addresses a need by humanities researchers for tools that can identify similar documents and local similarities represented by shared text sequences in a potentially vast large archive of documents. As far as the authors are aware, there are no tools currently exist that provide the same level of tolerance to the language of the documents.<\/jats:p><\/jats:sec>","DOI":"10.1108\/jd-10-2018-0175","type":"journal-article","created":{"date-parts":[[2019,9,16]],"date-time":"2019-09-16T05:44:36Z","timestamp":1568612676000},"page":"271-289","source":"Crossref","is-referenced-by-count":2,"title":["Comparing \u201cparallel passages\u201d in digital archives"],"prefix":"10.1108","volume":"76","author":[{"given":"Martyn","family":"Harris","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mark","family":"Levene","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dell","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dan","family":"Levene","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"140","reference":[{"key":"key2020021910240234700_ref001","unstructured":"ABBILDUNGB (2018), \u201cAbbyy finereader 14\u201d, available at: www.abbyy.com\/en-gb\/finereader\/compare-documents\/ (accessed 16 March 2018)."},{"volume-title":"Mining Text Data","year":"2012","key":"key2020021910240234700_ref002"},{"issue":"2","key":"key2020021910240234700_ref003","first-page":"179","article-title":"A maximum likelihood approach to continuous speech recognition","volume":"5","year":"1983","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI"},{"issue":"11","key":"key2020021910240234700_ref004","first-page":"965","article-title":"A survey of practical algorithms for suffix tree construction in external memory","volume":"40","year":"2010","journal-title":"Software: Practice and Experience"},{"issue":"1","key":"key2020021910240234700_ref005","first-page":"1137","article-title":"A neural probabilistic language model","volume":"3","year":"2003","journal-title":"Journal of Machine Learning Research"},{"key":"key2020021910240234700_ref006","unstructured":"bib (2016), \u201cBibleworks \u2013 bible software\u201d, available at: www.bibleworks.com\/classroom\/1\/_10\/ (accessed 18 May 2016)."},{"issue":"2","key":"key2020021910240234700_ref007","first-page":"14","article-title":"Unsupervised detection and visualisation of textual reuse on ancient Greek texts","volume":"1","year":"2010","journal-title":"Journal of the Chicago Colloquium on Digital Humanities and Computer Science"},{"year":"1794","key":"key2020021910240234700_ref008","article-title":"A short account of the malignant fever lately prevalent in Philadelphia \u2026: to which are added, accounts of the plague in London and Marseilles; and a list of the dead, from August 1, to the middle of December, 1793"},{"issue":"4","key":"key2020021910240234700_ref009","doi-asserted-by":"crossref","first-page":"359","DOI":"10.1006\/csla.1999.0128","article-title":"An empirical study of smoothing techniques for language modeling","volume":"13","year":"1999","journal-title":"Computer Speech and Language"},{"key":"key2020021910240234700_ref010","unstructured":"chi (2016), \u201cChinese text project \u2013 parallel-passages\u201d, available at: http:\/\/ctext.org\/tools\/parallel-passages (accessed 18 May 2016)."},{"issue":"2","key":"key2020021910240234700_ref011","doi-asserted-by":"crossref","first-page":"383","DOI":"10.1353\/apa.2012.0010","article-title":"Intertextuality in the digital age","volume":"142","year":"2012","journal-title":"Transactions of the American Philological Association"},{"key":"key2020021910240234700_ref012","doi-asserted-by":"crossref","unstructured":"de Jong, M. (2007), \u201cIsaiah among the ancient near eastern prophets: a comparative study of the earliest stages of the Isaiah tradition and the neo-Assyrian prophecies\u201d, Supplements to the Vetus Testamentum, Book 117, Brill Academic Publishing, Leiden, pp. 1-399.","DOI":"10.1163\/ej.9789004161610.i-524"},{"key":"key2020021910240234700_ref013","unstructured":"dif (2018), \u201cDiffchecker\u201d, available at: www.diffchecker.com\/ (accessed 16 March 2018)."},{"key":"key2020021910240234700_ref014","unstructured":"dif (2016), \u201cDiff doc tool\u201d, available at: www.softinterface.com\/MD\/Document-Comparison-Software.htm (accessed 8 February 2016)."},{"issue":"7","key":"key2020021910240234700_ref015","doi-asserted-by":"crossref","first-page":"1858","DOI":"10.1109\/TIT.2003.813506","article-title":"A new metric for probability distributions","volume":"49","year":"2003","journal-title":"IEEE Transactions on Information Theory"},{"issue":"2","key":"key2020021910240234700_ref016","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1177\/0957926592003002004","article-title":"Discourse and text: linguistic and intertextual analysis within discourse analysis","volume":"3","year":"1992","journal-title":"Discourse & Society"},{"issue":"1","key":"key2020021910240234700_ref017","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1177\/030751330308900104","article-title":"The middle kingdom offering formulas: a challenge","volume":"89","year":"2003","journal-title":"The Journal of Egyptian Archaeology"},{"article-title":"ETRAP (electronic text reuse acquisition project): a research group implementing the ehumanities A.C.I.D. paradigm","volume-title":"Digital Humanities Summit 2015","year":"2015","key":"key2020021910240234700_ref018"},{"volume-title":"Bayesian Methods: A Social and Behavioral Sciences Approach","year":"2014","key":"key2020021910240234700_ref019"},{"issue":"2","key":"key2020021910240234700_ref020","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1086\/370475","article-title":"Aramaic dialect problems","volume":"52","year":"1936","journal-title":"The American Journal of Semitic Languages and Literatures"},{"volume-title":"Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology","year":"1997","key":"key2020021910240234700_ref021"},{"first-page":"165","article-title":"The anatomy of a search and mining system for digital humanities","year":"2014","key":"key2020021910240234700_ref022"},{"issue":"3","key":"key2020021910240234700_ref023","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3195727","article-title":"Finding \u2018parallel passages\u2019 in cultural heritage archives","volume":"11","year":"2018","journal-title":"Journal on Computing and Cultural Heritage"},{"issue":"3","key":"key2020021910240234700_ref024","first-page":"223","article-title":"Techniques of quotation in clement of Alexandria: a view of ancient literary working methods","volume":"50","year":"1996","journal-title":"Vigiliae Christianae"},{"first-page":"77","volume-title":"N-Gram Feature Selection for Authorship Identification","year":"2006","key":"key2020021910240234700_ref025"},{"key":"key2020021910240234700_ref026","doi-asserted-by":"crossref","unstructured":"Kanaris, I., Kanaris, K. and Stamatatos, E. (2006), \u201cSpam detection using character n-grams\u201d, in Antoniou, G., Potamias, G., Spyropoulos, C. and Plexousakis, D. (Eds), Advances in Artificial Intelligence, Springer, Berlin and Heidelberg, pp. 95-104.","DOI":"10.1007\/11752912_12"},{"first-page":"2741","article-title":"Character-aware neural language models","year":"2015","key":"key2020021910240234700_ref027"},{"first-page":"180","article-title":"Named entity recognition with character-level models","year":"2003","key":"key2020021910240234700_ref028"},{"first-page":"472","article-title":"A computational model of text reuse in ancient literary texts","year":"2007","key":"key2020021910240234700_ref029"},{"volume-title":"Curse Or Blessing: What\u2019s in the Magic Bowl?","year":"2002","key":"key2020021910240234700_ref030"},{"edition":"2nd ed.","volume-title":"An Introduction to Search Engines and Web Navigation","year":"2010","key":"key2020021910240234700_ref031"},{"issue":"8","key":"key2020021910240234700_ref032","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions and reversals","volume":"10","year":"1966","journal-title":"Soviet Physics Doklady"},{"key":"key2020021910240234700_ref033","unstructured":"log (2016), \u201cLogos bible software series x tour: parallel passages\u201d, available at: www.logos.com\/media\/tour\/ParallelPassages.htm (accessed 18 May 2016)."},{"issue":"1-2","key":"key2020021910240234700_ref036","first-page":"73","article-title":"Character n-gram tokenization for European language text retrieval","volume":"7","year":"2004","journal-title":"Information Retrieval"},{"key":"key2020021910240234700_ref034","doi-asserted-by":"crossref","unstructured":"Ma, J. and Zhang, L. (2010), \u201cModern blast programs\u201d, in Heath, L. and Ramakrishnan, N. (Eds), Problem Solving Handbook in Computational Biology and Bioinformatics, Springer, Boston, MA, pp. 3-19.","DOI":"10.1007\/978-0-387-09760-2_1"},{"issue":"3","key":"key2020021910240234700_ref035","doi-asserted-by":"crossref","first-page":"341","DOI":"10.1162\/coli_a_00002","article-title":"Generating phrasal and sentential paraphrases: a survey of data-driven methods","volume":"36","year":"2010","journal-title":"Computational Linguistics"},{"first-page":"746","article-title":"Linguistic regularities in continuous space word representations","year":"2013","key":"key2020021910240234700_ref037"},{"issue":"2","key":"key2020021910240234700_ref038","doi-asserted-by":"crossref","first-page":"209","DOI":"10.1093\/llc\/18.2.209","article-title":"What is text analysis, really?","volume":"18","year":"2003","journal-title":"Literary and Linguistic Computing"},{"key":"key2020021910240234700_ref039","unstructured":"Rommel, T. (2007), \u201cLiterary studies\u201d, A Companion to Digital Humanities, in Siemens, R. and Schreibman, S. (Eds), Blackwell Publishing Ltd., Oxford, pp. 88-96."},{"key":"key2020021910240234700_ref040","unstructured":"Schonfeld, R.C. and Rutner, J. (2012), \u201cSupporting the changing research practices of historians\u201d, Final Report from Ithaka S+R, available at: https:\/\/sr.ithaka.org\/wp-content\/uploads\/2015\/08\/supporting-the-changing-research-practices-of-historians.pdf (accessed 13 June 2016)."},{"issue":"1","key":"key2020021910240234700_ref041","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1504\/IJBRA.2008.017165","article-title":"The generalised k-truncated suffix tree for time-and space-efficient searches in multiple DNA or protein sequences","volume":"4","year":"2008","journal-title":"International Journal of Bioinformatics Research and Applications"},{"key":"key2020021910240234700_ref042","unstructured":"SHE (2016), \u201cShebanq (system for hebrew text: annotations for queries and markup)\u201d, available at: https:\/\/shebanq.ancient-data.org\/ (accessed 18 May 2016)."},{"issue":"1","key":"key2020021910240234700_ref043","first-page":"1","article-title":"Intrinsic plagiarism detection using character n-gram profiles","volume":"2","year":"2009","journal-title":"Threshold"},{"key":"key2020021910240234700_ref044","unstructured":"Strauss, D. and Eliot, G. (1860), \u201cThe life of Jesus: critically examined\u201d, Number v. 1 in The Life of Jesus. C. Blanchard."},{"first-page":"64","article-title":"User needs for enhanced engagement with cultural heritage collections","year":"2012","key":"key2020021910240234700_ref045"},{"key":"key2020021910240234700_ref046","unstructured":"tes (2018), \u201cTesserae\u201d, available at: http:\/\/tesserae.caset.buffalo.edu\/index.php (accessed 16 March 2018)."},{"key":"key2020021910240234700_ref047","unstructured":"VMB (2014), \u201cVMBA: virtual magic bowl archive\u201d, available at: www.southampton.ac.uk\/vmba\/ (accessed 28 January 2014)."},{"key":"key2020021910240234700_ref048","unstructured":"wel (2017a), \u201cWellcome trust collections \u2013 UK medical heritage library\u201d, available at: http:\/\/wellcomelibrary.org\/collections\/digital-collections\/uk-medical-heritage-library\/ (accessed 7 January 2017)."},{"article-title":"Wellcome trust UK medical library project: Wellcome grant","year":"2017","author":"wel","key":"key2020021910240234700_ref049"},{"key":"key2020021910240234700_ref050","unstructured":"wol (2016), \u201cWord length distribution in various languages\u201d, available at: https:\/\/reference.wolfram.com\/language\/example\/WordLengthDistributioninVariousLanguages.html (accessed 11 January 2016)."},{"first-page":"372","article-title":"Effects of out of vocabulary words in spoken document retrieval (poster session)","year":"2000","key":"key2020021910240234700_ref051"},{"issue":"3","key":"key2020021910240234700_ref052","first-page":"137","article-title":"Statistical language models for information retrieval: a critical review. Foundations and Trends\u00ae in Information Retrieval","volume":"2","year":"2008"},{"volume-title":"Statistical Language Models for Information Retrieval","year":"2009","key":"key2020021910240234700_ref053"},{"issue":"2","key":"key2020021910240234700_ref054","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1145\/984321.984322","article-title":"A study of smoothing methods for language models applied to information retrieval","volume":"22","year":"2004","journal-title":"ACM Transactions on Information Systems"}],"container-title":["Journal of Documentation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/JD-10-2018-0175\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/JD-10-2018-0175\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T22:35:14Z","timestamp":1753396514000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/jd\/article\/76\/1\/271-289\/432419"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,9,2]]},"references-count":54,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,9,2]]}},"alternative-id":["10.1108\/JD-10-2018-0175"],"URL":"https:\/\/doi.org\/10.1108\/jd-10-2018-0175","relation":{},"ISSN":["0022-0418"],"issn-type":[{"type":"print","value":"0022-0418"}],"subject":[],"published":{"date-parts":[[2019,9,2]]}}}