{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T23:18:24Z","timestamp":1771629504726,"version":"3.50.1"},"reference-count":29,"publisher":"Emerald","issue":"3","license":[{"start":{"date-parts":[[2014,7,1]],"date-time":"2014-07-01T00:00:00Z","timestamp":1404172800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2014,7,1]]},"abstract":"<jats:sec>\n               <jats:title content-type=\"abstract-heading\">Purpose<\/jats:title>\n               <jats:p> \u2013 The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs). <\/jats:p>\n            <\/jats:sec>\n            <jats:sec>\n               <jats:title content-type=\"abstract-heading\">Design\/methodology\/approach<\/jats:title>\n               <jats:p> \u2013 The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeerX, SDOS, and Google Scholar) to retrieve the rest of metadata. <\/jats:p>\n            <\/jats:sec>\n            <jats:sec>\n               <jats:title content-type=\"abstract-heading\">Findings<\/jats:title>\n               <jats:p> \u2013 Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotero 4.0. The feasibility of the proposed model is thus justified. <\/jats:p>\n            <\/jats:sec>\n            <jats:sec>\n               <jats:title content-type=\"abstract-heading\">Research limitations\/implications<\/jats:title>\n               <jats:p> \u2013 For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers\u2019 productivity. <\/jats:p>\n            <\/jats:sec>\n            <jats:sec>\n               <jats:title content-type=\"abstract-heading\">Practical implications<\/jats:title>\n               <jats:p> \u2013 For practical implication, the system can outperform the existing tool, Zotero v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation. <\/jats:p>\n            <\/jats:sec>\n            <jats:sec>\n               <jats:title content-type=\"abstract-heading\">Originality\/value<\/jats:title>\n               <jats:p> \u2013 The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel <jats:italic>et al<\/jats:italic>, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.<\/jats:p>\n            <\/jats:sec>","DOI":"10.1108\/prog-12-2011-0059","type":"journal-article","created":{"date-parts":[[2014,7,10]],"date-time":"2014-07-10T17:20:35Z","timestamp":1405012835000},"page":"293-313","source":"Crossref","is-referenced-by-count":4,"title":["Extracting bibliographical data for PDF documents with HMM and external resources"],"prefix":"10.1108","volume":"48","author":[{"given":"Wen-Feng","family":"Hsiao","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Te-Min","family":"Chang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Erwin","family":"Thomas","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"140","reference":[{"key":"key2020123002355894200_b1","doi-asserted-by":"crossref","unstructured":"Bikel, D.M.\n               , \n                  Miller, S.\n               , \n                  Schwartz, R.\n                and \n                  Weischedel, R.\n                (1997), \u201cNymble: a high-performance learning name-finder\u201d, Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194-201.","DOI":"10.3115\/974557.974586"},{"key":"key2020123002355894200_b2","unstructured":"Covington, M.A.\n                (2003), \u201cET: an efficient tokenizer in ISO prolog\u201d, technical report, The University of Georgia, Athens, GA available at: www.ai.uga.edu\/mc\/ET\/et.pdf (accessed September 1, 2011)."},{"key":"key2020123002355894200_b3","doi-asserted-by":"crossref","unstructured":"Day, M.-Y.\n               , \n                  Tsai, R.T.-H.\n               , \n                  Sung, C.-L.\n               , \n                  Hsieh, C.-C.\n               , \n                  Lee, C.-W.\n               , \n                  Wu, S.-H.\n               , \n                  Wu, K.-P.\n               , \n                  Ong, C.-S.\n                and \n                  Hsu, W.-L.\n                (2007), \u201cReference metadata extraction using a hierarchical knowledge representation framework\u201d, Decision Support Systems \n               43 No. 1, pp. 152-167.","DOI":"10.1016\/j.dss.2006.08.006"},{"key":"key2020123002355894200_b4","unstructured":"Freitag, D.\n                and \n                  Mccallum, A.K.\n                (1999), \u201cInformation extraction with HMMs and shrinkage\u201d, Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, pp. 31-36."},{"key":"key2020123002355894200_b5","doi-asserted-by":"crossref","unstructured":"Gao, L.\n               , \n                  Qi, X.\n               , \n                  Tang, Z.\n               , \n                  Lin, X.\n                and \n                  Liu, Y.\n                (2012), \u201cWeb-based citation parsing, correction and augmentation\u201d, JCDL\u203212: Proceedings of the 12th ACM\/IEEE-CS Joint Conference on Digital Libraries, pp. 295-304.","DOI":"10.1145\/2232817.2232872"},{"key":"key2020123002355894200_b6","doi-asserted-by":"crossref","unstructured":"Gao, L.\n               , \n                  Tang, Z.\n               , \n                  Lin, X.\n               , \n                  Liu, Y.\n               , \n                  Qiu, R.\n                and \n                  Wang, Y.\n                (2011), \u201cStructure extraction from PDF-based book documents\u201d, JCDL-11 Proceeding of the 11th Annual International ACM\/IEEE Joint Conference on Digital Libraries, pp. 11-20.","DOI":"10.1145\/1998076.1998079"},{"key":"key2020123002355894200_b7","doi-asserted-by":"crossref","unstructured":"Giuffrida, G.\n               , \n                  Shek, E.C.\n                and \n                  Yang, J.\n                (2000), \u201cKnowledge-based metadata extraction from PostScript files\u201d, Proceedings of the Fifth ACM Conference on Digital Libraries, pp. 77-84.","DOI":"10.1145\/336597.336639"},{"key":"key2020123002355894200_b8","doi-asserted-by":"crossref","unstructured":"Granitzer, M.\n               , \n                  Hristakeva, M.\n               , \n                  Knight, R.\n               , \n                  Jack, K.\n                and \n                  Kern, R.\n                (2012), \u201cA comparison of layout based bibliographic metadata extraction techniques\u201d, Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics WIMS'12, Craiova, June 13-15, available at: http:\/\/doi.acm.org\/10.1145\/2254129.2254154 (accessed September 1, 2013).","DOI":"10.1145\/2254129.2254154"},{"key":"key2020123002355894200_b9","doi-asserted-by":"crossref","unstructured":"Groza, T.\n               , \n                  Grimnes, G.A.\n               , \n                  Handschuh, S.\n                and \n                  Decker, S.\n                (2011), \u201cFrom raw publications to linked data\u201d, Knowledge and Information Systems, Vol. 34 No. 1, pp 1-21, available at: www.springerlink.com\/content\/98715029650435t6\/ (accessed September 1, 2013)","DOI":"10.1007\/s10115-011-0473-6"},{"key":"key2020123002355894200_b10","doi-asserted-by":"crossref","unstructured":"Han, H.\n               , \n                  Giles, C.L.\n               , \n                  Manavoglu, E.\n               , \n                  Zha, H.\n               , \n                  Zhang, Z.\n                and \n                  Fox, E.A.\n                (2003), \u201cAutomatic document metadata extraction using support vector machines\u201d, Third ACM\/IEEE-CS Joint Conference on Digital Libraries (JCDL'03), pp. 37-48.","DOI":"10.1109\/JCDL.2003.1204842"},{"key":"key2020123002355894200_b11","doi-asserted-by":"crossref","unstructured":"Hu, Y.\n               , \n                  Li, H.\n               , \n                  Cao, Y.\n               , \n                  Li, T.\n               , \n                  Meyerzon, D.\n                and \n                  Zheng, Q.\n                (2006), \u201cAutomatic extraction of titles from general documents using machine learning\u201d, Information Processing and Management, Vol. 42 No. 5, pp. 1276-1293.","DOI":"10.1016\/j.ipm.2005.12.001"},{"key":"key2020123002355894200_b12","unstructured":"Kohavi, R.\n                and \n                  Provost, F.\n                (1998), \u201cGlossary of terms\u201d, Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process, Vol. 30 Nos 2\/3, pp. 271-274, available at: http:\/\/robotics.stanford.edu\/\u223cronnyk\/glossary.html (accessed September 1, 2013)"},{"key":"key2020123002355894200_b13","doi-asserted-by":"crossref","unstructured":"Lipinski, M.\n               , \n                  Yao, K.\n               , \n                  Breitinger, C.\n               , \n                  Beel, J.\n                and \n                  Gipp, B.\n                (2013), \u201cEvaluation of header metadata extraction approaches and tools for scientific PDF documents\u201d, JCDL'13 Proceedings of the 13th ACM\/IEEE-CS Joint Conference on Digital Libraries, pp. 385-386.","DOI":"10.1145\/2467696.2467753"},{"key":"key2020123002355894200_b14","doi-asserted-by":"crossref","unstructured":"McCallum, A.\n               , \n                  Nigam, K.\n               , \n                  Rennie, J.\n                and \n                  Seymore, K.\n                (2000), \u201cAutomating the construction of Internet portals with machine learning\u201d, Information Retrieval Journal, Vol. 3 No. 2, pp. 127-163.","DOI":"10.1023\/A:1009953814988"},{"key":"key2020123002355894200_b15","unstructured":"Marinai, S.\n                (2009), \u201cMetadata extraction from PDF documents for digital library ingest\u201d, Proceeding of 10th International Conference on Document Analysis and Recognition, pp. 251-255."},{"key":"key2020123002355894200_b16","unstructured":"Moens, M.\n                (2006), Information Extraction: Algorithms and Prospects in a Retrieval Context, Springer, Dordrecht."},{"key":"key2020123002355894200_b19","unstructured":"Peng, F.\n                and \n                  McCallum, A.\n                (2004), \u201cAccurate information extraction from research papers using conditional random fields\u201d, Proceedings of HLT-NAACL04, pp. 329-336."},{"key":"key2020123002355894200_b20","doi-asserted-by":"crossref","unstructured":"Porter, M.F.\n                (1980), \u201cAn algorithm for suffix stripping\u201d, Program, Vol. 14 No. 3, pp 130-137, available at: http:\/\/cpro-documents-management.googlecode.com\/svn\/trunk\/2.%20Requirement\/Algorithm\/Stem.pdf (accessed September 1, 2011).","DOI":"10.1108\/eb046814"},{"key":"key2020123002355894200_b21","unstructured":"Schmid, H.\n                (2008), Tokenizing, In Anke L\u00fcdeling and Merja Kyt\u00f6, editors: Corpus Linguistics, An International Handbook, Mouton de Gruyter, Berlin, available at: www.coli.uni-saarland.de\/\u223cschulte\/Teaching\/ESSLLI-06\/Referenzen\/Tokenisation\/schmid-hsk-tok.pdf (accessed September 1, 2011)"},{"key":"key2020123002355894200_b22","doi-asserted-by":"crossref","unstructured":"Sebastiani, F.\n                (2002), \u201cMachine learning in automated text categorization\u201d, ACM Computing Surveys, Vol. 34 No. 1, pp. 1-47.","DOI":"10.1145\/505282.505283"},{"key":"key2020123002355894200_b23","doi-asserted-by":"crossref","unstructured":"Wei, W.\n               , \n                  King, I.\n                and \n                  Lee, J.H.-M.\n                (2007), \u201cBibliographic attributes extraction with layer-upon-layer tagging\u201d, Ninth International Conference on Document Analysis and Recognition, ICDAR2007, Vol. 2, pp. 804-808.","DOI":"10.1109\/ICDAR.2007.4377026"},{"key":"key2020123002355894200_b24","doi-asserted-by":"crossref","unstructured":"Yang, H.\n               , \n                  Onda, N.\n               , \n                  Kashimura, M.\n                and \n                  Ozawa, S.\n                (1999), \u201cExtraction of bibliography information based on image of book cover\u201d, Proceeding of 10th International Conference on Image Analysis and Processing, pp. 921-926.","DOI":"10.1109\/ICIAP.1999.797713"},{"key":"key2020123002355894200_b25","unstructured":"Yin, P.\n               , \n                  Zhang, M.\n               , \n                  Deng, Z.H.\n                and \n                  Yang, D.Q.\n                (2005), \u201cMetadata extraction from bibliographies using bigram HMM\u201d, Lecture Notes in Computer Science, Vol. 3334 pp. 310-319, available at: http:\/\/link.springer.com\/chapter\/10.1007%2F978-3-540-30544-6_33 (accessed September 1, 2011)."},{"key":"key2020123002355894200_b26","doi-asserted-by":"crossref","unstructured":"Zhai, C.X.\n                (2008), \u201cStatistical language models for information retrieval a critical review\u201d, Foundations and Trends in Information Retrieval, Vol. 2 No. 3, pp. 137-213.","DOI":"10.1561\/1500000008"},{"key":"key2020123002355894200_b27","unstructured":"Zhang, M.\n               , \n                  Yang, D.\n               , \n                  Deng, Z.H.\n               , \n                  Feng, Y.\n               , \n                  Wang, W.\n               , \n                  Zhao, P.\n               , \n                  Wu, S.\n               , \n                  Wang, S.\n                and \n                  Tang, S.W.\n                (2004), \u201cPKUSpace: a collaborative platform for scientific researching, advances in web-based learning\u201d, Lecture Notes in Computer Science, Vol. 3143, pp. 245-260, available at: http:\/\/link.springer.com\/chapter\/10.1007%2F978-3-540-27859-7_16 (accessed September 1, 2011)."},{"key":"key2020123002355894200_b28","unstructured":"Zotero 3.0\n                (2011), Manage your research and bibliographies, available at: www.zotero.org\/ (accessed September 1, 2011)."},{"key":"key2020123002355894200_b29","unstructured":"Zotero 4.0\n                (2014), Manage your research and bibliographies, available at: www.zotero.org\/ (accessed April 2, 2014)."},{"key":"key2020123002355894200_frd1","unstructured":"Apache PDFBox 1.8.4 (2014), Apache PDFBox \u2013 A Java PDF Library\n               , available at: http:\/\/pdfbox.apache.org\/downloads.html (accessed March 1, 2014)."},{"key":"key2020123002355894200_frd2","unstructured":"PDFMiner\n                (2014), Python PDF parser and analyzer, available at: www.unixuser.org\/\u223ceuske\/python\/pdfminer\/index.html (accessed April 2, 2014)."}],"container-title":["Program"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/PROG-12-2011-0059\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/PROG-12-2011-0059\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T21:57:39Z","timestamp":1753394259000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/dta\/article\/48\/3\/293-313\/332775"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,7,1]]},"references-count":29,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2014,7,1]]}},"alternative-id":["10.1108\/PROG-12-2011-0059"],"URL":"https:\/\/doi.org\/10.1108\/prog-12-2011-0059","relation":{},"ISSN":["0033-0337"],"issn-type":[{"value":"0033-0337","type":"print"}],"subject":[],"published":{"date-parts":[[2014,7,1]]}}}