Kurdish Optical Character Recognition
DOI:
https://doi.org/10.25079/ukhjse.v2n1y2018.pp18-27Keywords:
Optical Character Recognition, Character Segmentation, Upper Contour Labeling, Kurdish OCR, Kurdish NLPAbstract
Currently, no offline tool is available for Optical Character Recognition (OCR) in Kurdish. Kurdish is spoken in different dialects and uses several scripts for writing. The Persian/Arabic script is widely used among these dialects. The Persian/Arabic script is written from Right to Left (RTL), it is cursive, and it uses unique diacritics. These features, particularly the last two, affect the segmentation stage in developing a Kurdish OCR. In this article, we introduce an enhanced character segmentation based method which addresses the mentioned characteristics. We applied the method to text-only images and tested the Kurdish OCR using documents of different fonts, font sizes, and image resolutions. The results of the experiments showed that the accuracy rate of character recognition of the proposed method was 90.82% on average.
Downloads
References
Agrawal, S., Constandache, I., Gaonkar, S., Roy, C. R., Caves, K. & DeRuyter, F. (2011). Using Mobile Phones to Write in Air. In: Proceedings of the 9th International Conference on Mobile Systems, Applications, and Services. ACM, New York, USA. p. 15-28.
Amin, A. (1988). OCR of Arabic Texts. Pattern Recognition, Cambridge, UK. p. 616-625.
Amin, A. (1991). Recognition of Arabic handprinted mathematical formulas. Arabian Journal for Science and Engineering, 16(4), 531-542.
Azmi, R. & Kabir, E. (2001). A new segmentation technique for omnifont Farsi text. Pattern Recognition Letters, 22(2), 97-104.
Belaı̈d, A. & Ouwayed, N. (2012). Segmentation of ancient Arabic documents. In: Guide to OCR for Arabic Scripts. Springer, Londonp. p. 103-122.
Cheung, A., Bennamoun, M. & Bergmann, N. W. (2001). An Arabic optical character recognition system using recognition-based segmentation. Pattern Recognition, 34(2), 215-233.
Droettboom, M., MacMillan, K. & Fujinaga, I. (2003). The Gamera framework for building custom recognition systems. In: Symposium on Document Image Understanding Technologies. pp. 275-286. Available from: http://www.gamera.informatik.hsnr.de/.
Hassani, H. (2017a). BLARK for multi-dialect languages: Towards the Kurdish BLARK. Language Resources and Evaluation, 51(5), 1-20.
Hassani, H. (2017b). Kurdish Interdialect Machine Translation. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). Association for Computational Linguistics. p. 63-72.
Hassani, H. & Medjedovic, D. (2016). Automatic Kurdish dialects identification. Computer Science and Information Technology, 6(2), 61-78. Available from: http://www.airccj.org/CSCP/vol6/csit65007. pdf. [Last retrieved on 2016 Jul 10].
Hassanpour, A. (1992). Nationalism and Language in Kurdistan, 1918-1985. Mellen Research University Press, San Francisco.
Hubert, I., Arppe, A., Lachler, J. & Santos, E. A. (2016). Training and quality assessment of an optical character recognition model for northern Haida. In: Chair, N. C. C., Choukri, K., Declerck, T., Muud, A., Maegaard, B., Mariani, A., Odijk, A. & Piperidis, J., editors. Proceedings of the Tenth International Conference on Language Resources and Evaluation (lrec 2016). European Language Resources Association (ELRA), Paris, France. Available from:
http://www.lrec-conf.org/proceedings/lrec2016/pdf/39 Paper.pdf. [Last retrieved on 2018 Apr 27].
Jumari, K. & Ali, M. A. (2002). A survey and comparative evaluation of selected online Arabic handwritten character recognition systems. Jurnal Technology, 36, 1-18.
Kanungo, T., Marton, G. A. & Bulbul, O. (1999). Performance evaluation of two Arabic OCR products. In: Proceedings of SPIE-the international society for optical engineering. SPIE, 3584, 76-83.
Mohammed, B. O. (2013). Handwritten Kurdish character recognition using geometric discertization feature. International Journal of Computer Science and Communication, 4, 51-55.
Rashid, S. F. (2014). Optical Character Recognition-A Combined ANN/HMM Approach (Unpublished Doctoral Dissertation). Technical University of Kaiserslautern.
Zheng, L., Hassin, A. H. & Tang, X. (2004). A new algorithm for machine printed Arabic character segmentation. Pattern Recognition Letters, 25(15), 1723-1729.
Downloads
Published
Issue
Section
License
Authors who publish with this journal agree to the following terms:
1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License [CC BY-NC-ND 4.0] that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).