Kurdish Optical Character Recognition

Keywords: Optical Character Recognition, Character Segmentation, Upper Contour Labeling, Kurdish OCR, Kurdish NLP

Abstract

Currently, no offline tool is available for Optical Character Recognition (OCR) in Kurdish. Kurdish is spoken in different dialects and uses several scripts for writing. The Persian/Arabic script is widely used among these dialects. The Persian/Arabic script is written from Right to Left (RTL), it is cursive, and it uses unique diacritics. These features, particularly the last two, affect the segmentation stage in developing a Kurdish OCR. In this article, we introduce an enhanced character segmentation based method which addresses the mentioned characteristics. We applied the method to text-only images and tested the Kurdish OCR using documents of different fonts, font sizes, and image resolutions. The results of the experiments showed that the accuracy rate of character recognition of the proposed method was 90.82% on average.

References

Agrawal, S., Constandache, I., Gaonkar, S. & Choudhury, R.R. (2009). Phonepoint Pen: Using Mobile Phones to Write in Air. In: Proceedings of the 1st ACM workshop on Networking, Systems, and Applications for Mobile Hand-Helds. p. 1-6.

Agrawal, S., Constandache, I., Gaonkar, S., Roy, C. R., Caves, K. & DeRuyter, F. (2011). Using Mobile Phones to Write in Air. In: Proceedings of the 9th International Conference on Mobile Systems, Applications, and Services. ACM, New York, USA. p. 15-28.

Amin, A. (1988). OCR of Arabic Texts. Pattern Recognition, Cambridge, UK. p. 616-625.

Amin, A. (1991). Recognition of Arabic handprinted mathematical formulas. Arabian Journal for Science and Engineering, 16(4), 531-542.

Azmi, R. & Kabir, E. (2001). A new segmentation technique for omnifont Farsi text. Pattern Recognition Letters, 22(2), 97-104.

Belaı̈d, A. & Ouwayed, N. (2012). Segmentation of ancient Arabic documents. In: Guide to OCR for Arabic Scripts. Springer, Londonp. p. 103-122.

Cheung, A., Bennamoun, M. & Bergmann, N. W. (2001). An Arabic optical character recognition system using recognition-based segmentation. Pattern Recognition, 34(2), 215-233.

Droettboom, M., MacMillan, K. & Fujinaga, I. (2003). The Gamera framework for building custom recognition systems. In: Symposium on Document Image Understanding Technologies. pp. 275-286. Available from: http://www.gamera.informatik.hsnr.de/.

Hassani, H. (2017a). BLARK for multi-dialect languages: Towards the Kurdish BLARK. Language Resources and Evaluation, 51(5), 1-20.

Hassani, H. (2017b). Kurdish Interdialect Machine Translation. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). Association for Computational Linguistics. p. 63-72.

Hassani, H. & Medjedovic, D. (2016). Automatic Kurdish dialects identification. Computer Science and Information Technology, 6(2), 61-78. Available from: http://www.airccj.org/CSCP/vol6/csit65007. pdf. [Last retrieved on 2016 Jul 10].

Hassanpour, A. (1992). Nationalism and Language in Kurdistan, 1918-1985. Mellen Research University Press, San Francisco.

Hubert, I., Arppe, A., Lachler, J. & Santos, E. A. (2016). Training and quality assessment of an optical character recognition model for northern Haida. In: Chair, N. C. C., Choukri, K., Declerck, T., Muud, A., Maegaard, B., Mariani, A., Odijk, A. & Piperidis, J., editors. Proceedings of the Tenth International Conference on Language Resources and Evaluation (lrec 2016). European Language Resources Association (ELRA), Paris, France. Available from:
http://www.lrec-conf.org/proceedings/lrec2016/pdf/39 Paper.pdf. [Last retrieved on 2018 Apr 27].

Jumari, K. & Ali, M. A. (2002). A survey and comparative evaluation of selected online Arabic handwritten character recognition systems. Jurnal Technology, 36, 1-18.

Kanungo, T., Marton, G. A. & Bulbul, O. (1999). Performance evaluation of two Arabic OCR products. In: Proceedings of SPIE-the international society for optical engineering. SPIE, 3584, 76-83.

Mohammed, B. O. (2013). Handwritten Kurdish character recognition using geometric discertization feature. International Journal of Computer Science and Communication, 4, 51-55.

Rashid, S. F. (2014). Optical Character Recognition-A Combined ANN/HMM Approach (Unpublished Doctoral Dissertation). Technical University of Kaiserslautern.

Zheng, L., Hassin, A. H. & Tang, X. (2004). A new algorithm for machine printed Arabic character segmentation. Pattern Recognition Letters, 25(15), 1723-1729.
Published
2018-06-30
How to Cite
Yaseen, R., & Hassani, H. (2018, June 30). Kurdish Optical Character Recognition. UKH Journal of Science and Engineering, 2(1), 18-27. https://doi.org/https://doi.org/10.25079/ukhjse.v2n1y2018.pp18-27
Section
Research Articles