Improving Kurdish Web Mining through Tree Data Structure and Porter’s Stemmer Algorithms

Keywords: Kurdish text classification, Porter's stemmer algorithm, Stemming, Tree data structure

Abstract

Stemming is one of the main important preprocessing techniques that can be used to enhance the accuracy of text classification. The key purpose of using the stemming is combining the number of words that have same stem to decrease high dimensionality of feature space. Reducing feature space cause to decline time to construct a model and minimize the memory space. In this paper, a new stemming approach is explored for enhancing Kurdish text classification performance. Tree data structure and Porter’s stemmer algorithms are incorporated for building the proposed approach.  The system is assessed through using Support Vector Machine (SVM) and Decision Tree (C4.5) to illustrate the performance of the suggested stemmer after and before applying it. Furthermore, the usefulness of using stop words are considered before and after implementing the suggested approach.

Downloads

Download data is not yet available.

Author Biographies

Ari M. Saeed, Department of Computer Science, College of Science, University of Halabja, Kurdistan Region - F.R. Iraq

 

Ari M. Saeed is an Assistant Lecturer and a Researcher at Halabja University, College of Science, computer Department. He holds M.Sc. in computer engineering at Lefke University, Cyprus in 2013- 2015. He interests also include Machine learning and Artificial intelligent.

Tarik A. Rashid, Department of Computer Science and Engineering, School of Science and Engineering, University of Kurdistan Hewler, Kurdistan Region - F.R. Iraq

Dr. Tarik Ahmed Rashid received his Ph.D. in Computer Science and Informatics degree from College of Engineering, Mathematical and Physical Sciences, University College Dublin (UCD) in 2001-2006. He pursued his Post-Doctoral Follow at the Computer Science and Informatics School, College of Engineering, Mathematical and Physical Sciences, University College Dublin (UCD) from 2006-2007. He was a Professor at Salahaddin University-Erbil, Hawler, Kurdistan. He Joined the University of Kurdistan Hewlêr (UKH) in 2017.

Arazo M. Mustafa, College of Basic Education, University of Kirkuk, F.R. Iraq

 

Currently, Arazo Mohamed Mustafa is a Lecturer at the College of Basic Education in the University of Kirkuk, Iraq. Arazo Mohamed Mustafa has a B.Sc. in Computer Sciences from the University of Kirkuk, Iraq (2007) and MS.c degree in Computer Science from University of Sulaimani in Kurdistan Region, Iraq (2017). From 2008 to 2013 Arazo Mohamed Mustafa worked as a software engineer at the Directorate of Municipalities of Kirkuk in the Projects Department of the Ministry of Construction, Housing and Public Municipalities. Her research interests are in the areas of Artificial Intelligence, Programming Languages, Pattern recognition, Machine Learning, and Data Analysis.

Polla Fattah, Department of Computer Science and Engineering, School of Science and Engineering, University of Kurdistan Hewler, Kurdistan Region - F.R. Iraq

 

Polla Fattah is a Assistant Lecturer at Salahaddin University-Erbi and a visiting lecturer at Kurdistan Unievrsity-Hawler. He is interested in data mining, machine learning and optimizations problems especially time series analysis and deep learning. 

Birzo Ismael, UKH, Computer Science and Engineering

 

Birzo Ismael holds B.Sc. degree in Computer Science from Kingston University in London, and M.Sc. degree in Software Engineering from the same university. He joined the department of Computer Sciences and Engineering in Sep 2016. Birzo’s history with UKH goes back to June 2011, when he first joined UKH as a Software developer; later in April 2013 he took up the role of the Director of IT Admin. until he finally joined CSE department as a lecturer in Sep 2016.

References

Alajmi, A., Saad, E. M., & Darwish, R. (2012). Toward an ARABIC stop-words list generation. International Journal of Computer Applications (0975 – 8887), 46(8), 8-13.

Alami, N., Meknassi, M., & Ouatik, S. A. (2016). Impact of stemming on Arabic text summarization. International Colloquium on Information Science and Technology (CiSt). Tangier, Morocco: IEEE.

Bahassine, S., Mohamed, K., & Abdellah, M. (2014). New stemming for Arabic text classification using feature selection and decision trees. 5th International Conference on Arabic Language. Oujda, Morocco: IEEE. p. 200-205.

Danisman, T., & Adil, A. (2008). Feeler: Emotion classification of text using vector space model. In: AISB 2008 Convention Communication Interaction and Social Intelligence. Vol. 2. Aberdeen, UK: AISB.

Duwairi, R., Al-Refai, M., & Khasawneh, N. (2007). Stemming versus light stemming as feature selection techniques for arabic text categorization. Innovations in Information Technologies (IIT). Dubai, Dubai: IEEE.

Esmaili, K. S., Donya, E., & Shahin, S. (2013). Building a Test collection for Sorani Kurdish. International Conference on Computer Systems and Applications (AICCSA). Ifrane, Morocco: IEEE.Khalid, A., Zakir, H., & Baig, M. A. (2016).

Arabic stemmer for search engines information retrieval. (IJACSA) International Journal of Advanced Computer Science and Applications, 7(1), 407-411.

Mamoun, R., & Mahmoud, A. (2016). Arabic text stemming: Comparative analysis. Conference of Basic Sciences and Engineering Studies (SGCAC). Khartoum, Sudan: IEEE.

Mohammed, F. S., Zakaria, L., & Omar, N. (2012). Automatic Kurdish SORANi text categorization using N-gram based model. International Conference on Computer and Information Science (ICCIS). Kuala Lumpeu, Malaysia: IEEE.

Mustafa, A. M., & Rashid, T. A. (2017). Kurdish stemmer pre-processing steps for improving information retrieval. Journal of Information Science, 44(1), 15-27.

Karthik, P., Saurabh, M., & Chandrasekhar, U. (2016). Classification of text documents using association rule mining with critical relative support based pruning. International Conference on Advances in Computing, Communications and Informatics (ICACCI). Jaipur, India: IEEE.

Rahman, A., & Usman, Q. (2016). A Bayesian classifiers based combination model for automatic text classification. International Conference on Software Engineering and Service Science (ICSESS). Beijing, China: IEEE.

Rashid T.A., Mustafa A.M., & Saeed A.M. (2018). Automatic Kurdish text classification using KDC 4007 dataset. In: Barolli, L., Zhang, M., & Wang X., editors. Advances in Internetworking, Data and Web Technologies. EIDWT 2017. Lecture Notes on Data Engineering and Communications Technologies. Vol. 6. Cham: Springer.

Saeed, A. M., Rashid, T. A., Mustafa, A. M., Al-Rashid Agha, R. A., Shamsaldin, A. S., & Al-Salihi, N. K. (2018). An evaluation of Reber stemmer with longest match stemmer technique in Kurdish Sorani text classification. Iran Journal of Computer Science, 1(2), 99-107.

Salavati, S., Sheykh, E.K., & Akhlaghian, F. (2013). Stemming for Kurdish information retrieval. In: Banchs, R.E., Silvestri, F., Liu, T.Y., Zhang, M., Gao S., & Lang, J., editors. Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science. Vol. 8281. Berlin, Heidelberg: Springer.

Sharma, N., A. S., & V. T. (2016). Text classification using combined sparse representation classifiers and support vector machines. 4th International Symposium on Computational and Business Intelligence (ISCBI). Olten, Switzerland: IEEE.

Tanja Gaust, G. B. (2002). Accurate stemming of Dutch for text classification (language and computers: Studies in practical linguistics). In: Theune, M., Nijholt, A., & Hondorp, H., editors. Computational Linguistics in the Netherlands. Amsterdam: Rodopi. pp. 104-117, 14
Published
2018-06-30
How to Cite
Saeed, A., Rashid, T., Mustafa, A., Fattah, P., & Ismael, B. (2018, June 30). Improving Kurdish Web Mining through Tree Data Structure and Porter’s Stemmer Algorithms. UKH Journal of Science and Engineering, 2(1), 48-54. https://doi.org/https://doi.org/10.25079/ukhjse.v2n1y2018.pp48-54
Section
Research Articles