Methods of Intellectual Text Analysis
DOI:
https://doi.org/10.15802/stp2023/295252Keywords:
natural language texts, intellectual text processing, frequency analysis, stemming, syntactic analysis, neural networksAbstract
Purpose. Natural language text processing techniques are used to solve a wide range of tasks. One of the most difficult tasks when working with natural language texts for different languages is to find certain indicators for further determining its authorship. The problem is still relevant due to the lack of a unified tool or method for working with texts in different languages. Working with texts in Ukrainian requires taking into account its peculiarities of word and sentence construction to obtain the best result. The main purpose of this article is to analyze the existing methods of text processing, their features and effectiveness in working with texts of different languages. Methodology. Natural language text processing methods are systematized by type and format, according to the tools and approaches used. For each method, its features, effectiveness, scope, and limitations are considered. The means of system analysis were used to form the final characterization of the method, taking into account its purpose and capabilities. Findings. The study of methods has revealed the following ones used for the intellectual analysis of texts in different languages, their scope, effectiveness in working with different languages, strengths and weaknesses. This will make it possible to choose an effective toolkit for working with Ukrainian texts. It has been established that intelligent text processing is a complex task that requires an individual approach to each language to take into account its peculiarities and obtain the best result. Originality. The basis for choosing an effective method for working with Ukrainian-language texts is formed, the existing methods of intellectual text processing, their application features, capabilities and efficiency in working with texts of different languages are analyzed and systematized. Practical value. The work allowed us to identify the most promising, effective and appropriate methods of intellectual analysis of natural language texts in order to use them for processing Ukrainian-language texts in the future.
References
Buk, S. (2011). Slavic experience of compiling a frequency dictionary of writer’s language. Problems of slavonic studies, 60, 217-224. (in Ukrainian)
Voitenko, K. I. (2012). Funktsionalnyy styl khudozhnoho movlennya. Naukovì zapiski Nacìonalʹnogo unìversitetu «Ostrozʹka akademìâ». Serìâ Fìlologìčna, 26, 53-56. (in Ukrainian)
Perebyynis, V. S. (2002). Statystychni metody dlya linhvistiv: navchalnyy posibnyk. Vinnytsya: Nova knyha. (in Ukrainian)
Addin, O., Sapuan, S. M., Mahdi, E., & Othman, M. (2007). A Naïve-Bayes classifier for damage detection in engineering materials. Materials & Design, 28(8), 2379-2386. DOI: https://doi.org/10.1016/j.matdes.2006.07.018 (in English)
Aggarwal, C. C. (2018). Machine Learning for Text (pp. 1-6). Springer International Publishing. DOI: https://doi.org/10.1007/978-3-319-73531-3 (in English)
Alekseev, P. M. (2005). Frequency dictionaries (Häufigkeitswörterbücher). In Quantitative Linguistik: ein inter-nationales Handbuch=Quantitative linguistics: an international handbook (pp. 312–324). Berlin; New York: Walter de Gruyter. (in English)
Alsaleem, S. (2011). Automated Arabic Text Categorization Using SVM and NB. International Arab Journal of e-Technology, 2(2), 124-128. (in English)
Barros, R. C., Basgalupp, M. P., de Carvalho, A. C. P. L. F., & Freitas, A. A. (2012). A Survey of Evolutionary Algorithms for Decision-Tree Induction. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(3), 291-312. DOI: https://doi.org/10.1109/tsmcc.2011.2157494 (in English)
Bensefia, A., Nosary, A., Paquet, T., & Heutte, L. (2002). Writer identification by writer’s invariants. Proceed-ings Eighth International Workshop on Frontiers in Handwriting Recognition, 274-279. DOI: https://doi.org/10.1109/iwfhr.2002.1030922 (in English)
Brownlee, J. (2016). Support Vector Machines for Machine Learning. Machine Learning Algorithms. Retrived from https://machinelearningmastery.com/support-vector-machines-for-machine-learning/ (in English)
Cavnar, W. B., & John M. T. (1994). N-Gram-Based Text Categorization. Michigan. (in English)
Damanik, I. S., Windarto, A. P., Wanto, A., Poningsih, Andani, S. R., & Saputra, W. (2019). Decision Tree timiza-tion in C4.5 Algorithm Using Genetic Algorithm. Journal of Physics: Conference Series, 1255(1), 1-7. DOI: https://doi.org/10.1088/1742-6596/1255/1/012012 (in English)
Dey, A. (2016). Machine learning algorithms: a review. International Journal of Computer Science and Infor-mation Technologies, 7(3), 1174-1179. (in English)
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3), 1-37. (in English)
Fletcher, G. P., & Hinde, C. J. (1994). Interpretation of neural networks as Boolean transfer functions. Knowledge-Based Systems, 7(3), 207-214. DOI: https://doi.org/10.1016/0950-7051(94)90007-8 (in English)
Gamon, M. (2004). Linguistic correlates of style: authorship classification with deep linguistic analysis features. Proceedings of the 20th International Conference on Computational Linguistics, 1-7. DOI: https://doi.org/10.3115/1220355.1220443 (in English)
Gavankar, S. S., & Sawarkar, S. D. (2017, April). Eager decision tree. In 2017 2nd International Conference for Convergence in Technology (I2CT) (pp. 837-840). Mumbai, India. DOI: https://doi.org/10.1109/I2CT.2017.8226246 (in English)
Gupta, G. (2014, May). A self-explanatory review of decision tree classifiers. International conference on recent advances and innovations in engineering (ICRAIE-2014) (pp. 1–7). DOI: https://doi.org/10.1109/icraie.2014.6909245 (in English)
Gupta, V., & Lehal, G. S. (2009). A Survey of Text Mining Techniques and Applications. Journal of Emerging Technologies in Web Intelligence, 1(1), 60-76. DOI: https://doi.org/10.4304/jetwi.1.1.60-76 (in English)
Hearst, M. A. (1997). Text data mining: Issues, techniques, and the relationship to information access. Retrieved from https://people.ischool.berkeley.edu/~hearst/talks/dm-talk/ (in English)
Hoover, D. L. (2002). Frequent Word Sequences and Statistical Stylistics. Literary and Linguistic Computing, 17(2), 157-180. DOI: https://doi.org/10.1093/llc/17.2.157 (in English)
Juola, P. (2007). Authorship Attribution. Foundations and Trends® in Information Retrieval, 1(3), 233-334. DOI: https://doi.org/10.1561/1500000005 (in English)
Jusoh, S., & Al-Fawareh, H. M. (2007). Natural language interface for online sales systems. In 2007 Interna-tional Conference on Intelligent and Advanced Systems (pp. 224-228). DOI: https://doi.org/10.1109/icias.2007.4658379 (in English)
Kim, H., Howland, P., Park, H., & Christianini, N. (2005). Dimension reduction in text classification with support vector machines. Journal of machine learning research, 6(1), 37-53. (in English)
Kohan, Ya. O. (2016). On the possibilities of formalizing natural languages. TAAPSD, 3, 137-143. (in English)
Koppel, M., Schler, J., & Argamon, S. (2008). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1), 9-26. DOI: https://doi.org/10.1002/asi.20961 (in English)
Köhler, R., & Altmann, G. (2005). Aims and Methods of Quantitative Linguistics. Problems of Quantitative Lin-guistics, 12-42. (in English)
Kruczek, J., Kruczek, P., & Kuta, M. (2020). Are N-gram Categories Helpful in Text Classification? Computa-tional Science-ICCS 2020, 524-537. DOI: https://doi.org/10.1007/978-3-030-50417-5_39 (in English)
Langseth, H., & Nielsen, T. D. (2006). Classification using Hierarchical Naïve Bayes models. Machine Learning, 63(2), 135-159. DOI: https://doi.org/10.1007/s10994-006-6136-2 (in English)
Li, J., Liu, M., Qin, B., & Liu, T. (2022). A survey of discourse parsing. Frontiers of Computer Science, 16(5), 1-12. DOI: https://doi.org/10.1007/s11704-021-0500-z (in English)
Luo, X. (2021). Efficient English text classification using selected Machine Learning Techniques. Alexandria Engineering Journal, 60(3), 3401-3409. DOI: https://doi.org/10.1016/j.aej.2021.02.009 (in English)
Mahesh, B. (2020). Machine learning algorithms-a review. International Journal of Science and Research (IJSR), 9(1), 381-386. (in English)
Lytvyn, V., Pukach, P., Vysotska, V., Vovk, M., & Kholodna, N. (2023). Identification and Correction of Grammatical Errors in Ukrainian Texts Based on Machine Learning Technology. Mathematics, 11(4), 904-923. DOI: https://doi.org/10.3390/math11040904 (in English)
Markov, I., Baptista, J., & Pichardo-Lagunas, O. (2017). Authorship Attribution in Portuguese Using Character N-grams. Acta Polytechnica Hungarica, 14(3), 59-78. DOI: https://doi.org/10.12700/aph.14.3.2017.3.4 (in English)
Mazzei, A., & Lombardo, V. (2004). Building a large grammar for Italian. Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), 51-54. (in English)
Mrva, J., Neupauer, S., Hudec, L., Sevcech, J., & Kapec, P. (2019). Decision Support in Medical Data Using 3D Decision Tree Visualisation. 2019 E-Health and Bioengineering Conference (EHB) (pp. 1-4). Iasi, Romania. DOI: https://doi.org/10.1109/ehb47216.2019.8969926 (in English)
Platt, J. (1998). Sequential minimal optimization: a fast algorithm for training support vector machines. Retrieved from https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fast-algorithm-for-training-support-vector-machines/ (in English)
Popescu, I., & Altmann, G. (2006). Some aspects of word frequencies. Glottometrics, 13, 23-46. (in English)
Popescu, I. (2009). Word Frequency Studies. Berlin, New York: De Gruyter Mouton. DOI: https://doi.org/10.1515/9783110218534 (in English)
Priyanka, N. A., & Kumar, D. (2020). Decision tree classifier: a detailed survey. International Journal of Information and Decision Sciences, 12(3), 246-269. DOI: https://doi.org/10.1504/ijids.2020.108141 (in English)
Raheja, J. L., Mishra, A. & Chaudhary, A. (2016). Indian sign language recognition using SVM. Pattern Recog-nition and Image Analysis, 26, 434-441. DOI: https://doi.org/10.1134/S1054661816020164 (in English)
Russell, S., & Norvig, P. (2003). Artificial Intelligence: A Modern Approach. Prentice Hall, London. (in English)
Sari, Y., Vlachos, A., Stevenson, M. Continuous N-gram Representations for Authorship Attribution. In Proceed-ings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Vol. 2, pp. 267-273). DOI: https://doi.org/10.18653/v1/e17-2043 (in English)
Segaran, T. (2007). Programming Collective Intelligence. O’Reilly Media Inc. (in English)
Shynkarenko, V., & Demidovich, I. (2023). Constructive-synthesizing modeling of natural language texts. Computer Systems and Information Technologies, 3, 81-91. DOI: https://doi.org/10.31891/csit-2023-3-10 (in English)
Shynkarenko, V. I., & Demidovich, I. M. (2022, May). Natural Language Texts Authorship Establishing Basedon the Sentences Structure. In COLINS-2022: 6th International Conference on Computational Linguistics and Intelligent Systems (pp. 328-337). Gliwice, Poland. (in English)
Silberztein, M. (2017). A New Linguistic Engine for NooJ: Parsing Context-Sensitive Grammars with Finite-State Machines. Communications in Computer and Information Science, 240-250. DOI: https://doi.org/10.1007/978-3-319-73420-0_20 (in English)
Srinivas, R. (2010). Managing Large Data Sets Using Support Vector Machines. Retrieved from https://www.researchgate.net/publication/254701776_Managing_Large_Data_Sets_Using_Support_Vector_Machines (in English)
Sidorov, G. O. (2018). Automatic Authorship Attribution Using Syllables as Classification Features. Rhema, 1-19. (in English)
Tal, B. (2003). Neural Network – Based System of Leading Indicators, CIBC World Markets. (in English)
Towell, G. G., & Shavlik, J. W. (1993). Extracting refined rules from knowledge-based neural networks. Machine Learning, 13(1), 71-101. DOI: https://doi.org/10.1007/bf00993103 (in English)
Tu, J. V. (1996). Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. Journal of Clinical Epidemiology, 49(11), 1225-1231. DOI: https://doi.org/10.1016/s0895-4356(96)00002-9 (in English)
Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. Springer Verlag. (in English)
Vapnik, V. (1998). The Nature of Statistical Learning Theory. Springer. (in English)
Vijayarani, S., & Muthulakshmi, M. (2013). Comparative Analysis of Bayes and Lazy Classification Algorithms. International Journal of Advanced Research in Computer and Communication Engineering, 2(8), 3118-3124. (in English)
Vijayarani, M. (2015). Preprocessing Techniques for Text Mining – An Overview. International Journal of Computer Science & Communication Networks, 5(1), 7-16. (in English)
Vysotska, V., Holoshchuk, S., & Holoshchuk, R. (2021). A Comparative Analysis for English and Ukrainian Texts Processing Based on Semantics and Syntax Approach. COLINS, 311-356. (in English)
Vysotska, V., Brodyak, O., Lytvyn, V., Kovalchuk, V., Kubinska, S., Dilai, M., Chyrun, L., Chyrun, S., …, & Pohreliuk, L. (2019). Method of Similar Textual Content Selection Based on Thematic Information Re-trieval. In 2019 IEEE 14th International Conference on Computer Sciences and Information Technolo-gies (CSIT) (pp. 1-6). Lviv, Ukraine. DOI: https://doi.org/10.1109/stc-csit.2019.8929752 (in English)
Vysotska, V., Markiv, O., Teslia, S., Romanova, Y., & Pihulechko, I. (2022). Correlation Analysis of Text Author Identification Results Based on N-Grams Frequency Distribution in Ukrainian Scientific and Technical Articles. CEUR Workshop Proceedings, 3171, 277-314. (in English)
Wang, L.-M., Li, X.-L., Cao, C.-H., & Yuan, S.-M. (2006). Combining decision tree and Naive Bayes for classification. Knowledge-Based Systems, 19(7), 511-515. DOI: https://doi.org/10.1016/j.knosys.2005.10.013 (in English)
Wimmer, G., Altmann, G., Hřebíček, L., Ondrejovič, S., & Wimmerová, S. (2003). Úvod do analýzy textov. Bratislava. (in Slovak)
Xhemali, D., Hinde, C. J., & Stone, R. (2009). Naive Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages. International Journal of Computer Science, 4(1), 16-23. (in English)
Yalcin, K., Cicekli, I., & Ercan, G. (2022). An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding. Expert Systems with Applications, 197, 116677. DOI: https://doi.org/10.1016/j.eswa.2022.116677 (in English)
Yang, F. (2019, Dec.). An Extended Idea about Decision Trees. In 2019 International Conference on Computa-tional Science and Computational Intelligence (CSCI) (pp. 349-354). Las Vegas, NV, USA. DOI: https://doi.org/10.1109/CSCI49370.2019.00068 (in English)
Zeldes, A., & Schroeder, C. T. (2015). Computational Methods for Coptic: Developing and Using Part-of-Speech Tagging for Digital Scholarship in the Humanities. Digital Scholarship in the Humanities, 30(suppl_1), i164–i176. DOI: https://doi.org/10.1093/llc/fqv043 (in English)
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Science and Transport Progress
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright and Licensing
This journal provides open access to all of its content.
As such, copyright for articles published in this journal is retained by the authors, under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0). The CC BY license permits commercial and non-commercial reuse. Such access is associated with increased readership and increased citation of an author's work. For more information on this approach, see the Public Knowledge Project, the Directory of Open Access Journals, or the Budapest Open Access Initiative.
The CC BY 4.0 license allows users to copy, distribute and adapt the work in any way, provided that they properly point to the author. Therefore, the editorial board of the journal does not prevent from placing published materials in third-party repositories. In order to protect manuscripts from misappropriation by unscrupulous authors, reference should be made to the original version of the work.