A Dual Approach to Establishing the Authority of Technical Natural Lan guage Texts and Their Components

Purpose. The study is aimed at testing the hypothesis that it is possible to determine plagiarism by methods of establishing the authorship of a text without using a text bank and their direct comparison. Methodology. Constructive and productive models of the processes of establishing the authorship of technical texts for two methods have been developed. The first method is based on the formation of a text model in the form of a set of formal substitu-tion rules with probabilistic weights (as in stochastic formal grammars), which reflects the syntactic features and patterns of text formation by the author. The degree of similarity between the text under study and another text is determined by comparing their models. The second method is a classical approach to detecting borrowings (plagia-rism) by directly comparing the text under study with an existing text bank, highlighting repeated text fragments, and determining the degree of originality. Experiments were conducted to establish the correlation between the re-sults of these two methods. The experimental base consisted of 509 text sections of theses of students majoring in «Software Engineering». Findings. Experimental studies have made it possible to establish a high correlation be-tween the results of the two methods. Correlation coefficients in the range of 0.75...1.0 and with an average value of 0.88 were obtained provided that borrowings are taken into account for text fragments of at least five words in length. Originality. For the first time, the authors have identified the possibilities and proposed methods for indirect plagiarism detection without using a large text bank. The essence of the model is to formalize the representation of the author's sentence syntax by a set of substitution rules with probabilistic weights. Practical value. Based on the results obtained, the possibilities for detecting borrowings have been expanded and the effectiveness of the corresponding methods has been increased. Recommendations on the parameters of classical methods for detecting bor-rowings have been obtained, in particular, it is recommended to take into account text fragments of at least five words in length as a rational parameter when using borrowing detection systems. The possibilities of text authorship detection methods tested on fiction texts are extended to technical texts.


Introduction
The problem of identifying similarities and differences in the various authors' texts is still relevant due to the difficulties of identifying commonalities that are not a direct coincidence of the text.A special difficulty is working with specific characteristics of a certain language, which significantly complicates the task and makes it impossible to create a unified toolkit.
Currently, approaches from the theory of pattern recognition, mathematical statistics and probability theory, algorithms of neural networks and cluster analysis, and many others are used for text attribution.However, all such methods do not have suffi-cient efficiency and cannot work with texts of different languages and topics, and also do not work with the stylistic features of the author to a sufficient extent.
Consider two tasks for processing natural language texts: the task of borrowings identifying (establishing the authorship of individual parts of textsphrases, sentences, paragraphs, the text as a whole).There is a text, and it is necessary to highlight those parts of it that are already found in earlier texts by other authors and establish the degree of the text originality; the task of establishing the text authorship according to the style and other features of the author's text.
To solve the first problem (the problem of borrowings identifying), it is necessary to have a complete bank of texts to compare with.Taking into account their huge number, and the presence of various storages and storage methods (in particular, file formats), there is a certain probability of positively erroneous decisions, i.e. part of the borrowings will not be detected.
For the second task (the task of establishing syntactic similarity), it is necessary to have a certain number of the author's texts, at least one text of sufficient length.At the same time, no specific borrowings are identified, but conclusions are drawn about the text as a whole.Previously, the authors of the work should pay attention to the specificity of the speech by a specific author [15,26].
In this paper, we study the correlation dependence between the results of solving these two problems.For this, the approach of constructivesynthesizing modeling proposed by the authors for solving both problems is applied and the corresponding methods are developed.
Both tasks are solved using software tools that implement the constructors presented in this article.They are multi-parametric and solve the problem with varying degrees of accuracy.The expert must make the final decision.For an expert, the results of checking by one of the methods (solution of one problem) may be sufficient.For a more objective expert decision, both of these methods can be used.

Related works
One of the problems considered in this paper is the identification of borrowings.Borrowing is a fairly common problem in academic fields, including scientific articles, publications, inventions, etc. [1].Plagiarism comes in many varieties; for example, self-plagiarism (publishing the same or very similar articles in several journals) or using the texts of other authors.This phenomenon can be observed in both academic and non-academic environments.Academic plagiarism is one of the most serious forms of academic misconduct and negatively affects the educational institution and its employees.Research articles containing, among other things, plagiarism, interfere with the scientific process [25].The existence of plagiarism can have serious consequences.Plagiarism of research articles can significantly affect the work of specialists in various fields, for example, plagiarism in the medical field can threaten the safety of patients [25].In addition, plagiarism wastes scientific resources.Even detecting, investigating and punishing plagiarized research articles requires a lot of effort by academics, institutions and funders [25].
There are many methods of working with borrowings and their detection.All of them can be grouped: lexical detection methods [4] (working only with symbols or their sequence of a certain length [9] in a document or even words [5]); detection methods based on lemmas [8] and syntax (working with the syntactic structure of a sentence, i.e. parts of speech) [10], grammars [15,27], detection methods based on semantics [13,20] and comparing a certain sequence of words [21] or sentences [6]; detection methods based on ideas and contents go beyond the analysis of the text in the document, for example: the mathematical component [14], citations [22] and images in it [18].
Checking a suspected document for plagiarism manually is an extremely difficult and timeconsuming process for different source documents [1].Therefore, the use of computer systems is appropriate.The plagiarism detection tools that have been proposed so far are capable of detecting different types of plagiarism; however, the detection of plagiarism in the text depends on experts [19].
In Ukraine and other countries, means of detecting plagiarism and borrowing have been introduced in the academic environment and universities.However, even with sufficient efficiency and credibility of the work, it is not possible to ensure coverage of all sources of plagiarism due to the constant increase in their number and free access to them on the Internet.
A hypothesis is put forward about the possibility of identifying borrowings by methods related to establishing the authorship of texts based on the analysis of the author's existing text.
The second problem, which is the subject of this research, is establishing the authorship of the text.Accurate and reliable authorship establishment requires the use of a certain texts' corpus by different authors, which will allow establishing a style characteristic of them and subsequently us- ing this to establish the authorship of other texts.Methods for solving this problem belong to the same groups as methods for identifying borrowings.Due to the formalization of the text, they have a wide range of applications for different languages in the world.An example of the methods and approaches range to solve the problem is the use of neural networks for Ukrainian texts [17], a genetic algorithm for working with Turkish texts [10], establishing the authorship of ancient texts in Hindi [22], using the peculiarities of the parts of the language and different stylistics usage [7], as well as features of working with small texts [2] and even text messages [12].
However, none of the methods or their combination still gives 100% accuracy in determining the authorship.

Purpose
The research aiming to test the hypothesis of the determining plagiarism possibilities by establishing the text authorship methods without using a texts bank and their direct comparison.

Methodology
The processes of texts authorship identifying using the constructive-production modeling.To determine the authorship of texts, a constructiveproduction approach to modeling the sentence structure is used.This process consists of the sequential work of the following constructors: the constructorconverter of natural language text into tagged text, the constructor-converter of tagged text into a set of formal substitution rules with a probability measure, and the constructor-measurer of the two texts similarity degree.They are described in more detail in the authors' previous article [23].
Constructive and synthesizing modeling of borrowing detection processes.The graphic representation of the texts constructor.To speed up the comparison of the text, it is suggested to use the designer-converter of the natural language text into a graph.The idea of the constructor's work is to create a graph structure based on the text, which contains all the chains that have the input text and does not contain extraneous ones.
The constructor has the form: , , , , where g Mis an extensible medium that includes sets of graph constructions, language constructions (words, sentences, etc.) and their elements, Denote a loaded graph as , is the set of vertices and arcs loaded with attributes.Each set contains an empty element.
The graph has the following attributes where _ start v -is the starting vertex of the graph, _ last vis the last added vertex, _ current v -is the current vertex when forming the graph, and _ amount ls the number of cycles that the starting vertex includes.
Operations statements.A substitution relation and several specified operations are used to construct the graph [25]:  definition of an arc by incident vertices;  finding a vertex with a content weight attribute equal to the given value;  execution of n operations from the given list;  calculation of a set power;  addition of two numbers;  union of graphs;  partial and complete removal.
The complete ontology of the graph constructor, as well as the specification of the graph construction rules, are presented in [26].
The goal of construction is to construct a graph structure that corresponds to a given text structure.
The graph constructor is limited by the text construction.The number of graphs depends on the number of different characters in the text.
Initial conditions for the construction of graphs: σa non-terminal from which the derivation begins.
Construction completion condition: the form does not contain non-terminals and each text construction element corresponds to a graph construction element.
Text comparison is performed as follows: two texts are input: in the form of a line and an ordered graphs set; a character-by-symbol comparison of the string text with the text in the graphical representation is performed.The graph is selected so that its starting vertex has a content attribute content equal to the current character in the text.After establishing this equality, traversal is performed in the graph in the order that corresponds to the order in the text-line.If the specified order cannot be found in the graph, then the transition to the next word in the line and another graph in the set is performed.The result of the comparison is the lines setfragments that occur in both input texts.After that, the percentage of borrowings can be calculated as the ratio of the total found fragments length to the length of the input text string.
Let's formalize the specified processes using the constructor for comparison: , where int X is the integers' vectors set of indicating the beginning and end positions in the text fragments that are the same in the two texts, Ris a set of real numbers, which includes the percentage of borrowings in texts that are checked for originality.
Statement of operations.A substitution relation and several specified operations are used to compare text presented as a string and text in a graphical representation: , determined fa list of numbers, where each odd element is the position in the beginning borrowed fragment of the text, eventhe end, paramthe minimum number of words fragment, which is considered as a plagiarism; c a bis a logical «and» operation on operands a and b, c is the operation result.
The purpose of the construction is to establish the degree of the texts similarity by comparing the text in the form of a line with the text in the graphical representation built by the constructor g C .The initial construction conditions is the text in the form of a string in which borrowings are sought, 0 i  -is the number of the symbol in the text t , from which the comparison begins,   i Gis the text in the form of an ordered set of graphs, , -is an initial non-terminal (axioms).
Construction completion condition: getting a number from 0 to 1 that reflects the similarity of two texts represented as a string and a set of graphs.
The specification of the constructor for comparison: , , , , is a set of production rules, , , , ii ssoperations on text in the form of a line and a graph, respectively, operations f of borrowed fragments in the text t .If the text string is not processed completely, that is, the current position is less than the length of the text t (checked using 1,1 1,2 , gg), the connection of the current character of the line with the graph ( 1, , ) , ( ,1, ) For a connected graph, its connection ( ( 2, , ) ( ,1, ), ( , ), ( 2, ) As long as there is a vertex in the graph for each subsequent character of the text, the constructor moves to the next graph vertex and the text character 3,2 3 ( , , , ) ) If no match was found for the current character of the text-line among the graph vertices and before that several characters were processed, according to the rule 4,2 s constructor writes the processed fragment boundaries s to the list f , and then it goes to the next word in the text ( 4,1 s ). .
If no match was found for the current string text character among the vertices of the graph and the previous symbols were not processed, we move to the next word from the text ( ) The interpretation means the established correspondence between the operations of K  and the algorithms of some algorithmic structure containing the set of algorithms 2.2.if c q param  , then c q q q  ; 2.3.Experimental studies.Predefined constructive and production models of the texts authorship determining processes and their software implementations are applied to experimentally test the hypothesis regarding the possible statistical relationship between the results of solving the corresponding problems: the task of identifying borrowings and the task of text authorship establishing according to the style and other features of the author's text.
The purpose of the experiment.To determine the suitability of using the text authorship determining method with the help of a constructor that displays the sentence structure for the tasks of detecting borrowings (in a broader senseplagiarism).
The experimental base is 16 text files in docx format, which are documentation for diploma projects of the OKR «Bachelor» in the direction 6.050103 «Software engineering» DNUZT-2018 (size 0.7 Mb -27.3 Mb).Each file contains structural sections (28-33 pieces).Each section is allocated in a separate txt file.The total texts sections number (files) is 509.
The technical characteristics of the PC do not affect the results of the experiment.
Methodology of the experiment.The experiment consists of three logical parts: 1) determination of the borrowings percentage in the text using the graphical text representation model [12,25]; 2) determining the percentage of borrowings by analyzing the author's style; 3) results 1 and 2 correlation coefficient calculation.
Part I has the following stages: 1) automated analysis of the document structure, which is performed based on the analysis of the XML structure of its file, according to which the headings, designed with the help of built-in heading styles, determine the boundaries of sections [16]; formation of txt files containing the texts of individual sections.When creating files, the texts of the section undergo preliminary processing: removal of control symbols, conversion to one case, unification of punctuation marks, etc.; 2) the i-th document files-texts section graphic representation construction; 3) setting the parameters and comparing the jth document txt files set with the i-th document sections graphical representation; 4) forming a summary results table (table 1); 5) assignment of new comparison parameters values and points 3-4 repetition; 6) transition to the ( 1) j  th document.
Part II consists of the following stages (add the constructor): 1) completely coincides with the first part of the experiment step 1; 2) conversion of the text from the txt-file into a formatted text with the parts-of-speech indication, number and gender using the first constructorconverter CP; 3) forming the rules of the stochastic constructor based on the tagged text by the second constructor-converter CT; 4) with the help of a constructor-measurement CE, the calculation of the similarity of two stochastic constructors reflecting the syntactic structure of the texts being compared; 5) formation of a summary table of results; Table 1 presents the sequential comparison results of the two diplomas' relevant sections with each other (P1, P2,..P30) using the two methods described above in a percentage of coincidence between them.
Due to the differences between the two approaches to comparisonthe use of sentences in the first and words sequences without taking into account sentences in the secondthe graph constructor worked with different comparison parameters, which are the type of fragment and its minimum length, at which a fragment can be considered borrowed (3-7 respectively).
The work result of the constructor-calculator based on the sentences' syntactic structure in the two relevant sections is located in the «sentence» row of the table. 1 and reduced to percentage form.Obtaining a zero similarity of some partitions is usually due to their size being too small to reliably reflect the similarity.In this work: fragment typeword, minimum length: from 3 to 7 words.This length is determined by the results of research [16], the data of which are partially shown in Table 2.
Since the experimental base of this study contains scientific style texts, which mainly have a complete grammatical basis and secondary clauses, the minimum sentence length is three words.
Regarding the maximum length, based on the data in the table.2, it is advisable to take 7-9.However, the parallel execution of the experiment's second part indicates the consideration sufficiency of the maximum equal to seven.

Discussion of experimental results
The obtained similarity results for 16 diploma theses were compared and the correlation coefficient was obtained for the results of the work of the two described approaches.The comparison was made taking into account the different lengths of the sequence of words, from 3 to 7 in number, and the following results were obtained.
For 3 words in a row, the average value of the correlation coefficient was 0.00053, which is an unsatisfactory result and demonstrates a large dis-crepancy between the results of the two applied methods.
When using a sequence of words with a length of 4, the reliability of the results has improved significantlythe average value is 0.82, which allows us to say that the analysis starting with 4 words is reliable and reflects the real state of affairs.
In longer experiments using word lengths 5 and 6, the obtained results also reflect the feasibility of using precisely these lengths of word sequences as the most informative.The average value of the correlation coefficients is 0.88 for calculations with a sequence length of 5 and 0.82 for a length of 6 words.
The results of working with a 7 words sequence.The result is similar to working with 3 words in a row and it points out the impracticality of their use The average value of the correlation coefficient is 0.000531, which is an unsatisfactory result and indicates a strong discrepancy between the two methods.The general result can be considered the sufficient correlational similarity of the two methods and identification of the required sequence lengths for a reliable reflection of the author's style.

Originality and practical value
The research was carried out on technical programming texts.It is expected that the method can also be applied to texts from other technical fields, but this position should be supported by relevant experiments.
With small volumes of the text, a weak results correlation for identifying borrowings and establishing the texts authorship was observed, which requires further research.There are no clear boundaries between small and large volumes of text.According to the results of the experiments, it was established that for a satisfactory result, the minimum volume of text should be 10 characters and consist of 5 sentences.
When choosing text samples, the author should take into account that the author's style can change due to the passage of time, different text topics, and changes in commonly used templates for the text's formation.
We believe that the proposed method of the text's authorship establishing can be widespread and effectively used for various Slavic languages.Other languages, such as English, where fewer attributes of words, which are compensated by sentence building patterns, as well as more formal requirements for it, significantly weaken the capabilities of the proposed methods.
The presented method of the text's authorship establishing can also be used to identify the presence of a large borrowings volume.This can serve as a reason for further, more thorough verification by software tools or with the involvement of experts.
To check the text for the presence of plagiarism, it is necessary to have a large bank of texts by other authors to detect these borrowings, which can be difficult due to the constant increase in the number of materials in free access and the variety of forms and formats of their presentation.Unlike well-known programs for identifying borrowings and establishing authorship, the proposed approach is limited to the availability of a relatively small volume of the author's texts and does not require a large bank of texts.
In the course of working with explanatory notes for students' diplomas in the programming field, it was established that the approach allows working only with natural language text.The method did not work with sections that include program code that could not be processed this way.In the future, it is planned to develop a constructor that will be able to process texts in formal languages.

Findings
The work confirms the hypothesis regarding the high connection between tasks, methods of solving them, and results regarding establishing the authorship of technical texts and identifying borrowings.It was established that the correlation ratio between the results can be more than 0.9%.
A constructive-production model of the text authorship establishment was developed based on the features analysis and regularities of the author's sentence formation style.The essence of the model is to formalize the representation of the author's sentence syntax by a set of substitution rules with a probability load.The obtained results show that the proposed method has high efficiency compared to the methods used earlier [3].
A model of technical texts was developed taking into account the author's style, thanks to the reflection of the unique stylistic and linguistic features of the author's own language allows to significantly simplify the identifying borrowings process and establishing authorship due to the using only one author's work instead of a whole texts' corpuspossible sources of borrowing.
The results of the experiments determined the value of the rational parameterthe text fragments minimum length (in the number of words) that should be considered borrowing, it is equal to five words.This should be considered as a recommendation for the use of any plagiarism detection programs.
The proposed method can be used both to solve the problems of finding borrowings and to establish the probable authorship of the text.


is a set of CIS statements.Claims about the carrier.The carrier includes multiple terminal and non-terminal elements is a set of graph constructions, V , Esets of vertices and arcs with their attributes.The vertex has the attributes v w, id content  , id -identifier, accepts inte- ger values, contentpart of the text structure.Attributes of the arc , e w id routes, start, end  , id identifier, takes integer values, routesset of numbers of the paths in which the arc is included (indicates the order of traversal of the graph), start , end -vertices that are incident to the arc e.

M
is a carrier including sets of terminal ( C T ) and non-terminal ( C N ) symbols, C is a set of operations and relations on the elements of C M , C is a set of CIS statements; ttext in the form of a stringa sequence of characters i t (reformatted text in which the paragraph and line break symbols are replaced by spaces);   i Gthe text presented in the form of an ordered set of graphs, which is the result of the work of the g C , paramthe minimum number of words of the fragment, which is considered a borrowing.The carrier claims.

5 s
, which allow forming a vector of positions (integers)

2 , 2 s
) is performed with successive symbols of the textline, matching the top of the graph with the load i t to the symbol of the text, 2,1 s ensures progress along the text-line.

1
is the degree of borrowing of line texts concerning the texts in the graphic representation.

Fig. 1 .
Fig. 1.The sequence of performing the search for borrowings in natural language texts using the graph constructor