Disputed Authorship Resolution through Using Relative Empirical Entropy
for Markov Chains of Letters in Human Language Texts
D.V. Khmelev
Summary
The problem of disputed authorship resolution is solved here by the formal analysis of texts. The method of the analysis is based on the Markov Model for the sequence of letters in text. We assume that the frequencies of letter pairs are very specific for an author. This assumption is checked in the large statistical experiment which was carried out for 386 text samples (stories, novels, and their combination) over stories and novels of 82 Russian fiction writers.
1. Introduction
First attempts in the search of specific quantitative structures for individual author style took place in the beginning of the twentieth century (Morozov, 1915). A famous mathematician Markov (1916) criticized the method offered by N.A. Morozov. A.A. Markov argued that characteristics of writer’s style (individual word frequencies, e.g., the distribution of negative word “ne”) offered by Morozov (1915) are unstable. A.A. Markov had given an example of good statistical approach (Markov, 1913). In this work A.A. Markov studied the distribution of vowels and consonants among initial 20000 letters of “Evgenij Onegin”. This paper (Markov, 1913) is the first application of “events which are linked to the chain”. In the last century this object received the name “Markov chains”. In linguistic studies this object is said to be the Markov Model. Markov (1913) presents not only a historical basis for the present article, but also a practical one.
Modern methods of disputed authorship resolution are reviewed by Milov (1994) in Russia and Holmes (1998) in West. All methods described there has a common disadvantage although there is the amazing variety of them. None of these methods have been tested by use of large number of authors. Moreover, the testing of most of these methods is impossible because they have informal steps which need a human participation (for example, some of them require manual separation of words into various classes). Clearly, such an approach prevents checking of a large number of authors and, indeed, all of the methods are used for analysis of a small number of authors and for a limited number of texts.
Fomenko et al. (1996) have another approach. They select a simple parameter of author’s style: the fraction of functional words. They check the stability of this fraction over large number of Russian writers. So this fraction is specific enough for each author and is called by Fomenko et al. (1996) an author’s invariant.
Another method of disputed authorship resolution is offered in the present work. This method is totally independent of all methods described by Fomenko (1996), Holmes (1998) and Milov (1994). Here we shall not apply this method to resolution of well-known problems of disputed authorship for specific texts. The present article is mainly theoretical and is aimed on testing of the offered method. The method is based on Markov Model for the sequence of letters.
Assume we have wide range of large texts in Russian of known authorship. A text under question is known to be written by one of these writers but we do not know who the true author is. Our goal is to find the author who is closest to the text under question according to the proposed measure. Note also that for control purposes we do know a real author for each of "doubtful" texts.
Consider a sequence of letters of text as a Markov chain. The matrices of transition frequencies of letter pairs are calculated over all texts of each author. Therefore we know (approximately) the probability of transition from one letter to another for each author. The true author of an anonymous text is calculated with the principle of maximal likelihood, i.e., for each matrix we calculate the probability of the anonymous text, we choose the author with the maximal corresponding probability and the chosen author is supposed to be the true author. Suppose we take the logarithm of each probability, change the sign, and divide it by the length of anonymous text; then each of the obtained numbers is called the relative empirical entropy. Relative empirical entropy is more convenient for computing. The true author has the minimal relative entropy.
This method is amazingly precise. We discuss here the features of methods application and compare it to another method which is based on isolated letter frequencies in the text. The method is checked on 386 text samples of 82 writers. There are a lot of well-known Russian writers of the nineteenth and twentieth centuries among them.
2. Mathematical Foundations
2.1 Markov Model. Suppose A is a set of letters. Let Ak be a set of words of length k over alphabet A and A*=È k>0 Ak. By | f | denote the length of fÎ A* .
Consider the following formalization of the problem of true writer recognition. Take n sets Ci, where i=0,...,n-1. The set Ci contains sequences fi,jÎ A*, where j=1,...,mi, i.e., Ci={ fi,j | j=1,...,mi}. We shall assign to xÎ A* the set Ci.
Let fi,j be a sample from Markov chain with transition matrix P i. Now we shall define the estimate Pi for P i. Let hi,j,kl be the number of letter pairs (transitions) k® l in fi,j. We have hi,kl=å jhi,j,kl and hi,k=å lhi,kl. By definition, put Pikl = hi,kl / hi,k. Let Zi be a set of pairs (k,l) such that Pikl>0.
Also, let x be a sample from a Markov chain with transition matrix Pq , where q is unknown number in {1,...,n}.
Let n kl denote the number of transitions k® l in x. By definition, put n k=å l n kl. Consider
L i(x) = -å (k,l) n kl´ ln(n kl /(Pikl´ n k)),
where the sum is calculated over (k,l)Î Zi . Let us remark that if Zi contains all possible pairs (k,l), then L i(x) is the minus logarithm of the probability of the sequence x when the sequence x is a Markov chain with transition matrix Pi. The normalized number L i(x) / | x | is called the relative empirical entropy with respect to the set Ci. Define the maximum likelihood estimate t(x) of the unknown q by the rule
t(x)=argmini=0,...,n-1 L i(x) / | x |. (2.1)
We shall not discuss or prove any mathematical properties of estimate (2.1) although these properties are interesting for statistics (Ivchenko and Medvedev, 1992, p. 224). But we shall prove that estimate (2.1) is amazingly effective in true author recognition.
2.2 Computational procedure. Take A={small cirillic letters}È {space symbol}. Suppose we have quite a large texts of n authors in Russian. Let gi,j denote the j-th text of the author i. The text gi,j is a sequence of symbols of extendent alphabet B, where the alphabet B contains punctuation symbols, capital letters, Latin letters etc (set B is extended ASCII).
To each fragment gi,jÎ B* assign the sequence F(gi,j)Î A*. Let F be the map from B* to A* such that all capital letters convert to lower letters, all hyphens are attached, all punctuation and extra space symbols drop, a delimiter space symbol between words stays, and a space symbol is inserted at the beginning and at the end of the fragment if absent.
Besides, consider the map G: B*® A*. The map G has the same structure as F, but the map G drops all words with capital letter in text gi,j . In particular, if
y=“Krome togo, my budem rassmatrivat’ funktsiju G,”,
then F( y)=“ krome togo my budem rassmatrivat’ funktsiju ”,
and G( y)=“ togo my budem rassmatrivat’ funktsiju ”.
Now suppose an text yÎ B* is written by one of n authors, but the true author is unknown. How do we find the true author of y? Using estimate (2.1) with sequences x=F( y) and x=G( y) we obtain two ways for determining the author:
1) the true author is t(F( y)),
2) the true author is t(G( y)).
Note that estimates t(F( y)) and t(G( y)) use only information on letter pairs. Estimates t(F( y)) and t(G( y)) are independent of word order because all words are delimited by space symbols. Perhaps, t(F( y)) and t(G( y)) are based on mediated information on morphemic units (and their combinations within word forms), and surely use no syntactical and phraseological information (methods described by Milov (1994) are based on syntactical information).
Formally any natural language text is not a Markov chain, i.e., this hypothesis is rejected by decision rules of statistics. But one can formally calculate and apply estimate (2.1) to human language texts taking into account stability of estimates obtained by the principle of maximal likelihood. In the present paper we show that estimate (2.1) gives the true author in most cases.
2.3 Frequency analysis for isolated letters. Assume that the sequences fi,j and x are sample from i.i.d.r.v. taking values in A. By the previous statement, estimate (2.1) takes the following form
e(x) = argmini G i(x) / | x |, (2.2)
where
G i(x) = -å k n kln((n k´ hi)/(hi,k´ n )),
and the sum is taken over k such that n k >0 and n =å k n k, hi=å k hi,k . Note that estimate (2.2) is usually called frequency analysis. Our calculation show that e(x) is significantly worse than t(x).
3. An example of research
Consider an example of our approach for checking the quality of the method for disputed authorship resolution. We shall check the estimate t(F( y)) on texts of K. Bulychev, A. Volkov, N.V. Gogol’ and V. Nabokov.
We offer the following way of checking. Choose a control text y i for each author (i = 0,1,2,3), calculate estimates P i for matrices P i with the other texts fi,j, and calculate t(F( y i)). If the estimate is good then for each author t(F( y i))=i.
0) K. Bulychev: Umenie kidat’ mjach ( y0); Beloe plat’e zolushki (g0,1); Velikij dukh i begletsy (g0,2); Glubokouvazhaemyj mikrob (g0,3); Zakon dlja drakona (g0,4); Ljubimets [Sponsory] (g0,5); Marsianskoe zel’e (g0,6); Miniatjury (g0,7); “Mozhno poprosit’ Ninu?” (g0,8); Na dnjakh zemletrjasenie v Ligone (g0,9); Pereval (g0,10); Pokazanija Oli N. (g0,11); Pominal’nik XX veka (g0,12); Raskopki kurganov v doline Repedelkinok (g0,13); Trinadtsat’ let puti (g0,14); Smert’ etazhom nizhe (g0,15);
1) A. Volkov: Sem’ podzemnykh korolej ( y1); Volshebnik izumrudnogo goroda (g1,1); Urfin Dzhjus i ego derevjannye soldaty (g1,2); Ognennyj bog Marranov (g1,3); Genial’nyj pen’ (g1,4); Na vojne, kak na vojne (g1,5); O chem molchali gazety... (g1,6); Prestuplenie i nakazanie (g1,7); Epilog (g1,8); Zheltyj Tuman (g1,9); Tajna zabroshennogo zamka (g1,10);
2) N.V. Gogol’: Rasskazy i povesti ( y2, nazvanija povestej: “Povest’ o tom, kak possorilsja Ivan Ivanovich s Ivanom Nikiforovichem”, “Starosvetskie pomeschiki”, “Vij”, “Zapiski sumasshedshego”); Revizor (g2,1); Taras Bul’ba (g2,2); Vechera na khutore bliz Dikan’ki (g2,3);
3) V. Nabokov: Drugie berega ( y3); Korol’, dama, valet (g3,1); Lolita (g3,2); Mashen’ka (g3,3); Rasskazy (g3,4); Nezavershennyj roman (g3,5).
For example, A. Volkov has a control text y1, i.e., “Sem’ podzemnykh korolej.” The other texts of A. Volkov are used in computing of Pi.
The results of calculations are presented in the following table 1.
Table 1
N
Author
c1
c2
c3
c4
0
K. Bulychev
0
15
2345689
75161
1
A. Volkov
0
8
1733165
233418
2
N.V. Gogol’
0
3
723812
243767
3
V. Nabokov
0
5
1658626
367179
The column c2 contains the total number of files with texts of each author. Note that the number of files and the number of texts are different in the following two cases. Firstly, some texts of the author can be collected in one file (this is happen in A. Volkov case: three stories “Zheltyj Tuman”, “Tajna zabroshennogo zamka” and “Ognennyj bog Marranov” are collected in one file). Secondly, one large text can be cut into smaller parts (recall this when you study the table 2).
The column c3 contains the total number of symbols (cyrillic letters and spaces) in F(gi,j): c3=å j |F(gi,j)|. The column c4 contains the total number of symbols in F( y i), i.e., c4=|F( y i)|. For example, the total volume å j F( g0,j ) of texts for K. Bulychev is 2'345'689. The volume of F( y1), i.e., the number of symbols of alphabet A in control story “Umenie kidat’ mjach” is 75'161.
The column c1 in line j contains the L j(F( y j)) rank. The ranking of numbers {L i(F( y j)) | i = 0,1,2,3}. The rank of L j(F( y j)) is its number (counting from 0) in set of numbers {L i(F( y j)) | i = 0,1,2,3} lined up from smallest to biggest. For example, if j=1 and L i lined up in the following order L 0£ L 3£ L 2£ L 1then L 1 has a rank of 3. If j=0 and L i lined up in the same order L 0£ L 3£ L 2£ L 1 then L 0 has a rank of 0. The rank of L j(F( y j)) in set of numbers {L i(F( y j) | i=0,1,2,3} equals the rank of L j(F( y j))/|F( y j)| among numbers {L i(F( y j))/|F( y j)| | i=0,1,2,3}. Put in lines j=0,1,2,3 of the following matrix the numbers L i(F( y j))/|F( y j)|, i=0,1,2,3:
2.484569
2.508425
2.504301
2.493778
L =
2.501061
2.473907
2.516797
2.492874
2.499033
2.504508
2.480202
2.483829
2.541367
2.538101
2.548842
2.520018
Rank L i in each line.
0
3
2
1
R =
2
0
3
1
2
3
0
1
2
1
3
0
The required numbers of column c1 are the diagonal in matrix R. Using (2.1), we obtain that t(F( y j))=j iff the number L j(F( y j))/|F( y j)| in the set {L i(F( y j))/|F( y j)| | i=0,1,2,3} has a rank of 0. Therefore, suppose we find 0 in column c1 of table 1; then the authorship of the control text is established correctly. Note from the data of table 1 that the authorship is established absolutely correctly.
Before we discuss this result we shall explain why we define column c1 in such a complicated way. Suppose the authorship is not established correctly, i.e., t(F( y j)) ¹ j; then we are interested in how close to the correct author we are. If L j(F( y j))/|F( y j)| has a rank of 1 in the set {L i(F( y j))/|F( y j)| | i=0,1,2,3}, then the error is just in one author. Clearly this case is much better than another case where L j(F( y j))/|F( y j)| is of rank 3. (This note is useful in study of Table 2).
Besides, the matrix R has some interesting interpretations. For example, it follows from the first line that the list of author-candidates for the control text “Umenie kidat’ mjach” of K. Bulychev is K. Bulychev, V. Nabokov, N. Gogol’ and A. Volkov. In following two lines V. Nabokov is also the second candidate to control texts of A. Volkov and N. Gogol’. Is it the corollary of historical position of V. Nabokov between N. Gogol’ and A. Volkov with K. Bulychev? If it is true then our method is sensitive to historical epoch of the text. This hypothesis is supported by the data of the last line of matrix R. The list of candidates to the control text of V. Nabokov is V. Nabokov, A. Volkov, K. Bulychev, and N. Gogol’. If the pair of A. Volkov and K. Bulychev is broken by N. Gogol’ then it contradicts the hypothesis. Nevertheless, there are other possible interpretations of matrix R and the author of present paper does not insist on the one above.
It is interesting to know how the matrix R depend on
a) the number and volume of training samples,
b) the homogeneity in genre,
c) the homogeneity in theme,
d) the size of control text,
e) the unit of analysis (the level of letters, words or sentences)
and many other parameters. Here we give some information on point a) and d). Our method works well (i.e. the diagonal of matrix R is zero) when the total size of training samples is more than 100 thousand ASCII symbols and the size of control text is more than 100 thousand ASCII symbols.
Returning to discussion on table 1. The authorship of all control texts is established absolutely correctly. The result is completely unexpected because we use such primitive information as the frequencies of letter pairs. Moreover, a simple computer study (we omit this results here) showed that if there are a small number of author-candidates (less than 6) even estimate (2.2) gives good results. Note that estimate (2.2) is based just on the isolated letter frequencies! The full research is described in the following section. The full study shows that estimate (2.1) is stable even for large number of authors.
4. Results of extended research
Consider n=82 authors of the nineteenth and twentieth centuries given in the electronic library at http://www.lib.ru. Various authors have from 1 to 30 different text samples e.g. Arkadij and Boris Strugatskij have 30 texts. One author, Boris Strugatskij has just one story. In this case the story is cut into two separate texts, one of which acts as a control. The total size of texts for each of chosen authors exceeds 100000 ASCII symbols. The total number of novels, stories and short stories exceeds 1000. These texts are combined in 386 text samples. The total size of data is 128´ 106 ASCII symbols.
For each author we make a list gi,j of texts for estimates Pi and for each author we keep a control "doubtful" text y i. The control text y i is not used in each of calculations of Pi. The total number of control text samples is 82 and the total number of text samples in the training set is 304=386-82. Following research design presented in previous section, we studied the quality of estimates t(F(× )), t(G(× )), e(F(× )), e(G(× )) for all 82 writers. For brevity we shall give the big table for t(G(× )) only. This table is filled out in a similar way with table 1. Again, for brevity we omit corresponding tables L and R.
Table 2
N
Author
c1
c2
c3
c4
0
K. Bulychev
0
15
2007724
64741
1
O. Avramenko
0
6
1733113
223718
2
A. Bol’nykh
0
6
1294721
373611
3
A. Volkov
0
8
1478932
202495
4
G. Glazov
0
5
1398323
184593
5
M. and S. Djachenko
0
5
1754213
197039
6
A. Etoev
0
5
267096
80358
7
A. Kabakov
0
4
905502
222278
8
V. Kaplan
0
6
515029
129608
9
S. Kazmenko
3
4
1846161
156768
10
V. Klimov
0
3
250231
179903
11
I. Krashevskij
0
2
1183722
481795
12
I. Kublitskaja
0
1
282377
170469
13
L. Kudrjavtsev
1
3
583239
179093
14
A. Kurkov
0
6
628041
218726
15
Ju. Latynina
10
2
2628781
283565
16
A. Lazarevich
46
3
310553
94629
17
A. Lazarchuk
0
5
2395669
210151
18
S. Lem
0
7
1568013
343519
19
N. Leonov
0
2
568854
279377
20
S. Loginov
14
13
1998543
159247
21
E. Lukin
0
4
602216
125694
22
V. Chernjak
0
2
920056
201636
23
A.P. Chekhov
0
2
662801
343694
24
I. Khmelevskaja
0
4
1524905
203684
25
L. and E. Lukin
0
3
837198
122999
26
S. Luk’janenko
0
14
3682298
483503
27
N. Markina
0
1
266297
93647
28
M. Naumova
0
3
306514
337821
29
S. Pavlov
0
2
751836
453448
30
B. Rajnov
0
4
1405994
420256
31
N. Rerikh
0
3
1011285
211047
32
N. Romanetskij
2
6
1305096
117147
33
A. Romashov
0
1
88434
87744
34
V. Rybakov
0
6
715406
121497
35
K. Serafimov
0
1
186424
75276
36
I. Sergievskaja
0
1
109118
50786
37
S. Scheglov
10
2
253732
55188
38
A. Schegolev
0
2
848730
105577
39
V. Shinkarev
29
2
156667
80405
40
K. Sitnikov
0
7
419872
109116
41
S. Snegov
0
2
824423
408984
42
A. Stepanov
0
5
1223980
93707
43
A. Stoljarov
11
1
350053
137135
44
R. Svetlov
0
2
454638
268472
45
A. Sviridov
63
3
660413
235439
46
E. Til’man
0
2
705352
464685
47
D. Truskinovskaja
0
8
2005238
118351
48
A. Tjurin
0
18
4109050
110237
49
V. Jugov
0
5
829209
66657
50
A. Molchanov
0
1
398487
206541
51
F.M. Dostoevskij
1
3
613825
88582
52
N.V. Gogol’
0
3
638339
215540
53
D. Kharms
0
2
199449
114889
54
A. Zhitinskij
0
2
2137325
543037
55
E. Khaetskaja
2
2
723167
204091
56
V. Khlumov
0
3
788562
183358
57
V. Kunin
0
3
1335918
296463
58
A. Melikhov
0
1
615548
458086
59
V. Nabokov
0
5
1522633
342774
60
Ju. Nikitin
0
2
1342176
702383
61
V. Segal’
0
2
320218
75917
62
V. Jan
0
1
507502
600636
63
A. Tolstoj
0
1
129664
97842
64
I. Efremov
0
1
536604
256521
65
E. Fedorov
0
1
1120665
221388
66
O. Grinevskij
0
1
158762
96085
67
N. Gumilev
0
1
70181
71042
68
L.N. Tolstoj
0
1
1225242
199903
69
V. Mikhajlov
0
1
254464
84135
70
Ju. Nesterenko
0
1
352988
71075
71
A.S. Pushkin
0
1
170380
57143
72
L. Reznik
0
1
115925
79628
73
M.E. Saltykov-Schedrin
0
1
239289
101845
74
V. Shukshin
0
1
309524
66756
75
S. M. Solov’ev
0
1
2345807
160002
76
A. Kats
0
1
841898
81830
77
E. Kozlovskij
1
1
849038
889560
78
S. Esenin
0
1
219208
44855
79
A. Strugatskij
0
1
151246
51930
80
A. and B. Strugatskij
0
29
6571689
345582
81
B. Strugatskij
0
1
298832
261206
First note that the number of right answers (zeroes in column c1) is very high: 69. The true author is second in the candidate list (a 1 in column c1) in 3 cases: L. Kudrjavtsev, F.M. Dostoevskij and E. Kozlovskij. The true author is third (c1=2) in 2 cases: N. Romanetskij and E. Khaetskaja. Only one author (S. Kazmenko) has the fourth position in list of candidates (c1=3). The error of recognition is very high in case of the other 7 authors (Ju. Latynina, A. Lazarevich, S. Loginov, S. Scheglov, V. Shinkarev, A. Stoljarov, A. Sviridov). They are not in the list of top ten of candidates.
The average rank is a sum of numbers in column c1 divided by the total number of writers 82. The average rank is a measure for error of estimate t(G(×)). Here the average rank is equal to
2.35» (3´ 1+2´2+1´3+2´10+1´11+1´14+1´29+1´46+1´63) / 82
All these numbers are given in table 3 in column for t(G(×)). Suppose we drop 7 hardly recognizable authors; then the average rank is 0.13» 2/15=(3´1+2´2+1´3) / 75.
Now note that the method is applicable to poetry (A.S. Pushkin, S. Esenin and N. Gumilev). Further note that Polish writers (S. Lem and I. Khmelevskaja) are recognizable although their stories are translated from Polish to Russian. Finally note that hardly recognizable authors are not well-known classics.
The following table 3 contains results of similiar research with estimates t(F(x)), e(F(x)), e(G(x)) on the same texts and authors.
Table 3
c1
t(F(×))
t(G(×))
e(F(×))
e(G(×))
0
57
69
1
2
1
4
3
8
8
2
4
2
7
13
3
4
1
2
2
4
0
0
3
7
³ 5
13
7
61
50
Average
3.50
2.35
13.95
12.37
Note that the frequency analysis for isolated letters works badly (we have at most 2 exact answers in the best case). Nevertheless, it gives some information on author because if the true author is chosen at random then the average result in columns e(F(×)) and e(G(×)) should be about 40. Note also that the dropping of words with a capital letter makes the results better (even in case of frequency analysis). Obviously, columns with function G(×) perform better than columns with function F(×).
5. Conclusion
Note from the data of table 3 that estimate (2.1) is better than (2.2) and estimate (2.1) gives right author in greater number of cases (84% against 3%). Note that estimate (2.1) uses information on pairs of letters, but estimate (2.2) uses only information on isolated letters frequency distribution. Therefore the superiority of estimate (2.1) is expected. We stress that the precision of estimate (2.1) is interesting. For instance, the method of author’s invariant (Fomenko et al., 1996) can not distinguish more than 10 writers (here we consider more than 80 writers). Undoubtedly this precision should attract the greater attention to the present method.
Note that the quality of author’s recognition increases significantly when we drop words with capital letter. This phenomenon needs an explanation.
As we mentioned above A.A. Markov was interested in problem of disputed author resolution. It is remarkable that his own idea on "events which are linked to a chain" leads to new approaches to this problem today.
The author is grateful to M.I. Grinchuk for fruitful discussions. Also the author thanks A.T. Fomenko and G.V. Nosovskij for attention to this work and discussions on results. Also the author is grateful to A.A. Polikarpov for discussions which influenced the final shape of this paper (including its translation into English). Author thanks F.J. Tweedie for some language corrections and references.
References
Fomenko, V.P. and Fomenko, T.G. (1996). Avtorskij invariant russkikh literaturnykh tekstov. [Predislovie A.T. Fomenko.] (Author's Quantative Invariant For Russian Literary Texts [Commentary by Academician A.T.Fomenko]). In: Fomenko, A.T. Novaja khronologija Gretsii: Antichnost’ v srednevekov’e. T. 2. M.: Izd-vo MGU, p.768–820. (in Russian).
Holmes, D.I. (1998). The Evolution of Stylometry in Humanities Scholarship. In: Literary and Linguistic Computing, Vol. 13, No. 3.
Ivchenko, G.I. and Medvedev, Yu.I. Matematicheskaja statistika. (Mathematical Statistics) 2nd edition. M.: Vysshaja shkola, 1992. (in Russian).
Markov, A.A. (1913). Primer statisticheskogo issledovanija nad tekstom “Evgenija Onegina” illjustrirujuschij svjaz’ ispytanij v tsep’. (An example of statistical study on text of “Eugeny Onegin” illustrating the linking of events to a chain). In: Izvestija Imp.Akad.nauk, serija VI, T.X, N3, c.153. (in Russian).
Markov, A.A. (1916). Ob odnom primenenii statisticheskogo metoda (On some application of statistical method). In: Izvestia Akademii Nauk. (Russia). Ser.6, vol.X, N4, p.239 (in Russian).
Milov, L.V. (editor) (1994). Ot Nestora do Fonvizina. Novye metody opredelenija avtorstva. (From Nestor to Fonvizin. New methods of determining of authorship) M.: Progress publishing. (in Russian).
Morozov, N.A. (1915). Lingvisticheskie spektry (Linguistic spectrums). In: Izvestia Akademii Nauk (Russia), (Section of Russian Language), Books 1-4, vol.XX, (in Russian).