TRUE WRITER RECOGNITION

Вы находитесь на архивной версии сайта лаборатории, некоторые материалы можно найти только здесь.
Актуальная информация о деятельности лаборатории на lex.philol.msu.ru.

Disputed Authorship Resolution through Using Relative Empirical Entropy
for Markov Chains of Letters in Human Language Texts

D.V. Khmelev

Summary

The problem of disputed authorship resolution is solved here by the formal analysis of texts. The method of the analysis is based on the Markov Model for the sequence of letters in text. We assume that the frequencies of letter pairs are very specific for an author. This assumption is checked in the large statistical experiment which was carried out for 386 text samples (stories, novels, and their combination) over stories and novels of 82 Russian fiction writers.

1. Introduction

First attempts in the search of specific quantitative structures for individual author style took place in the beginning of the twentieth century (Morozov, 1915). A famous mathematician Markov (1916) criticized the method offered by N.A. Morozov. A.A. Markov argued that characteristics of writer’s style (individual word frequencies, e.g., the distribution of negative word “ne”) offered by Morozov (1915) are unstable. A.A. Markov had given an example of good statistical approach (Markov, 1913). In this work A.A. Markov studied the distribution of vowels and consonants among initial 20000 letters of “Evgenij Onegin”. This paper (Markov, 1913) is the first application of “events which are linked to the chain”. In the last century this object received the name “Markov chains”. In linguistic studies this object is said to be the Markov Model. Markov (1913) presents not only a historical basis for the present article, but also a practical one.

Modern methods of disputed authorship resolution are reviewed by Milov (1994) in Russia and Holmes (1998) in West. All methods described there has a common disadvantage although there is the amazing variety of them. None of these methods have been tested by use of large number of authors. Moreover, the testing of most of these methods is impossible because they have informal steps which need a human participation (for example, some of them require manual separation of words into various classes). Clearly, such an approach prevents checking of a large number of authors and, indeed, all of the methods are used for analysis of a small number of authors and for a limited number of texts.

Fomenko et al. (1996) have another approach. They select a simple parameter of author’s style: the fraction of functional words. They check the stability of this fraction over large number of Russian writers. So this fraction is specific enough for each author and is called by Fomenko et al. (1996) an author’s invariant.

Another method of disputed authorship resolution is offered in the present work. This method is totally independent of all methods described by Fomenko (1996), Holmes (1998) and Milov (1994). Here we shall not apply this method to resolution of well-known problems of disputed authorship for specific texts. The present article is mainly theoretical and is aimed on testing of the offered method. The method is based on Markov Model for the sequence of letters.

Assume we have wide range of large texts in Russian of known authorship. A text under question is known to be written by one of these writers but we do not know who the true author is. Our goal is to find the author who is closest to the text under question according to the proposed measure. Note also that for control purposes we do know a real author for each of "doubtful" texts.

Consider a sequence of letters of text as a Markov chain. The matrices of transition frequencies of letter pairs are calculated over all texts of each author. Therefore we know (approximately) the probability of transition from one letter to another for each author. The true author of an anonymous text is calculated with the principle of maximal likelihood, i.e., for each matrix we calculate the probability of the anonymous text, we choose the author with the maximal corresponding probability and the chosen author is supposed to be the true author. Suppose we take the logarithm of each probability, change the sign, and divide it by the length of anonymous text; then each of the obtained numbers is called the relative empirical entropy. Relative empirical entropy is more convenient for computing. The true author has the minimal relative entropy.

This method is amazingly precise. We discuss here the features of methods application and compare it to another method which is based on isolated letter frequencies in the text. The method is checked on 386 text samples of 82 writers. There are a lot of well-known Russian writers of the nineteenth and twentieth centuries among them.

2. Mathematical Foundations

2.1 Markov Model. Suppose A is a set of letters. Let A^k be a set of words of length k over alphabet A and A^*=È _k>0 A^k. By | f | denote the length of fÎ A^* .

Consider the following formalization of the problem of true writer recognition. Take n sets C_i, where i=0,...,n-1. The set C_i contains sequences f_i,jÎA^*, where j=1,...,m_i, i.e., C_i={ f_i,j | j=1,...,m_i}. We shall assign to xÎ A^* the set C_i.

Let f_i,j be a sample from Markov chain with transition matrix P ⁱ. Now we shall define the estimate Pⁱ for P ⁱ. Let h_i,j,kl be the number of letter pairs (transitions) k® l in f_i,j. We have h_i,kl=å _jh_i,j,kl and h_i,k=å _lh_i,kl. By definition, put Pⁱ_kl= h_i,kl/ h_i,k. Let Z_i be a set of pairs (k,l) such that Pⁱ_kl>0.

Also, let x be a sample from a Markov chain with transition matrix Pq , where q is unknown number in {1,...,n}.

Let n _kl denote the number of transitions k® l in x. By definition, put n _k=å _l n
kl. Consider

L _i(x) = -å _(k,l) n _kl´ln(n _kl/(Pⁱ_kl´
n
k)),

where the sum is calculated over (k,l)Î Z_i. Let us remark that if Z_i contains all possible pairs (k,l), then L _i(x) is the minus logarithm of the probability of the sequence x when the sequence x is a Markov chain with transition matrix Pⁱ. The normalized number L _i(x) / | x | is called the relative empirical entropy with respect to the set C_i. Define the maximum likelihood estimate t(x) of the unknown q by the rule

t(x)=argmin_i=0,...,n-1 L i(x) / | x |. (2.1)

We shall not discuss or prove any mathematical properties of estimate (2.1) although these properties are interesting for statistics (Ivchenko and Medvedev, 1992, p. 224). But we shall prove that estimate (2.1) is amazingly effective in true author recognition.

2.2 Computational procedure. Take A={small cirillic letters}È {space symbol}. Suppose we have quite a large texts of n authors in Russian. Let g_i,j denote the j-th text of the author i. The text g_i,j is a sequence of symbols of extendent alphabet B, where the alphabet B contains punctuation symbols, capital letters, Latin letters etc (set B is extended ASCII).

To each fragment g_i,jÎB^* assign the sequence F(g_i,j)Î A^*. Let F be the map from B^* to A^* such that all capital letters convert to lower letters, all hyphens are attached, all punctuation and extra space symbols drop, a delimiter space symbol between words stays, and a space symbol is inserted at the beginning and at the end of the fragment if absent.

Besides, consider the map G: B^*® A^*. The map G has the same structure as F, but the map G drops all words with capital letter in text g_i,j. In particular, if

y=“Krome togo, my budem rassmatrivat’ funktsiju G,”,

then F( y)=“ krome togo my budem rassmatrivat’ funktsiju ”,

and G( y)=“ togo my budem rassmatrivat’ funktsiju ”.

Now suppose an text yÎ B^* is written by one of n authors, but the true author is unknown. How do we find the true author of y? Using estimate (2.1) with sequences x=F( y) and x=G( y) we obtain two ways for determining the author:

1) the true author is t(F( y)),

2) the true author is t(G( y)).

Note that estimates t(F( y)) and t(G( y)) use only information on letter pairs. Estimates t(F( y)) and t(G( y)) are independent of word order because all words are delimited by space symbols. Perhaps, t(F( y)) and t(G( y)) are based on mediated information on morphemic units (and their combinations within word forms), and surely use no syntactical and phraseological information (methods described by Milov (1994) are based on syntactical information).

Formally any natural language text is not a Markov chain, i.e., this hypothesis is rejected by decision rules of statistics. But one can formally calculate and apply estimate (2.1) to human language texts taking into account stability of estimates obtained by the principle of maximal likelihood. In the present paper we show that estimate (2.1) gives the true author in most cases.

2.3 Frequency analysis for isolated letters. Assume that the sequences f_i,j and x are sample from i.i.d.r.v. taking values in A. By the previous statement, estimate (2.1) takes the following form

e(x) = argmin_i G _i(x) / | x |, (2.2)

where

G _i(x) = -å _k n kln((n _k´h_i)/(h_i,k´
n)),

and the sum is taken over k such that n _k>0 and n =å _k n _k, h_i=å _k h_i,k . Note that estimate (2.2) is usually called frequency analysis. Our calculation show that e(x) is significantly worse than t(x).

3. An example of research

Consider an example of our approach for checking the quality of the method for disputed authorship resolution. We shall check the estimate t(F( y)) on texts of K. Bulychev, A. Volkov, N.V. Gogol’ and V. Nabokov.

We offer the following way of checking. Choose a control text yⁱ for each author (i = 0,1,2,3), calculate estimates P ⁱ for matrices P ⁱ with the other texts f_i,j, and calculate t(F( yⁱ)). If the estimate is good then for each author t(F( yⁱ))=i.

0) K. Bulychev: Umenie kidat’ mjach ( y⁰); Beloe plat’e zolushki (g_0,1); Velikij dukh i begletsy (g_0,2); Glubokouvazhaemyj mikrob (g_0,3); Zakon dlja drakona (g_0,4); Ljubimets [Sponsory] (g_0,5); Marsianskoe zel’e (g_0,6); Miniatjury (g_0,7); “Mozhno poprosit’ Ninu?” (g_0,8); Na dnjakh zemletrjasenie v Ligone (g_0,9); Pereval (g_0,10); Pokazanija Oli N. (g_0,11); Pominal’nik XX veka (g_0,12); Raskopki kurganov v doline Repedelkinok (g_0,13); Trinadtsat’ let puti (g_0,14); Smert’ etazhom nizhe (g_0,15);

1) A. Volkov: Sem’ podzemnykh korolej ( y¹); Volshebnik izumrudnogo goroda (g_1,1); Urfin Dzhjus i ego derevjannye soldaty (g_1,2); Ognennyj bog Marranov (g_1,3); Genial’nyj pen’ (g_1,4); Na vojne, kak na vojne (g_1,5); O chem molchali gazety... (g_1,6); Prestuplenie i nakazanie (g_1,7); Epilog (g_1,8); Zheltyj Tuman (g_1,9); Tajna zabroshennogo zamka (g_1,10);

2) N.V. Gogol’: Rasskazy i povesti ( y², nazvanija povestej: “Povest’ o tom, kak possorilsja Ivan Ivanovich s Ivanom Nikiforovichem”, “Starosvetskie pomeschiki”, “Vij”, “Zapiski sumasshedshego”); Revizor (g_2,1); Taras Bul’ba (g_2,2); Vechera na khutore bliz Dikan’ki (g_2,3);

3) V. Nabokov: Drugie berega ( y³); Korol’, dama, valet (g_3,1); Lolita (g_3,2); Mashen’ka (g_3,3); Rasskazy (g_3,4); Nezavershennyj roman (g_3,5).

For example, A. Volkov has a control text y¹, i.e., “Sem’ podzemnykh korolej.” The other texts of A. Volkov are used in computing of Pⁱ.

The results of calculations are presented in the following table 1.

Table 1

N

Author

c₁

c₂

c₃

c₄

0

K. Bulychev

0

15

2345689

75161

1

A. Volkov

0

8

1733165

233418

2

N.V. Gogol’

0

3

723812

243767

3

V. Nabokov

0

5

1658626

367179

The column c₂ contains the total number of files with texts of each author. Note that the number of files and the number of texts are different in the following two cases. Firstly, some texts of the author can be collected in one file (this is happen in A. Volkov case: three stories “Zheltyj Tuman”, “Tajna zabroshennogo zamka” and “Ognennyj bog Marranov” are collected in one file). Secondly, one large text can be cut into smaller parts (recall this when you study the table 2).

The column c₃ contains the total number of symbols (cyrillic letters and spaces) in F(g_i,j): c₃=å _j |F(g_i,j)|. The column c₄ contains the total number of symbols in F( yⁱ), i.e., c₄=|F( yⁱ)|. For example, the total volume å _j F( g_0,j) of texts for K. Bulychev is 2'345'689. The volume of F( y¹), i.e., the number of symbols of alphabet A in control story “Umenie kidat’ mjach” is 75'161.

The column c₁ in line j contains the L _j(F( y^j)) rank. The ranking of numbers {L _i(F( y^j)) | i = 0,1,2,3}. The rank of L _j(F( y^j)) is its number (counting from 0) in set of numbers {L _i(F( y^j)) | i = 0,1,2,3} lined up from smallest to biggest. For example, if j=1 and L _i lined up in the following order L _{0£
L
3£
L
2£
L
1}then L ₁ has a rank of 3. If j=0 and L _i lined up in the same order L _{0£
L
3£
L
2£
L
1} then L ₀ has a rank of 0. The rank of L _j(F( y^j)) in set of numbers {L _i(F( y^j) | i=0,1,2,3} equals the rank of L _j(F( y^j))/|F( y^j)| among numbers {L _i(F( y^j))/|F( y^j)|  |  i=0,1,2,3}. Put in lines j=0,1,2,3 of the following matrix the numbers L _i(F( y^j))/|F( y^j)|, i=0,1,2,3:

2.484569

2.508425

2.504301

2.493778

L =

2.501061

2.473907

2.516797

2.492874

2.499033

2.504508

2.480202

2.483829

2.541367

2.538101

2.548842

2.520018

Rank L _i in each line.

0

3

2

1

R =

2

0

3

1

2

3

0

1

2

1

3

0

The required numbers of column c₁ are the diagonal in matrix R. Using (2.1), we obtain that t(F( y^j))=j iff the number L _j(F( y^j))/|F( y^j)| in the set {L _i(F( y^j))/|F( y^j)|  | i=0,1,2,3} has a rank of 0. Therefore, suppose we find 0 in column c₁ of table 1; then the authorship of the control text is established correctly. Note from the data of table 1 that the authorship is established absolutely correctly.

Before we discuss this result we shall explain why we define column c₁ in such a complicated way. Suppose the authorship is not established correctly, i.e., t(F( y^j)) ¹ j; then we are interested in how close to the correct author we are. If L _j(F( y^j))/|F( y^j)| has a rank of 1 in the set {L _i(F( y^j))/|F( y^j)|  | i=0,1,2,3}, then the error is just in one author. Clearly this case is much better than another case where L _j(F( y^j))/|F( y^j)| is of rank 3. (This note is useful in study of Table 2).

Besides, the matrix R has some interesting interpretations. For example, it follows from the first line that the list of author-candidates for the control text “Umenie kidat’ mjach” of K. Bulychev is K. Bulychev, V. Nabokov, N. Gogol’ and A. Volkov. In following two lines V. Nabokov is also the second candidate to control texts of A. Volkov and N. Gogol’. Is it the corollary of historical position of V. Nabokov between N. Gogol’ and A. Volkov with K. Bulychev? If it is true then our method is sensitive to historical epoch of the text. This hypothesis is supported by the data of the last line of matrix R. The list of candidates to the control text of V. Nabokov is V. Nabokov, A. Volkov, K. Bulychev, and N. Gogol’. If the pair of A. Volkov and K. Bulychev is broken by N. Gogol’ then it contradicts the hypothesis. Nevertheless, there are other possible interpretations of matrix R and the author of present paper does not insist on the one above.

It is interesting to know how the matrix R depend on

a) the number and volume of training samples,

b) the homogeneity in genre,

c) the homogeneity in theme,

d) the size of control text,

e) the unit of analysis (the level of letters, words or sentences)

and many other parameters. Here we give some information on point a) and d). Our method works well (i.e. the diagonal of matrix R is zero) when the total size of training samples is more than 100 thousand ASCII symbols and the size of control text is more than 100 thousand ASCII symbols.

Returning to discussion on table 1. The authorship of all control texts is established absolutely correctly. The result is completely unexpected because we use such primitive information as the frequencies of letter pairs. Moreover, a simple computer study (we omit this results here) showed that if there are a small number of author-candidates (less than 6) even estimate (2.2) gives good results. Note that estimate (2.2) is based just on the isolated letter frequencies! The full research is described in the following section. The full study shows that estimate (2.1) is stable even for large number of authors.

4. Results of extended research

Consider n=82 authors of the nineteenth and twentieth centuries given in the electronic library at http://www.lib.ru. Various authors have from 1 to 30 different text samples e.g. Arkadij and Boris Strugatskij have 30 texts. One author, Boris Strugatskij has just one story. In this case the story is cut into two separate texts, one of which acts as a control. The total size of texts for each of chosen authors exceeds 100000 ASCII symbols. The total number of novels, stories and short stories exceeds 1000. These texts are combined in 386 text samples. The total size of data is 128´ 10⁶ ASCII symbols.

For each author we make a list g_i,j of texts for estimates Pⁱ and for each author we keep a control "doubtful" text yⁱ. The control text yⁱ is not used in each of calculations of Pⁱ. The total number of control text samples is 82 and the total number of text samples in the training set is 304=386-82. Following research design presented in previous section, we studied the quality of estimates t(F(× )), t(G(× )), e(F(× )), e(G(× )) for all 82 writers. For brevity we shall give the big table for t(G(× )) only. This table is filled out in a similar way with table 1. Again, for brevity we omit corresponding tables L and R.

Table 2

N

Author

c₁

c₂

c₃

c₄

0

K. Bulychev

0

15

2007724

64741

1

O. Avramenko

0

6

1733113

223718

2

A. Bol’nykh

0

6

1294721

373611

3

A. Volkov

0

8

1478932

202495

4

G. Glazov

0

5

1398323

184593

5

M. and S. Djachenko

0

5

1754213

197039

6

A. Etoev

0

5

267096

80358

7

A. Kabakov

0

4

905502

222278

8

V. Kaplan

0

6

515029

129608

9

S. Kazmenko

3

4

1846161

156768

10

V. Klimov

0

3

250231

179903

11

I. Krashevskij

0

2

1183722

481795

12

I. Kublitskaja

0

1

282377

170469

13

L. Kudrjavtsev

1

3

583239

179093

14

A. Kurkov

0

6

628041

218726

15

Ju. Latynina

10

2

2628781

283565

16

A. Lazarevich

46

3

310553

94629

17

A. Lazarchuk

0

5

2395669

210151

18

S. Lem

0

7

1568013

343519

19

N. Leonov

0

2

568854

279377

20

S. Loginov

14

13

1998543

159247

21

E. Lukin

0

4

602216

125694

22

V. Chernjak

0

2

920056

201636

23

A.P. Chekhov

0

2

662801

343694

24

I. Khmelevskaja

0

4

1524905

203684

25

L. and E. Lukin

0

3

837198

122999

26

S. Luk’janenko

0

14

3682298

483503

27

N. Markina

0

1

266297

93647

28

M. Naumova

0

3

306514

337821

29

S. Pavlov

0

2

751836

453448

30

B. Rajnov

0

4

1405994

420256

31

N. Rerikh

0

3

1011285

211047

32

N. Romanetskij

2

6

1305096

117147

33

A. Romashov

0

1

88434

87744

34

V. Rybakov

0

6

715406

121497

35

K. Serafimov

0

1

186424

75276

36

I. Sergievskaja

0

1

109118

50786

37

S. Scheglov

10

2

253732

55188

38

A. Schegolev

0

2

848730

105577

39

V. Shinkarev

29

2

156667

80405

40

K. Sitnikov

0

7

419872

109116

41

S. Snegov

0

2

824423

408984

42

A. Stepanov

0

5

1223980

93707

43

A. Stoljarov

11

1

350053

137135

44

R. Svetlov

0

2

454638

268472

45

A. Sviridov

63

3

660413

235439

46

E. Til’man

0

2

705352

464685

47

D. Truskinovskaja

0

8

2005238

118351

48

A. Tjurin

0

18

4109050

110237

49

V. Jugov

0

5

829209

66657

50

A. Molchanov

0

1

398487

206541

51

F.M. Dostoevskij

1

3

613825

88582

52

N.V. Gogol’

0

3

638339

215540

53

D. Kharms

0

2

199449

114889

54

A. Zhitinskij

0

2

2137325

543037

55

E. Khaetskaja

2

2

723167

204091

56

V. Khlumov

0

3

788562

183358

57

V. Kunin

0

3

1335918

296463

58

A. Melikhov

0

1

615548

458086

59

V. Nabokov

0

5

1522633

342774

60

Ju. Nikitin

0

2

1342176

702383

61

V. Segal’

0

2

320218

75917

62

V. Jan

0

1

507502

600636

63

A. Tolstoj

0

1

129664

97842

64

I. Efremov

0

1

536604

256521

65

E. Fedorov

0

1

1120665

221388

66

O. Grinevskij

0

1

158762

96085

67

N. Gumilev

0

1

70181

71042

68

L.N. Tolstoj

0

1

1225242

199903

69

V. Mikhajlov

0

1

254464

84135

70

Ju. Nesterenko

0

1

352988

71075

71

A.S. Pushkin

0

1

170380

57143

72

L. Reznik

0

1

115925

79628

73

M.E. Saltykov-Schedrin

0

1

239289

101845

74

V. Shukshin

0

1

309524

66756

75

S. M. Solov’ev

0

1

2345807

160002

76

A. Kats

0

1

841898

81830

77

E. Kozlovskij

1

1

849038

889560

78

S. Esenin

0

1

219208

44855

79

A. Strugatskij

0

1

151246

51930

80

A. and B. Strugatskij

0

29

6571689

345582

81

B. Strugatskij

0

1

298832

261206

First note that the number of right answers (zeroes in column c₁) is very high: 69. The true author is second in the candidate list (a 1 in column c₁) in 3 cases: L. Kudrjavtsev, F.M. Dostoevskij and E. Kozlovskij. The true author is third (c₁=2) in 2 cases: N. Romanetskij and E. Khaetskaja. Only one author (S. Kazmenko) has the fourth position in list of candidates (c₁=3). The error of recognition is very high in case of the other 7 authors (Ju. Latynina, A. Lazarevich, S. Loginov, S. Scheglov, V. Shinkarev, A. Stoljarov, A. Sviridov). They are not in the list of top ten of candidates.

The average rank is a sum of numbers in column c₁ divided by the total number of writers 82. The average rank is a measure for error of estimate t(G(Ч)). Here the average rank is equal to

2.35» (3´ 1+2ґ2+1ґ3+2ґ10+1ґ11+1ґ14+1ґ29+1ґ46+1ґ63) / 82

All these numbers are given in table 3 in column for t(G(Ч)). Suppose we drop 7 hardly recognizable authors; then the average rank is 0.13» 2/15=(3ґ1+2ґ2+1ґ3) / 75.

Now note that the method is applicable to poetry (A.S. Pushkin, S. Esenin and N. Gumilev). Further note that Polish writers (S. Lem and I. Khmelevskaja) are recognizable although their stories are translated from Polish to Russian. Finally note that hardly recognizable authors are not well-known classics.

The following table 3 contains results of similiar research with estimates t(F(x)), e(F(x)), e(G(x)) on the same texts and authors.

Table 3

c₁

t(F(Ч))

t(G(Ч))

e(F(Ч))

e(G(Ч))

0

57

69

1

2

1

4

3

8

8

2

4

2

7

13

3

4

1

2

2

4

0

0

3

7

³ 5

13

7

61

50

Average

3.50

2.35

13.95

12.37

Note that the frequency analysis for isolated letters works badly (we have at most 2 exact answers in the best case). Nevertheless, it gives some information on author because if the true author is chosen at random then the average result in columns e(F(Ч)) and e(G(Ч)) should be about 40. Note also that the dropping of words with a capital letter makes the results better (even in case of frequency analysis). Obviously, columns with function G(Ч) perform better than columns with function F(Ч).

5. Conclusion

Note from the data of table 3 that estimate (2.1) is better than (2.2) and estimate (2.1) gives right author in greater number of cases (84% against 3%). Note that estimate (2.1) uses information on pairs of letters, but estimate (2.2) uses only information on isolated letters frequency distribution. Therefore the superiority of estimate (2.1) is expected. We stress that the precision of estimate (2.1) is interesting. For instance, the method of author’s invariant (Fomenko et al., 1996) can not distinguish more than 10 writers (here we consider more than 80 writers). Undoubtedly this precision should attract the greater attention to the present method.

Note that the quality of author’s recognition increases significantly when we drop words with capital letter. This phenomenon needs an explanation.

As we mentioned above A.A. Markov was interested in problem of disputed author resolution. It is remarkable that his own idea on "events which are linked to a chain" leads to new approaches to this problem today.

The author is grateful to M.I. Grinchuk for fruitful discussions. Also the author thanks A.T. Fomenko and G.V. Nosovskij for attention to this work and discussions on results. Also the author is grateful to A.A. Polikarpov for discussions which influenced the final shape of this paper (including its translation into English). Author thanks F.J. Tweedie for some language corrections and references.

References

Fomenko, V.P. and Fomenko, T.G. (1996). Avtorskij invariant russkikh literaturnykh tekstov. [Predislovie A.T. Fomenko.] (Author's Quantative Invariant For Russian Literary Texts [Commentary by Academician A.T.Fomenko]). In: Fomenko, A.T. Novaja khronologija Gretsii: Antichnost’ v srednevekov’e. T. 2. M.: Izd-vo MGU, p.768–820. (in Russian).

Holmes, D.I. (1998). The Evolution of Stylometry in Humanities Scholarship. In: Literary and Linguistic Computing, Vol. 13, No. 3.

Ivchenko, G.I. and Medvedev, Yu.I. Matematicheskaja statistika. (Mathematical Statistics) 2nd edition. M.: Vysshaja shkola, 1992. (in Russian).

Markov, A.A. (1913). Primer statisticheskogo issledovanija nad tekstom “Evgenija Onegina” illjustrirujuschij svjaz’ ispytanij v tsep’. (An example of statistical study on text of “Eugeny Onegin” illustrating the linking of events to a chain). In: Izvestija Imp.Akad.nauk, serija VI, T.X, N3, c.153. (in Russian).

Markov, A.A. (1916). Ob odnom primenenii statisticheskogo metoda (On some application of statistical method). In: Izvestia Akademii Nauk. (Russia). Ser.6, vol.X, N4, p.239 (in Russian).

Milov, L.V. (editor) (1994). Ot Nestora do Fonvizina. Novye metody opredelenija avtorstva. (From Nestor to Fonvizin. New methods of determining of authorship) M.: Progress publishing. (in Russian).

Morozov, N.A. (1915). Lingvisticheskie spektry (Linguistic spectrums). In: Izvestia Akademii Nauk (Russia), (Section of Russian Language), Books 1-4, vol.XX, (in Russian).

			Table 1
N	Author	c₁	c₂	c₃	c₄
0	K. Bulychev	0	15	2345689	75161
1	A. Volkov	0	8	1733165	233418
2	N.V. Gogol’	0	3	723812	243767
3	V. Nabokov	0	5	1658626	367179

	2.484569	2.508425	2.504301	2.493778
L =	2.501061	2.473907	2.516797	2.492874
	2.499033	2.504508	2.480202	2.483829
	2.541367	2.538101	2.548842	2.520018

c₁	t(F(Ч))	t(G(Ч))	e(F(Ч))	e(G(Ч))
0	57	69	1	2
1	4	3	8	8
2	4	2	7	13
3	4	1	2	2
4	0	0	3	7
³ 5	13	7	61	50
Average	3.50	2.35	13.95	12.37

Disputed Authorship Resolution through Using Relative Empirical Entropy for Markov Chains of Letters in Human Language Texts

D.V. Khmelev

Disputed Authorship Resolution through Using Relative Empirical Entropy
for Markov Chains of Letters in Human Language Texts