|
|
|
|
|
|
STYLOMETRY
Stylometry is the application of the study of linguistic style, usually to written language. In the last few years it has successfully been applied also to music and to fine-art paintings.
Stylometry is often used to attribute authorship to anonymous or disputed documents. It has legal as well as academic and literary applications, ranging from the question of the authorship of Shakespeare's works to Forensic linguistics.
An early example is Lorenzo Valla's 1439 proof that the Donation of Constantine was a forgery, an argument based partly on a comparison of the Latin with that used in authentic 4th Century documents.
Methods
Modern stylometry draws heavily on the aid of computers for statistical analysis, artificial intelligence and access to the growing corpus of texts available via the Internet.
Whereas in the past, stylometry emphasized the rarest or most striking elements of a text, contemporary techniques can isolate identifying patterns even in common parts of speech.
Frequency in chunks
In one method of determining style, the text is analyzed to find the 50 most common words. The text is then broken into 5,000 word chunks and each of the chunks is analyzed to find the frequency of those 50 words in that chunk. This generates a unique 50-number identifier for each chunk. These numbers place each chunk of text into a point in a 50-dimensional space. This 50-dimensional space is flattened into a plane using principal components analysis (PCA). This results in a display of points that correspond to an author's style. If two literary works are placed on the same plane, the resulting pattern may show if both works were by the same author or different authors.
Neural networks
Neural networks are used to analyze authorship of texts. One such network was built with the links having random strengths. The network was presented with training texts of known authorship. Any time the network guessed incorrectly, it adjusted the strengths of its links until the network could properly identify known texts. Once the training period is complete, the network can properly determine authorship of texts by authors that it had been trained on previously.
Genetic Algorithms
The genetic algorithm is another artificial intelligence technique used in stylometry. A method starts out with a set of rules. An example rule might be, "If but appears more than 1.7 times in every thousand words, then the text is author X". The program is presented with text and uses the rules to determine authorship. The rules are tested against a set of known texts and each rule is given a fitness score. The 50 rules with the lowest scores are thrown out. The remaining 50 rules are given small changes and 50 new rules are introduced. This is repeated until the evolved rules correctly attribute the texts.
Rare Pairs
One method for identifying style is called "rare pairs", and relies upon individual habits of collocation. The use of certain words may, for a particular author, idiosyncratically entail the use of other, predictable words.
|
|
|
|
|
|
|