6 February 2011

Instructions for Analyze

Contents:


Introduction

Here's a familiar text, Abraham Lincoln's Gettysburg address:
Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation or any nation so conceived and so dedicated can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field as a final resting-place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But in a larger sense, we cannot dedicate, we cannot consecrate, we cannot hallow this ground. The brave men, living and dead who struggled here have consecrated it far above our poor power to add or detract. The world will little note nor long remember what we say here, but it can never forget what they did here. It is for us the living rather to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us--that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion--that we here highly resolve that these dead shall not have died in vain, that this nation under God shall have a new birth of freedom, and that government of the people, by the people, for the people shall not perish from the earth.
What is meant by an analysis of this text? Clearly many analyses are possible, however, only a small number are machine implementable.

Here a text is an ordered (i.e. 0 ... 266) collection of text elements or elements that can be printed on a sheet of paper. An analysis depends on the way the elements are defined. The elements in the above text can be defined as:

  1. an unbroken sequence of bytes,
  2. a sequence of the lowercase letters a-z with other characters ignored.
  3. a sequence of blank-separated words.
The interpretation 1 might be useful for someone transmitting the bytes (hexadecimal 20 is the most frequent byte) , interpretation 2 for a cryptographer (the letter 'q' occurs only once), and interpretation 3 for a literary critic ("we cannot" occurs three times).

The conversion of the bytes in a computer file into elements is called filtering. Analyze provides several built-in filters to define text elements. Moreover, it can incorporate filters written in the Java programming language by the user. Analyze is multi-lingual and can analyze and present results in foreign languages, such as Hebrew. For example, a filter (Tanach) is provided to analyze the Hebrew Bible, the Tanach, in Hebrew words.

Analyze determines all the singletons and tuples in a text. A singleton is an element that occurs only once in the text. A tuple is an ordered collection of one or more elements that occurs two or more times in a text. An n-tuple is an ordered group of n elements occurring 2 or more times in a text. For example, the word "government" is a singleton in the above text, the words "we cannot" is a 2-tuple. The number of times a specified tuple occurs is called the tuple's count. The count of any singleton is 1, the count of any tuple is 2 or more. For example, the count of the 2-tuple "we cannot" is 3.

Another application of Analyze is to determine if a text contains a random collection of elements such as bytes or characters. The statistics for a random source are computed, displayed, and compared with those of the text to determine if the text has random characteristics.

Analyze can analyze and compare up to 5 different texts. For example, it can determine what tuples are present in texts A and B and not in text C. It can also determine a quantitative measure, the difference in mean motions of the log likelihood ratio, of the extent to which the texts have differing statistical characteristics.


Installation

A modern browser such as Firefox, Internet Explorer 6+ is required to display outputs.

Although Analyze can be run as a Java applet over the web, the most flexible (and preferred) way of running it is from a Java .jar file. This file, Analyze.jar, can be downloaded from the web page.

Double click on the Analyze.jar file. If a disclaimer window pops up, the installation is complete and you should click on the "Instructions" item of the "Help" menu for further information.

Installing the Java Runtime Environment (JRE)
If nothing happens when you click on the Analyze.jar file, the Java Runtime Environment (JRE) must be installed on your computer. Installation of the JRE is fast and cheap (FREE). On your browser, go to http://www.oracle.com/technetwork/java/index.html. Push the "Get Java" button in the box on the middle, right-hand side of the page. On the resulting page, look at the box labeled "Java Platform, Standard Edition" (Currently Java SE 6 Update 23). Press the "Download JRE" button in this box. (Do not download the entire JDK, because it isn't needed and it will take a lot of space on your computer. ) Specify your operating system and check the licensing agreement box, the press "Continue". Click on the link to download and install the JRE software.
Websites and JRE versions change frequently, so anticipate minor differences.

Double click again on the Analyze.jar file. If a disclaimer window pops up, the installation is complete and you can click on the "Instructions" item of the "Help" menu for further information.

Notes
When Analyze is run, a file Analyze.X.Y.ini is placed in your user home directory, where X.Y is the version number. This file contains your personal settings for Analyze, including your acceptance of the disclaimer. If you delete it, Analyze will return to its default settings.


Example

The easiest way to see what Analyze can do is to try the following example. Assume that Analyze.jar has been placed in a directory ANALYZE.

Double click on Analyze.jar. A disclaimer window should appear. If it doesn't, verify that that the Java Runtime Environment is installed. Read the disclaimer, think about it, and click "Accept".

An application window titled "Analyze (X.Y)" should appear. X.Y is the version number. Note the blue "Help" button at the right end of the menu bar. Pull down the "Help" menu to see information about Analyze and these instructions.

Pull down the "File" menu and click on the "Set" menu item. A popup window titled "Data name, description, and output directory" will appear. Press the blue "Change directory" button and search for the ANALYZE directory that contains Analyze.jar. Click the directory name and push "Set". Then press the "Set" button in the "Data name, description, and output directory" window. The ANALYZE directory appears after the label "Output" in the Analyze window. Outputs from Analyze will now be placed in the ANALYZE directory.

Notice

Analyze is designed to automatically display results with your default browser. This will occur only if the complete file path of the output directory contains no blanks. For example, the Windows output directory path C:\Documents and Settings\User\Desktop contains blanks which prevent browser opening, causing the following error window to pop up.

When this happens, clicking "OK" allows operation to continue, but the viewing of the results with a browser must be done manually by locating and clicking on the file.

For best operation choose an output directory that has no blanks in its complete path name.

Pull down and click on the "Help:Extract examples" menu item. Look in the ANALYZE and observe a new subdirectory, Examples has been created. Five files have been placed in it for examples. These files are Gettysburg.txt, IHaveADream.txt, AlphabeticRandom.txt, AlphabeticRandom2.txt, and Genesis.xml. These files are not necessary to run Analyze and may be deleted when you're familiar with the program.

Pull down and click on the "File:Read" item. A file choosing window will appear. Move to the directory ANALYZE until you can see the file Examples/Gettysburg.txt. Click on Gettysburg.txt and then on "Read" at the bottom of the file choosing window. The file choosing window will disappear, leaving the original Analyze window, but with "Gettysburg.txt" appearing below the "File" menu. This is the file that will be analyzed.

Next, a filter to define the elements of the text must be selected. Note the words "Filter: Bytes" appears in the left middle of the Analyze window. This indicates that the text would be interpreted as a series of hexadecimal bytes. For this example, however, the text will be viewed as a collection of words. Pull down the "Filter" menu and click on the "Words" item. The Analyze window should now say "Filter: Words". A label saying "No pruning" should appear to the right of the "Filter: Words" label. If it doesn't, pull down the "Filter" menu and select the "Disable pruning" menu item. Bravely push on the "Analyze" button.

Results
With luck, an browser window should appear, possibly over the Analyze window. The title of this window should be "Analyze Summary: Gettysburg.Words". If you don't get this window, verify that you have Firefox, Internet Explorer 6.0+, or any modern browser.

Before examining the results in the browser window, look into the ANALYZE directory. Two new items can be found there: A folder Gettysburg.Words and a file Analyze.xsl. Analyze.xsl is a critical XSL file that defines the formatting of Analyze outputs, which are in XML and difficult to understand without it. Move to the Gettysburg.Words folder. Five files with an .xml extension should be present. These are the results of the analysis and can be examined with the browser after exiting the Analyze program, usually starting with the Summary.xml file. Close the directory window.

Go back to the browser window window that popped up after running Analyze. Move to the "Summary" section and click on the "Singletons" link. The resulting display (from the Singletons1of1.xml in the Gettysburg.Words folder) gives the words that appear only once in the text. Go back to the summary page, either by pushing the browser "Back" button or clicking on the "Tuples" link in the upper right side of the window.

Look at the "Tuples" section of the summary, under the "1 tuples" link. The most frequent word is "that" appearing 13 times, the least frequent word appearing more than once is "from". Click on the "1 tuples" link. The display (from the Histogram.1.xml in the Gettysburg.Words folder) shows all words appearing more than once in the text and indicates how often they appeared. The word "dedicated" appears 4 times. Again, go back to the summary.

Now look at the "Tuples" section of the summary, under the "3 tuples" link. The only repeated group of 3 words, a 3-tuple, is "dedicated to the" appearing 2 times. Click on the "3 tuples" link. The display (from the Histogram.3.xml in the Gettysburg.Words folder) shows this 3-tuple with a count of 2, as expected. Close the browser window and return to the Analyze window.

Labelling results
Labelling results is important for any serious study. Pull down and click the "File:Reset" item. This resets the program to process the same file with the same filter as before. Pull down and click the "File:Set" item again. The popup window titled "Data name, description, and output directory" will appear. Enter some reasonable name in the field after "Name", i.e. "Gettysburg". Then enter some description of the analysis in the field below the label "Description". Note that the "Output directory" is given as the ANALYZE directory that you set earlier. Press the "Set" button in the "Data name, description, and output directory" window.

The Analyze window now has "Gettysburg" appearing as a major label. Press the "Analyze" button and examine the results in the browser window. The results are better labelled and now appear in the folder Gettysburg rather than the Gettysburg.Words folder as before. To avoid confusion, delete the Gettysburg.Words folder.

Analyze remembers what you've been doing. To see this, exit Analyze by clicking the close box in the upper right hand corner or by selecting the red "Exit" entry in the "File" menu. Then re-run Analyze by double clicking on the Analyze.jar file as before. All the previous settings have been retained. Pushing the "Analyze" button re-generates the previous outputs.

Viewing selected words in the text
Pull down the "View" menu and click on the "Text" item. A popup window labelled "Select text to view" will appear. Click the check box to the right of the word "Text" and enter 0 into the field labelled Index(0-267)". Place a zero in the field labelled "Objects before the selected object" and 267 in the field labelled "Objects after the selected object". Uncheck the box to the right of the "Add space every" label, then push the "Show" button. A browser window should pop up giving the entire text. This is the text as seen by Analyze with the "Words" filter. The text is without capitalization and punctuation. The browser is displaying a file Text.0.0.xml in a folder Text that has been placed in the Gettysburg folder. Close the browser.

Pull down the "View" menu and click on the "Tuple" item. A popup window labelled "Select tuple to view" will appear. Do not click the checkbox before the "Require an exact match" label. Enter the word "dedicated" in the field following the label "Tuple", then push the "Show" button. A browser window should pop up giving the results of the search. The browser is displaying a file Search.xml in a folder Tuples that has been placed in the Gettysburg folder. The word "dedicated" appears in three forms in the text, first as a single word, second as part of the 2-tuple "dedicated to", and third as part of the 3-tuple "dedicated to the".

Suppose we'd like to find all occurrences of the 3-tuple "dedicated to the ". Highlight the words "dedicated to the " in the browser and select "Copy" with a right-mouse click in the browser. Keep the trailing space because each word is assumed to have a trailing space. Close the browser window and return to the still-open "Select tuple to view" window. This time check the "Require an exact match" checkbox. Click on the "Tuple" field and right click select "Paste" to put "dedicated to the " into the field. (You could have typed the tuple directly in the window, however, this can be difficult with texts in a foreign language with strange keyboard layouts.) Press the "Show" button. A new browser window will popup displaying the locations of each occurrence of the 3-tuple "dedicated to the ". The browser is displaying the contents of a file Tuple.3.0.20.xml in a folder Tuples in the Gettysburg folder.

The browser shows that the tuple appear twice in the text, once at index 193 and once at index 20. These indices correspond to the 194-th and 21-th words in the text. That is, the indices start at 0. Since only one text is being studied, the text number is zero. (The file name Tuple.3.0.20.xml specifies the tuple size (3), the text (0), and the index (20) of one of the tuple locations.) The combination of a text number and an index value can be concatenated with a period to give a "Position string". For example, "0.193" is the position string for the last occurrence of "dedicated to the ". Close the browser and close the "Select tuple to view" window by clicking on the close box in the upper right window corner.

Pull down the "View" menu and click on the "Text" item. This time click the check box to the right of the word "Position string" and enter 0.193 into the field labelled "Position string(0.0-0.267)". Place a 10 in the field labelled "Objects before the selected object" and 10 in the field labelled "Objects after the selected object". Pull down the "Number of tuples to highlight" to the number 3 because we're looking for a 3-tuple. Push the "Show" button. A browser window will popup showing the text surrounding the "dedicated to the " 3-tuple shown. The words "dedicated to the" are highlighted in blue. The browser is displaying a file Text.0.193.xml in a folder Text. (The file name Text.0.193.xml specifies the text (0), and the index (193) of the specified tuple location.)

The location and text of a singleton (a word that occurs only once) can be found the same way. Return to the Summary.xml document at click on the "Singletons" link. This displays a file Singletons.1of1.xml in the Gettysburg folder. Highlight the word "government " including the space and copy it to the clipboard. Return to Analyze to the "View" menu, "Tuple" entry. Check the "Require an exact match" checkbox and paste from the clipboard into the "Tuple" field. The browser displays the singleton "government" in a file Singleton.0.251.xml in a folder Tuples in the Gettysburg folder. "government " is at position string value 0.251. The words adjacent to "government" can be displayed in the usual way with the "View:Text" item.

Thus, the locations and adjacent text of any tuple found in the analysis can be displayed.

The function of the "Analysis" menu will be described later. Its function applies only when more than one text has been analyzed.


Testing for randomness

Texts from random sources have statistical properties that distinguish them from non-random sources. Analyze can perform some elementary statistical tests to determine if a text is from a random source. Testing for randomness is possible only if the elements of a text come from a finite and known set of possible elements, an alphabet. The Bytes and Alphabetic filters produce elements from alphabets of size 256 and 26 respectively. Outputs from the Words filter are non-alphabetic because the number of words is unbounded. Thus, the randomness testing can only be done with Bytes and Alphabetic filters, and not with the Words filter.

Analyze assumes that elements in random texts have a uniform distribution and are independent from one element to the next. That is, the probability of any 1-tuple is 1/n where n is the number of elements in the alphabet. Further, the probability of a particular element doesn't depend on the adjacent elements. Such random tests have tuple frequency distributions and statistics from the binomial distribution.

Example with a non-random text
Analyze the file Gettysburg.txt with the Alphabetic filter. To do this the "File:Clear" menu item must be clicked to allow resetting the filter. Set the data label as "GettysburgAlphabetic" to avoid mixing results from the previous example. Use "File:Read" to re-read Gettysburg.txt and "Filter:Alphabetic" to set the filter. Then push "Analyze".

Because the text is in a natural language, English, it isn't random. Look at the browser window (/GettysburgAlphabetic/Summary.xml) under the major title "Summary". A column labelled "Random" appears there, with the contents "false". This means that Analyze has found the text to be non-random by the method described later.

Under the major title "Tuples" examine the table following the "1-tuples" link. The table contains two columns, "Expected tuples" and "Ratio" that were not present in the previous non-alphabetic analysis. The "Expected tuples" column shows the number of 1-tuples, 26, expected if the text were random. The text has fewer (22) than expected 1-tuples.

"Ratio" gives the ratio of the tuples found (22) to the expected tuples(26) when this ratio can be computed.

Under the 14 tuples link, the "Expected tuples" column shows that 0.0 tuples were expected. The "Ratio" column is -1.00. This is a flag indicating that the actual number of expected tuples is so small that it could not be computed even with double precision arithmetic. Hence the "Ratio" (the ratio of tuples count divided by the expected number of tuples found) could not be computed.

Looking up the table, the first tuple size having a "Ratio" that is not -1.0 is 10. For 10 tuples, 3.5 tuples are expected, 10 actually occurred, yielding a Ratio of 2.87. Looking up the table further, the first tuple size having an "Expected tuples" value greater than 1 is 5 tuples, with 1.4 tuples expected, 116 tuples found, with a "Ratio" of 2104.53.

Analyze labels a text as non-random if "Ratio" is greater than 2 for the largest tuple size having the "Expected tuples" greater than or equal to 1.

For this text, the randomness decision is made at tuple size 5 because the "Expected tuples" at this size is greater than 1.0; all larger tuples have "Expected tuples" less than 1. Since the ratio is greater than 2.0, the text is taken as non-random as indicated by the "false" in the "Random" column under the "Summary" major title.

Click on the link to "1-tuples" under the "Tuples" major heading. Three useful statistics are shown. The quantity "Tuples" is the number of 1-tuples found, 22 out of a possible 26, as shown previously on the "Summary" page. The "Expected count", 44.2, is the expected count for every 1-tuple if the text were random. The counts in the table below range from 165 for "e" to 3 for "k", hence the distribution of 1-tuples isn't uniform. The "Std. dev. count", 6.6, is the standard deviation of counts expected for a random text. With a random text, a large percentage of the counts (68.3%) would be in the range 44.2 +- 6.6, that is, between 37.7 and 50.8. Clearly, the results don't have this property.

Example with a random text
Analyze the file AlphabeticRandom.txt with the Alphabetic filter. Set the data label as "Random" to avoid mixing results from the previous example. This file contains 10,000 letters generated from a noise source. Look at the browser window (/Random/Summary.xml) under the major title "Summary". The column labelled "Random" contains "true" indicating that Analyze has found the text to be random.

Look under the "Tuples" major title. In the previous, non-random example the largest tuple size (with only 1149 text elements total) was 14. Here the largest tuple size is 5, even with 10,000 letters of text. The "Tuples" column contains an * for tuple sizes 1 and 2. This indicates that all possible tuples of this size have been found. For all tuple sizes, the actual tuple counts, "Tuples", closely match the "Expected tuples", suggesting that the text is random.

Click on the link to "1-tuples" under the "Tuples" major heading. The "Count" values range from 339 to 420. The "Expected count", 384.6, is near the center of this range. The "Std. dev. count" is 19.2, so that for a random text a large percentage (68.3%) of the counts would be in the range 384.6 +- 19.2, that is, between 365.4 and 403.8. The results approach this property.

Click on the link to "5-tuples". The "Count" value is two for all 4 tuples found. This is the minimum count for any tuple. The "Expected count" and "Std. dev. count" are zero, indicating a very small count is expected. Although the counts are greater than the expected values, the low number of tuples suggests that these tuples representing 40 letters (4 tuples of 5 letters 2 times) out of 10,000 letters represent usual extremes, often called "flyers", in the data.

Notes on randomness testing

Analyze provides one very simple statistical test to determine if a text is random. This test, based on the "Ratio" statistic, is indicative, but certainly not definitive.

The best use of Analyze for randomness testing is to examine in detail tuples of large size to see if they have recognizable values. For example, analysis of the bytes of a text might show that the largest tuples contain either all x00 or all xFF bytes. This would be a source for suspicion that the text was not random. Such an anomaly might not be found by frequency analysis or by correlation methods unless the tuple size was very large.


Multiple texts

Before comparing texts, analyze the file IHaveADream.txt, which contains Martin Luther King, Jr's "I have a dream" speech, using the "Words" filter. Label the data "Dream", to avoid confusion. The results in the browser window have the same format as that of the Gettysburg address.

Reset the program with the "File:Clear" menu item. Press the "File:Read" menu item and read Gettysburg.txt. Press the "File:Read" menu item a second time and read IHaveADream.txt. Press the "File:Set" menu item and label the data "Speeches". The filter should be "Words", the label "No pruning." should appear on the same line as the "Filter: Words". Press "Analyze" for a comparison of the two texts.

A Summary result should appear in the browser window. Under the "Data" major title a box appears with both files listed. The first text, Gettysburg.txt, will be considered text 0; second text, IHaveADream.txt, will be considered text 1.

Under the "Summary" major title, the box contains rows for both files and for the overall result. The number of elements and singletons sum to the result given in the "Overall" row. The number of tuples shown for each text is is the same as when the text was processed individually. However, the overall number of tuples, 602, is not the sum (607) of the tuples in text 0 and the tuples in text 1. This is because a tuple can occur both in text 0, and in text 1. If it does, it is shown as a tuple in each text, but only once in the "Overall" row. The number of singletons in each text is less than when the text was processed individually. This is due to the merging of singletons in the individual texts into "Overall" tuples in the multitext analysis.


Comparing texts

Texts are compared by determining what tuples they share and what tuples are unique to a particular text.

Pull down the "Analysis" menu to the "Set comparison" menu item. A popup window appears labelled "Set comparison". In the window are two pull-down lists, one for each text. For the first text the alternatives are:
  • Include tuples from Gettysburg.
  • Exclude tuples from Gettysburg.
  • Ignore Gettysburg completely.
The alternatives for the second list are similar but are for IHaveADream.

Set the alternatives so that both say "Include" and press the "Set" button. The results list all tuples that are in both Gettysburg.txt and IHaveADream.txt. Clicking on the "3-tuples" link gives a list of all 3-tuples that are in both texts.

To verify that this is true, highlight and copy the 3-tuple "that all men " (Keep the trailing space!) to the clipboard, and pull down the "View:Tuple" menu. Check the "Require exact match" checkbox, paste the text into the "Tuple:" window and push "Set". In the browser window the first table shows that this tuple appears once in Gettysburg.txt and twice in IHaveADream.txt. That is, it occurs in both texts, as expected. Go back to the "Select tuple to view" window and paste "that all men " into the "Tuple:" window without checking the "Require exact match" checkbox. This time, four alternative texts are shown; in the browser window "that all men ", that all men are ", "that all men are created ", and "that all men are created equal ". Select the latter and copy it to the clipboard. Paste it into the "Select tuple to view" window's "Tuple:" field. This time check the "Require exact match" checkbox and press "Show". The browser window shows this 6-tuple also appears in both texts, and is listed in the 6-tuple link of the comparison.

Tuples can have a count of 1 in the comparison analysis of multiple texts. The actual count of the tuple in the overall text is 2 or more.

The comparison results have been put in a directory ANALYZE/Speeches/Comparison.11. The files in this directory are Comparison.11.xml and files of the form ComparisonHistogram.11.n.xml, where n corresponds to the tuple size. These files are available study after the program has been exited. The comparison code 11 shows that there are two texts because it has two digits. The digit coding is:

For example, a 01x1 comparison code would indicate that 4 texts have been studied and that the results are tuples that are in texts 1 and 3 and not in text 0. Text 2 is ignored in the comparison.

To find the tuples unique to IHaveADream.txt set the alternatives so that the first pulldown list says "Exclude tuples from Gettysburg" and the seconds says "Include tuples from IhaveADream" and press the "Set" button. The browser results list all tuples that are not in Gettysburg.txt but are in IHaveADream.txt. These are tuples that King used but Lincoln didn't. The comparison results have been put in a directory ANALYZE/Speeches/Comparison.01. The files in this directory are Comparison.01.xml and files of the form ComparisonHistogram.01.n.xml, where n corresponds to the tuple size.

The "Ignore ... completely" option sets the comparison so that it doesn't consider anything about tuples in the specified file. This option is helpful in constructing pairwise comparisons when more than 2 texts are being studied. If too many files are set to "Ignore ... completely" or no "Include tuples from ..." is set, a warning "Empty comparison" appears in blue in the lower right corner of the "Set comparison" window.


Evaluating differences in texts

Analyze provides a quantitative measure of the difference between two texts. The measure provides is the difference in mean motions of the log likelihood ratio. The following description is crude, but suggestive of the difference theory.
The difference in mean motions of the log likelihood ratio, E(l|A) - E(l|B), tells how much each tuple can change our opinion of whether text A or text B is present.
Analyze provides the above statistic to evaluate differences between to texts.
Differences between two English texts
Analyze Gettysburg.txt and IHaveADream.txt with the "Words" filter giving the data name as "Speeches" as before. Pull down the "Analysis" menu to the "Difference" item submenu and click on the submenu's "Difference" item. A pop up window labelled "Difference texts" will appear with two pulldown lists. Set the first list to "Take Gettysburg as text A.", the second to "Take IHaveADream as text B". Leave the pulldown lists "Maximum tuple size" and "Minimum difference tuple counts" set at 1. Push the "Set" button.

The browser window should give the difference results. Results are limited to tuples of size 1, the "Maximum tuple size". Because the probabilities p(x(i)|A) and p(x(i)|B) are estimated based on the frequency of the tuple x(i) in the texts being analyzed, both probabilities can't be estimated for tuples that occur only in one text. Thus any tuple considered in the difference must have a minimum count in each text of 1, the "Minimum difference tuple Count". With this restriction, the number of 1-tuples in the difference is 70 out of the 199 possible 1 tuples. The probabilities are computed as fractions of the element counts within the difference tuples, not as fractions of the overall counts. The meaning of the "Excluded tuples" and "Non-excluded tuples" items will be explained later.

The "Difference of the mean motions", 0.862, gives the basic measure of the differences between the texts. This is a "large" value for this quantity, as will be seen with a later example. The following box gives useful statistics that are generally self-explanatory. The "% elements in difference", 69.0% in A, 49.6% in B, should always be noted. Small values for either of these percentages indicates that too few tuples are in the difference, usually because of too high a minimum required count. The "Expected number of elements (0.500 -> 0.999)" gives the expected number of tuples of each text to move an initial probability of either text of 0.500 to a probability of the text of the column to 0.999. That is, if the texts are equiprobable at the start, about 22.4 tuples will be required to establish that text A is present with a probability of 0.999, about 33.5 tuples are required to establish that text B is present with the same probability. These counts are compensated for the fact that not all elements are in the difference.

A table of "Influential tuples" follows. These are tuples that have the largest contributions to the difference of the mean motions. The first tuple, "of", occurs 5 times in text A out of 185 elements in the difference, so p("of"|A) is taken as 0.02703. Similarly, "of" occurs 96 times in text B out of 773 elements in the difference, so p("of"|B) is taken as 0.12419. l("of") is -1.52488. The difference of the p("of"|A)l("of") and p("of"|B)l("of") is 0.148, the largest such difference of all tuples. Hence "of" is the most influential tuple in the difference in the sense of having a large l(x) and large probabilities p("of"|A), p("of"|B). The sign and magnitude of l(x) tell how a tuple modifies the final probability of a text. For example, the word "of" suggests the text is King's, the word "here" suggests the text is Lincoln's.

The field "Maximum tuple size" in the pop up window allows tuples up to the specified size to be considered in the difference. Tuples sizes greater than 1 complicate the theory and generally have small effect.

The field "Minimum difference tuple Count" in the pop up window controls the minimum number of counts that a tuple must have in each text to be considered in the difference. Increasing this number increases the quality of the estimates of p(x|A), p(x|B), but reduces the number of tuples in the difference. The balance between these two considerations must be found empirically for a particular pair of texts.

Another problem in examining the difference in texts is the presence of tuples that the analyst believes produce an "unfair" contrast between the texts. Very often these tuples are proper nouns and are specific to the particular texts. Suppose, that we analyze two short stories to see if they come from the same author. The first, A, has "Joe" as a primary character and "Tom" as a very minor character. The second text, B, has "Tom" as a major character and "Joe" as a minor character. The tuples "Joe" and "Tom" will be very influential because their frequencies will be very different depending on the text and will increase the difference in the mean log likelihood ratios. This increase is "unfair" in the sense, that the two names are specific to the stories and not to the author's vocabulary. In this case, it's desirable to exclude the tuples "Joe" and "Tom" from the analysis and then see if the texts are different.

The two texts here do not appear to have such "unfair" tuples. Suppose, however, that we want to exclude simple words such as "of", "the", and "a" from the difference between Gettysburg.txt and IHaveADream.txt. Go to the "Analysis:Difference" submenu and click on the "Exclude tuples" menu item. (You must push "Analyze:Difference" first.) A pop up menu labelled "Select tuple to exclude" will appear. Type "of " in the "Tuple:" field. (Don't forget the trailing blank.) Press the "Exclude tuple" button. Continue in this manner with the words "the " and "a ". Then push the "Show results" button. The browser window should pop up with difference results, with and without the exclusions. The exclusion of these three tuples has reduced the difference in mean motions from 0.862 to 0.787. It has also reduced the percentage of elements in the difference from 69.0% to 60.4% for text A, and from 49.6% to 34.8% in text B. With the exclusions, fewer tuples (18.5) are required to have a final probability of A of 0.999, more elements (117.4) are required to have a final probability of B of 0.999. If you accidentally exclude a tuple that you didn't want excluded, excluding the tuple one more time will re-include it into the difference analysis.

Sometimes excluding a whole block of tuples will be convenient. This can be accomplished by setting the number of tuples to exclude in the "Exclude the first ____ most influential tuples." field in the "Select tuples to exclude" window. Go to the "Analysis:Difference" submenu and click on the "Exclude tuples" menu item again. This time enter 10 in the "Exclude the first ____ most influential tuples." field. Push "Show results" and observe the browser window. The first 10 most influential tuples have been excluded. If you are unhappy with the exclusions, re-push the "Difference" item of the "Analysis:Difference" submenu.

In any difference analysis, experiment with exclusion of tuples that you suspect are "unfair" in some sense. Often the exclusion has a small effect.
Differences of random texts
Analyze the two texts AlphabeticRandom.txt and AlphabeticRandom2.txt with the "Alphabetic" filter. These texts contain random alphabetic characters taken from a noise source. Label the results "TwoRandom". (Note both texts and the overall text are found to be random in the Summary.) Perform the "Difference" analysis. In the browser window the difference of the mean motions is 0.004, a "small" value. Over 3600 tuples would be required to distinguish between the two texts. This is the expected result for two, unrelated random texts.
Differences between an English text and a random text.
Analyze the two texts IHaveADream.txt and AlphabeticRandom.txt with the "Alphabetic" filter. This compares a natural language text with a random text. Label the results "EnglishRandom". (In the summary IHaveADream.txt is found to be non-random, AlphabeticRandom.txt is found to be random, the overall text is non-random.) Perform the "Difference" analysis. In the browser window the difference of the mean motions is 0.987, a "large" value. This shows the well known result that natural languages have non-random frequency distributions and are easily distinquished from random texts.


Special topics


Program crashes due to memory errors

Although Analyze has been carefully programmed, for certain data and filters it will crash due to memory errors. These errors occur when the Java interpreter runs out of memory. Analyze monitors the memory available and posts a "Memory alert" if less than 1 megabyte of memory remains, however, this warning isn't always given. The most reliable way to detect memory errors is to open the Java console and look for messages on the Java console if the program crashes. Memory errors cause the printout java.lang.OutOfMemoryError to appear on the Java console. Unlike many Java errors, this error can't be caught and repaired.

Memory errors are most likely to occur with large data sets having large (50) tuples. To explore this problem, open the Java Console and analyze the Analyze.jar file with the Bytes filter. Unless you're very lucky, a warning message will pop up and java.lang.OutOfMemoryError will be printed on the Java Console.

Memory errors can be reduced by calling the Java interpreter with a flag to increase its available memory. The following code is effective in running from the Analyze.jar file:

java -Xmx1024M -jar Analyze.jar
Try re-analyzing Analyze.jar with the Bytes filter, starting the program with the above command. (This command line can be placed in a RunAnalyze.bat file for convenience.) On a machine with 1 gigabyte of internal memory, the analysis will complete without a memory error. Analyze.jar contains more than 200,00 bytes with tuples up to size 58. It is distinctly non-random. Random texts of several megabytes are easily handled because their maximum tuple size is much smaller.


Tuple pruning

Consider the text 12ABCD34ABCD56. ABCD occurs twice and is a 4-tuple. As a consequence of ABCD being a tuple, 4 1-tuples (A, B, C, and D), 3 2-tuples (AB, BC, CD), 2 3-tuples (ABC and BCD) are also tuples. Given that ABCD is a tuple, these other tuples aren't very interesting and are called redundant tuples.
A redundant tuple is a tuple which
  1. is included inside a larger tuple and
  2. is not found in the text other than inside the larger tuple.
Pruning removes redundant tuples.
Pruning the text 12ABCD34ABCD56 yields only 1 4-tuple, ABCD, eliminating 9 redundant tuples.

Suppose the text was 12ABCD34ABCD56BCD. The tuple BCD occurs outside the tuple ABCD and is no longer redundant. The 1-tuples (B, C, D) and the 2-tuples (BC, CD) are still redundant because they occur only in BCD. The 1-tuple A, the 2-tuple AB, and the 3-tuple ABC are still redundant because they're only in ABCD. Thus the non-redundant tuples are BCD and ABCD. Pruning this text leaves only two tuples, BCD and ABCD. Thus pruning removes redundant tuples to reduce the number of tuples an analyst must examine.

Take a look at the earlier analysis of IHaveADream.txt with the "Words" filter and labelled Dream.

Tuple sizeNo. of tuplesTuples
12 1"i have a dream today i have a dream that one day "
11 2"i have a dream today i have a dream that one ", "have a dream today i have a dream that one day ".
103"a dream today i have a dream that one day ", "have a dream today i have a dream that one ", and "i have a dream today i have a dream that".
9 4"a dream today i have a dream that one" , "dream today i have a dream that one day" , "have a dream today i have a dream that" , i have a dream today i have a dream"
85 + 2"a dream today i have a dream that" , "dream today i have a dream that one" , "have a dream today i have a dream" , "i have a dream today i have a" , "today i have a dream that one day", "we can never be satisfied as long as", "with this faith we will be able to "

The italicized tuples in the above table clearly result from the single, 12-tuple. They are redundant in the sense that if you have the 12-tuple "i have a dream today i have a dream that one day " you also have all the italicized tuples.

Re-analyze the file IHaveADream.txt with the "Words" filter; this time click the "Filter:Enable pruning" menu item. Label the data "DreamPruned", to avoid confusion. In the browser window the tuple sizes range from 1 to 12, as in Dream. However, no 9, 10, or 11 tuples are shown. Only 2 8-tuples are shown: "we can never be satisfied as long as", "with this faith we will be able to ". The redundant tuples have been removed.

Look in the Summary on the right side just above the initial double lines for the words "Tuples are pruned.". This labels the fact that pruning has been used. Under the Summary major title, observe the words "265 of 541 tuples were pruned.". Here pruning has hidden 265 tuples, almost half of the total of 541 tuples, to reduce the number of tuples shown to the analyst.

Tuple pruning saves the analyst a lot of effort and pruning is recommended for non-alphabetic filters. Pruning is not available for alphabetic filters (such as "Bytes", "Alphabetic").


Working with Hebrew text

Analyze can analyze texts written in any language expressible in Unicode, regardless of the text direction. The filter determines how the input bytes are converted to elements in the language and the way the text is displayed. A demonstration of this capability is provided for the Hebrew language with a text from the Hebrew Bible, as contained in the Unicode/XML Westminster Leningrad Codex (WLC).
The SBL Hebrew font (SBL_Hbrw.ttf) is required to display the Hebrew text. It can be downloaded free from www.sbl-site.org/Resources/Resources_BiblicalFonts.aspx and installed in the usual manner.
The example text, Genesis.xml, is the book of Genesis transcribed from the WLC site. Other books of the Hebrew Bible are available in this format from the WLC site's Books directory, http://www.tanach.us/Books/.

The filter "Tanach" converts a Unicode/XML Tanach file, such as Genesis.xml, which contains vowelized and cantillated text, into words for Analyze. A number of options for this conversion are available from a popup window which appears after the "Analyze" button is pushed. The text can be analyzed with or without vowels. Cantillation marks are always removed. Text can be taken from either the ketib or qere variants.

A fuller description of this filter is printed in the Summary. The filter preserves the conventional chapter and verse nomenclature, so that a position string "Gen 2:10.1" refers to Genesis Chapter 2, Verse 10, word 1.

Name the data Genesis and run Analyze on Genesis.xml with the "Tanach" filter, preferably with pruning enabled. Set the default values in the Tanach filter options window when it appears. In the browser window the Hebrew text is displayed from right to left in the SBL Hebrew font. With this filter, as with the "Words" filter, words have a trailing blank. However, because the language is right-to-left, the trailing blank corresponding to a word is on the left end of the word. The function of the Analyze commands and the contents of its outputs remain the same, only the presentation of the text has changed.

Activate the "View:Text" command. Check the "Position string" checkbox and place "Gen 1:1.1" in the "Position string" field. Note that the range of position strings is shown. The last of Genesis is "Gen 50:26.10", the 10-th word in the 26-th Verse of Chapter 50. Press "Show" and the first 10 words of Genesis will appear in the browser window..

Activate the "View:Tuple" command. If your keyboard can be used for Hebrew characters, a blue label saying "Input Hebrew (Israel) characters." will appear. When you type into the "Tuple" field, the letters appear from right to left and are in the SBL Hebrew font. If you know Hebrew and know how your keyboard assigns keys to Hebrew letters, you can enter a word into the "Tuple" field by typing. Alternatively, copy a simple Hebrew word from the summary display on the browser into the clipboard (right mouse button, "Copy"), noting to sweep from right to left. Then click on the "Tuple" field in the "Select tuple to view" and paste the word there (right mouse button, "Paste"). Do not click the "Require an exact match" checkbox. Press "Show". The browser will display a list of words containing the selected word.

Choose one of the resulting tuples, copy it from the browser (Keep the trailing blank!) to the clipboard and paste it into the "Tuple" field as before. This time click the "Require an exact match" checkbox. Press "Show". A list of all occurences of this tuple will appear. Note that the "Position string" field in the table gives the position in the conventional Chapter and Verse notation. Caution: Word positions are for the position of a word in the Hebrew text, not the English text!


Theory of text differences

Analyze provides a quantitative measure of the difference between two texts. The measure provides is the difference in mean motions of the log likelihood ratio. The following description quickly summarizes the theory.

Suppose that we are given n tuples {x(0)...x(n-1)} = x from one of two texts, A or B, with the purpose of determining which text, A or B, is the source. Before we look at the tuples, the probability if text A is present is P0(A). The probability of text B being present, P0(B), is 1 - P0(A).

The probability of text A being present after the tuples are examined is P(A, x).

P(A, x) = P( x | A ) P0(A)

P(B, x) = P( x | B ) P0(B)

P(A, x) is the aposteriori probability of A.

If the tuples are independent

P( {x(0)...x(n-1)} )|A) = p( x(0) | A) p( x(1) | A) .... p( x(n-1) | A)

P( {x(0)...x(n-1)} )|B) = p( x(0) | B) p( x(1) | B) .... p( x(n-1) | B)

p(x(i) |A) is the probability of tuple x(i) in text A, p(x(i)|B) is the probability of tuple x(i) in text B. The ratio P(A, x)/P(B, x) is
P(A, x)/P(B, x) = {p( x(0) | A)/p( x(0) | B)} {p( x(1) | A)/p x(1) | B)} ... {p( x(n-1) | A)/p( x(n-1) | B)}
* {P0(A)/P0(B)}
Taking the natural logarithm of this equation gives
ln{P(A, x)/P(B, x)} = sum [ ln{ p( x(i) | A)/p( x(i) | B) } ] over all tuples + ln(P0(A)/P0(B))
The log odds ratio transforms probabilities from a range of 0 to 1 to a range from minus infinity to plus infinity. It's the preferred scale for most detection studies. Given a probability p, the log odds ratio of p is
L( p ) = ln [ p / ( 1 - p ) ]
Large, positive values of log odds ratio correspond to p near 1. Large, negative values of log odds ratio correspond to p near 0. L = 0 corresponds to p = 0.5. Let the log odds ratio of P0(A) be L( P0(A) ) = L0. Let the log odds ratio of P(A, x) be L( P(A, X) ) = L.

Given a tuple x(i), the log likelihood ratio (of x(i)) is:

l( x(i) ) = log [ p(x(i)|A) / p(x(i)|B) ]
Thus
L = sum l( x(i) ) over all tuples + L0.
L0 is the apriori log odds ratio of A, that is the log odds ratio of A before the tuples have been processed. L is the aposteriori log odds ratio of A, that is the log odds ratio of A after the tuples have been processed. Hence positive values of l( x(i) ) increase L and suggest text A is present, negative values decrease L and suggest that text B is present.

The mean motion of the log likelihood ratio (in A) is:

E( l | A) = sum [ p(x(i)|A) l(x(i)) ] over all tuples x(i).
This is the expected change in the likelihood per tuple given that text A is present. The expected value of the final log odds ratio after n tuples when text A is present is L0 + n*E( l | A). Large positive values of E( l | A) indicate each tuple moves the final log odds ratio a lot in favor of text A. Likewise, the mean motion of the log likelihood ratio in B is:
E( l | B) = sum [ p(x(i)|B) l(x(i)) ] over all tuples x(i).
This is the expected change in the likelihood per tuple given that text B is present. The expected value of the final log odds ratio after n tuples when text B is present is L0 + n*E( l | B). Large negative values of E( l | B) indicate each tuple moves the final log odds ratio a lot in favor of text B.
The difference in mean motions of the log likelihood ratio, E( l | A) - E( l | B), tells how much, on the average, tuples can change our opinion of whether text A or text B is present.


Creating and installing filters

Analyze comes with several useful filters, however, some users might want other filters for their applications. Analyze includes a capability for a user to add his own filters. These filters must be written in the Java programming language. Filters can be added only when Analyze is run as an application, that is, from Analyze.jar.
Users familiar with the basics of Java programming will have no trouble constructing their own filters.
Run Analyze and press the "Help:Extract filters" command. The following folders and files will appear in your ANALYZE directory. Alphanumeric32.class and Alphanumeric32.java are the class and source files for a filter that processes bytes containing the letters a-z and numbers 2-7. It will provide an example of a user-designed filter. FilterCore.java contains the source for a superclass that simplifies user-designed filters. It is not necessary for a user-designed filter to use this superclass, however. All filters must satisfy the requirements specified in the Java Interface, FilterInterface which is carefully documented.
Adding a filter
Before creating a filter, here's how filters are added to the program. Run Analyze from Analyze.jar and click on the "Filter:Add a filter" menu item. An "Add a filter" popup window should appear. Type "Analyze.Filter.Alphanumeric32" in the text field. This is the Java class name for the class in ANALYZE/Analyze/Filter/Alphanumeric32.class. Note that the directory separators are "." and that the extension ".class" does not appear. Push the "Set" button. A file choosing window labelled "Locate the .class file for Analyze.Filter.Alphanumeric32". Move to the directory ANALYZE/Analyze/Filter and click on Alphanumeric32.class. Pull down the "Filter" menu and observe that "Alphanumeric32" is now one of the available filters. Note that the Classpath variable doesn't need to be set to add a pre-existing class.

Deleting a filter
Activate the "Filter:Delete filters" menu item. A list of filters preceded by checkboxes will appear. Note that "Alphanumeric32" is present on the list and that its full Java class name and location are given. Check all the boxes and press "Delete". Note that all filters have been removed from the "Filter" menu. Don't panic. Activate the "Filter:Restore defaults" menu item and observe that the initial filters, but not "Alphanumeric32" are back on the list.
Creating a filter
The following discussion assumes a familiarity with the Java programming language and that the current Java Software Development Kit (SDK) is installed on your machine. The SDK is available for free at the same web site that the JRE was obtained.

In Analyze text elements are represented by Java objects of type Object, the most general type of object in the Java language. Filtering is the process of taking a byte array, byte[], and converting it into an ordered vector of objects in a Java Vector object. The objects in the Vector can be defined any way desired and don't even need to be of the same class. In the processing of objects in the Vector are cast to the Java Object type and compared based on their toString() methods. That is, two objects are equal if their toString() methods yield the same String with the String.compareTo() method.

The program provides the byte array to the filter by calling setArray(byte[] InputArray, int Length).The filter returns the user-filled Vector when the program later calls the getVector() method.

Each text is represented by the Java Vector described above. All the texts are contained in a Java array of vectors, Vector[], of dimension 5. Particular elements in the data are described by their text number, the index of the Vector corresponding to the text in the Vector[], and their index, their index in that Vector. The getPosition(Object[] Data, int Text, int Index) method returns a String, the position string, identifying the position of the text element at index Index in text Text. This can be a simple combination of the provided Text and Index values as in Alphanumeric32 and as implemented by FilterCore.java. Alternatively, the method can access the specified object and construct the String from its contents, as in "WHIBHS".

The three critical filter methods are setArray, getVector, and getPosition. Other methods describe the filter and set the output format and are trivially implemented. The super class FilterCore.java simplifies implementing these methods.

In designing a filter, the user must determine if the text elements are alphabetic or non-alphabetic. If text elements are from a finite alphabet, the method getAlphabetSize() should return the size of the alphabet. Otherwise it should return zero, indicating the text is non-alphabetic. Alphabetic filters should return false from isPrunable() to avoid the possibility of calculating randomness statistics on pruned data.

The setFrame(Frame F) method provides a Java Frame in which to put to put a graphical user interface, if desired. It is called before the setArray method.

Carefully study the example sources, particularly the FilterInterface specification, before writing your own filter.
Assuming you're really ready, here's the procedure to create your own class. Set the Java Classpath variable so that it points to your working directory ANALYZE, not to the extracted Analyze subdirectory. Compile the FilterCore.java and FilterInterface.java sources to form their .class files.

Create a directory such as MyFilters in the ANALYZE directory. Write your filter as a Java class that satisfies the FilterInterface specification given by its class name Analyze.Filter.FilterInterface. Make your filter an extension of the Analyze.Filter.FilterCore class if you want. Suppose your filter class is in ANALYZE/MyFilters/Filter1.java with a class name MyFilters.Filter1. Then it can be compiled and added to the filter menu with the "Filter:Add a filter" command as with the Alphanumeric32.


Output file types

Data
Data from Analyze is placed in XML files for immediate viewing by the browser and for later study with a browser. Knowing the nomenclature and positions of these files will often save re-running the program.

Suppose the output directory is ANALYZE and that the name of the present project is MyProject. Ten file types are potentially produced by Analyze and are given by the following table. (The file type is given in the fourth line down from the upper left corner of each browser page. For example, in a Summary.xml file, the line reads "Analyze.xsl : /Summary".)

LocationFile nameContents
ANALYZE/MyProject Summary.xml Overall summary of data, starting point for histograms.
ANALYZE/MyProject Histogram.x.xml Histogram of x-tuples.
ANALYZE/MyProject Singletons.xofy.xml Singletons in y separate files, this is the x-th file.
ANALYZE/MyProject/Text Text.x.y.xml Display of text x centered on element y.
ANALYZE/MyProject/Tuples Search.xml Results of a "Tuple" search without the "Require exact match" checkbox selected. This is a temporary file that is overwritten with every such search.
ANALYZE/MyProject/Tuples Tuple.x.y.z.xml Results of a "Tuple" search with the "Require exact match" checkbox selected. This gives the results for an x-tuple found in element z of text y.
ANALYZE/MyProject/Tuples Singleton.x.y.xml Results of a "Tuple" search which found a singleton, element x of text y.
ANALYZE/MyProject/Comparison.xy Comparison.uv.xml Summary of the comparison of texts with comparison code "uv".
ANALYZE/MyProject/Comparison.xy ComparisonHistogram.uv.z.xml Histogram of z-tuples from the comparison of texts with comparison code "uv".
ANALYZE/MyProject/Difference Difference.uv.x.y.xml Difference statistics for texts u and v, with maximum tuple size x and with minimum tuple count .
ANALYZE/** Analyze.xsl.xml XSL file controlling the formating of the above XML files. This file is placed in all subdirectories. With expertise and care, this file can be modified to format the above results differently.

Other
Analyze places other files on your machine, either automatically or as a result of an Analyze command. You may want to know what they do. Suppose that your user home directory is USERHOME and that the output directory of Analyze is ANALYZE as before.

LocationFile nameContents
USERHOME Analyze.X.Y.ini Initialization file for Analyze containing you current settings. If this file isn't found a new, default version will be written when Analyze is run.
ANALYZE/Instructions Instructions.html This documentation, installed and displayed by the "Help:Instructions" command.
ANALYZE/Examples Gettysburg.txt, IHaveADream.txt, AlphabeticRandom.txt, AlphabetRandom2.txt, Genesis.xml Example files installed by the "Help:Extract examples" command.
ANALYZE/Analyze/Filter Alphanumeric32.class, Alphanumeric32.java, FilterCore.java, FilterInterface.java Java files to create user filters installed by the "Help:Extract filters" command.

© C. V. Kimball 2011