A collection of essays is collated for readers with visualizing graphics. The graphics should both serve as a thematic and structural overview of each text, and pose the essay in question in relation to the other essays in the book. They should be both an abbreviation of the text and the key to decoding the complex issues under discussion.
The difficulty in developing appropriate graphics arises from the level of discussion of the key themes. In some cases the relation between essays is apparent, a few of the essays have close thematic links. However, others focus so strongly on one topic that they more or less stand alone. In yet other cases, common keywords allow for surprising cross-references between two seemingly unrelated essays. The challenge is to find forms of graphical and/or typographical representation of the essays that are both appealing and informative. Care must be taken that the reader is not overwhelmed by either the number or the scope of the graphics. At the same time, each graphic must sufficiently summarise the text, and, in the interest of cohesion, each must be generated according to the same basic principles. Taking all of these issues into account, we have attempted create a system which automatically generates graphics according to predefined rules.
A data graphic – like a bar chart – depicts quantifiable data. Any automated generation of content graphics requires an automated, quantifiable transformation of the (mostly textual) information in the essays. Unlike human beings, our current computing systems can only insufficiently "read" or interpret essays, which means that, in our work, we have had to rely exclusively on that data presented in each of the text, making no allowances for data that might be derived from interpretative reading or association.
A significant constraint in developing appropriate graphics arises from the manner of data collection. In what follows, we would like to provide a brief overview of all the types of data under consideration.
It is possible to divide data extracted from essays into two main groups: data that must be collected ”manually” (in our case, using human intelligence), and data that can be captured automatically by machine intelligence.
Keywords are terms that identify an essay in a special way or that have increased significance for it. If a person researches a subject using one of these keywords, it will produce a list of search results that should include the essay to which it is attached. Keywords may exist in or be added to an essay. In any case, they represent a special marker set by an author or cataloguer manually. Keywords are subjective – they reflect the specific perspective and background of the person who declares them.
Examples of essay metadata are the essay’s author, the time period in which an essay was written, the essay language, the number or type of images it uses, the text genre, the intended audience, the essay length (in characters, words, sentences, sections, pages), file format, typeface, etc. We find that it is possible to automate the collection of some of these metadata. For example, the language in which an article is written can be determined by comparing the words used in the article with words of other known languages. If the article contains words associated with the region where Romanian is spoken, it was most likely written in Romanian [ad.01]. If words are unknown to the computer, it will not be able to determine the language.
For the most part, the time period during which an essay was written can be retrieved automatically - provided that the text was written on a computer and the word-processing program kept track of the writing sessions. If images are available in digital form, it is also possible to automate the processing of image metadata, since in in such cases, image file formats, dimensions, color depths and resolutions are known. However, image content is meta-information that needs to be classified manually. Consequently, classification of metadata as manual or automatic must be performed on a case-by-case basis. Although the length of a text is meta-information, it can be determined automatically in such great detail that it transitions to the domain of automatically collectable metadata.
An essay is usually stored electronically in the form of alphanumerical character sequences, which computers can easily process. Consequently, data collection can be easily automated when the data pertains to the number of characters, words, sentences, sections and pages in an essay, or to the frequency of words and characters (all other frequencies generate insignificant results). However, this data has only limited significance for dicovering the content of the essay. Statistical analyses are much more useful for the comparison of several essays and their relation. Essay statistics can be compiled in order to compare between or among essays, to assess one essay against several others, or to analyze the entire essay collection or book.
Computer-linguistic processes can be used to evaluate grammatical and structural information in an essay. However, they require that the grammar of the language be well known and that the author of the text actually use it. A collection of purely associative words, even if they represent a statement of sorts, cannot be analyzed linguistically. And even if a linguistic analysis yields results, these results say little about the subject of the essay. They are better suited to understanding text genre or the complexity of a given author’s style. [ad.02].
We have concentrated on data that was easy to collect automatically because otherwise too much subjective handwork would have been needed before graphics could be generated. The frequency of words both in an essay and in comparison to other essays provided us with a basis for graphics generation. In order to create detailed graphics with the greatest possible thematic significance we disregarded metadata and focused solely on word frequency. In order to remove from the statistics an essay’s non-thematic filler words (it, and, one, is, that, etc.), we used a manually compiled stop word list as a filter in the preliminary stage of the project. This influence is much smaller than manually signing keywords
The identification of a "secondary author" was another vital criterion for the development of suitable visualizing graphics. The secondary author was identified for each examined word in an essay as the author who uses this special word most frequently in his/her essay amoung all other essays. Each examined word in the essay in question thus establishes a thematic link (at least in statistical terms) to another essays by referencing the essay in that the word has most importance beneath the one in question.
next page...