RAA - textanalyzer

textanalyzer / 0.1

Short description: Extract important words in a text
Category: Library/textproc
Status: prototype
Created: 2004-09-21 20:20:57 GMT
Last update: 2004-09-21 20:20:57 GMT
Owner: Martin Ankerl (Projects of this owner)
Homepage: http://martin.ankerl.com/
Download: http://martin.ankerl.com/files/textanalyze.rb
License: PublicDomain
Dependency:
None
Description:

TextAnalyzer - Automatically Extracts Characterisic Words

TextAnalzyer is a text analysation tool that finds out words that are characteristic for a given input file. It is independent from any language, and even seems to work well with HTML files.

This program is only a little prototype, that shows that this technique seems to work. It’s public domain, feel free to do whatever you like with it.

Example

  1. Build an index with a reasonably large amount of data, it should be much larger than the text you want to analyze. For example, I have indexed 76 of Grimm’s fairy tales with this command:
        % cat *.txt | ruby ../textanalzye.rb c
    

    This creates the file wordcount.dat that contains the wordcount of each word.

  2. To analyze a text to find out which words are characteristic for it, the previously gathered data is used. For example:
        % cat "Little Red Riding Hood.txt" |ruby ../textanalzye.rb a
    

    This produces the output

        hood, grandma, riding, hunter, red
    

Other Uses

The previous example seems a bit useless, but there certainly are a lot of useful applications. Here are some ideas:

  • Quickly find out what an unknown text is about
  • Automatically extract important words from blog entries
  • Find out what a text is about by reading just 5 words
  • Automatically create very short descriptions for a large number of documents

The currently implemented algorithm even works well with HTML files (To my own surprise. Actually, I am surprised that it works at all…)

Algorithm

The main idea is quite simple: the algorithm assumes, that important words are :

  1. Often used in the to-be-analyzed text
  2. Seldom used in other texts

For example, the second condition ensures that words like ‘the’, ‘and’ etc. are not considered important.

Edit this project (for project owner)

back to RAA top