Rediscovering My Adjacency Finder

After my adjacency finder project came up in conversation today, I suddenly remembered I wrote such a thing, and decided to play with it again.

Here was the sample text (I've separated it from the code because my JS-based Ruby parser can't handle docstrings):

                  
                  checktext = <<-EOF
                  
                  TODAY'S NEWS
                  
                  (By Alex Wong -- Getty Images)
                    Enlarge Photo    
                  TOOLBOX
                   Resize Text
                   Save/Share +
                  Print This E-mail This
                  COMMENT 
                  washingtonpost.com readers have posted 1 comment about this item.
                  View All Comments »
                  POST A COMMENT
                  You must be logged in to leave a comment. Log in | Register
                   Why Do I Have to Log In Again?
                  
                  
                  
                   Discussion Policy
                  WHO'S BLOGGING
                  » Links to this article
                  Friday, June 27, 2008; Page C12
                  High Court Overturns District's Handgun Ban
                  
                  · The U.S. Supreme Court yesterday overturned the District of 
                  Columbia's 32-year-old ban on handguns. In a 5-4 decision, 
                  the justices ruled that Americans have a right to own guns to 
                  defend themselves in their homes.
                  
                  The Second Amendment to the Constitution says, "A well regulated 
                  militia, being necessary to the security of a free state, the 
                  right of the people to keep and bear arms, shall not be infringed." 
                  The issue before the court was whether that language, 
                  adopted in 1791, protected an individual's right to own guns 
                  or was somehow tied to service in a state militia (military force).
                  
                  The District's law, among the strictest in the nation, in part barred 
                  residents from owning handguns unless they had one before the 
                  law took effect in 1976.
                  
                  In its first major statement on individual gun rights in U.S. history, 
                  the high court said yesterday that the Second Amendment gives 
                  residents the right to defend themselves at home.
                  
                  Although ruling that a total ban on handguns is unconstitutional, 
                  the justices said that some gun restrictions (such as keeping 
                  firearms out of schools and forbidding felons and mentally ill 
                  people from having them) are allowed.
                  
                  Read more about this on A1.
                  
                  EOF
                  
                  
                  

And the code to run it:

                  
                  require 'adjacency_finder'
                  #require 'pp'
                  require 'rubygems'
                  
                  checktext = "..." # From above
                  
                  t = VZV::AdjacencyFinder.new(checktext)
                  common_words = []
                  File.open("common_words.txt").each do |line|
                    common_words << line.gsub(/[^A-Za-z0-9]/, "")
                  end
                  
                  t.duplicate_words.each do |word, node_collection|
                    next if node_collection.frequency_to_average_distance > 0.7 or common_words.include?(word)
                  
                    data = []
                    use_data = []
                    position_index = (t.average_positioning(word) * 100).to_i
                    data[position_index] = node_collection.frequency_to_average_distance
                  
                  
                  
                    data.each do |n|
                      use_data << (n.nil? ? 0 : n)
                    end
                  
                    puts "Word: #{word}"
                    puts "AvPos: #{t.average_positioning(word)}"
                    puts "AvDis: #{node_collection.average_distance}"
                    puts "Frequency to Average Distance: #{node_collection.frequency_to_average_distance}"
                    puts 
                  end
                  
                  

and the output:

                  
                  Word: gun
                  AvPos: 0.761029411764706
                  AvDis: 38.0
                  Frequency to Average Distance: 0.0263157894736842
                  
                  Word: defend
                  AvPos: 0.378676470588235
                  AvDis: 123.0
                  Frequency to Average Distance: 0.00813008130081301
                  
                  Word: guns
                  AvPos: 0.371323529411765
                  AvDis: 59.0
                  Frequency to Average Distance: 0.0169491525423729
                  
                  Word: law
                  AvPos: 0.643382352941177
                  AvDis: 20.0
                  Frequency to Average Distance: 0.05
                  
                  Word: handguns
                  AvPos: 0.441176470588235
                  AvDis: 100.666666666667
                  Frequency to Average Distance: 0.0298013245033113
                  
                  Word: yesterday
                  AvPos: 0.283088235294118
                  AvDis: 139.0
                  Frequency to Average Distance: 0.00719424460431655
                  
                  Word: log
                  AvPos: 0.154411764705882
                  AvDis: 8.0
                  Frequency to Average Distance: 0.125
                  
                  Word: ban
                  AvPos: 0.279411764705882
                  AvDis: 108.666666666667
                  Frequency to Average Distance: 0.0276073619631902
                  
                  Word: court
                  AvPos: 0.307598039215686
                  AvDis: 84.6666666666667
                  Frequency to Average Distance: 0.0708661417322835
                  
                  Word: districts
                  AvPos: 0.257352941176471
                  AvDis: 104.0
                  Frequency to Average Distance: 0.00961538461538462
                  
                  Word: amendment
                  AvPos: 0.404411764705882
                  AvDis: 110.0
                  Frequency to Average Distance: 0.00909090909090909
                  
                  Word: themselves
                  AvPos: 0.382352941176471
                  AvDis: 123.0
                  Frequency to Average Distance: 0.00813008130081301
                  
                  Word: residents
                  AvPos: 0.680147058823529
                  AvDis: 37.0
                  Frequency to Average Distance: 0.027027027027027
                  
                  Word: militia
                  AvPos: 0.433823529411765
                  AvDis: 52.0
                  Frequency to Average Distance: 0.0192307692307692
                  
                  Word: comment
                  AvPos: 0.0790441176470588
                  AvDis: 13.5
                  Frequency to Average Distance: 0.444444444444444
                  
                  Word: justices
                  AvPos: 0.338235294117647
                  AvDis: 149.0
                  Frequency to Average Distance: 0.00671140939597315
                  
                  Word: state
                  AvPos: 0.466911764705882
                  AvDis: 42.0
                  Frequency to Average Distance: 0.0238095238095238
                  
                  

I'd say it's doin' something right.

I've reposted my original project announcement from my old blog at the end of this post.

Contents below originally posted on October 5, 2007 at Vazav.com, before it was killed by a WordPress exploit:

Last night I was aglow with thoughts. For some reason, while just on the edge of sleep, I was able to visualize the process of analyzing text. Now, for most of you, this is probably a fairly simple procedure, but my skills in algorithms and matrix theory aren't quite as great as they should be, so for me this was a minor revelation.

So this morning I wrote a small library for picking apart text. When given a sample of information, the library returns bits of information such as: the average positioning of "interesting" words, the frequency of those words, and the ratio of the frequency of a word to the average distance between each. The math is pretty simple... but the fun part is what you can deduce from this.

For instance, if a subject has a very high frequency-to-average-distance (F:AD), that means that the subject is mentioned repeatedly in a very small space. If the average positioning (AP) is close to zero, then it appears at the beginning of the text. If it's closer to one, then it's towards the end. If the F:AD ratio is high, and the AP is around 0.5, then the subject is dense near the middle.

I've attached a graph with an analysis run ignoring common words. Fun stuff.

Contents below originally posted on October 6, 2007 at Vazav.com:

I ran a sample essay through my analyzer and got some pretty good numbers. I then generated a graph to help show me what those numbers meant (my first time using Gruff... not a lot of docs out there). I figured an essay was a good choice because they are meant to have a certain structure, and I was interested in seeing what that looked like.

Typically, an essay has the intro, body, and conclusion. Most people will repeat the intro material in the conclusion, and have sprinklings of information throughout the main body. Those sprinklings should be related to argument being presented.

I chose the K-Means method for determining clusters. Since there wasn't a good one around in Ruby, I wrote my own. The results were satisfactory.

Essay Analyzed

Generated Graph for English Paper

Run Results

The source is available here.