Rediscovering My Adjacency Finder

After my adjacency finder project came up in conversation today, I suddenly remembered I wrote such a thing, and decided to play with it again.

Here was the sample text (I've separated it from the code because my JS-based Ruby parser can't handle docstrings):

                
                checktext = <<-EOF
                
                TODAY'S NEWS
                
                (By Alex Wong -- Getty Images)
                  Enlarge Photo    
                TOOLBOX
                 Resize Text
                 Save/Share +
                Print This E-mail This
                COMMENT 
                washingtonpost.com readers have posted 1 comment about this item.
                View All Comments »
                POST A COMMENT
                You must be logged in to leave a comment. Log in | Register
                 Why Do I Have to Log In Again?
                
                
                
                 Discussion Policy
                WHO'S BLOGGING
                » Links to this article
                Friday, June 27, 2008; Page C12
                High Court Overturns District's Handgun Ban
                
                · The U.S. Supreme Court yesterday overturned the District of 
                Columbia's 32-year-old ban on handguns. In a 5-4 decision, 
                the justices ruled that Americans have a right to own guns to 
                defend themselves in their homes.
                
                The Second Amendment to the Constitution says, "A well regulated 
                militia, being necessary to the security of a free state, the 
                right of the people to keep and bear arms, shall not be infringed." 
                The issue before the court was whether that language, 
                adopted in 1791, protected an individual's right to own guns 
                or was somehow tied to service in a state militia (military force).
                
                The District's law, among the strictest in the nation, in part barred 
                residents from owning handguns unless they had one before the 
                law took effect in 1976.
                
                In its first major statement on individual gun rights in U.S. history, 
                the high court said yesterday that the Second Amendment gives 
                residents the right to defend themselves at home.
                
                Although ruling that a total ban on handguns is unconstitutional, 
                the justices said that some gun restrictions (such as keeping 
                firearms out of schools and forbidding felons and mentally ill 
                people from having them) are allowed.
                
                Read more about this on A1.
                
                EOF
                
                
                

And the code to run it:

                
                require 'adjacency_finder'
                #require 'pp'
                require 'rubygems'
                
                checktext = "..." # From above
                
                t = VZV::AdjacencyFinder.new(checktext)
                common_words = []
                File.open("common_words.txt").each do |line|
                  common_words << line.gsub(/[^A-Za-z0-9]/, "")
                end
                
                t.duplicate_words.each do |word, node_collection|
                  next if node_collection.frequency_to_average_distance > 0.7 or common_words.include?(word)
                
                  data = []
                  use_data = []
                  position_index = (t.average_positioning(word) * 100).to_i
                  data[position_index] = node_collection.frequency_to_average_distance
                
                
                
                  data.each do |n|
                    use_data << (n.nil? ? 0 : n)
                  end
                
                  puts "Word: #{word}"
                  puts "AvPos: #{t.average_positioning(word)}"
                  puts "AvDis: #{node_collection.average_distance}"
                  puts "Frequency to Average Distance: #{node_collection.frequency_to_average_distance}"
                  puts 
                end
                
                

and the output:

                
                Word: gun
                AvPos: 0.761029411764706
                AvDis: 38.0
                Frequency to Average Distance: 0.0263157894736842
                
                Word: defend
                AvPos: 0.378676470588235
                AvDis: 123.0
                Frequency to Average Distance: 0.00813008130081301
                
                Word: guns
                AvPos: 0.371323529411765
                AvDis: 59.0
                Frequency to Average Distance: 0.0169491525423729
                
                Word: law
                AvPos: 0.643382352941177
                AvDis: 20.0
                Frequency to Average Distance: 0.05
                
                Word: handguns
                AvPos: 0.441176470588235
                AvDis: 100.666666666667
                Frequency to Average Distance: 0.0298013245033113
                
                Word: yesterday
                AvPos: 0.283088235294118
                AvDis: 139.0
                Frequency to Average Distance: 0.00719424460431655
                
                Word: log
                AvPos: 0.154411764705882
                AvDis: 8.0
                Frequency to Average Distance: 0.125
                
                Word: ban
                AvPos: 0.279411764705882
                AvDis: 108.666666666667
                Frequency to Average Distance: 0.0276073619631902
                
                Word: court
                AvPos: 0.307598039215686
                AvDis: 84.6666666666667
                Frequency to Average Distance: 0.0708661417322835
                
                Word: districts
                AvPos: 0.257352941176471
                AvDis: 104.0
                Frequency to Average Distance: 0.00961538461538462
                
                Word: amendment
                AvPos: 0.404411764705882
                AvDis: 110.0
                Frequency to Average Distance: 0.00909090909090909
                
                Word: themselves
                AvPos: 0.382352941176471
                AvDis: 123.0
                Frequency to Average Distance: 0.00813008130081301
                
                Word: residents
                AvPos: 0.680147058823529
                AvDis: 37.0
                Frequency to Average Distance: 0.027027027027027
                
                Word: militia
                AvPos: 0.433823529411765
                AvDis: 52.0
                Frequency to Average Distance: 0.0192307692307692
                
                Word: comment
                AvPos: 0.0790441176470588
                AvDis: 13.5
                Frequency to Average Distance: 0.444444444444444
                
                Word: justices
                AvPos: 0.338235294117647
                AvDis: 149.0
                Frequency to Average Distance: 0.00671140939597315
                
                Word: state
                AvPos: 0.466911764705882
                AvDis: 42.0
                Frequency to Average Distance: 0.0238095238095238
                
                

I'd say it's doin' something right.

I've reposted my original project announcement from my old blog at the end of this post.

Contents below originally posted on October 5, 2007 at Vazav.com, before it was killed by a WordPress exploit:

Last night I was aglow with thoughts. For some reason, while just on the edge of sleep, I was able to visualize the process of analyzing text. Now, for most of you, this is probably a fairly simple procedure, but my skills in algorithms and matrix theory aren't quite as great as they should be, so for me this was a minor revelation.

So this morning I wrote a small library for picking apart text. When given a sample of information, the library returns bits of information such as: the average positioning of "interesting" words, the frequency of those words, and the ratio of the frequency of a word to the average distance between each. The math is pretty simple... but the fun part is what you can deduce from this.

For instance, if a subject has a very high frequency-to-average-distance (F:AD), that means that the subject is mentioned repeatedly in a very small space. If the average positioning (AP) is close to zero, then it appears at the beginning of the text. If it's closer to one, then it's towards the end. If the F:AD ratio is high, and the AP is around 0.5, then the subject is dense near the middle.

I've attached a graph with an analysis run ignoring common words. Fun stuff.

Contents below originally posted on October 6, 2007 at Vazav.com:

I ran a sample essay through my analyzer and got some pretty good numbers. I then generated a graph to help show me what those numbers meant (my first time using Gruff... not a lot of docs out there). I figured an essay was a good choice because they are meant to have a certain structure, and I was interested in seeing what that looked like.

Typically, an essay has the intro, body, and conclusion. Most people will repeat the intro material in the conclusion, and have sprinklings of information throughout the main body. Those sprinklings should be related to argument being presented.

I chose the K-Means method for determining clusters. Since there wasn't a good one around in Ruby, I wrote my own. The results were satisfactory.

Essay Analyzed

Generated Graph for English Paper

Run Results

The source is available here.

Recent Entries