Rediscovering My Adjacency Finder
After my adjacency finder project came up in conversation today, I suddenly remembered I wrote such a thing, and decided to play with it again.
Here was the sample text (I've separated it from the code because my JS-based Ruby parser can't handle docstrings):
checktext = <<-EOF
TODAY'S NEWS
(By Alex Wong -- Getty Images)
Enlarge Photo
TOOLBOX
Resize Text
Save/Share +
Print This E-mail This
COMMENT
washingtonpost.com readers have posted 1 comment about this item.
View All Comments »
POST A COMMENT
You must be logged in to leave a comment. Log in | Register
Why Do I Have to Log In Again?
Discussion Policy
WHO'S BLOGGING
» Links to this article
Friday, June 27, 2008; Page C12
High Court Overturns District's Handgun Ban
· The U.S. Supreme Court yesterday overturned the District of
Columbia's 32-year-old ban on handguns. In a 5-4 decision,
the justices ruled that Americans have a right to own guns to
defend themselves in their homes.
The Second Amendment to the Constitution says, "A well regulated
militia, being necessary to the security of a free state, the
right of the people to keep and bear arms, shall not be infringed."
The issue before the court was whether that language,
adopted in 1791, protected an individual's right to own guns
or was somehow tied to service in a state militia (military force).
The District's law, among the strictest in the nation, in part barred
residents from owning handguns unless they had one before the
law took effect in 1976.
In its first major statement on individual gun rights in U.S. history,
the high court said yesterday that the Second Amendment gives
residents the right to defend themselves at home.
Although ruling that a total ban on handguns is unconstitutional,
the justices said that some gun restrictions (such as keeping
firearms out of schools and forbidding felons and mentally ill
people from having them) are allowed.
Read more about this on A1.
EOF
And the code to run it:
require 'adjacency_finder'
#require 'pp'
require 'rubygems'
checktext = "..." # From above
t = VZV::AdjacencyFinder.new(checktext)
common_words = []
File.open("common_words.txt").each do |line|
common_words << line.gsub(/[^A-Za-z0-9]/, "")
end
t.duplicate_words.each do |word, node_collection|
next if node_collection.frequency_to_average_distance > 0.7 or common_words.include?(word)
data = []
use_data = []
position_index = (t.average_positioning(word) * 100).to_i
data[position_index] = node_collection.frequency_to_average_distance
data.each do |n|
use_data << (n.nil? ? 0 : n)
end
puts "Word: #{word}"
puts "AvPos: #{t.average_positioning(word)}"
puts "AvDis: #{node_collection.average_distance}"
puts "Frequency to Average Distance: #{node_collection.frequency_to_average_distance}"
puts
end
and the output:
Word: gun
AvPos: 0.761029411764706
AvDis: 38.0
Frequency to Average Distance: 0.0263157894736842
Word: defend
AvPos: 0.378676470588235
AvDis: 123.0
Frequency to Average Distance: 0.00813008130081301
Word: guns
AvPos: 0.371323529411765
AvDis: 59.0
Frequency to Average Distance: 0.0169491525423729
Word: law
AvPos: 0.643382352941177
AvDis: 20.0
Frequency to Average Distance: 0.05
Word: handguns
AvPos: 0.441176470588235
AvDis: 100.666666666667
Frequency to Average Distance: 0.0298013245033113
Word: yesterday
AvPos: 0.283088235294118
AvDis: 139.0
Frequency to Average Distance: 0.00719424460431655
Word: log
AvPos: 0.154411764705882
AvDis: 8.0
Frequency to Average Distance: 0.125
Word: ban
AvPos: 0.279411764705882
AvDis: 108.666666666667
Frequency to Average Distance: 0.0276073619631902
Word: court
AvPos: 0.307598039215686
AvDis: 84.6666666666667
Frequency to Average Distance: 0.0708661417322835
Word: districts
AvPos: 0.257352941176471
AvDis: 104.0
Frequency to Average Distance: 0.00961538461538462
Word: amendment
AvPos: 0.404411764705882
AvDis: 110.0
Frequency to Average Distance: 0.00909090909090909
Word: themselves
AvPos: 0.382352941176471
AvDis: 123.0
Frequency to Average Distance: 0.00813008130081301
Word: residents
AvPos: 0.680147058823529
AvDis: 37.0
Frequency to Average Distance: 0.027027027027027
Word: militia
AvPos: 0.433823529411765
AvDis: 52.0
Frequency to Average Distance: 0.0192307692307692
Word: comment
AvPos: 0.0790441176470588
AvDis: 13.5
Frequency to Average Distance: 0.444444444444444
Word: justices
AvPos: 0.338235294117647
AvDis: 149.0
Frequency to Average Distance: 0.00671140939597315
Word: state
AvPos: 0.466911764705882
AvDis: 42.0
Frequency to Average Distance: 0.0238095238095238
I'd say it's doin' something right.
I've reposted my original project announcement from my old blog at the end of this post.
Contents below originally posted on October 5, 2007 at Vazav.com, before it was killed by a WordPress exploit:
Last night I was aglow with thoughts. For some reason, while just on the edge of sleep, I was able to visualize the process of analyzing text. Now, for most of you, this is probably a fairly simple procedure, but my skills in algorithms and matrix theory aren't quite as great as they should be, so for me this was a minor revelation.
So this morning I wrote a small library for picking apart text. When given a sample of information, the library returns bits of information such as: the average positioning of "interesting" words, the frequency of those words, and the ratio of the frequency of a word to the average distance between each. The math is pretty simple... but the fun part is what you can deduce from this.
For instance, if a subject has a very high frequency-to-average-distance (F:AD), that means that the subject is mentioned repeatedly in a very small space. If the average positioning (AP) is close to zero, then it appears at the beginning of the text. If it's closer to one, then it's towards the end. If the F:AD ratio is high, and the AP is around 0.5, then the subject is dense near the middle.
I've attached a graph with an analysis run ignoring common words. Fun stuff.
Contents below originally posted on October 6, 2007 at Vazav.com:
I ran a sample essay through my analyzer and got some pretty good numbers. I then generated a graph to help show me what those numbers meant (my first time using Gruff... not a lot of docs out there). I figured an essay was a good choice because they are meant to have a certain structure, and I was interested in seeing what that looked like.
Typically, an essay has the intro, body, and conclusion. Most people will repeat the intro material in the conclusion, and have sprinklings of information throughout the main body. Those sprinklings should be related to argument being presented.
I chose the K-Means method for determining clusters. Since there wasn't a good one around in Ruby, I wrote my own. The results were satisfactory.
Generated Graph for English Paper
The source is available here.