Introduction
In an earlier article we introduced distance metrics and showed how these could be implemented in ruby. At the time we mentioned that such metrics can be used in nearest-neighbour models and in generating recommendation systems. We aim to develop these ideas further in this blog post, and to develop a recommendation system based on said metrics.
The algorithms implemented in this article are inspired by the python versions presented by Toby Segaran in Collective Intelligence (Amazon).
We adapt these algorithms, formulate them in ruby and apply them to a whimsical dataset containing class scores for different students in Hogwart's School for Witchcraft and Wizardry. We use the data to generate predicted scores for the different classes which the student has not taken. Whilst the dataset is whimsical, the algorithms presented are sound, and could be legitimately applied to any real dataset.
Let's get started!
Introducing the Hogwarts data
Our dataset is defined inside a static ruby hash named CLASS_SCORES
. The full data can be seen
on GitHub, but we reproduce a small extract below
to illustrate the format:
CLASS_SCORES = {
"Harry Potter" => {
"Alchemy" => nil,
"Arithmancy" => nil,
"Astronomy" => 6,
"Care of Magical Creatures" => 8,
"Charms" => 5,
"Defence Against the Dark Arts" => 10,
"Divination" => 4,
"Herbology" => 7,
"Muggle Studies" => nil,
"Potions" => 2,
"Study of Ancient Runes" => nil,
"Transfiguration" => 7,
},
"Hermione Granger" => {
"Alchemy" => 9,
"Arithmancy" => 10,
"Astronomy" => 9,
"Care of Magical Creatures" => 9,
"Charms" => 10,
"Defence Against the Dark Arts" => 9,
"Divination" => 1,
"Herbology" => 10,
"Muggle Studies" => 10,
"Potions" => 3,
"Study of Ancient Runes" => 9,
"Transfiguration" => 10,
}, …
The algorithm will use this ruby hash, CLASS_SCORES
, but it will be useful to consider the representation of the data in tabular form:
Harry Potter | Hermione Granger | Ron Weasley | Neville Longbottom | Draco Malfoy | Vincent Crabbe | Greggory Goyle | Ginny Weasley | Cho Chang | Cedric Diggory | |
---|---|---|---|---|---|---|---|---|---|---|
Alchemy | - | 9 | - | 4 | - | 1 | 2 | 8 | - | 6 |
Arithmancy | - | 10 | - | - | 7 | 2 | 2 | - | 8 | - |
Astronomy | 6 | 9 | 6 | - | - | - | - | - | 9 | 8 |
Magical Creatures | 8 | 9 | 7 | 5 | 4 | 2 | 3 | 8 | - | 8 |
Charms | 5 | 10 | 7 | 4 | 7 | - | - | 9 | 7 | 7 |
Dark Arts | 10 | 9 | 8 | 4 | 3 | 1 | 2 | 9 | 7 | 6 |
Divination | 4 | 1 | 7 | 5 | - | - | - | - | - | - |
Herbology | 7 | 10 | 8 | 10 | 8 | 3 | 2 | 9 | 8 | 6 |
Muggle Studies | - | 10 | - | - | - | - | - | 9 | - | - |
Potions | 2 | 3 | 3 | 1 | 10 | 9 | 8 | 3 | 6 | 3 |
Ancient Runes | - | 9 | - | - | 7 | 3 | - | - | 9 | - |
Transfiguration | 7 | 10 | 7 | 6 | 7 | 2 | 3 | 8 | 9 | 3 |
In this format each row of the table represents the scores in a particular class and each column represents the scores for a single student.
So we can consider a single student to be represented by a vector (or array) of scores, e.g. Ginny Weasley can be represented by the vector
[8, nil, nil, 8, 9, 9, nil, 9, 9, 3, nil, 8]
. There are two points worth mentioning here:
- The order is obviously important. The 3 in Ginny's array is at index 9 because it relates to her score in Potions class, so the index of a value is meaningful. As long as two student vectors use the same indexing then they can be directly compared.
-
There are
nil
values in the vector for any classes that the student does not take.
That 3 in Potions is really bugging Ginny, so she is considering switching this subject for one where she might achieve a better score. There are four subjects which she is not currently studying: Arithmancy, Astronomy, Divination and Study of Ancient Runes. Can we advise her which subject she would be best to switch to?
Finding similar students
The first approach Ginny might attempt is to find another student with similar interests and see if they could recommend an alternative class to her. How do we find a similar student? Well, we can use the distance metrics discussed in our previous post to compare Ginny's vector to that of each of the other students, we can then identify the student who is closest to Ginny and take their recommendation.
To drive this process we will introduce a Recommender
class which will be parameterized by the class_scores
data structure
and the distance_measure
to use in calculations:
require_relative './metrics'
class Recommender
attr_reader :class_scores, :distance_measure
def initialize(class_scores = {}, distance_measure = Metrics.method(:euclidean_distance))
@class_scores = class_scores
@distance_measure = distance_measure
end
The class_scores
attribute is just the Hash
of scores we introduced in the previous section.
The distance_measure
is a method (callable) that should take two equal-length vectors and return a distance between them.
We can reuse many of the implementations discussed in our
previous post, but the current class will default to using
the euclidean_distance
when no other measure is specified. I reproduce the implementation of this particular measure here, for convenience:
def self.euclidean_distance(a,b)
return unless a.any? && (a.size == b.size) # We need two vectors of equal lenght to calculate this distance
diff_squared = (0..a.size-1).reduce(0) do |sum, i|
sum + (a[i] - b[i])**2
end
Math.sqrt(diff_squared)
end
With the Recommender
class set up in this way we can now look at how we might find the other students most similar to Ginny.
The following code snippet from the Recommender
class shows how we can implement a top_k_matches
method. The method takes a single
student name (the target student) and the number, k
, of matches we want to retrieve. It returns the k
other students who are deemed to resemble
our target student most closely, according to the distance metric being considered. The implementation is below:
…
def top_k_matches(person, k = class_scores.size)
class_scores.keys.inject({}) do |scores, other_person|
next scores if other_person == person
scores[other_person] = similarity(person, other_person)
scores
end.compact.sort_by do |_key, value|
-1*value
end[0..k-1].to_h
end
private
# The similarity score lies in the closed interval [0, 1].
# A score of 1 indicates that the items are identical.
# A score of 0 indicates that they are infinitely distant.
def similarity(person, other_person)
1.0/(1.0 + distance(person, other_person))
end
def distance(person, other)
# Must only include class dimensions where both students have scores
person_vec, other_vec = class_scores[person].values.each_with_index.map do |score, i|
next unless score && class_scores[other].values[i]
[score, class_scores[other].values[i]]
end.compact.transpose
# Return arbitrarily large value if no overlapping scores
# Effectively mimicing a distance of inifinity.
return 10_000 if [person_vec, other_vec].any?(&:empty?)
distance_measure.call(person_vec, other_vec)
end
end
The top_k_matches
method, itself, is pretty straighforward. It loops through each of the students in class_scores
and calculates
the similarity between that student and the target student, taking care not to compare the target student with themselves. The resulting similarities are ordered and
we return the top k
most similar.
def top_k_matches(person, k = class_scores.size)
class_scores.keys.inject({}) do |scores, other_person|
next scores if other_person == person # Don't compare person to themselves
scores[other_person] = similarity(person, other_person)
scores
end.compact.sort_by do |_key, value|
-1*value # Order by decreasing similarity score
end[0..k-1].to_h # Extract the top k, and output as a hash
end
Perhaps more interesting are the manipulations we have to carry out to determine the similarity scores once we have the two student vectors. The distance measures
introduced in the previous post generally require two equal-length, numeric arrays. In this respect, the presence of nil
values in the arrays would make
it impossible to calculate a distance.
What do the nil
values represent again? A student vector will have a nil
value when they do not study the class at that particular index.
So if one student has a score for a particular subject and the other student does not study that subject, i.e. has a nil
, then we cannot say much about the
similarity, or difference, of the students based on that subject. As a result, let's omit the elements for each index where either, or both, of our vectors have a nil
value. This is what we achieve in the first part of the distance
method:
def distance(person, other)
# Must only include class dimensions where both students have scores
person_vec, other_vec = class_scores[person].values.each_with_index.map do |score, i|
next unless score && class_scores[other].values[i]
[score, class_scores[other].values[i]]
end.compact.transpose
# Return arbitrarily large value if no overlapping scores
# Effectively mimicing a distance of inifinity.
return 10_000 if [person_vec, other_vec].any?(&:empty?)
distance_measure.call(person_vec, other_vec)
end
The input vectors, person
and other
, are transformed into shorter vectors, person_vec
and other_vec
, where we
have sliced out any index for which either student has a nil
entry. We can then pass these shorter arrays into our established distance functions to determine the distance
between the two students in this new mutual subspace.
One other complication that we must consider is when the two students do not share any classes in common. In such cases both person_vec
and
other_vec
will be empty, and we cannot determine the distance between two empty arrays. In this case, if the students share no common classes, we will assume
that they are very dissimilar, so we will return an arbitrarily large distance in this case (to approximate a distance of infinity).
So far we have determined the distance between our two students, but what we actually want is a similarity score. To convert our distance
into a
similarity we can use the following:
def similarity(person, other_person)
1.0/(1.0 + distance(person, other_person))
end
Hopefully you can see that if the two students are identical then the distance
should evaluate to zero, in which case the similarity
will equal 1.
By contrast, when the distance
evaluates to a very large value then the similarity
will tend towards 0. In this way, the similarity measure will range between 0 and 1.
With the code described we can now calculate the top 3 most similar students to Ginny:
require_relative './hogwarts_class_scores'
require_relative './recommender'
recommender = Recommender.new(CLASS_SCORES)
puts recommender.top_k_matches("Ginny Weasley", 3)
# {"Ron Weasley"=>0.2612038749637414, "Hermione Granger"=>0.25, "Cho Chang"=>0.18660549686337075}
OK cool! So it looks like brother Ron is the best match. So Ginny could ask Ron for his advice, and he will likely recommend that she study Divination. But would you trust Ron's advice??
Recommending classes
The approach in the previous section might work fairly well for Ginny, to help her choose which class to switch to. But because she is only asking one other student there is the possibility that the student she chooses (Ron, in this case) has very peculiar tastes, and their recommendation ends up not being a great fit.
What is more, Ron himself does not study Arithmancy or Ancient Runes, so he cannot make an informed recommendation on the things which he has not experienced himself.
To overcome these shortcomings we need a new approach. Rather than taking the recommendation from a single student, we could, instead, take recommendations from many (or all students) who have studied a subject and calculate an average score for that subject. But in this process it would also be sensible to apply a greater weight to the recommendations provided by other students who are more similar to Ginny, so we weight the contribution from each student by their similarity to Ginny. The ruby implementation for this calculation is shown below:
def get_recommendations(person)
class_scores[person].inject({}) do |memo, (class_name, score)|
next memo if score # We only want to recommend classes for which we have no score registered
total_similarity = 0
memo[class_name] = class_scores.keys.inject(0) do |avg, other_person|
next avg unless class_scores[other_person][class_name] # We are only interested in students who have taken this class, and have a score
similarity = similarity(person, other_person) # The similarity between students will be the weighting factor
total_similarity += similarity # Keep a running total of similarity to divide by at the end
avg += class_scores[other_person][class_name] * similarity # Tally the weighted contributions
end
memo[class_name] = memo[class_name] / total_similarity # Divide by the sum of the weights
memo
end
end
We loop over the class scores for the target student, but we only concern ourselves with the classes for which the student has no score. For each of these classes we loop over all the other students who have registered a score in that class, and calculate a weighted average of their scores for the class, where the weighting factor is the similarity to the target student. That's it.
Let's see what recommendations we now generate for Ginny using this weighted average
require_relative './hogwarts_class_scores'
require_relative './recommender'
require_relative './metrics'
recommender = Recommender.new(CLASS_SCORES, Metrics.method(:euclidean_distance))
puts recommender.get_recommendations(student)
# {"Arithmancy"=>7.492996951956632, "Astronomy"=>7.566161526023427, "Divination"=>4.168756010451042, "Study of Ancient Runes"=>8.094049420209119}
Based on this model, Ginny would be best advised to swith to Study of Ancient Runes, as we estimate that she will achieve a score of about 8.09.
Perhaps Euclidean distance is not the best distance measure in this higher-dimensional space
(see here, for example)?
Not to fear, our algorithm should work equally well with other distance measures. Let's consider the same calculation based on the manhattan_distance
:
recommender = Recommender.new(CLASS_SCORES, Metrics.method(:manhattan_distance))
puts recommender.get_recommendations(student)
# {"Arithmancy"=>7.653125000000001, "Astronomy"=>7.508417508417509, "Divination"=>4.235938089845225, "Study of Ancient Runes"=>8.184615384615384}
So we can see slightly different numbers in our predicted scores, but the same conclusion: the best option for Ginny would be to switch to Ancient Runes.
Conclusion
That was our simple recommendation system in ruby; these algorithms can be used to generate recommendations from any user-item dataset. In this case, our recommendation system is driven by the idea of similar users; and the technique is known as user-based filtering. Perhaps you can appreciate that we could easily invert our dataset and, instead, consider our items (in our case, the classes) to be represented by vectors of individual student scores. We could then proceed to look for similar classes rather than similar students, and use these similar classes to generate our recommendations. Such an approach is know as item-based filtering, and in some cases this approach will make more sense.
There are many shortcomings with the model we have presented. Perhaps most significantly, is the need to calculate the distance between our target user and every other user in our database. As the number of users increase this may become impractical. Also if we have a completely new user, with few or no scores, we can't make any meaningful recommendations.
Finally our dataset is relatively dense, in that each student studies a pretty large subset of available classes. In such dense datasets it is possible to calculate meaningful user distances. But imagine a context like movie recommendations, where each user has rated only a very small subset of the available items. In such cases our dataset is sparse, and the idea of distance between our user vectors becomes a little more suspect.
There are various other recommendation models which address these different problems, but further exploration of these algorithms will have to wait for a future post.
I hope you enjoyed this article and found it useful. f you would like to be kept up-to-date when we publish new articles please subscribe.
References
- Collective Intelligence by Toby Sagaran, available on Amazon
- Previous article on implementing distance measures in ruby
- GitHub repo with the algorithm implementation as discussed in this post
- Blog post ranking Hogwarts classes
- Blog post discussing when Euclidean distance is not a good fit
Comments
Thank you for sharing. Just a tiny note: you referenced the CLASS_PREFERENCES but in the provided code you rather used CLASS_SCORES one.
Thanks for reading, and thanks for picking up on this mistake. That has now been corrected.
Got your own view or feedback? Share it with us below …