Set
class and consider under what circumstances you might wish to consider an actual Set
over an Array
.
Introduction
The ruby language has always valued developer productivity and eloquent, expressive code. Arrays are a cornerstone of most
programming languages and, as such, the ruby Array
is packed full of syntactic delights. Amongst the extensive
Array
interface are a number of set operations that
provide some excellent additions to your arsenal for writing terse and expressive ruby code.
In this post we will look at the set operations offered on the ruby Array
class, where you might choose to use
these operations and when you might want to opt for an actual Set
.
Array set operations
A set is a fundamental data construct that contains zero or more members and guarantees the uniqueness of those members. There are a number of common set operations that are supported by ruby arrays. The operations to be discussed in this post are union, intersection and difference. The last of these is commonly discussed in the context of ruby array set operations and, whilst a useful operation, I contest that ruby's implementation is not strictly a set operation, but more on that later.
Let's look at each of these operations in turn.
Union
The union of two sets $A$ and $B$, $A \cup B$, will return all the members that are either in set $A$, in set $B$ or in both.
It can be visualized as shown in the Venn diagram.
If $A = \{ 1, 2, 3, 4 \}$ and $B = \{ 2, 4, 6, 8 \}$, then the union $A \cup B = \{ 1, 2, 3, 4, 6, 8 \}$.
In ruby we can form the union of two arrays using the |
operator as follows:
ministry = ["weasley", "fudge", "shacklebolt", "weasley"]
gryffindor = ["weasley", "granger", "potter", "longbottom", "weasley"]
ministry | gryffindor # Returns ["weasley", "fudge", "shacklebolt", "granger", "potter", "longbottom"]
Whilst arrays do not require that the entries are unique, this is a fundamental requirement of a set. As a result, our union operation
on our arrays, ministry | gryffindor
, only returns the unique entries. This fact gives us an interesting alternative to
#uniq
for returning the unique elements in an array:
A = ["foo", "bar", "foo", "baz", "foo", "bar"]
A.uniq # Returns ["foo", "bar", "baz"]
A | [] # Returns ["foo", "bar", "baz"]
Asides from this slightly esoteric application the array union operator provides a neat way to incrementally build lists of distinct items. For example, imagine a helper method for building an HTML element that could be embellished with a number of different classes
module CustomTagHelper
def build_tag(text_content, options = {})
content_tag(:span, text_content, {
class: [
"custom_class_a",
"custom_class_b"
] | (options[:class] || "").split(" ")
}.merge(options.slice(:style, :title)))
end
end
In this example we make use of the content_tag
view-helper method exposed by Rails (see
here). Our
CustomTagHelper
module defines our own helper method build_tag
, which takes a string of content and will
stick it into a span
tag. It also allows the caller to pass a number of options which will be used to embellish the tag with a limited set
of HTML attributes. The style
and title
attributes can be passed in the options hash and will be added as attributes to
the element. More interestingly, the client code can also pass a string of classes within the options[:class]
, and these will be merged
into a list of default classes which are applied to the element. Building a unique array of HTML class names is a frequent task, and the array union
operator has proven a very useful and concise tool to use in such cases.
Intersection
The intersection of two sets $A$ and $B$, $A \cap B$, will return all the members that are present in both set $A$ and $B$.
It can be visualized as shown in the Venn diagram with the red-coloured region representing the intersection.
If $A = \{ 1, 2, 3, 4 \}$ and $B = \{ 2, 4, 6, 8 \}$, then the intersection $A \cap B = \{ 2, 4 \}$.
In ruby we can form the union of two arrays using the &
operator as follows:
ministry = ["weasley", "fudge", "shacklebolt", "weasley"]
gryffindor = ["weasley", "granger", "potter", "longbottom", "weasley"]
gryffindor & ministry # Returns ["weasley"]
As for the union operation our intersection operation, gryffindor & ministry
, will remove all duplicate entries in the resulting array.
And once again, this provides another interesting way to get the unique elements in an array:
ministry = ["weasley", "fudge", "shacklebolt", "weasley"]
ministry & ministry # Returns ["weasley", "fudge", "shacklebolt"]
ministry.uniq # Returns ["weasley", "fudge", "shacklebolt"]
The intersection operator can be useful when filtering some user-provided values against an allow-list of accepted terms. For example, we could imagine the following code appearing in a Rails controller:
class FoosController < ApplicationController
ALLOWED_PROPERTIES = %w( name description updated_at )
def index
sanitized_properties = params[:properties] & ALLOWED_PROPERTIES
@foos = Foo.all.pluck(*sanitized_properties)
end
end
In this example we can imagine that users will hit the FoosController#index
endpoint with a list of :properties
that they wish to retrieve for each Foo
instance. It is probably a bad idea to allow users to see any detail they choose,
so to restrict the properties that they can actually retrieve we filter the user-supplied params[:properties]
through the allow-list
defined by ALLOWED_PROPERTIES
. I have found the array intersection operation a neat tool to use in similar filtering situations.
Difference
The difference between sets $A$ and $B$, $B-A$, will return the members that are in set $B$
that are not in set $A$. It can be visualized as the coloured region in the Venn diagram.
If $A = \{ 1, 2, 3, 4 \}$ and $B = \{ 2, 4, 6, 8 \}$, then the difference $B - A = \{ 6, 8 \}$.
In ruby we can take the difference between two arrays as follows:
ministry = ["weasley", "fudge", "shacklebolt", "weasley"]
gryffindor = ["weasley", "granger", "potter", "longbottom", "weasley"]
gryffindor - ministry # Returns ["granger", "potter", "longbottom"]
In this example, the elements that are in gryffindor
but not in ministry
are returned. However, I am unsure if we can really classify
this as a set operation, because duplicates in the first array can remain after the operation (i.e. this method does not return a set, small s). Consider, for example:
gryffindor = ["weasley", "granger", "potter", "longbottom", "weasley"]
gryffindor - ["shacklebolt", "potter"] # Returns ["weasley", "granger", "longbottom", "weasley"]
Here the duplicate weasley
entry remains in the output. So it seems that the difference operator (-
) does not return a set.
Interestingly the ruby v2.7 docs for this operator, make an acknowledgement of
this non-set-like behaviour with this remark:
If you need set-like behavior, see the library class Set.
However, this note is actually removed from the documentation in the ruby v3+ docs.
Not sure why, but there you go. Nonetheless, this remark in the documentation does get us thinking: If we are looking for set-like behaviour why aren't we
using the Set
class rather than using the Array
class as our proverbial hammer?
When to use Set
The truth is, probably, that we stick with the Array
class due to its familiarity and flexibility.
- Familiar because, as a data structure, arrays are ubiquitous in programming and all developers will be comfortable reasoning in terms of arrays.
- Flexible because the
Array
interface is more extensive than that ofSet
. This means that, for better or worse, we can easily reuse the same array in multiple different ways. We can treat our array like a set (albeit without any in-built uniqueness guarantees), but then we can go ahead an use array methods also, like indexing into the array etc.
So if arrays provide all the tools we need, why would I ever choose to use a Set
?
Asides from the semantic benefits to our code, the main powers of Set
are two-fold:
- Entries are guaranteed to be unique
- Checking for inclusion in the set is extremenly fast
Set
A
contains a member, then the call
A.include?(member)
will be very fast. And as an extension to this, the set operations for #subset?
and
#superset?
will also be very efficient.
A simple benchmark
To give an idea of the relative performance merits of Set
versus Array
we can run a very simple benchmark:
require 'benchmark'
require 'set'
def generate_random_string(len=6)
(0..len).map{ ('a'..'z').to_a[rand(26)] }.join('')
end
n=100_000
big_set = nil
big_arr = nil
Benchmark.bmbm do |x|
x.report("Building Set") { big_set = Set.new; n.times { big_set << generate_random_string } }
x.report("Building Array") { big_arr = Array.new; n.times { big_arr << generate_random_string } }
x.report("Set#include?") { 1_000.times{ big_set.include?(generate_random_string) } }
x.report("Array#include?") { 1_000.times{ big_arr.include?(generate_random_string) } }
end
This benchmark code defines a generate_random_string
method which builds a 6 character string with lowercase letters.
We call this method in two separate loops to populate a large Set
and a large Array
. We report on the time
taken to build these two separate data structures. The benchmark then repeatedly (1000 times) tries to find a random string in each of the
big_set
and the big_array
. Again, we report on the time it takes to do the 1000 searches against the Set
and how long it takes to complete the searches against the Array
. The results are below:
user system total real
Building Set 3.152817 0.019997 3.172814 ( 3.174221)
Building Array 2.948688 0.016002 2.964690 ( 2.965641)
Set#include? 0.027470 0.000000 0.027470 ( 0.027548)
Array#include? 6.498631 0.007982 6.506613 ( 6.509682)
We should not read too much into the absolute values here, as we make no attempt to understand the impact of the calls to generate_random_string
.
However, these costs should be the same for both the Array
and the Set
loops, so the relative timings should be significant.
We can see that building the Set
burns more time than building the Array
. This stands to reason, as we build the Set
there
will need to be internal checks to ensure that we are not adding a duplicate entry. By contrast, the Array
does not care if the entry is a duplicate,
so adding an element to an Array
is quicker than adding to a Set
of equivalent size.
When it comes to presence checks (i.e. #include?
) we can see tha the Set
absolutely smokes (technical term) the Array
.
When a Set
checks for existence it can use a hash-based lookup, but the Array
needs to check every item, which can get quite slow
then the array is large.
Summary
In this post we have introduced the set operations for union (|
) and intersection (&
) that are exposed on ruby arrays, and we have
seen a couple of simple scenarios where these operators prove their worth.
We have also discussed the difference operator (-
) which, while convenient, is perhaps miscategorised as a set operation.
Finally we took a look at why you might choose to use a Set
rather than an Array
, with efficient lookups over large unique collections
being the primary use case for the Set
class.
In my own experience I use the Set
class very infrequently, typically reaching for an Array
instead.
It seems that most array instances are small thow-away data constructs, for which the true value of Set
will not be very evident.
However, in writing this post I have resolved to keep an open mind to using the Set
class more liberally in my day-to-day programming.
Even when the lookup efficiency is not a significant, enforcing of a unique-membered-collection has undoubtedly got some benefits, even just from a
semantic perspective.
References
- Ruby docs for
Array
class - Ruby docs for
Set
class - Wikipedia entry for set union operation
- Wikipedia entry for set intersection operation
- Wikipedia entry for set difference operation
- Please note that the Venn diagrams reproduced in this blog post have been borrowed from Wikipedia
- Rails documentation for the content_tag view helper method
- Ruby v2.7 docs for
Array
-
operator - Wikipedia entry for Venn diagrams
Comments
There are no existing comments
Got your own view or feedback? Share it with us below …