# How complex is your Ruby? Time for some lexical variability analysis

How rich a subset of Ruby are you using? To which extent are you exploring the language space? I've been performing some lexical analysis on Ruby software to provide a first-order answer to those questions.

I'm only considering the lexical richness of a program for the time being. Using concepts from information theory (briefly explained below), we can obtain the "lexical complexity" of the source code.

### Some results

#### The simplest files in stdlib

Here are the three (not totally trivial) lexically simplest files in the stdlib and their "lexical complexities":

- irb/version.rb: 0.216 bits (17 tokens of 9 different types)
- shell/version.rb: 0.126 bits (17 tokens of 9 different types)
- irb/lc/error.rb: 0.305 bits (93 tokens of 10 different types)

Both irb/version.rb and shell/version.rb look like

module SomeModule @RELEASE_VERSION = "version" @LAST_UPDATE_DATE = "release date" end

It's no wonder: they were both written by ISHITSUKA Keiju.

#### Other lexically simple files

- English.rb: 0.616 bits (100 tokens of 4 different types)
- tkclass.rb: 0.619 bits (150 tokens of 10 different types)
- i686-linux/rbconfig.rb: 0.631 bits (1616 tokens of 41 different types)

English.rb is but a sequence of

alias $READABLE_NAME $SOME_UGLY_PERLISH_VAR

tkclass.rb consists mostly of assignments like

TopLevel = TkToplevel

and i686-linux/rbconfig.rb has got a lot of lines resembling

CONFIG["ruby_install_name"] = "ruby"

The "lexical complexity" above represent the amount of information carried by a token (type) if we know the type of the token immediately preceding it. In all three cases, there are on average less than 2 choices. This can be expected in English.rb and tkclass.rb: after all, they only have 4 and 10 different tokens, respectively. The case of rbconfig.rb is more interesting: there are 41 distinct token types, but the lexical variability is very low; in other words, the code is very repetitive.

#### The most complex files in stdlib

- tk/timer.rb: 2.411 bits (2656 tokens of 70 different types)
- resolv.rb: 2.415 bits (8556 tokens of 85 different types)
- optparse.rb: 2.505 bits (5903 tokens of 84 different types)

So the "lexically richest/most complex" file in the stdlib is optparse.rb. In general, bigger files tend to use more token types. But length detracts from the lexical variability at some point, as the code becomes a bit repetitive.

### Some techniques

If you don't feel like reading the explanation and code below, here are the four things that stand out the most:

- Using the block form of Hash.new to assign unique IDs to objects:

OIDS = Hash.new{|h,k| h[k] = OIDS.size}

- Another way to cache the results of a method call with an anonymous hash (could be easily metaprogrammed)

class Foo foo_cache = {} define_method(:foo) do |*a| if foo_cache.include?(a) foo_cache[a] else ret = # compute ret foo_cache[a] = ret end end end

- Computing a weighted average:

weights = [1, 2, 3, 2, 4, 2, 1, 3] weighted_avg = weights.inject([0,0]) do |(sum,idx), weight| [sum + weight * func(idx), idx + 1] end.first

- Beware of Matrix; it's slow and the interface is a bit rough at times (messy failures when requesting out-of-bound rows/columns, Vector objects are not enumerable...). (... narray)

## How to compute the lexical complexity of Ruby code

### Modelling random Ruby programs

To begin with, since I'm just considering the vocabulary used in a given program, source code is just taken as a sequence of token types, so, given

a = 1

I'll operate on

tIDENTIFIER '=' tINTEGER '\n'

Luckily, ruby can provide that information with the -y command-line option, along with lots of other details relative to the parsing process (i.e. how the LALR(1) automaton works):

batsman@tux-chan:/tmp$ ruby -y -c -e "a = 1" 2>&1 Starting parse Entering state 0 Reducing stack by rule 1 (line 328), -> @1 Stack now 0 Entering state 2 Reading a token: Next token is token tIDENTIFIER () Shifting token tIDENTIFIER, Entering state 34 Reading a token: Next token is token '=' () ...

We can turn an arbitrary Ruby program into a sequence of token types, so what? Keeping in mind that we would like to arrive to some sort of "lexical complexity" index, the method must at least satisfy these properties:

- the more token types in the program, the higher the index
- if we can predict the type of the next token based on the previous ones, the "lexical complexity" goes down

These two basic properties can be satisfied by modelling Ruby code using a Markov chain and calculating the associated entropy rate: I'll explain how that works with a bit of theory and a simple example, and then apply that to the token information ruby can provide.

#### Markov chains of tokens

(Scroll down to the code if you really, really dislike equations.)

As said above, the program is just a sequence of token types at this level:
. Given *n* symbols, we'd
like to predict what the next one will be, that is, the probability that
it is of a given type

In the following model, I'll take the simplifying assumption that a token depends only
on the one preceding it. This is of course untrue: the probability of the next token being
*end* is higher in

class Foo; def foo; end

than in

puts 1

Indeed, the probability of finding an *end* token in a syntactically valid Ruby program is
exactly 0 if we have no open *class*, *module*, *if* etc.

So, to which extent does that assumption contaminate the results I'll obtain? It turns out that it doesn't really matter that much. By assuming that the next token depends only on the preceding one, I'm discarding all the information carried by the tokens before the latter: this means that the process misses some opportunities to identify redundancy (remember the second property above), so the measure of "lexical complexity" I'll get will be higher than the actual one. So I'm just computing an upper bound for the complexity. But this doesn't invalidate the model as a tool to assess the relative lexical complexities of two Ruby programs. (Besides, if enough people are interested I'll show another way to estimate the lexical complexity that doesn't discard that much information.)

The assumption that a token depends only on the preceding one can be written as

which is read as "the probability of being given the values of the preceding tokens is the same as that when I only know the value of the one token immediately before."

#### Probability transition matrix

We can write the probability that a token is *j* if the preceding one
was *i* as

That's the probability of a transition between *i* and *j*; the
matrix encompassing all such probabilities is
.

Suppose that we know the probabilities of a given token being of
any of the possible token types (e.g. 50% probability of being
**tIDENTIFIER**, 50% of it being **tINTEGER**), i.e. the probability
of type is .
The probabilities for the next symbol are

#### Stationary distribution

The distribution of probabilities such that
is called a "stationary distribution". Given some conditions (if the process is aperiodic and
you can get from any state to any other in a finite number of steps), there's only one
stationary distribution, and the distribution of tends to it when
*n* tends to infinity.

#### Example

Let's imagine that we have only two token types, *a* and *b*, and we know the
probability transition matrix is

That is, *b* follows *a* with a probability of 80%, we get *a* after *a* 20% of the times,
the probability of getting *ba* is 40%, and for *bb* 60%.

By solving , we get

In other words, if we generate a infinite number of symbols using that process, one third of
them will be of type *a*, and two thirds of type *b*.

There's another way to get that result without actually solving the above equation. Just by multiplying the transition matrix by itself repeatedly, each row will approach progressively:

require 'matrix' m = Matrix[ [0.2, 0.8], [0.4, 0.6] ] # => Matrix[[0.2, 0.8], [0.4, 0.6]] m * m * m * m * m * m * m # => Matrix[[0.3333248, 0.6666752], [0.3333376, 0.6666624]]

### A measure of the lexical complexity

Using Markov processes as presented above, I take the amount of information added by a single token (in an infinite sequence of tokens generated according to the model) as a measure of the lexical complexity of an arbitrary Ruby program.

How to compute that? It's the weighted average of the "surprise" caused by a token when we know the preceding one. This is of course weighted with the probabilities from the stationary distribution. (In mathematical terms, I'll be using the entropy rate of the Markov process matching a given text).

### The code

The transition matrix is computed by recording for each token the type of the token preceding it:

require 'matrix' def transition_matrix(filename) indices = Hash.new{|h,k| h[k] = indices.size} probs = [] last_tok = nil ntokens = 0 freqs = [] IO.popen("ruby -y -c #{filename} 2>&1") do |f| f.each do |line| if md = /^Reading a token: Next token is token (\S+)/.match(line) tok = md[1] ntokens += 1 freqs[indices[tok]] ||= 0 freqs[indices[tok]] += 1 if last_tok probs[indices[last_tok]] ||= [] probs[indices[last_tok]][indices[tok]] ||= 0 probs[indices[last_tok]][indices[tok]] += 1 end last_tok = tok end end end if ntokens == 0 return [nil, nil, nil, 0] end probs.map! do |row| sum = row.inject(0){|s,x| s + (x || 0)} row.map{|x| 1.0 * (x || 0) / sum } end freqs = freqs.map{|p| 1.0 * p / ntokens } cols = [probs.size, probs.map{|x| x.size}.max].max probs << [0] * cols if probs.size < cols [Matrix.rows(probs.map{|row| row + [0] * (cols - row.size)}), freqs, indices, ntokens] end

The entropy rate is calculated by obtaining the stationary distribution if there's one, or just taking the observed token type probabilities otherwise, and then computing the weighted average of the entropies of each row in the transition matrix.

LOG2 = Math.log(2) def H(probs) probs.to_a.inject(0) do |s,x| if x < 1e-6 # for continuity s else s - 1.0 * x * Math.log(x) / LOG2 end end end class Object _stationary_distribution_cache = {} max_iterations = 10 define_method(:stationary_distribution) do |t_matrix, *rest| if _stationary_distribution_cache.has_key?(t_matrix) _stationary_distribution_cache[t_matrix] else threshold = rest[0] ? rest[0] : 0.001 2.times{ t_matrix = t_matrix * t_matrix } iterations = 0 loop do t_matrix = t_matrix * t_matrix # should verify that they are all close, but the following weaker # test will do d1 = (t_matrix.row(0) - t_matrix.row(t_matrix.row_size - 1)).to_a.inject{|s,x| s + x.abs} d2 = (t_matrix.row(0) - t_matrix.row(t_matrix.row_size/2)).to_a.inject{|s,x| s + x.abs} break if d1 < threshold && d2 < threshold || iterations > max_iterations iterations += 1 end return nil if iterations >= max_iterations _stationary_distribution_cache[t_matrix] = t_matrix.row(0).to_a end end end def entropy_rate(t_matrix) mu = stationary_distribution(t_matrix) return nil if mu.nil? ret = 0 mu.each_with_index do |p, row| ret += p * H(t_matrix.row(row)) end ret end t_matrix, freqs, indices, tokens = transition_matrix("~/usr/lib/ruby/1.8/webrick/version.rb") #t_matrix, freqs, indices, tokens = transition_matrix("~/usr/lib/ruby/1.8/weakref.rb") #t_matrix, freqs, indices, tokens = transition_matrix("~/usr/lib/ruby/1.8/erb.rb") entropy_rate(t_matrix) # => nil h_eq = Math.log(t_matrix.row_size) / LOG2 # => 3.0 tokens # => 11 avg_entropy = freqs.inject([0,0]) do |(s,idx), p| [s + p * H(t_matrix.row(idx)), idx + 1] end.first avg_entropy # => 0.454545454545455

Finally, here's the code that performs that analysis for all the files in RUBYLIBDIR and saves the obtained information:

############################################################################# # compute entropy rates for the files under ARGV[0] || stdlib require 'find' require 'rbconfig' require 'yaml' dir = ARGV[0] || Config::CONFIG["rubylibdir"] $stdout.sync = true begin statistics = {} Dir.chdir(dir) do Find.find(".") do |file| next unless /\.rb$/ =~ file #next unless %r{^./rss.rb$} =~ file groupname = File.dirname(file).split(%r{/})[1] groupname = File.basename(file)[/[^.]+/] if groupname.nil? puts "Processing #{file} (under #{groupname})" t_matrix, freqs, indices, tokens = transition_matrix(file) if t_matrix h_rate = entropy_rate(t_matrix) if h_rate mu = stationary_distribution(t_matrix) else h_rate = freqs.inject([0,0]) do |(s,idx), p| [s + p * H(t_matrix.row(idx)), idx + 1] end.first mu = freqs end h_iid = H(mu) h_eq1 = mu.inject([0,0]) do |(s,idx), p| non_zero = t_matrix.row(idx).to_a.select{|x| x > 0}.size non_zero = [1, non_zero].max [s + p * Math.log(non_zero) / LOG2, idx + 1] end.first h_eq2 = Math.log(t_matrix.row_size) / LOG2 (statistics[groupname] ||= []) << [file, h_rate, h_eq1, h_iid, h_eq2, tokens] p [file, h_rate, h_eq1, h_iid, h_eq2, tokens] # [file, # h_rate either the entropy rate of the markov chain if there is # a stationary distribution or the weighted sum of the # entropies for each row in the transition matrix, based on # the observed frequencies for the prev token type # h_eq1 entropy if all possible (as observed by an instance of the # process) choices given the prev token are equidistributed # h_iid we ignore the markovian process and consider this as a # stochastic process of indep. identically distributed # random variables (so all tokens are considered independent) # h_eq2 entropy per token if we consider them all equally probable # tokens number of tokens # ] end end end ensure File.open("lexical.entropy", "w"){|f| Marshal.dump(statistics, f) } puts "=" * 80 puts statistics.to_yaml raise if $! end

By the way, the lexical complexity of these scripts is 2.303 bits, with 1118 tokens of 59 different types. That places them relatively close to optparse.rb.

### Download

Stand-alone lexical complexity estimator: lexical_complexity.rb

### Blah - Eaten-Alive (2005-12-20 (Tue) 12:36:14)

Run it against other major version of Ruby, and compare. (1.6 -> 1.8, 1.8 -> 1.9)

### i should say - himm (2005-12-19 (Mon) 23:44:06)

you are sick.

### jo 2005-12-29 (Don) 04:34:25

A tool to analyze C programs is cil.sf.net which is a full C parser written in OCAML. (Perhaps also see nekovm.org for a virtual machine based on OCAML).

- 52 http://www.artima.com/forums/flat.jsp?forum=123&thread=141431
- 33 http://www.artima.com/buzz/community.jsp?forum=123
- 12 http://chneukirchen.org/anarchaia
- 10 http://anarchaia.org
- 6 http://anarchaia.org/archive/2005/12.html
- 5 http://planetruby.0x42.net
- 4 http://www.artima.com/forums/flat.jsp?forum=123&thread=141629
- 3 http://www.anarchaia.org
- 2 http://search.ninemsn.com.au/spresults.aspx?q=lexical chain analysis
- 2 http://www.artima.com/buzz/community.jsp?forum=123&start=0&thRange=15

Keyword(s):[blog] [ruby] [complexity] [lexical] [information] [theory] [entropy] [markov] [chain] [stochastic] [process]

References:[Ruby] [Unexplored corners of Ruby's syntax] [Analyzing the stdlib]