eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

How complex is your Ruby? Time for some lexical variability analysis

How rich a subset of Ruby are you using? To which extent are you exploring the language space? I've been performing some lexical analysis on Ruby software to provide a first-order answer to those questions.

I'm only considering the lexical richness of a program for the time being. Using concepts from information theory (briefly explained below), we can obtain the "lexical complexity" of the source code.

Some results

The simplest files in stdlib

Here are the three (not totally trivial) lexically simplest files in the stdlib and their "lexical complexities":

  • irb/version.rb: 0.216 bits (17 tokens of 9 different types)
  • shell/version.rb: 0.126 bits (17 tokens of 9 different types)
  • irb/lc/error.rb: 0.305 bits (93 tokens of 10 different types)

Both irb/version.rb and shell/version.rb look like

module SomeModule
  @RELEASE_VERSION = "version"
  @LAST_UPDATE_DATE = "release date"
end

It's no wonder: they were both written by ISHITSUKA Keiju.

Other lexically simple files

  • English.rb: 0.616 bits (100 tokens of 4 different types)
  • tkclass.rb: 0.619 bits (150 tokens of 10 different types)
  • i686-linux/rbconfig.rb: 0.631 bits (1616 tokens of 41 different types)

English.rb is but a sequence of

 alias $READABLE_NAME $SOME_UGLY_PERLISH_VAR

tkclass.rb consists mostly of assignments like

 TopLevel = TkToplevel

and i686-linux/rbconfig.rb has got a lot of lines resembling

 CONFIG["ruby_install_name"] = "ruby"

The "lexical complexity" above represent the amount of information carried by a token (type) if we know the type of the token immediately preceding it. In all three cases, there are on average less than 2 choices. This can be expected in English.rb and tkclass.rb: after all, they only have 4 and 10 different tokens, respectively. The case of rbconfig.rb is more interesting: there are 41 distinct token types, but the lexical variability is very low; in other words, the code is very repetitive.

The most complex files in stdlib

  • tk/timer.rb: 2.411 bits (2656 tokens of 70 different types)
  • resolv.rb: 2.415 bits (8556 tokens of 85 different types)
  • optparse.rb: 2.505 bits (5903 tokens of 84 different types)

So the "lexically richest/most complex" file in the stdlib is optparse.rb. In general, bigger files tend to use more token types. But length detracts from the lexical variability at some point, as the code becomes a bit repetitive.

Some techniques

If you don't feel like reading the explanation and code below, here are the four things that stand out the most:

  • Using the block form of Hash.new to assign unique IDs to objects:
 OIDS = Hash.new{|h,k| h[k] = OIDS.size}

  • Another way to cache the results of a method call with an anonymous hash (could be easily metaprogrammed)
class Foo
  foo_cache = {}
  define_method(:foo) do |*a|
    if foo_cache.include?(a)
      foo_cache[a]
    else
      ret = # compute ret
      foo_cache[a] = ret
    end
  end
end

  • Computing a weighted average:
weights = [1, 2, 3, 2, 4, 2, 1, 3]
weighted_avg = weights.inject([0,0]) do |(sum,idx), weight|
  [sum + weight * func(idx), idx + 1]
end.first

  • Beware of Matrix; it's slow and the interface is a bit rough at times (messy failures when requesting out-of-bound rows/columns, Vector objects are not enumerable...). (... narray)

How to compute the lexical complexity of Ruby code


Modelling random Ruby programs

To begin with, since I'm just considering the vocabulary used in a given program, source code is just taken as a sequence of token types, so, given

 a = 1

I'll operate on

 tIDENTIFIER '=' tINTEGER '\n'

Luckily, ruby can provide that information with the -y command-line option, along with lots of other details relative to the parsing process (i.e. how the LALR(1) automaton works):

 batsman@tux-chan:/tmp$ ruby -y -c -e "a = 1" 2>&1
 Starting parse
 Entering state 0
 Reducing stack by rule 1 (line 328), -> @1
 Stack now 0
 Entering state 2
 Reading a token: Next token is token tIDENTIFIER ()
 Shifting token tIDENTIFIER, Entering state 34
 Reading a token: Next token is token '=' ()
 ...

We can turn an arbitrary Ruby program into a sequence of token types, so what? Keeping in mind that we would like to arrive to some sort of "lexical complexity" index, the method must at least satisfy these properties:

  • the more token types in the program, the higher the index
  • if we can predict the type of the next token based on the previous ones, the "lexical complexity" goes down

These two basic properties can be satisfied by modelling Ruby code using a Markov chain and calculating the associated entropy rate: I'll explain how that works with a bit of theory and a simple example, and then apply that to the token information ruby can provide.

Markov chains of tokens

(Scroll down to the code if you really, really dislike equations.)

As said above, the program is just a sequence of token types at this level: left { x sub 1, x sub 2 ,..., x sub n right }. Given n symbols, we'd like to predict what the next one will be, that is, the probability that it is of a given type

roman Pr left { { X sub {n+1} = x sub j left | { X sub 1 = x sub 1 ,..., X sub n = x sub n} } right }

In the following model, I'll take the simplifying assumption that a token depends only on the one preceding it. This is of course untrue: the probability of the next token being end is higher in

 class Foo; def foo; end 

than in

 puts 1

Indeed, the probability of finding an end token in a syntactically valid Ruby program is exactly 0 if we have no open class, module, if etc.

So, to which extent does that assumption contaminate the results I'll obtain? It turns out that it doesn't really matter that much. By assuming that the next token depends only on the preceding one, I'm discarding all the information carried by the tokens before the latter: this means that the process misses some opportunities to identify redundancy (remember the second property above), so the measure of "lexical complexity" I'll get will be higher than the actual one. So I'm just computing an upper bound for the complexity. But this doesn't invalidate the model as a tool to assess the relative lexical complexities of two Ruby programs. (Besides, if enough people are interested I'll show another way to estimate the lexical complexity that doesn't discard that much information.)

The assumption that a token depends only on the preceding one can be written as

roman Pr left { X sub {n+1} = j left | { X sub 1 = x sub 1 ,..., X sub n = x sub n} right } = roman Pr left { { X sub {n+1} = j left | { X sub n = x sub n } } right }

which is read as "the probability of X sub n being j given the values of the preceding tokens is the same as that when I only know the value of the one token immediately before."

Probability transition matrix

We can write the probability that a token is j if the preceding one was i as

P sub {ij} = roman Pr left { { X sub {n+1} = j left | { X sub n = i } } right }

That's the probability of a transition between i and j; the matrix encompassing all such probabilities is P = left [ P sub {ij} right ].

Suppose that we know the probabilities of a given token being of any of the possible token types (e.g. 50% probability of being tIDENTIFIER, 50% of it being tINTEGER), i.e. the probability of type x sub n is p left ( x sub n right ). The probabilities for the next symbol are

p left ( x sub {n+1} right ) = sum from {x sub {n+1}} p(x sub n) P sub {x sub n x sub {n+1}}

Stationary distribution

The distribution of probabilities mu such that mu P = mu is called a "stationary distribution". Given some conditions (if the process is aperiodic and you can get from any state to any other in a finite number of steps), there's only one stationary distribution, and the distribution of X sub n tends to it when n tends to infinity.

Example

Let's imagine that we have only two token types, a and b, and we know the probability transition matrix is

P ~=~ left [ pile{ 0.2 above 0.4 } ~~ pile{0.8 above 0.6} right ]

That is, b follows a with a probability of 80%, we get a after a 20% of the times, the probability of getting ba is 40%, and for bb 60%.

By solving mu P = mu, we get

mu = left [ 1 over 3, 2 over 3 right ]

In other words, if we generate a infinite number of symbols using that process, one third of them will be of type a, and two thirds of type b.

There's another way to get that result without actually solving the above equation. Just by multiplying the transition matrix by itself repeatedly, each row will approach mu progressively:

require 'matrix'
m = Matrix[ [0.2, 0.8], [0.4, 0.6] ]               # => Matrix[[0.2, 0.8], [0.4, 0.6]]
m * m * m * m * m * m * m                          # => Matrix[[0.3333248, 0.6666752], [0.3333376, 0.6666624]]

A measure of the lexical complexity

Using Markov processes as presented above, I take the amount of information added by a single token (in an infinite sequence of tokens generated according to the model) as a measure of the lexical complexity of an arbitrary Ruby program.

How to compute that? It's the weighted average of the "surprise" caused by a token when we know the preceding one. This is of course weighted with the probabilities from the stationary distribution. (In mathematical terms, I'll be using the entropy rate of the Markov process matching a given text).

The code

The transition matrix is computed by recording for each token the type of the token preceding it:

require 'matrix'

def transition_matrix(filename)
  indices = Hash.new{|h,k| h[k] = indices.size}
  probs = []
  last_tok = nil
  ntokens = 0
  freqs = []
  IO.popen("ruby -y -c #{filename} 2>&1") do |f|
    f.each do |line|
      if md = /^Reading a token: Next token is token (\S+)/.match(line)
        tok = md[1]
        ntokens += 1
        freqs[indices[tok]] ||= 0
        freqs[indices[tok]] += 1
        if last_tok
          probs[indices[last_tok]] ||= []
          probs[indices[last_tok]][indices[tok]] ||= 0
          probs[indices[last_tok]][indices[tok]] += 1
        end
        last_tok = tok
      end
    end
  end
  if ntokens == 0
    return [nil, nil, nil, 0]
  end
  probs.map! do |row|
    sum = row.inject(0){|s,x| s + (x || 0)}
    row.map{|x| 1.0 * (x || 0) / sum }
  end

  freqs = freqs.map{|p| 1.0 * p / ntokens }
  cols = [probs.size, probs.map{|x| x.size}.max].max
  probs << [0] * cols if probs.size < cols
  [Matrix.rows(probs.map{|row| row + [0] * (cols - row.size)}), freqs, indices, ntokens]
end

The entropy rate is calculated by obtaining the stationary distribution if there's one, or just taking the observed token type probabilities otherwise, and then computing the weighted average of the entropies of each row in the transition matrix.

LOG2 = Math.log(2)
def H(probs)
  probs.to_a.inject(0) do |s,x|
    if x < 1e-6 # for continuity
      s
    else
      s - 1.0 * x * Math.log(x) / LOG2
    end
  end
end

class Object
  _stationary_distribution_cache = {}
  max_iterations = 10

  define_method(:stationary_distribution) do |t_matrix, *rest|
    if _stationary_distribution_cache.has_key?(t_matrix)
      _stationary_distribution_cache[t_matrix]
    else
      threshold = rest[0] ? rest[0] : 0.001
      2.times{ t_matrix = t_matrix * t_matrix }
      iterations = 0
      loop do
        t_matrix = t_matrix * t_matrix
        # should verify that they are all close, but the following weaker
        # test will do 
        d1 = (t_matrix.row(0) - t_matrix.row(t_matrix.row_size - 1)).to_a.inject{|s,x| s + x.abs}
        d2 = (t_matrix.row(0) - t_matrix.row(t_matrix.row_size/2)).to_a.inject{|s,x| s + x.abs}
        break if d1 < threshold && d2 < threshold || iterations > max_iterations
        iterations += 1
      end
      return nil if iterations >= max_iterations
      _stationary_distribution_cache[t_matrix] = t_matrix.row(0).to_a
    end
  end
end

def entropy_rate(t_matrix)
  mu = stationary_distribution(t_matrix)
  return nil if mu.nil?
  ret = 0
  mu.each_with_index do |p, row|
    ret += p * H(t_matrix.row(row))
  end
  ret
end

t_matrix, freqs, indices, tokens = transition_matrix("~/usr/lib/ruby/1.8/webrick/version.rb")
#t_matrix, freqs, indices, tokens = transition_matrix("~/usr/lib/ruby/1.8/weakref.rb")
#t_matrix, freqs, indices, tokens = transition_matrix("~/usr/lib/ruby/1.8/erb.rb")

entropy_rate(t_matrix)                             # => nil
h_eq = Math.log(t_matrix.row_size) / LOG2          # => 3.0
tokens                                             # => 11
avg_entropy = freqs.inject([0,0]) do |(s,idx), p|
  [s + p * H(t_matrix.row(idx)), idx + 1]
end.first
avg_entropy                                        # => 0.454545454545455

Finally, here's the code that performs that analysis for all the files in RUBYLIBDIR and saves the obtained information:

#############################################################################
# compute entropy rates for the files under ARGV[0] || stdlib

require 'find'
require 'rbconfig'
require 'yaml'

dir = ARGV[0] || Config::CONFIG["rubylibdir"]

$stdout.sync = true
begin
  statistics = {}
  Dir.chdir(dir) do
    Find.find(".") do |file|
      next unless /\.rb$/ =~ file
      #next unless %r{^./rss.rb$} =~ file
      groupname = File.dirname(file).split(%r{/})[1]
      groupname = File.basename(file)[/[^.]+/] if groupname.nil?
      puts "Processing #{file} (under #{groupname})"

      t_matrix, freqs, indices, tokens = transition_matrix(file)
      if t_matrix
        h_rate = entropy_rate(t_matrix)
        if h_rate
          mu = stationary_distribution(t_matrix)
        else
          h_rate = freqs.inject([0,0]) do |(s,idx), p|
            [s + p * H(t_matrix.row(idx)), idx + 1]
          end.first
          mu = freqs
        end

        h_iid = H(mu)
        h_eq1 = mu.inject([0,0]) do |(s,idx), p|
          non_zero = t_matrix.row(idx).to_a.select{|x| x > 0}.size
          non_zero = [1, non_zero].max
          [s + p * Math.log(non_zero) / LOG2, idx + 1]
        end.first
        h_eq2 = Math.log(t_matrix.row_size) / LOG2

        (statistics[groupname] ||= []) << [file, h_rate, h_eq1, h_iid, h_eq2, tokens]
        p [file, h_rate, h_eq1, h_iid, h_eq2, tokens]
        # [file, 
        #  h_rate  either the entropy rate of the markov chain if there is
        #          a stationary distribution or the weighted sum of the
        #          entropies for each row in the transition matrix, based on
        #          the observed frequencies for the prev token type
        #  h_eq1   entropy if all possible (as observed by an instance of the
        #          process) choices given the prev token are equidistributed
        #  h_iid   we ignore the markovian process and consider this as a
        #          stochastic process of indep. identically distributed
        #          random variables (so all tokens are considered independent)
        #  h_eq2   entropy per token if we consider them all equally probable
        #  tokens  number of tokens
        # ]
      end
    end
  end

ensure
  File.open("lexical.entropy", "w"){|f| Marshal.dump(statistics, f) }
  puts "=" * 80
  puts statistics.to_yaml
  raise if $!
end

By the way, the lexical complexity of these scripts is 2.303 bits, with 1118 tokens of 59 different types. That places them relatively close to optparse.rb.

Download

Stand-alone lexical complexity estimator: lexical_complexity.rb


Blah - Eaten-Alive (2005-12-20 (Tue) 12:36:14)

Run it against other major version of Ruby, and compare. (1.6 -> 1.8, 1.8 -> 1.9)


i should say - himm (2005-12-19 (Mon) 23:44:06)

you are sick.


jo 2005-12-29 (Don) 04:34:25

A tool to analyze C programs is cil.sf.net which is a full C parser written in OCAML. (Perhaps also see nekovm.org for a virtual machine based on OCAML).

Last modified:2005/12/19 10:36:44
Keyword(s):[blog] [ruby] [complexity] [lexical] [information] [theory] [entropy] [markov] [chain] [stochastic] [process]
References:[Ruby] [Unexplored corners of Ruby's syntax] [Analyzing the stdlib]