eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Estimating how many people are subscribed to my RSS feeds

Wanting to know how many people are reading my RSS feeds, I wrote a couple scripts to analyze my httpd logs. Their only merit, if any, is that they use a cache (with good old Marshal) to avoid processing data twice, so they still terminate in a fraction of a second when I run them on my >200MB access.log file, which I update with my append-only rsync substitute.

The bloglines index

bloglines seems to be one of the most successful online aggregators, and its bots leave some interesting information in the Referer field (which e.g. Google's Feedreader doesn't), so it's quite a good way to measure a site's growth

bloglines.png

The script aggregates requests originating from bloglines within a 1H interval, and saves the information necessary for the next run (how much of the file was processed and the data corresponding to the last requests). Pretty simple:

#!/usr/bin/env ruby
# Copyright (C) 2006 Mauricio Fernandez <mfp@acm.org> http://eigenclass.org
# Use and distribution under the same terms as Ruby.
#
# Use as in
#   ruby bloglines.rb bloglines.cache >> bloglines.dat
# the script will read ./access.log.

require 'time'

INTERVAL = 3600  # seconds

$stderr.sync = true

if ARGV.size < 1
  puts <<-EOS
  ruby bloglines.rb <cache>
  EOS
  exit
end

skip_bytes = 0
last_date = Time.at(0)
saved_data = {}
  
if File.exist? ARGV[0]
  skip_bytes, last_date, saved_data = Marshal.load(File.read(ARGV[0])) rescue nil
end

blog_subscribers = Hash.new{|h,k| h[k] = {} }
blog_subscribers.update(saved_data)
bloglines_re = %r{^(\S+).*?\[([^\]]*)\].*GET (\S+) .*http://www.bloglines.com; ([0-9]+) subscriber}

fsize = tentative_fsize = skip_bytes
block_first = last_date
$stderr.puts "Skipping #{skip_bytes} bytes."
File.open("access.log") do |f|
  f.pos = skip_bytes
  f.each_with_index do |l, done|
    tentative_fsize += l.size
    $stderr.print "#{done}\r" if done % 100 == 0
    next unless md = bloglines_re.match(l)
    ip, date, feed, subscribers = md.captures
    date = Time.parse(date.gsub(%r{/}, " ").gsub(/(200.):/, '\1 '))
    if date - block_first > INTERVAL
      block_first = date
      #puts "=" * 80
    end
    blog_subscribers[block_first][feed] = subscribers.to_i
    fsize = tentative_fsize
  end
end

File.open(ARGV[0], "w") do |f|
  retained = blog_subscribers.to_a.select{|k,v| k >= block_first - INTERVAL}
  subs_info = Hash[*retained.flatten]
  Marshal.dump([fsize, block_first, subs_info], f)
end

subscriber_stats = []
last = 0
blog_subscribers.keys.sort.each do |k|
  count = blog_subscribers[k].inject(0){|s,(feed,n)| s + n}

  # assume we cannot lose over 5%: we'll consider that was due to bloglines
  # dropping some feed
  if count >= 0.95 * last 
    subscriber_stats << [k, count]
    last = count
  end
end

subscriber_stats.each do |date, count|
  puts "%s    %4d" % [date.strftime("%Y-%m-%d:%H:%M:%S"), count]
end

(feel free to modify it to open the cache in binary mode on win32; I didn't feel like changing Marshal.load(File.read(ARGV[0])) )

The above script reads access.log, and dumps new data to stdout (I just append it to bloglines.dat). You can collect that information and plot it with gnuplot quite easily:

set terminal postscript color
set xdata time
set timefmt "%Y-%m-%d:%H:%M:%S"
set format x "%m/%d"

set ylabel "bloglines subscribers"
set output "bloglines.ps"
set y2label ""
plot "bloglines.dat" using 1:2 with lines title ""

Direct subscribers

Also of interest is the readership hitting your RSS feeds directly. In this case, I'm just measuring the number of unique IPs the feeds are fetched from in one day. This figure deviates from the actual number of readers in two ways:

  • it includes requests originating from online aggregators and such services
  • the script doesn't even try to consider the multiplicity of a request

Requests from external aggregator services seem to amount to only a small fraction of the total, so the obtained estimate can't differ from the actual number of desktop news aggregators tracking my RSS that much.

#!/usr/bin/env ruby
# Copyright (C) 2006 Mauricio Fernandez <mfp@acm.org> http://eigenclass.org
# Use and distribution under the same terms as Ruby.
#
# Use as in
#    ruby direct-subscribers.rb direct-subscribers.cache
# The script will read ./access.log and (re)write ./direct-subscribers.dat.

require 'time'
require 'date'

# Change to match your RSS URI
RSS_RE = %r{^(\S+) eigenclass.org - \[([^:]+)[^\]]+\] "GET /hiki.rb\?c=rss;}

$stderr.sync = true

if ARGV.size < 1
  puts <<-EOS
  ruby direct-subscribers.rb <cache>
  EOS
  exit
end

skip_bytes = 0
last_date = "01/Jan/1970"
saved_data = {}
  
if File.exist? ARGV[0]
  skip_bytes, last_date, saved_data = Marshal.load(File.read(ARGV[0])) rescue nil
end

subscribers = Hash.new{|h,k| h[k] = {} }
subscribers.update(saved_data)

dump_since = last_date

fsize = tentative_size = skip_bytes
File.open("access.log") do |f|
  f.pos = skip_bytes
  f.each_with_index do |l, done|
    tentative_size += l.size
    $stderr.print "Done: #{done}\r" if done % 100 == 0
    next unless md = RSS_RE.match(l)
    ip, date = md.captures
    subscribers[date][ip] = true # any will do
    if date != last_date 
      fsize = tentative_size 
      last_date = date
    end
  end
end


MONTHS = %w[Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec]
format_date = lambda do |date|
  day, month, year = date.split(%r{/})
  "%4d-%02d-%02d" % [year.to_i, MONTHS.index(month) + 1, day.to_i]
end

File.open(ARGV[0], "w") do |f|
  subs_info = Hash[*subscribers.to_a.sort_by{|date, _| format_date[date]}.last]
  Marshal.dump([fsize, last_date, subs_info], f)
end

odata = {}

subscribers.each do |date, subs|
  day, month, year = date.split(%r{/})
  odata[format_date[date]] = subs.size
end

olddata = (File.readlines("direct-subscribers.dat") rescue "")
File.open("direct-subscribers.dat", "w") do |f|
  f.puts olddata[0..-2]

  beginning = format_date[dump_since]
  odata.keys.sort.each do |date|
    next unless date >= beginning
    f.puts "%s    %4d" % [date, odata[date]]
  end
end


Bloglines as 93 subscribers - Derek at CD Baby (2006-03-23 (Thr) 09:11:31)

I read you through Bloglines.

It tells me your feed has 93 subscribers at Bloglines.

mfp 2006-03-23 (Thr) 09:15:35

There are several feeds for eigenclass.org (RSS 2.0 and RDF formats, rich RSS selection expressions...) --- the one you're subscribed to is but one of them. Overall, about 210 ppl are subscribed to eigenclass' feeds on bloglines (see the graph). This is why the above script aggregates the subscribers corresponding to different feeds in a given interval.


Last modified:2006/03/23 05:30:21
Keyword(s):[blog] [ruby] [rss] [readership] [snippet] [subpar] [frontpage]
References: