#168 Feed Parsing
- Download:
- source codeProject Files in Zip (93.8 KB)
- mp4Full Size H.264 Video (18.8 MB)
- m4vSmaller H.264 Video (11.9 MB)
- webmFull Size VP8 Video (32.1 MB)
- ogvFull Size Theora Video (30.3 MB)
Below is the home page of a Rails blogging application. We’d like to make the site a little more useful by integrating information from another site into it. To do this we’ll add a list of links on the page that link to another Rails-related site, say a list of the most recent ASCIIcasts.
If we visit asciicasts.com we’ll see a list of the most recent episodes on the homepage, but how should we get this data on to our site? We could use screen-scraping: grabbing the HTML from the page and parsing it to get the data we want which works, but has disadvantages. For example, if the site owner changes the structure of the page then our parsing code could well stop being able to extract the right data from it.
A much better way, where it’s available, is to use an RSS feed. The ASCIIcasts site has an RSS feed containing a list of the episodes so instead of scraping the site, we can just pull the data we need from the feed.
Feedzirra
There are a number of ways of parsing an RSS feed in Ruby, but one of the best is a gem called Feedzirra. The main advantage of Feedzirra is its speed; it parses feeds very quickly, but it is also useful as it can parse many different types of feed.
To install Feedzirra we first need to make sure that http://gems.github.com
is in our list of gem sources. If not we’ll need to add it.
gem sources -a http://gems.github.com
Now we can install the gem:
sudo gem install pauldix-feedzirra
Several dependencies will be installed alongside the gem. Once everything’s installed we’ll need to add a reference to the gem in our application’s /config/environment.rb
file.
config.gem "pauldix-feedzirra", :lib => "feedzirra", :source => "http://gems.github.com"
That’s it. We’re ready to start parsing RSS feeds in our application.
Getting The Feed
We’re going to show the feed on the home page of our site, but we don’t want to have to get the feed every time a user visits that page as getting the feed and parsing it are expensive operations and take time to run. It would be better to cache the feed locally. There are a various ways we could cache the feed’s data; we’re going to store it in the database and create a new model to represent an entry in the feed. We’ll call this model feed_entry
. The model will have four attributes to store the entry’s data: name
, to store the headline, summary
to store the content, url
to store the entry’s link, published_at
for the time the entry was created and guid
to store the entry’s unique identifier so that we can check for duplicates.
We’ll generate our model with
script/generate model feed_entry name:string summary:text url:string published_at:datetime guid:string
Then migrate the database to generate the table.
rake db:migrate
The logic for parsing the feed and updating the entries will be added to the FeedEntry
class. To start off we’ll need a method that parses the feed and adds any new entries to the database. For this we’ll write a class method called update_from_feed
.
def self.update_from_feed(feed_url) feed = Feedzirra::Feed.fetch_and_parse(feed_url) feed.entries.each do |entry| unless exists? :guid => entry.id create!( :name => entry.title, :summary => entry.summary, :url => entry.url, :published_at => entry.published, :guid => entry.id ) end end end
The method takes one parameter: a URL for the feed which will be parsed by Feedzirra. It will fetch the feed and parse it, then it will then loop through each entry and add it to the database unless it’s already there. The method uses ActiveRecord’s exists?
method to search for an entry by its guid to see if the entry is already in the database.
We can now go into the console and try out our new method to get the entries from the ASCIIcasts feed into our database.
>> FeedEntry.update_from_feed("http://asciicasts.com/episodes.xml")
There will be a delay of a few seconds while the feed is fetched and parsed and then you should see a long array of FeedZilla objects returned. Once it’s finished we should have our entries in the database.
>> FeedEntry.count => 61
If we were to run the command again it would only add any new entries that had been created since the last time we ran it. To keep the feed up to date we could set up a cron job to fetch the feed at a regular interval. If we were to do this we could use the Whenever gem that was covered in episode 164.
Now that we have our feed entries in the database we’ll modify our view code to show the most recent entries. At the top of the articles index view we’ll add the following code to render the ten most recent entries.
<div id="recent_episodes"> <h3>Recent ASCIIcasts Episodes</h3> <ul> <% for entry in FeedEntry.all(:limit => 10, :order => "published_at desc") %> <li><%= link_to h(entry.name), entry.url %></li> <% end %> </ul> </div>
With a little CSS we can style the div and make the list appear on the right-hand side of the articles page.
#recent_episodes { float: right; border: solid 1px #666; margin: 8px 0 16px 16px; padding: 4px; background-color: #DDD; } #recent_episodes h3 { margin: 0; font-size: 1em; } #recent_episodes ul { list-style: none; margin-left: 8px; padding-left: 0; } #recent_episodes a { font-size: 0.9em; }
Our page now has a panel on it showing the most recent episodes.
More Frequent Updates
The code we’ve written works well when we don’t need to check for updates to the feed very often, but if we had to check every ten minutes or so then this isn’t the most efficient way to do it. We have to get the full feed every time and most of the time the data in it won’t have changed from the last time, so we’re wasting time and bandwidth by always pulling the whole feed back.
Thankfully Feedzirra provides a way of getting updates for a feed. If you look at the example code for Feedzirra you can see that there is a method that will get only the entries for the feed that have been updated since the feed was last retrieved.
# updating a single feed updated_feed = Feedzirra::Feed.update(feed)
The update method uses ETags to determine if the feed has updated since it was last changed and will only download and reparse it if it has. To get the new entries there’s a new_entries
method that will return a collection of the new entries.
I was unable to get this working while writing the test application, but I’ll show you the code that should work and enable you to get frequent updates from a feed. What we’re going to do is add another method to our FeedEntry
class to go with the update_from_feed
method we created earlier. This method will repeatedly poll the feed and add any updated entries to the database.
Our new method will use the code that add the entries to the database so we’ll start by extracting this code out into a method.
class FeedEntry < ActiveRecord::Base def self.update_from_feed(feed_url) feed = Feedzirra::Feed.fetch_and_parse(feed_url) add_entries(feed.entries) end private def self.add_entries(entries) entries.each do |entry| unless exists? :guid => entry.id create!( :name => entry.title, :summary => entry.summary, :url => entry.url, :published_at => entry.published, :guid => entry.id ) end end end end
Now we can write our new method, which we’ll call update_from_feed_continuously
.
def self.update_from_feed_continuously(feed_url, delay_interval = 15.minutes) feed = Feedzirra::Feed.fetch_and_parse(feed_url) add_entries(feed.entries) loop do sleep delay_interval.to_i feed = Feedzirra::Feed.update(feed) add_entries(feed.new_entries) if feed.updated? end end
This method is similar to the update_from_feed
method but it takes an additional parameter that specifies how often the feed should be polled. It starts by getting the full feed and adding the entries then enters an endless loop that sleeps for the specified period before checking to see if the feed has been updated and, if so, adding any new entries to the database.
So we now have two methods for getting entries from an RSS feed; one that is suitable for a cron job and another that can use a daemonized process and is more suitable for when a feed needs to be checked for updates frequently.
It should be noted that a loop is not the best way to do a daemonized process. For a better approach look at Episode 129 which shows how to use the daemons gem to create a background process.