#173
Aug 03, 2009

Screen Scraping with ScrAPI

Screen scraping is not pretty, but sometimes it's your only option to extract content from an external site. In this episode I show you how to fetch product prices using ScrAPI.
Tags: plugins
Download (33.9 MB, 15:23)
alternative download for iPod & Apple TV (19.7 MB, 15:23)

Resources

sudo gem install scrapi
# config/environment.rb
config.gem "scrapi"

# models/product.rb
def self.fetch_prices
  scraper = Scraper.define do
    process "div.firstRow div.priceAvail>div>div.PriceCompare>div.BodyS", :price => :text
    result :price
  end
  
  find_all_by_price(nil).each do |product|
    uri = URI.parse("http://www.walmart.com/search/search-ng.do?search_constraint=0&ic=48_0&search_query=" + CGI.escape(product.name) + "&Find.x=0&Find.y=0&Find=Find")
    product.update_attribute :price, scraper.scrape(uri)[/[.0-9]+/]
  end
end

# scrapitest.rb
require 'rubygems'
require 'scrapi'

scraper = Scraper.define do
  array :items
  process "div.item", :items => Scraper.define {
    process "a.prodLink", :title => :text, :link => "@href"
    process "div.priceAvail>div>div.PriceCompare>div.BodyS", :price => :text
    result :price, :title, :link
  }
  result :items
end

uri = URI.parse("http://www.walmart.com/search/search-ng.do?search_constraint=0&ic=48_0&search_query=lost+third+season&Find.x=0&Find.y=0&Find=Find")
scraper.scrape(uri).each do |product|
  puts product.title
  puts product.price
  puts product.link
  puts
end

RSS Feed for Episode Comments 24 comments

1. elad Aug 03, 2009 at 01:18

You are actually a mind reader !!! Thanks..


2. Blake Aug 03, 2009 at 04:39

Cool! Thanks.


3. Ray Aug 03, 2009 at 05:00

I would have to agree with elad. You must have a crystal ball for rails developers. Thanks for the episode.


4. Tevez Aug 03, 2009 at 05:59

I like this episode, it inspires me a lot, really thanks!


5. Stewie Aug 03, 2009 at 07:02

Hi,

Firstly thanks! I look forward to each weeks episode.

I'm not sure what goes wrong but this is the output of the scrapitest.rb

/usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/reader.rb:216:in `parse_page': Scraper::Reader::HTMLParseError: Unable to load /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/../tidy/libtidy.dylib (Scraper::Reader::HTMLParseError)
from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:865:in `document'
from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:749:in `scrape'
from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:347:in `scrape'
from scrapitest.rb:10

gems list scrapi gives: 1.2.0

I will try to fix it and post my solution here.


6. Stewie Aug 03, 2009 at 07:26

Back again,

its a 64-bit problem.
This guy has a quick and dirty fix for it. I did not use it. I will wait until the gem is improved to not include tidy/tidylib.dll tidy/tidylib.so as they hopefully are in the middle of removing tidy/.

http://anti.teamidiot.de/nusse/2009/05/scrapi_libtidyso_fail/

regardless it's a quite nice gem and another good episode.


7. Roland Aug 03, 2009 at 09:58

for me Nokogiri (http://github.com/tenderlove/nokogiri/tree/master) does the job pretty well.


8. Henning Aug 03, 2009 at 10:55

If you do not want to replace FireBug with FireQuark you can use http://www.selectorgadget.com/ bookmarklet to interactively build a unique CSS selector for any element on a page. This works also in Safari.


9. RORgasm Aug 03, 2009 at 11:03

hey Ryan, I would actually suggest taking a look at Hpricot... I've done a few applications that required quite a bit of scraping (legal of course :) ) and fount Hpricot to be a stable, good solution. The Hpricot API also uses the familiarity of CSS selectors for convenience ... unless I'm missing something is there something else that ScrAPI offers that Hpricot doesn't?


10. zhon Aug 03, 2009 at 13:27

Thanks for another great 'cast. I have been scraping with mechanize/nokogiri and like it (except installing is painful). I was (and still occassionally) use watir to scrap. As always, it is good to see a new tool.

I would love to see a 'cast where you navigate and scrap a site that includes Javascript.


11. elad Aug 03, 2009 at 13:29

@Henning, thanks for the selectorgadget link, just what i needed, cause some how FireQuark can't work on latest Firefox ver 3.5.1


12. Garrett Aug 03, 2009 at 14:02

Hey Ryan,

This episode seems to freeze both audio and video around 3:12. Just thought you might want to know!


13. Garrett Aug 03, 2009 at 14:06

Clarification:

It seems to work fine on site, but it wasn't working when I tried to download from the RSS feed.


14. plotti Aug 03, 2009 at 14:40

Excellent screencast, scaping with ruby and scrAPI sees just so much fun. Cant wait to try it out tomorrow! Big Thanks!


15. _fa Aug 03, 2009 at 15:55

Great screencast.

I get a problem though running Product.fetch_prices

"
You have a nil object when you didn't expect it!
You might have expected an instance of ActiveRecord::Base.
The error occurred while evaluating nil.[]
"

Any clues?


16. chetan conikee Aug 03, 2009 at 17:03

Bates, you listened .... :)

One more request, hope you could have another installment with nokogiri and mechanize

Thanks
Chetan


17. Thomas Evan Lecklider Aug 03, 2009 at 18:04

I'm with @RORgasm on Hpricot. It uses CSS or Xpath selectors and has great block handling for multiple elements. Behaves similarly to jQuery on the traversal end.

As always, thanks for the great screencast!


18. Nakul (quarkruby) Aug 04, 2009 at 03:30

@elad New version of firequark (compatible with ff3.5 is here): http://www.quarkruby.com/assets/2009/8/4/firequark-3.5.2.xpi


19. elad Aug 04, 2009 at 14:42

@Nakul thanks!!!


20. Brett Aug 04, 2009 at 16:58

I have the latest version of scrapi installed, however for some reason when I try running the scrapitest.rb code, I receive the following error:

NameError: uninitialized constant Scraper

at top level in scrapitest1.rb at line 5
copy output
Program exited with code #1 after 0.14 seconds.

Any ideas?


21. chimere Aug 05, 2009 at 10:19

I've been playing with scRUBYt and FireWatir lately, they've given me much joy. I'll be looking forward to your screencast on scRUBYt when you do get it to run. Salute!


22. Ludger Aug 07, 2009 at 08:20

Yes, how does hpricot compare to ScrAPI? How about their speeds in comparison?


23. Ludger Aug 09, 2009 at 13:16

And of course: THANK YOU very much for these ultra high quality screencasts. I am so glad that I have this very convenient source of know how.

One question I have:
So far I am not very comfortable with the concept of the Ruby symbols. Most of the time I know how to modify existing code, but so far I was not able to find a text explaining the concept of Ruby symbols sufficiently.

...

This text that I just found helps somewhat: http://glu.ttono.us/articles/2005/08/19/understanding-ruby-symbols
and comment 1 and 12 on mentioned page indicate that there are special Rails aspects of Ruby symbols, but the article is not intended to cover Rails.


24. Serdar Soydemir Aug 16, 2009 at 15:23

If you try it on Windows and get an error related to "libtidy.so", just delete the libtidy.so file in folder "ruby/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/tidy". This will force scrapi to use "libtidy.dll" in the same folder...


25. ning Aug 23, 2009 at 03:59

@Ryan what is the advantage of use scrAPI? why don't just use Hpricot?


26. mark mcdonald Aug 29, 2009 at 14:57

if you're getting this error:

./scrapi.rb:5: uninitialized constant Scraper (NameError)
from /opt/ruby-1.8.7-p72/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `gem_original_require'

then add this 'gem 'scrapi'

require 'rubygems'
gem 'scrapi'
require 'scrapi'

-mark


27. David McNally Sep 08, 2009 at 03:42

Hi Ryan, love the screencasts, don't love the spam getting through your filters in the comments.

Hopefully you can improve this and share how you did it.

Thanks


28. Cezar Sep 17, 2009 at 11:27

Interesting topic even tho I am not really impressed by scrapi, I hope to see some alternatives (maybe hpricot)

Thanks for another great screencast!


29. Kevin Sep 26, 2009 at 22:40

Is it possible to scrape password protected pages, or will ScRUBYt! be required?

Thanks.


30. Eric Nov 04, 2009 at 00:09

In my case it works all fine.

And pls kill this spam :-)

Eric


31. Arsyuta Nov 06, 2009 at 06:32

Not bed, API is very interestinc for internet.


32. Rafael Barbolo Nov 21, 2009 at 02:49

Guys having problem with libtidy.dilyb:

http://exceptionz.wordpress.com/2009/11/03/scrapi-on-snow-leopard/


33. eatmydust Dec 03, 2009 at 01:44

Scrapi works good but there is a problem with UTF-8 characters, e.g. german "Umlaute" like ö, ä, ü.
Scrapi messes them up.
In the scrape cheat sheet there is a hint that one can call:
myscraper=scraper.scrape(uri, :parse_options)
where :parse_options should have something to do with tidy, i.e. scrapi should be able to deal with utf-8 characters.
Has anybody done this ?
I don't see how to use :parse_options.
Please post an example of working code which uses those :parse_options! Thanks.


34. Gerson Seifert Dec 04, 2009 at 12:57

use tidy_ffi, works like a charm


35. webtasarim Jul 15, 2010 at 09:05

web tasarımı, kurumsal site tasarımı, profesyonel web sitesi tasarımı, profesyonel web tasarımı

<a href="http://www.webtasarimturk.net" title="web tasarımı">web tasarımı</a>


36. free directory list Aug 11, 2010 at 22:36

I think i have same problem too


37. Phil Aug 13, 2010 at 02:03

I can't get scrapi running under snow leopard as it seems that not only do you need a new libtidy.dylib you also new a new .so, which I can't seem to find anywhere!

I am not sure why scrapi requires all these binaries and doesn't just use what is installed.


38. Wholesale baseball hats Aug 20, 2010 at 20:28

Good post. I am also going to write a blog post about this...I enjoyed reading your post and I like your take on the issue. Thanks.


39. Nike Sb Dunks Aug 23, 2010 at 22:57

Useful and nice episode! High quality low price.It's fit for you. Thanks MattR for sharing that. And thanks Ryan for this great screencast.


40. Wholesale Electronics Aug 25, 2010 at 01:46

Discount Wholesale Electronics, Wholesale Cell Phones, Electronic Gadgets and More from the Best Dropship Wholesaler


41. louis vuitton shoes Aug 26, 2010 at 23:19

Thanks for sharing your article. I really enjoyed it. I put a link to my site to here so other people can read it. My readers have about the same interets


42. wholesale Earphone Aug 30, 2010 at 20:25

it is a nice post, thanks for your sharing, like it so much.


43. snow boots Aug 30, 2010 at 20:53

I would have to agree with elad. You must have a crystal ball for rails developers. Thanks for the episode.


44. louis vuitton sunglasses Sep 01, 2010 at 22:31

Nice post. My friend John told me about this blog some weeks ago but this is the first time I’m visting. I’ll undoubtedly be back.

Add your comment:

(SKIP THIS ONE)

(required)

(not shown)


(use pastie or gist for code)

sponsored by:
if you want to help:
required:
Get Quicktime Player
Give Back to Open Source