#173 Screen Scraping with ScrAPI
Aug 03, 2009 | 15 minutes |
Plugins
Screen scraping is not pretty, but sometimes it's your only option to extract content from an external site. In this episode I show you how to fetch product prices using ScrAPI.
- Download:
- source codeProject Files in Zip (97.4 KB)
- mp4Full Size H.264 Video (27 MB)
- m4vSmaller H.264 Video (17.2 MB)
- webmFull Size VP8 Video (44.7 MB)
- ogvFull Size Theora Video (36.8 MB)
You are actually a mind reader !!! Thanks..
Cool! Thanks.
I would have to agree with elad. You must have a crystal ball for rails developers. Thanks for the episode.
I like this episode, it inspires me a lot, really thanks!
Hi,
Firstly thanks! I look forward to each weeks episode.
I'm not sure what goes wrong but this is the output of the scrapitest.rb
/usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/reader.rb:216:in `parse_page': Scraper::Reader::HTMLParseError: Unable to load /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/../tidy/libtidy.dylib (Scraper::Reader::HTMLParseError)
from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:865:in `document'
from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:749:in `scrape'
from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:347:in `scrape'
from scrapitest.rb:10
gems list scrapi gives: 1.2.0
I will try to fix it and post my solution here.
Back again,
its a 64-bit problem.
This guy has a quick and dirty fix for it. I did not use it. I will wait until the gem is improved to not include tidy/tidylib.dll tidy/tidylib.so as they hopefully are in the middle of removing tidy/.
http://anti.teamidiot.de/nusse/2009/05/scrapi_libtidyso_fail/
regardless it's a quite nice gem and another good episode.
for me Nokogiri (http://github.com/tenderlove/nokogiri/tree/master) does the job pretty well.
If you do not want to replace FireBug with FireQuark you can use http://www.selectorgadget.com/ bookmarklet to interactively build a unique CSS selector for any element on a page. This works also in Safari.
hey Ryan, I would actually suggest taking a look at Hpricot... I've done a few applications that required quite a bit of scraping (legal of course :) ) and fount Hpricot to be a stable, good solution. The Hpricot API also uses the familiarity of CSS selectors for convenience ... unless I'm missing something is there something else that ScrAPI offers that Hpricot doesn't?
Thanks for another great 'cast. I have been scraping with mechanize/nokogiri and like it (except installing is painful). I was (and still occassionally) use watir to scrap. As always, it is good to see a new tool.
I would love to see a 'cast where you navigate and scrap a site that includes Javascript.
@Henning, thanks for the selectorgadget link, just what i needed, cause some how FireQuark can't work on latest Firefox ver 3.5.1
Hey Ryan,
This episode seems to freeze both audio and video around 3:12. Just thought you might want to know!
Clarification:
It seems to work fine on site, but it wasn't working when I tried to download from the RSS feed.
Excellent screencast, scaping with ruby and scrAPI sees just so much fun. Cant wait to try it out tomorrow! Big Thanks!
Great screencast.
I get a problem though running Product.fetch_prices
"
You have a nil object when you didn't expect it!
You might have expected an instance of ActiveRecord::Base.
The error occurred while evaluating nil.[]
"
Any clues?
I'm with @RORgasm on Hpricot. It uses CSS or Xpath selectors and has great block handling for multiple elements. Behaves similarly to jQuery on the traversal end.
As always, thanks for the great screencast!
@elad New version of firequark (compatible with ff3.5 is here): http://www.quarkruby.com/assets/2009/8/4/firequark-3.5.2.xpi
@Nakul thanks!!!
I have the latest version of scrapi installed, however for some reason when I try running the scrapitest.rb code, I receive the following error:
NameError: uninitialized constant Scraper
at top level in scrapitest1.rb at line 5
copy output
Program exited with code #1 after 0.14 seconds.
Any ideas?
I've been playing with scRUBYt and FireWatir lately, they've given me much joy. I'll be looking forward to your screencast on scRUBYt when you do get it to run. Salute!
Yes, how does hpricot compare to ScrAPI? How about their speeds in comparison?
And of course: THANK YOU very much for these ultra high quality screencasts. I am so glad that I have this very convenient source of know how.
One question I have:
So far I am not very comfortable with the concept of the Ruby symbols. Most of the time I know how to modify existing code, but so far I was not able to find a text explaining the concept of Ruby symbols sufficiently.
...
This text that I just found helps somewhat: http://glu.ttono.us/articles/2005/08/19/understanding-ruby-symbols
and comment 1 and 12 on mentioned page indicate that there are special Rails aspects of Ruby symbols, but the article is not intended to cover Rails.
If you try it on Windows and get an error related to "libtidy.so", just delete the libtidy.so file in folder "ruby/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/tidy". This will force scrapi to use "libtidy.dll" in the same folder...
@Ryan what is the advantage of use scrAPI? why don't just use Hpricot?
if you're getting this error:
./scrapi.rb:5: uninitialized constant Scraper (NameError)
from /opt/ruby-1.8.7-p72/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `gem_original_require'
then add this 'gem 'scrapi'
require 'rubygems'
gem 'scrapi'
require 'scrapi'
-mark
Hi Ryan, love the screencasts, don't love the spam getting through your filters in the comments.
Hopefully you can improve this and share how you did it.
Thanks
Interesting topic even tho I am not really impressed by scrapi, I hope to see some alternatives (maybe hpricot)
Thanks for another great screencast!
Is it possible to scrape password protected pages, or will ScRUBYt! be required?
Thanks.
In my case it works all fine.
And pls kill this spam :-)
Eric
Not bed, API is very interestinc for internet.
Guys having problem with libtidy.dilyb:
http://exceptionz.wordpress.com/2009/11/03/scrapi-on-snow-leopard/
Scrapi works good but there is a problem with UTF-8 characters, e.g. german "Umlaute" like ö, ä, ü.
Scrapi messes them up.
In the scrape cheat sheet there is a hint that one can call:
myscraper=scraper.scrape(uri, :parse_options)
where :parse_options should have something to do with tidy, i.e. scrapi should be able to deal with utf-8 characters.
Has anybody done this ?
I don't see how to use :parse_options.
Please post an example of working code which uses those :parse_options! Thanks.
use tidy_ffi, works like a charm
I can't get scrapi running under snow leopard as it seems that not only do you need a new libtidy.dylib you also new a new .so, which I can't seem to find anywhere!
I am not sure why scrapi requires all these binaries and doesn't just use what is installed.
Excellent screencast, thanks!
Eduardo M. - Internal Development
Unihost Brasil
very interesting once again!
Would love to see an updated version now that scrAPI is no longer maintained.
Thank you for another cast.
Have a nice day
Hospedagem de sites