#173 Screen Scraping with ScrAPI

Aug 03, 2009 | 15 minutes | Plugins

Screen scraping is not pretty, but sometimes it's your only option to extract content from an external site. In this episode I show you how to fetch product prices using ScrAPI.

Click to Play Video ▶

Download:
source codeProject Files in Zip (97.4 KB)
mp4Full Size H.264 Video (27 MB)
m4vSmaller H.264 Video (17.2 MB)
webmFull Size VP8 Video (44.7 MB)
ogvFull Size Theora Video (36.8 MB)

elad almost 16 years ago

You are actually a mind reader !!! Thanks..

Blake almost 16 years ago

Cool! Thanks.

Ray almost 16 years ago

I would have to agree with elad. You must have a crystal ball for rails developers. Thanks for the episode.

Tevez almost 16 years ago

I like this episode, it inspires me a lot, really thanks!

Stewie almost 16 years ago

Hi,

Firstly thanks! I look forward to each weeks episode.

I'm not sure what goes wrong but this is the output of the scrapitest.rb

/usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/reader.rb:216:in `parse_page': Scraper::Reader::HTMLParseError: Unable to load /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/../tidy/libtidy.dylib (Scraper::Reader::HTMLParseError)
from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:865:in `document'
from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:749:in `scrape'
from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:347:in `scrape'
from scrapitest.rb:10

gems list scrapi gives: 1.2.0

I will try to fix it and post my solution here.

Stewie almost 16 years ago

Back again,

its a 64-bit problem.
This guy has a quick and dirty fix for it. I did not use it. I will wait until the gem is improved to not include tidy/tidylib.dll tidy/tidylib.so as they hopefully are in the middle of removing tidy/.

http://anti.teamidiot.de/nusse/2009/05/scrapi_libtidyso_fail/

regardless it's a quite nice gem and another good episode.

Roland almost 16 years ago

for me Nokogiri (http://github.com/tenderlove/nokogiri/tree/master) does the job pretty well.

Henning almost 16 years ago

If you do not want to replace FireBug with FireQuark you can use http://www.selectorgadget.com/ bookmarklet to interactively build a unique CSS selector for any element on a page. This works also in Safari.

RORgasm almost 16 years ago

hey Ryan, I would actually suggest taking a look at Hpricot... I've done a few applications that required quite a bit of scraping (legal of course :) ) and fount Hpricot to be a stable, good solution. The Hpricot API also uses the familiarity of CSS selectors for convenience ... unless I'm missing something is there something else that ScrAPI offers that Hpricot doesn't?

zhon almost 16 years ago

Thanks for another great 'cast. I have been scraping with mechanize/nokogiri and like it (except installing is painful). I was (and still occassionally) use watir to scrap. As always, it is good to see a new tool.

I would love to see a 'cast where you navigate and scrap a site that includes Javascript.

elad almost 16 years ago

@Henning, thanks for the selectorgadget link, just what i needed, cause some how FireQuark can't work on latest Firefox ver 3.5.1

Garrett almost 16 years ago

Hey Ryan,

This episode seems to freeze both audio and video around 3:12. Just thought you might want to know!

Garrett almost 16 years ago

Clarification:

It seems to work fine on site, but it wasn't working when I tried to download from the RSS feed.

plotti almost 16 years ago

Excellent screencast, scaping with ruby and scrAPI sees just so much fun. Cant wait to try it out tomorrow! Big Thanks!

_fa almost 16 years ago

Great screencast.

I get a problem though running Product.fetch_prices

"
You have a nil object when you didn't expect it!
You might have expected an instance of ActiveRecord::Base.
The error occurred while evaluating nil.[]
"

Any clues?

Thomas Evan Lecklider almost 16 years ago

I'm with @RORgasm on Hpricot. It uses CSS or Xpath selectors and has great block handling for multiple elements. Behaves similarly to jQuery on the traversal end.

As always, thanks for the great screencast!

Nakul (quarkruby) almost 16 years ago

@elad New version of firequark (compatible with ff3.5 is here): http://www.quarkruby.com/assets/2009/8/4/firequark-3.5.2.xpi

elad almost 16 years ago

@Nakul thanks!!!

Brett almost 16 years ago

I have the latest version of scrapi installed, however for some reason when I try running the scrapitest.rb code, I receive the following error:

NameError: uninitialized constant Scraper

at top level in scrapitest1.rb at line 5
copy output
Program exited with code #1 after 0.14 seconds.

Any ideas?

chimere almost 16 years ago

I've been playing with scRUBYt and FireWatir lately, they've given me much joy. I'll be looking forward to your screencast on scRUBYt when you do get it to run. Salute!

Ludger almost 16 years ago

Yes, how does hpricot compare to ScrAPI? How about their speeds in comparison?

Ludger almost 16 years ago

And of course: THANK YOU very much for these ultra high quality screencasts. I am so glad that I have this very convenient source of know how.

One question I have:
So far I am not very comfortable with the concept of the Ruby symbols. Most of the time I know how to modify existing code, but so far I was not able to find a text explaining the concept of Ruby symbols sufficiently.

...

This text that I just found helps somewhat: http://glu.ttono.us/articles/2005/08/19/understanding-ruby-symbols
and comment 1 and 12 on mentioned page indicate that there are special Rails aspects of Ruby symbols, but the article is not intended to cover Rails.

Serdar Soydemir almost 16 years ago

If you try it on Windows and get an error related to "libtidy.so", just delete the libtidy.so file in folder "ruby/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/tidy". This will force scrapi to use "libtidy.dll" in the same folder...

ning almost 16 years ago

@Ryan what is the advantage of use scrAPI? why don't just use Hpricot?

mark mcdonald over 15 years ago

if you're getting this error:

./scrapi.rb:5: uninitialized constant Scraper (NameError)
from /opt/ruby-1.8.7-p72/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `gem_original_require'

then add this 'gem 'scrapi'

require 'rubygems'
gem 'scrapi'
require 'scrapi'

-mark

David McNally over 15 years ago

Hi Ryan, love the screencasts, don't love the spam getting through your filters in the comments.

Hopefully you can improve this and share how you did it.

Thanks

Cezar over 15 years ago

Interesting topic even tho I am not really impressed by scrapi, I hope to see some alternatives (maybe hpricot)

Thanks for another great screencast!

Kevin over 15 years ago

Is it possible to scrape password protected pages, or will ScRUBYt! be required?

Thanks.

Eric over 15 years ago

In my case it works all fine.

And pls kill this spam :-)

Eric

Arsyuta over 15 years ago

Not bed, API is very interestinc for internet.

Rafael Barbolo over 15 years ago

Guys having problem with libtidy.dilyb:

http://exceptionz.wordpress.com/2009/11/03/scrapi-on-snow-leopard/

eatmydust over 15 years ago

Scrapi works good but there is a problem with UTF-8 characters, e.g. german "Umlaute" like ö, ä, ü.
Scrapi messes them up.
In the scrape cheat sheet there is a hint that one can call:
myscraper=scraper.scrape(uri, :parse_options)
where :parse_options should have something to do with tidy, i.e. scrapi should be able to deal with utf-8 characters.
Has anybody done this ?
I don't see how to use :parse_options.
Please post an example of working code which uses those :parse_options! Thanks.

Gerson Seifert over 15 years ago

use tidy_ffi, works like a charm

Phil almost 15 years ago

I can't get scrapi running under snow leopard as it seems that not only do you need a new libtidy.dylib you also new a new .so, which I can't seem to find anywhere!

I am not sure why scrapi requires all these binaries and doesn't just use what is installed.

Unihost Brasil Servidores de Hospedagem over 14 years ago

Excellent screencast, thanks!

Eduardo M. - Internal Development
Unihost Brasil

ecoologic over 13 years ago

very interesting once again!

Barnett Klane over 11 years ago

Would love to see an updated version now that scrAPI is no longer maintained.

SempiHost over 10 years ago

Thank you for another cast.
Have a nice day
Hospedagem de sites