#190
Nov 30, 2009

Screen Scraping with Nokogiri

Screen scraping is easy with Nokogiri and SelectorGadget.
Tags: tools
Download (43 MB, 13:34)
alternative download for iPod & Apple TV (21.9 MB, 13:34)

Resources

sudo gem install nokogiri -- --with-xml2-include=/usr/local/include/libxml2 --with-xml2-lib=/usr/local/lib
# nokogiri_test.rb
require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = "http://www.walmart.com/search/search-ng.do?search_constraint=0&ic=48_0&search_query=batman&Find.x=0&Find.y=0&Find=Find"
doc = Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css(".item").each do |item|
  title = item.at_css(".prodLink").text
  price = item.at_css(".PriceCompare .BodyS, .PriceXLBold").text[/\$[0-9\.]+/]
  puts "#{title} - #{price}"
  puts item.at_css(".prodLink")[:href]
end

# lib/tasks/product_prices.rake
desc "Fetch product prices"
task :fetch_prices => :environment do
  require 'nokogiri'
  require 'open-uri'
  
  Product.find_all_by_price(nil).each do |product|
    url = "http://www.walmart.com/search/search-ng.do?search_constraint=0&ic=48_0&search_query=#{CGI.escape(product.name)}&Find.x=0&Find.y=0&Find=Find"
    doc = Nokogiri::HTML(open(url))
    price = doc.at_css(".PriceCompare .BodyS, .PriceXLBold").text[/[0-9\.]+/]
    product.update_attribute(:price, price)
  end
end

RSS Feed for Episode Comments 37 comments

1. Sergio Burgueño Nov 30, 2009 at 00:10

Great!, thanks a lot!


2. Steve Nov 30, 2009 at 00:29

Love the idea of scraping website's. Can't wait till next week!


3. Nils Riedemann Nov 30, 2009 at 00:47

After some problems setting up nokogiri it is really awesome. XML parsing and screen scraping as simple as possible. And fast!


4. igor Nov 30, 2009 at 01:24

Great, another one library for doing it,
very like this stuff, thanks!


5. Thibaut Barrère Nov 30, 2009 at 01:37

@Jamie - sure, it helps to ask for permission. That's what I do at least :)

@Ryan - thanks for another great episode!


6. David Nov 30, 2009 at 03:40

Good stuff, thanks Ryan!


7. Ivan Nov 30, 2009 at 04:36

Hi, Ryan! Thank you for one more great screencast. I'd like to translate you casts to russian. If you agree, please, contact me, course I tried, but got just this:
Delivery to the following recipient failed permanently:

    feedback@railscasts.com

Technical details of permanent failure:
The recipient server did not accept our requests to connect. [mx1.sub4.homie.mail.dreamhost.com.railscasts.com. (0): Destination address required]
[mx2.sub4.homie.mail.dreamhost.com.railscasts.com. (0): Destination address required]


8. Espen Antonsen Nov 30, 2009 at 05:26

Anyone done a performance comparison of Nokogiri vs Hpricot?


9. Eric Nov 30, 2009 at 06:51

Hi Ryan,

Great screencast and I look forward to the next episode featuring Mechanize as most of us will likely need the ability to interact with the website being scraped.

That said, I've opted against Mechanize in favor of Celerity given Mechanize's lack of support for Javascript in today's jQuery/Prototype, etc world.

Sure there are generally workarounds to bypass the Javascript (fun) and Watir (though I prefer Celerity's faceless browser). Perhaps highlighting this weakness of Mechanize in your screencast will encourage the addition of such support...that and better documentation. :-)


10. rATRIJS Nov 30, 2009 at 07:15

Great episode. This looks cleaner than ScrAPI.

Btw I didn't have to provide libxml path when installing nokogiri. gem install nokogiri worked like a charm. I'm using Windows with cygwin environment.


11. Sam Millar Nov 30, 2009 at 07:35

I don't think I'll be using ScreenScraping any time soon in my apps, however SelectorGadget looks like a great tool which may come in handy for me at some point.

Thanks Ryan.


12. Matt Rust Nov 30, 2009 at 07:54

Would this work for getting football statistics or is there a better way to do that?

Thanks!!


13. Chris K Nov 30, 2009 at 09:14

Ryan, I have posted this before but it may have gotten lost among all the spam. Since spam really disables a fruitful conversation (or even just reading) of this little forum, I think you have to attack the problem seriously. Here is my funny but possibly quite powerful solution:

I have a solution for the spam. Since most of us know at least a bit about rails (otherwise we wouldn't give a rat's ... about the Railscasts), as a simple question in addition to Captcha for example:

Fill in the blank:

validates_xxxxx_of :firstname, :lastname

or something more funny (and political LOL):

validates_xxxxx_xxx :smartpresidents, :in => "George W. Bush"

(...exclusion of that is LOL)


14. Anlek Nov 30, 2009 at 09:35

Hello Ryan,
Amazing screen cast, You always cover the items I am working on next!
Keep up the great work!


15. brookr Nov 30, 2009 at 11:05

Great 'cast! Relevant as ever. SelectorGadget is an awesome tool, thanks a bunch for highlighting that.


16. Benjamin Lewis Nov 30, 2009 at 16:10

Very nice, thank you Ryan.

Good job with the comment spam too.


17. Mislav Dec 01, 2009 at 06:28

Here is the same Walmart scraper rewritten with my nifty "Scraper" class:
http://gist.github.com/246309

http://github.com/mislav/scraper


18. Dobril Bojilov Dec 01, 2009 at 10:50

Thanks Ryan:) Agean great screen cast!


19. Patrick Dec 01, 2009 at 11:31

@Espen: Here you go - http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html


20. Darryl Dec 01, 2009 at 21:29

Hi Ryan

Thanks for all the great screencasts.

You make it look effortless!

D


21. Robert Dec 02, 2009 at 06:14

Great screencast Ryan!(as always). Btw, since there is more spam comments than actual useful ones, do you mind adding captcha support? We certainly won't mind it :)


22. Memiux Dec 03, 2009 at 10:56

Reporting spam does really worth/work?

I would like to be a moderator or a spam reporter on chief here on railscasts :)

As someone said before, you're my role model,
thanks.


23. Alexis N. Mueller Dec 03, 2009 at 12:47

This may or may not have been covered in a previous episode, but when I try to run the Ruby on Rails test script in TextMate, I get the error shown below. The script runs using 'ruby nokogiri_test.rb'. I am running Snow Leopard. Thoughts?

Error:
/Applications/TextMate.app/Contents/SharedSupport/Support/lib/io.rb:38:in `exhaust': undefined method `first' for nil:NilClass (NoMethodError) from /Applications/TextMate.app/Contents/SharedSupport/Support/lib/tm/process.rb:227:in `run' from /Applications/TextMate.app/Contents/SharedSupport/Support/lib/tm/executor.rb:211:in `parse_version' from /Applications/TextMate.app/Contents/SharedSupport/Support/lib/tm/executor.rb:98:in `run' from /Applications/TextMate.app/Contents/SharedSupport/Bundles/Ruby.tmbundle/Support/RubyMate/run_script.rb:93


24. Alexis N. Mueller Dec 03, 2009 at 14:45

Never mind. I found the answer at http://wiki.macromates.com/Troubleshooting/SnowLeopard

I guess I should always look a bit more...


25. Ronald H. Dec 04, 2009 at 13:29

I have a really important question to ask about RUBY ON RAILS. I'm a newbe that has been doing research on ROR for about 3 mouths or so and my question is, how do you style an application? I've seen a number of screencast but not one seems to address this issue, which is a shame because its something every programmer will have to do. I would like it if you could cover this topic in a screencast to help newbes like myself understand how to make things look better on a presentation bases. Thanks.


26. James Edward Gray II Dec 05, 2009 at 09:21

Just a slight comment on the regular expression used in the screencast: /[0-9\.]+/. First, a . is not a special character inside a character class ([…]), so you can drop the slash. Also, there's a shortcut in regular expression for the character class [0-9], usable inside or outside of a character class. Thus, the expression can be simplified to /[\d.]+/. Just an FYI.

Thanks for another great episode!


27. mrbrdo Dec 07, 2009 at 06:47

Sometimes you need to scrap AJAX-heavy sites and scrapping using traditional methods is not an option.

I would like to mention HtmlUnit here, it's a Java tool for website testing, and it implements a GUI-less browser with pretty good Javascript support. If anyone runs into a problem where they need to scrap an AJAX-heavy site and they can't manage with approaches like those mentioned in this railscast, i would recommend they take a look at HtmlUnit. The way I use it is with crontab once a day, I fetch IDs/URLs (which change often) and write them to a file or DB and use nokogiri to really scrap the data.
I must note though, that HtmlUnit isn't really fast, so avoid when you can.


28. Henrik Hodne Dec 07, 2009 at 10:52

You can do \d in stead of 0-9 to match digits in regular expressions. \D is non-digit characters.


29. Mike J Dec 07, 2009 at 11:43

Running the above example I get the following error:
undefined method `text' for nil:NilClass (NoMethodError)

If I just do following:
'puts doc' then I get the following text which does not include the title and only seems to display the commented out code in source html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<!--[if lt IE 7]>
<link href="http://i2.walmartimages.com/css/global_ie6.css" rel="stylesheet" type="text/css">
<![endif]--><!--[if IE 7]>
<link href="http://i2.walmartimages.com/css/global_ie7.css" rel="stylesheet" type="text/css">
<![endif]--><!--[if lt IE 7]>
<link href="http://i2.walmartimages.com/css/pagination_ie6.css" rel="stylesheet" type="text/css">
<![endif]--><!--[if IE 7]>
<link href="http://i2.walmartimages.com/css/pagination_ie.css" rel="stylesheet" type="text/css">
<![endif]--><!-- start /include/static/kill_frames.jsp --><!-- end /include/static/kill_frames.jsp --><!--[if lt IE 7]>
<iframe id="overlay" src="/overlay/overlay_iframe_default_src.jsp?bv_enabled=false" name="overlay" frameborder="0" scrolling="no"></iframe>
<![endif]--><!--[if IE 7]>
<iframe id="overlay" src="/overlay/overlay_iframe_default_src.jsp?bv_enabled=false" name="overlay" frameborder="0" scrolling="no" allowTransparency="yes"></iframe>
<![endif]--><!-- Start: Module G0040: Primary Navigation --><!-- Site Header start --><!--[if lt IE 7]>
<iframe id="dropmenuiframe" src="/blank.html" style="z-index:20;display:none;position:absolute"></iframe>
<![endif]--><!--[if IE 7]>
...


30. Jonsey Dec 08, 2009 at 16:27

Excellent, thank you, only four hours ago I came up with an idea that needed exactly this.


31. TJ Koblentz Dec 13, 2009 at 00:28

Thanks, Ryan! Awesome screencast as usual.


32. php developers Mar 08, 2010 at 21:50

Great post nice idea


33. 642-145 Apr 22, 2010 at 00:38

Very nice, thank you Ryan.


34. RandiR May 11, 2010 at 09:17

Good tool to have.

There are also some open source sample scripts at

http://www.biterscripting.com/samples_internet.html

I use them often.


35. SErkan May 27, 2010 at 16:04

i had to write here


36. Feedback May 27, 2010 at 16:04

i had to write here sites.


37. <a title=sesli href=http://www.ozgurdunyam.com/ >sesli</a> Jun 03, 2010 at 09:22

Thank you for the information your provide.


38. Martin Jun 09, 2010 at 21:43

I try it on rails 4 beta accoring to your example.But it fail on rake task mode and pass on "ruby test.rb" mode. Same rake task pass on Rails 2.3.5.
Ruby version is 1.8.7.OS is ubuntu 9.04.

May i seek your help to solve it ?


39. Martin Jun 09, 2010 at 21:44

I try it on rails 3 beta4 accoring to your example.But it fail on rake task mode and pass on "ruby test.rb" mode. Same rake task pass on Rails 2.3.5.
Ruby version is 1.8.7.OS is ubuntu 9.04.

May i seek your help to solve it ?


40. <a href="http://www.sportsjerseysshop.com">Cheap Nfl Jerseys</a> Jun 10, 2010 at 18:58

Thanks for posting this. Very nice recap of some of the key points in my talk. I hope you and your readers find it useful! Thanks again


41. fashion style Jun 12, 2010 at 20:12

I try it on rails 3 beta4 accoring to your example.But it fail on rake task mode and pass on "ruby test.rb" mode. Same rake task pass on Rails 2.3.5.


42. SesliSohbetbk Jun 12, 2010 at 22:03

It links of london bracelet with with navy and baby blue thread, the features the male symbol of Mars, 925 Sterling Silver Or 18CT Gold. Thanks Admin


43. louis vuitton sale Jun 16, 2010 at 05:16

i like the blog


44. nhl jerseys Jun 22, 2010 at 08:18

Good,thank you for share


45. louis vuitton neverfull pm Jun 23, 2010 at 20:37

Good,thank you for share
http://www.louisvuittonbagmall.com/Mahina-category-3-b0.html louis vuitton mahina


46. louis vuitton neverfull pm Jun 23, 2010 at 20:38

Thank you share
http://www.louisvuittonbagmall.com/ louis vuitton speedy bag


47. Gary Jun 27, 2010 at 19:15

Three blonde women were stranded on an island. While trying to dig their way out, one of them came across a buried lamp. Suddenly a genie appears and offers to grant each one of them one wish, in return for saving him.


48. Minnesota Vikings jerseys Jul 13, 2010 at 08:15

<a href="http://www.ecwebcom.com/nfl-jerseys/philadelphia-eagles">Eagles jerseys</a>
Nice article,You did a good job,and i just got one <a href="http://www.ecwebcom.com/nfl-jerseys/minnesota-vikings">Minnesota Vikings jerseys</a> and <a href="http://www.ecwebcom.com/nfl-jerseys/new-orleans-saints">New Orleans Saints jerseys</a>today,so pleasure


49. Free driver downloads Jul 19, 2010 at 01:51

This is all very new to me and this article really opened my eyes.Thanks for sharing with us your wisdom.


50. iPhone Ringtone Maker for mac Jul 20, 2010 at 18:44

ah ha ,i have read it


51. free directory list Aug 11, 2010 at 22:39

Hey everyone. I know this is an old screencast, but I wanted to add a little something to it.


52. 传奇私服 Aug 16, 2010 at 10:57

I enjoyed your article here mate. Infact I'm a fan of the site in general to be very honest. It's the fourth ocasion I've been back here but I kept forgeting to save the site in my saved website list so I have to keep going through the search engines to find it. SAVED this time haha . Best of luck.


53. Wholesale hats Aug 20, 2010 at 20:18

Good post. I am also going to write a blog post about this...I enjoyed reading your post and I like your take on the issue. Thanks.


54. Jordan Air Retro Aug 22, 2010 at 23:12

thanks for the great screencast. I have become a huge fan of this website and I really cant wait to read you next posts! Thanks for your work and sharing your information. I going to download it


55. PDF to Images Converter Aug 24, 2010 at 23:05

Some times, to a certain need, we have to convert PDF to image for enjoyment.


56. Wholesale Electronics Aug 25, 2010 at 01:23

Discount Wholesale Electronics, Wholesale Cell Phones, Electronic Gadgets and More from the Best Dropship Wholesaler


57. error fix Aug 25, 2010 at 11:43

I've opted against Mechanize in favor of Celerity given Mechanize's lack of support for Javascript in today's jQuery/Prototype, etc world.


58. louis vuitton shoes Aug 26, 2010 at 21:17

Thanks for sharing your article. I really enjoyed it. I put a link to my site to here so other people can read it. My readers have about the same interets


59. snow boots Aug 30, 2010 at 20:39

Good job with the comment spam too.


60. louis vuitton sunglasses Sep 01, 2010 at 22:20

Good post, I can’t say that I agree with everything that was said, but very good information overall:)

Add your comment:

(SKIP THIS ONE)

(required)

(not shown)


(use pastie or gist for code)

sponsored by:
if you want to help:
required:
Get Quicktime Player
Give Back to Open Source