#190 Screen Scraping with Nokogiri
Nov 30, 2009 | 13 minutes |
Tools
Screen scraping is easy with Nokogiri and SelectorGadget.
- Download:
- source codeProject Files in Zip (97.9 KB)
- mp4Full Size H.264 Video (29.2 MB)
- m4vSmaller H.264 Video (16.7 MB)
- webmFull Size VP8 Video (38.5 MB)
- ogvFull Size Theora Video (36.2 MB)
Great!, thanks a lot!
Love the idea of scraping website's. Can't wait till next week!
After some problems setting up nokogiri it is really awesome. XML parsing and screen scraping as simple as possible. And fast!
Great, another one library for doing it,
very like this stuff, thanks!
@Jamie - sure, it helps to ask for permission. That's what I do at least :)
@Ryan - thanks for another great episode!
Good stuff, thanks Ryan!
Hi, Ryan! Thank you for one more great screencast. I'd like to translate you casts to russian. If you agree, please, contact me, course I tried, but got just this:
Delivery to the following recipient failed permanently:
feedback@railscasts.com
Technical details of permanent failure:
The recipient server did not accept our requests to connect. [mx1.sub4.homie.mail.dreamhost.com.railscasts.com. (0): Destination address required]
[mx2.sub4.homie.mail.dreamhost.com.railscasts.com. (0): Destination address required]
Anyone done a performance comparison of Nokogiri vs Hpricot?
Hi Ryan,
Great screencast and I look forward to the next episode featuring Mechanize as most of us will likely need the ability to interact with the website being scraped.
That said, I've opted against Mechanize in favor of Celerity given Mechanize's lack of support for Javascript in today's jQuery/Prototype, etc world.
Sure there are generally workarounds to bypass the Javascript (fun) and Watir (though I prefer Celerity's faceless browser). Perhaps highlighting this weakness of Mechanize in your screencast will encourage the addition of such support...that and better documentation. :-)
Great episode. This looks cleaner than ScrAPI.
Btw I didn't have to provide libxml path when installing nokogiri. gem install nokogiri worked like a charm. I'm using Windows with cygwin environment.
I don't think I'll be using ScreenScraping any time soon in my apps, however SelectorGadget looks like a great tool which may come in handy for me at some point.
Thanks Ryan.
Would this work for getting football statistics or is there a better way to do that?
Thanks!!
Ryan, I have posted this before but it may have gotten lost among all the spam. Since spam really disables a fruitful conversation (or even just reading) of this little forum, I think you have to attack the problem seriously. Here is my funny but possibly quite powerful solution:
I have a solution for the spam. Since most of us know at least a bit about rails (otherwise we wouldn't give a rat's ... about the Railscasts), as a simple question in addition to Captcha for example:
Fill in the blank:
validates_xxxxx_of :firstname, :lastname
or something more funny (and political LOL):
validates_xxxxx_xxx :smartpresidents, :in => "George W. Bush"
(...exclusion of that is LOL)
Hello Ryan,
Amazing screen cast, You always cover the items I am working on next!
Keep up the great work!
Great 'cast! Relevant as ever. SelectorGadget is an awesome tool, thanks a bunch for highlighting that.
Very nice, thank you Ryan.
Good job with the comment spam too.
Here is the same Walmart scraper rewritten with my nifty "Scraper" class:
http://gist.github.com/246309
http://github.com/mislav/scraper
Thanks Ryan:) Agean great screen cast!
@Espen: Here you go - http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html
Hi Ryan
Thanks for all the great screencasts.
You make it look effortless!
D
Great screencast Ryan!(as always). Btw, since there is more spam comments than actual useful ones, do you mind adding captcha support? We certainly won't mind it :)
Reporting spam does really worth/work?
I would like to be a moderator or a spam reporter on chief here on railscasts :)
As someone said before, you're my role model,
thanks.
This may or may not have been covered in a previous episode, but when I try to run the Ruby on Rails test script in TextMate, I get the error shown below. The script runs using 'ruby nokogiri_test.rb'. I am running Snow Leopard. Thoughts?
Error:
/Applications/TextMate.app/Contents/SharedSupport/Support/lib/io.rb:38:in `exhaust': undefined method `first' for nil:NilClass (NoMethodError) from /Applications/TextMate.app/Contents/SharedSupport/Support/lib/tm/process.rb:227:in `run' from /Applications/TextMate.app/Contents/SharedSupport/Support/lib/tm/executor.rb:211:in `parse_version' from /Applications/TextMate.app/Contents/SharedSupport/Support/lib/tm/executor.rb:98:in `run' from /Applications/TextMate.app/Contents/SharedSupport/Bundles/Ruby.tmbundle/Support/RubyMate/run_script.rb:93
Never mind. I found the answer at http://wiki.macromates.com/Troubleshooting/SnowLeopard
I guess I should always look a bit more...
I have a really important question to ask about RUBY ON RAILS. I'm a newbe that has been doing research on ROR for about 3 mouths or so and my question is, how do you style an application? I've seen a number of screencast but not one seems to address this issue, which is a shame because its something every programmer will have to do. I would like it if you could cover this topic in a screencast to help newbes like myself understand how to make things look better on a presentation bases. Thanks.
Just a slight comment on the regular expression used in the screencast: /[0-9\.]+/. First, a . is not a special character inside a character class ([…]), so you can drop the slash. Also, there's a shortcut in regular expression for the character class [0-9], usable inside or outside of a character class. Thus, the expression can be simplified to /[\d.]+/. Just an FYI.
Thanks for another great episode!
Sometimes you need to scrap AJAX-heavy sites and scrapping using traditional methods is not an option.
I would like to mention HtmlUnit here, it's a Java tool for website testing, and it implements a GUI-less browser with pretty good Javascript support. If anyone runs into a problem where they need to scrap an AJAX-heavy site and they can't manage with approaches like those mentioned in this railscast, i would recommend they take a look at HtmlUnit. The way I use it is with crontab once a day, I fetch IDs/URLs (which change often) and write them to a file or DB and use nokogiri to really scrap the data.
I must note though, that HtmlUnit isn't really fast, so avoid when you can.
You can do \d in stead of 0-9 to match digits in regular expressions. \D is non-digit characters.
Running the above example I get the following error:
undefined method `text' for nil:NilClass (NoMethodError)
If I just do following:
'puts doc' then I get the following text which does not include the title and only seems to display the commented out code in source html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<!--[if lt IE 7]>
<link href="http://i2.walmartimages.com/css/global_ie6.css" rel="stylesheet" type="text/css">
<![endif]--><!--[if IE 7]>
<link href="http://i2.walmartimages.com/css/global_ie7.css" rel="stylesheet" type="text/css">
<![endif]--><!--[if lt IE 7]>
<link href="http://i2.walmartimages.com/css/pagination_ie6.css" rel="stylesheet" type="text/css">
<![endif]--><!--[if IE 7]>
<link href="http://i2.walmartimages.com/css/pagination_ie.css" rel="stylesheet" type="text/css">
<![endif]--><!-- start /include/static/kill_frames.jsp --><!-- end /include/static/kill_frames.jsp --><!--[if lt IE 7]>
<iframe id="overlay" src="/overlay/overlay_iframe_default_src.jsp?bv_enabled=false" name="overlay" frameborder="0" scrolling="no"></iframe>
<![endif]--><!--[if IE 7]>
<iframe id="overlay" src="/overlay/overlay_iframe_default_src.jsp?bv_enabled=false" name="overlay" frameborder="0" scrolling="no" allowTransparency="yes"></iframe>
<![endif]--><!-- Start: Module G0040: Primary Navigation --><!-- Site Header start --><!--[if lt IE 7]>
<iframe id="dropmenuiframe" src="/blank.html" style="z-index:20;display:none;position:absolute"></iframe>
<![endif]--><!--[if IE 7]>
...
did u solve it i have same problem when i don't have any data it give me nil
Excellent, thank you, only four hours ago I came up with an idea that needed exactly this.
Thanks, Ryan! Awesome screencast as usual.
Good tool to have.
There are also some open source sample scripts at
http://www.biterscripting.com/samples_internet.html
I use them often.
I try it on rails 4 beta accoring to your example.But it fail on rake task mode and pass on "ruby test.rb" mode. Same rake task pass on Rails 2.3.5.
Ruby version is 1.8.7.OS is ubuntu 9.04.
May i seek your help to solve it ?
I try it on rails 3 beta4 accoring to your example.But it fail on rake task mode and pass on "ruby test.rb" mode. Same rake task pass on Rails 2.3.5.
Ruby version is 1.8.7.OS is ubuntu 9.04.
May i seek your help to solve it ?
The bigger movie (the one that is 43MB), when at 5:02, when the program runs using TextMate, the result is not shown when the movie is played inside the browser using the current QuickTime 10.0. The small video does show the result.
If VLC is used, then both the large video and small video show the result. Maybe it is some format issue, or can the large video run through Handbrake and keep the same size so that it is good for both QuickTime and iPod and iTV?
Also, the background of the text being greyish a little, doesn't bring the contrast of the text relative to the background. If a black background can be used, it will be best.
@Ronald H Style an app? As in design? The css is under public/stylesheets
You can treat the erb files like any html files.
github signin = WIN!!!!
SO i think this is a simple question ...
how do i get the awesome nokogiri purple header / document return screen?
beautifully explained, thank you