After some problems setting up nokogiri it is really awesome. XML parsing and screen scraping as simple as possible. And fast!
Great, another one library for doing it,
very like this stuff, thanks!
@Jamie - sure, it helps to ask for permission. That's what I do at least :)
@Ryan - thanks for another great episode!
Hi, Ryan! Thank you for one more great screencast. I'd like to translate you casts to russian. If you agree, please, contact me, course I tried, but got just this:
Delivery to the following recipient failed permanently:
feedback@railscasts.com
Technical details of permanent failure:
The recipient server did not accept our requests to connect. [mx1.sub4.homie.mail.dreamhost.com.railscasts.com. (0): Destination address required]
[mx2.sub4.homie.mail.dreamhost.com.railscasts.com. (0): Destination address required]
Anyone done a performance comparison of Nokogiri vs Hpricot?
Hi Ryan,
Great screencast and I look forward to the next episode featuring Mechanize as most of us will likely need the ability to interact with the website being scraped.
That said, I've opted against Mechanize in favor of Celerity given Mechanize's lack of support for Javascript in today's jQuery/Prototype, etc world.
Sure there are generally workarounds to bypass the Javascript (fun) and Watir (though I prefer Celerity's faceless browser). Perhaps highlighting this weakness of Mechanize in your screencast will encourage the addition of such support...that and better documentation. :-)
Great episode. This looks cleaner than ScrAPI.
Btw I didn't have to provide libxml path when installing nokogiri. gem install nokogiri worked like a charm. I'm using Windows with cygwin environment.
I don't think I'll be using ScreenScraping any time soon in my apps, however SelectorGadget looks like a great tool which may come in handy for me at some point.
Thanks Ryan.
Would this work for getting football statistics or is there a better way to do that?
Thanks!!
Ryan, I have posted this before but it may have gotten lost among all the spam. Since spam really disables a fruitful conversation (or even just reading) of this little forum, I think you have to attack the problem seriously. Here is my funny but possibly quite powerful solution:
I have a solution for the spam. Since most of us know at least a bit about rails (otherwise we wouldn't give a rat's ... about the Railscasts), as a simple question in addition to Captcha for example:
Fill in the blank:
validates_xxxxx_of :firstname, :lastname
or something more funny (and political LOL):
validates_xxxxx_xxx :smartpresidents, :in => "George W. Bush"
(...exclusion of that is LOL)
Hello Ryan,
Amazing screen cast, You always cover the items I am working on next!
Keep up the great work!
Great 'cast! Relevant as ever. SelectorGadget is an awesome tool, thanks a bunch for highlighting that.
Very nice, thank you Ryan.
Good job with the comment spam too.
Here is the same Walmart scraper rewritten with my nifty "Scraper" class:
http://gist.github.com/246309
http://github.com/mislav/scraper
@Espen: Here you go - http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html
Hi Ryan
Thanks for all the great screencasts.
You make it look effortless!
D
Great screencast Ryan!(as always). Btw, since there is more spam comments than actual useful ones, do you mind adding captcha support? We certainly won't mind it :)
Reporting spam does really worth/work?
I would like to be a moderator or a spam reporter on chief here on railscasts :)
As someone said before, you're my role model,
thanks.
This may or may not have been covered in a previous episode, but when I try to run the Ruby on Rails test script in TextMate, I get the error shown below. The script runs using 'ruby nokogiri_test.rb'. I am running Snow Leopard. Thoughts?
Error:
/Applications/TextMate.app/Contents/SharedSupport/Support/lib/io.rb:38:in `exhaust': undefined method `first' for nil:NilClass (NoMethodError) from /Applications/TextMate.app/Contents/SharedSupport/Support/lib/tm/process.rb:227:in `run' from /Applications/TextMate.app/Contents/SharedSupport/Support/lib/tm/executor.rb:211:in `parse_version' from /Applications/TextMate.app/Contents/SharedSupport/Support/lib/tm/executor.rb:98:in `run' from /Applications/TextMate.app/Contents/SharedSupport/Bundles/Ruby.tmbundle/Support/RubyMate/run_script.rb:93
Never mind. I found the answer at http://wiki.macromates.com/Troubleshooting/SnowLeopard
I guess I should always look a bit more...
I have a really important question to ask about RUBY ON RAILS. I'm a newbe that has been doing research on ROR for about 3 mouths or so and my question is, how do you style an application? I've seen a number of screencast but not one seems to address this issue, which is a shame because its something every programmer will have to do. I would like it if you could cover this topic in a screencast to help newbes like myself understand how to make things look better on a presentation bases. Thanks.
Just a slight comment on the regular expression used in the screencast: /[0-9\.]+/. First, a . is not a special character inside a character class ([…]), so you can drop the slash. Also, there's a shortcut in regular expression for the character class [0-9], usable inside or outside of a character class. Thus, the expression can be simplified to /[\d.]+/. Just an FYI.
Thanks for another great episode!
Sometimes you need to scrap AJAX-heavy sites and scrapping using traditional methods is not an option.
I would like to mention HtmlUnit here, it's a Java tool for website testing, and it implements a GUI-less browser with pretty good Javascript support. If anyone runs into a problem where they need to scrap an AJAX-heavy site and they can't manage with approaches like those mentioned in this railscast, i would recommend they take a look at HtmlUnit. The way I use it is with crontab once a day, I fetch IDs/URLs (which change often) and write them to a file or DB and use nokogiri to really scrap the data.
I must note though, that HtmlUnit isn't really fast, so avoid when you can.
You can do \d in stead of 0-9 to match digits in regular expressions. \D is non-digit characters.
Running the above example I get the following error:
undefined method `text' for nil:NilClass (NoMethodError)
If I just do following:
'puts doc' then I get the following text which does not include the title and only seems to display the commented out code in source html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<!--[if lt IE 7]>
<link href="http://i2.walmartimages.com/css/global_ie6.css" rel="stylesheet" type="text/css">
<![endif]--><!--[if IE 7]>
<link href="http://i2.walmartimages.com/css/global_ie7.css" rel="stylesheet" type="text/css">
<![endif]--><!--[if lt IE 7]>
<link href="http://i2.walmartimages.com/css/pagination_ie6.css" rel="stylesheet" type="text/css">
<![endif]--><!--[if IE 7]>
<link href="http://i2.walmartimages.com/css/pagination_ie.css" rel="stylesheet" type="text/css">
<![endif]--><!-- start /include/static/kill_frames.jsp --><!-- end /include/static/kill_frames.jsp --><!--[if lt IE 7]>
<iframe id="overlay" src="/overlay/overlay_iframe_default_src.jsp?bv_enabled=false" name="overlay" frameborder="0" scrolling="no"></iframe>
<![endif]--><!--[if IE 7]>
<iframe id="overlay" src="/overlay/overlay_iframe_default_src.jsp?bv_enabled=false" name="overlay" frameborder="0" scrolling="no" allowTransparency="yes"></iframe>
<![endif]--><!-- Start: Module G0040: Primary Navigation --><!-- Site Header start --><!--[if lt IE 7]>
<iframe id="dropmenuiframe" src="/blank.html" style="z-index:20;display:none;position:absolute"></iframe>
<![endif]--><!--[if IE 7]>
...
Excellent, thank you, only four hours ago I came up with an idea that needed exactly this.
Good tool to have.
There are also some open source sample scripts at
http://www.biterscripting.com/samples_internet.html
I use them often.
Thank you for the information your provide.
I try it on rails 4 beta accoring to your example.But it fail on rake task mode and pass on "ruby test.rb" mode. Same rake task pass on Rails 2.3.5.
Ruby version is 1.8.7.OS is ubuntu 9.04.
May i seek your help to solve it ?
I try it on rails 3 beta4 accoring to your example.But it fail on rake task mode and pass on "ruby test.rb" mode. Same rake task pass on Rails 2.3.5.
Ruby version is 1.8.7.OS is ubuntu 9.04.
May i seek your help to solve it ?
Thanks for posting this. Very nice recap of some of the key points in my talk. I hope you and your readers find it useful! Thanks again
I try it on rails 3 beta4 accoring to your example.But it fail on rake task mode and pass on "ruby test.rb" mode. Same rake task pass on Rails 2.3.5.
It links of london bracelet with with navy and baby blue thread, the features the male symbol of Mars, 925 Sterling Silver Or 18CT Gold. Thanks Admin
Good,thank you for share
http://www.louisvuittonbagmall.com/Mahina-category-3-b0.html louis vuitton mahina
Thank you share
http://www.louisvuittonbagmall.com/ louis vuitton speedy bag
Three blonde women were stranded on an island. While trying to dig their way out, one of them came across a buried lamp. Suddenly a genie appears and offers to grant each one of them one wish, in return for saving him.
<a href="http://www.ecwebcom.com/nfl-jerseys/philadelphia-eagles">Eagles jerseys</a>
Nice article,You did a good job,and i just got one <a href="http://www.ecwebcom.com/nfl-jerseys/minnesota-vikings">Minnesota Vikings jerseys</a> and <a href="http://www.ecwebcom.com/nfl-jerseys/new-orleans-saints">New Orleans Saints jerseys</a>today,so pleasure
This is all very new to me and this article really opened my eyes.Thanks for sharing with us your wisdom.
Hey everyone. I know this is an old screencast, but I wanted to add a little something to it.
I enjoyed your article here mate. Infact I'm a fan of the site in general to be very honest. It's the fourth ocasion I've been back here but I kept forgeting to save the site in my saved website list so I have to keep going through the search engines to find it. SAVED this time haha . Best of luck.
Good post. I am also going to write a blog post about this...I enjoyed reading your post and I like your take on the issue. Thanks.
thanks for the great screencast. I have become a huge fan of this website and I really cant wait to read you next posts! Thanks for your work and sharing your information. I going to download it
Some times, to a certain need, we have to convert PDF to image for enjoyment.
Discount Wholesale Electronics, Wholesale Cell Phones, Electronic Gadgets and More from the Best Dropship Wholesaler
I've opted against Mechanize in favor of Celerity given Mechanize's lack of support for Javascript in today's jQuery/Prototype, etc world.
Thanks for sharing your article. I really enjoyed it. I put a link to my site to here so other people can read it. My readers have about the same interets
Good post, I can’t say that I agree with everything that was said, but very good information overall:)






