#191 Mechanize
Dec 07, 2009 | 10 minutes |
Tools
Mechanize extends the power of Nokogiri allowing you to interact with multiple pages on the site: click links, submit forms, etc.
- Download:
- source codeProject Files in Zip (93.6 KB)
- mp4Full Size H.264 Video (19.3 MB)
- m4vSmaller H.264 Video (12 MB)
- webmFull Size VP8 Video (29.8 MB)
- ogvFull Size Theora Video (29.2 MB)
Ive been wanting to use mechanize for a while! Great to see a screencast on it!
Thanks ryanb!
Sweet! Thanks again for your consistent screen casts!
Is that your real wish-list?
Hi Ryan,
Love you work! Small thing though - you promised to stick the line for getting the console history in the show notes, and I don't see it ...
@Steve, nope, not a real wish list so don't get me anything from in. ;)
@Chris, oops, it's up there now.
Looks awesome and I think this will save me a lot of time. Does this work with Javascript at all or only straight websites - Some stupid websites do the form send through javascript only, if it could handle that it would be amazing...
That irb copy function is pretty neat, thanks for the tip!
It was very cool when it all came together and you proved how easy it was to simply add products to your site and almost instantly update the prices, put a smile on my face.
@Sam
That was exactly my reaction. Reminds me of how I felt when I first watched the create a blog a 15 minutes screencast :)
@Sam, @Godfrey
I felt the same way! I couldn't help but laugh -- I worked with Mechanize on Python a while back and it just did not seem that easy :)
@Ryan
You have a real talent for presenting this stuff. I really appreciate the time you put into it!
I'll just leave this here aswell (already commented in #190).
If anyone needs to scrap AJAX-heavy sites and html parsers just don't cut it, you might want to take a look at a HtmlUnit library. Sadly, it's only available for Java, but it's the only library capable of Javascript that I found.
Most of the time you wouldn't need this, but if a site uses a lot of ajax and some obfuscated javascript, and changes a lot, it might be the only way.
Ryan does it again. This is exactly what I need for an app I'm working on now.
Is there any way to deal with captchas with mechanize? Seems like more and more pages particularly any type of form submission have captcha.
Because someone is sure to bring up the black hat type stuff that can be done with mechanize, I assure you my intentions are purely white hat.
I put this at the bottom of my ~/.irbrc file to quickly access this command history:
def hist
puts Readline::HISTORY.entries.split("exit").last[0..-2].join("\n")
end
HTH,
Chip
Need to invoice? http://invoicethat.com
Maybe it is a stupid question but would this not be possible to do with web rat ?
Awesome screencasts, thanks a lot.
I'm trying to use the history command but I'm getting:
NoMethodError: private method `split' called for #<Array:0x10122ea28>
This commit from you dotfiles does work somewhat better, but it lists everything in my .irb_history, so it doesn't contain exit lines to split at?
http://github.com/ryanb/dotfiles/commit/78c149fb7e9ac1f2d89ed3a7518aee293b63b747
How would you get access to Nokogiri object of a give web page off once you are "Authenticated" to scrape non form / link date form the page?
I've been using Mechanize to scrape web content for a while and it's extremely convenient.
What I noticed though is if you keep the agent alive for multiple requests (like looping through pagination) it starts consuming more and more memory. My guess is Mechanize agent is storing the pages previously loaded even if you clear the variable holding said page.
Anyone know how to deal with this? Can you clear the 'cache' so to speak?
Thanks Ryan, great one !
I'd like to ask for some help...
Actually I need to forge a cookie. My rails application is a kind of proxy between the user and another webapp. I need to preserve the session of the end-webapp, through the entire user session on MY Rails app.
hence, I'm creating a new agent object for each new request, and I need to re-create the cookie with the previous session ID.
I'm struggling with Mechanize::CookieJar and stuff, but no luck yet...
Any idea ?
Wow lots of spams despite your Rails-captcha...
Anyway, I found a way to hack around my problem :
agent.cookie_jar.jar['mydomain'] = {'/' => {'PHPSESSID' => WWW::Mechanize::Cookie.new('PHPSESSID', previous_session_id)}}
However, #jar is not documented... I wonder if it will stay ok with upgrades...
Thanks. That's cool
For anyone struggling with Mechanize's memory usage like I was, you can limit the maximum number of pages it retains in memory by setting the agent's "max_history", for example:
agent = WWW::Mechanize.new
agent.max_history = 20
Great. You are showing how powerful Mechanize is and a "first touch" of it in a clear way.
Thanks for the screen casts. The quality is top-notch.
Great work Ryan,
It would be great if you could come up with a screencast on making screencasts. Till then could you share the tools you use to make your screencasts.
Thanks
Looks like SPAM is taking over again :-( Was nice there for a while.
Have you considered a plugin like Rakismet?
I'm going to fork the Railscasts repo, make some changes to hide comments if they appear spammy, until you flag them as ok. That way, even if spam still gets entered, at least it won't show up.
thanks
great episode once again! Thanks so much!
thanks a lot
Well done, i like it. Nice to get used to what mechanize is, and how to use it!
Thanks a lot!
Just a note ...
If you're having problems getting a form to submit ... try using the click_button form method instead.
http://mechanize.rubyforge.org/mechanize/WWW/Mechanize/Form.html
I get an error when I try out the history code.
Readline::HISTORY.entries.split("exit").last[0..-2]
which results in:
NoMethodError: private method `split' called for #<Array:0x1500554>
But it works if I do this:
Readline::HISTORY.entries[0..-2].join("\n")
Nice history tip, Ryan.
I created a method in my .irbrc that easily allows me to get the history of this or previous sessions as well as being able to limit to a certain number of lines ala
hits :all, 25
You can get the code at http://gist.github.com/272588
Is there any way to deal with captchas with mechanize? Seems like more and more pages particularly any type of form submission have captcha.
Is it possible to use mechanize to fetch images and such from a web site or is there a better gem/plugin suited for this?
Is there a way to use this Mechanize to direct a user to their email provider and plug their username into the form and tab down to the password field? So essentially they signup and I redirect the to their email account where they just enter their password since I provided the username?
Array#split is an ActiveSupport extension. If you want to use the history code in irb be sure to require 'active_support/core_ext/array'
on asp generated page, the form submit automatically fill 2 form variables:
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
before submit. How to define them in Mechanize? thx!
additional info: http://www.xefteri.com/articles/show.cfm?id=18
I'm creating a new agent object for each new request, and I need to re-create the cookie with the previous session ID.
I took almost an hour to go through each and every comment in here. I would be glad if someone can reply how to handle captchas using Perl. It need not solve captcha by its own but just a way to save the image and send to decaptcha api.
Thanks for this, will spend some time learning.... This is the first I'd read about mechanize so looking forward to using it...
If you are getting errors like this:
I believe
WWW::Mechanize.new
is now depreciated and should beMechanize.new
instead:Source: https://webrat.lighthouseapp.com/projects/10503/tickets/368-www-in-wwwmechanize-deprecated
Thanks for this episode.
I am actually trying to scrap a website full of ajax and mechanize does not really work fine in this case. I have found that some people are using watir or capybara to do so. Do you know any other simpler solutions?
Using capybara for web scraping
You can replace capybara-mechanize with any other headless capybara driver, e.g. poltergeist or capybara-webkit
I am trying to click a list of links with Mechanize gem, but apparently Mechanize's
links_with(criteria)
is not properly filtering based on the criteria. For debugging purposes, I am only printing out the link.The following script is printing out most (all?) links on the page:
And if I change the
(:text => /[VIEW FULL PROFILE]/)
to(:text => "VIEW FULL PROFILE")
then no link at all gets printed.I can't understand what I am doing wrong. Any thoughts?
I need help.
I am getting nil form value. I am getting error: undefined method `q=' for nil:NilClass
Thanks in advance!