#191 Mechanize

Dec 07, 2009 | 10 minutes | Tools

Mechanize extends the power of Nokogiri allowing you to interact with multiple pages on the site: click links, submit forms, etc.

Click to Play Video ▶

Download:
source codeProject Files in Zip (93.6 KB)
mp4Full Size H.264 Video (19.3 MB)
m4vSmaller H.264 Video (12 MB)
webmFull Size VP8 Video (29.8 MB)
ogvFull Size Theora Video (29.2 MB)

Ryan over 15 years ago

Ive been wanting to use mechanize for a while! Great to see a screencast on it!

Thanks ryanb!

Steve over 15 years ago

Sweet! Thanks again for your consistent screen casts!
Is that your real wish-list?

chris over 15 years ago

Hi Ryan,
Love you work! Small thing though - you promised to stick the line for getting the console history in the show notes, and I don't see it ...

Ryan Bates over 15 years ago

@Steve, nope, not a real wish list so don't get me anything from in. ;)

@Chris, oops, it's up there now.

Richard over 15 years ago

Looks awesome and I think this will save me a lot of time. Does this work with Javascript at all or only straight websites - Some stupid websites do the form send through javascript only, if it could handle that it would be amazing...

Sam Millar over 15 years ago

That irb copy function is pretty neat, thanks for the tip!

It was very cool when it all came together and you proved how easy it was to simply add products to your site and almost instantly update the prices, put a smile on my face.

Godfrey Chan over 15 years ago

@Sam

That was exactly my reaction. Reminds me of how I felt when I first watched the create a blog a 15 minutes screencast :)

Jeff Tucker over 15 years ago

@Sam, @Godfrey
I felt the same way! I couldn't help but laugh -- I worked with Mechanize on Python a while back and it just did not seem that easy :)

@Ryan
You have a real talent for presenting this stuff. I really appreciate the time you put into it!

mrbrdo over 15 years ago

I'll just leave this here aswell (already commented in #190).

If anyone needs to scrap AJAX-heavy sites and html parsers just don't cut it, you might want to take a look at a HtmlUnit library. Sadly, it's only available for Java, but it's the only library capable of Javascript that I found.
Most of the time you wouldn't need this, but if a site uses a lot of ajax and some obfuscated javascript, and changes a lot, it might be the only way.

adrenally fatigued over 15 years ago

Ryan does it again. This is exactly what I need for an app I'm working on now.

Is there any way to deal with captchas with mechanize? Seems like more and more pages particularly any type of form submission have captcha.

Because someone is sure to bring up the black hat type stuff that can be done with mechanize, I assure you my intentions are purely white hat.

Chip Castle over 15 years ago

I put this at the bottom of my ~/.irbrc file to quickly access this command history:

def hist
puts Readline::HISTORY.entries.split("exit").last[0..-2].join("\n")
end

HTH,
Chip

Need to invoice? http://invoicethat.com

eltados over 15 years ago

Maybe it is a stupid question but would this not be possible to do with web rat ?

lmjabreu over 15 years ago

Awesome screencasts, thanks a lot.

Kieran P over 15 years ago

I'm trying to use the history command but I'm getting:

NoMethodError: private method `split' called for #<Array:0x10122ea28>

This commit from you dotfiles does work somewhat better, but it lists everything in my .irb_history, so it doesn't contain exit lines to split at?

http://github.com/ryanb/dotfiles/commit/78c149fb7e9ac1f2d89ed3a7518aee293b63b747

Peter over 15 years ago

How would you get access to Nokogiri object of a give web page off once you are "Authenticated" to scrape non form / link date form the page?

Peter D over 15 years ago

I've been using Mechanize to scrape web content for a while and it's extremely convenient.

What I noticed though is if you keep the agent alive for multiple requests (like looping through pagination) it starts consuming more and more memory. My guess is Mechanize agent is storing the pages previously loaded even if you clear the variable holding said page.

Anyone know how to deal with this? Can you clear the 'cache' so to speak?

Simon Cookie Lover over 15 years ago

Thanks Ryan, great one !

I'd like to ask for some help...

Actually I need to forge a cookie. My rails application is a kind of proxy between the user and another webapp. I need to preserve the session of the end-webapp, through the entire user session on MY Rails app.

hence, I'm creating a new agent object for each new request, and I need to re-create the cookie with the previous session ID.

I'm struggling with Mechanize::CookieJar and stuff, but no luck yet...

Any idea ?

Simon Cookie Lover over 15 years ago

Wow lots of spams despite your Rails-captcha...

Anyway, I found a way to hack around my problem :

agent.cookie_jar.jar['mydomain'] = {'/' => {'PHPSESSID' => WWW::Mechanize::Cookie.new('PHPSESSID', previous_session_id)}}

However, #jar is not documented... I wonder if it will stay ok with upgrades...

Susann over 15 years ago

Thanks. That's cool

Peter D over 15 years ago

For anyone struggling with Mechanize's memory usage like I was, you can limit the maximum number of pages it retains in memory by setting the agent's "max_history", for example:

agent = WWW::Mechanize.new
agent.max_history = 20

Juan Medín Piñeiro over 15 years ago

Great. You are showing how powerful Mechanize is and a "first touch" of it in a clear way.

Thanks for the screen casts. The quality is top-notch.

pankaj over 15 years ago

Great work Ryan,
It would be great if you could come up with a screencast on making screencasts. Till then could you share the tools you use to make your screencasts.

Thanks

Kieran P over 15 years ago

Looks like SPAM is taking over again :-( Was nice there for a while.

Have you considered a plugin like Rakismet?

I'm going to fork the Railscasts repo, make some changes to hide comments if they appear spammy, until you flag them as ok. That way, even if spam still gets entered, at least it won't show up.

sh over 15 years ago

thanks

Lin He over 15 years ago

great episode once again! Thanks so much!

for-sec over 15 years ago

thanks a lot

Nils over 15 years ago

Well done, i like it. Nice to get used to what mechanize is, and how to use it!

Thanks a lot!

austin_web_developer over 15 years ago

Just a note ...

If you're having problems getting a form to submit ... try using the click_button form method instead.

http://mechanize.rubyforge.org/mechanize/WWW/Mechanize/Form.html

Chong Km over 15 years ago

I get an error when I try out the history code.

Readline::HISTORY.entries.split("exit").last[0..-2]

which results in:

NoMethodError: private method `split' called for #<Array:0x1500554>

But it works if I do this:

Readline::HISTORY.entries[0..-2].join("\n")

AaronH over 15 years ago

Nice history tip, Ryan.

I created a method in my .irbrc that easily allows me to get the history of this or previous sessions as well as being able to limit to a certain number of lines ala

hits :all, 25

You can get the code at http://gist.github.com/272588

Exclusive Local Leads over 15 years ago

Is there any way to deal with captchas with mechanize? Seems like more and more pages particularly any type of form submission have captcha.

Branden over 15 years ago

Is it possible to use mechanize to fetch images and such from a web site or is there a better gem/plugin suited for this?

Steve over 15 years ago

Is there a way to use this Mechanize to direct a user to their email provider and plug their username into the form and tab down to the password field? So essentially they signup and I redirect the to their email account where they just enter their password since I provided the username?

Michael Sepcot over 15 years ago

Array#split is an ActiveSupport extension. If you want to use the history code in irb be sure to require 'active_support/core_ext/array'

Horace Ho over 15 years ago

on asp generated page, the form submit automatically fill 2 form variables:

theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;

before submit. How to define them in Mechanize? thx!

additional info: http://www.xefteri.com/articles/show.cfm?id=18

Jocuri de Gatit almost 15 years ago

I'm creating a new agent object for each new request, and I need to re-create the cookie with the previous session ID.

Janice L. Evangelista over 14 years ago

I took almost an hour to go through each and every comment in here. I would be glad if someone can reply how to handle captchas using Perl. It need not solve captcha by its own but just a way to save the image and send to decaptcha api.

Sim only over 14 years ago

Thanks for this, will spend some time learning.... This is the first I'd read about mechanize so looking forward to using it...

Peter Fisher-Duke almost 14 years ago

If you are getting errors like this:

          ruby
        
> agent = WWW::Mechanize.new
NameError: uninitialized constant Object::WWW
        from (irb):7
        from /.rvm/rubies/ruby-1.9.2-p180/bin/irb:16:in `<main>'

I believe WWW::Mechanize.new is now depreciated and should be Mechanize.new instead:

          ruby
        
> agent = Mechanize.new
 => #<Mechanize:0x99960c0 @agent=#<Mechanize::HTTP::Agent:0x999605c

Source: https://webrat.lighthouseapp.com/projects/10503/tickets/368-www-in-wwwmechanize-deprecated

frencesco over 13 years ago

Thanks for this episode.

I am actually trying to scrap a website full of ajax and mechanize does not really work fine in this case. I have found that some people are using watir or capybara to do so. Do you know any other simpler solutions?

Leo Gallucci about 12 years ago

Using capybara for web scraping

          Gemfile
        
gem 'capybara', '~> 2.1'
gem 'capybara-mechanize', '~> 1.1'

          ruby
        
require 'capybara'
require 'capybara/mechanize'

Capybara.configure do |config|
  config.run_server = false
  config.default_driver = :mechanize
  config.app = "" # to avoid this error: ArgumentError: mechanize requires a rack application, but none was given
  config.app_host = "http://railscasts.tadalist.com"
end

# Including Capybara::DSL in the global scope is not recommended but for the sake of this example:
include Capybara::DSL

visit "/session/new"
fill_in "password", :with => "secret"
click_button("Sign in")
# etc.. capybara cheat sheet: https://gist.github.com/zhengjia/428105

You can replace capybara-mechanize with any other headless capybara driver, e.g. poltergeist or capybara-webkit

Sergio Schuler over 11 years ago

I am trying to click a list of links with Mechanize gem, but apparently Mechanize's links_with(criteria) is not properly filtering based on the criteria. For debugging purposes, I am only printing out the link.

The following script is printing out most (all?) links on the page:

          ruby
        
require 'mechanize'

agent = Mechanize.new
url = "http://www.fearlessphotographers.com/location/470/sul-do-brasil"

agent.get(url)
agent.page.links_with(:text => /[VIEW FULL PROFILE]/).each do |link|
    puts link.text
end

And if I change the (:text => /[VIEW FULL PROFILE]/) to (:text => "VIEW FULL PROFILE") then no link at all gets printed.

I can't understand what I am doing wrong. Any thoughts?

Raj Kumar Goyal almost 11 years ago

I need help.

          ruby
        
@web_agent = Mechanize.new
page = @web_agent.get 'https://www.google.com/alerts'
form = page.form_with :name => 'f1'

form.q = "search keyword"

I am getting nil form value. I am getting error: undefined method `q=' for nil:NilClass

Thanks in advance!