#396 Importing CSV and Excel
Dec 08, 2012 | 10 minutes |
Plugins
Allow users to import records into the database by uploading a CSV or Excel document. Here I show how to use Roo to parse these files and present a solution for validations.
- Download:
- source codeProject Files in Zip (90.8 KB)
- mp4Full Size H.264 Video (24.6 MB)
- m4vSmaller H.264 Video (12 MB)
- webmFull Size VP8 Video (14.8 MB)
- ogvFull Size Theora Video (29.5 MB)
I ran into a problem immediately, and from what I can tell it is because I am using a pg_search enabled Document model I am trying to import. I also have another form_tag that I use to search Documents on the index page. Everytime I try to import I am getting.
If I remove the include PgSearch section from the Document model the CSV import works perfectly fine. How can I resolve this issue?
The relevant portion from Document.rb:
Update. I went a little farther into the Screencast to see if the code would eventually work. After failing on the very first import, I spit out a CSV file with the IDs, and all columns and with the following code, it works now.
Awesome, I don't know how you do it, in fact I was quite sceptical at the beginning but now I'm convinced you are a mind reader. Quick tell me what number I'm thinking about.
Why do
require 'csv'in application.rb and not in product.rb where it's actually used?I have the same concern, is something about performance or better practices? Anyone?
Thanks for the episode!
Great example, how would we accommodate custom mapping vs static header mapping?
This solution is not the best idea for the situation where you have large csv files (like 40k+ rows)
First of all it will parse large files synchronously - which means you have to wait for the webserver response till parsing is done
The second major problem is that each row is inserted in separate transaction - which is obviously slow as hell
So,as for me, I would not recommend to use this technic for large files
One more thing - if gem you using for CSV parsing loads entire file into memory - this memory won't be freed up after conversion (it is one of the major Ruby's problems)
what do you think about using this solution mixed whit delayed job for large cvs?
Delayed Job effectively solves only response delay problem.
If you plan to use Active record for data import - expect poor performance
insertion of 25k records using AR takes about 10 minutes
A little bit better situation with https://github.com/zdennis/activerecord-import
insertion of 25k records takes about 3 minutes
Fastest way is to build and execute raw sql request
insertion of 25k records takes about 30 sec
Memory efficiency is about 1-3x CSV size (slower - less memory , faster - more memory)
If you plan to import CSVs frequently , you will face inefficient garbage collection problem. To free up memory you need to kill Dj worker
We are using following scheme in production - spawn Dj worker , import couple CSV files with raw queries, respawn worker
And one more thing - don't use Delayed job with rails directly ! it will load full rails environment for each Dj worker and it's a lot of memory for no reason
How do I not use delayed_job directly with rails?
Please expand on not using Delayed_job with rails directly
You might also want to check out this gem for importing CSV:
https://github.com/tilo/smarter_csv
it addresses the issue of reading large CSV files (I tried it with more than a million rows), and can process them in chunks, e.g. for creating Resque jobs (if you want to de-couple the processing from the CSV-file reading).
Why don't you use the find or create syntax in the product model instead of doing find || new ?
There is also a gem called CsvRecord that serves as a persistance layer between csv files and Ruby classes, either importing or exporting. It also implements validation and associations and it can be used alongside ActiveRecord.
https://github.com/lukasalexandre/csv_record
Why would you use
over
Because it's faster
find_or_initialize_by_id - is a dinamic finder, it uses SQL request + method_missing chain to return the object
Class hierarhy traversal (to hit correct method_missing) is much much slower than "|| new"
I built a Rails-based client portal for a client, and I initially used Roo to permit them to upload the spreadsheets full of data that the portal needed to parse. I later switched to RubyXL. I ran into severe constraints with both packages, on large spreadsheets (where "large" means more than 50,000 rows of data, in my case). Both packages performed poorly with large XLSX files, which stands to reason, given how those files are encoded. I had memory use issues and CPU spikes, severe at times. These issues were especially challenging on the production system (a Linode VPS).
My ultimate solution was to use the Gnumeric spreadsheet tool "ssconvert" to convert the XLSX files to CSV format; from there, the standard Ruby CSV module worked fine. The same Resque job that initially used Roo to parse the spreadsheets just runs "ssconvert" and produces CSVs in a temp directory, before continuing its work. This solution isn't always suitable, of course; if you need to preserve the formulae in the spreadsheets, CSV won't work. In my case, I just needed the data in the columns.
Roo and RubyXL are great for some purposes. In fact, I'd have preferred to use them, rather than my hack solution. But my wishes were thwarted by the sheer size of the spreadsheets I had to import. If you run into a problem like that, using "ssconvert" to convert the Excel spreadsheets to CSV might help.
Thanks for the tip. I'm also interested in parsing tens of thousands of lines of excel data in rails.
Could you say more about how you set up Gnumeric/ssconvert on your production server? We're running Ubuntu 12 and I'm not seeing an obvious way to get the command-line ssconvert program installed without installing Gnumeric and all of the Gnome packages it depends on.
I'm the only one who have encodings issue? There's a method to force encoding to roo?
Pass encoding options to CSV is possible,
but with roo?
I've also had many frustrating experiences with encoding issues in user generated files. You might want to look at https://github.com/chrisgogreen/roo which allows CSV options, or my fork which incorporates CSV options and implements filthy brute force encoding to UTF-8: https://github.com/aew/roo.
Speaking of which - I've only ever been able to overcome pc/mac encoding issues in user generated files using iconv. String#encode or force_encoding don't seem to do it. I must be missing something as I've run into this repeatedly on different projects and have yet to figure out the "rails way" of dealing with it. I guess I'll keep staring at iconv deprecation warnings for the foreseeable future.
Agree with other comments on large files - roo / active record will be slow and have a large memory footprint which can get you in trouble using delayed_job on Heroku.
You might also want to check out this gem for importing CSV:
https://github.com/tilo/smarter_csv
It gives more control over how the CSV is imported, e.g. can manipulate, re-write, or replace column-headers; it can read data in chunks for more efficient post-processing (e.g. with Resque); it can deal with non-standard record separators; and more..
Thanks Ryan!!! Nice Episode..
A small question when I was watching:
What plugin do you use to swap between different applications(Safari, Terminal, and so on)? Currently, I used Quicksilver, but sometimes it conflicts with VI command. Any clues?
Just wanted to point out to everyone - if you happen to be using the google-api-client Roo has a name conflict with 'Google' and breaks the api client.
Took a little while to track that down..
Great topic.
One problem I've had with importing CSV files from real users is often they aren't always valid, i.e., they include extended characters, are encoded weirdly, or are otherwise malformed leading to many parsing errors. Finally if they include headers at all they do not match that names of the attributes in the Rails model.
The users who send me files like these are also the least technical so hardest for them to clean up themselves or understand why files are invalid.
Would like to see more about error handling, or know what people do with these real world messy files. Thanks.
check out smarter_csv for easy renaming of the headers to match your model.
Agreed - this seems like a big issue per my comment above, and I'm really curious as to how an experienced Rails dev would tackle it.
This works excellent for Excel files.
But how do I import a csv file that have ";" as a column separator instead of the default value ","?
I'd like to know this also, I'm often given tab separated files and various text enclosures.
check out the other Gem I mentioned in a previous comment - it gives full control over delimiters, column headers, etc..
Thanks for this fork! https://github.com/chrisgogreen/roo
Thanks for uploading this video, but how do I parse tab delimited instead of comma ?
It'll be something like:
CSV.parse("text you are processing", column separator, row separator)
For anybody that's getting bitten by the "redundant UTF-8 sequence" bug when attempting to upload an Excel file (which is likely causing the error Ryan refers to), a fix has been implemented in the 3-2-stable branch of rails, but not released yet. In order to get this goodness into your own app, simply create this file:
...and be sure te
requireit. You should be good to go, after that.Im having problems getting it to import without adding in new records rather than updating existing. Sometimes it doesn't appear to be doing anything. I cant figure why though!
Do you mean that it adds the same data in the file again instead of updating? If so, Change the find_by in the model to another field instead of id. For me, I use first name & last name seeing as this would not be one of the changes made to the uploaded file.
I'm trying to import an excel xml that was created like 362-exporting-csv-and-excel http://railscasts.com/episodes/362-exporting-csv-and-excel?view=asciicast
I get the following error
c:/ruby/lib/ruby/gems/1.8/gems/parseexcel-0.5.2/lib/parseexcel/
olestorage.rb:354:in `get_header': OLE::UnknownFormatError
(OLE::UnknownFormatError)
ruby-roo/ parseexcel cannot read the excel xml file type?
I switched over to strong_parameters and this has broken the basic CSV import method outlined in this screencast I am using on a Documents model. Wondering what syntax I would need to use to fix this line to work with strong_parameters.
document.attributes = row.to_hash.slice(*accessible_attributes)I am trying to import largish Excel files (~2MB) that are generated by an analysis program that we run. I am running into an issue where the open operation is exhausting the memory on my computer (I start with about 8 GiB free).
I have tried both Roo and the underlying Spreadsheet gem. Has anyone had any luck opening larger files?
Turns out the issue was really with irb:
https://github.com/zdavatz/spreadsheet/issues/8
Seems to work fine from scripts.
After import I need to populate one of the columns with the foreign key so that I can associate each row uploaded from the file with the user that uploaded it. However, the foreign key (restuarant_id) is stored in a session when the user logs in. seeing that I can't or should't access sessions in the model, how do I accomplish this?
I get the error uninitialized constant Model::Excel. Anyone else have this problem?
I found a fix for it just now. I called up Excel/Csv classes from the Roo object. Seems to be working ok
Thank you for this! I was getting very frustrated with not being able to get even the example to work correctly and this solved my problem.
Yeah, i have it too, but with Excelx. The problem occurs with ruby 1.9.3, but not in 1.9.2 Anyone know about this?
Use:
Just checked on both 1.9.2 and 1.9.3, works fine.
http://ericlondon.com/posts/246-importing-and-converting-csv-data-into-a-rails-model-with-a-slick-active-admin-interface
how to use this functionality to import data for model associations
This is my case
class SampleRequest < ActiveRecord::Base
has_one :sample_request_text_excerpt
accepts_nested_attributes_for :sample_request_text_excerpt
end
class SampleRequestTextExcerpt < ActiveRecord::Base
belongs_to :sample_request
end
In the sample_request model, asset type field would be text and image , for text type different excel sheet and for image type different excel sheet , only certain field would be extra in the sample_request_text_excerpt model.
For single model no issues , while try to use for associations
this line showing error
sample_request.attributes = row.to_hash
while uploading the text spreadsheet error displayed
put the below file 'import.rake' in lib/tasks
then 'rake import_hosts'
require 'csv'
task :import_hosts =>[:environment] do
file="db/tmp/hostssmall.csv"
additional_rows_to_skip = 1
CSV.foreach(file) do |row|
if additional_rows_to_skip > 0
additional_rows_to_skip-=1
else
p = Host.create!({
:ref => row[0],
:name => row[1],
:address => row[2],
}
)
end
end
end
the import is seemingly working ok for me but then the values of the attributes are saved as nil. any thoughts?
INSERT INTO "addresses" ("baths", "beds", "building_name", "city", "created_at", "date_available", "full_address", "laundry_in_building", "laundry_in_unit", "listing_notes", "parking", "pets", "price", "showing_contact", "showing_contact_number", "showing_instructions", "state", "status", "unit_number", "updated_at", "utilities_included", "zipcode") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22) RETURNING "id" [["baths", nil], ["beds", nil], ["building_name", nil], ["city", nil], ["created_at", Sun, 03 Mar 2013 16:52:17 UTC +00:00], ["date_available", nil], ["full_address", nil], ["laundry_in_building", nil], ["laundry_in_unit", nil], ["listing_notes", nil], ["parking", nil], ["pets", nil], ["price", nil], ["showing_contact", nil], ["showing_contact_number", nil], ["showing_instructions", nil], ["state", nil], ["status", nil], ["unit_number", nil], ["updated_at", Sun, 03 Mar 2013 16:52:17 UTC +00:00], ["utilities_included", nil], ["zipcode", nil]]
I just had this same issue. My problem was that I was searching for row["id"] but my spreadsheet had "ID" instead.
You know this is awesome sauce.
The most valuable thing however was
product.attributes = row.to_hash.slice(*accessible_attributes)The thought is simple, it's mass assignment basics but would have never thought of it when importing excel files!
Super.
Great 'cast. . . I am attempting to use the same technique to import XML.
Does anyone have pointers on how to work with XML using this approach?
Thanks in advance.
correction with open_spreadsheet method
def self.open_spreadsheet(file)
case File.extname(file.original_filename)
when ".csv" then Roo::Csv.new(file.path, nil, :ignore)
when ".xls" then Roo::Excel.new(file.path, nil, :ignore)
when ".xlsx" then Roo::Excelx.new(file.path, nil, :ignore)
else raise "Unknown file type: #{file.original_filename}"
end
end
My import button is not showing up in any browsers. I used this snippet in my customer.rb
and this in my index.html.erb
Any thoughts?
I can't get it to work when i add this into my app.
undefined method `import' for
i have gone at it over and over -- and nothing.
-- though I can get it to work using the supplied source code.
please assist -- i have lots of content entered in good cspreadhseets taht I need to enter into my site.
thx,
sg
First sign in through GitHub to post a comment.