May 12, 2019

Using the Postgres Copy command in Ecto Migrations

Today’s post is (super) short, we’re going to look at how we can use the Postgres Copy command within an Ecto migration to populate a lookup table.

Motivation

Why would you want to do this? Well, to be honest often you wouldn’t. If you have a lookup table with not a huge amount of data, say maybe a list of product categories, just explicitly inserting the data within a migration works fine, i.e.

def up do
  Repo.insert(%Category{name: "Hammers"})
  Repo.insert(%Category{name: "Widgets"})
  ...
end

def down do
  Repo.delete_all(Category)
end

If your source data is in a file, you could still do something like the above. You would just read the data from the file instead of specifying it in the migration.

What if you are dealing with a significant amount of data however? Recently I needed to insert ~900k lookup records into a table. Inserting that many records one by one is going to take forever!

Using Copy within a migration

Postgres has a Copy command which is designed exactly for the purpose of importing or exporting large amounts of data to or from a file.

And luckily, it is pretty easy to use within a migration:

def up do
  directory = Application.app_dir(:your_app, "priv/repo")
  file_path = Path.join(directory, "products.csv")
  import_product_data_from(file_path)
end

def down do
  Repo.delete_all(Location)
end

def import_product_data_from(filepath) do
  sql = """
    COPY products(name, external_product_id, year, manufac_id)
    FROM STDIN
    WITH DELIMITER ',' CSV HEADER
  """

  stream = Ecto.Adapters.SQL.stream(Repo, sql)

  Repo.transaction(fn ->
    File.stream!(filepath)
    |> Enum.into(stream)
  end)
end

With the above, all we are doing is specifying the COPY command in our migration. We can use COPY just as we would with Postgres directly. By making use of the COPY command and an Ecto stream we speed up the migration significantly.

When dealing with a large amount of data, this is still going to be a slow operation, it takes around 5 minutes for the 900k rows of data I am dealing with to get into the database. This is not great, but much better than using individual Repo.insert statements… which for this amount of data, would not be realistic.

Summary

So as can be seen, it’s pretty easy to insert a large amount of data directly within a migration. I’m still not sure how I feel about it however. It might make sense to create external scripts and keep the data insertion out of the migration. This way running the migration is quick… the downside being you now have more steps to get your database up and populated with data. Like most things in life, it’s a tradeoff!

Thanks for reading and I hope you enjoyed the post!



Comment on this post!