Getting Big Data From Redshift

28 בAugust 2016

tags:
no comments

As part of a machine learning data collection task, I wanted to retrieve ~350 GB of data from Redshift into an AWS server in order to load the data into an IPython notebook and test some machine learning models.

Get It All With Python (Attempt #1)

1

Using psycopg2 library to work with Redshift, I tried to fetch the result (‘fetchall’) into the server, using a simple python Docker container. I initiated the script, only to find out that the server’s RAM is bursting out, practically killing the docker container, and the hosting storage. It made sense that the server can’t cope with loading a data set so big into its RAM.

Server Side Cursor (Attempt #2 )

2

In psycopg2 documentation, I found about server side cursors, a wrapper over the DECLAR & FETCH SQL keywords, which enables the user to scroll \ iterate over the retrieved dataset, loading it chunk by chunk into the memory (chunk number of rows is specified by eternize variable). Trying this solution did solve one problem, but created a new one:

3Exceeded the maximum size allowed for the total set of cursor data: 38400MB”

It seems Redshift has cursor constraints dependent on the cluster size.

״When the first row of a cursor is fetched, the entire result set is materialized on the leader node. If the result set does not fit in memory, it is written to disk as needed. To protect the integrity of the leader node, Amazon Redshift enforces constraints on the size of all cursor result sets, based on the cluster’s node type.”

Fetching Into A Table (Attempt #3)

So if I can’t fetch the data directly from the database, I might try another approach: select the data into a new table, and fetch it from there later.

4

In this case I don’t need the cursor anymore, since the data stays in Redshift.

Trying this approach quickly failed, since the redshift cluster didn’t had enough disk space to hold the new table.

Unload Is A Winner (Attempt #4)

I realised loading into a table is not really necessary. I can unload the data directly into s3, no need hold it in a table first. From S3 I can easily download it into the server \ local machine with the LOAD command.

5

No cursor and constraints, no RAM or disk space limitations. Just using the right tool for the problem.

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*