Quickly Transform Huge CSV Files Using AWS Lambda with Amazon S3

Photo by Nazarizal Mohammad on Unsplash

When dealing with micro-functionalities, you don’t want to deal with many things, including server management, logging, performance, and scaling. That’s where AWS Lambda comes into play.

Processing CSV Files

You have an application that allows users to upload CSV on AWS S3, and the uploaded file gets processed afterward; to do so, you need a background job to avoid blocking your application and rate-limit the traffic.

After AWS added S3 Triggers, it is easy to specify that every time a file gets uploaded to a specific bucket, an action will trigger, which is AWS Lambda function.

AWS Lambda Function

To read and process S3 files we’re going to use Amazon Web Services (AWS) SDK for Python, “Boto”.

This function reads a CSV file uploaded to amazon-search-terms and creates a new processed file under the same bucket prefixed with _processed.csv, which separates processed files from their origin.

AWS Lambda Limitations

You can now configure your AWS Lambda functions to run up to 15 minutes per execution. Previously, the maximum execution time (timeout) for a Lambda function was 5 minutes.

If you’re dealing with files over 1GB, perhaps you should consider AWS Athena or optimizing your AWS Lambda function to read the file using a stream instead of loading the whole file into memory.

AWS Lambda Execution Role

To read/write data into AWS S3 you’re AWS Lambda function needs to have a role with policies, or you’ll get an “Access Denied Error” if you attempt to open a file to read/write.

You need to attach IAMReadonly and S3 Full Access policies; you can find an example policy below, which is only for learning purposes.

Conclusion

I have to say I’m pretty happy with the result since the files I’m processing are all under 600m, and it takes around an average of 5–10 min for each file to be processed; otherwise, you can look into AWS Athena and AWS SQS.