When we are working on any data science project, one of the essential steps to take is to download some data from an API to the memory so we can process it.
When doing that, there are some problems that we can face; one of these problems is having too much data to process. If the size of our data is larger than the size of our available memory (RAM), we might face some problems in getting the project done.
So, what to do then?
There are different options to solve the problem of big data, small problems. These solutions either cost time or money.
- Money-costing solution: One possible solution is to buy a new computer with a more robust CPU and larger RAM that is capable of handling the entire dataset. Or, rent a cloud or a virtual memory and then create some clustering arrangement to handle the workload.
- Time-costing solution: Your RAM might be too small to handle your data, but often, your hard drive is much larger than your RAM. So, why not just use it? Using the hard drive to deal with your date will make the processing of it much slower because even an SSD hard drive is slower than a RAM.
Now, both those solutions are very valid, that is, if you have the resources to do so. If you have a big budget for your project or the time is not a constraint, then using one of those techniques is the simplest and most straightforward answer.
What if you can’t? What if you’re working on a budget? What if your data is so big, loading it from the drive will increase your processing time 5X or 6X or even more? Is there a solution to handling big data that doesn’t cost money or time?
I am glad you asked — or I asked?.
There are some techniques that you can use to handle big data that don’t require spending any money or having to deal with long loading times. This article will cover 3 techniques that you can implement using Pandas to deal with large size datasets.
Technique №1: Compression
The first technique we will cover is compressing the data. Compression here doesn’t mean putting the data in a ZIP file; it instead means storing the data in the memory in a compressed format.
In other words, compressing the data is finding a way to represent the data in a different way that will use less memory. There are two types of data compression: lossless compression and lossy one. Both these types only affect the loading of your data and won’t cause any changes in the processing section of your code.
Lossless compression doesn’t cause any losses in the data. That is, the original data and the compressed ones are semantically identical. You can perform lossless compression on your data frames in 3 ways:
For the remainder of this article, I will use this dataset that contains COVID-19 cases in the united states divided into different counties.
- Load specific columns
The dataset I am using has the following structure:
import pandas as pd data = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv") data.sample(10)
Loading the entire dataset takes 111 MB of memory!
However, I really only need two columns of this dataset, the county and the case columns, so why would I load the entire dataset? Loading only the two columns I need requires 36 MB, which is a 32% decrease in memory usage.
I can use Pandas to load only the columns I need like this
Code snippet for this section
- Manipulate datatypes
Another way to decrease the memory usage of our data is to truncate numerical items in the data. For example, whenever we load a CSV into a column in a data frame, if the file contains numbers, it will store it as which takes 64 bytes to store one numerical value. However, we can truncate that and use other int formates to save some memory.
int8 can store integers from -128 to 127.
int16 can store integers from -32768 to 32767.
int64 can store integers from -9223372036854775808 to 9223372036854775807.
if you know that the numbers in a particular column will never be higher than 32767, you can use an
int32 and reduce the memory usage of that column by 75%.
So, assume that the number of cases in each county can’t exceed 32767 — which is not true in real-life — then, we can truncate that column to
int16 instead of
- Sparse columns
If the data has a column or more with lots of empty values stored as
NaN you save memory by using a sparse column representation so you won’t waste memory storing all those empty values.
Assume the county column has some
NaN values and I just want to skip the rows containing
NaN, I can do that easily using sparse series.
What if performing lossless compression wasn’t enough? What if I need to compress my data even more? In this case, you can use lossy compression, so you sacrifice 100% accuracy in your data for the sake of memory usage.
You can perform lossy compression in two ways: modify numeric values and sampling.
- Modifying numeric values: Sometimes, you don’t need full accuracy in your numeric data so that you can truncate them from
- Sampling: Maybe you want to prove that some states have higher COVID cases than others, so you take a sample of some counties to see which states have more cases. Doing that is considered lossy compression because you’re not considering all rows.
Technique №2: Chunking
Another way to handle large datasets is by chunking them. That is cutting a large dataset into smaller chunks and then processing those chunks individually. After all the chunks have been processed, you can compare the results and calculate the final findings.
This dataset contains 1923 rows.
Let’s assume I want to find the country with the most number of cases. I can divide my dataset into chunks of 100 rows and process each of them individually and then get the maximum of the smaller results.
Code snippet for this section
Technique №3: Indexing
Chunking is excellent if you need to load your dataset only once, but if you want to load multiple datasets, then indexing is the way to go.
Think of indexing as the index of a book; you can know the necessary information about an aspect without needing to read the entire book.
For example, let’s say I want to get the cases for a specific state. In this case, chunking would make sense; I could write a simple function that accomplishes that.
Indexing vs. chunking
In chunking, you need to read all data, while in indexing, you just need a part of the data.
So, my small function loads all the rows in each chunk but only cares about the ones for the state I want. That leads to significant overhead. I can avoid having this by using a database next to Pandas. The simplest one I can use is SQLite.
To do that, I first need to load my data frame into an SQLite database.
Then I need to re-write my
get_state_info function and use the database in it.
By doing that, I can decrease the memory usage by 50%.
Handing big datasets can be such a hassle, especially if it doesn’t fit in your memory. Some solutions for that can either be time or money consuming, which is you have the resource that could be the simplest, most straightforward approach.
However, if you don’t have the resources, you can use some techniques in Pandas to decrease the memory usage of loading your data — techniques such as compression, indexing, and chucking.