How to load sequences from a dataframe in batches?

  • 2 April 2024
I'm loading sequences from a file that has 11 million rows, and I would like to load it in batches, where each batch is a dataframe. As I load by batch, the rows that the sequence already had are being deleted, leaving only the last batch. How can I load each batch without deleting the previous one?


This is the code I use to read the file in batches and load rows into the sequence. It is assumed that the sequence is already created.



Karina Saylema

Hi @Karina Saylema 

Your code seems correct.
Could you make sure that “row_number” is unique? 

If you try the code bellow, are you encountering the same issue?

batch_size = 10000
external_id = 'TEST_UPLOAD4'
start_row = 0 # Initialize a starting row number

for chunk in pd.read_csv(file_input, sep=';', chunksize=batch_size):
chunk.drop(columns=["row_number"], inplace=True)
chunk.index = range(start_row, start_row + len(chunk)), external_id=external_id, dropna=False)
start_row += batch_size


@Karina Saylema 

I think what’s happening is that chunk.set_index() is resetting the index for each chunk which is resulting in the sequence rows being overwritten because the row indexes are the same as the prior batch.
I’m not a pandas expert, but something like explicitly setting the index based on your batch size might be what you are looking for:

Thanks @Andrian Gasper, @Jason Dressel for your help 

