How to load sequences from a dataframe in batches?

Question

Hi everyone,

I'm loading sequences from a file that has 11 million rows, and I would like to load it in batches, where each batch is a dataframe. As I load by batch, the rows that the sequence already had are being deleted, leaving only the last batch. How can I load each batch without deleting the previous one?

This is the code I use to read the file in batches and load rows into the sequence. It is assumed that the sequence is already created.

Regards,

Karina Saylema

@Aditya Kotiyal @HanishSharma @Jason Dressel @Jairo Salaya @Liliana Sierra

Andrian Gasper · Accepted Answer

Hi @Karina SaylemaYour code seems correct.Could you make sure that “row_number” is unique?If you try the code bellow, are you encountering the same issue?batch_size = 10000external_id = 'TEST_UPLOAD4'start_row = 0  # Initialize a starting row numberfor chunk in pd.read_csv(file_input, sep=';', chunksize=batch_size):    chunk.drop(columns=["row_number"], inplace=True)    chunk.index = range(start_row, start_row + len(chunk))    client.sequences.data.insert_dataframe(chunk, external_id=external_id, dropna=False)    start_row += batch_size

Jason Dressel · Answer

@Karina SaylemaI think what’s happening is that chunk.set_index() is resetting the index for each chunk which is resulting in the sequencerows being overwritten because the row indexes are the same as the prior batch.I’m not a pandas expert, but something like explicitly setting the index based on your batch size might be what you are looking for:https://pandas.pydata.org/docs/reference/api/pandas.RangeIndex.htmlHope this helps,Jason

Reply

Sign up

Log in to the community

Scanning file for viruses.

This file cannot be downloaded