Solved

How to load sequences from a dataframe in batches?

9 months ago
2 April 2024
4 replies
57 views

Karina Saylema
Committed

Hi everyone,

I'm loading sequences from a file that has 11 million rows, and I would like to load it in batches, where each batch is a dataframe. As I load by batch, the rows that the sequence already had are being deleted, leaving only the last batch. How can I load each batch without deleting the previous one?

This is the code I use to read the file in batches and load rows into the sequence. It is assumed that the sequence is already created.

Regards,

Karina Saylema

@Aditya Kotiyal @HanishSharma @Jason Dressel @Jairo Salaya @Liliana Sierra

Best answer by Andrian Gasper

Hi @Karina Saylema

Your code seems correct.
Could you make sure that “row_number” is unique?

If you try the code bellow, are you encountering the same issue?

batch_size = 10000
external_id = 'TEST_UPLOAD4'
start_row = 0  # Initialize a starting row number

for chunk in pd.read_csv(file_input, sep=';', chunksize=batch_size):
    chunk.drop(columns=["row_number"], inplace=True)
    chunk.index = range(start_row, start_row + len(chunk))
    client.sequences.data.insert_dataframe(chunk, external_id=external_id, dropna=False)
    start_row += batch_size

View original

Andrian Gasper
Practitioner
9 months ago
April 2, 2024

Hi @Karina Saylema

Your code seems correct.
Could you make sure that “row_number” is unique?

If you try the code bellow, are you encountering the same issue?

batch_size = 10000
external_id = 'TEST_UPLOAD4'
start_row = 0  # Initialize a starting row number

for chunk in pd.read_csv(file_input, sep=';', chunksize=batch_size):
    chunk.drop(columns=["row_number"], inplace=True)
    chunk.index = range(start_row, start_row + len(chunk))
    client.sequences.data.insert_dataframe(chunk, external_id=external_id, dropna=False)
    start_row += batch_size

Jason Dressel
Seasoned Practitioner
9 months ago
April 2, 2024

@Karina Saylema

I think what’s happening is that chunk.set_index() is resetting the index for each chunk which is resulting in the sequence rows being overwritten because the row indexes are the same as the prior batch.
I’m not a pandas expert, but something like explicitly setting the index based on your batch size might be what you are looking for: https://pandas.pydata.org/docs/reference/api/pandas.RangeIndex.html

Hope this helps,
Jason

Karina Saylema
Committed
9 months ago
April 3, 2024

Thanks @Andrian Gasper, @Jason Dressel for your help

Andrian Gasper
Practitioner
9 months ago
April 4, 2024

@Karina Saylema You can mark this questions as answered, if you believe it is. :)

Reply

Cookie Policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos

Reply

Sign up

Log in to the community

Scanning file for viruses.

This file cannot be downloaded

Cookie Policy

Cookie settings