Skip to main content
Solved

Excel Extractor


Pablo Palacios
Seasoned
Forum|alt.badge.img+3

Hello:
I am using DB Extractor  v 3.4.3 to read columns from an Excel file but it gives me the following error:

 

2024-08-20 18:53:06.550 UTC [ERROR   ] QueryExecutor_0 - Unexpected error in query1: Could not read file C:\Cognite\SAT-SX4 - 2023-03-21.xls: required package 'xlsx2csv' not found.
Please install using the command `pip install xlsx2csv`.

The following is my config file:

 

 databases:
   - type: spreadsheet    
     name: "DSN_FT"
     path: "C:\\Cognite\\SAT-SX4 - 2023-03-21.xls"

 queries:
   -    
     name: query1     
     database: "DSN_FT"
     sheet: "P_VSD"     
     query: 
       #"SELECT STS PreCOM FROM SAT-SX4 - 2023-03-21.xls"
       "SELECT * FROM P_VSD"

     destination:
       type: raw
       database: FTV
       table: AE_MOLYNOR

     primary-key: "{EventID}"  

What am I doing wrong? Can you help me?

 

Best answer by Dilini Fernando

Hi @Pablo Palacios,

The issue has been resolved in the latest version, 3.4.7. Please check and let me know if you encounter any problems.

View original
Did this topic help you find an answer to your question?

10 replies

Dilini Fernando
Seasoned Practitioner
Forum|alt.badge.img+2
  • Seasoned Practitioner
  • 670 replies
  • August 27, 2024

Hi @Pablo Palacios,

Thank you for bringing this to our attention. The engineering team has been informed, and I will update you as soon as I have an update on this.


Dilini Fernando
Seasoned Practitioner
Forum|alt.badge.img+2
  • Seasoned Practitioner
  • 670 replies
  • Answer
  • August 28, 2024

Hi @Pablo Palacios,

The issue has been resolved in the latest version, 3.4.7. Please check and let me know if you encounter any problems.


Pablo Palacios
Seasoned
Forum|alt.badge.img+3

Thank you very much, the problem has been solved


Pablo Palacios
Seasoned
Forum|alt.badge.img+3

The extraction from an Excel sheet works fine for me, now I have the problem that I want to extract data from different Excel sheets, the data is sent to CDF in RAW in different tables, but only both tables are overwritten by the second query, what am I doing wrong?

 

databases:
  - 
    type: spreadsheet
    name: "DBexcel"
    path: "C:\\Users\\ppalacios\\Documents\\Proyectos DPA\\RA Software\\FT DataMosaix\\SQM SX5\\SAT\\Protocolo SAT.xlsx"

queries:

  - name: query1 
    database: "DBexcel"
    sheet: "VDF"     
    query:  >
      SELECT * FROM VDF                   
    destination:
      type: raw
      database: "SX5"
      table: "SAT_VDF"
    primary-key: "{row_number}"   

  - name: query2 
    database: "DBexcel"
    sheet: "MTR"           
    query:  >
      SELECT * FROM MTR
    destination:
      type: raw
      database: "SX5"
      table: "SAT_MTR"
    primary-key: "{row_number}"

 


Pablo Palacios
Seasoned
Forum|alt.badge.img+3

As additional information, the extractor always sends me the first tab of the excel to the RAW and writes it in all the tables of the RAW


Pablo Palacios
Seasoned
Forum|alt.badge.img+3

hi @Dilini Fernando:

Could you see with the equipment what I mentioned above? Although the Excel extractor works now, it is not possible to extract information from other Excel sheets, only from the first one. In the extractor I defined it according to the documentation, but it does not work.


Forum|alt.badge.img

Hi and thanks for the report @Pablo Palacios!

Looks like we had a bug in the spreadsheet provider of the DB Extractor which I've attempted to fix. There will be a v3.4.8 release coming that should _actually_ respect the configured sheet name to extract from and not default to the first sheet in the list.


Pablo Palacios
Seasoned
Forum|alt.badge.img+3
  • Author
  • Seasoned
  • 31 replies
  • September 10, 2024

Thanks @Thomas Sjølshagen 

version 3.4.8 has already been released. I tried it and the other Excel sheets still cannot be read. Will another version come out with this improvement?


Hi Pablo,

I tested myself and I can confirm that the extractor is working fine when doing multiple queries ponting to different sheets of the same spreadsheet. (DB Extractor v3.5.0)

Example config:

logger:
  console:
    level: INFO

  # file:
  #   level: INFO
  #   path: "file.log"


# Read all of these from environment variables
cognite:
  host: ${COGNITE_BASE_URL}
  project: ${COGNITE_PROJECT}
  idp-authentication:
    client-id: ${COGNITE_CLIENT_ID}
    secret: ${COGNITE_CLIENT_SECRET}
    token-url: ${COGNITE_TOKEN_URL}
    scopes:
     - ${COGNITE_BASE_URL}/.default

extractor:
#   (Optional) Run mode of the extractor. It can be "single" or "continuous". Default value is "single".
#   `continuous` : runs the configured queries using the configured schedules per endpoint
#   `single` : will run the queries once per endpoint
#   mode: continuous

#   (Optional) Number of rows to fetch from sources before uploading
  upload-queue-size: 100_000

# List of databases
databases:
  - type: spreadsheet 
    name: local_excel
    path: /path/to/file/test.xlsx


# List of queries
queries:

  - name: test_excel_1
    database: local_excel
    sheet: test_1

    query: >
      SELECT
        *
      FROM
        test_1

    destination:
      type: raw
      database: "db_excel"
      table: "local_test_1"

    primary-key: "{row_number}"

  - name: test_excel_2
    database: local_excel
    sheet: test_2

    query: >
      SELECT
        *
      FROM
        test_2

    destination:
      type: raw
      database: "db_excel"
      table: "local_test_2"

    primary-key: "{row_number}"





Logs from the extractor run:
 

2024-09-18 11:58:37.488 UTC [INFO    ] MainThread - Successfully ingested environment variables from /Users/rogerio.junior@cognitedata.com/workspace/db-extractor/.env
2024-09-18 11:58:37.497 UTC [WARNING ] MainThread - No config type specified, default to local
2024-09-18 11:58:37.497 UTC [INFO    ] MainThread - Loaded local config file
2024-09-18 11:58:37.498 UTC [INFO    ] MainThread - No extraction pipeline configured
2024-09-18 11:58:37.498 UTC [INFO    ] MainThread - Cognite DB Extractor version 3.5.0
2024-09-18 11:58:37.498 UTC [INFO    ] MainThread - Starting extraction of 2 queries
2024-09-18 11:58:37.498 UTC [INFO    ] MainThread - Extraction will run once. To run the extractor continuously use continuous mode
2024-09-18 11:58:37.571 UTC [INFO    ] QueryExecutor_0 - Starting spreadsheet extraction
2024-09-18 11:58:37.571 UTC [INFO    ] QueryExecutor_1 - Starting spreadsheet extraction
2024-09-18 11:58:37.572 UTC [INFO    ] QueryExecutor_0 - Starting extraction of test_excel_1
2024-09-18 11:58:37.572 UTC [INFO    ] QueryExecutor_1 - Starting extraction of test_excel_2
2024-09-18 11:58:37.600 UTC [INFO    ] QueryExecutor_1 - Processing Excel file: 27 rows, 7 columns
2024-09-18 11:58:37.603 UTC [INFO    ] QueryExecutor_0 - Processing Excel file: 59 rows, 4 columns
2024-09-18 11:58:37.607 UTC [INFO    ] QueryExecutor_0 - No more rows for test_excel_1. 59 rows extracted in 0.0352 seconds
2024-09-18 11:58:37.608 UTC [INFO    ] QueryExecutor_1 - No more rows for test_excel_2. 27 rows extracted in 0.0362 seconds
2024-09-18 11:58:38.077 UTC [INFO    ] QueryExecutor_0 - Uploaded 86 raw rows
2024-09-18 11:58:38.078 UTC [INFO    ] QueryExecutor_1 - Uploaded 0 raw rows

Please find attached the spreadsheet file I used for this test. Remeber to change the path in the database config, if you would like to test this yourself.

 Regards,

Rogerio Junior
Cognite Engineering


Pablo Palacios
Seasoned
Forum|alt.badge.img+3
  • Author
  • Seasoned
  • 31 replies
  • September 20, 2024

Now with this version of the extractor it works as it should, thank you all very much @Rogerio Júnior 


Reply


Cookie Policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie Settings