Skip to main content
Solved

File extractor - SharePoint Online documentation

  • 30 October 2023
  • 6 replies
  • 165 views

Hi,

I was wondering where I can find information about how to set up the permissions for a file extractor reading only files from a few sites in SharePoint, i.e., not giving the extractor access to all sites. I have solved it using the GraphAPI, but this is quite cumbersome, so I was wondering if we have documentation on how this can be set up. I’ll post my solution below for clarity on how I solved the issue:

_______________________________________________________________________________

You will need to create two applications. One that is used in the file extractor and one to give access to the file extractor.

  1. Register the File Extractor App:
    • Go to the Azure Active Directory portal.
    • Click on "Azure Active Directory" then click on "App registrations".
    • Click on "New registration" to register a new app.
  2. Grant API Permissions:
    • Once your app is registered, click on the app's name to go to its dashboard.
    • Click on "API permissions" in the left panel.
    • Click "Add a permission".
    • Choose "Microsoft Graph" and then "Application permissions".
    • Look for the "Sites.Selected" permission and add it.
  3. Admin Consent:
    • For application permissions, an administrator needs to provide consent for the app to have the specified permissions.
    • In the "API permissions" tab, after adding the required permissions, you'll see the "Grant admin consent for rYour Organization]" button. Click it. This will grant the necessary permissions at the tenant level.
  4. Grant Access to the Specific SharePoint Site:
    • Now, even with admin consent, for "Sites.Selected", you still need to specify which sites your app can access.
    • Create new App with Sites.FullControl.All (Application) and Sites.Read.All (Application) permission and a client secret
  5. Give access using the GraphAPI
    • You need to fill in the Id’s and secret of the app you created, as well as the app you need to give access to
    # GIVE ACCESS
    #AAD AppOnly for Graph API
    $tenantId="{tenantId}"
    $aadClientId = "{adminClientId}"
    $aadClientSecret = "{adminClientSecret}"
    $file_extractor_app_id = "{file_extractor_app_id}"
    $file_extractor_app_name = "{file_extractor_app_name}"
    $siteName = "Testsite"

    $scopes =  "https://graph.microsoft.com/.default"
    $loginURL = "https://login.microsoftonline.com/$tenantId/oauth2/v2.0/token"

    $Token = Invoke-RestMethod -Method Post -Uri $loginURL -Body @{grant_type="client_credentials";client_id=$aadClientId;client_secret=$aadClientSecret;scope=$scopes}
    $headers  = @{'Authorization'="$($Token.token_type) $($Token.access_token)"}

    # FIND SITE ID
    # Define the URI and body
    $uri = "https://graph.microsoft.com/v1.0/sites/1bbzmd.sharepoint.com:/sites/$siteName"
    $response = Invoke-WebRequest -Method Get -Headers $headers -Uri $uri
    $jsonObject = $response.Content | ConvertFrom-Json
    $siteId = $jsonObject.id


    # Define the URI and body
    $uri = "https://graph.microsoft.com/v1.0/sites/$siteId/permissions"
    $body = @"
    {
        "roles": r
            "read"
        ],
        "grantedToIdentities": d
            {
                "application": {
                    "id": "$file_extractor_app_id",
                    "displayName": "$file_extractor_app_name"
                }
            }
        ]
    }
    "@


    # Perform the POST request using Invoke-WebRequest
    $response = Invoke-WebRequest -Uri $uri -Method POST -Body $body -Headers $headers -ContentType "application/json"

    # Output the HTTP status code
    $response.StatusCode

    # If you also want to see the status description:
    $response.StatusDescription

    # And if you want to see the content (body) of the response:
    $response.Content

 

You should get something like this:


You can now check if the File Extractor App has access by running this script in PowerShell

#CHECK ACCESS
#AAD AppOnly for Graph API
$tenantId="{tenantId}"
$aadClientId = "{fileExtractorClientId}"
$aadClientSecret = "{fileExtractorClientSecret}"
$siteName = "Testsite"

$scopes =  "https://graph.microsoft.com/.default"
$loginURL = "https://login.microsoftonline.com/$tenantId/oauth2/v2.0/token"


$body = @{grant_type="client_credentials";client_id=$aadClientId;client_secret=$aadClientSecret;scope=$scopes}

$Token = Invoke-RestMethod -Method Post -Uri $loginURL -Body $body
$headerParams  = @{'Authorization'="$($Token.token_type) $($Token.access_token)"}

#Graph API call to get site
Invoke-WebRequest -Method Get -Headers $headerParams -Uri "https://graph.microsoft.com/v1.0/sites/1bbzmd.sharepoint.com:/sites/$siteName"


The application should now have the correct rights if it was able to return the information about the site. Change the siteName to another site to see that it doesn’t have access to all sites.

  1. Download the file extractor and configure
    Create a .env file that looks like this:
COGNITE_CLIENT_ID="{COGNITE_APP_CLIENT_ID}"
COGNITE_CLIENT_SECRET="{COGNITE_APP_CLIENT_SECRET}"
COGNITE_PROJECT="{COGNITE_PROJECT}"
TENANT_ID="{TENANT_ID}"
CDF_CLUSTER="{CDF_CLUSTER}"
COGNITE_BASE_URL="{COGNITE_BASE_URL}" # e.g. https://greenfield.cognitedata.com"
COGNITE_TOKEN_UTL="https://login.microsoftonline.com/{TENANT_ID}/oauth2/v2.0/token"
SP_CLIENT_ID="{SHAREPOINT_FILE_EXTRACTOR_APP_CLIENT_ID}"
SP_CLIENT_SECRET="{SHAREPOINT_FILE_EXTRACTOR_APP_CLIENT_SECRET}"
SP_AZURE_TENANT_ID="{SP_AZURE_TENANT_ID}"
SP_BASE_URL="Psharepoint-domain].sharepoint.com"
SP_SITE="{sharepointSite}"
SP_DOCUMENT_LIBRARY="{sharepointSubSite/DocumentLibrary}"
SP_DATASET_EXTERNAL_ID = "{SP_DATASET_EXTERNAL_ID}"


use it with the following config file, add other configs such as extraction pipelines as needed:

# Configuration template for the Cognite File Extractor version 0.1.0
#
# The config schema supports general interpolation with environment variables.
# To set a config parameter to an environment variable, use the ${ENV_VAR}
# syntax.
#
# Example: to set the client-password field to the content of the MY_PASSWORD
# environment variable, use
#   client-password: ${MY_PASSWORD}
#
# For optional parameters, the default values are provided as comments.
# Uncomment them to change their values. For most scenarios the default values
# should not be changed.


# (Optional) Configure logging to standard out (console) and/or file. Level can
# be DEBUG, INFO, WARNING or CRITICAL
logger:
  # Logging to console/terminal. Remove or comment out to disable terminal
  # logging
  console:
    level: INFO

  # Logging to file. Include to enable file logging
  #file:
    #level: INFO
    #path: "/path/to/file.log"

    # (Optional) Log retention (in days).
    #retention: 7


# Information about CDF project
cognite:
  # Read these from environment variables
  host: ${COGNITE_BASE_URL}
  project: ${COGNITE_PROJECT}

  idp-authentication:
    # OIDC client ID
    client-id: ${COGNITE_CLIENT_ID}

    # URL to fetch OIDC tokens from
    token-url: ${COGNITE_TOKEN_UTL}

    # Alternatively, you can specify an Azure tenant to generate token-url automatically
    #tenant: azure-tenant-uuid

    # OIDC client secret - either this or certificate is required
    secret: ${COGNITE_CLIENT_SECRET}

    # Uncomment to use a key/certificate pair instead of client secret to authenticate
    #certificate:
      # Path to key and certificate
      #path: /path/to/key_and_cert.pem

      # Authority URL (Either this or tenant is required
      #authority-url: https://url.com/

    # List of OIDC scopes to request
    scopes:
      - ${COGNITE_BASE_URL}/.default


  # Data set to attach uploaded files to. Either use CDF IDs (integers) or
  # user-given external ids (strings)
  data-set:
    #id: 1234
    external-id: ${SP_DATASET_EXTERNAL_ID}


# (Optional) Extractor performance tuning.
extractor:
  # (Optional) Schedule for triggering runs, ie extractor will run continuously
  # Can be either cron type or an interval. Supported units for intervals are
  # seconds (s), minutes (m), hours (h) and days (d).
  #schedule:
  #  type: cron
  #  expression: "*/5 8-16 * * 1-5"

  schedule:
    type: interval
    expression: 10s

  # (Optional) Where to store extraction states (progress) between runs.
  # Required for incremental load to work.
  #state-store:
    # Uncomment to use a local json file for state storage
    #local:
      #path:

      # (Optional) Save interval (in seconds) for intermediate file saves. A
      # final save will also be made on extractor shutdown.
      #save-interval: 30

    # Uncomment to use a RAW table for state storage
    #raw:
      # RAW database and table to use
      #database:
      #table:

      # (Optional) Upload interval (in seconds) for intermediate uploads. A
      # final upload will also be made on extractor shutdown.
      #upload-interval: 30

# Information about files to extract
files:
  # (Optional) A list of extensions to fetch. If included, only files matching
  # these extensions will be uploaded.
  #extensions:
  #  - .pdf
  #  - .tiff

  # (Optional) Maximum number of concecutive errors before crashing extractor
  #errors-threshold: 5

  # (Optional) Maximum file size of files before skipping upload. Default: no
  # maximum.
  #max-file-size: 64Mb

  # (Optional) Whether to let a single failure crash the extractor or not
  #fail-fast: false

  # (Optional) Include metadata in file upload
  #with-metadata: false

  # (Optional) CDF Labels to attach to uploaded files
  #labels:
  #  - label1
  #  - label2

  # (Optional) CDF Security Categories to assign to uploaded files
  #security-categories:
  #  - 123456
  #  - 234567

  # Information about file provider
  file-provider:
    # Provider type. Supported types include local, sharepoint_online,
    # gcp_cloud_storage, azure_blob_storage, aws_s3, smb_protocol, ftp and sftp.

  # (Optional) Prefix added to the directory property on files in CDF.
  #directory-prefix: "/my/files"
    type: sharepoint_online

    # User login for Sharepoint server
    client-id: ${SP_CLIENT_ID}
    client-secret: ${SP_CLIENT_SECRET}

    # Azure tenant ID
    tenant-id: ${SP_AZURE_TENANT_ID}

    # Base URL for Sharepoint server
    base-url: ${SP_BASE_URL}

    # Sharepoint site
    site: ${SP_SITE}

    # Document library to fetch from
    document-library: ${SP_DOCUMENT_LIBRARY}

    # (Optional) Whether to traverse into subfolders or not
    #with-subfolders: false
  1. Run the extractor
    • Run the extractor with the config file you just created

_______________________________________________________________________________

Hi @Kristian Gjestad Vangsnes

Thanks a lot for your question, and the information you have shared. We are currently actually working on some documentation regarding the SharePoint extractor. Hopefully that’ll help in answering your question. I’ll let you know when it’s out!

Best,

Carin


Hi @Kristian Gjestad Vangsnes  

Please check below link and see if it is helpful. 

 

Hi @Carin Meems 

I am also looking for resources for sharepoint extractor. Thanks for your help :) 

 


Hi @Kristian Gjestad Vangsnes,

As @Rajendra Pasupuleti has noticed, the documentation has been added to Hub. Please use that for your reference. Hope it helps!

 

Hi @Rajendra Pasupuleti, this is all we have, currently. Is there any specific information you’re looking for? 


From my look at the “Configure SharePoint Extractor” page, it gives full access to all the sites. This is as mentioned too broad access in most cases, we simply want to give access to certain sites. If there is a way to do that with the same method, then that is greatly appreciated.


Hi @Kristian Gjestad Vangsnes,

Cognite has updated the documentation. Please take a look and let me know if you have more questions.

Br,
Dilini

 

 

 

 


Hi @Kristian Gjestad Vangsnes,

I hope the above helped. As of now, I’m closing this topic. Please feel free to create a new post if you have any questions.

Best regards,
Dilini


Reply