Hi,
I was wondering where I can find information about how to set up the permissions for a file extractor reading only files from a few sites in SharePoint, i.e., not giving the extractor access to all sites. I have solved it using the GraphAPI, but this is quite cumbersome, so I was wondering if we have documentation on how this can be set up. I’ll post my solution below for clarity on how I solved the issue:
_______________________________________________________________________________
You will need to create two applications. One that is used in the file extractor and one to give access to the file extractor.
- Register the File Extractor App:
- Go to the Azure Active Directory portal.
- Click on "Azure Active Directory" then click on "App registrations".
- Click on "New registration" to register a new app.
- Grant API Permissions:
- Once your app is registered, click on the app's name to go to its dashboard.
- Click on "API permissions" in the left panel.
- Click "Add a permission".
- Choose "Microsoft Graph" and then "Application permissions".
- Look for the "Sites.Selected" permission and add it.
- Admin Consent:
- For application permissions, an administrator needs to provide consent for the app to have the specified permissions.
- In the "API permissions" tab, after adding the required permissions, you'll see the "Grant admin consent for rYour Organization]" button. Click it. This will grant the necessary permissions at the tenant level.
- Grant Access to the Specific SharePoint Site:
- Now, even with admin consent, for "Sites.Selected", you still need to specify which sites your app can access.
- Create new App with Sites.FullControl.All (Application) and Sites.Read.All (Application) permission and a client secret
- Give access using the GraphAPI
- You need to fill in the Id’s and secret of the app you created, as well as the app you need to give access to
# GIVE ACCESS
#AAD AppOnly for Graph API
$tenantId="{tenantId}"
$aadClientId = "{adminClientId}"
$aadClientSecret = "{adminClientSecret}"
$file_extractor_app_id = "{file_extractor_app_id}"
$file_extractor_app_name = "{file_extractor_app_name}"
$siteName = "Testsite"
$scopes = "https://graph.microsoft.com/.default"
$loginURL = "https://login.microsoftonline.com/$tenantId/oauth2/v2.0/token"
$Token = Invoke-RestMethod -Method Post -Uri $loginURL -Body @{grant_type="client_credentials";client_id=$aadClientId;client_secret=$aadClientSecret;scope=$scopes}
$headers = @{'Authorization'="$($Token.token_type) $($Token.access_token)"}
# FIND SITE ID
# Define the URI and body
$uri = "https://graph.microsoft.com/v1.0/sites/1bbzmd.sharepoint.com:/sites/$siteName"
$response = Invoke-WebRequest -Method Get -Headers $headers -Uri $uri
$jsonObject = $response.Content | ConvertFrom-Json
$siteId = $jsonObject.id
# Define the URI and body
$uri = "https://graph.microsoft.com/v1.0/sites/$siteId/permissions"
$body = @"
{
"roles": r
"read"
],
"grantedToIdentities": d
{
"application": {
"id": "$file_extractor_app_id",
"displayName": "$file_extractor_app_name"
}
}
]
}
"@
# Perform the POST request using Invoke-WebRequest
$response = Invoke-WebRequest -Uri $uri -Method POST -Body $body -Headers $headers -ContentType "application/json"
# Output the HTTP status code
$response.StatusCode
# If you also want to see the status description:
$response.StatusDescription
# And if you want to see the content (body) of the response:
$response.Content
You should get something like this:
You can now check if the File Extractor App has access by running this script in PowerShell
#CHECK ACCESS
#AAD AppOnly for Graph API
$tenantId="{tenantId}"
$aadClientId = "{fileExtractorClientId}"
$aadClientSecret = "{fileExtractorClientSecret}"
$siteName = "Testsite"
$scopes = "https://graph.microsoft.com/.default"
$loginURL = "https://login.microsoftonline.com/$tenantId/oauth2/v2.0/token"
$body = @{grant_type="client_credentials";client_id=$aadClientId;client_secret=$aadClientSecret;scope=$scopes}
$Token = Invoke-RestMethod -Method Post -Uri $loginURL -Body $body
$headerParams = @{'Authorization'="$($Token.token_type) $($Token.access_token)"}
#Graph API call to get site
Invoke-WebRequest -Method Get -Headers $headerParams -Uri "https://graph.microsoft.com/v1.0/sites/1bbzmd.sharepoint.com:/sites/$siteName"
The application should now have the correct rights if it was able to return the information about the site. Change the siteName to another site to see that it doesn’t have access to all sites.
- Download the file extractor and configure
Create a .env file that looks like this:
COGNITE_CLIENT_ID="{COGNITE_APP_CLIENT_ID}"
COGNITE_CLIENT_SECRET="{COGNITE_APP_CLIENT_SECRET}"
COGNITE_PROJECT="{COGNITE_PROJECT}"
TENANT_ID="{TENANT_ID}"
CDF_CLUSTER="{CDF_CLUSTER}"
COGNITE_BASE_URL="{COGNITE_BASE_URL}" # e.g. https://greenfield.cognitedata.com"
COGNITE_TOKEN_UTL="https://login.microsoftonline.com/{TENANT_ID}/oauth2/v2.0/token"
SP_CLIENT_ID="{SHAREPOINT_FILE_EXTRACTOR_APP_CLIENT_ID}"
SP_CLIENT_SECRET="{SHAREPOINT_FILE_EXTRACTOR_APP_CLIENT_SECRET}"
SP_AZURE_TENANT_ID="{SP_AZURE_TENANT_ID}"
SP_BASE_URL="Psharepoint-domain].sharepoint.com"
SP_SITE="{sharepointSite}"
SP_DOCUMENT_LIBRARY="{sharepointSubSite/DocumentLibrary}"
SP_DATASET_EXTERNAL_ID = "{SP_DATASET_EXTERNAL_ID}"
use it with the following config file, add other configs such as extraction pipelines as needed:
# Configuration template for the Cognite File Extractor version 0.1.0
#
# The config schema supports general interpolation with environment variables.
# To set a config parameter to an environment variable, use the ${ENV_VAR}
# syntax.
#
# Example: to set the client-password field to the content of the MY_PASSWORD
# environment variable, use
# client-password: ${MY_PASSWORD}
#
# For optional parameters, the default values are provided as comments.
# Uncomment them to change their values. For most scenarios the default values
# should not be changed.
# (Optional) Configure logging to standard out (console) and/or file. Level can
# be DEBUG, INFO, WARNING or CRITICAL
logger:
# Logging to console/terminal. Remove or comment out to disable terminal
# logging
console:
level: INFO
# Logging to file. Include to enable file logging
#file:
#level: INFO
#path: "/path/to/file.log"
# (Optional) Log retention (in days).
#retention: 7
# Information about CDF project
cognite:
# Read these from environment variables
host: ${COGNITE_BASE_URL}
project: ${COGNITE_PROJECT}
idp-authentication:
# OIDC client ID
client-id: ${COGNITE_CLIENT_ID}
# URL to fetch OIDC tokens from
token-url: ${COGNITE_TOKEN_UTL}
# Alternatively, you can specify an Azure tenant to generate token-url automatically
#tenant: azure-tenant-uuid
# OIDC client secret - either this or certificate is required
secret: ${COGNITE_CLIENT_SECRET}
# Uncomment to use a key/certificate pair instead of client secret to authenticate
#certificate:
# Path to key and certificate
#path: /path/to/key_and_cert.pem
# Authority URL (Either this or tenant is required
#authority-url: https://url.com/
# List of OIDC scopes to request
scopes:
- ${COGNITE_BASE_URL}/.default
# Data set to attach uploaded files to. Either use CDF IDs (integers) or
# user-given external ids (strings)
data-set:
#id: 1234
external-id: ${SP_DATASET_EXTERNAL_ID}
# (Optional) Extractor performance tuning.
extractor:
# (Optional) Schedule for triggering runs, ie extractor will run continuously
# Can be either cron type or an interval. Supported units for intervals are
# seconds (s), minutes (m), hours (h) and days (d).
#schedule:
# type: cron
# expression: "*/5 8-16 * * 1-5"
schedule:
type: interval
expression: 10s
# (Optional) Where to store extraction states (progress) between runs.
# Required for incremental load to work.
#state-store:
# Uncomment to use a local json file for state storage
#local:
#path:
# (Optional) Save interval (in seconds) for intermediate file saves. A
# final save will also be made on extractor shutdown.
#save-interval: 30
# Uncomment to use a RAW table for state storage
#raw:
# RAW database and table to use
#database:
#table:
# (Optional) Upload interval (in seconds) for intermediate uploads. A
# final upload will also be made on extractor shutdown.
#upload-interval: 30
# Information about files to extract
files:
# (Optional) A list of extensions to fetch. If included, only files matching
# these extensions will be uploaded.
#extensions:
# - .pdf
# - .tiff
# (Optional) Maximum number of concecutive errors before crashing extractor
#errors-threshold: 5
# (Optional) Maximum file size of files before skipping upload. Default: no
# maximum.
#max-file-size: 64Mb
# (Optional) Whether to let a single failure crash the extractor or not
#fail-fast: false
# (Optional) Include metadata in file upload
#with-metadata: false
# (Optional) CDF Labels to attach to uploaded files
#labels:
# - label1
# - label2
# (Optional) CDF Security Categories to assign to uploaded files
#security-categories:
# - 123456
# - 234567
# Information about file provider
file-provider:
# Provider type. Supported types include local, sharepoint_online,
# gcp_cloud_storage, azure_blob_storage, aws_s3, smb_protocol, ftp and sftp.
# (Optional) Prefix added to the directory property on files in CDF.
#directory-prefix: "/my/files"
type: sharepoint_online
# User login for Sharepoint server
client-id: ${SP_CLIENT_ID}
client-secret: ${SP_CLIENT_SECRET}
# Azure tenant ID
tenant-id: ${SP_AZURE_TENANT_ID}
# Base URL for Sharepoint server
base-url: ${SP_BASE_URL}
# Sharepoint site
site: ${SP_SITE}
# Document library to fetch from
document-library: ${SP_DOCUMENT_LIBRARY}
# (Optional) Whether to traverse into subfolders or not
#with-subfolders: false
- Run the extractor
- Run the extractor with the config file you just created
_______________________________________________________________________________