Create Dataset

introduction the create dataset action in cloudfiles document ai enables you to group multiple files into a single logical dataset and process them together using ai a dataset represents a collection of files that can be versioned, queried, and compared as a whole this makes it ideal for use cases where insights need to be drawn across multiple documents rather than from a single file just like individual documents must be processed before they can be queried, datasets must first be created and processed so that their contents become searchable and queryable through cloudfiles ai this action is a foundational step for workflows that involve querying across multiple files, comparing data between documents, or asking high level questions that span an entire dataset using docid\ dm3eh gzaocyoqd5y0r8d or docid\ ok1o1lpb07zvaclnrbgvu actions what this action does this action runs asynchronously, meaning it does not provide immediate output the create dataset action processes multiple files together as a single dataset, preparing them for intelligent, cross document querying within cloudfiles document ai instead of returning output directly, it publishes a docid\ fe88tt1zz yvq yzv ens event once all dataset resources have been successfully processed during processing, each file in the dataset is analyzed individually the docid\ fe88tt1zz yvq yzv ens event includes the datasetid (a unique cloudfiles identifier), which is essential for querying the dataset in subsequent cloudfiles document ai actions example scenario consider a scenario where you need to analyze multiple kyc documents associated with an account or contact in salesforce such as passports, national identity cards, address proofs, and utility bills instead of processing each file individually, you may want to treat all these documents as a single logical unit and ask questions like “do all documents belong to the same person?” “compare the address across all submitted documents ” “is there any mismatch in nationality or date of birth?” to enable this process, you would create a flow that collects multiple related files (for example, all kyc documents attached to a contact or uploaded via a cloudfiles widget) and uses the create dataset action to group and process them together set up another flow triggered by the dataset created event , which references the datasetid and contextual information (such as the originating salesforce record) this flow can then use docid\ dm3eh gzaocyoqd5y0r8d or docid\ ok1o1lpb07zvaclnrbgvu actions to ask questions, compare values across files, or extract consolidated insights from the dataset input parameters in flow builder, search for cloudfiles create dataset under the cloudfiles category and configure the following inputs context an optional identifier to track the source of the event or any other intended/necessary details this shall be available in corresponding output i e in the corresponding dataset created event details the context parameter helps identify the origin of a dataset when the create dataset action is used across multiple flows for example, when creating a dataset from documents attached to a contact or account, you can pass the record’s id as the context this value is included in the docid\ fe88tt1zz yvq yzv ens event, allowing downstream flows to easily associate the dataset with the correct salesforce record or process name (required) a human readable name for the dataset this helps identify the dataset in queries, events, and version history dataset id (optional) if you want your new dataset to include files from an existing dataset, provide that dataset's id here the action will copy all resources from the referenced dataset into the newly created one, in addition to whatever files you specify in the resources input this is useful when you want to build on a previous dataset — for example, a contact submits additional kyc documents and you want to create a fresh dataset that includes both the previously submitted files and the new ones note that this always creates a new dataset; it does not modify or append to the existing one resources (required) the resources input specifies the files that will be included in the dataset this parameter must be an apex defined collection of docid\ mr7u7qdhigoasucpf nlr objects only values provided in this collection are accepted by the create dataset action this is where you specify which files to include in the dataset the parameter accepts an apex defined collection of docid\ mr7u7qdhigoasucpf nlr cldfs resource objects each docid\ mr7u7qdhigoasucpf nlr represents a single file source you can include files from salesforce files or external storage (via cloudfiles document management), not a mix of both option a adding salesforce files for each salesforce file you want to include, set these properties on a cldfs resource variable library salesforce (please note the parameters are case sensitive) fileid the contentdocumentid of the salesforce file to be processed option b adding external storage files if you use the cloudfiles document management package, you can also include files from connected external storage set these properties library the library parameter is the external storage type you are using possible values are sharepoint , google (for google drive), onedrive , dropbox , box , azure , cloudfiles (for aws s3) please note the parameters are case sensitive drive id the id of the drive where the document resides this is important for google drive & sharepoint libraries only the drive id is a unique identifier for a storage location in both sharepoint and google drive in sharepoint, it represents a document library within a site, while in google drive, it identifies a user's drive or shared drive fileid the unique identifier (resource id) of the file that is to be processed example building the resources collection in flow builder scenario a looping through salesforce files attached to a record use this approach when the files you want to include are salesforce files linked to a record (e g , all files attached to a contact or account via contentdocumentlink) step 1 — retrieve the files add a get records element to query contentdocumentlink where linkedentityid equals the id of your source record (e g , a contactid ) store the results in a collection variable — this gives you all the files attached to that record step 2 — create your working variables you need two new variables a single variable of type cldfs resource — call it something like var singleresource this will be reused on every pass through the loop a collection variable of type cldfs resource — call it something like var resourcescollection this is what you'll ultimately pass to the create dataset action step 3 — add a loop element drag in a loop element and configure it to iterate over the contentdocumentlink collection from step 1 on each iteration, flow automatically gives you access to the current contentdocumentlink record's fields step 4 — assign values inside the loop inside the loop, add an assignment element that sets the following on your singleresource variable library → salesforce id → the contentdocumentid from the current loop iteration step 5 — add the resource to the collection still inside the loop, add another assignment element (or add to the same one) that adds singleresource to resourcescollection the loop repeats steps 4–5 for every file attached to the record by the time it finishes, resourcescollection contains a cldfs resource entry for each file — regardless of whether there were 2 files or 20 step 6 — pass the collection to the create dataset action after the loop, connect the cloudfiles create dataset action and map resourcescollection to its resources input the action now receives the complete, dynamically built list of files scenario b passing external storage files use this approach when your files live in an external storage service connected through the cloudfiles document management package (e g , sharepoint, google drive, dropbox, box, onedrive, azure, or aws s3) use an action like docid 9mdumpjrzqenyvrixspbd to fetch the files from your target folder in the connected storage this action outputs an apex defined collection of docid\ mr7u7qdhigoasucpf nlr cldfs resource , you can pass it directly to the create dataset action note the docid 9mdumpjrzqenyvrixspbd action returns all resources within a folder — both files and sub folders the create dataset action only accepts files, so passing folder resources will cause errors only add resources that are files by filtering such collections description (optional) an optional description to provide context about the dataset’s purpose or contents output parameters the apex action does not return anything as an output in the flow it is used but for every dataset processed a dataset created event is published this event signals the completion of file processing and can be used to trigger platform event flows to perform actions such as docid\ dm3eh gzaocyoqd5y0r8d or docid\ ok1o1lpb07zvaclnrbgvu if the action fails due to some reason, an error event event will be triggered and this event can be used in a decision element to diagnose and handle the error