Datasets

in many business workflows, decisions rely on multiple related inputs rather than a single file for example, a customer onboarding package may include identification documents, agreements, and tax forms, along with customer data from salesforce similarly, a procurement transaction may involve quotes, purchase orders, invoices, and related system records when these inputs are analyzed individually, it becomes difficult to understand how they relate to each other comparing information across documents and system data often requires manual effort, and automation becomes harder when insights depend on multiple sources together cloudfiles datasets solve this by allowing you to group documents and salesforce data into a single collection and analyze them collectively using ai how datasets work a dataset represents a unified collection of all inputs relevant to a workflow or process, including both documents and salesforce data rather than treating each file or data source in isolation, cloudfiles brings them together into a single context, enabling them to be analyzed as one cohesive unit this enables two powerful capabilities cross document analysis – ai can analyze multiple files together to compare values, detect duplicates, identify inconsistencies, or extract insights across documents document + salesforce data analysis – queries can combine information from documents with salesforce data, allowing workflows that depend on both structured and unstructured information by working at the dataset level, cloudfiles enables workflows where the value comes from understanding how different inputs relate to each other, rather than simply extracting information from a single file datasets are created using the create dataset docid\ gmwc xazlo4a9qanfcwih flow action, which generates a unique dataset id documents can then be associated with this dataset so they can be analyzed collectively ai queries can be executed across the dataset using the query document/dataset docid\ dm3eh gzaocyoqd5y0r8d or query document/dataset (batch) docid\ ok1o1lpb07zvaclnrbgvu actions, allowing natural language prompts to reference multiple documents at once industry use cases datasets are useful in any scenario where multiple documents contribute to a single business decision legal teams can use datasets to compare different versions of contracts and identify changes before final approval procurement teams can group invoices, purchase orders, and supporting documents for a transaction and verify that values match across files financial institutions can analyze multiple onboarding documents together to validate identity information and ensure consistency datasets are also valuable in audit and compliance workflows, where reviewers often need to analyze document collections rather than individual files upload multiple documents to the dataset and use the dataset playground to run queries across all files in that dataset for example, you upload a dataset containing multiple purchase order (po) documents submitted by different schools or departments to the same vendor these documents may include duplicate purchase orders, corrected versions, or resubmitted orders for the same purchase ask queries that analyze information across all documents in the dataset, such as identifying duplicate purchase orders based on key fields like po number, vendor, or order details for example query instruction detect duplicate purchase orders task review all documents in the dataset and identify purchase orders that have the same po/order number, vendor name, and delivery organization return the po numbers that appear to be duplicates if multiple duplicates exist, list them as separate entries output \[{"po number" "121212","vendor name" "cloudfiles learning solutions","delivery organization" "sample high school (46101)"}] you can easily test the dataset feature directly within cloudfiles in the cloudfiles document ai tab , create a dataset to group your related documents into a single collection once the dataset is created and documents are added, you can open the dataset playground playground docid 2imyi 69ntxktr6s8yuz to run queries across the dataset