Block: Extract Data
Define the specific information your AI Agent should retrieve from webpages and the structure for this data.
The Extract Data
block is where you tell the AI Agent precisely what information to pull from the current webpage and how to structure it. This is the core of most data collection workflows.
Purpose
Use the Extract Data
block to:
- Define a schema (a set of fields) for the data you want to collect.
- Instruct the agent to find and extract specific pieces of text, numbers, or other details from a page.
- Scrape data from a single item (e.g., product details on a product page).
- Extract a list of multiple items from a single page (e.g., all reviews, all products in a list).
- Transform or format data as it’s being extracted (e.g., translate text, format dates - by providing clear instructions).
This block follows after navigation blocks like Open Websites
, Follow Links
, or Explore Content
.
Configuration
-
What information should we scrape from each URL?
- This is the first crucial choice that determines how the agent approaches data extraction on the page.
- A list of items: Choose this if the page contains multiple similar objects you want to extract (e.g., a list of user reviews, search results, products on a category page). The agent will try to identify each item and extract the defined attributes for all of them.
- A single item: Choose this if the page is focused on one primary object and you want to extract various attributes of that single object (e.g., details of one specific product on its dedicated page, contact information from a contact page).
-
Defining Attributes (The Data Table):
- Once you’ve selected “A list of items” or “A single item,” you’ll define the structure of your data in a table-like interface.
- NAME (Column 1): This is the internal name or key for each piece of data you want to extract. It can be anything descriptive (e.g.,
product_name
,review_text
,price
). This name will be used as the column header or JSON key in your results. - EXAMPLE VALUE OR A LONGER DESCRIPTION (Column 2): This is a critical field. Here, you provide a clear description or an example of the data you want the AI to find for the corresponding NAME. The more detailed and unambiguous your description and examples are, the more accurate the extraction will be.
- Good example: For
NAME: author_name
, Description:The full name of the person who wrote the review, e.g., John Doe.
- Good example: For
NAME: rating
, Description:The star rating the user gave, as a number (e.g., 4 or 5).
- Good example: For
-
Add additional instructions and guidance to the AI when extracting data:
- This is a text field below the attribute table where you can provide general instructions that apply to the entire extraction process on the page.
- Use this to:
- Give context (e.g., “You are on a product page. Extract the following details.”).
- Specify how to handle missing data (e.g., “If you can’t find the target data, leave the corresponding field empty.”).
- Request data transformation (e.g., “Translate all extracted text into English.”).
- Provide negative constraints (e.g., “Do not extract prices from the ‘related items’ section.”).
-
Using Variables:
- You can use variables (from an
Apply Variables
block) within the “EXAMPLE VALUE OR A LONGER DESCRIPTION” column or in the “Additional instructions” field to make your extraction prompts dynamic. - Example: If you have a variable
{{brand_name}}
, you could have a description:Find the price for the {{brand_name}} product.
- You can use variables (from an
-
Advanced Options (Edit data shape):
- This mode allows technically proficient users to define the extraction schema directly in JSON format.
- This is powerful for:
- Creating nested data structures (e.g., an “author” object with “name” and “location” sub-fields).
- Fine-tuning the exact output format.
- If you use this, the visual editor is reflects the JSON structure.
[Screenshot: Extract Data block configuration showing item type selection and attribute table]
Examples
Example 1: Extracting Product Details (Single Item)
- Agent is on
https://example.com/product/123
- Extraction Type:
A single item
- Attributes:
NAME: product_title
,DESCRIPTION: The main name of the product.
NAME: price
,DESCRIPTION: The current selling price, e.g., $19.99.
NAME: stock_status
,DESCRIPTION: Availability status, e.g., 'In Stock' or 'Out of Stock'.
- Additional Instructions:
If a discount price is shown, extract that as the price.
Example 2: Extracting a List of Reviews
- Agent is on a product page with multiple reviews.
- Extraction Type:
A list of items
- Attributes:
NAME: reviewer_name
,DESCRIPTION: Name of the person who wrote the review.
NAME: review_date
,DESCRIPTION: Date the review was published, in YYYY-MM-DD format.
NAME: review_text
,DESCRIPTION: The full body text of the user's review.
NAME: star_rating
,DESCRIPTION: The number of stars given, e.g., 4 or 5.
- Additional Instructions:
Translate all reviews into English. If star rating is not found, leave empty.
Key Considerations
- Clarity is King: The accuracy of the AI’s extraction heavily depends on how clearly and unambiguously you define your fields and instructions. Provide examples whenever possible.
- One Page, One Schema: The schema defined in one
Extract Data
block applies to the single page the agent is currently on. If you navigate to a different page with a different structure, you’ll need anotherExtract Data
block. - Iterative Refinement: You might need to run your workflow, check the results, and then refine your descriptions or instructions in the
Extract Data
block to improve accuracy. - JSON for Complexity: Don’t hesitate to use the Advanced (JSON) mode if you need nested data or very specific output structures that are hard to define in the table.
- Chaining Extractions: Field names defined in an
Extract Data
block can be used as variables in subsequent blocks within the same iteration (e.g., if processing a list of items).
The Extract Data
block is where the magic of data collection happens in Jsonify!