🔥 WebDataWizard

Empower your AI apps with clean data from any website. Featuring advanced scraping, crawling, and data extraction capabilities.

This repository is in development, and we’re still integrating custom modules into the mono repo. It's not fully ready for self-hosted deployment yet, but you can run it locally.

What is WebDataWizard?

[WebDataWizard] is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap required. Check out our

How to use it?

We provide an easy to use API with our hosted version. You can find the playground and documentation You can also self host the backend if you'd like.

Powerful Capabilities

LLM-ready formats: markdown, structured data, screenshot, HTML, links, metadata
The hard stuff: proxies, anti-bot mechanisms, dynamic content (js-rendered), output parsing, orchestration
Customizability: exclude tags, crawl behind auth walls with custom headers, max crawl depth, etc...
Media parsing: pdfs, docx, images.
Reliability first: designed to get the data you need - no matter how hard it is.
Actions: click, scroll, input, wait and more before extracting data

Intracting with the page with Actions (Cloud-only)

WebDataWizard allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.

Here is an example of how to use actions to navigate to google.com, search for WebDataWizard, click on the first result, and take a screenshot.

Using Python SDK

Installing Python SDK

pip install WebDataWizard-py

Crawl a website

from WebDataWizard.WebDataWizard import WebDataWizardApp

app = WebDataWizardApp(api_key="fc-YOUR_API_KEY")

# Scrape a website:
scrape_status = app.scrape_url(
  'https://WebDataWizard.dev', 
  params={'formats': ['markdown', 'html']}
)
print(scrape_status)

# Crawl a website:
crawl_status = app.crawl_url(
  'https://WebDataWizard.dev', 
  params={
    'limit': 100, 
    'scrapeOptions': {'formats': ['markdown', 'html']}
  },
  poll_interval=30
)
print(crawl_status)

Extracting structured data from a URL

With LLM extraction, you can easily extract structured data from any URL. We support pydantic schemas to make it easier for you too. Here is how you to use it:

from WebDataWizard.WebDataWizard import WebDataWizardApp

app = WebDataWizardApp(api_key="fc-YOUR_API_KEY")

class ArticleSchema(BaseModel):
    title: str
    points: int
    by: str
    commentsURL: str

class TopArticlesSchema(BaseModel):
    top: List[ArticleSchema] = Field(..., max_items=5, description="Top 5 stories")

data = app.scrape_url('https://news.ycombinator.com', {
    'formats': ['extract'],
    'extract': {
        'schema': TopArticlesSchema.model_json_schema()
    }
})
print(data["extract"])

Using the Node SDK

Installation

To install the WebDataWizard Node SDK, you can use npm:

npm install @mendable/WebDataWizard-js

Usage

Get an API key from WebDataWizard.dev
Set the API key as an environment variable named WebDataWizard_API_KEY or pass it as a parameter to the WebDataWizardApp class.

import WebDataWizardApp, { CrawlParams, CrawlStatusResponse } from '@mendable/WebDataWizard-js';

const app = new WebDataWizardApp({apiKey: "fc-YOUR_API_KEY"});

// Scrape a website
const scrapeResponse = await app.scrapeUrl('https://WebDataWizard.dev', {
  formats: ['markdown', 'html'],
});

if (scrapeResponse) {
  console.log(scrapeResponse)
}

// Crawl a website
const crawlResponse = await app.crawlUrl('https://WebDataWizard.dev', {
  limit: 100,
  scrapeOptions: {
    formats: ['markdown', 'html'],
  }
} as CrawlParams, true, 30) as CrawlStatusResponse;

if (crawlResponse) {
  console.log(crawlResponse)
}

Extracting structured data from a URL

With LLM extraction, you can easily extract structured data from any URL. We support zod schema to make it easier for you too. Here is how you to use it:

import WebDataWizardApp from "@mendable/WebDataWizard-js";
import { z } from "zod";

const app = new WebDataWizardApp({
  apiKey: "fc-YOUR_API_KEY"
});

// Define schema to extract contents into
const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe("Top 5 stories on Hacker News"),
});

const scrapeResult = await app.scrapeUrl("https://news.ycombinator.com", {
  extractorOptions: { extractionSchema: schema },
});

console.log(scrapeResult.data["llm_extraction"]);

Open Source vs Cloud Offering

WebDataWizard is open source available under the AGPL-3.0 license.

To deliver the best possible product, we offer a hosted version of WebDataWizard alongside our open-source offering. The cloud solution allows us to continuously innovate and maintain a high-quality, sustainable service for all users.

WebDataWizard Cloud is available at WebDataWizard.dev and offers a range of features that are not available in the open source version:

Contributing

We love contributions! Please read our before submitting a pull request. If you'd like to self-host, refer to the

It is the sole responsibility of the end users to respect websites' policies when scraping, searching and crawling with WebDataWizard. Users are advised to adhere to the applicable privacy policies and terms of use of the websites prior to initiating any scraping activities. By default, WebDataWizard respects the directives specified in the websites' robots.txt files when crawling. By utilizing WebDataWizard, you expressly agree to comply with these conditions.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
.vscode		.vscode
apps		apps
examples		examples
img		img
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SELF_HOST.md		SELF_HOST.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔥 WebDataWizard

What is WebDataWizard?

How to use it?

Powerful Capabilities

Intracting with the page with Actions (Cloud-only)

Using Python SDK

Installing Python SDK

Crawl a website

Extracting structured data from a URL

Using the Node SDK

Installation

Usage

Extracting structured data from a URL

Open Source vs Cloud Offering

Contributing

About

Releases

Packages

Languages

License

KonectU/WebData_Wizard

Folders and files

Latest commit

History

Repository files navigation

🔥 WebDataWizard

What is WebDataWizard?

How to use it?

Powerful Capabilities

Intracting with the page with Actions (Cloud-only)

Using Python SDK

Installing Python SDK

Crawl a website

Extracting structured data from a URL

Using the Node SDK

Installation

Usage

Extracting structured data from a URL

Open Source vs Cloud Offering

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages