Skip to content

KonectU/WebData_Wizard

Repository files navigation

🔥 WebDataWizard

Empower your AI apps with clean data from any website. Featuring advanced scraping, crawling, and data extraction capabilities.

This repository is in development, and we’re still integrating custom modules into the mono repo. It's not fully ready for self-hosted deployment yet, but you can run it locally.

What is WebDataWizard?

[WebDataWizard] is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap required. Check out our

How to use it?

We provide an easy to use API with our hosted version. You can find the playground and documentation You can also self host the backend if you'd like.

Powerful Capabilities

  • LLM-ready formats: markdown, structured data, screenshot, HTML, links, metadata
  • The hard stuff: proxies, anti-bot mechanisms, dynamic content (js-rendered), output parsing, orchestration
  • Customizability: exclude tags, crawl behind auth walls with custom headers, max crawl depth, etc...
  • Media parsing: pdfs, docx, images.
  • Reliability first: designed to get the data you need - no matter how hard it is.
  • Actions: click, scroll, input, wait and more before extracting data

Intracting with the page with Actions (Cloud-only)

WebDataWizard allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.

Here is an example of how to use actions to navigate to google.com, search for WebDataWizard, click on the first result, and take a screenshot.

Using Python SDK

Installing Python SDK

pip install WebDataWizard-py

Crawl a website

from WebDataWizard.WebDataWizard import WebDataWizardApp

app = WebDataWizardApp(api_key="fc-YOUR_API_KEY")

# Scrape a website:
scrape_status = app.scrape_url(
  'https://WebDataWizard.dev', 
  params={'formats': ['markdown', 'html']}
)
print(scrape_status)

# Crawl a website:
crawl_status = app.crawl_url(
  'https://WebDataWizard.dev', 
  params={
    'limit': 100, 
    'scrapeOptions': {'formats': ['markdown', 'html']}
  },
  poll_interval=30
)
print(crawl_status)

Extracting structured data from a URL

With LLM extraction, you can easily extract structured data from any URL. We support pydantic schemas to make it easier for you too. Here is how you to use it:

from WebDataWizard.WebDataWizard import WebDataWizardApp

app = WebDataWizardApp(api_key="fc-YOUR_API_KEY")

class ArticleSchema(BaseModel):
    title: str
    points: int
    by: str
    commentsURL: str

class TopArticlesSchema(BaseModel):
    top: List[ArticleSchema] = Field(..., max_items=5, description="Top 5 stories")

data = app.scrape_url('https://news.ycombinator.com', {
    'formats': ['extract'],
    'extract': {
        'schema': TopArticlesSchema.model_json_schema()
    }
})
print(data["extract"])

Using the Node SDK

Installation

To install the WebDataWizard Node SDK, you can use npm:

npm install @mendable/WebDataWizard-js

Usage

  1. Get an API key from WebDataWizard.dev
  2. Set the API key as an environment variable named WebDataWizard_API_KEY or pass it as a parameter to the WebDataWizardApp class.
import WebDataWizardApp, { CrawlParams, CrawlStatusResponse } from '@mendable/WebDataWizard-js';

const app = new WebDataWizardApp({apiKey: "fc-YOUR_API_KEY"});

// Scrape a website
const scrapeResponse = await app.scrapeUrl('https://WebDataWizard.dev', {
  formats: ['markdown', 'html'],
});

if (scrapeResponse) {
  console.log(scrapeResponse)
}

// Crawl a website
const crawlResponse = await app.crawlUrl('https://WebDataWizard.dev', {
  limit: 100,
  scrapeOptions: {
    formats: ['markdown', 'html'],
  }
} as CrawlParams, true, 30) as CrawlStatusResponse;

if (crawlResponse) {
  console.log(crawlResponse)
}

Extracting structured data from a URL

With LLM extraction, you can easily extract structured data from any URL. We support zod schema to make it easier for you too. Here is how you to use it:

import WebDataWizardApp from "@mendable/WebDataWizard-js";
import { z } from "zod";

const app = new WebDataWizardApp({
  apiKey: "fc-YOUR_API_KEY"
});

// Define schema to extract contents into
const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe("Top 5 stories on Hacker News"),
});

const scrapeResult = await app.scrapeUrl("https://news.ycombinator.com", {
  extractorOptions: { extractionSchema: schema },
});

console.log(scrapeResult.data["llm_extraction"]);

Open Source vs Cloud Offering

WebDataWizard is open source available under the AGPL-3.0 license.

To deliver the best possible product, we offer a hosted version of WebDataWizard alongside our open-source offering. The cloud solution allows us to continuously innovate and maintain a high-quality, sustainable service for all users.

WebDataWizard Cloud is available at WebDataWizard.dev and offers a range of features that are not available in the open source version:

Contributing

We love contributions! Please read our before submitting a pull request. If you'd like to self-host, refer to the

It is the sole responsibility of the end users to respect websites' policies when scraping, searching and crawling with WebDataWizard. Users are advised to adhere to the applicable privacy policies and terms of use of the websites prior to initiating any scraping activities. By default, WebDataWizard respects the directives specified in the websites' robots.txt files when crawling. By utilizing WebDataWizard, you expressly agree to comply with these conditions.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published