Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Menghuan1918 committed May 31, 2024
1 parent 994e631 commit 1d23e67
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 3 deletions.
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# pdfdeal

For better RAG!

🗺️ ENGLISH | [简体中文](README_CN.md)

Easier to deal with PDF, extract readable text and OCR to recognise image text and clean the format. Make it more suitable for knowledge base construction.
Expand All @@ -13,7 +15,15 @@ Its going to use [easyocr](https://github.com/JaidedAI/EasyOCR) to recognise the
## Support for Doc2x

Added support for Doc2x, which currently has a daily 500-page **free** usage quota, and its recognition of tables/formulas is excellent. You can also use Doc2x support module alone to convert pdf to markdown/latex/docx directly. See [Doc2x Support](./docs/doc2x.md).
Added support for Doc2x, which currently has a daily 500-page **free** usage quota, and its recognition of tables/formulas is excellent.

You can also use Doc2x support module **alone** to convert pdf to markdown/latex/docx directly like below. See [Doc2x Support](./docs/doc2x.md) for more.

```python
from pdfdeal.doc2x import Doc2x
Client = Doc2x(api_key=your_api)
Client.pdf2file(pdf_file="./ppt/test.pdf", output_path="./output", output_format="md_dollar", ocr=True)
```

## Usage
See the [example codes](https://github.com/Menghuan1918/pdfdeal?tab=readme-ov-file#processes-all-the-files-in-a-file-and-saves-them-in-the-output-folder).
Expand Down
14 changes: 12 additions & 2 deletions README_CN.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# pdfdeal

为了更好的RAG效果!(浓浓机翻味)

更轻松简单地处理 PDF,提取可读文本,用 OCR 识别图像文本并清理格式,使其更适合构建知识库。

使用 [easyocr](https://github.com/JaidedAI/EasyOCR) 来识别图像并将其添加到原始文本中。可以设置输出格式使用 pdf 格式,这将确保文本在新 PDF 中的页数与原始文本相同。对 PDF 进行处理后与知识库应用程序(如[Dify](https://github.com/langgenius/dify)[FastGPT](https://github.com/labring/FastGPT)),理论上可以达到更好的识别率。
Expand All @@ -11,10 +13,18 @@
## 对Doc2x的支持

新增对Doc2x的支持,目前其每日有500页的**免费**使用额度,其对表格/公式的识别效果卓越。你也可以单独使用Doc2x的支持模块直接将pdf转换为markdown/latex/docx等格式。请参阅[Doc2x支持](./docs/doc2x_cn.md)
新增对Doc2x的支持,目前其每日有500页的**免费**使用额度,其对表格/公式的识别效果卓越。

## 使用方法
你也可以**单独使用**Doc2x的支持模块直接将pdf转换为markdown/latex/docx等格式,就像下面这样。详细请参阅[Doc2x支持](./docs/doc2x_cn.md)

```python
from pdfdeal.doc2x import Doc2x
Client = Doc2x(api_key=your_api)
Client.pdf2file(pdf_file="./ppt/test.pdf", output_path="./output", output_format="md_dollar", ocr=True)
```

## 使用方法
[示范代码](https://github.com/Menghuan1918/pdfdeal/blob/main/README_CN.md#%E5%B0%86%E6%96%87%E4%BB%B6%E5%A4%B9%E4%B8%AD%E7%9A%84%E6%89%80%E6%9C%89%E6%96%87%E4%BB%B6%E8%BF%9B%E8%A1%8C%E5%A4%84%E7%90%86%E5%B9%B6%E6%94%BE%E7%BD%AE%E5%9C%A8output%E6%96%87%E4%BB%B6%E5%A4%B9%E4%B8%AD)

### 安装
从 PyPI 安装:
Expand Down

0 comments on commit 1d23e67

Please sign in to comment.