forked from iwsfutcmd/ideograph
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
80 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -102,3 +102,6 @@ venv.bak/ | |
|
||
# mypy | ||
.mypy_cache/ | ||
|
||
# vscode | ||
.vscode/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,38 @@ | ||
# ideograph | ||
Ideograph lookup by components | ||
|
||
A tool to look up ideographs by their components. At the moment, it only contains Han characters, but it could be expanded to include other ideographic scripts such as Tangut or Sumero-Akkadian Cuneiform. | ||
|
||
## Installation | ||
|
||
```bash | ||
$ pip install ideograph | ||
``` | ||
|
||
## Usage | ||
|
||
*ideograph* consists of a single function `find()`, which takes a string of ideograph components and returns a set of ideographs that include all of those components. | ||
|
||
Characters in the component string that are not ideographic components are ignored. | ||
|
||
Note that the current implementation is quite strict and relies on visual distinction for components rather than etymological connection: e.g. "人" ≠ "亻". | ||
|
||
*ideograph* can either be used from the command line: | ||
|
||
```bash | ||
$ ideograph 木日勿 | ||
䵘楊鸉𣝻𣿘𥂸𥠜𦼴𩁒𪳷𬬍 | ||
``` | ||
|
||
or imported as a Python package: | ||
|
||
```python3 | ||
>>> import ideograph | ||
>>> ideograph.find("木日勿") | ||
{'𣿘', '𣝻', '𥠜', '𪎥', '𩁒', '𪎧', '𥟘', '𣓗', '楊', '𣓾', '𬬍', '𪳷', '𦼴', '鸉', '䵘', '𥂸'} | ||
``` | ||
|
||
## Data | ||
|
||
Character components are derived from the [cjkvi-ids database](https://github.com/cjkvi/cjkvi-ids) (included in this Git repository as a submodule), specifically the `ids-cdp.txt` data file. As some components do not currently have a Unicode code point assigned to them, they are given code points in the Private Use Area of Unicode. Note that because of this, some of these characters may be returned by the `find()` function. | ||
|
||
Data is stored in a sqlite3 database, which can be regenerated from cjkvi-ids data by running the `generate_data.py` script. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import sys | ||
import argparse | ||
|
||
import ideograph | ||
|
||
parser = argparse.ArgumentParser(description="Find ideographs by components.") | ||
parser.add_argument("components", type=str, help="components to search for") | ||
args = parser.parse_args() | ||
output = ideograph.find(args.components) | ||
sys.stdout.write("".join(sorted(output))) | ||
sys.exit(0) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
import setuptools | ||
|
||
with open("README.md", "r") as fh: | ||
long_description = fh.read() | ||
|
||
setuptools.setup( | ||
name="ideograph", | ||
version="1.0.0", | ||
author="Ben Yang", | ||
author_email="[email protected]", | ||
description="Tool for finding ideographic (e.g. Han) characters from their components", | ||
long_description=long_description, | ||
long_description_content_type="text/markdown", | ||
url="https://github.com/iwsfutcmd/ideograph", | ||
packages=setuptools.find_packages(), | ||
classifiers=[ | ||
"Programming Language :: Python :: 3", | ||
"License :: OSI Approved :: MIT License", | ||
"Operating System :: OS Independent", | ||
], | ||
scripts=["bin/ideograph"], | ||
) |