Skip to content

Commit

Permalink
rearranging to make ready for PyPi
Browse files Browse the repository at this point in the history
  • Loading branch information
iwsfutcmd committed Mar 29, 2019
1 parent 2a06776 commit fd92940
Showing 7 changed files with 80 additions and 11 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -102,3 +102,6 @@ venv.bak/

# mypy
.mypy_cache/

# vscode
.vscode/
38 changes: 37 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,38 @@
# ideograph
Ideograph lookup by components

A tool to look up ideographs by their components. At the moment, it only contains Han characters, but it could be expanded to include other ideographic scripts such as Tangut or Sumero-Akkadian Cuneiform.

## Installation

```bash
$ pip install ideograph
```

## Usage

*ideograph* consists of a single function `find()`, which takes a string of ideograph components and returns a set of ideographs that include all of those components.

Characters in the component string that are not ideographic components are ignored.

Note that the current implementation is quite strict and relies on visual distinction for components rather than etymological connection: e.g. "人" ≠ "亻".

*ideograph* can either be used from the command line:

```bash
$ ideograph 木日勿
䵘楊鸉𣝻𣿘𥂸𥠜𦼴𩁒𪳷𬬍
```

or imported as a Python package:

```python3
>>> import ideograph
>>> ideograph.find("木日勿")
{'𣿘', '𣝻', '𥠜', '𪎥', '𩁒', '𪎧', '𥟘', '𣓗', '', '𣓾', '𬬍', '𪳷', '𦼴', '', '', '𥂸'}
```

## Data

Character components are derived from the [cjkvi-ids database](https://github.com/cjkvi/cjkvi-ids) (included in this Git repository as a submodule), specifically the `ids-cdp.txt` data file. As some components do not currently have a Unicode code point assigned to them, they are given code points in the Private Use Area of Unicode. Note that because of this, some of these characters may be returned by the `find()` function.

Data is stored in a sqlite3 database, which can be regenerated from cjkvi-ids data by running the `generate_data.py` script.
13 changes: 13 additions & 0 deletions bin/ideograph
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/env python3

import sys
import argparse

import ideograph

parser = argparse.ArgumentParser(description="Find ideographs by components.")
parser.add_argument("components", type=str, help="components to search for")
args = parser.parse_args()
output = ideograph.find(args.components)
sys.stdout.write("".join(sorted(output)))
sys.exit(0)
1 change: 0 additions & 1 deletion generate_data.py
Original file line number Diff line number Diff line change
@@ -52,7 +52,6 @@ def recursive_breakup(charset):
else:
return recursive_breakup(output)

# data = {ideo: list(recursive_breakup(data[ideo]) - {ideo}) for ideo in data}
data = {ideo: list(recursive_breakup(data[ideo]) - {ideo}) for ideo in data}
reverse_data = defaultdict(list)
for ideo in data:
14 changes: 5 additions & 9 deletions __init__.py → ideograph.py
Original file line number Diff line number Diff line change
@@ -11,12 +11,8 @@
def find(components):
ideostr = f"{tuple(components)}" if len(components) > 1 else f"('{components}')"
cursor.execute(f"SELECT ids FROM ids_data WHERE ideo in {ideostr}")
output = set.intersection(*[set(r[0]) for r in cursor.fetchall()])
return output

if __name__ == "__main__":
components = sys.argv[1]
output = find(components)
sys.stdout.write("".join(sorted(output)))
conn.close()
sys.exit(0)
try:
output = set.intersection(*[set(r[0]) for r in cursor.fetchall()])
except TypeError:
output = set()
return output
Binary file modified ids-data.db
Binary file not shown.
22 changes: 22 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import setuptools

with open("README.md", "r") as fh:
long_description = fh.read()

setuptools.setup(
name="ideograph",
version="1.0.0",
author="Ben Yang",
author_email="[email protected]",
description="Tool for finding ideographic (e.g. Han) characters from their components",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/iwsfutcmd/ideograph",
packages=setuptools.find_packages(),
classifiers=[
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
],
scripts=["bin/ideograph"],
)

0 comments on commit fd92940

Please sign in to comment.