Skip to content

Commit

Permalink
《自然语言处理入门》新书携v1.7.5发布🔥:http://nlp.hankcs.com/book.php
Browse files Browse the repository at this point in the history
  • Loading branch information
hankcs committed Oct 17, 2019
1 parent a7c05c7 commit 422077b
Show file tree
Hide file tree
Showing 56 changed files with 3,224 additions and 13 deletions.
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ HanLP: Han Language Processing

------

HanLP是一系列模型与算法组成的NLP工具包,由大快搜索主导并完全开源,目标是普及自然语言处理在生产环境中的应用。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。
HanLP是一系列模型与算法组成的NLP工具包,由大快搜索主导并完全开源,目标是普及自然语言处理在生产环境中的应用。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。内部算法经过工业界和学术界考验,配套书籍[《自然语言处理入门》](http://nlp.hankcs.com/book.php)已经出版。

HanLP提供下列功能:

Expand Down Expand Up @@ -64,7 +64,7 @@ HanLP提供下列功能:

## 项目主页

[在线演示](http://hanlp.com/)[Python调用](https://github.com/hankcs/pyhanlp)[Solr及Lucene插件](https://github.com/hankcs/hanlp-lucene-plugin)[论文引用](https://github.com/hankcs/HanLP/wiki/papers)[更多信息](https://github.com/hankcs/HanLP/wiki)
[《自然语言处理入门》🔥](http://nlp.hankcs.com/book.php)[随书代码](https://github.com/hankcs/HanLP/tree/v1.7.5/src/test/java/com/hankcs/book)[在线演示](http://hanlp.com/)[Python调用](https://github.com/hankcs/pyhanlp)[Solr及Lucene插件](https://github.com/hankcs/hanlp-lucene-plugin)[论坛](https://bbs.hankcs.com/)[论文引用](https://github.com/hankcs/HanLP/wiki/papers)[更多信息](https://github.com/hankcs/HanLP/wiki)

------

Expand All @@ -78,7 +78,7 @@ HanLP提供下列功能:
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.7.4</version>
<version>portable-1.7.5</version>
</dependency>
```

Expand Down Expand Up @@ -732,10 +732,18 @@ HanLP.Config.enableDebug();
* 基于角色标注的命名实体识别比较依赖词典,所以词典的质量大幅影响识别质量。
* 这些词典的格式与原理都是类似的,请阅读[相应的文章](http://www.hankcs.com/category/nlp/ner/)或代码修改它。

如果问题解决了,欢迎向我提交一个pull request,这是我在代码库中保留明文词典的原因,众人拾柴火焰高!
若还有疑问,请参考[《自然语言处理入门》](http://nlp.hankcs.com/book.php)相应章节。如果问题解决了,欢迎向我提交一个pull request,这是我在代码库中保留明文词典的原因,众人拾柴火焰高!

------

## [《自然语言处理入门》](http://nlp.hankcs.com/book.php)

![img](http://file.hankcs.com/img/nlp-book-squre.jpg)

一本配套HanLP的NLP入门书,基础理论与生产代码并重,Python与Java双实现。从基本概念出发,逐步介绍中文分词、词性标注、命名实体识别、信息抽取、文本聚类、文本分类、句法分析这几个热门问题的算法原理与工程实现。书中通过对多种算法的讲解,比较了它们的优缺点和适用场景,同时详细演示生产级成熟代码,助你真正将自然语言处理应用在生产环境中。

[《自然语言处理入门》](http://nlp.hankcs.com/book.php)由南方科技大学数学系创系主任夏志宏、微软亚洲研究院副院长周明、字节跳动人工智能实验室总监李航、华为诺亚方舟实验室语音语义首席科学家刘群、小米人工智能实验室主任兼NLP首席科学家王斌、中国科学院自动化研究所研究员宗成庆、清华大学副教授刘知远、北京理工大学副教授张华平和52nlp作序推荐。感谢各位前辈老师,希望这个项目和这本书能成为大家工程和学习上的“蝴蝶效应”,帮助大家在NLP之路上蜕变成蝶。

## 版权

### Apache License Version 2.0
Expand Down
2 changes: 0 additions & 2 deletions data/dictionary/custom/现代汉语补充词库.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2566,7 +2566,6 @@ t恤 n 4
上晶脉 n 2
上智下愚 nz 3
上替下陵 nz 3
上有 v 3
上有天堂 nz 3
上有所好 l 3
上有政策 n 6
Expand Down Expand Up @@ -182815,7 +182814,6 @@ t恤 n 4
进利除害 nz 3
进到 v 3
进制 v 89
进前 v 3
进前一步 i 3
进前一级 l 3
进化树 gb 3
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>1.7.4</version>
<version>1.7.5</version>

<name>HanLP</name>
<url>http://www.hankcs.com/</url>
Expand Down
31 changes: 31 additions & 0 deletions src/test/java/com/hankcs/book/ch01/HelloWord.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
/*
* <author>Han He</author>
* <email>[email protected]</email>
* <create-date>2018-05-18 下午5:38</create-date>
*
* <copyright file="HelloWord.java">
* Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/
* This source is subject to Han He. Please contact Han He for more information.
* </copyright>
*/
package com.hankcs.book.ch01;

import com.hankcs.hanlp.HanLP;

/**
* 《自然语言处理入门》1.6 开源工具
* 配套书籍:http://nlp.hankcs.com/book.php
* 讨论答疑:https://bbs.hankcs.com/
*
* @author hankcs
* @see <a href="http://nlp.hankcs.com/book.php">《自然语言处理入门》</a>
* @see <a href="https://bbs.hankcs.com/">讨论答疑</a>
*/
public class HelloWord
{
public static void main(String[] args)
{
HanLP.Config.enableDebug(); // 首次运行会自动建立模型缓存,为了避免你等得无聊,开启调试模式说点什么:-)
System.out.println(HanLP.segment("王国维和服务员"));
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
/*
* <author>Han He</author>
* <email>[email protected]</email>
* <create-date>2018-05-28 下午5:59</create-date>
*
* <copyright file="AhoCorasickDoubleArrayTrieSegmentation.java">
* Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/
* This source is subject to Han He. Please contact Han He for more information.
* </copyright>
*/
package com.hankcs.book.ch02;

import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie;
import com.hankcs.hanlp.collection.trie.DoubleArrayTrie;
import com.hankcs.hanlp.corpus.io.IOUtil;
import com.hankcs.hanlp.dictionary.CoreDictionary;

import java.io.IOException;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.TreeMap;

/**
* 《自然语言处理入门》2.7 基于双数组字典树的AC自动机
* 配套书籍:http://nlp.hankcs.com/book.php
* 讨论答疑:https://bbs.hankcs.com/
*
* @author hankcs
* @see <a href="http://nlp.hankcs.com/book.php">《自然语言处理入门》</a>
* @see <a href="https://bbs.hankcs.com/">讨论答疑</a>
*/
public class AhoCorasickDoubleArrayTrieSegmentation
{
public static void main(String[] args) throws IOException
{
classicDemo();
for (int i = 1; i <= 10; ++i)
{
evaluateSpeed(i);
System.gc();
}
}

private static void classicDemo() throws IOException
{
String[] keyArray = new String[]{"hers", "his", "she", "he"};
TreeMap<String, String> map = new TreeMap<String, String>();
for (String key : keyArray)
map.put(key, key.toUpperCase());
AhoCorasickDoubleArrayTrie<String> acdat = new AhoCorasickDoubleArrayTrie<String>(map);
for (AhoCorasickDoubleArrayTrie<String>.Hit<String> hit : acdat.parseText("ushers")) // 一下子获取全部结果
{
System.out.printf("[%d:%d]=%s\n", hit.begin, hit.end, hit.value);
}
System.out.println();
acdat.parseText("ushers", new AhoCorasickDoubleArrayTrie.IHit<String>() // 及时处理查询结果
{
@Override
public void hit(int begin, int end, String value)
{
System.out.printf("[%d:%d]=%s\n", begin, end, value);
}
});
}

private static void evaluateSpeed(int wordLength) throws IOException
{
TreeMap<String, CoreDictionary.Attribute> dictionary = loadDictionary(wordLength);

AhoCorasickDoubleArrayTrie<CoreDictionary.Attribute> acdat = new AhoCorasickDoubleArrayTrie<CoreDictionary.Attribute>(dictionary);
DoubleArrayTrie<CoreDictionary.Attribute> dat = new DoubleArrayTrie<CoreDictionary.Attribute>(dictionary);

String text = "江西鄱阳湖干枯,中国最大淡水湖变成大草原";
long start;
double costTime;
final int pressure = 1000000;
System.out.printf("长度%d:\n", wordLength);

start = System.currentTimeMillis();
for (int i = 0; i < pressure; ++i)
{
acdat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit<CoreDictionary.Attribute>()
{
@Override
public void hit(int begin, int end, CoreDictionary.Attribute value)
{

}
});
}
costTime = (System.currentTimeMillis() - start) / (double) 1000;
System.out.printf("ACDAT: %.2f万字/秒\n", text.length() * pressure / 10000 / costTime);

start = System.currentTimeMillis();
for (int i = 0; i < pressure; ++i)
{
dat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit<CoreDictionary.Attribute>()
{
@Override
public void hit(int begin, int end, CoreDictionary.Attribute value)
{

}
});
}
costTime = (System.currentTimeMillis() - start) / (double) 1000;
System.out.printf("DAT: %.2f万字/秒\n", text.length() * pressure / 10000 / costTime);
}

/**
* 加载词典,并限制词语长度
*
* @param minLength 最低长度
* @return TreeMap形式的词典
* @throws IOException
*/
public static TreeMap<String, CoreDictionary.Attribute> loadDictionary(int minLength) throws IOException
{
TreeMap<String, CoreDictionary.Attribute> dictionary =
IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt");

Iterator<String> iterator = dictionary.keySet().iterator();
while (iterator.hasNext())
{
if (iterator.next().length() < minLength)
iterator.remove();
}
return dictionary;
}

/**
* 基于ACDAT的完全切分式的中文分词算法
*
* @param text 待分词的文本
* @param acdat 词典
* @return 单词列表
*/
public static List<String> segmentFully(final String text, AhoCorasickDoubleArrayTrie<CoreDictionary.Attribute> acdat)
{
final List<String> wordList = new LinkedList<String>();
acdat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit<CoreDictionary.Attribute>()
{
@Override
public void hit(int begin, int end, CoreDictionary.Attribute value)
{
wordList.add(text.substring(begin, end));
}
});
return wordList;
}
}
89 changes: 89 additions & 0 deletions src/test/java/com/hankcs/book/ch02/AhoCorasickSegmentation.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
/*
* <author>Han He</author>
* <email>[email protected]</email>
* <create-date>2018-05-28 上午11:00</create-date>
*
* <copyright file="AhoCorasickSegmentation.java">
* Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/
* This source is subject to Han He. Please contact Han He for more information.
* </copyright>
*/
package com.hankcs.book.ch02;

import com.hankcs.hanlp.algorithm.ahocorasick.trie.Emit;
import com.hankcs.hanlp.algorithm.ahocorasick.trie.Trie;
import com.hankcs.hanlp.corpus.io.IOUtil;
import com.hankcs.hanlp.dictionary.CoreDictionary;

import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import java.util.TreeMap;

/**
* 《自然语言处理入门》2.6 AC 自动机
* 配套书籍:http://nlp.hankcs.com/book.php
* 讨论答疑:https://bbs.hankcs.com/
*
* @author hankcs
* @see <a href="http://nlp.hankcs.com/book.php">《自然语言处理入门》</a>
* @see <a href="https://bbs.hankcs.com/">讨论答疑</a>
*/
public class AhoCorasickSegmentation
{
public static void main(String[] args) throws IOException
{
classicDemo();
evaluateSpeed();
}

private static void classicDemo()
{
String[] keyArray = new String[]{"hers", "his", "she", "he"};
Trie trie = new Trie();
for (String key : keyArray)
trie.addKeyword(key);
for (Emit emit : trie.parseText("ushers"))
System.out.printf("[%d:%d]=%s\n", emit.getStart(), emit.getEnd(), emit.getKeyword());
}

private static void evaluateSpeed() throws IOException
{
// 加载词典
TreeMap<String, CoreDictionary.Attribute> dictionary =
IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt");
Trie trie = new Trie(dictionary.keySet());

String text = "江西鄱阳湖干枯,中国最大淡水湖变成大草原";
long start;
double costTime;
final int pressure = 1000000;

System.out.println("===AC自动机接口===");
System.out.println("完全切分");
start = System.currentTimeMillis();
for (int i = 0; i < pressure; ++i)
{
segmentFully(text, trie);
}
costTime = (System.currentTimeMillis() - start) / (double) 1000;
System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime);
}

/**
* 基于AC自动机的完全切分式的中文分词算法
*
* @param text 待分词的文本
* @param dictionary 词典
* @return 单词列表
*/
public static List<String> segmentFully(final String text, Trie dictionary)
{
final List<String> wordList = new LinkedList<String>();
for (Emit emit : dictionary.parseText(text))
{
wordList.add(emit.getKeyword());
}
return wordList;
}
}
Loading

0 comments on commit 422077b

Please sign in to comment.