《自然语言处理入门》新书携v1.7.5发布🔥：http://nlp.hankcs.com/book.php

zhangjuhui · Oct 17, 2019 · 422077b · 422077b
1 parent a7c05c7
commit 422077b
Show file tree

Hide file tree

Showing 56 changed files with 3,224 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@ HanLP: Han Language Processing
 
 ------
 
-HanLP是一系列模型与算法组成的NLP工具包，由大快搜索主导并完全开源，目标是普及自然语言处理在生产环境中的应用。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。
+HanLP是一系列模型与算法组成的NLP工具包，由大快搜索主导并完全开源，目标是普及自然语言处理在生产环境中的应用。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。内部算法经过工业界和学术界考验，配套书籍[《自然语言处理入门》](http://nlp.hankcs.com/book.php)已经出版。
 
 HanLP提供下列功能：
 
@@ -64,7 +64,7 @@ HanLP提供下列功能：
 
 ## 项目主页
 
-[在线演示](http://hanlp.com/)、[Python调用](https://github.com/hankcs/pyhanlp)、[Solr及Lucene插件](https://github.com/hankcs/hanlp-lucene-plugin)、[论文引用](https://github.com/hankcs/HanLP/wiki/papers)、[更多信息](https://github.com/hankcs/HanLP/wiki)。
+[《自然语言处理入门》🔥](http://nlp.hankcs.com/book.php)、[随书代码](https://github.com/hankcs/HanLP/tree/v1.7.5/src/test/java/com/hankcs/book)、[在线演示](http://hanlp.com/)、[Python调用](https://github.com/hankcs/pyhanlp)、[Solr及Lucene插件](https://github.com/hankcs/hanlp-lucene-plugin)、[论坛](https://bbs.hankcs.com/)、[论文引用](https://github.com/hankcs/HanLP/wiki/papers)、[更多信息](https://github.com/hankcs/HanLP/wiki)。
 
 ------
 
@@ -78,7 +78,7 @@ HanLP提供下列功能：
 <dependency>
     <groupId>com.hankcs</groupId>
     <artifactId>hanlp</artifactId>
-    <version>portable-1.7.4</version>
+    <version>portable-1.7.5</version>
 </dependency>
 ```
 
@@ -732,10 +732,18 @@ HanLP.Config.enableDebug();
   * 基于角色标注的命名实体识别比较依赖词典，所以词典的质量大幅影响识别质量。
   * 这些词典的格式与原理都是类似的，请阅读[相应的文章](http://www.hankcs.com/category/nlp/ner/)或代码修改它。
 
-如果问题解决了，欢迎向我提交一个pull request，这是我在代码库中保留明文词典的原因，众人拾柴火焰高！
+若还有疑问，请参考[《自然语言处理入门》](http://nlp.hankcs.com/book.php)相应章节。如果问题解决了，欢迎向我提交一个pull request，这是我在代码库中保留明文词典的原因，众人拾柴火焰高！
 
 ------
 
+## [《自然语言处理入门》](http://nlp.hankcs.com/book.php)
+
+![img](http://file.hankcs.com/img/nlp-book-squre.jpg)
+
+一本配套HanLP的NLP入门书，基础理论与生产代码并重，Python与Java双实现。从基本概念出发，逐步介绍中文分词、词性标注、命名实体识别、信息抽取、文本聚类、文本分类、句法分析这几个热门问题的算法原理与工程实现。书中通过对多种算法的讲解，比较了它们的优缺点和适用场景，同时详细演示生产级成熟代码，助你真正将自然语言处理应用在生产环境中。
+
+[《自然语言处理入门》](http://nlp.hankcs.com/book.php)由南方科技大学数学系创系主任夏志宏、微软亚洲研究院副院长周明、字节跳动人工智能实验室总监李航、华为诺亚方舟实验室语音语义首席科学家刘群、小米人工智能实验室主任兼NLP首席科学家王斌、中国科学院自动化研究所研究员宗成庆、清华大学副教授刘知远、北京理工大学副教授张华平和52nlp作序推荐。感谢各位前辈老师，希望这个项目和这本书能成为大家工程和学习上的“蝴蝶效应”，帮助大家在NLP之路上蜕变成蝶。
+
 ## 版权
 
 ### Apache License Version 2.0

diff --git a/data/dictionary/custom/现代汉语补充词库.txt b/data/dictionary/custom/现代汉语补充词库.txt
@@ -2566,7 +2566,6 @@ t恤	n	4
 上晶脉	n	2
 上智下愚	nz	3
 上替下陵	nz	3
-上有	v	3
 上有天堂	nz	3
 上有所好	l	3
 上有政策	n	6
@@ -182815,7 +182814,6 @@ t恤	n	4
 进利除害	nz	3
 进到	v	3
 进制	v	89
-进前	v	3
 进前一步	i	3
 进前一级	l	3
 进化树	gb	3

diff --git a/pom.xml b/pom.xml
@@ -4,7 +4,7 @@
 
     <groupId>com.hankcs</groupId>
     <artifactId>hanlp</artifactId>
-    <version>1.7.4</version>
+    <version>1.7.5</version>
 
     <name>HanLP</name>
     <url>http://www.hankcs.com/</url>

diff --git a/src/test/java/com/hankcs/book/ch01/HelloWord.java b/src/test/java/com/hankcs/book/ch01/HelloWord.java
@@ -0,0 +1,31 @@
+/*
+ * <author>Han He</author>
+ * <email>[email protected]</email>
+ * <create-date>2018-05-18 下午5:38</create-date>
+ *
+ * <copyright file="HelloWord.java">
+ * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/
+ * This source is subject to Han He. Please contact Han He for more information.
+ * </copyright>
+ */
+package com.hankcs.book.ch01;
+
+import com.hankcs.hanlp.HanLP;
+
+/**
+ * 《自然语言处理入门》1.6 开源工具
+ * 配套书籍：http://nlp.hankcs.com/book.php
+ * 讨论答疑：https://bbs.hankcs.com/
+ *
+ * @author hankcs
+ * @see <a href="http://nlp.hankcs.com/book.php">《自然语言处理入门》</a>
+ * @see <a href="https://bbs.hankcs.com/">讨论答疑</a>
+ */
+public class HelloWord
+{
+    public static void main(String[] args)
+    {
+        HanLP.Config.enableDebug();         // 首次运行会自动建立模型缓存，为了避免你等得无聊，开启调试模式说点什么:-)
+        System.out.println(HanLP.segment("王国维和服务员"));
+    }
+}
diff --git a/src/test/java/com/hankcs/book/ch02/AhoCorasickDoubleArrayTrieSegmentation.java b/src/test/java/com/hankcs/book/ch02/AhoCorasickDoubleArrayTrieSegmentation.java
@@ -0,0 +1,152 @@
+/*
+ * <author>Han He</author>
+ * <email>[email protected]</email>
+ * <create-date>2018-05-28 下午5:59</create-date>
+ *
+ * <copyright file="AhoCorasickDoubleArrayTrieSegmentation.java">
+ * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/
+ * This source is subject to Han He. Please contact Han He for more information.
+ * </copyright>
+ */
+package com.hankcs.book.ch02;
+
+import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie;
+import com.hankcs.hanlp.collection.trie.DoubleArrayTrie;
+import com.hankcs.hanlp.corpus.io.IOUtil;
+import com.hankcs.hanlp.dictionary.CoreDictionary;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.TreeMap;
+
+/**
+ * 《自然语言处理入门》2.7 基于双数组字典树的AC自动机
+ * 配套书籍：http://nlp.hankcs.com/book.php
+ * 讨论答疑：https://bbs.hankcs.com/
+ *
+ * @author hankcs
+ * @see <a href="http://nlp.hankcs.com/book.php">《自然语言处理入门》</a>
+ * @see <a href="https://bbs.hankcs.com/">讨论答疑</a>
+ */
+public class AhoCorasickDoubleArrayTrieSegmentation
+{
+    public static void main(String[] args) throws IOException
+    {
+        classicDemo();
+        for (int i = 1; i <= 10; ++i)
+        {
+            evaluateSpeed(i);
+            System.gc();
+        }
+    }
+
+    private static void classicDemo() throws IOException
+    {
+        String[] keyArray = new String[]{"hers", "his", "she", "he"};
+        TreeMap<String, String> map = new TreeMap<String, String>();
+        for (String key : keyArray)
+            map.put(key, key.toUpperCase());
+        AhoCorasickDoubleArrayTrie<String> acdat = new AhoCorasickDoubleArrayTrie<String>(map);
+        for (AhoCorasickDoubleArrayTrie<String>.Hit<String> hit : acdat.parseText("ushers")) // 一下子获取全部结果
+        {
+            System.out.printf("[%d:%d]=%s\n", hit.begin, hit.end, hit.value);
+        }
+        System.out.println();
+        acdat.parseText("ushers", new AhoCorasickDoubleArrayTrie.IHit<String>() // 及时处理查询结果
+        {
+            @Override
+            public void hit(int begin, int end, String value)
+            {
+                System.out.printf("[%d:%d]=%s\n", begin, end, value);
+            }
+        });
+    }
+
+    private static void evaluateSpeed(int wordLength) throws IOException
+    {
+        TreeMap<String, CoreDictionary.Attribute> dictionary = loadDictionary(wordLength);
+
+        AhoCorasickDoubleArrayTrie<CoreDictionary.Attribute> acdat = new AhoCorasickDoubleArrayTrie<CoreDictionary.Attribute>(dictionary);
+        DoubleArrayTrie<CoreDictionary.Attribute> dat = new DoubleArrayTrie<CoreDictionary.Attribute>(dictionary);
+
+        String text = "江西鄱阳湖干枯，中国最大淡水湖变成大草原";
+        long start;
+        double costTime;
+        final int pressure = 1000000;
+        System.out.printf("长度%d：\n", wordLength);
+
+        start = System.currentTimeMillis();
+        for (int i = 0; i < pressure; ++i)
+        {
+            acdat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit<CoreDictionary.Attribute>()
+            {
+                @Override
+                public void hit(int begin, int end, CoreDictionary.Attribute value)
+                {
+
+                }
+            });
+        }
+        costTime = (System.currentTimeMillis() - start) / (double) 1000;
+        System.out.printf("ACDAT: %.2f万字/秒\n", text.length() * pressure / 10000 / costTime);
+
+        start = System.currentTimeMillis();
+        for (int i = 0; i < pressure; ++i)
+        {
+            dat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit<CoreDictionary.Attribute>()
+            {
+                @Override
+                public void hit(int begin, int end, CoreDictionary.Attribute value)
+                {
+
+                }
+            });
+        }
+        costTime = (System.currentTimeMillis() - start) / (double) 1000;
+        System.out.printf("DAT: %.2f万字/秒\n", text.length() * pressure / 10000 / costTime);
+    }
+
+    /**
+     * 加载词典，并限制词语长度
+     *
+     * @param minLength 最低长度
+     * @return TreeMap形式的词典
+     * @throws IOException
+     */
+    public static TreeMap<String, CoreDictionary.Attribute> loadDictionary(int minLength) throws IOException
+    {
+        TreeMap<String, CoreDictionary.Attribute> dictionary =
+            IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt");
+
+        Iterator<String> iterator = dictionary.keySet().iterator();
+        while (iterator.hasNext())
+        {
+            if (iterator.next().length() < minLength)
+                iterator.remove();
+        }
+        return dictionary;
+    }
+
+    /**
+     * 基于ACDAT的完全切分式的中文分词算法
+     *
+     * @param text  待分词的文本
+     * @param acdat 词典
+     * @return 单词列表
+     */
+    public static List<String> segmentFully(final String text, AhoCorasickDoubleArrayTrie<CoreDictionary.Attribute> acdat)
+    {
+        final List<String> wordList = new LinkedList<String>();
+        acdat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit<CoreDictionary.Attribute>()
+        {
+            @Override
+            public void hit(int begin, int end, CoreDictionary.Attribute value)
+            {
+                wordList.add(text.substring(begin, end));
+            }
+        });
+        return wordList;
+    }
+}
diff --git a/src/test/java/com/hankcs/book/ch02/AhoCorasickSegmentation.java b/src/test/java/com/hankcs/book/ch02/AhoCorasickSegmentation.java
@@ -0,0 +1,89 @@
+/*
+ * <author>Han He</author>
+ * <email>[email protected]</email>
+ * <create-date>2018-05-28 上午11:00</create-date>
+ *
+ * <copyright file="AhoCorasickSegmentation.java">
+ * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/
+ * This source is subject to Han He. Please contact Han He for more information.
+ * </copyright>
+ */
+package com.hankcs.book.ch02;
+
+import com.hankcs.hanlp.algorithm.ahocorasick.trie.Emit;
+import com.hankcs.hanlp.algorithm.ahocorasick.trie.Trie;
+import com.hankcs.hanlp.corpus.io.IOUtil;
+import com.hankcs.hanlp.dictionary.CoreDictionary;
+
+import java.io.IOException;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.TreeMap;
+
+/**
+ * 《自然语言处理入门》2.6 AC 自动机
+ * 配套书籍：http://nlp.hankcs.com/book.php
+ * 讨论答疑：https://bbs.hankcs.com/
+ *
+ * @author hankcs
+ * @see <a href="http://nlp.hankcs.com/book.php">《自然语言处理入门》</a>
+ * @see <a href="https://bbs.hankcs.com/">讨论答疑</a>
+ */
+public class AhoCorasickSegmentation
+{
+    public static void main(String[] args) throws IOException
+    {
+        classicDemo();
+        evaluateSpeed();
+    }
+
+    private static void classicDemo()
+    {
+        String[] keyArray = new String[]{"hers", "his", "she", "he"};
+        Trie trie = new Trie();
+        for (String key : keyArray)
+            trie.addKeyword(key);
+        for (Emit emit : trie.parseText("ushers"))
+            System.out.printf("[%d:%d]=%s\n", emit.getStart(), emit.getEnd(), emit.getKeyword());
+    }
+
+    private static void evaluateSpeed() throws IOException
+    {
+        // 加载词典
+        TreeMap<String, CoreDictionary.Attribute> dictionary =
+            IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt");
+        Trie trie = new Trie(dictionary.keySet());
+
+        String text = "江西鄱阳湖干枯，中国最大淡水湖变成大草原";
+        long start;
+        double costTime;
+        final int pressure = 1000000;
+
+        System.out.println("===AC自动机接口===");
+        System.out.println("完全切分");
+        start = System.currentTimeMillis();
+        for (int i = 0; i < pressure; ++i)
+        {
+            segmentFully(text, trie);
+        }
+        costTime = (System.currentTimeMillis() - start) / (double) 1000;
+        System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime);
+    }
+
+    /**
+     * 基于AC自动机的完全切分式的中文分词算法
+     *
+     * @param text       待分词的文本
+     * @param dictionary 词典
+     * @return 单词列表
+     */
+    public static List<String> segmentFully(final String text, Trie dictionary)
+    {
+        final List<String> wordList = new LinkedList<String>();
+        for (Emit emit : dictionary.parseText(text))
+        {
+            wordList.add(emit.getKeyword());
+        }
+        return wordList;
+    }
+}