Skip to content

Files

82 lines (63 loc) · 2.73 KB

README.md

File metadata and controls

82 lines (63 loc) · 2.73 KB

Namignizer

Use a variation of the PTB model to recognize and generate names using the Kaggle Baby Name Database.

API

Namignizer is implemented in Tensorflow 0.8r and uses the python package pandas for some data processing.

How to use

Download the data from Kaggle and place it in your data directory (or use the small training data provided). The example data looks like so:

Id,Name,Year,Gender,Count
1,Mary,1880,F,7065
2,Anna,1880,F,2604
3,Emma,1880,F,2003
4,Elizabeth,1880,F,1939
5,Minnie,1880,F,1746
6,Margaret,1880,F,1578
7,Ida,1880,F,1472
8,Alice,1880,F,1414
9,Bertha,1880,F,1320

But any data with the two columns: Name and Count will work.

With the data, we can then train the model:

train("data/SmallNames.txt", "model/namignizer", SmallConfig)

And you will get the output:

Reading Name data in data/SmallNames.txt
Epoch: 1 Learning rate: 1.000
0.090 perplexity: 18.539 speed: 282 lps
...
0.890 perplexity: 1.478 speed: 285 lps
0.990 perplexity: 1.477 speed: 284 lps
Epoch: 13 Train Perplexity: 1.477

This will as a side effect write model checkpoints to the model directory. With this you will be able to determine the perplexity your model will give you for any arbitrary set of names like so:

namignize(["mary", "ida", "gazorpazorp", "houyhnhnms", "bob"],
  tf.train.latest_checkpoint("model"), SmallConfig)

You will provide the same config and the same checkpoint directory. This will allow you to use a the model you just trained. You will then get a perplexity output for each name like so:

Name mary gives us a perplexity of 1.03105580807
Name ida gives us a perplexity of 1.07770049572
Name gazorpazorp gives us a perplexity of 175.940353394
Name houyhnhnms gives us a perplexity of 9.53870773315
Name bob gives us a perplexity of 6.03938627243

Finally, you will also be able generate names using the model like so:

namignator(tf.train.latest_checkpoint("model"), SmallConfig)

Again, you will need to provide the same config and the same checkpoint directory. This will allow you to use a the model you just trained. You will then get a single generated name. Examples of output that I got when using the provided data are:

['b', 'e', 'r', 't', 'h', 'a', '`']
['m', 'a', 'r', 'y', '`']
['a', 'n', 'n', 'a', '`']
['m', 'a', 'r', 'y', '`']
['b', 'e', 'r', 't', 'h', 'a', '`']
['a', 'n', 'n', 'a', '`']
['e', 'l', 'i', 'z', 'a', 'b', 'e', 't', 'h', '`']

Notice that each name ends with a backtick. This marks the end of the name.

Contact Info

Feel free to reach out to me at knt(at google) or k.nathaniel.tucker(at gmail)