-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Oliver Keyes
committed
Aug 26, 2015
1 parent
8111b2b
commit 7d9baa6
Showing
8 changed files
with
175 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
Package: humaniformat | ||
Title: A Parser for Human Names | ||
Version: 0.0.1 | ||
Version: 0.2.0 | ||
Author: Oliver Keyes | ||
Maintainer: Oliver Keyes <[email protected]> | ||
Description: Human names are complicated and nonstandard things. Humaniformat attempts to provide functions for parsing those names, making a best-guess attempt to distinguish sub-components such as prefixes, suffixes, middle names and salutations. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
% Generated by roxygen2 (4.1.1): do not edit by hand | ||
% Please edit documentation in R/RcppExports.R | ||
\name{format_period} | ||
\alias{format_period} | ||
\title{Reformat Period-Separated Names} | ||
\usage{ | ||
format_period(names) | ||
} | ||
\arguments{ | ||
\item{names}{a vector of names following this convention. Names that lack periods will | ||
be returned entirely intact, so assuming you don't have (legitimate) periods in names | ||
not following this format, there's no need to worry if your vector has mixed formatting.} | ||
} | ||
\description{ | ||
a common pattern for names is for first and middle names to be represented | ||
by initials. Unfortunately depending on how this is done, that can make things problematic; | ||
"G. K. Chesterton" is easy to parse, but "G.K. Chesterton" or "G.K.Chesterton" is not. | ||
\code{format_period} takes names that are period-separated in this fashion and reformats | ||
them to ensure there are spaces between each initial. Periods after any space in the name | ||
are preserved, so "G.K. Chesterton, M.D." does not become "G. K. Chesterton, M. D. ". | ||
} | ||
\examples{ | ||
format_period("G.K.Chesterton") | ||
} | ||
\seealso{ | ||
\code{\link{format_reverse}} for names stored as "Lastname, Firstname", and | ||
\code{\link{parse_names}} to parse the output of this function. | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
## ----eval=FALSE---------------------------------------------------------- | ||
# library(humaniformat) | ||
# names <- c("Oliver Keyes", "Keyes, Oliver") | ||
# format_reverse(names) | ||
# | ||
# [1] "Oliver Keyes" "Oliver Keyes" | ||
|
||
## ----eval=FALSE---------------------------------------------------------- | ||
# names <- c("G.K. Chesterton", "G.K.Chesterton") | ||
# format_period(names) | ||
# | ||
# [1] "G. K. Chesterton" "G. K. Chesterton" | ||
|
||
## ----eval=FALSE---------------------------------------------------------- | ||
# names <- c("G.K. Chesterton", "G.K.Chesterton") | ||
# narmes <- format_period(names) | ||
# parsed_chestertons <- parse_names(names) | ||
# str(parsed_chestertons) | ||
# | ||
# 'data.frame': 2 obs. of 6 variables: | ||
# $ salutation : chr "" "" | ||
# $ first_name : chr "G.K." "G.K.Chesterton" | ||
# $ middle_name: chr "" "" | ||
# $ last_name : chr "Chesterton" "" | ||
# $ suffix : chr "" "" | ||
# $ full_name : chr "G.K. Chesterton" "G.K.Chesterton" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
<!DOCTYPE html> | ||
|
||
<html xmlns="http://www.w3.org/1999/xhtml"> | ||
|
||
<head> | ||
|
||
<meta charset="utf-8"> | ||
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> | ||
<meta name="generator" content="pandoc" /> | ||
|
||
<meta name="author" content="Oliver Keyes" /> | ||
|
||
<meta name="date" content="2015-08-26" /> | ||
|
||
<title>Introduction to humaniformat</title> | ||
|
||
|
||
|
||
<style type="text/css">code{white-space: pre;}</style> | ||
<style type="text/css"> | ||
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { | ||
margin: 0; padding: 0; vertical-align: baseline; border: none; } | ||
table.sourceCode { width: 100%; line-height: 100%; } | ||
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } | ||
td.sourceCode { padding-left: 5px; } | ||
code > span.kw { color: #007020; font-weight: bold; } | ||
code > span.dt { color: #902000; } | ||
code > span.dv { color: #40a070; } | ||
code > span.bn { color: #40a070; } | ||
code > span.fl { color: #40a070; } | ||
code > span.ch { color: #4070a0; } | ||
code > span.st { color: #4070a0; } | ||
code > span.co { color: #60a0b0; font-style: italic; } | ||
code > span.ot { color: #007020; } | ||
code > span.al { color: #ff0000; font-weight: bold; } | ||
code > span.fu { color: #06287e; } | ||
code > span.er { color: #ff0000; font-weight: bold; } | ||
</style> | ||
<style type="text/css"> | ||
pre:not([class]) { | ||
background-color: white; | ||
} | ||
</style> | ||
|
||
|
||
<link href="data:text/css,body%20%7B%0A%20%20background%2Dcolor%3A%20%23fff%3B%0A%20%20margin%3A%201em%20auto%3B%0A%20%20max%2Dwidth%3A%20700px%3B%0A%20%20overflow%3A%20visible%3B%0A%20%20padding%2Dleft%3A%202em%3B%0A%20%20padding%2Dright%3A%202em%3B%0A%20%20font%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0A%20%20font%2Dsize%3A%2014px%3B%0A%20%20line%2Dheight%3A%201%2E35%3B%0A%7D%0A%0A%23header%20%7B%0A%20%20text%2Dalign%3A%20center%3B%0A%7D%0A%0A%23TOC%20%7B%0A%20%20clear%3A%20both%3B%0A%20%20margin%3A%200%200%2010px%2010px%3B%0A%20%20padding%3A%204px%3B%0A%20%20width%3A%20400px%3B%0A%20%20border%3A%201px%20solid%20%23CCCCCC%3B%0A%20%20border%2Dradius%3A%205px%3B%0A%0A%20%20background%2Dcolor%3A%20%23f6f6f6%3B%0A%20%20font%2Dsize%3A%2013px%3B%0A%20%20line%2Dheight%3A%201%2E3%3B%0A%7D%0A%20%20%23TOC%20%2Etoctitle%20%7B%0A%20%20%20%20font%2Dweight%3A%20bold%3B%0A%20%20%20%20font%2Dsize%3A%2015px%3B%0A%20%20%20%20margin%2Dleft%3A%205px%3B%0A%20%20%7D%0A%0A%20%20%23TOC%20ul%20%7B%0A%20%20%20%20padding%2Dleft%3A%2040px%3B%0A%20%20%20%20margin%2Dleft%3A%20%2D1%2E5em%3B%0A%20%20%20%20margin%2Dtop%3A%205px%3B%0A%20%20%20%20margin%2Dbottom%3A%205px%3B%0A%20%20%7D%0A%20%20%23TOC%20ul%20ul%20%7B%0A%20%20%20%20margin%2Dleft%3A%20%2D2em%3B%0A%20%20%7D%0A%20%20%23TOC%20li%20%7B%0A%20%20%20%20line%2Dheight%3A%2016px%3B%0A%20%20%7D%0A%0Atable%20%7B%0A%20%20margin%3A%201em%20auto%3B%0A%20%20border%2Dwidth%3A%201px%3B%0A%20%20border%2Dcolor%3A%20%23DDDDDD%3B%0A%20%20border%2Dstyle%3A%20outset%3B%0A%20%20border%2Dcollapse%3A%20collapse%3B%0A%7D%0Atable%20th%20%7B%0A%20%20border%2Dwidth%3A%202px%3B%0A%20%20padding%3A%205px%3B%0A%20%20border%2Dstyle%3A%20inset%3B%0A%7D%0Atable%20td%20%7B%0A%20%20border%2Dwidth%3A%201px%3B%0A%20%20border%2Dstyle%3A%20inset%3B%0A%20%20line%2Dheight%3A%2018px%3B%0A%20%20padding%3A%205px%205px%3B%0A%7D%0Atable%2C%20table%20th%2C%20table%20td%20%7B%0A%20%20border%2Dleft%2Dstyle%3A%20none%3B%0A%20%20border%2Dright%2Dstyle%3A%20none%3B%0A%7D%0Atable%20thead%2C%20table%20tr%2Eeven%20%7B%0A%20%20background%2Dcolor%3A%20%23f7f7f7%3B%0A%7D%0A%0Ap%20%7B%0A%20%20margin%3A%200%2E5em%200%3B%0A%7D%0A%0Ablockquote%20%7B%0A%20%20background%2Dcolor%3A%20%23f6f6f6%3B%0A%20%20padding%3A%200%2E25em%200%2E75em%3B%0A%7D%0A%0Ahr%20%7B%0A%20%20border%2Dstyle%3A%20solid%3B%0A%20%20border%3A%20none%3B%0A%20%20border%2Dtop%3A%201px%20solid%20%23777%3B%0A%20%20margin%3A%2028px%200%3B%0A%7D%0A%0Adl%20%7B%0A%20%20margin%2Dleft%3A%200%3B%0A%7D%0A%20%20dl%20dd%20%7B%0A%20%20%20%20margin%2Dbottom%3A%2013px%3B%0A%20%20%20%20margin%2Dleft%3A%2013px%3B%0A%20%20%7D%0A%20%20dl%20dt%20%7B%0A%20%20%20%20font%2Dweight%3A%20bold%3B%0A%20%20%7D%0A%0Aul%20%7B%0A%20%20margin%2Dtop%3A%200%3B%0A%7D%0A%20%20ul%20li%20%7B%0A%20%20%20%20list%2Dstyle%3A%20circle%20outside%3B%0A%20%20%7D%0A%20%20ul%20ul%20%7B%0A%20%20%20%20margin%2Dbottom%3A%200%3B%0A%20%20%7D%0A%0Apre%2C%20code%20%7B%0A%20%20background%2Dcolor%3A%20%23f7f7f7%3B%0A%20%20border%2Dradius%3A%203px%3B%0A%20%20color%3A%20%23333%3B%0A%7D%0Apre%20%7B%0A%20%20white%2Dspace%3A%20pre%2Dwrap%3B%20%20%20%20%2F%2A%20Wrap%20long%20lines%20%2A%2F%0A%20%20border%2Dradius%3A%203px%3B%0A%20%20margin%3A%205px%200px%2010px%200px%3B%0A%20%20padding%3A%2010px%3B%0A%7D%0Apre%3Anot%28%5Bclass%5D%29%20%7B%0A%20%20background%2Dcolor%3A%20%23f7f7f7%3B%0A%7D%0A%0Acode%20%7B%0A%20%20font%2Dfamily%3A%20Consolas%2C%20Monaco%2C%20%27Courier%20New%27%2C%20monospace%3B%0A%20%20font%2Dsize%3A%2085%25%3B%0A%7D%0Ap%20%3E%20code%2C%20li%20%3E%20code%20%7B%0A%20%20padding%3A%202px%200px%3B%0A%7D%0A%0Adiv%2Efigure%20%7B%0A%20%20text%2Dalign%3A%20center%3B%0A%7D%0Aimg%20%7B%0A%20%20background%2Dcolor%3A%20%23FFFFFF%3B%0A%20%20padding%3A%202px%3B%0A%20%20border%3A%201px%20solid%20%23DDDDDD%3B%0A%20%20border%2Dradius%3A%203px%3B%0A%20%20border%3A%201px%20solid%20%23CCCCCC%3B%0A%20%20margin%3A%200%205px%3B%0A%7D%0A%0Ah1%20%7B%0A%20%20margin%2Dtop%3A%200%3B%0A%20%20font%2Dsize%3A%2035px%3B%0A%20%20line%2Dheight%3A%2040px%3B%0A%7D%0A%0Ah2%20%7B%0A%20%20border%2Dbottom%3A%204px%20solid%20%23f7f7f7%3B%0A%20%20padding%2Dtop%3A%2010px%3B%0A%20%20padding%2Dbottom%3A%202px%3B%0A%20%20font%2Dsize%3A%20145%25%3B%0A%7D%0A%0Ah3%20%7B%0A%20%20border%2Dbottom%3A%202px%20solid%20%23f7f7f7%3B%0A%20%20padding%2Dtop%3A%2010px%3B%0A%20%20font%2Dsize%3A%20120%25%3B%0A%7D%0A%0Ah4%20%7B%0A%20%20border%2Dbottom%3A%201px%20solid%20%23f7f7f7%3B%0A%20%20margin%2Dleft%3A%208px%3B%0A%20%20font%2Dsize%3A%20105%25%3B%0A%7D%0A%0Ah5%2C%20h6%20%7B%0A%20%20border%2Dbottom%3A%201px%20solid%20%23ccc%3B%0A%20%20font%2Dsize%3A%20105%25%3B%0A%7D%0A%0Aa%20%7B%0A%20%20color%3A%20%230033dd%3B%0A%20%20text%2Ddecoration%3A%20none%3B%0A%7D%0A%20%20a%3Ahover%20%7B%0A%20%20%20%20color%3A%20%236666ff%3B%20%7D%0A%20%20a%3Avisited%20%7B%0A%20%20%20%20color%3A%20%23800080%3B%20%7D%0A%20%20a%3Avisited%3Ahover%20%7B%0A%20%20%20%20color%3A%20%23BB00BB%3B%20%7D%0A%20%20a%5Bhref%5E%3D%22http%3A%22%5D%20%7B%0A%20%20%20%20text%2Ddecoration%3A%20underline%3B%20%7D%0A%20%20a%5Bhref%5E%3D%22https%3A%22%5D%20%7B%0A%20%20%20%20text%2Ddecoration%3A%20underline%3B%20%7D%0A%0A%2F%2A%20Class%20described%20in%20https%3A%2F%2Fbenjeffrey%2Ecom%2Fposts%2Fpandoc%2Dsyntax%2Dhighlighting%2Dcss%0A%20%20%20Colours%20from%20https%3A%2F%2Fgist%2Egithub%2Ecom%2Frobsimmons%2F1172277%20%2A%2F%0A%0Acode%20%3E%20span%2Ekw%20%7B%20color%3A%20%23555%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%2F%2A%20Keyword%20%2A%2F%0Acode%20%3E%20span%2Edt%20%7B%20color%3A%20%23902000%3B%20%7D%20%2F%2A%20DataType%20%2A%2F%0Acode%20%3E%20span%2Edv%20%7B%20color%3A%20%2340a070%3B%20%7D%20%2F%2A%20DecVal%20%28decimal%20values%29%20%2A%2F%0Acode%20%3E%20span%2Ebn%20%7B%20color%3A%20%23d14%3B%20%7D%20%2F%2A%20BaseN%20%2A%2F%0Acode%20%3E%20span%2Efl%20%7B%20color%3A%20%23d14%3B%20%7D%20%2F%2A%20Float%20%2A%2F%0Acode%20%3E%20span%2Ech%20%7B%20color%3A%20%23d14%3B%20%7D%20%2F%2A%20Char%20%2A%2F%0Acode%20%3E%20span%2Est%20%7B%20color%3A%20%23d14%3B%20%7D%20%2F%2A%20String%20%2A%2F%0Acode%20%3E%20span%2Eco%20%7B%20color%3A%20%23888888%3B%20font%2Dstyle%3A%20italic%3B%20%7D%20%2F%2A%20Comment%20%2A%2F%0Acode%20%3E%20span%2Eot%20%7B%20color%3A%20%23007020%3B%20%7D%20%2F%2A%20OtherToken%20%2A%2F%0Acode%20%3E%20span%2Eal%20%7B%20color%3A%20%23ff0000%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%2F%2A%20AlertToken%20%2A%2F%0Acode%20%3E%20span%2Efu%20%7B%20color%3A%20%23900%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%2F%2A%20Function%20calls%20%2A%2F%20%0Acode%20%3E%20span%2Eer%20%7B%20color%3A%20%23a61717%3B%20background%2Dcolor%3A%20%23e3d2d2%3B%20%7D%20%2F%2A%20ErrorTok%20%2A%2F%0A%0A" rel="stylesheet" type="text/css" /> | ||
|
||
</head> | ||
|
||
<body> | ||
|
||
|
||
|
||
<div id="header"> | ||
<h1 class="title">Introduction to humaniformat</h1> | ||
<h4 class="author"><em>Oliver Keyes</em></h4> | ||
<h4 class="date"><em>2015-08-26</em></h4> | ||
</div> | ||
|
||
|
||
<p><code>humaniformat</code> is an R package for formatting and parsing human names. With it, you can reformat names in various ways to standardise them and then take those reformatted names and parse thme, splitting out salutations, suffixes, and first- middle- and last-names.</p> | ||
<div id="formatting" class="section level2"> | ||
<h2>Formatting</h2> | ||
<p>Names come in a lot of different formats, and making something that can machine-read all of them is a pretty difficult problem. Instead, <code>humaniformat</code> comes with formatters designed to standardise common formats for names.</p> | ||
<p>Sometimes names are reversed, and comma-separated, like “<code>Keyes, Oliver</code>”. For those you can use <code>format_reverse()</code>, which is designed for precisely this class of name. Names that are <em>not</em> comma separated won’t be touched.</p> | ||
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(humaniformat) | ||
names <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"Oliver Keyes"</span>, <span class="st">"Keyes, Oliver"</span>) | ||
<span class="kw">format_reverse</span>(names) | ||
|
||
[<span class="dv">1</span>] <span class="st">"Oliver Keyes"</span> <span class="st">"Oliver Keyes"</span></code></pre> | ||
<p>Alternatively, we could be dealing with initials rather than full names, and those are period-separated, but not always in the same way. “G.K. Chesterton” and “G.K.Chesterton” are very similar but from a machine’s point of view look very different - the first would be parsed as a first and last name, and the second as a single first name, when the real answer is that we have a first, middle and last name.</p> | ||
<p><code>format_period</code> takes names with this potentially inconsistent formatting and reworks them to ensure that initials are always space-separated. This makes them a lot easier to parse, and a lot easier to deal with in other programming contexts too:</p> | ||
<pre class="sourceCode r"><code class="sourceCode r">names <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"G.K. Chesterton"</span>, <span class="st">"G.K.Chesterton"</span>) | ||
<span class="kw">format_period</span>(names) | ||
|
||
[<span class="dv">1</span>] <span class="st">"G. K. Chesterton"</span> <span class="st">"G. K. Chesterton"</span></code></pre> | ||
</div> | ||
<div id="parsing-names" class="section level2"> | ||
<h2>Parsing names</h2> | ||
<p>Once you’ve got your formatted names (or even if you haven’t - maybe your names came in a standard format) you can parse them. This produces a data.frame of salutations (“Prof”), first names, middle names, last names, and suffixes (“PhD”):</p> | ||
<pre class="sourceCode r"><code class="sourceCode r">names <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"G.K. Chesterton"</span>, <span class="st">"G.K.Chesterton"</span>) | ||
narmes <-<span class="st"> </span><span class="kw">format_period</span>(names) | ||
parsed_chestertons <-<span class="st"> </span><span class="kw">parse_names</span>(names) | ||
<span class="kw">str</span>(parsed_chestertons) | ||
|
||
<span class="st">'data.frame'</span>:<span class="st"> </span><span class="dv">2</span> obs. of <span class="dv">6</span> variables: | ||
<span class="st"> </span><span class="er">$</span><span class="st"> </span>salutation :<span class="st"> </span>chr <span class="st">""</span> <span class="st">""</span> | ||
$<span class="st"> </span>first_name :<span class="st"> </span>chr <span class="st">"G.K."</span> <span class="st">"G.K.Chesterton"</span> | ||
$<span class="st"> </span>middle_name:<span class="st"> </span>chr <span class="st">""</span> <span class="st">""</span> | ||
$<span class="st"> </span>last_name :<span class="st"> </span>chr <span class="st">"Chesterton"</span> <span class="st">""</span> | ||
$<span class="st"> </span>suffix :<span class="st"> </span>chr <span class="st">""</span> <span class="st">""</span> | ||
$<span class="st"> </span>full_name :<span class="st"> </span>chr <span class="st">"G.K. Chesterton"</span> <span class="st">"G.K.Chesterton"</span></code></pre> | ||
</div> | ||
<div id="features-and-bugs" class="section level2"> | ||
<h2>Features and bugs</h2> | ||
<p>If you have ideas for other features that would make name handling easier, or find a bug, the best approach is to either <a href="https://github.com/Ironholds/humaniformat/issues">report it</a> or <a href="https://github.com/Ironholds/humaniformat/pulls">add it</a>!</p> | ||
</div> | ||
|
||
|
||
|
||
<!-- dynamically load mathjax for compatibility with self-contained --> | ||
<script> | ||
(function () { | ||
var script = document.createElement("script"); | ||
script.type = "text/javascript"; | ||
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; | ||
document.getElementsByTagName("head")[0].appendChild(script); | ||
})(); | ||
</script> | ||
|
||
</body> | ||
</html> |