The original plan was to lift the regexes directly, but I’d forgotten that Standard Ebooks is a GPL3 codebase, and here is MIT. Obviously we can’t copy everything directly over, so the new plan is that I’ll copy over my original contributions, and anything that anyone else agrees should be contributed.
At Standard Ebooks we use python-titlecase to format a bunch of stuff throughout our productions (thanks!) but we also have some additional rules and changes to meet our specific needs. These start at [redacted]; the comments as a list give you a good overview:
- Uppercase Roman numerals, but only if they are valid Roman numerals and they are not
MIX (which is much more likely to be an English word than a Roman numeral) or DI which may be an Italian word
- Lowercase
and, or even if preceded by punctuation
- pip_titlecase capitalizes all prepositions preceded by parenthesis; we only want to capitalize ones that aren't the first word of a subtitle OK: From Sergeant Bulmer (of the Detective Police) to Mr. Pendril OK: Three Men in a Boat (To Say Nothing of the Dog)
- Uppercase words preceded by en or em dash
- Lowercase
and, if it's not the very first word, and not preceded by an em-dash
- Lowercase
the, if preceded by a dash (like Puss-in-Boots or Jack-in-the-Box)
- Lowercase "in", if followed by a semicolon (but not words like "inheritance")
- Lowercase
th’, sometimes used poetically
- Lowercase
o’
- Uppercase words that begin compound words, like
to-night (which might appear in poetry)
- Lowercase
from, with, as long as they're not the first word and not preceded by a parenthesis
Capitalise the first word after an opening quote or italicisation that signifies a work this relies on SE specific markup
- Lowercase
the if preceded by vs.
- Lowercase
de, von, van, le, du as in Charles de Gaulle, Werner von Braun, etc., and if not the first word and not preceded by an “
- Uppercase word following
Or,, since it is probably a subtitle
- Uppercase word following
:, except or, , which indicates a kind of subtitle
- Uppercase words after an initial contraction, like
O'Keefe or L'Affaire. But only if there's at least 3 letters after, to prevent catching things like I'm or E're
- Uppercase letter after
Mc
- Uppercase first letter after beginning contraction
- Uppercase first letter
- Lowercase
by
- Lowercase leading
d’, as in Marie d’Elle
- Uppercase
l’ as in l’Affaire, but not if it's a the first letter
- Uppercase leading
A- as in A-Breaking
- Uppercase some known initialisms
- Lowercase
À (as in À La Carte) unless it's the first word
- Uppercase initialisms
- Uppercase No. as in Number
- Lowercase V. as in versus in a legal case
- Lowercase
mm (millimeters, as in 50 mm gun) unless it's followed by a period in which case it's likely Mm. (Monsieurs)
- Lowercase
al- (as in the Arabic definite article) unless it’s the first word
- …and some special cases
Would any of these be things that python-titlecase are interested in? I’d be happy to upstream them as PRs.
The original plan was to lift the regexes directly, but I’d forgotten that Standard Ebooks is a GPL3 codebase, and here is MIT. Obviously we can’t copy everything directly over, so the new plan is that I’ll copy over my original contributions, and anything that anyone else agrees should be contributed.
At Standard Ebooks we use python-titlecase to format a bunch of stuff throughout our productions (thanks!) but we also have some additional rules and changes to meet our specific needs. These start at [redacted]; the comments as a list give you a good overview:
MIX(which is much more likely to be an English word than a Roman numeral) orDIwhich may be an Italian wordand,oreven if preceded by punctuationand, if it's not the very first word, and not preceded by an em-dashthe, if preceded by a dash (likePuss-in-BootsorJack-in-the-Box)th’, sometimes used poeticallyo’to-night(which might appear in poetry)from,with, as long as they're not the first word and not preceded by a parenthesisCapitalise the first word after an opening quote or italicisation that signifies a workthis relies on SE specific markuptheif preceded byvs.de,von,van,le,duas inCharles de Gaulle,Werner von Braun, etc., and if not the first word and not preceded by an “Or,, since it is probably a subtitle:, exceptor,, which indicates a kind of subtitleO'KeefeorL'Affaire. But only if there's at least 3 letters after, to prevent catching things likeI'morE'reMcbyd’, as inMarie d’Ellel’as inl’Affaire, but not if it's a the first letterA-as inA-BreakingÀ(as inÀ La Carte) unless it's the first wordmm(millimeters, as in50 mm gun) unless it's followed by a period in which case it's likelyMm.(Monsieurs)al-(as in the Arabic definite article) unless it’s the first wordWould any of these be things that python-titlecase are interested in? I’d be happy to upstream them as PRs.