budou package¶

Submodules¶

budou.budou module¶

Budou: an automatic organizer tool for beautiful line breaking in CJK

Usage:

budou [–segmenter=<seg>] [–language=<lang>] [–classname=<class>] [–inlinestyle] <source> budou -h | –help budou -v | –version

Options:

-h –help Show this screen.

-v –version Show version.

`--segmenter=<segmenter>`
	Segmenter to use [default: nlapi].
`--language=<language>`
	Language the source in.
`--classname=<classname>`
	Class name for output SPAN tags. Use comma-separated value to specify multiple classes.
`--inlinestyle`	Add `display:inline-block` as inline style attribute.

budou.budou.authenticate(json_path=None)[source]¶

Gets a Natural Language API parser by authenticating the API.

This method is deprecated. Please use budou.parser.get_parser to obtain a parser instead.

Parameters:	json_path (str, optional) – The file path to the service account’s credentials.
Returns:	Parser. (`budou.parser.NLAPIParser`)

budou.budou.main()[source]¶: Budou main method for the command line tool.

budou.budou.parse(source, segmenter='nlapi', language=None, max_length=None, classname=None, attributes=None, inlinestyle=False, **kwargs)[source]¶

Parses input source.

Parameters:

source (str) – Input source to process.
segmenter (str, optional) – Segmenter to use [default: nlapi].
language (str, optional) – Language code.
max_length (int, optional) – Maximum length of a chunk.
classname (str, optional) – Class name of output SPAN tags.
attributes (dict, optional) – Attributes for output SPAN tags.
inlinestyle (bool, optional) – Add display:inline-block as inline style attribute.

Returns:

Results in a dict. chunks holds a list of chunks (budou.chunk.ChunkList) and html_code holds the output HTML code.

budou.cachefactory module¶

Budou cache factory class.

class budou.cachefactory.AppEngineMemcache[source]¶

Bases: budou.cachefactory.BudouCache

Cache system with google.appengine.api.memcache backend.

memcache¶

Memcache service.

Type:	`google.appengine.api.memcache`

get(key)[source]¶

Gets a value by a key.

Parameters:	key (str) – Key to retrieve the value.
Returns:	Retrieved value (str or None).

set(key, val)[source]¶

Sets a value in a key.

Parameters:	key (str) – Key for the value. val (str) – Value to set.

class budou.cachefactory.BudouCache[source]¶

Bases: object

Base class for cache system.

get(key)[source]¶

Abstract method: Gets a value by a key.

Parameters:	key (str) – Key to retrieve the value.
Returns:	Retrieved value (str or None).
Raises:	`NotImplementedError` – If it’s not implemented.

set(key, val)[source]¶

Abstract method: Sets a value in a key.

Parameters:	key (str) – Key for the value. val (str) – Value to set.
Raises:	`NotImplementedError` – If it’s not implemented.

class budou.cachefactory.PickleCache(filename)[source]¶

Bases: budou.cachefactory.BudouCache

Cache system with pickle backend.

Parameters:	filename (str) – The file path to the cache file.

filename¶

The file path to the cache file.

Type:	str

DEFAULT_FILE_NAME = '/tmp/budou-cache.pickle'¶: The default path to the cache file.

get(key)[source]¶

Gets a value by a key.

Parameters:	key (str) – Key to retrieve the value.

Returns: Retrieved value (str or None).

set(key, val)[source]¶

Sets a value in a key.

Parameters:	key (str) – Key for the value. val (str) – Value to set.

budou.cachefactory.load_cache(filename=None)[source]¶

Returns a cache service.

If Google App Engine Standard Environment’s memcache is available, this uses memcache as the backend. Otherwise, this uses pickle to cache the outputs in the local file system.

Parameters:	filename (str, optional) – The file path to the cache file. This is used only when `pickle` is used as the backend.
Returns:	A cache system (`budou.cachefactory.BudouCache`)

budou.chunk module¶

Chunk module as a unit of word segment with helpers.

class budou.chunk.Chunk(word, pos=None, label=None, dependency=None)[source]¶

A unit for word segmentation.

word¶

Surface word of the chunk.

Type:	str

pos¶

Part of speech.

Type:	str, optional

label¶

Label information.

Type:	str, optional

dependency¶

Dependency to neighbor words. None for no dependency, True for dependency to the following word, and False for the dependency to the previous word.

Type:	bool, optional

Parameters:	word (str) – Surface word of the chunk. pos (str, optional) – Part of speech. label (str, optional) – Label information. dependency (bool, optional) – Dependency to neighbor words. `None` for no dependency, `True` for dependency to the following word, and `False` for the dependency to the previous word.

classmethod breakline()[source]¶

Creates breakline Chunk.

Returns:	A chunk (`budou.chunk.Chunk`)

has_cjk()[source]¶

Checks if the word of the chunk contains CJK characters.

This is using unicode codepoint ranges from https://github.com/nltk/nltk/blob/develop/nltk/tokenize/util.py#L149

Returns:	True if the chunk has any CJK character.
Return type:	bool

is_open_punct()[source]¶

Whether the chunk is an open punctuation mark.

Ps: Punctuation, open (e.g. opening bracket characters) Pi: Punctuation, initial quote (e.g. opening quotation mark) See also https://en.wikipedia.org/wiki/Unicode_character_property

Returns:	True if it is an open punctuation mark.
Return type:	bool

is_punct()[source]¶

Whether the chunk is a punctuation mark.

Returns:	True if it is a punctuation mark.
Return type:	bool

is_space()[source]¶

Whether the chunk is a space.

Returns:	True if it is a space.
Return type:	bool

serialize()[source]¶: Returns serialized chunk data in dictionary.

classmethod space()[source]¶

Creates space Chunk.

Returns:	A chunk (`budou.chunk.Chunk`)

class budou.chunk.ChunkList(*args)[source]¶

Bases: _abcoll.MutableSequence

List of budou.chunk.Chunk with some helpers.

This list accepts only instances of budou.chunk.Chunk.

Example

from budou.chunk import Chunk, ChunkList
chunks = ChunkList(Chunk('abc'), Chunk('def'))
chunks.append(Chunk('ghi'))  # OK
chunks.append('jkl')         # NG

Parameters:	args (list of `budou.chunk.Chunk`) – Initial values included in the list.

get_overlaps(offset, length)[source]¶

Returns chunks overlapped with the given range.

Parameters:	offset (int) – Begin offset of the range. length (int) – Length of the range.
Returns:	Overlapped chunks. (`budou.chunk.ChunkList`)

html_serialize(attributes, max_length=None)[source]¶

Returns concatenated HTML code with SPAN tag.

Parameters:	attributes (dict) – A map of name-value pairs for attributes of output SPAN tags. max_length (int, optional) – Maximum length of span enclosed chunk.
Returns:	The organized HTML code. (str)

insert(index, value)[source]¶: S.insert(index, object) – insert object before index

resolve_dependencies()[source]¶: Resolves chunk dependency by concatenating them.

swap(old_chunks, new_chunk)[source]¶

Swaps old consecutive chunks with new chunk.

Parameters:	old_chunks (`budou.chunk.ChunkList`) – List of consecutive Chunks to be removed. new_chunk (`budou.chunk.Chunk`) – A Chunk to be inserted.

budou.mecabsegmenter module¶

MeCab based Segmenter.

Word segmenter module powered by MeCab. You need to install MeCab to use this segmenter. The easiest way to install MeCab is to run make install-mecab. The script will download source codes from GitHub and build the tool. It also setup IPAdic, a standard dictionary for Japanese.

class budou.mecabsegmenter.MecabSegmenter[source]¶

Bases: budou.segmenter.Segmenter

MeCab Segmenter.

tagger¶

MeCab Tagger to parse the input sentence.

Type:	MeCab.Tagger

supported_languages¶

List of supported languages’ codes.

Type:	list of str

segment(source, language=None)[source]¶

Returns a chunk list from the given sentence.

Parameters:	source (str) – Source string to segment. language (str, optional) – A language code.
Returns:	A chunk list. (`budou.chunk.ChunkList`)
Raises:	`ValueError` – If `language` is given and it is not included in `supported_languages`.

supported_languages = set(['ja'])

budou.nlapisegmenter module¶

Natural Language API based Segmenter.

Word segmenter module powered by Cloud Natural Language API. You need to enable the API in your Google Cloud Platform project before you use this module.

Example

Once you enabled the API, download a service account’s credentials and set as GOOGLE_APPLICATION_CREDENTIALS environment variable.

$ export GOOGLE_APPLICATION_CREDENTIALS='/path/to/credentials.json'

Alternatively, you can also pass the path to your credentials file to the module.

segmenter = budou.segmenter.NLAPISegmenter(
    credentials_path='/path/to/credentials.json')

This module is equipped with caching system not to make multiple requests for the same source sentence because making request to the API may incur costs. The caching system is provided by budou.cachefactory, and a proper caching system is chosen to be used based on the environment.

class budou.nlapisegmenter.NLAPISegmenter(cache_filename, credentials_path, use_entity, use_cache, cache_discovery=True, service=None)[source]¶

Bases: budou.segmenter.Segmenter

Natural Language API Segmenter.

service¶: A resource object for interacting with Cloud Natural Language API.

cache_filename¶

File path to the cache file.

Type:	str

supported_languages¶

List of supported languages’ codes.

Type:	list of str

Parameters:

cache_filename (str, optional) – File path to the pickle file for caching. The file is created automatically if not exist. If the environment is Google App Engine Standard Environment and memcache service is available, it is used for caching and the pickle file won’t be generated.
credentials_path (str, optional) – File path to the service account’s credentials file. If no file path is specified, it tries to authenticate with default credentials.
use_entity (bool, optional) – Whether to use entity analysis results to wrap entity names in the output.
use_cache (bool, optional) – Whether to use a cache system.
cache_discovery (bool, optional) – Whether to use the cache to build the natural language API service [default: True]. When using oauth2client >= 4.0.0 or google-auth, its value should be False.
service (googleapiclient.discovery.Resource, optional) – A Resource object for interacting with Cloud Natural Language API. If this is given, the constructor skips the authentication process and use this service instead.

segment(source, language=None)[source]¶

Returns a chunk list from the given sentence.

Parameters:	source (str) – Source string to segment. language (str, optional) – A language code.
Returns:	A chunk list. (`budou.chunk.ChunkList`)
Raises:	`ValueError` – If `language` is given and it is not included in `supported_languages`.

supported_languages = set([u'ja', u'ko', u'zh', u'zh-CN', u'zh-HK', u'zh-Hant', u'zh-TW'])

budou.parser module¶

Parser modules.

Parser modules have parse method which processes the input text into a list of chunks and a HTML snippet.

Examples

import budou
parser = budou.get_parser('nlapi')
results = parser.parse('Google Home を使った。', classname='w')
print(results['html_code'])
# <span>Google <span class="w">Home を</span>
# <span class="w">使った。</span></span>

chunks = results['chunks']
print(chunks[1].word)  # Home を

class budou.parser.MecabParser[source]¶

Bases: budou.parser.Parser

Parser built on Mecab Segmenter (budou.mecabsegmenter.MecabSegmenter).

segmenter¶

Segmenter module.

Type:	`budou.mecabsegmenter.MecabSegmenter`

class budou.parser.NLAPIParser(**options)[source]¶

Bases: budou.parser.Parser

Parser built on Cloud Language API Segmenter (budou.nlapisegmenter.NLAPISegmenter).

Parameters:

cache_filename (string, optional) – the path to the cache file.
credentials_path (string, optional) – the path to the service account’s credentials file.
use_entity (bool, optional) – Whether to use entity analysis results to wrap entity names in the output.
use_cache (bool, optional) – Whether to use a cache system.
service (googleapiclient.discovery.Resource, optional) – A Resource object for interacting with Cloud Natural Language API. If this is given, the constructor skips the authentication process and use this service instead.

segmenter¶

Segmenter module.

Type:	`budou.nlapisegmenter.NLAPISegmenter`

class budou.parser.Parser[source]¶

Bases: object

Abstract parser class:

segmenter¶

Segmenter module.

Type:	`budou.segmenter.Segmenter`

parse(source, language=None, classname=None, max_length=None, attributes=None, inlinestyle=False)[source]¶

Parses the source sentence to output organized HTML code.

Parameters:	source (str) – Source sentence to process. language (str, optional) – Language code. max_length (int, optional) – Maximum length of a chunk. attributes (dict, optional) – Attributes for output SPAN tags. inlinestyle (bool, optional) – Add `display:inline-block` as inline style attribute.
Returns:	A dictionary containing `chunks` (`budou.chunk.ChunkList`) and `html_code` (str).

class budou.parser.TinysegmenterParser[source]¶

Bases: budou.parser.Parser

Parser built on TinySegmenter Segmenter (budou.tinysegmentersegmenter.TinysegmenterSegmenter).

segmenter¶

Segmenter module.

Type:	`budou.tinysegmentersegmenter.TinysegmenterSegmenter`

budou.parser.get_parser(segmenter, **options)[source]¶

Gets a parser.

Parameters:	segmenter (str) – Segmenter to use. options (dict, optional) – Optional settings.
Returns:	Parser (`budou.parser.Parser`)
Raises:	`ValueError` – If unsupported segmenter is specified.

budou.parser.parse_attributes(attributes=None, classname=None, inlinestyle=False)[source]¶

Parses attributes,

Parameters:	attributes (dict) – Input attributes. classname (str, optional) – Class name of output SPAN tags. inlinestyle (bool, optional) – Add `display:inline-block` as inline style attribute.
Returns:	Parsed attributes. (dict)

budou.parser.preprocess(source)[source]¶

Removes unnecessary break lines and white spaces.

Parameters:	source (str) – Input sentence.
Returns:	Preprocessed sentence. (str)

budou.segmenter module¶

Segmenter module.

class budou.segmenter.Segmenter[source]¶

Bases: object

Base class for Segmenter modules.

segment(source, language=None)[source]¶

Returns a chunk list from the given sentence.

Parameters:	source (str) – Source string to segment. language (str, optional) – A language code.
Returns:	A chunk list. (`budou.chunk.ChunkList`)
Raises:	`NotImplementedError` – If not implemented.

budou.tinysegmentersegmenter module¶

TinySegmenter based Segmenter.

Word segmenter module powered by TinySegmenter, a compact Japanese tokenizer originally developed by Taku Kudo. This is built on its Python port (https://pypi.org/project/tinysegmenter3/) developed by Tatsuro Yasukawa.

class budou.tinysegmentersegmenter.TinysegmenterSegmenter[source]¶

Bases: budou.segmenter.Segmenter

TinySegmenter based Segmenter.

supported_languages¶

List of supported languages’ codes.

Type:	list of str

segment(source, language=None)[source]¶

Returns a chunk list from the given sentence.

Parameters:	source (str) – Source string to segment. language (str, optional) – A language code.
Returns:	A chunk list. (`budou.chunk.ChunkList`)
Raises:	`ValueError` – If `language` is given and it is not included in `supported_languages`.

supported_languages = set(['ja'])

budou.tinysegmentersegmenter.is_hiragana(word)[source]¶

Checks is the word is a Japanese hiragana.

This is using the unicode codepoint range for hiragana. https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)

Parameters:	word (str) – A word.
Returns:	True if the word is a hiragana.
Return type:	bool

Module contents¶

Package indicator for budou.

budou package¶

Submodules¶

budou.budou module¶

budou.cachefactory module¶

budou.chunk module¶

budou.mecabsegmenter module¶

budou.nlapisegmenter module¶

budou.parser module¶

budou.segmenter module¶

budou.tinysegmentersegmenter module¶

Module contents¶

budou

Navigation

Related Topics