Natural Language Processing Toolkit written in Go
Written in Go.
Progress has been slow, but I am fully comitted to NLPT; even if there are silent spells. Some of my work on NLP and Go cannot be open-sourced for given periods of time, and now I have chance to be working in a more open-source friendly capacity!
I am a linguist by trade and so NLP is always something I want to write code for. I now have the chance to work on Go professionally in a more open-source friendly capacity and this bodes well for this project.
siw is the Simple Words project and is the place I've playing out some basic ideas in shorter form. If you don't see any progress on nlpt
, you can be sure that siw
has got something brewing.
exp branch is the experimental branch; this will be the default branch until more progress is made on the core architecture (tokenzier, stemmer, and tagger).
This is the tested, stable, and production ready brnach of a research project to write natural language processing tools in Go. NLPT is built up from multiple sub-packages (each separately accessible).
Get it:
go get github.com/jbowles/nlpt
go get -u github.com/jbowles/nlpt
)Functionality is separated into sub packages, which are usable outside the scope of the main NLPT project. Naming of each subpackage will be consistent as per the first 3 letter prefix
+ subpackage name
. For example: tokenizer = nlptokenizer
, stemmer = nlpstemmer
, tagger = nlptagger
.
Get a subpackage:
go get github.com/jbowles/nlpt/nlptokenizer
go get -u github.com/jbowles/nlpt/nlptokenizer
)Thanks to the Go Berlin users group for letting me ripoff their gopher image!
exp
stable
Development workflow == exp
-> stable
-> master
Stability
(Experimental, Stable, Production) to determine whether the API is production ready. Volatility
(Radical, Mild, Stable) to determine whether the API is likely to change.Test
(Nil, Some, Stable) to signal range of coverage for tests over the API.Examples
link to external repo with more documentation and examples.NLPT broadly supports minimal functionality for the full range of 4 bit unicode code points.
Done:
Bucket
th0s7!e
=> ["thse"], [0, 7], ["!"]
)In Progress:
Run tests and benchmarks:
go test -v
go test -benchmem -bench .
Stability: 2 - Stable
Volatility: 2 - Stable
Tests: 2 - Stable
Tokenizer it leverages the Go Rune Type (int32
aliases for Unicode). Basically, you can build custom unicode alphabets that are used for pattern matching (instead of regular expressions). General goals:
The Tf-Idf stuff is not done, I've just been playing with different ways of doing it. There is not a full model finished yet and so the first implementation is not complete.
Stability: 0 - Not Started
Volatility: 0 - Not Started
Tests: 0 - Not Started
Stability: 0 - Not Started
Volatility: 0 - Not Started
Tests: 0 - Not Started
Stability: 0 - Not Started
Volatility: 0 - Not Started
Tests: 0 - Not Started
See the IDF entry in the Information Retrieval book provided by Stanford and authors for detail.
One reason for knowing about NLP services is they can be used for testing and comparing results. When more sub-packages become available tests will be written against these APIs to compare results.
/*
** DESCRIPTION FROM XEROX **
These tools, called xfst, twolc, and lexc, are used in many linguistic applications such as morphological analysis, tokenisation, and shallow parsing of a wide variety of natural languages. The finite state tools here are built on top of a software library that provides algorithms to create automata from regular expressions and equivalent formalisms and contains both classical operations, such as union and composition, and new algorithms such as replacement and local sequentialisation.
Finite-state linguistic resources are used in a series of applications and prototypes that range from OCR to terminology extraction, comprehension assistants, digital libraries and authoring and translation systems.
The components provided here are:
Tokenization
Morphology
Part of Speech Disambiguation (Tagging)
*/
/*
** DESCRIPTION FROM ALCHEMY **
AlchemyAPI uses natural language processing technology and machine learning algorithms to extract semantic meta-data from content, such as information on people, places, companies, topics, facts, relationships, authors, and languages.
API endpoints are provided for performing content analysis on Internet-accessible web pages, posted HTML or text content.
*/