Updated Wiki: Home

Project Description
RNNSharp is a toolkit of recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling.

This page will introduces you about what is RNNSharp, how it works and how to use it. To get the latest source code, please visit [SOURCE CODE] page and download it by clicking "download" link. To get the demo package, please access [DOWNLOADS] page and download the package.

Overview

RNNSharp supports two kinds of recurrent neural network (aka RNN). One is the standard recurrent neural network, the other is recurrent conditional random field[1] based on current neural network.

The standard recurrent neural network is an excellent algorithm for online sequence labeling task, such as speech recognition, auto suggestion and so on. It has better performance than MMEM and the algorithms which use traditional N-gram features.

The recurrent conditional random field (aks. recurrent-CRF) is a new type of CRF based on RNN. Compared with above standard RNN, Recurrent-CRF is a type of RNN model for offline tasks. Similar as CRF, Recurrent-CRF can be also used for many different types of sequence labeling tasks, such as word segmentation, named entity recognition and so on. With the similar feature set, it has better performance than linear CRF, since the representation of its feature is richer than before.

Supported Feature Types

RNNSharp supports four types of feature set. They are template features, context template features, run time feature and word embedding features. These features are controlled by configuration file, the following paragraph will introduce what these features are and how to use them in configuration file.

Template Features

This type of features are generated by templates. By given templates, according corpus, the features will be generated. The template feature has binary-value. If the feature is existed in current token, its value will be as 1, otherwise, the value will be as 0. It's similar as CRFSharp features. In RNNSharp, TFeatureBin.exe is a console tool to generate this type of features.

In template file, each line describes one template which consists of prefix, id and rule-string. The prefix is used to indicate template type. So far, RNNSharp also supports U-type feature, so the prefix is always as "U". Id is used to distinguish different templates. And rule-string is used to guide CRFSharp to generate features.

# Unigram

U01:%x[-1,0]

U02:%x[0,0]

U03:%x[1,0]

U04:%x[-1,0]/%x[0,0]

U05:%x[0,0]/%x[1,0]

U06:%x[-1,0]/%x[1,0]

U07:%x[-1,1]

U08:%x[0,1]

U09:%x[1,1]

U10:%x[-1,1]/%x[0,1]

U11:%x[0,1]/%x[1,1]

U12:%x[-1,1]/%x[1,1]

U13:C%x[-1,0]/%x[-1,1]

U14:C%x[0,0]/%x[0,1]

U15:C%x[1,0]/%x[1,1]

The rule-string has two types of form, one is constant string, and the other is macro. The simplest macro form is {“%x[row,col]”}. Row specifies the offset between current focusing token and generating feature token in row. Col specifies the absolute column position in corpus. Moreover, combined macro is also supported, for example: {“%x[row1, col1]/%x[row2, col2]”}. When generating feature set, macro will be replaced as specific string. A template file example as follows:

U01:New

U02:York

U03:are

U04:New/York

U05:York/are

U06:New/are

U07:NNP

U08:NNP

U09:are

U10:NNP/NNP

U11:NNP/VBP

U12:NNP/VBP

U13:CNew/NNP

U14:CYork/NNP

U15:Care/VBP

In this template file, assuming current focusing token is “York NNP E_LOCATION” in the first record in training corpus above, the generated unigram feature set as follows:

Although U07 and U08, U11 and U12’s rule-string are the same, we can still distinguish them by id string.

When building features, according templates, builder will generate feature set (like the example in above) from specific corpus and save them into feature set file.

In feature configuration file, keyword TFEATURE_FILENAME is used to specify the file name for template feature in binary format

Context Template Features

This type of features are based on template features and combined with context. Here is an example, if the configuration of this feature is "-1,0,1", then the generated feature will combine the features of current token with its previous token and next token. For instance, if the sentence is "how are you". the generated feature set will be {Feature("how"), Feature("are"), Feature("you")}.

In feature configuration file, keyword TFEATURE_CONTEXT is used to specify the tokens' context range for the feature.

Word Embedding Features

This type of features are used to describe the features of given token. It's very useful when we only have small labeled corpus, but have lots of unlabeled corpus. This type of feature is generated by WSDSharp. With lots of unlabeled corpus, in train mode, WSDSharp is able to generate vectors for each token. Note that, the token's granularity in both word embedding feature and training corpus for RNN model encoding should be aligned, otherwise, lots of tokens in training corpus will not be matched with the feature. For more detailed information about how to generate word embedding features, please visit WSDSharp homepage.

In RNNSharp, this feature also supports context feature. It will combine all features in the context into a single word embedding feature.

In feature configuration, keyword WORDEMBEDDING_FILENAME is used to specify the encoded word embedding data file name generated by WSDSharp. And keyword WORDEMBEDDING_CONTEXT is used to specify the token's context range for this feature.

Run Time Features

Compared with other features generated offline, this feature is generated in run time. So far, this feature uses the result of previous tokens as run time feature for current token.

In feature configuration, keyword RTFEATURE_CONTEXT is used to specify the context range of this feature.

Feature Configuration File

The file contain the configuration items for features. All of them have been introduced in above sections, and the following is an example. In console tool, use -ftrfile as parameter to specify feature configuration file

#The file name for template feature set
TFEATURE_FILENAME:tfeatures

#The context range for template feature set. In below, the context is current token, next token and next after next token
TFEATURE_CONTEXT: 0,1,2

#The word embedding data file name generated by WSDSharp
WORDEMBEDDING_FILENAME:word_vector.bin

#The context range for word embedding. In below, the context is current token, previous token and next token
WORDEMBEDDING_CONTEXT: -1,0,1

#The run time feature
RTFEATURE_CONTEXT: 0

Training file format

Training corpus contains many records to describe what the model should be. For each record, it is split into one or many tokens and each token has one or many dimension features to describe itself.

In training file, each record can be represented as a matrix and ends with an empty line. In the matrix, each row describes one token and its features, and each column represents a feature in one dimension. In entire training corpus, the number of column must be fixed.

When RNNSharp encodes, if the column size is N, according template file describes, the first N-1 columns will usually be used as input data to generate binary feature set and train model. The Nth column (aka last column) is the answer that the model should output. The means, for one record, if we have an ideal encoded model, given all tokens’ the first N-1 columns, the model should output each token’s Nth column data as answer.

There is an example (a bigger training example file is at download section, you can see and download it there):

!	PUN	S
Tokyo	NNP	S_LOCATION
and	CC	S
New	NNP	B_LOCATION
York	NNP	E_LOCATION
are	VBP	S
major	JJ	S
financial	JJ	S
centers	NNS	S
.	PUN	S

!	PUN	S
p	FW	S
'	PUN	S
y	NN	S
h	FW	S
44	CD	S
University	NNP	B_ORGANIZATION
of	IN	M_ORGANIZATION
Texas	NNP	M_ORGANIZATION
Austin	NNP	E_ORGANIZATION

In above example, we designed output answer as "POS_TYPE". POS means the position of the term in the chunk or named entity, TYPE means the output type of the term.
The example is for labeling named entities in records. It has two records and each token has three columns. The first column is the term of a token, the second column is the token’s pos-tag result and the third column is to describe whether the token is a named entity or a part of named entity and its type. The first and the second columns are input data for encoding model, and the third column is the model ideal output as answer.

For POS, it supports four types as follows:
S : the chunk has only one term
B: the begin term of the chunk
M: one of the middle term in the chunk
E: the end term of the chunk

For TYPE, the example contains many types as follows:
ORGANIZATION : the name of one organization
LOCATION : the name of one location
For output answer without TYPE, it's just a normal term, not a named entity.

Test file format

Test file has the similar format as training file. The only different between training and test file is the last column. In test file, all columns are features for CRF model.

Tag Mapping File

This file is used to map tag name and its id. For readable, RNNSharp uses tag name in corpus, however, for high efficiency in encoding and decoding, RNNSharp maps tag name into a integer value. The mapping is defined in a file (-tagfile as parameter in console tool) and its format as follows: tag id \t tag name

Here is an example:
0 O
1 LOC
2 BRD_ORG
3 ORG_MOD
4 ORG_SUFFIX

Console Tool

RNNSharpConsole

RNNSharpConsole.exe is a console tool for recurrent neural network encoding and decoding. The tool supports two types of running mode. "train" mode is used for training model and "test" mode is used for predicting output tag from test corpus by given encoded model.

Encode Model

In this mode, the console tool can encode a RNN model by given feature set and training/validated corpus. The usage as follows:

RNNSharpConsole.exe -mode train <parameters>
Parameters for training RNN based model
-trainfile <string>: training corpus file
-validfile <string>: validated corpus for training
-modelfile <string>: encoded model file
-ftrfile <string>: feature configuration file
-tagfile <string>: supported output tagid-name list file
-alpha <float>: learning rate, default is 0.1
-layersize <int>: hidden layer size for training, default is 200
-crf <0/1>: training model by standard RNN(0) or RNN-CRF(1), default is 0
-maxiter <int>: maximum iteration for training, default is 20
-savestep <int>: save temporary model after every <int> sentence, default is 0

Example: RNNSharpConsole.exe -mode train -trainfile train.txt -validfile valid.txt -modelfile model.bin -tagfile tags.txt -layersize 200 -alpha 0.1 -crf 1 -maxiter 20 -savestep 200K

Decode Model

In this mode, the console tool is used to predict output tags of given corpus. The usage as follows:

RNNSharpConsole.exe -mode test <parameters>
Parameters for predicting iTagId tag from given corpus
-testfile <string>: training corpus file
-modelfile <string>: encoded model file
-tagfile <string>: supported output tagid-name list file
-ftrfile <string>: feature configuration file
-outfile <string>: result output file

Example: RNNSharpConsole.exe -mode test -testfile test.txt -modelfile model.bin
-tagfile tags.txt -ftrfile features.txt -outfile result.txt

TFeatureBin

It's used to generate template feature set by given template and corpus files. For high performance accessing and save memory cost, the generated feature set is built as double array in trie-tree by AdvUtils. The usage of this tool as follows:

TFeatureBin.exe <parameters>
The tool is used to generate template features and build them into DART format
-template <string> : feature template file
-inputfile <string> : file used to generate features
-ftrfile <string> : generated features file
-minfreq <int> : min-frequency of feature. Any features which frequency is less than this value will be dropped.
-debug <int> : output raw feature set

Here is an example:

TFeatureBin.exe -template template.txt -inputfile train.txt -ftrfile tfeature -minfreq 3 -debug 1

APIs

The RNNSharp also provides some APIs for developers to leverage it into their projects. By download source code package and open RNNSharpConsole project, you will see how to use APIs in your project to encode and decode RNN models. Note that, before use RNNSharp APIs, you should add RNNSharp.dll as reference into your project.

Reference

Updated Wiki: Home

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112