没有合适的资源?快使用搜索试试~ 我知道了~
语句对齐程序
需积分: 9 26 下载量 200 浏览量
2008-05-07
10:43:58
上传
评论
收藏 14KB TXT 举报
温馨提示
试读
17页
这是将两种语料自动做到句子对齐的程序
资源推荐
资源详情
资源评论
Appendix: Program
with Michael D. Riley
The following code is the core of align. It is a C language program which inputs two text files, with one
token (word) per line. The text files contain a number of delimiter tokens. There are two types of delimiter
tokens: ¡®¡®hard¡¯¡¯ and ¡®¡®soft¡¯¡¯. The hard regions (e.g., paragraphs) may not be changed, and there must be
equal numbers of them in the two input files. The soft regions (e.g., sentences) may be deleted (1-0),
inserted (0-1), substituted (1-1), contracted (2-1), expanded (1-2), or merged (2-2) as necessary so that the
output ends up with the same number of soft regions. The program generates two output files. The two
output files contain an equal number of soft regions, each on a line. If the -v command line option is
included, each soft region is preceded by its probability score.
#include <fcntl.h>
#include <malloc.h>
#include <math.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <values.h>
#include <sys/stat.h>
/*
usage:
align_regions -D ¡¯.PARA¡¯ -d ¡¯.End of Sentence¡¯ file1 file2
outputs two files: file1.al & file2.al
hard regions are delimited by the -D arg
soft regions are delimited by the -d arg
*/
#define dist(x,y) distances[(x) * ((ny) + 1) + (y)]
#define pathx(x,y) path_x[(x) * ((ny) + 1) + (y)]
#define pathy(x,y) path_y[(x) * ((ny) + 1) + (y)]
#define MAX_FILENAME 256
with Michael D. Riley
The following code is the core of align. It is a C language program which inputs two text files, with one
token (word) per line. The text files contain a number of delimiter tokens. There are two types of delimiter
tokens: ¡®¡®hard¡¯¡¯ and ¡®¡®soft¡¯¡¯. The hard regions (e.g., paragraphs) may not be changed, and there must be
equal numbers of them in the two input files. The soft regions (e.g., sentences) may be deleted (1-0),
inserted (0-1), substituted (1-1), contracted (2-1), expanded (1-2), or merged (2-2) as necessary so that the
output ends up with the same number of soft regions. The program generates two output files. The two
output files contain an equal number of soft regions, each on a line. If the -v command line option is
included, each soft region is preceded by its probability score.
#include <fcntl.h>
#include <malloc.h>
#include <math.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <values.h>
#include <sys/stat.h>
/*
usage:
align_regions -D ¡¯.PARA¡¯ -d ¡¯.End of Sentence¡¯ file1 file2
outputs two files: file1.al & file2.al
hard regions are delimited by the -D arg
soft regions are delimited by the -d arg
*/
#define dist(x,y) distances[(x) * ((ny) + 1) + (y)]
#define pathx(x,y) path_x[(x) * ((ny) + 1) + (y)]
#define pathy(x,y) path_y[(x) * ((ny) + 1) + (y)]
#define MAX_FILENAME 256
#define BIG_DISTANCE 2500
/* Dynamic Programming Optimization */
struct alignment {
int x1;
int y1;
int x2;
int y2;
int d;
};
char *hard_delimiter = NULL; /* -D arg */
char *soft_delimiter = NULL; /* -d arg */
int verbose = 0; /* -v arg */
/* utility functions */
char *readchars(), **readlines(), **substrings();
void err();
/*
seq_align by Mike Riley
x and y are sequences of objects, represented as non-zero ints, to be aligned.
dist_funct(x1, y1, x2, y2) is a distance function of 4 args:
dist_funct(x1, y1, 0, 0) gives cost of substitution of x1 by y1.
dist_funct(x1, 0, 0, 0) gives cost of deletion of x1.
dist_funct(0, y1, 0, 0) gives cost of insertion of y1.
dist_funct(x1, y1, x2, 0) gives cost of contraction of (x1,x2) to y1.
dist_funct(x1, y1, 0, y2) gives cost of expansion of x1 to (y1,y2).
dist_funct(x1, y1, x2, y2) gives cost to match (x1,x2) to (y1,y2).
align is the alignment, with (align[i].x1, align[i].x2) aligned
with (align[i].y1, align[i].y2). Zero in align[].x1 and align[].y1
correspond to insertion and deletion, respectively. Non-zero in
align[].x2 and align[].y2 correspond to contraction and expansion,
respectively. align[].d gives the distance for that pairing.
/* Dynamic Programming Optimization */
struct alignment {
int x1;
int y1;
int x2;
int y2;
int d;
};
char *hard_delimiter = NULL; /* -D arg */
char *soft_delimiter = NULL; /* -d arg */
int verbose = 0; /* -v arg */
/* utility functions */
char *readchars(), **readlines(), **substrings();
void err();
/*
seq_align by Mike Riley
x and y are sequences of objects, represented as non-zero ints, to be aligned.
dist_funct(x1, y1, x2, y2) is a distance function of 4 args:
dist_funct(x1, y1, 0, 0) gives cost of substitution of x1 by y1.
dist_funct(x1, 0, 0, 0) gives cost of deletion of x1.
dist_funct(0, y1, 0, 0) gives cost of insertion of y1.
dist_funct(x1, y1, x2, 0) gives cost of contraction of (x1,x2) to y1.
dist_funct(x1, y1, 0, y2) gives cost of expansion of x1 to (y1,y2).
dist_funct(x1, y1, x2, y2) gives cost to match (x1,x2) to (y1,y2).
align is the alignment, with (align[i].x1, align[i].x2) aligned
with (align[i].y1, align[i].y2). Zero in align[].x1 and align[].y1
correspond to insertion and deletion, respectively. Non-zero in
align[].x2 and align[].y2 correspond to contraction and expansion,
respectively. align[].d gives the distance for that pairing.
剩余16页未读,继续阅读
资源评论
flyinrain12123
- 粉丝: 0
- 资源: 3
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功