IBM-Text-Analytics-on-Apache-Spark-Bhatia-Thitte

所需积分/C币:10 2015-09-22 16:40:04 2.04MB PDF
13
收藏 收藏
举报

2014年Spark Summit于6月30日至7月2日在美国旧金山举行。Spark、Shark、Spark流媒体和相关项目及产品的主要用户聚集一地,共同探讨Spark项目开发的方向,以及Spark在各种各样应用程序中的实践情况。
重 Agenda Motivation IBM Text Analytics -Our expectation, experience, solution · IBM Text analytics SystemT high-performance run-time, uses optimized execution plans Information Extraction(IE)-deep-parse, lexical semantics, extraction libraries AQL =express lexical semantics as declarative rules using relational algebra Benchmarks - Systemf verSuS GaTE-ANNie Eclipse Web based developer tooling text-analytics life-cycle, map-reduce Project *Sparkle*-IBM Text Analytics on Apache Spark Spark-Java, Shark-UDTF Future work Scale, Scala, Tooling, Extractors C 2014 IBM Corporation 重 IBM Text Analytics- Motivation An enterprise information extraction system needs to be: 0ur“ expectation expressive efficient transparent usable 0ur“ experience 0ur“ solution rule-based solutions building on a declarative information extraction cascading grammars with expressivity system with cost-based optimization efficiency issues high-performance runtime and novel black-box solutions building on development tooling based on solid statistical learning models with lack of theoretical foundation ISIGMOD Record'09, ACL 10, PODS'13, PODS'14/ transparency C 2014 IBM Corporation 重 IBM Text Analytics Extracted AQL EXtractor Concepts per document create view IntentToBuy as select P name as product, 工, clue as strength Select Product from Intent I product p where SystemT Runtime egex FoLlows(I clue, P. name,0,20 Product and Not( ContainsRegex(/\b (not)\b/ Intent Cost-based Left Context(I clue, 10))) optimization Product Input document one-at-a-time Declarative SQL-like language User specifies tasks in a high-level language, without specifying alg gorithms for data rocessing SIGMOD Record09, ACL'10 High-performance scalable and embeddable Java runtime Outperforms tate-of-the-art systems Various optimization strategies to choose Document-at-a-time ISIGMOD Record09, ACL'10 across execution plans High-throughput Modern pattern discovery tools Cost-based optimization for text-centric operations IICDE'08, ICDE'11] Small memory footprint AOL development using ml HCi TEMNLP08, VLDB,10, ACL'11, CIKM'11 ISIGMOD Record08 ACL'12, EMNLP'12, CHI13, SIGMOD'13 ACL'13 AQL language exposed via Info Sphere BigInsights and Streams SystemT Runtime with pre-built extractors ship in 8+ other IBM products C 2014 IBM Corporation 重 SystemT- high performance run-time, optimized execution ultiple ways to execute a given set of aQL statement Optimizer chooses a good plan from among alternatives AQL Language Employs multiple techniques AQL rewrite rules Cost-based optimization Global plan rewrite rules Optimizer Extractor plan- graph of operators Operator- a module that performs specific task, ex. identifying matches of a regex on a string Compiled Output of one operator input of Plan another Shared dictionary Matching Regular expression Input documents Info. Extractions Strength reduction Distributed cluste Shared regular with SystemT Expression matching Conditional evaluation C 2014 IBM Corporation 重 Information Extraction -highlights malization, rule-based lexical semantics, algebraic operations over textual spans, extensibility via functions, rich extraction libi ↓ Social, Log CRM. Search Email ie Financial ie Piracy le Life sciences ie Action api Named Machine Financial Entities Data Primitives Deep Parser Primitives Noisy Data normalization AOL a 20 [0-9]+).+. )()(X) Rule Regexes Dictionaries Span Operations Joins Predicates Functions Action API, deep-parser noisy data normalization work in progress slated towards a future release of iBm big Insights C 2014 IBM Corporation 重 AQL-express lexical semantics as declarative rules module Intentexamples import view Actions from module actionapi as actions; AP/ imports import view Roles from module action API as roles create dictionary Intent Verbs with case insensitive as(want, wish, intend) create dictionary CustomerTerm with case insensitive as(l,'we) create dictionary IntentSubject with case insensitive as (agent) create dictionary intentObject with case insensitive as(theme, ' action_theme create view ClientNeeds as select asentence. o value from actions a. roles s roles o Dictionaries where Equals(getText(A aid), Get Text(S aid))and Equals(getText(A aid), Get Text(Oaid)) and Join MatchesDict( Intent Verbs, A. verbBase)and Actions+ roles and Matches Dict(IntentSubject, S name)and use functions Matches Dict(CustomerTerm, S value)and MatchesDict(IntentObject, O name) Dictionary-based output view Clientintent selection predicates C 2014 IBM Corporation 重8 Benchmarks- SystemT VS GATE-ANNIe Table 1: Datasets for performance evaluation Dataset Description of the Content Nunber o Docunent size documents 7an?gE CVE 4gE Enronz Emails randomly sampled from the Enron corpus of average size KB(0.5<T<100). 1000 TKB+/-10% EKB WT'ebCraw Small to medium size web pages representing company news, with HTML tags removed 1931 68b-3886KB 8.8KB Fnancem Medium size financial regulatory filings 100 240KB-09MB 401KB Finance Large size financial regulatory filings 30 IMB-34MB 1.54MB a) Throughput on Enron Table 2: Quality of Person on test datasets 8600 Quality of Precision (%) Recall(%) FI measure (%) (Exact/Partial)(ExactPartial)I(Exact/Partial ANNE Person entity Enron meetings ● ANNIE-Optimized extraction via A AIM 57.05/76.84 48.5965.46 52.48770.69 300 -.T-NE AQL is I-NE 88,419299 82.3986.05 85,29/89.71 8200 Minkoy 81.1/NA 74 9/NA 779NA superlative ACE 22! ANNE3941/78.1530.39602734.3268061 20 40 60 100 T-NE 93909582 90.9092:76 923894.27 Average document size(KB) b) Memory Utilization on Enron ■TNE Using aQl SystemT run-time 600 AnNe 25a75 E (ANNIE-Optim ANNIE Runtime §400香AE http://gate.ac.uk/sale/tao/splitch6.html#chap:a nie Performance* of BystemT is orders of 皇200 ANNIE-Optim magnitude better ANNIE with Ontotext Japec transducer 0 Minkov 60 100 Average document size(KB) Using E MinkoV IEMNLP'O5) as a function of throughput memory utilization, as seen on a cluster of 2 x 2. 4 GHz, 4-core Intel Xeon CPUs with 64GB RAM +Gate-annieisawellknownopen-sourceiesystem+http://gate.ac.uk/sale/tao/splitch6.html C 2014 IBM Corporation 重 Eclipse IE-Extraction workflow, AQL editor, extraction design p anner O BigInsights Text Analytics Workflow-TAProject2/aql//main. aql -Eclipse e巴x Eile Edit Navigate Search Project Bun Window Help Bs E Biglnsights S 心P1P色bt2=日 InfoSphere BigInsigh main aql mainaql x 口日的 Extraction plan数 created by: Step I: Select Documents date description a. Select document collection start writing your AQL here 2 TAProject2 ect a collection in one of the supp include 'Revenue ByDivisiom/basics. aq1": b s RevenueBy Division /TAPraject2/data,/ibm QuarterlyRepe include RevenueByDivision/concepts. aql" i include RevenueByDivision/refinement,aql i Browse Workspace Clcar create view Revenue Leftcontext as select LeftContextTok(R, match, 10)as Ic Language en from Revenue R: output view RevenueLettContext b. select the documents to work with EN Problems B Console Search E Annotation Explorer 回wQ2006.t Powerful aQL editor q14Q2007,t with assistive design 回w4Q2008,t Input Document Left Context Span Attrib planner w4Q2009t 回 1422010,tt Guided ie workflow EE Example Step 2, Label Examples and Clues y Writable Insert 1:1 on

...展开详情
试读 23P IBM-Text-Analytics-on-Apache-Spark-Bhatia-Thitte
立即下载
限时抽奖 低至0.43元/次
身份认证后 购VIP低至7折
一个资源只可评论一次,评论内容不能少于5个字
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
腾讯开发者 CSDN认证企业博客
  • 分享王者

关注 私信
上传资源赚钱or赚积分
最新推荐
IBM-Text-Analytics-on-Apache-Spark-Bhatia-Thitte 10积分/C币 立即下载
1/23
IBM-Text-Analytics-on-Apache-Spark-Bhatia-Thitte第1页
IBM-Text-Analytics-on-Apache-Spark-Bhatia-Thitte第2页
IBM-Text-Analytics-on-Apache-Spark-Bhatia-Thitte第3页
IBM-Text-Analytics-on-Apache-Spark-Bhatia-Thitte第4页
IBM-Text-Analytics-on-Apache-Spark-Bhatia-Thitte第5页

试读结束, 可继续读2页

10积分/C币 立即下载