没有合适的资源?快使用搜索试试~ 我知道了~
Spark Programming Guide-java - Spark 1.6.2.
需积分: 1 21 下载量 196 浏览量
2016-08-19
20:12:20
上传
评论
收藏 416KB PDF 举报
温馨提示
试读
18页
Spark Programming Guide-java - Spark 1.6.2.pdf
资源推荐
资源详情
资源评论
2016/8/17 SparkProgrammingGuideSpark1.6.2Documentation
http://spark.apache.org/docs/1.6.2/programmingguide.html 1/18
1.6.2
Overview ProgrammingGuides APIDocs Deploying More
SparkProgrammingGuide
Overview
LinkingwithSpark
InitializingSpark
UsingtheShell
ResilientDistributedDatasets(RDDs)
ParallelizedCollections
ExternalDatasets
RDDOperations
Basics
PassingFunctionstoSpark
Understandingclosures
Example
Localvs.clustermodes
PrintingelementsofanRDD
WorkingwithKeyValuePairs
Transformations
Actions
Shuffleoperations
Background
PerformanceImpact
RDDPersistence
WhichStorageLeveltoChoose?
RemovingData
SharedVariables
BroadcastVariables
Accumulators
DeployingtoaCluster
LaunchingSparkjobsfromJava/Scala
UnitTesting
Migratingfrompre1.0VersionsofSpark
WheretoGofromHere
Overview
Atahighlevel,everySparkapplicationconsistsofadriverprogramthatrunstheuser’smainfunction
andexecutesvariousparalleloperationsonacluster.ThemainabstractionSparkprovidesisaresilient
distributeddataset(RDD),whichisacollectionofelementspartitionedacrossthenodesofthecluster
thatcanbeoperatedoninparallel.RDDsarecreatedbystartingwithafileintheHadoopfilesystem(or
anyotherHadoopsupportedfilesystem),oranexistingScalacollectioninthedriverprogram,and
transformingit.UsersmayalsoaskSparktopersistanRDDinmemory,allowingittobereused
efficientlyacrossparalleloperations.Finally,RDDsautomaticallyrecoverfromnodefailures.
AsecondabstractioninSparkissharedvariablesthatcanbeusedinparalleloperations.Bydefault,
whenSparkrunsafunctioninparallelasasetoftasksondifferentnodes,itshipsacopyofeach
2016/8/17 SparkProgrammingGuideSpark1.6.2Documentation
http://spark.apache.org/docs/1.6.2/programmingguide.html 2/18
Scala Java Python
Scala Java Python
variableusedinthefunctiontoeachtask.Sometimes,avariableneedstobesharedacrosstasks,or
betweentasksandthedriverprogram.Sparksupportstwotypesofsharedvariables:broadcast
variables,whichcanbeusedtocacheavalueinmemoryonallnodes,andaccumulators,whichare
variablesthatareonly“added”to,suchascountersandsums.
ThisguideshowseachofthesefeaturesineachofSpark’ssupportedlanguages.Itiseasiesttofollow
alongwithifyoulaunchSpark’sinteractiveshell–eitherbin/spark-shellfortheScalashellor
bin/pysparkforthePythonone.
LinkingwithSpark
Spark1.6.2workswithJava7andhigher.IfyouareusingJava8,Sparksupportslambdaexpressions
forconciselywritingfunctions,otherwiseyoucanusetheclassesinthe
org.apache.spark.api.java.functionpackage.
TowriteaSparkapplicationinJava,youneedtoaddadependencyonSpark.Sparkisavailablethrough
MavenCentralat:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.6.2
Inaddition,ifyouwishtoaccessanHDFScluster,youneedtoaddadependencyonhadoop-clientfor
yourversionofHDFS.
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
Finally,youneedtoimportsomeSparkclassesintoyourprogram.Addthefollowinglines:
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.SparkConf
InitializingSpark
ThefirstthingaSparkprogrammustdoistocreateaJavaSparkContextobject,whichtellsSparkhowto
accessacluster.TocreateaSparkContextyoufirstneedtobuildaSparkConfobjectthatcontains
informationaboutyourapplication.
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);
2016/8/17 SparkProgrammingGuideSpark1.6.2Documentation
http://spark.apache.org/docs/1.6.2/programmingguide.html 3/18
Scala Python
Scala Java Python
TheappNameparameterisanameforyourapplicationtoshowontheclusterUI.masterisaSpark,
MesosorYARNclusterURL,oraspecial“local”stringtoruninlocalmode.Inpractice,whenrunningon
acluster,youwillnotwanttohardcodemasterintheprogram,butratherlaunchtheapplicationwith
spark-submitandreceiveitthere.However,forlocaltestingandunittests,youcanpass“local”torun
Sparkinprocess.
UsingtheShell
IntheSparkshell,aspecialinterpreterawareSparkContextisalreadycreatedforyou,inthevariable
calledsc.MakingyourownSparkContextwillnotwork.Youcansetwhichmasterthecontextconnectsto
usingthe--masterargument,andyoucanaddJARstotheclasspathbypassingacommaseparatedlist
tothe--jarsargument.Youcanalsoadddependencies(e.g.SparkPackages)toyourshellsessionby
supplyingacommaseparatedlistofmavencoordinatestothe--packagesargument.Anyadditional
repositorieswheredependenciesmightexist(e.g.SonaType)canbepassedtothe--repositories
argument.Forexample,torunbin/spark-shellonexactlyfourcores,use:
$ ./bin/spark-shell --master local[4]
Or,toalsoaddcode.jartoitsclasspath,use:
$ ./bin/spark-shell --master local[4] --jars code.jar
Toincludeadependencyusingmavencoordinates:
$ ./bin/spark-shell --master local[4] --packages "org.example:example:0.1"
Foracompletelistofoptions,runspark-shell --help.Behindthescenes,spark-shellinvokesthe
moregeneralspark-submitscript.
ResilientDistributedDatasets(RDDs)
Sparkrevolvesaroundtheconceptofaresilientdistributeddataset(RDD),whichisafaulttolerant
collectionofelementsthatcanbeoperatedoninparallel.TherearetwowaystocreateRDDs:
parallelizinganexistingcollectioninyourdriverprogram,orreferencingadatasetinanexternalstorage
system,suchasasharedfilesystem,HDFS,HBase,oranydatasourceofferingaHadoopInputFormat.
ParallelizedCollections
ParallelizedcollectionsarecreatedbycallingJavaSparkContext’sparallelizemethodonanexisting
Collectioninyourdriverprogram.Theelementsofthecollectionarecopiedtoformadistributed
datasetthatcanbeoperatedoninparallel.Forexample,hereishowtocreateaparallelizedcollection
2016/8/17 SparkProgrammingGuideSpark1.6.2Documentation
http://spark.apache.org/docs/1.6.2/programmingguide.html 4/18
Scala Java Python
holdingthenumbers1to5:
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
Oncecreated,thedistributeddataset(distData)canbeoperatedoninparallel.Forexample,wemight
calldistData.reduce((a, b) -> a + b)toadduptheelementsofthelist.Wedescribeoperationson
distributeddatasetslateron.
Note:Inthisguide,we’lloftenusetheconciseJava8lambdasyntaxtospecifyJavafunctions,butin
olderversionsofJavayoucanimplementtheinterfacesintheorg.apache.spark.api.java.function
package.WedescribepassingfunctionstoSparkinmoredetailbelow.
Oneimportantparameterforparallelcollectionsisthenumberofpartitionstocutthedatasetinto.Spark
willrunonetaskforeachpartitionofthecluster.Typicallyyouwant24partitionsforeachCPUinyour
cluster.Normally,Sparktriestosetthenumberofpartitionsautomaticallybasedonyourcluster.
However,youcanalsosetitmanuallybypassingitasasecondparametertoparallelize(e.g.
sc.parallelize(data, 10)).Note:someplacesinthecodeusethetermslices(asynonymforpartitions)
tomaintainbackwardcompatibility.
ExternalDatasets
SparkcancreatedistributeddatasetsfromanystoragesourcesupportedbyHadoop,includingyour
localfilesystem,HDFS,Cassandra,HBase,AmazonS3,etc.Sparksupportstextfiles,SequenceFiles,
andanyotherHadoopInputFormat.
TextfileRDDscanbecreatedusingSparkContext’stextFilemethod.ThismethodtakesanURIforthe
file(eitheralocalpathonthemachine,orahdfs://,s3n://,etcURI)andreadsitasacollectionof
lines.Hereisanexampleinvocation:
JavaRDD<String> distFile = sc.textFile("data.txt");
Oncecreated,distFilecanbeactedonbydatasetoperations.Forexample,wecanaddupthesizesof
allthelinesusingthemapandreduceoperationsasfollows:distFile.map(s -> s.length()).reduce((a,
b) -> a + b).
SomenotesonreadingfileswithSpark:
Ifusingapathonthelocalfilesystem,thefilemustalsobeaccessibleatthesamepathonworker
nodes.Eithercopythefiletoallworkersoruseanetworkmountedsharedfilesystem.
AllofSpark’sfilebasedinputmethods,includingtextFile,supportrunningondirectories,
compressedfiles,andwildcardsaswell.Forexample,youcanusetextFile("/my/directory"),
textFile("/my/directory/*.txt"),andtextFile("/my/directory/*.gz").
ThetextFilemethodalsotakesanoptionalsecondargumentforcontrollingthenumberof
partitionsofthefile.Bydefault,Sparkcreatesonepartitionforeachblockofthefile(blocksbeing
64MBbydefaultinHDFS),butyoucanalsoaskforahighernumberofpartitionsbypassingalarger
value.Notethatyoucannothavefewerpartitionsthanblocks.
剩余17页未读,继续阅读
资源评论
qinxike
- 粉丝: 37
- 资源: 62
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功