SparkProgrammingGuide-java-Spark1.6.2.资源-CSDN文库

需积分: 1 196 浏览量 2016-08-19 20:12:20 上传评论收藏 416KB PDF 举报

资源推荐

资源详情

资源评论

2016/8/17 SparkProgrammingGuideSpark1.6.2Documentation
http://spark.apache.org/docs/1.6.2/programmingguide.html 1/18
1.6.2
Overview ProgrammingGuides APIDocs Deploying More
SparkProgrammingGuide
Overview
LinkingwithSpark
InitializingSpark
UsingtheShell
ResilientDistributedDatasets(RDDs)
ParallelizedCollections
ExternalDatasets
RDDOperations
Basics
PassingFunctionstoSpark
Understandingclosures
Example
Localvs.clustermodes
PrintingelementsofanRDD
WorkingwithKeyValuePairs
Transformations
Actions
Shuffleoperations
Background
PerformanceImpact
RDDPersistence
WhichStorageLeveltoChoose?
RemovingData
SharedVariables
BroadcastVariables
Accumulators
DeployingtoaCluster
LaunchingSparkjobsfromJava/Scala
UnitTesting
Migratingfrompre1.0VersionsofSpark
WheretoGofromHere
Overview
Atahighlevel,everySparkapplicationconsistsofadriverprogramthatrunstheuser’smainfunction
andexecutesvariousparalleloperationsonacluster.ThemainabstractionSparkprovidesisaresilient
distributeddataset(RDD),whichisacollectionofelementspartitionedacrossthenodesofthecluster
thatcanbeoperatedoninparallel.RDDsarecreatedbystartingwithafileintheHadoopfilesystem(or
anyotherHadoopsupportedfilesystem),oranexistingScalacollectioninthedriverprogram,and
transformingit.UsersmayalsoaskSparktopersistanRDDinmemory,allowingittobereused
efficientlyacrossparalleloperations.Finally,RDDsautomaticallyrecoverfromnodefailures.
AsecondabstractioninSparkissharedvariablesthatcanbeusedinparalleloperations.Bydefault,
whenSparkrunsafunctioninparallelasasetoftasksondifferentnodes,itshipsacopyofeach

2016/8/17 SparkProgrammingGuideSpark1.6.2Documentation

http://spark.apache.org/docs/1.6.2/programmingguide.html 2/18

Scala Java Python

variableusedinthefunctiontoeachtask.Sometimes,avariableneedstobesharedacrosstasks,or

betweentasksandthedriverprogram.Sparksupportstwotypesofsharedvariables:broadcast

variables,whichcanbeusedtocacheavalueinmemoryonallnodes,andaccumulators,whichare

variablesthatareonly“added”to,suchascountersandsums.

ThisguideshowseachofthesefeaturesineachofSpark’ssupportedlanguages.Itiseasiesttofollow

alongwithifyoulaunchSpark’sinteractiveshell–eitherbin/spark-shellfortheScalashellor

bin/pysparkforthePythonone.

LinkingwithSpark

Spark1.6.2workswithJava7andhigher.IfyouareusingJava8,Sparksupportslambdaexpressions

forconciselywritingfunctions,otherwiseyoucanusetheclassesinthe

org.apache.spark.api.java.functionpackage.

TowriteaSparkapplicationinJava,youneedtoaddadependencyonSpark.Sparkisavailablethrough

MavenCentralat:

groupId = org.apache.spark

artifactId = spark-core_2.10

version = 1.6.2

Inaddition,ifyouwishtoaccessanHDFScluster,youneedtoaddadependencyonhadoop-clientfor

yourversionofHDFS.

groupId = org.apache.hadoop

artifactId = hadoop-client

version = <your-hdfs-version>

Finally,youneedtoimportsomeSparkclassesintoyourprogram.Addthefollowinglines:

import org.apache.spark.api.java.JavaSparkContext

import org.apache.spark.api.java.JavaRDD

import org.apache.spark.SparkConf

InitializingSpark

ThefirstthingaSparkprogrammustdoistocreateaJavaSparkContextobject,whichtellsSparkhowto

accessacluster.TocreateaSparkContextyoufirstneedtobuildaSparkConfobjectthatcontains

informationaboutyourapplication.

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);

JavaSparkContext sc = new JavaSparkContext(conf);

2016/8/17 SparkProgrammingGuideSpark1.6.2Documentation

http://spark.apache.org/docs/1.6.2/programmingguide.html 3/18

Scala Python

Scala Java Python

TheappNameparameterisanameforyourapplicationtoshowontheclusterUI.masterisaSpark,

MesosorYARNclusterURL,oraspecial“local”stringtoruninlocalmode.Inpractice,whenrunningon

acluster,youwillnotwanttohardcodemasterintheprogram,butratherlaunchtheapplicationwith

spark-submitandreceiveitthere.However,forlocaltestingandunittests,youcanpass“local”torun

Sparkinprocess.

UsingtheShell

IntheSparkshell,aspecialinterpreterawareSparkContextisalreadycreatedforyou,inthevariable

calledsc.MakingyourownSparkContextwillnotwork.Youcansetwhichmasterthecontextconnectsto

usingthe--masterargument,andyoucanaddJARstotheclasspathbypassingacommaseparatedlist

tothe--jarsargument.Youcanalsoadddependencies(e.g.SparkPackages)toyourshellsessionby

supplyingacommaseparatedlistofmavencoordinatestothe--packagesargument.Anyadditional

repositorieswheredependenciesmightexist(e.g.SonaType)canbepassedtothe--repositories

argument.Forexample,torunbin/spark-shellonexactlyfourcores,use:

$ ./bin/spark-shell --master local[4]

Or,toalsoaddcode.jartoitsclasspath,use:

$ ./bin/spark-shell --master local[4] --jars code.jar

Toincludeadependencyusingmavencoordinates:

$ ./bin/spark-shell --master local[4] --packages "org.example:example:0.1"

Foracompletelistofoptions,runspark-shell --help.Behindthescenes,spark-shellinvokesthe

moregeneralspark-submitscript.

ResilientDistributedDatasets(RDDs)

Sparkrevolvesaroundtheconceptofaresilientdistributeddataset(RDD),whichisafaulttolerant

collectionofelementsthatcanbeoperatedoninparallel.TherearetwowaystocreateRDDs:

parallelizinganexistingcollectioninyourdriverprogram,orreferencingadatasetinanexternalstorage

system,suchasasharedfilesystem,HDFS,HBase,oranydatasourceofferingaHadoopInputFormat.

ParallelizedCollections

ParallelizedcollectionsarecreatedbycallingJavaSparkContext’sparallelizemethodonanexisting

Collectioninyourdriverprogram.Theelementsofthecollectionarecopiedtoformadistributed

datasetthatcanbeoperatedoninparallel.Forexample,hereishowtocreateaparallelizedcollection

2016/8/17 SparkProgrammingGuideSpark1.6.2Documentation

http://spark.apache.org/docs/1.6.2/programmingguide.html 4/18

Scala Java Python

holdingthenumbers1to5:

List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);

JavaRDD<Integer> distData = sc.parallelize(data);

Oncecreated,thedistributeddataset(distData)canbeoperatedoninparallel.Forexample,wemight

calldistData.reduce((a, b) -> a + b)toadduptheelementsofthelist.Wedescribeoperationson

distributeddatasetslateron.

Note:Inthisguide,we’lloftenusetheconciseJava8lambdasyntaxtospecifyJavafunctions,butin

olderversionsofJavayoucanimplementtheinterfacesintheorg.apache.spark.api.java.function

package.WedescribepassingfunctionstoSparkinmoredetailbelow.

Oneimportantparameterforparallelcollectionsisthenumberofpartitionstocutthedatasetinto.Spark

willrunonetaskforeachpartitionofthecluster.Typicallyyouwant24partitionsforeachCPUinyour

cluster.Normally,Sparktriestosetthenumberofpartitionsautomaticallybasedonyourcluster.

However,youcanalsosetitmanuallybypassingitasasecondparametertoparallelize(e.g.

sc.parallelize(data, 10)).Note:someplacesinthecodeusethetermslices(asynonymforpartitions)

tomaintainbackwardcompatibility.

ExternalDatasets

SparkcancreatedistributeddatasetsfromanystoragesourcesupportedbyHadoop,includingyour

localfilesystem,HDFS,Cassandra,HBase,AmazonS3,etc.Sparksupportstextfiles,SequenceFiles,

andanyotherHadoopInputFormat.

TextfileRDDscanbecreatedusingSparkContext’stextFilemethod.ThismethodtakesanURIforthe

file(eitheralocalpathonthemachine,orahdfs://,s3n://,etcURI)andreadsitasacollectionof

lines.Hereisanexampleinvocation:

JavaRDD<String> distFile = sc.textFile("data.txt");

Oncecreated,distFilecanbeactedonbydatasetoperations.Forexample,wecanaddupthesizesof

allthelinesusingthemapandreduceoperationsasfollows:distFile.map(s -> s.length()).reduce((a,

b) -> a + b).

SomenotesonreadingfileswithSpark:

Ifusingapathonthelocalfilesystem,thefilemustalsobeaccessibleatthesamepathonworker

nodes.Eithercopythefiletoallworkersoruseanetworkmountedsharedfilesystem.

AllofSpark’sfilebasedinputmethods,includingtextFile,supportrunningondirectories,

compressedfiles,andwildcardsaswell.Forexample,youcanusetextFile("/my/directory"),

textFile("/my/directory/*.txt"),andtextFile("/my/directory/*.gz").

ThetextFilemethodalsotakesanoptionalsecondargumentforcontrollingthenumberof

partitionsofthefile.Bydefault,Sparkcreatesonepartitionforeachblockofthefile(blocksbeing

64MBbydefaultinHDFS),butyoucanalsoaskforahighernumberofpartitionsbypassingalarger

value.Notethatyoucannothavefewerpartitionsthanblocks.

剩余17页未读，继续阅读

评论收藏

内容反馈

qinxike

粉丝: 37
资源: 62

Spark Programming Guide-java - Spark 1.6.2.

最新资源

Spark Programming Guide-java - Spark 1.6.2.

Spark Configuration - Spark 1.6.2

Spark_The Definitive Guide-O'Reilly(2018).epub

spark-1.6.0-bin-hadoop2.6.tgz

spark安装部署手册

javax.mail-1.6.2-API文档-中英对照版.zip

cronolog-1.6.2.tar.gz cronolog-1.6.2下载

apr-1.6.2.tar

javax.mail-api-1.6.2.jar

axis2-eclipse-codegen-plugin-1.6.2.zip和axis2-eclipse-service-plugin-1.6.2.zip

cronolog-1.6.2.tar.gz

uuid-1.6.2.tar.gz

nginx-1.6.2.tar.gz nginx-1.6.2下载

swagger-annotations-1.6.2-API文档-中文版.zip

spring-data-redis-1.6.2.RELEASE.jar

axis2-1.6.2.zip

nginx-1.6.2.tar.gz

spark-streaming_2.10-1.6.2.jar

velocity-1.6.2.zip

TortoiseSVN-1.6.2.16344-win32-svn-1.6.2.msi

javax.mail-1.6.2_javamailjar1.6.2_javamail-1.6.2_javax.mail_java

junit-platform-launcher-1.6.2.jar

slf4j-api-1.6.2.jar和slf4j-log4j12-1.6.2.jar

easy-captcha-1.6.2.jar

Java第十五届蓝桥杯大赛软件JavaB组真题

SwitchHosts

安卓期末大作业（AndroidStudio开发），垃圾分类助手app，分为前台后台，代码有注释，均能正常运行

Notepad++安装包

2024北森能力测评题库.7z

微信小程序源码-合集1.rar

最新资源