How to build AutoML from scratch – Alexander Mamaev – Medium.pdf

所需积分/C币:16 2019-05-31 09:46:21 3.22MB PDF
收藏 收藏

如何从头开始构建autoML,Bridging WebML to model-driven engineering: From document type definitions to meta object facility,Meta-models are a prerequisite for model-driven engineering (MDE) in general and consequently for model-driven web engineering in particular. Various web modelling languages, however, ar
H2D COMPUTE ENGI Load Data kplo atory Descriptive Distributed Analysis Feature Data Model Lossless Engineering s Evaluaton NES ate Prep Export: Model Export: LOCAL Plain Old Java Objec PRODUCTICN SCORING ENVIRONMENT YOUR kafka SIORM IMAGINATION H2Q is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment. Productionalization-is the main difference h20 from other frameworks, it means you can develop your model and features in this framework, and then easily integrate it into production environment like Kafka, Spark, Storm, e t c Thereby you can deploy model on usters for processing big data flow. H20 environment has got API integration with different platforms like Java, Scala, R, Python etc REST API deserves special attention because it means you can use H20 as web-service for ml with docker and make some Http-rEqueSts from other microservices H2O has integration with different storage platforms, and you can easily connect SoL database, Hadoop Distributed File System(HdFSl or s3 storage to your ML-pipeline H20. AutoML is an easy-to-use toolkit for AutoML that include h2o framework. This toolkit allows making machine learning models with using all power of H20 framework without any knowledge just in few lines of code H20 provide to use in your Ml pipeline many different algorithms like XGBooSt, H20 GBM, Neural Networks and other. The framework can automatically search optimal hyperparameters by greed search. Also H20 can ensemble your models into a stacked ensemble, that can approve accuracy on your task H20 also provide train AutoMl pipeline by Spark, it's allowed to use a cluster for training many different models inport h2o 2 from h2c autom import 120AutoML an.c⊥umns 13# For binary classification, response should be a factor testy= testlyI,as 19 anl train(x=x, y=y, training fxamestrain 22 preds aml predict(test) 23 preds- amlleader prcdict(teot) This is a simple example of using Auto ML, here we download some dataset and run basic AutoML pipeline After that H20 trains many different models and scores it on 5-folds cross-validation, also creates ensembles of top models model logloss hean_per Stacked Ensemble AllModels AutoML 20181022 221411 078701760.55413080.325461 Stacked Ensemble BestOfFamily AutoML 20181022-221411 0 7857408 0.5553949 0.3265818 XGBoost_grid_ 1_ AutoML_20181022_221411_model_3 07825571 0.55985320.332667 XGBoost 1 AutoML 20181022 221411 078106650.56012610.331227 XGBoost 3 AutoML 20181022 221411 07808475056116160.324007 XGBoost_grid_ 1- AutoML_20181022-221411- madel_4 07806241 0.56066130.322992 XGBoost 2 AutoML 20181022 221411 07805210561374003861294 A developer can use the top model from the leaderboard to predict values for the test set. Also, he can tune some parameters like stopping metric, sorting metric, n_folds, columns weights, time limit etc. Azure automated machine learning Dataset E Optimization hillII Metric Automated Machine Learning Machine Learning Model Constraints (Time/cost Azure AutoML-cloud toolkit from Microsoft for using AutoML in Azure cloud. You can use automated ml in azure notebook 1 AntonI config=AU-OMTCot etrIc a AuC weignt sec 12000 13 tron a ll core experiment import Experiment experinent, sukrit(Autonl corfig, show output-True) Azure Auto ML example You just need to create an experiment with some parameters: task dataset, the blacklist of algorithms, n_folds and other next run your experiment. After that black-magic of AutoML creates and fits the model for your task You can read more about azure autom here Google AutoML How AutoM works Dataset Autol Generate predictions with a REST APl 小> Serve Toogle autoML-cloud base autoML-as-a-service. You can use google Cloud to build your own classifier of photos or text by drag drop Cloud autoM vision Upload ana label images Train your model Evaluate Cloud autom Handbag You need to upload your dataset to the platform. After that Google will train model for your task, and then you can use it by Cloud API It's the simplest way to use ml in your project Other projects On GitHub you can find so many different AutoML projects: Nevegrad--Facebook derivative-free parameters optimization NNI-Microsoft toolkit to help run AutoML experiments on any neural network framework. like pvtorch. CNtK. Keras and etc AutoKeras- Toolkit to making AutoML in Keras in some count line of code auto-Sklearn--Toolkit to making AutoML in skleran in some count line of code TPot-Toolkit that making AutoML using sklearn and Xgboost models and tune it by genetic algorithm And other Competition Auto ML do something do something SDSJ Auto ML is a competition of AutoML systems, developers was hallenged to create their autoL software and upload it into contest platform, where it will train and evaluate This competition is different than classical Kaggle competition, you must send your code in docker container then the code runs on a server. trains model on closed datasets and tests it It's similar to acm ICPC competitions because you send code instead of CSv file Competitions dataset is closed and developers cant see it. The model evaluates on real bank,'s data The program receives at the entrance input train or test dataset name, task type(binary classification or regression )and calculation time limit. A program must train ML-model for a given time on a train set and then make a prediction for a test set for the same time Input data format Input dataset is CSV-table, program get train Input dataset have some columns line id--Line identificator target--target value (only for train set), real number for the regression task oro/1 for the classification task <type>_<feature>-input feature of some datatype(type) Feature data types: datetime-date in this format: 2010-01-01 or data and time in format:2010-01-0110:10:10 number--real or integer number (also can contain categorical feature) string-string feature(names or text) id--identificator (like categorical feature) Metrics For every task (dataset) system compute a specific metric for the task (RMSE for regression, ROC-AUC for binary classification For every task, you get some score from one to one. 0-for models that gave precision like baseline or less, 1--for the best solution in this task. After that, all scores for all task is summed up My solution Feature engineering Interesting factI did not use any hardware power except my Macbook. All solution realized in several python scripts. After adding any new feature I testing and debugging it on small datasets After that, I push my code to the platform and look up how some features change my score I tried to implement the classical algorithm of data-scientist work The big challenge was that sometimes we get big dataset or low time limit and our behavior must change After some iterations of implementing and testing some features i got my final data preprocessing and model training pipeline Dataset preparation If the dataset is big (> 2GB) then we calculate features correlation matrix and delete correlated features Else we make Mean Target Encoding and One Hot Encoding After that we select top -10 features by coefficients of the linear model(ridge/LogisticRegression We generate new features by pair division from top-10 features. This method generates 90 new features(102-10)and concatenates it to the dataset Model training If the dataset is small then we can train three LightGBM models by k-folds, after that blend prediction from every fold If the dataset is big and the time limit is small (5 minutes ) then we ust train linear models logistic regression or ridge) Else we train one big lightGBM(n_estimators =800 The source code of my solution you can find in my github Results Komara MHanbHEIn peaynbTaT locneAHee pewCHWe lonblTOK Magic City 6.54599 3Ho6p92018,23:55 2 antler 6.18121 03Ho6p92318,21:09 3 andrei.dukhourik 6,09792 03Ho6p2318,18:19 4 morphism 5,91017 280<T6p2018.18:05 ooc alxmamaey 587041 03Ho6pA2018,21:43 106 5 LFB? 5.83114 3HO96p92118,21:55 6 AgroDrozd 5,77626 03H96p2018,1943 Private leaderboard In final leaderbord i get 5th place. WOHOOO! Awards

试读 14P How to build AutoML from scratch – Alexander Mamaev – Medium.pdf
立即下载 低至0.43元/次 身份认证VIP会员低至7折
    Kinghiram_Zhang 不错,内容挺全面系统的。。。
    向来痴SAS 只是一篇14页的博客。
    Tsiu Hinghiok
    • 分享王者


    关注 私信 TA的资源

    How to build AutoML from scratch – Alexander Mamaev – Medium.pdf 16积分/C币 立即下载
    How to build AutoML from scratch – Alexander Mamaev – Medium.pdf第1页
    How to build AutoML from scratch – Alexander Mamaev – Medium.pdf第2页
    How to build AutoML from scratch – Alexander Mamaev – Medium.pdf第3页
    How to build AutoML from scratch – Alexander Mamaev – Medium.pdf第4页
    How to build AutoML from scratch – Alexander Mamaev – Medium.pdf第5页


    16积分/C币 立即下载 >