机器学习实战Crunchbase创业公司分类（成功和失败），包括数据集与代码，能跑通靠谱资源-CSDN文库

共43个文件

png：27个

csv：11个

ipynb：4个

机器学习

python

数据集

jupyternotebook

需积分: 14 166 浏览量 2022-12-02 17:31:17 上传评论收藏 91.09MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

机器学习实战 - 副本.rar （43个子文件）

机器学习实战 - 副本

acquisitions.csv 2MB

objects.csv 271.62MB

images

Untitled.ipynb 24KB

StartupInvestmentNotebook.ipynb 1.21MB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.016.png 20KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.022.png 90KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.009.png 26KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.007.png 24KB

native_bayes.png 23KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.020.png 12KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.017.png 88KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.018.png 26KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.012.png 45KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.030.png 28KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.014.png 27KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.011.png 18KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.025.png 22KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.013.png 16KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.021.png 14KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.010.png 31KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.006.png 22KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.031.png 31KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.023.png 63KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.005.png 8KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.027.png 23KB

README.md 65KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.008.png 29KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.015.png 18KB

decision_tree.png 23KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.029.png 21KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.019.png 15KB

Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.024.png 57KB

people.csv 9.69MB

relationships.csv 39.39MB

investments.csv 5.13MB

funds.csv 348KB

funding_rounds.csv 12.53MB

machine_learning.ipynb 1.2MB

degrees.csv 11.4MB

milestones.csv 9.32MB

.ipynb_checkpoints

machine_learning-checkpoint.ipynb 1.2MB

offices.csv 10.71MB

ipos.csv 140KB

# A MACHINE LEARNING APPROACH FOR START-UP INVESTMENTS Full dataset available from: https://www.kaggle.com/justinas/startup-investments **Abstract** In this project a machine learning approach which classifies startups into 2 classes (successful and unsuccessful) was implemented and explored. The dataset for the project was acquired from Crunchbase. It is comprised of 11 different tables containing information about startups, investors, relationships and founder’s background in the ecosystem, among many other information. Four tables were shortlisted and merged into one dataset. After data transformation and pre-processing however, a large amount of data had to be dropped as a result of data sparsity. The final dataset was made up of 61,716 instances of startups and 36 features. Feature scaling was also conducted which reduced the number of features to seven while keeping the same predictive power. Five supervised machine learning algorithms were used on the data, these include: Decision Tree, Support Vector Machine, Random Forest, Naïve Bayes and Multilayer Perceptron. K-means Clustering was also applied in combination and was used to boost the performance. All machine learning algorithms achieved an accuracy score of above 90%. However, this can be attributed to the skewed distribution of classes present on the dataset. Recall was identified to be a more important performance metric, as a strategy that minimises false negatives (misclassifying any successful startup as unsuccessful) should be prioritised (cost of the missed opportunity is extremely high). It was shown that the MLP model performed the best out of all the other models, achieving an accuracy of 98%, a precision of 95% and more surprisingly a recall of 91%. It was concluded that although the MLP achieved a reasonable score for recall, it should not be deployed yet as the dataset used had some limitation, specifically its outdatedness and lack of completeness. The model, however, could be used by investors in the initial phase of screening startups and could potentially save them a significant amount of time; it could also help them avoid tedious work. ***Keywords**: startup, investment, machine learning, SVM, DT, RF, NB, MLP, Crunchbase, K-means Clustering* # Literature Review There are several different strategies that have been taken in developing an accurate method that predicts the success of early stage companies. One of the earliest examples in this field, published in 2003, showcases a rule-based expert system for predicting company acquisition [1]. Although the model achieves a success rate of 70%, the effectiveness of the work is very limited as a dataset of only 200 startups was considered. Similarly, H. Littunen and H. Niittykangas analysed 200 companies based in Finland using a logistic regression. They looked at the founder’s background, motive and management style among other attributes [2]. Moreover, other work in the north-east of England studied the survival and failure of manufacturing startups by using log-logistic hazard models. It investigated the connection between the survival time of firms, they’re sizes and the macroeconomic conditions. However, this study too focused solely on 181 firms [3]. A common strategy used to predict the success or failure of a startup involves using a logistic regression, shown in the work conducted by R. N. Lussier and S. Pfeifer[4]. In this study, by reviewing 20 different previous works, 15 dependent variables were obtained. The work has also been broadened and modelled for different countries, including Chile, Croatia and the USA [5],[6] with varying outcomes. For example, in the USA only four out of the 15 variables were deemed to be statistically significant. Additionally, this study is also very limited as it considers a small sample size. Work by R. Nahata, published in 2008, looked specifically into the venture capital investment performance and connected it to the reputation of the VC firm. It was shown that VC firms with a higher reputation are more likely to help they’re portfolio companies achieve successful exits [7]. Furthermore, S. Hoenen’s work used a linear regression model to investigate if patents increase VC investment for biotechnology companies in the United States [14]. The companies studied were incorporated between 1974 and 2011, and attributes that define the startups included the number of patents owned, investment funding received directly from VCs and regional information among other attributes. Most work conducted, especially after the late 2000s, focuses more on data-driven machine learning models. This can be attributed the to the increase in the availability of data. While before, significant amount of effort was required to gather data on startups, companies such as Crunchbase have made it much more accessible, although they charge a significant amount for their services. A study by Yankov et al. published in 2014, uses a questionnaire and the use of multiple machine learning methods to predict the success rate of Bulgarian startups. Decision trees were shown to be the most accurate and were able to extract important startup success factors such as the founder’s background and the company’s competitive advantage.[8] A study published in 2009 [9], investigated the merger and acquisition market in Japan and used an ensemble classifier to predict 600 different cases. It was reported that the models achieved a global accuracy of 88% and a precision of around 40% when predicting an acquisition. Furthermore, D. J. McKenzie and D. Sansone compared machine learning methods with domain experts in the field of startup investment. They conducted a business plan competition in Nigeria, which involved 2056 startups. The report shows that machine learning techniques achieved an accuracy of 63% in predicting successful startups while the domain experts achieved a 58% accuracy. This highlighted that startup investment prediction is a difficult task, and even human experts in the field struggle significantly in getting it right.[10] A more advance approach conducted by Xiang et al. (2012) compares different ML classifier that have been trained to predict startup acquisition for companies started between 1970 and 2007 [11]. Some of the startup features used include finance sources, management team information and more interestingly news from TechCrunch, which is an American online publisher focusing on tech industry and trends. One major drawback they faced involved disregarding 20000 companies as a direct result of data sparsity in the dataset. They managed to keep 60000 startups described by 22 attributes. Furthermore, enriching the dataset using a corpus of over 38,000 news was not successful as only 5000 of the companies were present in the corpus. The study showed that Bayesian Networks performed better in comparison to both logistic regression and SVM. They achieved a precision which ranges between 60% to 79.8% from half of the categories. It was concluded that information gained from news outlets improved the overall results. More sophisticated approaches which combine supervised (Support Vector Machines) and unsupervised learning (clustering) to predict business models with higher growth and better chance of survival were also conducted [12]. The study focused on startups in the USA and Germany. It achieved an accuracy of 83.6% when trying to predict survivability. However, it had major limitations as it used a dataset comprised of around 181 businesses. Furthermore, F. R. da Silva Ribeiro Bento proposes a machine learning approach to predict startup success (either a merger & acquisition or an IPO) [13]. The study used SVM, logistic regression and random forests on dataset acquired from Crunchbase, comprising of 80,000 startups from 5 states in the United States (founded between 1985 and 2014) . A p

评论收藏

内容反馈