# A MACHINE LEARNING APPROACH FOR START-UP INVESTMENTS
Full dataset available from: https://www.kaggle.com/justinas/startup-investments
**Abstract**
In this project a machine learning approach which classifies startups into 2 classes (successful and unsuccessful) was implemented and explored. The dataset for the project was acquired from Crunchbase. It is comprised of 11 different tables containing information about startups, investors, relationships and founder’s background in the ecosystem, among many other information. Four tables were shortlisted and merged into one dataset. After data transformation and pre-processing however, a large amount of data had to be dropped as a result of data sparsity. The final dataset was made up of 61,716 instances of startups and 36 features. Feature scaling was also conducted which reduced the number of features to seven while keeping the same predictive power.
Five supervised machine learning algorithms were used on the data, these include: Decision Tree, Support Vector Machine, Random Forest, Naïve Bayes and Multilayer Perceptron. K-means Clustering was also applied in combination and was used to boost the performance.
All machine learning algorithms achieved an accuracy score of above 90%. However, this can be attributed to the skewed distribution of classes present on the dataset.
Recall was identified to be a more important performance metric, as a strategy that minimises false negatives (misclassifying any successful startup as unsuccessful) should be prioritised (cost of the missed opportunity is extremely high).
It was shown that the MLP model performed the best out of all the other models, achieving an accuracy of 98%, a precision of 95% and more surprisingly a recall of 91%.
It was concluded that although the MLP achieved a reasonable score for recall, it should not be deployed yet as the dataset used had some limitation, specifically its outdatedness and lack of completeness. The model, however, could be used by investors in the initial phase of screening startups and could potentially save them a significant amount of time; it could also help them avoid tedious work.
***Keywords**: startup, investment, machine learning, SVM, DT, RF, NB, MLP, Crunchbase, K-means Clustering*
# Literature Review
There are several different strategies that have been taken in developing an accurate method that predicts the success of early stage companies.
One of the earliest examples in this field, published in 2003, showcases a rule-based expert system for predicting company acquisition [1]. Although the model achieves a success rate of 70%, the effectiveness of the work is very limited as a dataset of only 200 startups was considered.
Similarly, H. Littunen and H. Niittykangas analysed 200 companies based in Finland using a logistic regression. They looked at the founder’s background, motive and management style among other attributes [2]. Moreover, other work in the north-east of England studied the survival and failure of manufacturing startups by using log-logistic hazard models. It investigated the connection between the survival time of firms, they’re sizes and the macroeconomic conditions. However, this study too focused solely on 181 firms [3].
A common strategy used to predict the success or failure of a startup involves using a logistic regression, shown in the work conducted by R. N. Lussier and S. Pfeifer[4]. In this study, by reviewing 20 different previous works, 15 dependent variables were obtained. The work has also been broadened and modelled for different countries, including Chile, Croatia and the USA [5],[6] with varying outcomes. For example, in the USA only four out of the 15 variables were deemed to be statistically significant. Additionally, this study is also very limited as it considers a small sample size.
Work by R. Nahata, published in 2008, looked specifically into the venture capital investment performance and connected it to the reputation of the VC firm. It was shown that VC firms with a higher reputation are more likely to help they’re portfolio companies achieve successful exits [7].
Furthermore, S. Hoenen’s work used a linear regression model to investigate if patents increase VC investment for biotechnology companies in the United States [14]. The companies studied were incorporated between 1974 and 2011, and attributes that define the startups included the number of patents owned, investment funding received directly from VCs and regional information among other attributes.
Most work conducted, especially after the late 2000s, focuses more on data-driven machine learning models. This can be attributed the to the increase in the availability of data. While before, significant amount of effort was required to gather data on startups, companies such as Crunchbase have made it much more accessible, although they charge a significant amount for their services.
A study by Yankov et al. published in 2014, uses a questionnaire and the use of multiple machine learning methods to predict the success rate of Bulgarian startups. Decision trees were shown to be the most accurate and were able to extract important startup success factors such as the founder’s background and the company’s competitive advantage.[8]
A study published in 2009 [9], investigated the merger and acquisition market in Japan and used an ensemble classifier to predict 600 different cases. It was reported that the models achieved a global accuracy of 88% and a precision of around 40% when predicting an acquisition.
Furthermore, D. J. McKenzie and D. Sansone compared machine learning methods with domain experts in the field of startup investment. They conducted a business plan competition in Nigeria, which involved 2056 startups. The report shows that machine learning techniques achieved an accuracy of 63% in predicting successful startups while the domain experts achieved a 58% accuracy. This highlighted that startup investment prediction is a difficult task, and even human experts in the field struggle significantly in getting it right.[10]
A more advance approach conducted by Xiang et al. (2012) compares different ML classifier that have been trained to predict startup acquisition for companies started between 1970 and 2007 [11].
Some of the startup features used include finance sources, management team information and more interestingly news from TechCrunch, which is an American online publisher focusing on tech industry and trends. One major drawback they faced involved disregarding 20000 companies as a direct result of data sparsity in the dataset. They managed to keep 60000 startups described by 22 attributes. Furthermore, enriching the dataset using a corpus of over 38,000 news was not successful as only 5000 of the companies were present in the corpus. The study showed that Bayesian Networks performed better in comparison to both logistic regression and SVM. They achieved a precision which ranges between 60% to 79.8% from half of the categories. It was concluded that information gained from news outlets improved the overall results.
More sophisticated approaches which combine supervised (Support Vector Machines) and unsupervised learning (clustering) to predict business models with higher growth and better chance of survival were also conducted [12]. The study focused on startups in the USA and Germany. It achieved an accuracy of 83.6% when trying to predict survivability. However, it had major limitations as it used a dataset comprised of around 181 businesses.
Furthermore, F. R. da Silva Ribeiro Bento proposes a machine learning approach to predict startup success (either a merger & acquisition or an IPO) [13]. The study used SVM, logistic regression and random forests on dataset acquired from Crunchbase, comprising of 80,000 startups from 5 states in the United States (founded between 1985 and 2014) . A p
没有合适的资源?快使用搜索试试~ 我知道了~
机器学习实战 Crunchbase 创业公司分类(成功和失败),包括数据集与代码,能跑通靠谱
共43个文件
png:27个
csv:11个
ipynb:4个
需积分: 14 3 下载量 166 浏览量
2022-12-02
17:31:17
上传
评论
收藏 91.09MB RAR 举报
温馨提示
在这个项目中,实施并探索了一种机器学习方法,将创业公司分为两类(成功和失败)。该项目的数据集是从Crunchbase获取的。它由11个不同的表格组成,其中包含创业公司、投资者、关系和创始人在生态系统中的背景等信息。四个表入围并合并为一个数据集。然而,在数据转换和预处理之后,由于数据稀疏,必须丢弃大量数据。最终的数据集由61716个创业实例和36个功能组成。还进行了特征缩放,将特征数量减少到7个,同时保持相同的预测能力。
资源推荐
资源详情
资源评论
收起资源包目录
机器学习实战 - 副本.rar (43个子文件)
机器学习实战 - 副本
acquisitions.csv 2MB
objects.csv 271.62MB
images
Untitled.ipynb 24KB
StartupInvestmentNotebook.ipynb 1.21MB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.016.png 20KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.022.png 90KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.009.png 26KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.007.png 24KB
native_bayes.png 23KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.020.png 12KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.017.png 88KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.018.png 26KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.012.png 45KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.030.png 28KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.014.png 27KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.011.png 18KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.025.png 22KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.013.png 16KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.021.png 14KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.010.png 31KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.006.png 22KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.031.png 31KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.023.png 63KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.005.png 8KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.027.png 23KB
README.md 65KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.008.png 29KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.015.png 18KB
decision_tree.png 23KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.029.png 21KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.019.png 15KB
Aspose.Words.77fd421c-8fc6-4a9f-8309-7408e15a47cf.024.png 57KB
people.csv 9.69MB
relationships.csv 39.39MB
investments.csv 5.13MB
funds.csv 348KB
funding_rounds.csv 12.53MB
machine_learning.ipynb 1.2MB
degrees.csv 11.4MB
milestones.csv 9.32MB
.ipynb_checkpoints
machine_learning-checkpoint.ipynb 1.2MB
offices.csv 10.71MB
ipos.csv 140KB
共 43 条
- 1
资源评论
非常规变量
- 粉丝: 2004
- 资源: 2
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功