软件工程师算法工程师机器学习工程师数据科学家海外外企大厂面试.zip资源-CSDN文库

共431个文件

md：381个

png：49个

gitignore：1个

版权申诉

83 浏览量 2024-03-05 10:49:21 上传评论收藏 24.97MB ZIP 举报

在当前的IT行业中，软件工程师、算法工程师、机器学习工程师以及数据科学家是极其热门且重要的职位，尤其在海外和大型跨国公司（外企大厂）中更是备受青睐。面试是这些职位申请的关键环节，因此了解相关知识并进行充分准备至关重要。软件工程师是计算机科学的核心角色，他们负责设计、开发、测试和维护各种软件应用程序。面试中，可能会涉及编程语言（如Java、Python、C++）、数据结构（如数组、链表、树）、算法（排序、查找）、操作系统原理（进程、线程、内存管理）、数据库知识（SQL查询）等内容。此外，软件工程实践，如敏捷开发、版本控制（Git）和代码审查也是考察的重点。算法工程师不仅需要扎实的编程基础，还需要精通算法和数据结构。在面试中，他们可能被要求解决复杂的问题，涉及图论、动态规划、贪心算法等高级算法。同时，对计算复杂度分析（时间复杂度和空间复杂度）的理解也至关重要。机器学习工程师是人工智能领域的关键角色，他们构建和优化能够自我学习的模型。面试中，基础的统计学和概率论知识是必需的，比如高斯分布、贝叶斯定理。理解监督学习、无监督学习、强化学习的基本概念和常见模型（如线性回归、逻辑回归、SVM、神经网络、K-means、DBSCAN）是核心。此外，面试官还可能考察特征工程、模型评估与调优、大数据处理工具（如Hadoop、Spark）等技能。数据科学家的角色更倾向于数据分析和业务洞察，他们需要处理大量数据，提取有价值的信息。面试时，熟悉数据预处理（清洗、缺失值处理）、数据可视化（如matplotlib、seaborn）、统计建模（线性模型、决策树、随机森林）、预测分析（如时间序列分析）以及使用数据分析工具（如Pandas、Numpy）是基本要求。同时，掌握一门编程语言（Python或R）和一种数据库（如SQL）也是必不可少的。在海外和外企大厂的面试中，除了技术能力，还会考察团队协作、沟通能力、项目管理经验和跨文化适应性。面试者应能清晰地阐述自己的项目经历，展示解决问题的能力，并表现出对新技术的热情和持续学习的精神。为了成功通过这些岗位的面试，候选人需要具备扎实的技术基础，广泛的领域知识，良好的问题解决技巧，以及优秀的软技能。通过不断学习和实践，提升自身能力，才能在竞争激烈的IT行业脱颖而出。

资源推荐

资源详情

资源评论

收起资源包目录

软件工程师算法工程师机器学习工程师数据科学家海外外企大厂面试.zip （431个子文件）

.gitignore 900B

SUMMARY.md 31KB

02_examples.md 30KB

05_deep_learning.md 22KB

11_nlp.md 12KB

17_recommendation.md 11KB

video_recommendation.md 10KB

02_linear_regression.md 10KB

README.md 9KB

recommendation.md 8KB

README.md 7KB

12_llm.md 7KB

200. Number of Islands.md 6KB

76. Minimum Window Substring.md 6KB

README.md 6KB

04_tree.md 6KB

21_product_case.md 6KB

README.md 6KB

113. Path Sum II .md 6KB

01_metrics.md 6KB

03_logistic_regression.md 6KB

912. Sort an Array.md 6KB

ad_click.md 6KB

207 Course Schedule.md 5KB

README.md 5KB

00_ml_math.md 5KB

search_engine.md 5KB

121 Best Time to Buy and Sell Stock.md 5KB

215. Kth Largest Element.md 4KB

06_svm.md 4KB

news_feed.md 4KB

56. Merge Intervals.md 4KB

39 Combination Sum.md 4KB

01_leadership_principles.md 4KB

qa.md 4KB

20. Valid Parentheses.md 4KB

59. Spiral Matrix II.md 4KB

146. LRU Cache.md 3KB

721 Accounts Merge.md 3KB

102 Binary Tree Level Order Traversal.md 3KB

lift.md 3KB

README.md 3KB

5. Longest Palindromic Substring.md 3KB

208 Implement Trie.md 3KB

450. Delete Node in a BST.md 3KB

09_k_means.md 3KB

video_search.md 3KB

239. Sliding Window Maximum.md 3KB

README.md 3KB

787. Cheapest Flights Within K Stops.md 3KB

26. Remove Duplicate Numbers in Array.md 3KB

4. Median of Two Sorted Arrays.md 3KB

98 Validate Binary Search Tree.md 3KB

283. Move Zeroes.md 3KB

31. Next Permutation.md 2KB

236 Lowest Common Ancestor of a Binary Tree.md 2KB

07_knn.md 2KB

24. Swap Nodes in Pairs.md 2KB

704. Binary Search.md 2KB

105 Construct Binary Tree from Preorder and Inorder Traversal.md 2KB

378. Kth Smallest Element in a Sorted Matrix.md 2KB

81. Search in Rotated Sorted Array II.md 2KB

46 Permutation.md 2KB

README.md 2KB

128. Longest Consecutive Sequence.md 2KB

29. Divide Two Integers.md 2KB

131 Palindrome Partitioning.md 2KB

523 Continuous Subarray Sum.md 2KB

job_scheduler.md 2KB

104 Maximum Depth of Binary Tree.md 2KB

55 Jump Game.md 2KB

08_unsuperwised.md 2KB

310 Minimum Height Trees.md 2KB

934. Shortest Bridge.md 2KB

707. Design Linked List.md 2KB

139 Word Break.md 2KB

72 Edit Distance.md 2KB

209. Minimum Size Subarray Sum.md 2KB

151. Reverse Words in a String.md 2KB

224. Basic Calculator.md 2KB

162. Find Peak Element.md 2KB

sentiment_analysis.md 2KB

1456. Maximum Number of Vowels in a Substring of Given Length.md 2KB

739 Daily Temperatures.md 2KB

37 Sodoku Solver.md 2KB

README.md 2KB

148. Sort List.md 2KB

130. Surrounded Regions.md 2KB

34. Find First and Last Position of Element in Sorted Array.md 2KB

417. Pacific Atlantic Water Flow.md 2KB

994. Rotting Oranges.md 2KB

53 Maximum Subarray.md 2KB

3. Longest Substring Without Repeating Characters.md 2KB

973. K Closest Points.md 2KB

206. Reverse Linked List.md 2KB

1143 Longest Common Subsequence.md 2KB

共 431 条

# 机器学习系统设计 ML design的核心，本质是训练一个**model**来实现某个任务，如prediction/ranking/classification - 建模design, 包括优化目标，feature，data，模型结构，评价标准等 - 系统design, 偏重于在线serve大模型，包括feature store, ANN, ETL pipeline, MLOps等 **例子** - youtube recommendation/doordash search box/auto suggestion - design youtube violent content detection system - detecting unsafe content - design a monitoring system to realtime measure ML models, including features, score distribution, qps **业务目标** - improve engagement on a feed - improve customer churn - return items from search engine query **名词解释** - 曝光(impression): 文档被用户看到 - 点击率(click-through-rate，CTR): 文档d曝光的前提下，用户点击d的概率 - 交互行为(engagement): 在点击的前提下, 文档的点赞、收藏、转发、关注作者、评论。电商的加购物车、下单、付款 ## 面试过程 ### 回答框架 - **明确需求** - 场景，功能，目标，约束，如何转化为机器学习问题(如推荐转化为二分类模型和原因) - **数据** - scale of the system, user和item有哪些数据和量级，一些可做特征的数据是否有log - 从2个方面identify data：training + label, testing + ground truth - label来源: 从交互中收集, 人工标注, 人工标注加无监督辅助, 增强数据 - 数据探讨: bias, 非均衡, label质量 - GDPR/privacy，数据脱敏，数据加密 - train/test data和product上distribution不一样怎么办 - data distribution随时间改变怎么办 - **特征工程** - 实际工作中，每个ML组都有自己不同的embedding set。互相使用别人的embedding set。怎么pre-train, fine-train, 怎么combine feature非常重要 - feature的ABtest怎么做？不同traffic做 - **模型** - 模型选择，考虑系统方面的constraint。比如prediction latency， memory。怎么合理的牺牲模型的性能以换取constraint方面的benefit - 每个design的选择，像平时写design doc一样比较不同选项的优劣 - 大多数场景，模型之外都需要额外的策略兜底 - **评价** - 模型的评价，比如：点击，转化，是否有广告？考察的是GMV，还是转化订单？ - **部署** - server or device - all users or a part of users - statically, dynamically(server or device) or model streaming - **serving** - **monitoring** - 监控latency，QPS，precision，recall等参数 - **maintain** ### 过程 - 一边白板画框图，一边告知面试官我要讲某几个部分 - 整个过程，讲清楚主题之前，不要陷入任何一部分的细节挖掘。随着问题介绍，data都会告诉你 - 每个部分，尤其是你熟悉的方面，要自己主动讲，因为每个部分都很重要 - 最后确认：Is there anywhere that you feel I missed? ## 问答 - how to scale - Scaling general SW system (distributed servers, load balancer, sharding, replication, caching, etc) - Train data / KB partitioning - Distributed ML - Data parallelism (for training) - Model parallelism (for training, inference) - Asynchronous SGD - Synchronous SGD - Distributed training - Data parallel DT, RPC based DT - Scaling data collection - machine translation for 1000 languages - NLLB - [embedding-> Deep Hash Embedding](https://zhuanlan.zhihu.com/p/397600084) - Monitoring, failure tolerance, updating (below) - Auto ML (soft: HP tuning, hard: arch search (NAS)) - 线上线下不一致 - [推荐系统有哪些坑？](https://www.zhihu.com/question/28247353/answer/2126590086) - 不同的数据用什么方式存储 - data pipeline怎么设计 - serving - Online A/B testing - Based on the online metrics we would select a significance level 𝛼 and power threshold 1 – 𝛽 - Calculate the required sample size per variation - The required sample size depends on 𝛼, 𝛽, and the MDE Minimum Detectable Effect – the target relative minimum increase over the baseline that should be observed from a test - Randomly assign users into control and treatment groups (discuss with the interviewer whether we will split the candidates on the user level or the request level) - Measure and analyze results using the appropriate test. Also, we should ensure that the model does not have any biases. - If we are serving batch features they have to be handled offline and served at real time so we have to have daily/weekly jobs for generating this data. - If we are serving real time features then they need to be fetched/derived at request time and we need to be aware of scalability or latency issues (load balancing), we may need to create a feature store to lookup features at serve time and maybe some caching depending on the use case. - Monitoring Performance - Latency (P99 latency every X minutes) - Biases and misuses of your model - Performance Drop - Data Drift - CPU load - Memory Usage - Where to run inference: if we run the model on the user’s phone/computer then it would use their memory/battery but latency would be quick, on the other hand, if we store the model on our own service we increase latency and privacy concerns but removes the burden of taking up memory and battery on the user’s device. - how often we would retrain the model. Some models need to be retrained every day, some every week and others monthly/yearly. Always discuss the pros and cons of the retraining regime you choose ## 参考 - [https://github.com/ByteByteGoHq/ml-bytebytego](https://github.com/ByteByteGoHq/ml-bytebytego) - [https://developers.google.com/machine-learning/recommendation](https://developers.google.com/machine-learning/recommendation) - [https://research.facebook.com/blog/2018/5/the-facebook-field-guide-to-machine-learning-video-series/](https://research.facebook.com/blog/2018/5/the-facebook-field-guide-to-machine-learning-video-series/) - [https://github.com/khangich/machine-learning-interview](https://github.com/khangich/machine-learning-interview) - [Machine Learning Engineering by Andriy Burkov](https://www.amazon.com/Machine-Learning-Engineering-Andriy-Burkov/dp/1999579577) - [https://github.com/shibuiwilliam/ml-system-in-actions](https://github.com/shibuiwilliam/ml-system-in-actions) - [https://github.com/mercari/ml-system-design-pattern](https://github.com/mercari/ml-system-design-pattern) - [https://github.com/chiphuyen/machine-learning-systems-design](https://github.com/chiphuyen/machine-learning-systems-design) - [https://github.com/alirezadir/Machine-Learning-Interviews/blob/main/src/MLSD/ml-system-design.md](https://github.com/alirezadir/Machine-Learning-Interviews/blob/main/src/MLSD/ml-system-design.md) - [https://github.com/ibragim-bad/machine-learning-design-primer](https://github.com/ibragim-bad/machine-learning-design-primer) - [Grokking the Machine Learning Interview](https://www.educative.io/courses/grokking-the-machine-learning-interview) - [https://about.instagram.com/blog/engineering/designing-a-constrained-exploration-system](https://about.instagram.com/blog/engineering/designing-a-constrained-exploration-system) - [https://www.educative.io/courses/grokking-the-machine-learning-interview](https://www.educative.io/courses/grokking-the-machine-learning-interview) - [https://www.youtube.com/c/BitTiger](https://www.youtube.com/c/BitTiger) - [ML system 入坑指南 - Fazzie的文章 - 知乎](https://zhuanlan.zhihu.com/p/608318764) - [模型生产环境中的反馈与数据回流 - 想飞的石头的文章 - 知乎](https://zhuanlan.zhihu.com/p/493080131) - [https://www.1point3acres.com/bbs/thread-901192-1-1.html](https://www.1point3acres.com/bbs/thread-901192-1-1.html) - [kuhung/machine-learning-systems-design](https://github.com/kuhung/machine-learning-systems-design) - [Machine Learning Systems Design Cases & Tests]() - [ML d

评论收藏

内容反馈

版权申诉