2023年美赛获奖C类论文_2311717.pdf资源-CSDN文库

版权申诉

111 浏览量 2024-03-10 17:07:07 上传评论收藏 2.15MB PDF 举报

根据给定的信息，本文将对2023年美国数学竞赛（美赛）中的获奖C类论文进行深入解析。此论文的主题为“Wordle之谜：挖掘数字得分与解词的秘密”。Wordle作为纽约时报提供的每日谜题之一，其简单的规则和独特的传播特性使其广受欢迎。本文将详细探讨论文中的两个主要部分：预测Twitter报告数量区间及结果分布的模型构建，以及解决方案词难度分类模型。 ### 一、预测模型 #### 1.1 Twitter报告数量预测模型在本节中，研究者们采用了第三阶高斯回归与非齐次泊松过程相结合的方法来预测Wordle的Twitter报告数量趋势。其中： - **第三阶高斯回归**：用于预测报告数量的趋势变化。 - **非齐次泊松过程**：在此基础上预测报告数量的随机波动。通过引入流行度松弛函数来修正随机过程，更准确地模拟了Wordle的流行度变化。通过对数据进行预处理后，研究团队建立了一个能够以75%置信水平预测2023年3月1日报告数量区间的模型，预测区间为[7654, 20154]。 #### 1.2 解词属性分析为了进一步理解玩家选择困难模式的因素，研究人员提取了单词的多种属性，包括字母数量、字母位置等共8种属性。然而，这些属性并未显著影响玩家选择困难模式的比例。研究推测，玩家对自己能力的信心及其游戏心态可能是决定是否选择困难模式的主要因素。 ### 二、解词难度分类模型 #### 2.1 结果分布预测在第二部分中，作者首先提取了可能影响结果分布的数据特征，如解词属性和难度模式比例等。基于这些特征，研究团队构建了一个BP神经网络模型，用以预测特定解词在未来的结果分布情况。为了提高预测结果的泛化性能，他们还构建了一个基于Bagging集成学习方法的BP神经网络。 #### 2.2 预测案例以“EERIE”为例，该模型预测了2023年3月1日的报告结果分布为(0, 1, ...)。虽然具体数值未给出，但可以推断出模型不仅能够预测单个解词的得分分布，还可以根据实际情况进行调整和优化。 ### 三、总结与展望通过对上述模型的构建和分析，可以看出研究团队在Wordle这一热门谜题上进行了深入探索。通过利用统计学方法和机器学习技术，不仅能够有效预测Wordle的流行趋势和玩家行为，还能为游戏设计提供有价值的参考信息。未来的研究方向可能会更加关注于如何进一步提升预测模型的准确性，并探索更多影响玩家选择的因素，以便更好地理解和满足用户需求。这篇获奖论文展示了在解决实际问题时采用创新方法的重要性，同时也为数学建模和数据分析提供了宝贵的经验。对于有兴趣深入了解Wordle或类似谜题背后机制的研究者而言，该论文无疑是一份极具价值的参考资料。

资源推荐

资源详情

资源评论

Riddle of Wordle: Mining the Secret of Number Scores & Solution Words

Summary

Wordle is a popular puzzle currently offered daily by the New York Times. The simple

rules and clever propagation properties have contributed to its popularity. In this article, we

build two prediction models for the prediction of the Twitter report number intervals and result

distributions, respectively, and develop a model for classifying the difficulty of solution words.

In TASK 1: After data preprocessing, we build a Wordle report number prediction model

based on 3rd-order gaussian regression and a non-homogeneous Poisson process from a statis-

tical perspective. Among them, the Gaussian regression is used to predict the trend signs of

report numbers, while the non-homogeneous Poisson process predicts the stochastic fluctua-

tions of report numbers on this basis. Moreover, we use the popularity relaxation function to

correct the stochastic process, which better approximates the popularity change. At a confi-

dence level of 75%, we predict the interval of the number of reports on March 1, 2023 to be

[7654, 20154]. In addition, we extract 8 attributes of words in terms of the number of letters,

letter location and so on, finding that these attributes did not have an effect on the percentage

of players' Hard Mode choices. Players' confidence in their performance ability and their play

mentality may be the main reasons for whether they choose the Hard Mode or not.

In TASK 2: We first extract the data features that affect the distribution of reported results,

including word attributes, and the percentage of difficulty patterns. Then, we build a BP neural

network to make preliminary predictions on the distribution of guessing results for a certain

solution word in the future. To improve the generalization performance of the prediction results,

we build an integrated BP neural network based on Bagging. Then, we predict the distribu-

tion of the reported results of EERIE on March 1, 2023 as (0, 1, 6, 25, 31, 25, 13) (in %). We

have more than 80% confidence that the absolute error of the predicted outcome for the per-

centage of each possible result does not exceed 5%.

In TASK 3: First, we build a word difficulty induction model based on the K-Means from

the distribution of user's reported data, and divide the difficulty into 4 classes. Then, we explore

the association between word attributes and difficulty based on Pearson’s coefficients, and

take the attributes with correlation coefficients greater than 0.6 as difficulty classification at-

tributes to build a word difficulty classification model. Moreover, we find that the frequency

of the first and second letters of the solution words, the number of vowels contained in the

pronunciation and the number of word properties have a high correlation with the difficulty

classification. Finally, the difficulty classification result of EERIR is the most difficult.

In TASK 4: While exploring the statistical properties of the number of reports, we find

that the distribution of the number of reports showed a similar pattern to its trend over time. In

addition, we also notice that the percentage fluctuation of 3 tries to complete the game was the

largest in the 359 days of reported outcome distribution data.

Finally, we perform a sensitivity analysis of the model and investigate the effect of

changes in the variable parameters of the model on the results.

Keywords: Gaussian regression; Poisson process; BP neural network; K-Means

Problem Chosen

2023

MCM/ICM

Summary Sheet

Team Control Number

2311717

更多数模资讯和学习资料，请关注b站/公众号：数学建模BOOM

b站主页：https://space.bilibili.com/350975620

Team # 2311717 Page 2 of 25

Contents

1 Introduction ...................................................................................................... 3

1.1 Problem Background ....................................................................................................... 3

1.2 Restatement of the Problem ............................................................................................. 3

1.3 Literature Review ............................................................................................................. 3

1.4 Our Work .......................................................................................................................... 4

2 Assumptions and Justifications ....................................................................... 5

3 Notations ........................................................................................................... 5

4 Data pre-processing ......................................................................................... 6

5 Task 1: Report Number Prediction Model & Game Mode Selection

Analysis ................................................................................................................ 6

5.1 Data Exploration .............................................................................................................. 7

5.2 Wordle Report Number Prediction Model ....................................................................... 8

5.3 Analysis of Game Mode Selection ................................................................................. 11

6 Task 2: A Prediction Model for The Distribution of The Reported Results

............................................................................................................................. 14

6.1 Building the BP Neural Network-based Prediction Model for the distribution of word-

guessing results .................................................................................................................... 14

6.2 Analysis of Uncertainties Affecting the Model .............................................................. 16

6.3 Analysis of the Results of the Prediction Model ............................................................ 17

7 Task 3: Word Difficulty Classification Model ............................................. 17

7.1 The Establishment of Word Difficulty Classification .................................................... 18

7.2 Analysis of Word Difficulty Classification Results ....................................................... 20

8 Task 4: Other Interesting Features ............................................................... 21

9 Sensitivity Analysis ......................................................................................... 22

10 Model Evaluation and Further Discussion ................................................ 23

10.1 Strengths ...................................................................................................................... 23

10.2 Weaknesses .................................................................................................................. 23

10.3 Further Discussion ....................................................................................................... 23

11 Conclusion ..................................................................................................... 23

References .......................................................................................................... 24

Letter .................................................................................................................. 25

Team # 2311717 Page 3 of 25

1 Introduction

1.1 Problem Background

Homer is a term used in the sport of baseball and is an informal American English word.

Amazingly, Homer (home run) was searched over 79,000 times on the Cambridge Dictionary

website and was searched 65,401 times on May 5. With that, Homer became the Cambridge

Dictionary's 2022 Word of the Year. You may be wondering why, but it starts with Wordle, a

very popular word-guessing game overseas. In 2022, the online puzzle game Wordle was all

over social media. And Wordle's answer that day was Homer, which was difficult for non-US

users who were not familiar with the word.

Wordle is currently a popular daily puzzle offered by The New York Times and has grown

in popularity with more than 60 versions available. Players can choose between "regular mode"

or "hard mode. Players attempt to solve the puzzle by guessing a five-letter word in six or fewer

attempts, with each guess receiving feedback and a change in the color of the tile (green, yellow,

gray). Note: Each guess must be a real word in English. Guesses that are not recognized as

words by the contest are not allowed.

󱯛󱯜： A green tile indicates that the letter in that tile is in the word and in the correct location.

󱯘󱯙： A yellow tile indicates that the letter in that tile is in the word but in the wrong location.

： A gray tile indicates that the letter in that tile is not included in the word.

1.2 Restatement of the Problem

Considering the background information and the results in this file, we need to solve the

following problems:

 Develop a model to account for changes in the number of reported outcomes and create a

prediction interval for the number of reported outcomes on March 1, 2023. Analyze the

extent to which attributes of words affect players' mode choices.

 Develop a model to predict the distribution of reported outcomes. Analyze the uncertainty

factors that exist in the model and predictions.

 Develop a model to classify solution words by difficulty. Identify the attributes of the

words associated with each classification.

 Describes other interesting features of the dataset.

1.3 Literature Review

In recent years, with the popularity of the Internet, social networks have gradually become

the main medium for discussing what is happening in the real world, and users can generate

and disseminate rich data streams on social platforms (e.g. Twitter) to gain insights into hot

events that are happening. Popularity modeling and prediction have a wide range of applica-

tions in marketing, opinion monitoring, advertising and other scenarios, and time-series-based

trend analysis is a research topic that has received much attention in the field of data mining

and social network analysis in recent years. The idea of this type of research mainly draws on

financial and epidemiological models. Shen et al

[1]

first established a Reinforced Poisson

Team # 2311717 Page 4 of 25

Processes (RPP) model to predict dynamic prevalence using a heterogeneous Poisson process

model, and considered the "rich get richer ". Zhao et al

[2]

developed a SEISMIC model based

on the theory of self-excited point processes, assuming that past popularity will affect the future

evolution of the process, and used a double stochastic process to portray the contagion of in-

formation. Wu et al

[3]

proposed a Bayesian network-based popularity prediction model (EPAB)

based on temporal characteristics, user characteristics and network structure characteristics,

and proposed the concept of early patterns to establish the relationship between early feature

information and future heat changes.

However, the time series model requires the data set to contain timing information, and

the data set that does not meet this condition cannot be modeled. Meanwhile, the sequential

model and the deep learning method based on node behavior dynamics are not suitable for the

forecast situation of this task based only on the reported data. On the one hand, the existing

data set does not contain specific information such as who the reporter is, how many players

there are at any given time, etc., so a node model cannot be built based on this data set. On the

other hand, techniques such as deep learning are not well interpretable and cannot explain the

trend of heat change mathematically, and require more training data.

In this paper, we try our best to extract all the information from the Data File. Aiming at

the specific application scenario of Wordle, we not only realize the interval prediction of the

number of future reports, but also carry out further analysis on the distribution of report results

and the classification of word difficulty.

1.4 Our Work

We put forward three models to mine the information of the reported result data. The

structure of our paper is shown in Figure 1.

(e)Word Attribute

（d）Task 3: Word Difficulty Classification Model

（

）

Task 2: The Distribution of the Reported Results Prediction Model

Data Feature

Extraction

（

）

Task 1: Report Number Prediction Model & Game Mode Selection Analysis

Mode Selection

Interval Prediction

（

）

Data Preprocessing

Trend

indication

Random

fluctuation

3rd-order Gaussian

Regression

non-homogeneous

Poisson Process

Popular

timeliness

Popularity relaxation

function

Word attribute

Hard Mode

ratio

Bagging

Twitter Report Results

Hard Mode

selection ratio

Word attributes

Letter Number

Letter Position

Other

BP Neural Network

Distribution of

report results

K-Means

clustering

Difficulty

summary

Word attribute

Correlation

coefficient

New words

Euclidean distance

Difficulty

classification

correction

Word

Number of

reports

Correction

Figure 1: The structure of our paper

剩余24页未读，继续阅读

评论收藏

内容反馈

版权申诉

阿拉伯梳子

粉丝: 2533
资源: 5734

2023年美赛获奖C类论文_2311717.pdf

2023年美赛特等奖论文-C-2311717-解密.pdf

2023美赛summary及C-data

2023年美赛获奖C类论文_2307166.pdf

2023年美赛获奖A类论文_2316994.pdf

2023年美赛获奖E类论文_2307336.pdf

2023年美赛获奖E类论文_2301428.pdf

2023年美赛获奖D类论文_2304962.pdf

2023年美赛获奖B类论文_2300136.pdf

2023年美赛获奖B类论文_2315379.pdf

2023年美赛获奖C类论文_2318982.pdf

2023年美赛获奖A类论文_2303950.pdf

2023年美赛获奖C类论文_2300348.pdf

2023年美赛获奖C类论文_2318036.pdf

2023年美赛获奖C类论文_2310767.pdf

2023年美赛获奖C类论文_2314151.pdf

2023年美赛获奖C类论文_2301192.pdf

2023年美赛获奖C类论文_2311035.pdf

2023年美赛获奖C类论文_2322645.pdf

2023年美赛获奖F类论文_2311258.pdf

2023年美赛获奖E类论文_2314354.pdf

2023年美赛获奖C类论文_2309397.pdf

2023年美赛获奖A类论文_2300336.pdf

2023年美赛获奖E类论文_2314817.pdf

2023年美赛获奖D类论文_2303967.pdf

2023年美赛获奖F类论文_2311517.pdf

2023年美赛获奖D类论文_2300229.pdf

2023年美赛获奖B类论文_2318300.pdf

2023年美赛获奖A类论文_2322687.pdf

2023年美赛获奖A类论文_2321860.pdf

2023年美赛获奖E类论文_2312411.pdf

最新资源