2023年美赛特等奖论文-C-2322645-解密.pdf资源-CSDN文库

版权申诉

数学建模

62 浏览量 2024-05-06 22:06:00 上传评论收藏 1.4MB PDF 举报

资源推荐

资源详情

资源评论

Problem Chosen

2023

MCM/ICM

Summary Sheet

Team Control Number

2322645

With the rising popularity of Wordle, people have eagerly taken to Twitter

to report their results daily by the tens of thousands. Three very natural

questions arise regarding this data: (1) Can we use this data to predict the

diﬃculty of a given target word in Wordle? (2) Can we use this data to

predict future Wordle player reporting trends? (3) How does the diﬃculty

of given target word aﬀect player reporting and results? In our paper, we

develop a comprehensive Bayesian model consisting of three submodels which

predict the distribution of the number of guesses, number of reported results

on Twitter and the number of reporting players playing in hard mode.

Initially, we decompose words into quantiﬁable traits associated with rel-

evant diﬃculty characteristics. Most notably, we formulate a novel Wordle-

speciﬁc entropy measure we call Subset Entropy which eﬀectively quantiﬁes

the average amount of information revealed by typical players after initial

guesses. We also develop a method to represent the distribution of player

attempts, and hence the observed diﬃculty of a word, using just two values

α, β corresponding to the cumulative mass function of the Beta distribution.

We use a preliminary Lasso regression to isolate the most relevant predictors

of word diﬃculty, which we then use in our Bayesian model.

Our Bayesian model predicts, for a given date and word, the reported

diﬃculty of a word, the number of player reports, and the number of players

reporting playing in hard-mode. To accomplish these three tasks, it is made

up of three submodels which are conditionally independent given the data,

making it eﬃcient to sample from its posterior using Markov Chain Monte-

Carlo (MCMC).

We ﬁnd that a word having a higher number of unique letters, usage

frequency in English, average number of revealed yellow squares over all

guesses, and Subset Entropy all make a word easier for players to guess. We

also ﬁnd that higher word diﬃculty decreases the number of player reports.

Under the assumption that the Times choose words randomly, this can be

interpreted as a causal eﬀect.

Our model is able to predict outcomes for new data and retrodict for old

data. Our model gives gives a 95% prediction interval that between 20238

and 27876 players will report results for “eerie” on March 1, 2023 and that

it will be in the 50th percentile of diﬃculty. Most notably, our model does

not just provide such simple point estimates and prediction intervals, but

full posterior distributions.

Keywords: Entropy, Lasso regression, MCMC, Bayesian methods,

Causal inference

Team # 2322645 Page 3 of 22

1 Introduction

Wordle is a language-based game currently owned by the New York Times that became a viral

sensation in early 2022. The goal of the game is simple to understand: At the the start of each

day there is a 5-letter target word that players have to guess. Players have six tries to do so,

and attempt to get the word in as few attempts as possible, each time using a valid English

word.

There are 11,881,376 possible 5-letter words if taking every possible sequence of ﬁve letters.

Even restricting it to words found in English dictionaries and those in common usage today would

only drop it down to around 12,000 and 4,000 words respectively[2]. For a person to randomly

guess a target word in six tries would statistically be almost impossible. This, however, is where

tile color feedback comes into play. For a given guess word, for each letter Wordle returns a

green tile if the corresponding letter is in the target word and in the right location, a yellow

tile if the corresponding letter is in the true word but in the wrong location, and a gray tile if

neither of these is true. With this information, most players are able to guess the word from

thousands of possibilities within six tries.

The game has captured the attention of millions, with people taking to social media to share

their guess results and comment on the diﬃculty for certain words. One Twitter account that

has popped up as a result of this trend is “@WordleStats”, a bot that tallies all posted Wordle

attempts and the distribution of attempts each day. Via this data, we can discover a wealth

of information about Wordle player behavior. Particularly, in this paper we develop a model

which utilizes both the trends in twitter reporting and the resulting inferred diﬃculty of target

words gleamed from this data to predict future Wordle statistics.

2 Data

2.1 Data cleaning

Data errors We ﬁxed several errors that appear in the provided data by referencing the

Twitter posts of the @WordleStats Twitter bot. These are logged below for full transparency.

• Day 239: hardmode 3249 → 9249

• Day 314: tash → trash

• Day 500: hardmode 3667 → 2667

• Day 525: clen → clean

• Day 529: Reported players 2569 → 25569

• Day 540: na¨ıve → naive (¨ı is not a letter in Wordle)

• Day 545: rprobe → probe

Percentages to counts Because the percentages of reports in the diﬀerent categories are

rounded and do not necessarily sum to 100, we divided the percentages in each row by their sum

to obtain proportions. As our Bayesian model predicts the number of players in each category,

we converted these proportions into counts by applying the following method for each row:

Team # 2322645 Page 4 of 22

1. Multiply the proportions by the number of reports on that day to obtain “counts” with

decimal values

2. Round the counts down

3. Add 1 back to the counts, in order from the count which was rounded down most to that

which was rounded down least, until the total again matches the number of reports on

that day.

This method gives counts which correspond to the given percentages, are integers, and whose

sum is the number of reports on that day.

2.2 Wordbank

To model 5-letter words and their properties, we rely on the Stanford GraphBase (SGB) word-

bank of 5757 5-letter words created by Donald Knuth [6], which provides a good approximation

of the set of words that a player could guess and expect as a target. This word bank is then used

in a few diﬀerent ways. First oﬀ, it is used to build the Order Frequency table and the Letter

Frequency table. The Letter Frequency table tells us how often each letter appears in 5-letter

words, and follows what we would expect. S and E are the most common letters, followed by A

and O. The Order Frequency table then shows given that a certain letter is in the word, what

proportion of the time is that letter in each position (e.g. given A is in a word, it is the 4th

letter 24% of the time).

As will be detailed later on, we also use this table in numerous word speciﬁc calculations,

namely computing for a given word t the average number of green, yellow and colored tiles

that are returned on any guess chosen uniformly at random from the SGB wordbank when t is

the actual target word. Additionally, for any word g we compute the average number of colors

returned when a target word t is chosen uniformly at random from the SGB wordbank and g

is guessed, which gives a somewhat naive but reasonable metric to evaluate guess words players

would use. Using this guess word metric, we compile a list of the 30 words with the highest

corresponding average, which we take to be a set of common guess words (see Figure 1). This

list will be used in the calculation of Subset Entropy later on.

3 Word & Diﬃculty Representation

Many factors contribute towards the diﬃculty associated with a given word. For example,

‘‘zingy’’ intuitively seems to be diﬃcult for a variety of reasons - it has uncommon letters

(“z” and “y”), only one “canonical” vowel, and is a generally infrequently used word in English.

A word like ‘‘onion’’ on the other hand would also seem to be diﬃcult, despite all of its letters

being fairly common and its usage in every day spoken language being much higher. The reason

it is perceived as diﬃcult is due to a repetition of the letter “o” and “n”, which people may

be less likely to guess again once they have already discovered one position of. Thus, given a

word, our ﬁrst task was to list and quantify such characteristics, so that when evaluating word

diﬃculty later on we could instead simply consider the vector of values corresponding to these

relevant characteristics.

剩余21页未读，继续阅读

评论收藏

内容反馈

版权申诉

阿拉伯梳子

粉丝: 1654
资源: 5735

2023年美赛特等奖论文-C-2322645-解密.pdf

2023年美赛特等奖论文-C-2314151-解密.pdf

2023年美赛特等奖论文-C-2301192-解密.pdf

2023年美赛特等奖论文-C-2318982-解密.pdf

2023年美赛特等奖论文-C-2300348-解密.pdf

2023年美赛特等奖论文-C-2307166-解密.pdf

2023年美赛特等奖论文-C-2311035-解密.pdf

2023年美赛特等奖论文-C-2318036-解密.pdf

2023年美赛特等奖论文-C-2311717-解密.pdf

2023年美赛特等奖论文-C-2310767-解密.pdf

2023年美赛特等奖论文-C-2307946-解密.pdf

2023年美赛特等奖论文-C-2309397-解密.pdf

2023年美赛特等奖论文-F-2315018-解密.pdf

2023年美赛特等奖论文-F-2311258-解密.pdf

2023年美赛特等奖论文-A-2321860-解密.pdf

2023年美赛特等奖论文-E-2314817-解密.pdf

2023年美赛特等奖论文-E-2307336-解密.pdf

2023年美赛特等奖论文-E-2312411-解密.pdf

2023年美赛特等奖论文-A-2316994-解密.pdf

2023年美赛特等奖论文-E-2301428-解密.pdf

相关实用应用程序（Windows可用）

李飞飞自传 我看见的世界 The World I see

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

第十九届研电赛-技术论文模板

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

智联招聘：2024年大学生就业力调研报告.pdf

4个亲测好用的ChatGPT4渠道

2024年俄罗斯商用车数字集群信息娱乐系统市场机会及渠道调研报告Sample.pdf

最新资源

李飞飞自传我看见的世界 The World I see