There are many flavors of beam search - I have found that for adding explicit ch...

There are many flavors of beam search - I have found that for adding explicit checks and constraints (for example rhyme constraints or certain pivot words) the resulting proposals are generally a lot better. Even with simple markov chains I see pretty diverse behavior depending on beam search style.

Some of the better ones I used were variants of diverse beam search, and stochastic beam searches usually combined together. The "classic" / pure variant has generally not been as useful in generative modeling for me, it tends to collapse to basically one or two effective candidates (with maybe some filler words changed) fairly quickly.

Also it seems to generally work better for me in conditional generation than in unconditional generation (e.g. charRNN / some uses of GPT-2). However, things like the "repetition problem" can be removed by construction if you are willing to hack in the beam search just a little bit. See https://badsamples.tumblr.com/post/160777871547/stochastic-s... (stochastic, diverse beam search w Markov iirc) vs https://badsamples.tumblr.com/post/160767248407/a-markov-arg... (fixed beam search, where I didn't try to remove repetition or anything special, same Markov setup)

Sometimes I also manipulate probabilities with masks and things directly, and that also combines fine with beam search in the experiments I have done.

Nucleus sampling works well, and if you don't want to control or constrain the output in unconditional generation I don't know that beam search really does much. But for conditional generation, or post-hoc hacks to get more control over a generator I find beam search variants really useful. Especially combined with a specifically conditional architecture.

For example, conditioning the language model on a particular bag-of-rhyme-words + (stochastic, probably) beam search to force rhyme pairs at the start and end of lines, probably further modified by input and output masks to "blank out" invalid tokens and tell the model which tokens will be blanked out. I've used some blend of these tricks in speech, music, and text experiments and it can be helpful if you have structure that is important to replicate and a simple model, with simple sampling just isn't replicating the necessary structure.

EDIT: One practical reason to do this would be plagiarism detection, especially if fine-tuning a small corpus. There are ways with guarantees by construction (https://www.researchgate.net/profile/Pierre_Roy2/publication...) but simple setups using beam searches and tries can also do constraint checks for n-grams of certain lengths. Concretely, set up tries for 1-2-3-4-...-nminus1-grams, which are considered "valid" transitions, then set a "bad" trie for n-grams. Check these tries during generation, and throw out any candidates which violate the "bad" trie, but still match in the good one.

See the line of Max Order work from the Sony CSL lab (formerly run by Francois Pachet) for some examples of this.