### Participants

A total of 40 people participated in this experiment: 20 undergraduate and graduate students (aged 21–32; 10 women and 10 men) from Yonsei University (Seoul, South Korea) and 20 undergraduate students (aged 18–27; 13 women and seven men) from Pusan National University (Busan, South Korea). We recruited more than twice as many participants as in previous studies using similar stimuli and tasks^{22,24}. Furthermore, we analyzed our data using Bayesian methods, which are less influenced by the number of participants^{26}. All participants had a normal color vision and normal or corrected-to-normal visual acuity. The experimental protocol was approved by the Institutional Review Boards at Yonsei and Pusan National Universities, and written informed consent was obtained according to their procedures. All methods were performed in accordance with the relevant guidelines and regulations by the Institutional Review Boards.

### Apparatus and stimuli

All stimuli were generated using the Psychophysics Toolbox Version 3 and its extensions (Brainard, 1997^{27}; Pelli, 1997^{28}) for MATLAB (MathWorks Inc., Natick, MA, USA). At Yonsei University, stimuli were displayed on a CRT monitor at a resolution of 1600 × 1200 with a refresh rate of 85 Hz. At Pusan National University, the same stimuli were displayed on a 27-in. full HD widescreen color monitor at a resolution of 1920 × 1080 with a refresh rate of 120 Hz. At both locations, participants were seated approximately 60 cm from the monitor with their heads on a head-chin rest inside a darkened room.

In each trial, we presented a set of faces, each of which expressed a specific emotion. Faces were morphed via FantaMorph software (Abrosoft), using face stimuli from the Yonsei Face Database^{29}. The database consists of 344 photographs of 17 amateur Korean actors (nine women and eight men) displaying six basic emotions (anger, disgust, fear, happiness, sadness, and surprise) and a neutral facial expression.

The emotional intensity of each photograph in the original study^{29} was evaluated by a minimum of 177 Yonsei University students (177–212) for verification purposes. The evaluators categorized the emotion of each photograph with one of the seven facial labels and assessed its emotional intensity on a linear scale from 0 (very weak) to 10 (very strong). In the current study, we selected eight actors (three women and five men) and used each actor’s happy and angry faces for morphing. We only used the faces of individuals whose mouths were closed for the following reasons. Open-mouthed faces are generally perceived as more intense and extreme, resulting in higher valence and arousal levels^{30}. They also exhibit a greater attentional advantage, especially when expressing anger and happiness^{31,32,33,34}. Using faces with closed mouths only also allowed us to rule out the possibility that any difference between crowds with masked and unmasked people is explained solely by the difference in the perceptual saliency of the mouth area of the face.

Our selection criteria for the actors were as follows: (1) the face stimuli had similar intensity ratings between happy and angry; (2) the most frequent emotion judged by the observers (> 70%) was consistent with the emotional category the actor in the image intended to express (e.g., the most frequent “happy” response to a “happy” expression); (3) none of the other emotional labels were rated as high as happy (e.g., to prevent confusion with surprise or disgust) or angry (e.g., to prevent confusion with fear or sadness). Supplementary Table 1 details the observers’ emotional judgments for the eight face stimuli with both happy and angry emotions, including the emotion judgment frequency and rated emotional intensity (mean and standard deviation).

By morphing the happy and angry face images with the same identity, we created 51 facial emotion morph levels ranging from − 25 (the happiest) to + 25 (the angriest). The morphed face images were linearly interpolated (in 2% increments) between the original happy and angry emotions. The same process was applied to the eight selected identities. The emotion level of 0 corresponded to a neutral face, which was a morph comprising 50% happy and 50% angry faces (Fig. 1a). Different faces were separated by emotional intensity units such that face 1 was one emotional unit happier than face 2. Therefore, the larger the separation between any two morphed faces in emotional units, the easier it was to discriminate them based on their emotional intensity.

Individual face images (2.27° × 3.37° visual angle in size at Yonsei University and 2.82° × 4.19° visual angle in size at Pusan National University) of a crowd were randomly positioned in an invisible frame (10° × 10° visual angle at the center of a gray background at Yonsei University and 12.4° × 12.4° visual angle at Pusan National University).

### Design and procedure

Our experiment had a 3 set sizes (one, four, and eight) × 4 emotions (very happy, somewhat happy, somewhat angry, and very angry) × 3 mask conditions of face sets (neither, either, and both) within-subject design, with 28 repetitions per condition. Therefore, there were 1008 trials in total, and the sequence of the trials was randomized for each participant. Participants took a short break after every 200 trials.

In each trial, two sets of faces were presented sequentially for 500 ms each (see Fig. 1b for a sample trial). Each set contained one, four, or eight different identity faces according to the set size condition. A black fixation cross on a blank screen was presented for 500 ms between the two displays of facial stimuli. One of the two displays always contained a neutral-emotion group with an average value of 0 emotional units as a control set. The probe set either had a mean of − 16 (very happy), − 8 (somewhat happy), + 8 (somewhat angry), or + 16 (very angry) emotion units. In other words, participants consistently compared a group of faces exhibiting one of the four average emotion intensities with a group of faces displaying a neutral emotion (where the average value was 0). Whether the control or probe set was presented first was randomly determined in every trial.

The minimum and maximum emotion intensities within the same emotion condition were the same across all set sizes, spanning a 10-unit range, except for the set size 1 condition where only the average value was chosen. Therefore, the happy conditions (including both somewhat happy and very happy conditions) contained only happy individual faces with varying intensities, while the angry conditions contained only angry individual faces. We also ensured that individual faces had emotion intensities that were distinct, uniformly distributed within the 10-unit range, and symmetrical around the average emotion. For example, if the emotion units used for the very angry condition in set size 4 were^{11,15,17,21}, the corresponding emotion units for the same condition in set size 8 would be^{11,12,14,15,17,18,20,21}, with varying middle values across trials. These manipulations were implemented to minimize the possibility that one or two faces with intense emotions would evoke faster responses due to saliency. For the neutral-emotion stimuli, the extreme values in the range were identical across trials: − 5 (happy) and 5 (angry). While maintaining the mean value of 0, the middle units varied across trials.

Importantly, there were three conditions, depending on whether the faces in a set (the probe or control set) wore masks. In the *both* condition, the faces wore masks in both the probe and control sets; in the *either* condition, only one set contained faces wearing masks, while the other set contained faces without masks; and in the *neither* condition, none of the sets contained faces wearing masks. Figure 1b presents a sample trial with the set size of four, when the control set (neutral; 0) was compared with the very happy crowd (− 16) and when only the faces in the control set were wearing masks (i.e., the *either* condition). In this figure, the control set containing individual faces with emotional units of [− 5, − 4, 4, 5] is presented first, followed by a blank screen and the probe set (average = −16) containing individual faces with emotional units of [− 21, − 19, − 13, − 11].

Participants were instructed to fixate on the center of the screen to view the two successive sets of multiple faces and make a keypress as accurately and quickly as possible to indicate which of two facial crowds they would rather avoid. We explicitly informed participants that the correct answer was the facial crowd showing a more negative emotion on average. Feedback was provided only during 20 practice trials and then removed for the main experiment. Participants’ responses to the first 20 practice trials were not included in the data analysis.

### Statistical analysis

We used the brms package^{35} for the Bayesian mixed model analysis and bayestestR packages^{36} to compare models. We fit models to predict the proportion of which facial crowd was judged as negative and the RTs, separately, depending on the set size, emotion, and mask conditions. The set size and the emotion predictors were treated as continuous predictors, while the facial mask predictor was treated as a categorical predictor. The response was binary—that is, 1 indicated a “more negative” response, and 2 indicated a “less negative” response for a probe set (a more expressive crowd) compared to a control set (a neutral crowd on average). Therefore, logistic regression models were fit to predict the proportion of “more negative” responses using the Bernoulli distribution with a logit link function. The RTs were fit to the shifted lognormal distribution with an identity link function as recommended in the literature^{37}.

For each dependent variable, the full model consisted of the three predictors and a participant-level random intercept. The null model contained only the random intercept of participants. The null model was compared with other models comprising all possible combinations of the main effects or interaction terms. The model comparison results were reported as Bayes factors (BF_{10}), which indicate the extent to which the data are better explained by one model than another^{38}. For example, a BF_{10} of 3 indicates that the observed data are three times more probable under the alternative model, *H*_{1}, than the null model, *H*_{0}. As established in the previous literature, we define 1 < BF_{10} < 3 as anecdotal evidence, 3 < BF_{10} < 10 as moderate evidence, 10 < BF_{10} < 30 as strong evidence, and BF_{10} > 30 as very strong evidence of an effect^{26,39}.

Our analysis was conducted in two steps. In the first step, we searched for the best model by comparing models with the null model. In the second step, we analyzed the specific effect of each predictor by calculating the inclusion BFs (BF_{incl}) across matched models, representing evidence for all models *with* a term of interest against all models *without* the term. Since we considered matched models only, the models *without* the term included only the main effect terms that constitute the interaction term^{40}.

To analyze the proportion of “more negative” responses for a probe set, we used a weakly informative prior, Student’s *t* distribution (ν = 4, µ = 0, s = 2.5), for the regression coefficients of the binary response data. The models were fit using eight chains, each with 6000 iterations, including a warm-up of 1000 samples per chain. This resulted in a total of 40,000 Markov chain Monte Carlo samples. For the RT data, we used a normal prior (*Normal*(0,10)), which is also weakly informative. The models were fit using seven chains, each with 10,000 iterations, including a warm-up of 1,000 samples per chain, resulting in 63,000 Markov chain Monte Carlo samples. The convergence of Markov chain Monte Carlo chains was validated by the Rhat statistic.

Additionally, we conducted a post-hoc analysis on the *either* condition only. Trials of the *either* condition were fit to the Bernoulli distribution with a logit link function. Student’s *t* distribution (ν = 4, µ = 0, s = 2.5) for the regression coefficients was used as a prior, and the models were fit using eight chains, each with 6000 iterations, including a warm-up of 1000 samples per chain. When constructing models, we used two population-level predictors: emotion category (happy, angry) and set with facial masks (probe, control).

We neither excluded nor trimmed specific data or trials when modeling the proportion of “more negative” response data. For the RT data, however, trials with RTs three standard deviations above or below the mean in each condition were considered outliers and subsequently excluded. The exclusion rate was 2% on average across conditions (1.95% in total).