Oftentimes the voices of a few drown out the voices of many. We interpret this problem literally as we investigate the ability of supervised machine learning models to predict sentiment directly from crowd audio in which multiple speakers are speaking simultaneously. Using a dataset of one second audio recordings of individuals saying "yes" or "no", we mix together different voices speaking at the same time to create a dataset of crowd audio responses. Each audio mixture was annotated with the specific proportion of the crowd response. On different mixture sizes ranging from 1 to 10 constituent voices, we trained BLSTM models using raw audio to predict the average sentiment of the crowd response.