Chasing Waterfalls: Cascade Effects and the AI Competencies Debate
MIMOSA SANDBOX
Experimental Field Notes from the End-User Frontier
Notes from the MIMOSA PROTEST experiment on LLM benchmarks, sandbox resources, preference cascades, and the professional pitfalls of chasing waterfalls.
TL;DR: After running an informal LLM benchmarking experiment, CKS reflects on the pros and cons of preparation, the practical challenges of testing multiple AI models, and professional tensions around domain-specific AI adoption.
Informal testing. Starting with an intuitive or grounded approach can fuel creativity and help bypass cognitive blocks, enabling a later shift to more rigorous, replicable benchmark design.
Sandbox challenges. Testing multiple LLMs through low-tech options like bookmarks, browser windows, manual processes works, but it’s inefficient and limited to smaller scale. There are built-for-purpose apps and services than can enable serious benchmarking.
Reluctant adopters. AI adoption connects to professional cultures, regulatory and compliance requirements, and incentives for changing from proven to experimental tools
Social dynamics. As AI proliferates, gatekeeping (STEM vs. non-STEM expertise), drawbridging (professional caution and resistance), and bandwagoning (hype-driven adoption pressures) are shaping perceptions about labour readiness and skill requirements.
Cascade effects. AI procurement, adoption, and operational decisions can be about social proof or genuine functional utility, dynamics shaped and driven by preference rather than objective requirements.
Summary point. Experiment boldly with AI, give it enough space to allow for some initial creative chaos, follow up with detailed reflexion, and be aware of judgment-distorting social pressures.
After last week’s MIMOSA PROTEST experiment, CKS did some remedial research on benchmarking practices. Our approach and intuitive and eclectic. The term for what we did - winging it - is something most people have at least a lii experience with. In more formal terms, we used a grounded approach, worked the problem from the bottom up, and took logical next steps that made sense in the context of the experiment. The major caveat is that “sense” in our case means judgment equipped with decades of strategic and applied social science research. That doesn’t mean it was unbiased.
Planning is overrated. Not it’s not. But it can get in the way of creative flow and ideation that plays an important role at project onset. Running a benchmarking experiment can be done off the cuff or programmatically. We’ve done more of the former, and we’re now preparing to do more of the latter. Working through the initial results reflexively gives us a good base to work from as we shift to more formal and replicable benchmark design. It can feel and look sloppy, but it’s also a useful and substantive injection of fuel at the start of the process. The trick is to recognize when it’s time to shift gears and get serious. Here we are. Post-experiment review and reflection below.
Benchmark meet sandbox. Developing a credible LLM benchmarking programme means working out and thinking through a grocer’s list of practical and physical issues. MIMOSA PROTEST involved a lot of browser-based bookmarks, multiple pro and project LLM accounts, running them separately in parallel browser windows, while iterating prompts and recording observations. It worked, mostly, and it served a dual purpose as a duct tape and baling wire exercise in limiting costs to time spent and exasperation spiked. The major issue is that it’s just not efficient, leaves too much room for design flaws and experimental error, and it makes the process much bumpier than it needs to be. Testing, in other words, can be messy. It doesn’t have to be. There are tidier options. Find them and work with them.
Bonfire of the profanities. Experiments can be ill-conceived, miss their mark, or go right off the rails. Don’t expect too much during the early days. Trial and error, process of elimination, discovery - all part of the process. Curse all you want at fumbles and missteps but register them as observables that help guardrail and improve everything that comes next. Anecdote: imagine the shriek of excitement when CKS discovered tools and technologies specifically designed to enable effective digital sandbox AI experiments. For industry outsiders not intimately familiar with the minutiae of computing hardware and software, the space for weedy digressions can be pretty limited. Some tech assists are still worth the efficiencies, even if they’re offset by the time and effort involved in learning how to use them.
For anyone shifting back and forth between models: horizontals like Perplexity, Liminal, and Poe are handy one-stop-shops. The world of user guides and toolkits can be very helpful. There are interesting resources beyond that, like domain-specific applications and high-end IT solutions. This is where the learning curve gets steeper, pockets need to be much deeper, and the implied commitment changes the game. Developing sufficient skill to support other work has to be balanced against the escalating costs of doing so.
Judge, jury, and... practitioner? There’s a lot of opportunity in AI, and a lot of information and expertise floating around. The cautions worth exercising can be summed up with the usual cliches: “seize the day”, “buyer beware”, “you get what you pay for”, “slow is smooth and smooth is fast”, and one of our professional go-to pieces of sage advice, “consider your sources”. Source evaluation is fast becoming a difficult proposition as AI tools generate new, synthetic, and hallucinated, polluted or compromised data. There are some interesting industry archetypes that plug in to this, like the “AI agent”, not to mention LLM-as-Judge (see here, here, here, here, and here) and LLM-as-Jury (here, here, here, and here). Functional and clear, if a little awkward for actual flesh and blood human jurists trying to work out the role of AI in the courtroom.
The posture of scientific revolutions. As social scientists, the hardworking staff at CKS pay close attention to perceptions, behaviours, and drivers, which intersect in interesting ways with the technology industry and its anthropomorphisms. There’s no shortage of frames and turns of phrase to consider. For anyone who has invested a lot of time and effort into developing expertise, reputation, and the state of the art, it’s a fair thing to resent newcomers and insta-experts, especially with so much hype built into public talk of AI. Other fields go through the same cycles on a routine basis (see Thomas Kuhn for one overarching variation on the theme). Respect to the builders. A solid understanding of AI and its applications is easier to develop now precisely because there were people clever and inspired enough to figure it out and put in the hard graft. The grumbling about AI “amateurs” is fine. The professionals are entitled.
Adjusting to social shifts and transitions, especially when it comes to democratized access and claims of expertise, can be uncomfortable. A half-baked thought: there’s something of the competent man and the gentleman amateur to all this. What does it meant to be a capable Renaissance-type generalist or a narrowly hyperskilled specialist? Our view is that choices and options on the road ahead aren’t so starkly binary. Keep that in mind when thinking about some of these additional challenges to managing, professionally or otherwise, the pressure to reskill or upskill in an AI-first world:
Gatekeeping. Mileage varies considerably when it comes to accepted definitions and public discussion of the life and death of “expertise” and “technical expertise”, and what they mean to each other. The benchmarking field is shaped by AI developers and technologists, an epistemic professional community whose knowledge and expertise have at least a surface appearance that differs considerably from that of their non-technical contemporaries. Domain professionals bring valuable evaluative standards grounded in their own academic and professional disciplines, real-world complexities, and accountability mechanisms. This is one that’s not easy to square with traditional and emerging differentiation between technicals in the STEM sense, and “non-technical” or “lay user” denizens of the wider market for AI technology. The frame is skewed. The consequence is that it potentially keeps the shutters closed and the doors barred against meaningful collaboration between STEM and AHSS (or “SHAPE”) technical specialists and their skillsets.
Drawbridging. Professional standards, caution, conservativism, and yes, resistance also inevitably come into play. AI and legaltech are extraordinarily prominent features of law and legal practice, for example, but the lawyers CKS Actual has spoken to - and more sober takes on the industry - point to a mixed picture of adoption, indifference, and resistance. AI is an experimental realm. Moats and drawbridges are reasonable and useful ways of protecting users from advertent and inadvertent harms. Professional communities like law are perceived to be skittish about new technologies, treating them with considerable skepticism. We don’t want to make too much of the point, but change isn’t universally equated with progress. When proven tools are comfortable knowns and perfectly serviceable, the benefits of change have to be compelling.
Bandwagoning. Then there’s the hype. If there’s anything illusory about change and progress, it might have something to do with market dynamics and adoption pressures, or at least with the opaque social, behavioural and cognitive elements that drive them. Other suspects: the pace of technological change, planned and accelerated obsolescence, technology-assisted benefits to human wellbeing, the narratives surrounding them. Hype, it’s worth stressing, can and does actually move people, but it alludes to many things, not all of them are negative or nefarious. A short list of alternatives: energy, inspiration, motivation, aspiration. The issue isn’t so much about keeping up with the Joneses, although that’s part of it. It’s not even whether the hype is manipulative. Of course it is. The more important point is whether the manipulation is to positive or negative ends, in both absolute and relative terms.
Don’t think of a waterfall. This is our last piece of MIMOSA PROTEST inspired analogical explanation for the day. Chasing waterfalls is an odd turn of phrase that doesn’t visualize well or naturally. Waterfalls, when humans interact with them, don’t readily bring to mind anything that can conceivably be chased. People routinely and tragically get swept over them, do strange things under them, or gaze at them from safer touristy perches. When waterfalls start run over top of one another, things get complicated. That’s when gatekeeping, drawbridging, and bandwagoning, those avatars of social conformity, stand at the precipice and throttle back access to a bumpy, jagged-edged flow. They serve as a break of sorts on preference falsification and its consequences. For Timur Kuran and Cass Sunstein, availability cascades are where procurement and adoption decisions demonstrate social proof and risk aversion rather than a commitment to genuine utility. The lesson: chase those waterfalls, but don’t let the thrill and majesty of them sweep away good judgment. Watch them, study them, marvel at them. Sing along to songs about them. Immerse yourself in them to know them better. Read the lessons they offer to make better decisions.
Get in touch to find out more about how we can help your in-house research team
CKS is a trading name of Craighead Kellas, a private limited company registered in the United Kingdom (company registration #15995326) and in the United Arab Emirates. ©Craighead Kellas 2025. All Rights Reserved. CraigheadKellas.com.