
#93 – Can AI Be Used As A Foundation for Professional Work?
About Michael Sena
Michael Sena is the CSO and co-founder of Recall, a decentralized skill market for AI where communities fund, rank, and discover the AI solutions they need. Michael helped scale from 30 to 1,800 people, co-founded uPort, the first decentralized identity protocol, and 3Box Labs where he led development of Ceramic Network. At Recall, Michael focuses on growing the world’s largest AI competitions and giving the community the power to shape and accelerate the future of AI.
Michael’s Links
https://recall.network/
https://x.com/recallnet
https://x.com/dataliquidity
SUMMARY:
The meeting explored the current shortcomings of AI evaluation and proposed practical approaches to trustworthy, continuously refreshed testing to support professional workflows. David Schropfer framed the concern that widely used AI benchmarks may be flawed and thus create misplaced confidence when AI outputs are used as foundations for professional work. He cited investigative findings that many benchmarks have validity problems and stressed the real-world consequences for areas like legal filings and corporate risk.
Michael Sena described benchmarks as the dominant current evaluation infrastructure and argued they are fundamentally flawed because models can overfit to well-known tests and produce misleading high scores. He proposed rotating, fresh evaluations and community-driven, arena-style head-to-head testing to surface real capability and safety attributes over time. Sena explained Recall Network’s approach of designing custom benchmarks for specific capabilities, running open competitions with continuous re-testing, and providing public leaderboards and community validation as a way for organizations to reliably select AI for sensitive use cases.
SHOW NOTES:
#93 – Can AI Be Used As A Foundation for Professional Work?
Welcome back everyone to DIY Cyber Guy
Hair on Fire 3 out of 5
Target: AI Users
As of today, there are no regelation’s on artificial intelligence. Among other things, that means that we need to rely on the company is producing AI based products and infrastructure to tell us if those systems have been tested for accuracy and validity of the information output by AI
But what if most of those tests do not actually work?
A new investigation by The Guardian found that experts reviewing more than 440 widely used AI benchmarks discovered flaws that, in the words of the researchers, “undermine the validity of the resulting claims” because “almost all … have weaknesses in at least one area.” Many tests used to judge whether AI systems are safe, aligned with human values, or even capable of basic reasoning may be irrelevant or even misleading. 
That should make everyone uncomfortable, especially if you rely on AI as a foundation of other work that you produce.
Because AI safety is not a theoretical exercise anymore. These systems are already influencing decisions in corporate risk, national security, consumer protection, and cyber defense. If our safety evaluations are fundamentally flawed, then our confidence in these systems is misplaced—and misplaced confidence is how complex systems fail quietly, until they fail catastrophically.
To help us unpack this—and to explore how real-world defense, detection, and accountability mechanisms should work—I am joined today by Michael Sena, founder of recall.network, a company focused on persistent digital evidence collection and trust infrastructure for AI and other distributed systems.
Welcome Michael!
TRANSCRIPT
0:01 – David W. Schropfer
Welcome back, everybody, to DIY Cyber Guy. This is episode 93. Can AI be used as a foundation for professional work? As of today, There is no regulation around artificial intelligence. So this podcast is for, this episode is for anybody that uses AI in any capacity. So if you’re listening to it, chances are pretty good that you use it in some capacity, whether it’s minor chatty GPT use or something more involved, or something more proprietary. This is a hair and fire three out of five. And the reason that’s so high is because if you’re founding other professional work or other work on an initial blueprint or an initial template that’s produced by AI, you could be founding that on errors or information that simply isn’t valid. So let’s get into that. As of today, as you all know, there’s no regulation around artificial intelligence. So among other things, that means that we need to rely on the company that made that AI product to prove to us or tell us that it’s been tested and it’s giving results that are reliable and valid and something that you can use as a basis for other work. A lot of authors, for example, will write articles first with the first draft based on AI, and then they will add meat and other content to that article just as an example. But if that work is based on errors that AI produced in the first place, that author could be in for a problem down the road. So the question here really is, what if most of the tests that people use to make sure these AI products are working correctly, what if those tests don’t work? What if those tests aren’t giving us reliable results? So a new investigation by The Guardian found that experts reviewing more than 440 widely used AI benchmarks discovered flaws that, in the words of the researchers, quote, undermine the validity of the resulting because, quote, almost all of these products have weaknesses in at least one area. Many of these tests used to judge whether AI systems are safe or aligned with human values, or even capable of basic reasoning, may be actually irrelevant or even misleading. That should make all of us uncomfortable, especially if you’re using AI as a foundation for professional work that you’re producing.
3:03 – David W. Schropfer
Safety is really not a theoretical exercise anymore. Just as the internet went through the same evolution, at first it functioned, and everybody was really excited about that. And we could go to websites and see information. But then as other issues came up, like security, like accuracy, like spoofing, like all the cybercrime, as those issues came up, they had to be fixed. And those fixes typically came with heavy duty testing and reliable testing. So that’s where we are. We’re in the early stages of that with AI today. The systems used to test AI are already influencing things like corporate risk, consumer protection, cyber defense. And the point is, if our safety evaluations are fundamentally flawed, then our confidence in these AI systems is also fundamentally flawed. It’s misplaced. And we have to figure out a way to deal with that. And make sure that these systems are something that we can rely on before we do other work based on their results. So to help us unpack all of this today and to explore how real-world defense detection and accountability mechanisms should work, I’m joined today by Michael Sena, founder of Recall Network, which is a company focused on persistent digital evidence collection and trust infrastructure for AI and other distributed systems. So welcome, Michael.
4:27 – Michael Sena
Thanks for having me on David.
4:30 – David W. Schropfer
It’s great to have you here. Let’s start with the basics. What is the state of AI testing? What does that look like, What are some of the third party products used to test AI systems and how accurate are they in your opinion?
4:44 – Michael Sena
Well, I’ll start off with the accuracy piece. I think that’s sort of the whole crux of the conversation we’re having today. The existing systems are not very accurate. How that works today with AI, there’s this infrastructure called benchmarks. And I’m sure you might’ve heard it mentioned on your favorite podcast or on Twitter or something, but basically a benchmark is a test that evaluates an AI system, whether that’s a model or an agent on a particular task. And these benchmarks are basically standardized And model developers or AI developers are, you know, basically using those as a way to prove whether or not their product is good at these things. And so there’s, you know, a few larger, you know, open source projects or infrastructure providers that do run these benchmarks. And, you know, it seems like every day, there’s 100 or 1000 new benchmarks that are coming out online. You know, like testing it, testing LLMs on everything from niche problems to general understanding and safety and all of these things. And so benchmarks are how AI is evaluated today, but they’re fundamentally flawed. And so, you know, happy to talk about that anymore, but I think sort of that’s the crux of the situation and sort of the point where we are today is, you know, there’s all this hype, there’s all this development, There’s all this testing and measurement, but there’s no real way for us to actually… Benchmarks are flawed. Yeah, exactly. There’s no way for us to trust the results. The numbers might look good, but they might not reflect reality.
6:30 – David W. Schropfer
So to take maybe an ineffective corollary, if you had a bunch of benchmarks that determined the safety of an 18-wheeler on a highway and the benchmark was flawed, the vehicle, that 18-wheeler could pass the test based on those benchmarks. But it’s still not a safe truck to be on the road.
6:53 – Michael Sena
Yeah, exactly. And even like another way to think about it, a bit more like directly applicable to AI would even be, imagine you had a multiple choice test and someone gave you the answer key ahead of time and said, memorize the results. And then you went home and instead of actually learning the material and upscaling yourself, just memorize the answers. And then you went into the test and you got 100. And everyone said you were the smartest and you were paraded around as the smartest. But like in reality, the only skill you actually had was the ability to memorize the answers to a standardized test and not actually be smart at the thing you were supposed to be tested on.
7:39 – David W. Schropfer
So is the fix to that? Well, obviously not giving out not not letting somebody get tested to the answers, which in your opinion, it sounds like that’s the case with AI, or is it a question of changing the nature of the test itself?
7:57 – Michael Sena
I think there’s a few parts to the answer. One of the big ones is that we can’t expose AI developers or AI themselves to the evaluations ahead of time. Like a few really well-known benchmarks or evals. And they’re so well-known that how AI does on them is a very marketable thing for them. It’s like a feature. It’s right up there with the list of things they lay out. And so the first step we need to take is rotating those benchmarks and evals and continuing to come up with new ones that continuously challenge the LLM ways and not have those be exposed to them. It’s like, you know, we call it benchmaxing. It’s like, you know, you’re just like maxing yourself out to succeed at this one thing. And not only are you not actually developing the skill, but oftentimes you’re doing so at the cost of other things, right? And so it’s not sort of, it is sort of zero sum in that where like, you can optimize for what you want to optimize And oftentimes they’re optimizing for these benchmarks at the expense of investing in safety or other generalized capabilities that might really be important to people, but that don’t make their way into the headlines.
9:22 – David W. Schropfer
So how do we as a user group, let’s just call everybody who uses AI as a user group, how do we push back on that? How do we demand different kinds of tests I was about to say to hold these AI apps to account, but it’s more to make sure that they don’t have, as you pointed out, the test results in advance and they can’t perform to the results that they know everybody is expecting.
9:53 – Michael Sena
Well, I mean, I think everyone that’s sort of an AI power user today is doing this already. I’ll myself in that bucket. It using a lot of different AI agents, models, or very particular things. And I don’t look at benchmark scores to figure out what I’m going to use a certain AI for. Through trial and error, through using an AI, if you actually are well-versed in what they can and can’t do and know their limits, when a new AI model comes out quite instantaneously, like, Oh, it seems better than this one that doing these types of things, and it’s less good at doing these other types of things. But you know, what about for the majority of people that just want to use AI or majority of businesses that want to build on AI, like, maybe not everyone has the time to test every new model that comes out. And so basically, what I think benchmarks are, for most people is sort of just this proxy for, um, letting you search for a skill that you want, and it gives you a ranked list of, well, here’s the best AI tools you could use today to help you do that. And right now, there’s just zero confidence or zero trust or accountability in those rankings. And so what happens is the majority of people end up using AI tools that are ill-suited and not optimized for their use case. People arrive at this local maximum, and they’re not actually figuring out really what the capabilities of AI are.
11:34 – David W. Schropfer
And that’s part of my personal concern using AI. There’s a big gap between something that’s not optimized for your use case, whatever you’re trying to get out of it, versus a system that’s intentionally trying to tell you what it thinks you want to hear, and even to the point of filling in a blank with completely fabricated information, or of course, going all the way to hallucination, coming up with a person that never existed or an event that never existed in an attempt to try to get you a complete answer. So those are the types of things that users want to make sure is taken out of the equation. Not optimal? Hallucinating and making up facts? Is definitely almost the same thing as using a malicious software.
12:26 – Michael Sena
I totally agree. And those are the types of things that aren’t typically found in the most popular benchmarks. Things like safety, trust, accountability, honesty.
12:41 – Michael Sena
These are things that actually matter to people. Not how did it perform on some abstract math question and exam, right, like more practical things that affect people’s daily interactions and what they can expect from their AI. And so they’re just not enough coverage, there’s not enough trust. And, you know, it is leading people down these paths where they think because they’re using an AI, it’s giving them it has given them all the answers, they begin to trust it. And and then it sort of like starts to distort reality Um, which are real, um, challenges with these things. And so, um, yeah, I mean, I think it shows where we are in the development. Like the progress is sort of everyone’s pushing the limits of AI, but I think now society is starting to sort of demand that we’re not forgetting about the benchmarks or the sort of like requirements that enable us to continue using the new features of these systems. Like we can’t do without. And so I completely agree there.
13:49 – David W. Schropfer
So a few episodes ago, we talked about a real world case where attorneys were using AI to make submissions to the court. And some of these flaws were not proofread out of those documents. And it turned out to be extraordinarily embarrassing and even detrimental to the careers of the lawyers that were submitting these documents to the court. Courts apparently don’t have a lot of patience for that kind of thing. So let’s say I’m one of the biggest law firms in the nation, and I come to Recall Network, and I say, look, I understand that I could be, I mean, on the order of, you know, 40, 50, efficient 60% in terms of my human capital, my use of human capital, if I could rely on some element of AI, or at least for some element tasks that go along with running a law firm and producing output. So if that huge law firm came to Recall Network and said, Michael, help me, what would your company do?
14:55 – Michael Sena
Well, first we’d ask them what they really need AI to do. What are the fundamental capabilities they want? Because obviously you can throw AI at everything. And I think within something like a law firm, the gains that you get from properly deploying AI with an organization around them. It’s like step function improvements in productivity. 90% improvement in productivity is still even an understatement. And so really it’s about helping them figure out what they actually need from AI. And then we help them design basically like benchmarks or evals or public arenas that that particularly measure that very specific capability they need or set of capabilities. And we use publicly available models, we let community submit AI agents, which are just basically specialized products built for that use case. And we define the criteria and we test them and sort of like, it’s a an arena style of competition where it’s sort of like AI are competing head to head at these tasks, which leaderboards and scoreboards and publicly visible charts. And the community plays a role in this too. So not only by submitting AI to be tested at that skill or building it themselves, but also sort of like voting and predicting who’s going to win. But really, you know, we could help them launch 10 different arenas if they really want to test 10 different skills. But if a skill that they need has already been tested on the platform, they’re free to just explore the leaderboards and find out which AI have been proven to be good at that skill. And it’s continuously evolving. So we continue to test AI models and agents at these things over time. So it’s not like you did well on this test once. Now you’re number one on the leaderboard forever. There is this freshness to it. So yeah, I mean, it’s really this open platform for any AI to be evaluated on any capability and that’ll form the foundation of better decision-making. So a Fortune 100 firm can actually trust the infrastructure that they’re building their business on and staking their reputation on.
17:19 – David W. Schropfer
And a lot of my listeners probably didn’t know that that was even achievable, that if you’re going to stake the reputation of a venerable 125-year-old law firm that’s in household name, Fortune 100 ranking in terms of revenue, they may not have known that was possible. What would you tell them in terms of a starting point? So I’m a CEO of one of these firms, or maybe a senior executive, one of these, and I’m just going to stick with the law firm example. And I just want to immerse myself in the concept of using a community to evaluate AI, the concept of looking at changing benchmarks that will continuously rank one product ahead of another or one system ahead of another for certain purposes and certain types of output. How do I start to learn about that?
18:11 – Michael Sena
Well, I guess the first step would be to check out the arenas and the benchmarks that we’ve already run on Recall Network. And those range from covering financial services, it’s like crypto trading, basically both actual live and paper, we’ve done tests on content generation, we’ve done tests on willingness to follow instructions regarding punctuation. So basically, we did a benchmark that someone wanted that said, which AI models the best that not using em dashes in its writing, because like, if you remember, six months ago, every AI model was using em dashes all over everything. And it was sort of like, a lot of people stylistically didn’t like that, which seems funny and kind of like quirky, but I think it’s really representative of the power of the platform. It’s like whatever you want to know, like whatever you want out of your AI, we can design a benchmark around it and rally a community to sort of like help evaluate. So yeah, I would just say check out what’s already there, see how we’ve run those and, you know, they can just like reach right out and it doesn’t take much for us to spin up an arena.
19:23 – David W. Schropfer
And I want to emphasize to my listeners that Recall Network is not a sponsor. I’m just that interested in what the founder and CEO, Michael Sena, had to say about some of these topics because they are new and they are necessary and they are required. And if memory serves, Michael, some of these elements are free on your website for evaluation. Is that right?
19:46 – Michael Sena
Yeah, they are free. You know, we have like a menu of predefined tests and a community of, you know, 100,000 or more members that that all join in and take part in like making sure that these arenas and competitions run. So yeah, it’s it is an open platform.
20:06 – David W. Schropfer
And like so many other things, competition, real time competition against a new opponent, a fresh opponent is probably the best way to test the reliability of any one of these So I think your methodology is great.
20:22 – Michael Sena
I mean, competition, not only is it, I think, the best way to actually evaluate models and agents in these head-to-head challenges, but it’s also entertaining too. I think we had an assumption that people would be interested in watching AIs battle on doing these tasks, but the sort of community response following an engagement has been overwhelming. So, you know, it’s like eSports for bots, I guess.
20:55 – Unidentified Speaker
That kind of vibe. eSports for bots. That’s interesting. All right. So, Michael, it’s been great having you on the podcast.
21:05 – David W. Schropfer
Where can people find out more about what you do?
21:10 – Michael Sena
The best place to sort of get an overview of what we’re doing is definitely by heading to recall that network. That’s our website. And on there, you’ll find all sorts of information about what we’re working on how the system works links to our socials like at recall net on x and links to our community discord chat. And I think just like always, we’re always posting updates on social. So that’s usually the best place to find out what we’re working on now. And our blog is pretty active for some case studies use cases and just sort of having us right as we go.
21:46 – David W. Schropfer
So the website and the blog posts, the social media posts of Recall Network. And I love it that your company is named Recall Network and your URL is conveniently recall.network. That’s very convenient. I wish every company could do that.
22:01 – Michael Sena
We try to make it easy. Easy recall. There you go.
22:08 – David W. Schropfer
Michael, been great having you on the show. And for my listeners, if you missed any of those links, just go to diycyberguide.com, look for episode 93, that’s nine three, and you will find everything that Michael just said and all the links that he just gave. Michael, thanks again for being on the show. Thanks, David.
22:23 – Michael Sena
It’s been a pleasure.
Published by