Which AI is Best for You? Our Summer Project Exploring LLMs
- newRoot Labs

- Sep 4
- 4 min read
by the newRoot Labs team D Gumaste & J Cast with J Eselgroth

We can almost guarantee that you have used ChatGPT before, but are you sure that was the strongest model for your needs? This summer, our project was to answer that question so students and professionals can turn to the right AI tool for help. Each Large Language Model (LLM) specializes in different areas and purposes, which means choosing the right one matters. To explore this, we built an automated system that gathered data from two websites (HELM and the LLM Leaderboard), matched model names across both sources, joined the results, and output a CSV file into an AWS bucket.

Learning with LLMs
We were encouraged to use LLMs while completing our work. At first, they helped us move faster than constantly asking people for help. Over time, though, we realized that relying on LLMs also meant sacrificing some fundamental understanding. Using AI saved time, but it sometimes left us with only an application-level grasp of the skills. Even so, we were able to learn more in a shorter timeframe than we would have otherwise, especially as this was our first exposure to many of these tools.
From Excel to Python

Our biggest initial task was to replicate a function from the HELM website. The site displayed mean win rate for efficiency but not for accuracy. Our challenge was figuring out the formula and confirm it matched the site’s results. With minimal Excel experience, ChatGPT was our first stop. We learned that the calculation used pairwise comparison, where models are evaluated against each other and win rates are averaged. Using ChatGPT, we created an Excel function that nearly matched the leaderboard’s published numbers. The small differences came from rounding: the site only displayed three decimal places, while the underlying calculation likely used more. Once we validated the efficiency formula, we translated it to accuracy as well.
=LET(
rng, $C$2:$BH$80,
r, ROW()-ROW($C$2)+1,
wins, SUM(BYCOL(rng, LAMBDA(c, COUNTIF(c, "<"&INDEX(c, r))))),
ties, SUM(BYCOL(rng, LAMBDA(c, COUNTIF(c, INDEX(c, r)) - 1))),
comps, SUM(BYCOL(rng, LAMBDA(c, COUNTA(c) - 1))),
IF(comps=0, "", (wins + 0.5 * ties) / comps)
)Breaking It Down
Function | What's Happening |
|---|---|
rng | determines the range of the data |
r | calculates the relative row index |
wins | counts how many times the current row value “wins” over the other rows |
ties | counts the ties |
comps | counts the total number of comparisons |
IF() | calculates the mean win rate |
Working in Excel first gave us a way to understand the process before coding. Afterward, we built Python software that pulled data from the LLM Leaderboard, joined it with HELM tables, and exported a CSV to an AWS bucket. This became the foundation of our project.
Web Scraping Challenges

When we started scraping the LLM Leaderboard, we quickly discovered that not all data appeared by default. The table displayed a limited view until certain buttons were clicked (See HTML snippet right). To solve this, we had to figure out how to use Python to launch a Chrome browser instance and simulate clicks on those buttons. In the HTML, each button was represented with attributes that identified it.

Our code (above) located these elements and triggered the clicks automatically. Once we did that, we could access the full dataset.
Matching Models Across Sources
Matching model names between HELM and the LLM Leaderboard turned out to be far harder than scraping. Names were similar but not identical. We designed a matching algorithm with three strategies: a general fuzzy match, tuned fuzzy matching with weighted tokens, and exact version number checks. Most of this code was generated with ChatGPT, and we learned how the pieces fit together. While we can explain the broader approach, applying the algorithm in a completely new environment would still require more practice. Compared to matching, the scraping felt straightforward, and we are confident we could rebuild that part from scratch.
Applying AI to Proposal Work
Not all our work was technical. Another assignment involved grouping sections of contract proposals into categories. Many of these categories were unfamiliar business frameworks or architectures. LLMs were especially useful here. Instead of consulting experts for every term, we asked AI tools to define and explain them. This helped us complete the task, but it also highlighted a drawback: AI can provide fast answers without helping us build intuition for business terminology. Memorization and internalization are key parts of human learning, and LLMs can sometimes shortcut that process.
School vs Workplace Learning with AI
One of our biggest takeaways was the contrast between how AI is used in school versus the workplace. In school, AI use is discouraged, and the approach is bottom-up: build fundamentals first, then advance. At work, AI use is encouraged, and the approach is top-down: reverse engineer solutions, then practice them as complete tasks. Both models are effective in their environments, but they create different learning paths. Experiencing both helped us see how AI changes the way knowledge is built.
Wrapping Up
So which AI tool is right for you? It depends on your priorities — efficiency, accuracy, subject mastery, or reasoning ability. Over the summer, we built an end-to-end pipeline that scraped HELM and the LLM Leaderboard, matched model names, and merged results into an interactive Tableau dashboard. These visuals let you explore the trade-offs across models for yourself. Want to see which AI is best for you? Dive into the dashboards below.
👉 Explore the Interactive Dashboard
References:




Comments