Key takeaways
- Most AI value in modern apps comes from quiet wins like smart search, fraud scoring, and auto-fill, not chatbot demos.
- Generative AI (Claude, GPT) and predictive ML (XGBoost, PyTorch) solve different problems. Picking the wrong one wastes the budget.
- Stripe's payments foundation model lifted card-testing detection from 59% to 97%, but tree models still beat deep nets on most tabular fraud data.
- Gartner expects more than 80% of enterprises to have generative AI APIs in production by end of 2026, and 40% of enterprise apps to embed task-specific AI agents.
- Data quality, latency budgets, and unit economics matter more than which model you pick.
The question in 2026 is no longer whether to put AI into an app. It is which kind, where it sits in the request path, and whether the unit economics hold up after the novelty wears off. A lot of the AI features shipped in the last 18 months were demo-driven and got switched off six months later because they cost more per request than they earned. The teams getting durable value pick smaller spots than the headlines suggest. This post is a tour of where AI and ML earn their keep inside mobile application and web application builds.
The split nobody talks about clearly: generative vs predictive
Generative AI and predictive ML are different tools. They share the word "AI" in marketing decks and almost nothing else under the hood. Mixing them up is the most common reason an AI roadmap stalls.
Generative models like Anthropic's Claude or OpenAI's GPT family produce new content from a prompt. They are good at drafting, summarising, classifying free text, calling tools, and turning messy input into structured output. They are slow and expensive per call compared to a classical model, and they hallucinate when pushed past their training. We covered the production tradeoffs in the Claude Opus 4.8 deep dive, including caching, latency, and where long-horizon agent behaviour earns its keep.
Predictive ML is the older sibling: gradient-boosted trees (XGBoost, LightGBM), random forests, and smaller neural nets in PyTorch. They take structured signals and output a number or class: fraud probability, churn risk, delivery ETA. They run in milliseconds, cost almost nothing per call once trained, and are easier to explain to a regulator. Our work for fintech clients almost always starts here, not with a chatbot.
Rule of thumb: if the output is text, an image, or a decision that needs reasoning, reach for a generative model. If the output is a probability or score over a clean feature vector, reach for predictive ML. Plenty of features need both inside the same request.
Where AI pulls real weight inside an app
The use cases below earn back their cost fastest. None require a moonshot, and most can ship in a normal sprint cadence if the data is in reasonable shape.
Smart search and ranking
Lexical search returns rows that contain the typed word. Semantic search returns rows that mean the same thing. Embedding-based retrieval is the highest-value AI upgrade for any app with a search bar. A user typing "non slip shoes for trails" should find the right trail shoe even if the listing never uses the phrase. The pattern is small: an embedding model, a vector store, a re-ranker, bolted onto the existing keyword index. We ship this often in ecommerce solutions and SaaS builds because the conversion lift is easy to measure.
Personalisation and recommendations
Recommendations are the oldest commercial ML use case and still one of the best. Smaller teams can now reach respectable quality without a dedicated ML platform, using managed feature stores and pre-trained two-tower models. The hard part has moved from modelling to evaluation: making sure the recommender is not chasing short-term engagement at the cost of retention. The design side lives in the web application examples post.
Fraud detection and risk scoring
Fraud is where predictive ML still beats almost everything else. Device fingerprint, velocity, geolocation, and BIN range feed a tree ensemble that gates the transaction, with humans on the borderline cases. Stripe's payments foundation model lifted card-testing detection from 59% to 97%, but it took tens of billions of labelled transactions to train. For most apps the right answer is a well-tuned XGBoost model plus an LLM on top to write the analyst's notes. Bread-and-butter inside our fintech industry work.
Image recognition
On-device image models got good enough for real work without a server round trip: receipt scanning, ID verification, defect detection, before-and-after photo grading. Apple's Vision framework and Google's ML Kit run in tens of milliseconds on a mid-range phone. The right architecture is a small on-device model for the common path and a server-side model for the hard cases. Our cross-platform app development team wires both into the same feature flag.
Content moderation
Any app with user-generated content needs moderation, and 2026 is the year it became affordable. AWS Rekognition runs at around one dollar per thousand images, Azure Content Moderator at $0.75 per thousand. For text, a small classifier or an LLM with a moderation prompt does the bulk of the work, with human reviewers handling appeals. The common mistake is moderating only on upload and never re-scanning when models improve.
Voice
Voice finally crossed the perceptual latency threshold. AssemblyAI prices transcription at $0.015 per minute, and speech-to-speech models from OpenAI and Google brought end-to-end latency under 400 ms. The useful shape is not "talk to the app" but voice as a faster input: dictating a delivery note, narrating a service report, asking a banking app to find a transaction. Most often in healthcare and field-service apps where typing on a phone is painful.
Picking the right model family for the job
The table below is the cheat sheet we hand to PMs at kickoff. It covers the choices that come up in 80% of app builds.
| Job | Reach for | Typical cost |
|---|---|---|
| Free-text classification, summary, extraction | Claude, GPT, or a fine-tuned LLM | $0.001 to $0.05 per call, cacheable |
| Probability scoring on tabular data | XGBoost, LightGBM, or a small PyTorch model | Sub-cent per call after training |
| Semantic search and retrieval | Embedding model plus vector store plus re-ranker | Cents per thousand queries |
| Image classification or detection | On-device CoreML or ML Kit, server fallback | Mostly free on-device, pennies for hard cases |
| Speech to text and back | Hosted ASR plus TTS, or a unified speech model | $0.01 to $0.03 per minute |
| Multi-step task automation | Agent loop on a frontier LLM with tool use | $0.05 to $1 per task, watch context bloat |
For a walkthrough of which features to pick first, the how to launch a mobile app in 2026 guide pairs well with this table.
The data plumbing problem nobody wants to own
Most AI projects fail at the data layer, not the model layer. The failure modes are mundane.
- Event tracking that was never designed as a training signal, with inconsistent naming and missing user IDs.
- PII sprayed across logs, blocking model teams without a long privacy review.
- No feature store, so the same SQL gets rewritten for training and inference and the versions drift apart.
- No ground-truth pipeline. The model ships and nobody collects the labels needed to know whether it is right.
Fixing this is not glamorous, but it separates the apps that keep getting smarter from the ones that ship one AI feature and stop. For teams without a data engineer in-house, our staff augmentation bench can stand up the basics. On compliance, the audit-ready AI agents post covers the decision-trace work most teams skip until a regulator asks.
What it costs to add AI to a normal app
Costs depend on request volume. The table below shows realistic ranges from production builds in the last year, on top of normal hosting cost.
| Feature | Build | Run at 100k MAU | Run at 1M MAU |
|---|---|---|---|
| Semantic search | 3-6 weeks | $80-$250 / mo | $600-$2,000 / mo |
| Recommendation feed | 4-8 weeks | $150-$500 / mo | $1,500-$5,000 / mo |
| Fraud scoring | 6-10 weeks | $200-$600 / mo | $2,000-$7,000 / mo |
| Image moderation | 2-4 weeks | $50-$300 / mo | $500-$3,000 / mo |
| Voice input | 3-6 weeks | $200-$1,000 / mo | $2,000-$10,000 / mo |
| Generative assistant | 6-12 weeks | $500-$3,000 / mo | $5,000-$40,000 / mo |
The generative assistant row surprises people. A chatbot handling 5% of support tickets sounds cheap until each conversation runs ten tool calls on a 20,000-token context. Caching and per-turn model selection make a 5 to 10x difference. The mobile app design cost breakdown is useful when sizing a fresh build.
Build, buy, or borrow
Almost every AI feature sits on someone else's model. The question is which layers you own and which you rent. Rent the weights for commodity tasks: speech recognition, OCR, baseline moderation, embeddings. The hosted versions are good enough and the per-call price keeps falling. Build the layer on top: retrieval logic, feature engineering, evaluation, human review, the fallback when the model is wrong. That is the durable advantage and what a generic vendor cannot ship for you.
The exception is when your data is proprietary and large enough to fine-tune on. A retailer with five years of clean reviews can fine-tune an embedding model and beat a generic one. A regional bank with hundreds of thousands of labelled fraud cases can train a tighter model than any API. The AI industry page outlines how we structure those engagements, and the Tamreeni case study shows the pattern in production.
What to do on Monday morning
For an existing app, pick one user-facing problem where a measurable metric will move (search conversion, fraud loss rate, time-to-first-reply). Pick the simplest model that can solve it. Ship behind a feature flag to 5% of traffic with a clear holdout. Instrument the ground truth before the model. Then talk about scaling. For a new build, bake the data layer into the schema on day one: event names, user IDs, and a feature-store interface are cheap to add early and painful to retrofit. The web app redesign checklist covers the same for an existing product.
Key takeaways
- Generative AI and predictive ML are different tools. Use Claude or GPT for free-text work, XGBoost-style models for probabilities on structured data.
- The highest-ROI AI features are the boring ones: search, ranking, fraud scoring, on-device image work.
- Hosted APIs handle commodity tasks at fair prices in 2026. Image moderation around a dollar per thousand, voice transcription around a cent per minute.
- Data plumbing and evaluation matter more than model choice.
- Build the layers above the model, rent the model itself, fine-tune only when your data is large and proprietary.
FAQ
What is the difference between generative AI and machine learning in apps?
Generative AI produces new content from a prompt and suits open-ended tasks like drafting, summarising, or classifying free text. Traditional ML trains on labelled examples to output a probability or class on structured data. Most modern apps use both: an LLM for language and a smaller model to score risk or forecast values inside the same request.
Should we use Claude, GPT, or an open-source model?
The choice comes down to latency, cost per token, and whether you need on-premise hosting. Claude and GPT lead on reasoning and tool use. Open-weight models like Llama and Mistral are cheaper at scale and easier to self-host. A reasonable default is to start on a hosted frontier model, then move high-volume paths to a cheaper or open model once prompts are stable.
How much does it cost to add AI to an existing mobile app?
A single AI feature usually lands between $15,000 and $60,000 in build cost, plus monthly run cost that scales with traffic. A semantic search upgrade for a 100k-user app runs about $20,000 to build and $150 a month. A generative assistant with tool use is closer to $80,000 to build and several hundred to several thousand a month.
Is on-device AI worth the extra work over a server API?
Yes, when latency, offline support, or privacy matters. On-device image classification, OCR, and small language models avoid the round trip and keep data on the phone, which matters for healthcare, finance, and field-service apps. Build effort is higher because of device fragmentation, but the UX win is usually worth it.
How do we stop an AI feature from hallucinating in production?
Retrieval-augmented generation plus structured output plus a deterministic fallback. Ground the model in your own data, force JSON output you validate, and route to a non-AI path when validation fails. Log every input and output, sample weekly, and feed failures back into the prompt. Treat the model as one component, not the whole feature.
Do we need a dedicated ML team to ship AI features?
For a first feature or two, no. A backend engineer and a PM who understands evaluation can ship semantic search, a recommendation feed, or a generative assistant on hosted APIs. You need dedicated ML capacity once you have multiple models in production, custom training pipelines, or regulated use cases needing full reproducibility.
If you are weighing AI options for a product build or want a second opinion on a roadmap, talk to the Brandrums team and we will scope it straight, including where AI is the wrong answer. The pricing page covers typical engagement shapes and the full service list shows where AI work sits inside a wider build.


