AI Software Development Company Checklist | Advisory Apps

The right artificial intelligence software development company for a Malaysian enterprise is one that can pick the correct model for your workload, architect RAG and guardrails around it, handle PDPA-sensitive data on-prem when needed, and prove the system works with real evaluation metrics — not demos. This checklist walks Malaysian CTOs through exactly what to test during vendor selection in 2025, so your pilot does not die between proof-of-concept and production.

What Should a Malaysian CTO Actually Look For?

AI projects fail at the boring parts: data plumbing, evaluation, and compliance. Before you look at any vendor’s demo, confirm they can answer these four questions without hand-waving:

Which foundation model fits this use case, and why not the others?
How will retrieval, prompts, and fine-tuning be combined?
Where does customer data physically live, and how is it isolated?
How do we measure accuracy, hallucination rate, and drift after launch?

A vendor that leads with “we use ChatGPT” has not done the work. A vendor that leads with your data and your evaluation harness has. Advisory Apps has shipped AI features across 200+ projects since 2012, and the pattern holds — the teams that win are the ones that treat the model as one component in a much larger system.

Which Foundation Model Is Right for the Job?

There is no universal best model in 2025 — there are tradeoffs. Your partner should be comfortable across the major frontier models and at least one open-weights family for on-prem work.

Model family	Best for	Hosting	Typical cost profile
Claude 4.5 (Anthropic)	Long-context reasoning, agentic workflows, code	Cloud API, AWS Bedrock	Higher per token, lower retries
GPT-5 (OpenAI)	General chat, structured output, tool use	Cloud API, Azure OpenAI	Mid-high, broad ecosystem
Gemini 2.5 (Google)	Multimodal, long video/doc ingestion	Vertex AI	Competitive, strong on docs
Llama 3.x / Qwen 2.5	On-prem, fine-tuning, cost control	Self-hosted GPU	Capex heavy, low marginal cost

A qualified artificial intelligence development partner should run your top three use cases through at least two of these in a bake-off, share the eval scores, and recommend based on accuracy per ringgit — not on whichever model they happen to have a reseller margin on.

RAG, Fine-Tuning, or Prompting — Who Decides?

The honest answer for most Malaysian enterprise workloads in 2025 is: start with prompting, layer in retrieval-augmented generation (RAG), and only fine-tune when you have exhausted both. Your vendor should walk you through this decision tree explicitly:

Prompting alone — good enough for summarisation, classification, and structured extraction on general content.
RAG — required once the model needs to cite internal policies, SOPs, product catalogues, or historical tickets. Vector store, chunking strategy, reranking, and citation UI all matter.
Fine-tuning — justified when tone, format, or domain vocabulary cannot be reliably prompted, and you have clean labelled data in the thousands.
Pre-training — almost never the right answer for a Malaysian SME or mid-market project.

If a vendor jumps straight to fine-tuning or proposes training a “Malaysia LLM” from scratch, push back. That is usually a sign they are selling GPU hours, not outcomes.

How Do You Handle PDPA and Data Governance?

Malaysia’s Personal Data Protection Act (PDPA) 2010 was amended in 2024 with stricter breach notification and data portability rules, and BNM and LHDN both have opinions about where regulated data can live. Your AI vendor must be able to answer:

Is customer data used to train third-party models? (It should not be — opt out of training on every API.)
Where is the inference endpoint physically located? Singapore? US? Self-hosted in KL?
How are prompts and responses logged, for how long, and who can read them?
Is there a data processing agreement (DPA) aligned with PDPA obligations?

For banks, insurers, healthcare, and government-linked work, the answer is often on-prem or private-cloud AI. Advisory Apps operates on-prem AI infrastructure precisely for regulated industries where data cannot leave the customer’s perimeter — a capability most local shops do not have, because it requires real GPU ops, not just an OpenAI API key.

What Does Good MLOps and Evaluation Look Like?

The difference between a pilot and a production AI system is measurement. Before signing a contract, ask to see the vendor’s evaluation harness on a past project. You want evidence of:

Golden datasets — a curated set of prompts with expected outputs, versioned in git.
Automated evals — accuracy, faithfulness, and hallucination scores run on every prompt change.
Human review loops — SME feedback captured and fed back into the eval set.
Drift monitoring — production traffic sampled and scored weekly, with alerts when quality drops.
Prompt and model versioning — every change traceable, every rollback one click.

A vendor that cannot show you any of this is selling you a demo. You will pay for production later, probably twice.

How Do You Prevent Hallucinations in Production?

Hallucinations are a systems problem, not a model problem. Guardrails that actually work in 2025 combine:

Retrieval grounding — force the model to cite retrieved chunks and refuse when retrieval returns nothing relevant.
Structured output — JSON schemas with validators that reject malformed responses before they hit the UI.
Secondary verification — a cheaper model (or rules engine) checks the primary model’s output for contradictions.
Human-in-the-loop for high-stakes actions — anything involving money, medical advice, or legal text gets a review queue.
Refusal training — explicit instructions and evals that reward “I don’t know” over confident guessing.

This is the same pattern Advisory Apps uses for AI features across enterprise deployments — including the document intelligence work layered on top of the same infrastructure that powers clients like Perodua’s retention systems and regulated platforms like MyJPJ’s road tax and licensing flows. The model is the cheap part; the guardrails are the product.

The Short Vendor Checklist

Use this as your first-meeting screener:

Can they name-drop three foundation models and explain when each wins?
Do they have at least one production AI system live in Malaysia or SEA?
Can they host on-prem or in a private VPC if PDPA requires it?
Do they show eval scores, not just demos?
Do they own the prompts, guardrails, and monitoring end-to-end?
Is there a post-launch SLA for accuracy drift, not just uptime?
Are they transparent about token costs and can they forecast monthly spend?

If the vendor scores five or more, they are worth a paid discovery. If they score three or fewer, keep looking.

Ready to De-Risk Your AI Build?

Picking an artificial intelligence software development company is a multi-million-ringgit decision that will shape your data strategy for years. If you want a second opinion on your shortlist, your model choice, or your PDPA posture, book a free consultation with Advisory Apps — we will walk through your use case, your data, and the honest tradeoffs before you sign anything.