Microsoft Unveils High-Speed AI Models for Images, Transcription, and Voice Synthesis

Microsoft has launched three new artificial intelligence models designed to enhance image processing, speech transcription, and voice synthesis. The first model, MAI-Image-2, generates high-resolution images up to 1024x1024 pixels based on user instructions, using advanced non-embedding parameters. It is twice as fast as its predecessor.

MAI-Transcribe-1 transcribes speech 2.5 times faster than earlier Microsoft models with an error rate of just 3.9%, outperforming competitors like Gemini 3.1 Flash and OpenAI’s GPT-Transcribe. It supports batch transcription and future updates will include real-time audio streaming and speaker diarization.

MAI-Voice-1 creates synthetic speech from user scripts, offering both built-in AI voices and the option to use custom voices. All three models are competitively priced and available through Microsoft Foundry and other services, including Bing and PowerPoint.

Microsoft also announced the Maia 200, a new inference-optimized chip that boosts AI performance across benchmarks.