Recent advancements in AI have reshaped the landscape of automatic speech recognition (ASR). Models like Whisper are praised for their accurate transcriptions. However, questions remain about their performance under atypical conditions. How effective are these systems with dialects, or when capturing speech from children, or non-native speakers? What is their accuracy in environments with multiple speakers or significant background noise? Additionally, handling large volumes of speech data can be challenging in terms of infrastructure. What are the optimal strategies for managing these large-scale transcription tasks in a structured and efficient manner?