마티 스타니셰프스키 CEO "텍스트 시대 가고 음성(Voice)이 온다"…AI 유니콘 일레븐랩스가 한국을 택한 이유

[테크수다 기자 도안구 eyeball@techsuda.com] "우리는 지금 텍스트 중심의 인터넷 환경이 음성(Voice) 중심으로 재편되는 변곡점에 서 있습니다. 언어의 장벽이 사라지고, 모든 콘텐츠를 현지 언어와 감정 그대로 즐기는 세상이 곧 도래할 것입니다."

[발표하고 있는 마티 스타니셰프스키 공동창업자 겸 CEO. 테크수다 촬영]

글로벌 음성 AI(Voice AI) 분야의 '슈퍼 유니콘'으로 꼽히는 일레븐랩스(ElevenLabs)의 마티 스타니셰프스키(Mati Staniszewski) CEO가 한국을 찾았다.

2022년 창업한 일레븐랩스는 불과 3년 만에 기업가치 66억 달러(약 9조 2천억 원), 월간 활성 사용자 5천만 명 이상을 확보하며 AI 음성 분야의 선두주자로 떠올랐다. 포춘 500대 기업의 75%가 이 회사의 고객이라는 점은 기술력을 방증한다. 전 세계 테크 업계의 주목을 한몸에 받고 있는 그는, 아시아 시장 확장의 전초기지로 주저 없이 '한국'을 지목했다.

21일 서울 반포동에서 열린 기자간담회에서 마티 CEO는 단순히 글자를 읽어주는 TTS(Text-to-Speech)를 넘어, 인간과 기계가 소통하는 방식 자체를 혁신하겠다는 야심 찬 청사진을 내놨다. 현장에서 지켜본 일레븐랩스의 기술은 단순한 '기능'의 진보가 아닌, '경험'의 재창조에 가까웠다.

AI 음성 시장은 2025년 현재 급격한 성장세를 보이고 있다. 시장조사기관들에 따르면 글로벌 AI 음성 생성 시장은 2023년 35.6억 달러에서 2030년 217.5억 달러 규모로 성장할 전망이다. 연평균 성장률(CAGR)은 약 29~30%에 달한다. (출처 https://www.grandviewresearch.com/industry-analysis/ai-voice-generators-market-report)

경쟁도 치열하다. 구글(Google Cloud Text-to-Speech), 마이크로소프트(Azure Speech), 아마존(Amazon Polly) 등 빅테크 기업들이 클라우드 기반 음성 서비스를 제공하고 있으며, 딥그램(Deepgram), 카르테시아(Cartesia), 오픈AI(OpenAI TTS) 등 전문 스타트업들도 각자의 강점을 내세우며 시장 점유율 확대에 나섰다.

"한국어의 '결'을 읽는다"... 단순 번역 넘어선 '문화적 현지화'

이날 간담회의 백미는 단연 한국어 시연이었다. 김유정의 소설을 낭독하는 AI의 목소리에는 단순히 문장을 읽는 기계음이 아닌, 상황에 따른 '연기'가 묻어났다. 숨을 고르거나, 웃음을 터뜨리고, 문맥에 따라 어조를 바꾸는 기술은 소름이 돋을 정도였다.

마티 스타니셰프스키 CEO는 "텍스트에는 없는 감정(Emotion)과 억양(Intonation)을 음성에 담는 것이 우리 기술의 핵심"이라며 "한국어는 문맥에 따라 단어의 의미가 달라지는 고맥락 언어인 만큼, 지난 1년간 한국어 발음과 뉘앙스 최적화에 사활을 걸었다"고 설명했다.

이를 위해 일레븐랩스는 데이터의 '양'보다 '질'에 집중했다. 전문 보이스 코치들이 참여해 오디오 데이터에 감정과 어투에 대한 주석(Annotation)을 다는 방식으로 모델을 학습시켰다. 기계적인 학습을 넘어 인간의 '지도 편달'이 들어간 셈이다.

일레븐랩스의 한국 시장 공략은 이미 상당히 구체적이다. 네이버와 LG유플러스로부터 전략적 투자를 유치해 로컬라이제이션(현지화) 파트너십을 구축했다. 이미 크래프톤은 게임 내 인터랙티브 캐릭터에, MBC와 SBS는 콘텐츠 제작 및 후시 녹음(ADR) 과정에 이들의 솔루션을 도입했다.

스타니셰프스키 CEO는 AI 오디오의 미래를 세 가지 키워드로 요약했다.

첫째, 음성은 기술과 상호작용하는 주요 인터페이스가 될 것이다. 웨어러블부터 자동차까지, 모든 기기가 우리가 무엇을 말하는지, 어떻게 말하는지를 이해하게 된다.

둘째, 언어 장벽이 사라진다. 모든 음성과 콘텐츠가 실시간 번역과 완벽한 더빙을 통해 전 세계 어디서나 접근 가능해진다. 셋째, 옴니모달·옴니채널·옴니프레즌트 시대가 열린다. 오디오를 넘어 이미지와 영상까지 아우르는 완전한 크리에이티브 경험이 제공된다.

일레븐랩스는 향후 3~5년 내 기업공개(IPO)를 계획하고 있다. 스타니셰프스키 CEO는 "파트너사와 고객사, 그리고 개인 투자자들이 함께 참여할 수 있는 기회를 제공하고 싶다"며 "모든 것이 순조롭다면 3년 내에도 가능할 것"이라고 전망했다.

왜 하필 지금, 왜 한국인가?

홍상원 일레븐랩스 한국 지사장은 한국 시장을 "혁신을 가장 빠르게 수용하는 까다로운 테스트베드"라고 정의했다.

홍 지사장은 "한국은 혁신을 가장 빠르게 수용하는 시장입니다"라며 "한국은 모바일 인터넷 보급률 99%, 5G 인프라 세계 1위 수준으로 대규모 음성 트래픽을 감당할 준비가 된 나라"라고 전했다.

그는 또 "특히 K-콘텐츠의 글로벌 파급력은 우리의 다국어 더빙 기술(Dubbing Studio)을 증명할 최고의 무대고 23%의 얼리어답터 비율, K-Pop과 K-Drama로 입증된 글로벌 콘텐츠 파워, 세계에서 가장 까다로운 서비스 기준이 한국을 최적의 시장으로 만들고 있습니다"라고 강조했다.

일레븐랩스가 내세운 'TTS v3' 모델은 70개 이상의 언어를 지원하면서 원작 배우의 목소리 톤과 감정선을 그대로 다른 언어로 입혀준다. 기존 더빙 방식 대비 비용은 95%, 시간은 90%까지 절감된다. 넷플릭스 '오징어 게임'의 대사를 영어, 스페인어, 프랑스어로 듣더라도 이정재 배우 특유의 음색과 감정이 유지되는 미래가 열리는 것이다.

실제로 한국은 대기업의 65.1%가 이미 AI를 도입했고, 근로자의 63.5%가 생성형 AI를 일상적으로 활용한다. 이는 글로벌 평균의 2배가 넘는 수치다. 정부도 2026년 AI 분야에 10조 1천억 원의 예산을 편성하며 AI 3대 강국 도약을 선언한 상태다.

일레븐랩스의 핵심 기술은 초저지연 AI 음성 합성이다. 0.5초 미만의 지연 속도로 인간 수준의 자연스러움을 구현하며, 7,000개 이상의 보이스와 32개 언어를 지원한다. 특히 웃음, 한숨, 감탄사, 숨소리까지 재현하는 수준의 표현력은 경쟁사들과 차별화되는 지점이다.

일레븐랩스의 차별점은 '수직 통합 환경'이다. 음성 인식(STT), 텍스트 음성 변환(TTS), 에이전트 오케스트레이션까지 전 과정을 하나의 플랫폼에서 처리한다. 스타니셰프스키 CEO는 "타사의 Speech-to-Speech 방식은 감사 및 관측이 어렵고 엔터프라이즈의 요구사항을 충족하는 데 한계가 있습니다"라며 "우리는 모델 고도화로 지연시간을 크게 감소시켰고, LLM 추론 속도도 대폭 향상시켰습니다"라고 강조했다.

AI 에이전트, 콜센터의 풍경을 바꾼다

일레븐랩스는 미디어 고객들 이외 B2B 영역 중 콜센터 시장에 우선 집중하고 있다고 전했다. 일본 시장에서도 마찬가지 행보를 하고 있다. 물론 일본 시장은 만화 콘텐츠와 이를 기반으로 한 다양한 에니메이션 시장에 '성우'들이 많이 활동하고 있어 그 시장에서도 급격한 성장세가 이어지고 있다고 귀띔했다.

B2B 시장에서의 파괴력도 만만치 않을 전망이다. 일레븐랩스는 이날 '대화형 AI 에이전트' 플랫폼을 집중 소개했다.

기존 ARS 고객센터가 "1번은.., 2번은.."을 듣느라 15분을 허비하게 했다면, AI 에이전트는 고객의 말을 즉시 알아듣고(STT), 생각하고(LLM), 말한다(TTS). 특히 0.5초 이하의 초저지연(Low Latency) 기술 덕분에 사람이 말을 끊고 들어와도 자연스럽게 대응이 가능하다.

마티 CEO는 "유럽의 한 디지털 은행은 우리 솔루션 도입 후 평균 상담 시간을 15분에서 2분으로 줄였고, 전체 문의의 50%를 AI가 완결짓고 있습니다"며 2026년 한국 시장에서 가장 빠르게 성장할 분야로 '고객 경험(CS)' 시장을 꼽았다.

이 시장은 빅테크들도 앞다퉈 진출하고 있다. AI분야에서 늦게 대응하는 거 아니냐는 AWS도 reInvent 2023년에 관련 상품을 내놓으며 우선 공략하고 있다. 이미 음성 안내 시스템 등이 구축되어 있다보니 이를 대체하는데도 수월하다는 입장이다.

일레븐랩스는 우선 범용적으로 기술을 적용하는 시장에 집중하고 있다. 마이크로소프트가 뉘앙스(Nuance) 같은 회사를 인수해서 의료 정보나 의사들을 위한 시장으로 가고 있지만 일레븐랩스는 아직 이 영역에는 집중하지 않고 있다. 헬스케어 시장에 대해 어떻게 대응하는지 물었다.

[발표하고 있는 마티 스타니셰프스키 공동창업자 겸 CEO. 일레븐랩스 제공]

이와 관련해 마티 CEO는 "한국에서는 아직 많이 보지 못했지만 미국 시장에서는 많이 배포되고 있습니다. 예약 스케줄링 같은 것들이죠. 환자들이 전화해서 예약을 잡으면 음성 에이전트가 처리하고, 나중에 음성 에이전트가 환자들에게 다시 전화해서 예약을 상기시키거나 컨디션이 어떤지 물어봅니다. 엄청난 기회가 있지만, 한국에서는 아직 준비가 안 됐다고 생각합니다. 실시간으로 작동하게 만드는 것이 이제 막 가능해진 단계입니다. 그래서 2026년에는 훨씬 더 많이 볼 수 있을 것이라고 생각합니다."라고 답했다.

음성 이슈다 보니 자연스럽게 동시 통역 혹은 서로 다른 나라 사람간 대화에 대해서도 물었다. 통역 관련한 서비스를 준비하면서 하드웨어 기업들과 협업하는지.

그는 "물론 모든 사람과 매우 빠르고 자연스럽게 소통할 수 있다면 놀라울 것입니다. 여러분의 도움 덕분에 우리는 그것을 다양한 언어로 자유롭게 할 수 있습니다. 기술적으로는 (통역해 전달하는 게) 가능할 것이라고 생각하지만, 그다음 질문은 하드웨어에 어떻게 배포하느냐입니다. 헤드폰을 사용할 것인가? 목걸이나 다른 장치를 사용할 것인가? 그리고 말의 뉘앙스가 가장 중요한 전문적인 환경에서는 아마도 AI와 인간의 조합이 필요할 것입니다. 완벽함을 보장하기 위해서죠"라고 아직까지 해결하고 넘어야 할 산이 남아 있다는 입장이다.

딥페이크의 그림자... "혁신엔 책임 따른다"

물론 빛이 밝을수록 그림자도 짙다. 고도로 발달한 음성 AI는 보이스피싱이나 딥페이크 범죄에 악용될 우려가 크다. 26년 차 기자의 시선에서 가장 날카롭게 파고들어야 할 부분이었다.

이에 대해 마티 CEO는 '3C 프레임워크'라는 방어 기제를 제시했다. ▲동의(Consent): 본인 인증 없는 음성 복제 불허 ▲통제(Control): AI 생성 음성 탐지 및 추적 ▲보상(Compensation): 음성 제공자에게 수익 분배 등이 골자다.

그는 "음성 주파수에 워터마크를 넣는 방식은 완벽한 해결책이 아닙니다"라면서도 "대신 99.5%의 정확도로 AI 음성을 식별하는 탐지 기술을 개발해 보안 당국 및 파트너사들에 제공하고 있습니다"라고 밝혔다. 기술의 진보를 막을 수 없다면, 이를 감시할 '방패'도 함께 팔겠다는 전략이다.

한편, 이날 마티 CEO는 IPO(기업공개) 시점에 대해 "3년에서 5년 내"라고 언급했다. 하지만 기술의 발전 속도를 볼 때 일레븐랩스가 시장에 미칠 영향력은 그보다 훨씬 빨리, 강력하게 다가올 것으로 보인다.

과거 우리는 인터넷을 하기 위해 키보드를 두드렸고, 스마트폰을 쓰기 위해 액정을 터치했다. 일레븐랩스가 보여준 미래는 명확했다. 이제 기계와 눈을 맞추거나 손가락을 쓸 필요 없이, 옆 사람에게 말하듯 대화하면 되는 세상이다.

"언어 장벽이 없는 세상을 만들겠다"는 창업자의 말은 허언이 아니었다. K-팝 스타가 유창한 스페인어로 남미 팬에게 인사를 건네고, 한국의 기자가 쓴 기사가 원어민 발음의 영어로 실시간 송출되는 시대. 일레븐랩스의 한국 상륙은 그 거대한 변화의 서막을 알리는 신호탄이다.

기자간담회 주요 질의응답

Q: 한국어 모델 최적화를 위해 어떤 학습 데이터를 사용했나?

A(스타니셰프스키 CEO): "부족한 것은 오디오 데이터의 양이 아니라 오디오에 대한 이해였다. 음성 전문 코치와 전문가들로 내부 팀을 구성해 오디오 샘플에 감정, 억양, 사투리 등의 주석을 직접 달았다. 이를 통해 모델이 맥락을 이해하고 감정을 담아 발화할 수 있게 됐다. 향후에는 명시적 태그 없이도 자동으로 이런 뉘앙스를 포착할 수 있는 수준까지 발전시킬 계획이다."

Q: 일레븐랩스의 IPO 계획과 한국 개인 투자자들의 투자 기회는?

A(스타니셰프스키 CEO): "파트너사와 고객사들이 지분을 보유하고, 개인 투자자들도 함께 참여할 수 있는 기회를 제공하고 싶어 IPO를 계획하고 있다. 모든 것이 순조롭다면 5년 내, 빠르면 3년 내에도 가능할 것으로 본다. 시장이 워낙 빠르게 움직이고 있어 예상보다 앞당겨질 수 있다."

Q: 뉴스 미디어에 일레븐랩스 기술을 적용할 계획은?

A(스타니셰프스키 CEO): "유명인이나 선호하는 인물의 목소리로 뉴스를 듣는다면 훨씬 자연스러운 경험이 될 것이다. 기술적으로는 이미 가능하며, 미국에서는 The Atlantic, Washington Post 등과 협업이 진행 중이다. 연구 과제를 넘어 실제 배포 단계에 있으며, 한국 미디어와도 협력할 의향이 충분하다."

Q: 지적재산권(IP) 및 특허 관련 리스크 대응 전략은?

A(스타니셰프스키 CEO): "우리는 특허에 대해 다른 접근을 취하고 있다. 이 분야의 발전 속도가 너무 빨라 작년의 연구가 올해는 이미 무용지물이 되기 때문에 특허 자체가 큰 의미가 없다고 판단한다. 대신 음성과 제품 관련 지적재산권 보호에 집중하고 있으며, 기술 혁신의 속도 자체를 차별화 포인트로 삼고 있다."

Q: 한국 시장 진출의 핵심 전략은?

A(홍상원 지사장): "두 가지 핵심 영역에 집중한다. 첫째는 K-콘텐츠의 진정한 글로벌화다. 70개 이상 언어를 지원하면서도 원작의 감정과 뉘앙스를 거의 완벽히 재현한다. 웃음, 한숨, 감탄사, 숨소리까지 그대로 전달하며, 화자 자동 분리, 타임라인 편집, API를 통한 대량 처리로 더빙 시간을 극적으로 단축한다. 둘째는 고객 경험의 완전한 재창조다. 500밀리초 이하 응답속도의 초저지연 음성 에이전트가 24시간 다국어로 응대하며, AI가 반복 문의의 70%를 처리하는 동안 상담사는 진짜 공감과 창의성이 필요한 복잡한 케이스에 집중할 수 있다."

다음은 위 기사를 바탕으로 구글 제미나이 3.0으로 번역한 영어 기사.

ElevenLabs CEO Mati Staniszewski: "The Era of Text is Ending, Voice is Arriving"… Why the AI Unicorn Chose Korea

By Do An-gu

November 21, 2025

[TechSuda Reporter Do An-gu | eyeball@techsuda.com]

"We are standing at an inflection point where the internet environment is shifting from text-centric to voice-centric. A world where language barriers disappear, and all content can be enjoyed with its original language and emotion intact, will soon arrive."

Mati Staniszewski, CEO of ElevenLabs—hailed as a 'Super Unicorn' in the global Voice AI sector—has visited Korea.

Founded in 2022, ElevenLabs has emerged as a leader in the AI voice field, securing a corporate valuation of $6.6 billion (approx. KRW 9.2 trillion) and over 50 million monthly active users in just three years. The fact that 75% of Fortune 500 companies are among its clients is a testament to its technological prowess. receiving the spotlight of the global tech industry, Staniszewski unhesitatingly designated 'Korea' as the forward base for his company's expansion into the Asian market.

At a press conference held in Banpo-dong, Seoul, on the 21st, CEO Staniszewski unveiled an ambitious blueprint to go beyond simple Text-to-Speech (TTS) and innovate the very way humans and machines communicate. The technology demonstrated at the scene was closer to a reinvention of 'experience' rather than a mere advancement of 'function.'

As of 2025, the AI voice market is showing rapid growth. According to market research firms, the global AI voice generation market is projected to grow from $3.56 billion in 2023 to $21.75 billion by 2030, with a Compound Annual Growth Rate (CAGR) of approximately 29-30%.

Competition is also fierce. Big tech companies such as Google (Cloud Text-to-Speech), Microsoft (Azure Speech), and Amazon (Amazon Polly) are providing cloud-based voice services, while specialized startups like Deepgram, Cartesia, and OpenAI (TTS) are expanding their market share by leveraging their respective strengths.

"Reading the Grain of Korean"... Cultural Localization Beyond Simple Translation

The highlight of the press conference was undoubtedly the Korean language demonstration. The AI voice reading a novel by Kim You-jeong contained not just robotic reading, but 'acting' tailored to the situation. The technology, which included pauses for breath, bursts of laughter, and changes in tone according to context, was goosebump-inducing.

"Capturing emotion and intonation—elements not found in text—into voice is the core of our technology," CEO Staniszewski explained. "Since Korean is a high-context language where the meaning of words changes depending on the context, we have staked everything on optimizing Korean pronunciation and nuance over the past year."

To achieve this, ElevenLabs focused on the 'quality' rather than the 'quantity' of data. They trained their models by having professional voice coaches annotate audio data with information on emotions and tone. It was a process that went beyond mechanical learning to include human 'guidance.'

ElevenLabs' strategy for the Korean market is already quite concrete. They have established localization partnerships by attracting strategic investments from Naver and LG Uplus. Krafton has already introduced their solution for interactive characters in games, and MBC and SBS have adopted it for content production and Automated Dialogue Replacement (ADR) processes.

CEO Staniszewski summarized the future of AI audio with three keywords:

First, voice will become the primary interface for interacting with technology. From wearables to cars, every device will understand what we say and how we say it.

Second, language barriers will disappear. All voice and content will become accessible from anywhere in the world through real-time translation and perfect dubbing.

Third, the era of Omnimodal, Omnichannel, and Omnipresent will open. A complete creative experience encompassing not just audio, but images and video will be provided.

ElevenLabs is planning an Initial Public Offering (IPO) within the next 3 to 5 years. "We want to provide opportunities for our partners, clients, and individual investors to participate," Staniszewski stated. "If everything goes smoothly, it could be possible within three years."

Why Now, and Why Korea?

Hong Sang-won, Country Manager of ElevenLabs Korea, defined the Korean market as a "demanding testbed that adopts innovation fastest."

"Korea is a market that accepts innovation most rapidly," Manager Hong said. "With a mobile internet penetration rate of 99% and world-class 5G infrastructure, Korea is a country ready to handle massive voice traffic."

He emphasized, "In particular, the global influence of K-Content is the best stage to prove our multilingual Dubbing Studio technology. The 23% early adopter ratio, the global content power proven by K-Pop and K-Drama, and the world's most demanding service standards make Korea the optimal market."

The 'TTS v3' model put forward by ElevenLabs supports over 70 languages while overlaying the original actor's voice tone and emotional lines onto other languages. Compared to existing dubbing methods, costs are reduced by 95% and time by 90%. A future is opening where lines from Netflix's Squid Game can be heard in English, Spanish, or French while retaining actor Lee Jung-jae's unique timbre and emotion.

In fact, 65.1% of large Korean conglomerates have already adopted AI, and 63.5% of workers use generative AI routinely. This is more than double the global average. The Korean government has also allocated a budget of KRW 10.1 trillion for the AI sector in 2026, declaring a leap to become one of the top 3 AI powerhouses.

ElevenLabs' core technology is ultra-low latency AI speech synthesis. It achieves human-level naturalness with a latency of less than 0.5 seconds and supports over 7,000 voices and 32 languages. Its expressiveness—reproducing laughter, sighs, exclamations, and even breathing sounds—is a key differentiator from competitors.

Another differentiator for ElevenLabs is its 'Vertical Integration Environment.' It handles the entire process from Speech-to-Text (STT), Text-to-Speech (TTS), to Agent Orchestration on a single platform. "Competitors' Speech-to-Speech methods have limitations in auditing and observability, making it hard to meet enterprise requirements," CEO Staniszewski noted. "We have significantly reduced latency through model advancement and vastly improved LLM inference speeds."

AI Agents Changing the Landscape of Call Centers

ElevenLabs stated that, aside from media clients, they are prioritizing the call center market within the B2B sector. They are taking a similar approach in the Japanese market. They noted that the Japanese market is also seeing rapid growth due to the active role of 'voice actors' in manga content and animation.

The disruptive power in the B2B market is expected to be significant. On this day, ElevenLabs focused on introducing its 'Conversational AI Agent' platform.

While existing ARS customer centers waste 15 minutes forcing customers to listen to "For English, press 1...", AI agents immediately understand (STT), think (LLM), and speak (TTS). Thanks to ultra-low latency technology (under 0.5 seconds), the AI can respond naturally even if a human interrupts.

"A digital bank in Europe reduced their average consultation time from 15 minutes to 2 minutes after adopting our solution, with AI successfully resolving 50% of all inquiries," said Mati, identifying the 'Customer Experience (CS)' market as the fastest-growing sector in the Korean market for 2026.

Big tech companies are also rushing into this market. AWS, despite concerns about being late to the AI game, introduced related products at re:Invent 2023 and is prioritizing this sector. Their stance is that replacing existing systems is easier since voice guidance systems are already in place.

ElevenLabs is currently focusing on markets where the technology can be applied universally. While Microsoft is moving into markets for medical information or doctors by acquiring companies like Nuance, ElevenLabs is not yet concentrating on this area. When asked about their response to the healthcare market:

"We haven't seen it much in Korea yet, but it is being widely deployed in the US market. Things like appointment scheduling. When patients call to make an appointment, a voice agent handles it, and later the voice agent calls the patients back to remind them or ask about their condition. I think there is a huge opportunity, but Korea isn't ready yet. We are just at the stage where making it work in real-time has become possible. So, I think we will see much more of this in 2026," CEO Staniszewski replied.

Given the focus on voice, questions naturally arose regarding simultaneous interpretation or conversations between people of different languages, and whether they are collaborating with hardware companies for interpretation services.

He commented, "Of course, it would be amazing to communicate very quickly and naturally with everyone. Thanks to your help, we can do that freely in various languages. Technologically, I think it (delivering interpretation) is possible, but the next question is how to deploy it to hardware. Will we use headphones? Necklaces or other devices? And in professional environments where the nuance of speech is paramount, a combination of AI and humans will likely be needed to guarantee perfection." He acknowledged that there are still hurdles to overcome.

The Shadow of Deepfakes... "Innovation Comes with Responsibility"

Of course, the brighter the light, the darker the shadow. Highly developed Voice AI carries a high risk of being misused for voice phishing or deepfake crimes. From the perspective of a reporter with 26 years of experience, this was the area that needed the sharpest scrutiny.

In response, CEO Staniszewski presented a defense mechanism called the '3C Framework.' The key points are: ▲ Consent: No voice cloning without identity verification ▲ Control: Detection and tracking of AI-generated voice ▲ Compensation: Revenue sharing for voice providers.

"Embedding watermarks in voice frequencies is not a perfect solution," he said. "Instead, we have developed detection technology that identifies AI voice with 99.5% accuracy and are providing it to security authorities and partners." The strategy is that if technological progress cannot be stopped, they will also sell the 'shield' to monitor it.

Meanwhile, CEO Staniszewski mentioned "within 3 to 5 years" regarding the IPO timing. However, considering the speed of technological advancement, ElevenLabs' impact on the market is expected to arrive much faster and more powerfully than that.

In the past, we tapped keyboards to use the internet and touched screens to use smartphones. The future shown by ElevenLabs was clear. It is a world where we no longer need to make eye contact with machines or use our fingers, but simply converse as if talking to the person next to us.

The founder's words about "creating a world without language barriers" were not empty rhetoric. An era where a K-Pop star greets South American fans in fluent Spanish, and an article written by a Korean reporter is broadcast in real-time in English with native pronunciation. ElevenLabs' landing in Korea is the signal flare announcing the prelude to this massive change.

Key Q&A from the Press Conference

Q: What training data was used to optimize the Korean model?

A (CEO Staniszewski): "What was lacking was not the quantity of audio data, but the understanding of the audio. We formed an internal team of voice coaches and experts to manually annotate audio samples with emotions, intonation, and dialects. This allowed the model to understand context and speak with emotion. In the future, we plan to advance this to a level where these nuances are automatically captured without explicit tags."

Q: What are ElevenLabs' IPO plans and investment opportunities for Korean individual investors?

A (CEO Staniszewski): "We are planning an IPO because we want to offer an opportunity for our partners, clients, and individual investors to participate together. If all goes smoothly, we see it happening within 5 years, or as early as 3 years. Since the market is moving so fast, it could be earlier than expected."

Q: Are there plans to apply ElevenLabs technology to news media?

A (CEO Staniszewski): "Listening to news in the voice of a celebrity or a preferred figure would be a much more natural experience. Technologically, it is already possible, and collaborations are underway in the US with The Atlantic and The Washington Post. We are past the research phase and into actual deployment, and we are fully willing to cooperate with Korean media as well."

Q: What is the strategy for risks related to Intellectual Property (IP) and patents?

A (CEO Staniszewski): "We are taking a different approach to patents. Because the speed of development in this field is so fast, research from last year becomes obsolete this year, so we judge that patents themselves don't hold much meaning. Instead, we are focusing on protecting IP related to voices and products, and we consider the speed of technological innovation itself as our differentiation point."

Q: What is the core strategy for entering the Korean market?

A (Country Manager Hong Sang-won): "We are focusing on two key areas. First is the true globalization of K-Content. We support over 70 languages while almost perfectly reproducing the original emotion and nuance. We convey laughter, sighs, exclamations, and breathing sounds, and dramatically shorten dubbing time through automated speaker separation, timeline editing, and bulk processing via API. Second is the complete reinvention of customer experience. An ultra-low latency voice agent with a response time of under 500 milliseconds responds in multiple languages 24/7, allowing human agents to focus on complex cases requiring genuine empathy and creativity while AI handles 70% of repetitive inquiries."

이 기사를 작성하면서 네이버의 클로바 노트, 구글의 제미나이, 앤쓰로픽의 클로드, 오픈AI의 ChatGPT-5을 활용했습니다.

테크수다 기자 도안구 eyeball@techsuda.com