Microsoft sources local speech data for its AI products through the startup Karya. Founded in 2021, Karya existed prior to ChatGPT’s ascent.
Preethi P. is sitting on a stool next to a sewing machine in her one-room home on a peaceful street in Agara, a small community three hours southwest of Bangalore surrounded by groundnut fields and rice paddies. She would typically embroider or fix clothing for hours at a time, earning less than $1 each day on average. But on this particular day, she is reading a line into a phone app in her native Kannada language. She reads another after a brief break.
Preethi, who goes by one name as is customary in the area, is one of the 70 employees that a startup named Karya hired in Agara and nearby villages to collect text, voice, and image data in the vernacular languages of India. She is a member of a sizable, invisible global workforce that works in nations like Kenya, the Philippines, and India to gather and categorize the data needed for AI chatbots and virtual assistants to provide pertinent answers. Preethi, however, receives competitive compensation for her work—at least in comparison to many other data contractors, at least locally.
Preethi made 4,500 rupees ($54) after working with Karya for three days, which is more than four times what the 22-year-old high school graduate typically makes in a month as a tailor. She stated that the amount would cover the current month’s installment of a loan she had taken out to partially restore her home’s decaying mud walls, which had been painstakingly covered in vibrant saris. “All I need is a phone and the internet.”
Also Read : What is AI Stress?
Although Karya was established in 2021—prior to ChatGPT’s ascent—the tech industry’s ravenous appetite for data has only grown as a result of this year’s generative AI craze.
By 2030, there will be about a million data annotation workers in India alone, according to Nasscom, the trade association for the nation’s IT sector. Karya sets itself apart from other data vendors by paying up to 20 times the current minimum wage to its contractors, the majority of whom are women living in rural areas, in exchange for higher-quality Indian-language data that IT companies are willing to pay more for.
According to Manu Chopra, the startup’s 27-year-old computer engineer with a Stanford degree, “big tech companies spend billions of dollars collecting training data for their AI” and machine learning models, Bloomberg was informed in an interview. “Poor pay for such work is an industry failure.”
Should low pay be considered an industrial failure, then Silicon Valley has some of the blame to take. Tech corporations have been using cheaper foreign contractors to handle jobs like data tagging and content control for years. However, some of the most well-known companies in Silicon Valley are now turning to Karya to help them with one of the main issues facing their AI products: locating high-quality data to create tools that will enable them to better serve billions of prospective non-English-speaking users. These collaborations may herald a significant change in the data industry’s economics and Silicon Valley’s standing with data suppliers.
For its AI products, Microsoft Corp. has sourced local speech data via Karya. In order to lessen gender bias in the data that goes into big language models—the technology that powers AI chatbots—the Bill & Melinda Gates Foundation is collaborating with Karya. Additionally, Google, a division of Alphabet Inc., is relying on Karya and other regional partners to collect speech data in 85 Indian districts.
Also Read: The Rise Of Artificial Intelligence
In addition to developing a generative AI model for 125 Indian languages, Google intends to expand to every district to incorporate the language or dialect that is spoken there.
A disproportionate amount of English-language internet content, including books, articles, and social media posts, has been used to construct numerous AI applications.
Because internet users in other nations are using AI-powered smartphones and apps more quickly than they are learning English, these AI models do a terrible job of representing the diversity of languages spoken online. In India alone, there are close to one billion potential users, and the government is pushing for the introduction of AI tools in many sectors, including healthcare, education, and finance.
Speaking of Google’s AI chatbot Bard, “India is the first non-Western country we are doing this in, and we are testing Bard in nine Indian languages,” Manish Gupta, head of Google Research in India, stated. “Every one of the more than a million Indian languages has no digital corpus.
According to Chopra, the objective is to combat poverty as well as enhance the availability of data. The creator of Karya was raised in West Delhi’s Shakur Basti, a destitute slum. After receiving a scholarship to attend a prestigious school, he was teased by his peers. When Chopra arrived at Stanford to study computer technology, he quickly came to detest the school’s “how you make a billion dollars” mentality.
Microsoft Research India researcher Saikat Guha said he has also used Karya’s information for a project to help people with visual impairments find employment. Guha specializes in the ethics of data collection.
Karya’s journey doesn’t end in India. According to the corporation, discussions are underway to sell its platform as a service to groups that will carry out comparable tasks in South America and Africa.