At the forefront of the burgeoning IT sector in Bulgaria, Commetric is pushing the boundaries of media analytics and measurement by incorporating cutting-edge AI technologies. Our diverse and skilled tech teams are developing proprietary AI-driven solutions that cater to a broad spectrum of organisations worldwide.

Learn what projects Commetric is working on from the company’s innovation team: Alexander Belokapov – Director of Data Engineering; Elena Mihalska – Project Manager; Slavi Slavov – Data Scientist; Genoveva Mihaylova – Natural Languages Processing Analyst; Svetlana Valeva – Transformation Director; Spyros Garyfallos – COO; Kristina Totseva – Managing Director; Ivan Uzunov – Technology Director; Maya Koleva – Head of Research and Insights, Konstantinos Karakostas – Head of Machine Learning.

This interview with our innovation team was originally published in Bulgarian on Economy.bg.

Tell us about the main project or key projects that the teams in Bulgaria are working on.

Behind the product and communication solutions and strategies of major pharmaceutical companies, global banks, consulting firms, and others, lies the analysis of large volumes of media data – big data, collected and processed by organisations like ours. Working with so much information in various languages can be a labour-intensive and slow process, but the world of news waits for no one – an image can be tarnished in a short time and have difficult-to-repair consequences.

Commetric has been developing and maintaining a series of technical solutions in the field of media studies for years, including Cogent, Siera, and others. The main project of our machine learning team is called Cerebro.

Cerebro is a unified ecosystem of the entire infrastructure and serves trained models for natural language processing (NLP), ensuring retraining and fine-tuning of these models through automated machine learning (AutoML). Cerebro was created to facilitate and simplify the daily tasks of Commetric’s media analysts, covering most of the recurring human actions for each project. It aids the analysis of large volumes of news content for our clients and their competitors, making the expanding team more efficient and consistent. This is especially important now, as analysts are not physically in the same office because they primarily work from home. It also helps us in hiring new people, as it simplifies training processes.

Would you tell us a little more about your key project – Cerebro?

We consider Cerebro as a multi-level project. The first level is the engineering part, the second – automated machine learning, and the last one is the connecting ports (API, ad-hoc predictions in large volumes, etc.).

The engineering part includes a series of automated streamlines, which are carefully optimized periodically to check for new data and decide whether a critical mass has accumulated to start the next “life cycle” of the model, or to create a completely new model and include it in the ecosystem. To achieve this on a large scale, Cerebro uses proven technologies that allow us to provide the models for use without interruption, as well as monitor processes and progress.

In the machine self-learning part, we use the latest innovations in the field of artificial intelligence like Transformers, which allow us to create reliable models for a variety of tasks related to natural language processing. The daily work of media analysts includes recurring tasks related to determining themes, commentators, relevance, which are automated precisely through machine self-learning. These functions, combined with AutoML procedures for ongoing retraining and model tuning, are key features of Cerebro.

Another important characteristic of Cerebro is that it is fully integrated into Commetric’s main platform – Cogent, to provide news article processing in the daily work of the media analysis team and to offer them solutions derived from ML.

How many people work on each project and what kind of specialists are they?

The teams collaborating to maintain the exchange of information include Machine Learning (ML), Technology, Natural Language Processing (NLP), Data Engineering, and Media Consultants.

The Machine Learning team, which is the main working team, collaborates closely with colleagues from the technology team for integration into the main technology platform for media analysis – Cogent. They have active interaction with colleagues from the natural language processing team for the interaction of Cerebro with Commetric’s patented NLP software – Siera. The business perspective of media analysis is provided by colleagues with years of experience from our main media analysis team, who provide important, time- and work-proven insights into data and the nature of the workflow. They set the direction for possible development tasks to cover and meet the needs of the media analysis team. Last but not least, no ML model can be trained without data. The encapsulation, organisation, and provision of data are handled by colleagues from the data engineering team.

And if we haven’t completely confused you with how many people are involved in the “affair” called machine learning, we will say briefly – there are many, from different teams and from different nationalities, as is our entire company. The best possible environment for training artificial intelligence.

What kind of technologies do you use for different types of projects?

Cerebro uses various open-source projects as well as internal software solutions for its technology package. And for the faint-hearted, who don’t like technical details, we recommend skipping to the next paragraph, while we expect questions, discussions, and advice from “tech-lovers” for our technological Frankenstein.

Pytorch and Hugging Face Transformers are key part of the models we develop. MLflow plays a major role in our AutoML procedures, and Elastic provides monitoring capabilities.

Last but not least, we use Docker for organisation and tuning, with Docker Swarm managing the stack’s orchestration.

The customized Transformers models are trained to understand and represent better the thematic topology of the industries to which our clients belong. During the experimentation phase, the Neural Architecture Search (NAS) technique is used to find the most suitable architecture.

Finally, since we need more than 500 deployed and ready-to-use ML models at any time, we developed a new hybrid approach that supports the most modern performance of Transformers, combined with minimal computational resources and a significantly reduced CO2 footprint.

How is the workflow organised and what is the management structure for different projects?

The workflow is largely similar for all projects. A brief description of the project is provided. An internal team discussion is held to address any potential oversights and/or ambiguities. The project kick-off comes after a detailed discussion with the team who set the project, to ensure full compliance. Depending on the project, the ML team conducts research for the most suitable modern methodologies and technologies and creates the corresponding attack plan. The workflow often includes many experiments with data and various models and technologies until the best possible result is achieved. In conclusion, we like discussion sessions and are always flexible enough to change things so that we have the best result in terms of both quality and time.

How do your projects evolve over the years?

Commetric and Machine Learning (ML) have been inextricably linked from the very beginning of our company. Cerebro encompasses all the accumulated experience of ML and builds on it, using the most modern technologies in the field. One of the innovations that Cerebro embodies is that it is integrated into Commetric’s media analysis platform and is used even unconsciously by media analysts in their daily work. The model is extremely flexible and allows new models, tasks, and types of data to be easily integrated with minimal human intervention. Last but not least, its resources are effectively controlled, which allows us to reduce our CO2 footprint despite the huge number of models and achieve great scalability.

What are the biggest challenges in working on this project or on other key projects?

Similar to most Data Science projects that deal with training and tuning models, this project also faces the challenges of insufficient or “dirty” data, as well as scaling problems. The need for training NLP models with unbalanced data necessitates the use of methods for artificially increasing the data. In addition, working with global clients creates a need to cover more than 90 languages. That’s why we use a combination of cross-lingual models and customized solutions.

Another significant source of “noise” is the human factor. As most of the annotations used to train the models are generated by humans, slight discrepancies combined with subjectivity lead to not so clean data.

The second challenge we faced is that Transformers models require a lot of resources but do not handle their real-time use well. Therefore, we had to develop procedures and methods to get the most modern representation of Transformers-based models while at the same time being able to scale and keep the computation time at an acceptable level.

What are the most significant achievements you can note for the projects you are working on?

From a technical perspective, the successes are expressed in overcoming the challenges mentioned above. The tasks currently performed by Cerebro achieve the highest performance score and are expected to streamline the media analysis workflow, making it more efficient over time and ensuring the highest standards in terms of consistency and attention to detail.

From a practical point of view, we consider the broad adoption we encounter from our analysts who recognize the benefits of Cerebro, facilitate its development, and suggest new features as a major success. Our direct “users” – the media analysts – are convinced that we “come in peace” and are here to support and assist their efforts.

What’s next?

Our vision for Cerebro is to expand its capabilities to cover more natural language processing tasks, including creating newsletters, summarizing media content, recognizing names of people, organisations, and communities, and extracting sentiment moments in articles that positively or negatively impact readers.

Alongside this, functions such as triggering retraining of models from Data/Concept drift are planned for further facilitation of AutoML procedures. The flexibility of Cerebro allows it to incorporate various new architectures that are currently gaining popularity, such as multi-task and multi-modal models.

Want to boost your PR & comms strategy by using best-in-class media measurement and evaluation?

Book your FREE media analytics consultation with us!