What is Azure Synapse Analytics and how it differs from Data Factory

Challenges and innovations in the IT world on Advanced Technology Days

Advanced Technology Days was held in Zagreb for the 17th time! The conference has become a traditional gathering of IT enthusiasts in the SEE region with an emphasis on new technologies and innovations in the field. This year Unitfly had two presenters: our COO talked about Azure Synapse Analytics, an Azure platform that combines enterprise data warehouse and big data analytics to ensure centralized management of data lakes and warehouses. Seemingly opposite, our Software Engineer Dino Grgic presented the challenges of OCR, which you can read here.

COO Alan Debijadi / Unitfly

We will now get in-depth into Azure Synapse Analytics.

Bulking up with Azure Synapse Analytics

Azure Synapse Analytics is an enterprise analytics service that offers a more efficient way to gain insights across data warehouses and big data systems. It offers the key features of multiple solutions: ETL from data warehouses, big data analytics, and reporting, as well as visualization (achieved by accessing Power BI within the service).

Difference between Synapse Analytics and Data Factory

Azure Synapse Analytics can help you turn big, unstructured data into actionable insights, while Data Factory ensures numerous integrations without the use of code. The main difference between the two services is that Synapse Analytics is an analytics service, and Data Factory is a hybrid data integration service that simplifies the ETL at scale. Data Factory offers the integration of different data sources, but Synapse Analytics serves as a platform from which you can manage, prepare and serve data for BI and Machine Learning purposes with reporting capabilities.

Azure Data Factory offers features such as:

  • real-time integration
  • parallel processing
  • data chunker

On the other hand, Azure Synapse provides:

  • Complete T-SQL-based analytics
  • deeply integrated Apache Spark
  • hybrid data integration

What does Azure Synapse Analytics do?

Ingest – all functionalities of Data Factory and more

Synapse Analytics offers all the possibilities of Data Factory such as the integration of different data sources, but with added functionalities of monitoring, management, alerting, and security in one place.

Explore and analyze – using Synapse SQL

Synapse SQL combines distributed query processing capabilities with Azure Storage to achieve high performance and scalability, offering serverless and dedicated resource models.

Serverless SQL pool

Serverless SQL pool is a query service over the data in your data lake. It enables you to access your data through these functionalities:

  • a familiar T-SQL syntax to query data in place without the need to copy or load data into a specialized store
  • integrated connectivity via the T-SQL interface that offers a wide range of business intelligence and ad-hoc querying tools, including the most popular drivers

Dedicated SQL pool (formerly SQL DW)

Dedicated SQL pool (formerly SQL DW) is a collection of analytic resources that are provisioned when using Synapse SQL. The size of a dedicated SQL pool is determined by Data Warehousing Units (DWU).

The analysis results can go to worldwide reporting databases or applications. Business analysts can then gain insights to make well-informed business decisions.

The other available services are Apache Spark and Data Explorer (still in preview).

Visualization

The main appeal of Synapse Analytics lies in the ability to do everything in one place. Thanks to the native integration with Power BI, data can be instantly visualized in the platform.

Conclusion

Azure Synapse Analytics offers a way to have the whole end-to-end process in one place, from managing, preparing, and serving data for BI and machine learning purposes. Without the need to include additional platforms to import data from different sources, it positions itself as a must-have solution for data engineers.

The challenges in building OCR models

Challenges and innovations in the IT world on Advanced Technology Days

Advanced Technology Days was held in Zagreb for the 17th time! The conference has become a traditional gathering of IT enthusiasts in the SEE region with an emphasis on new technologies and innovations in the field.

This year Unitfly had two presenters: our COO Alan Debijađi talked about Azure Synapse Analytics, an Azure platform that combines enterprise data warehouse and big data analytics to ensure centralized management of data lakes and warehouses. Seemingly opposite, our Software Engineer Dino Grgic presented the challenges of optical character recognition (OCR), the topic we will cover today.

Software Engineer Dino Grgić/ Unitfly

Introduction

The process of converting an image of text, or a hand-written text into a machine-readable text, also known as ‘optical character recognition’, became publicly widespread in the early 1990s.

Since then, the technology has undergone a lot of improvements. Nowadays, we are able to digitalize hand-written documents, along with other benefits of OCR.

Are today’s OCR solutions accurate enough and no longer challenging? Do they still require deep learning?

These are some of the questions our colleague Dino wanted to give an answer to in his presentation on this year’s Advanced Technology Days conference.

What is OCR?

Before we get to the bottom of the issue regarding OCR, let’s get to know a term called Computer Vision – a field of artificial intelligence (AI) that enables computers and systems to identify and understand objects in digital images, videos and other visual inputs –  and take actions based on that information.

computer vision identifying objects
Picture 1. Computer Vison system identifying objects on the street
source: https://appen.com

OCR is a subfield of Computer Vision. It recognizes text in an image and converts it in a machine-readable text data. Some of the fields where OCR is used and useful are:

  • License plate recognition
  • Traffic sign recognition
  • Helping the blind and visually impaired reading the text
  • Converting handwritten notes to machine-readable text
  • Translation from one language to another
converting notes to machine-readable text
Picture 2. OCR – Converting notes to machine-readable text
source: https://research.aimultiple.com/handwriting-recognition/
Google translation using OCR
Picture 3. OCR – Translation

OCR yields very good results for general use cases, however, there are a lot of specific cases where deep learning is still required.

For example, detection of data in unstructured incoming invoices in the Croatian language (and the language your current OCR model works on is English, and doesn’t recognize some specific letters used in Croatian –  Č, Ć, Š, Ž… ). This is a perfect example of a field where OCR needs improvement to become a reliable model for solving a requirement.

How to create OCR model

We might need to develop and train our own model if the use case is too specific. For that to happen, we need a set of data – for different fonts, and formats we need to train our computer for better recognition of any given incoming invoice and data in it. More data leads to a better model.

An important note to point out is that this model could be used in this field only (Croatian incoming invoices), but cannot be used, for example, in Arabic incoming invoices – because of the difference in a set of data that was put in it.

3 step process of creating an OCR model

  1. Pre-processing
    Inserting an image in the model, from which we want a computer to learn. Every image goes through a lot of filters before any text is detected.
  2. Text detection + text recognition
    Using bounding boxes, we detect the location of the text, and with text recognition, we train the computer to read it.
  3. Post-processing
    Converting data that we processed in previous step and generating the output in the form we want – document file, excel sheet, etc.

It is easy to recognize regular text, but the ongoing research is focused on recognizing irregular texts – blurred, with the object hiding a part of the text, text on a transparent background with the noise behind it, italic text, bad lightning, etc. …

technical challenges of OCR
Picture 5: Technical challenges of OCR algorithms
Source: Chenxia Li, et. al, Picture 5: Technical challenges of OCR algorithms
Source: Chenxia Li, et. al, “Dive into OCR”

Conclusion

There is no such thing as a 100% effective and accurate OCR model. Each OCR model is used for the specific task in mind only. Because of that, it is not possible to generalize the solution easily. Systems depending on OCR depend on its quality, so the OCR field will always seek for improvement in mode accuracy.

The presentation, demo and useful links regarding OCR you can find on Dino’s GitHub repository.