In 2023 the Data Engineer’s salary increased by 5% according to the survey Stackoverflow. Given the boom of AI or Data Analytics, data engineering takes relevance. The main of the article is to tell about commons term in the data engineering ecosystems that help you to understand the task or interact with a Data engineer.
Ten terms that I consider vital that you know to understanding the data engineering ecosystem. The scope maybe is short but powerful to be used such as a glossary.
The principal notes I take from Joe Reis and Matt Housley in your book resume perfectly the fundamentals of Data engineering.
The data need to be stored somewhere, typically call a database but the principal difference between a typical database and a data warehouse is your utilities.
The data warehouse is used for reporting and analysis, not a conventional transactional database. Also is used to organize and centralize the data applied techniques such as data modeling.
Typically the software application is an online transaction processing system. OLPT is a database that read, write data records and commonly is called a transactional database and example are: MySQL, MySQL, PostgreSQL, MariaDB, Microsoft SQL Server, and Oracle Database (RDBMS).
The data lifecycle has components such as ingestion, transformation, and serving, these ‘bricks’ are managed through data pipelines. The data pipeline is the combination of architecture, systems, and processes that move data through the stages of the data engineering lifecycle.
Extraction, Transforming, and loading (ETL) are tasks that help to ingest, transform, and serve phases. Commonly is used to insert the source data from the analytic repository.
Batch involves processing data in bulk. Data is ingested by taking a subset of data from a source system, based either on a time interval or the size of accumulated data.
The core of data engineering is the design of systems to support the evolving data needs of an enterprise. Here it defines how to attend to the needs of each stage of the data lifecycle (ingestion, storage, serving, security, orchestration, etc.
On-Premise / Cloud
Refers to the home of your components, for example, the storage. When it is an On-Premise used a server that is allocated in your company. The cloud refers to the use of components as a server outside, the popular is AWS, GCP, and Azure.
The data model is the conceptualization of the business logic. The information must respond to the business and help to make decisions. The data model interacts with the stage (raw data) and shapes it depending on the needs of the company that allows to attend reports, dashboards, or analysis.
Interoperability describes how various technologies or systems connect, exchange information, and interact (Reis, Housley).
Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning (Reis, Housley).
When I started the Data Engineering path was hard to understand all ecosystem but exists a lot of information and definition. I’m sure that the ten concepts will help you in the development of your career.
Additionally, if you got here and you are a software engineer, you will be interested in this article where I talk about how to become a data professional.