What is Data Engineering and Why Is It So Important?
What is Data Engineering?
The key to understanding what data engineering lies in the “engineering” part. Engineers design and build things. “Data” engineers design and build pipelines that transform and transport data into a format wherein, by the time it reaches the Data Scientists or other end users, it is in a highly usable state. These pipelines must take data from many disparate sources and collect them into a single warehouse that represents the data uniformly as a single source of truth.
Sounds simple enough but a lot of skill goes into this role. This is why Data Engineers are in such short supply and why there is confusion around the role. The figure below is one example of the activities involved in data engineering.
Data engineering activities (source: Eckerson Group)
Monica Rogati, an equity partner at Data Collective, created a now-famous data science hierarchy of needs. It depicts where data engineering falls in the roadmap to becoming a data science/AI-driven organization.
Data engineering falls into levels 2 and 3 primarily
A Data Engineer’s role is at level 2 and 3. It’s worth noting that the bottom level “collect” is growing larger and larger, thereby driving the need for more Data Engineers.
(Big) Data Engineers are in Demand
Data Scientists as a professional group get a lot of attention and hype. Over the last several months, however, we’ve seen a growing interest in using our technical skills testing platform for data engineering roles.
We understand intuitively the surge in demand for Data Engineer skills testing. LinkedIn’s 2020 Emerging Jobs Report and Hired’s 2019 State of Software Engineers Report ranked Data Engineer jobs right up there with Data Scientist and Machine Learning Engineer.
However, for some companies, especially those still finding their legs in data science or AI, it’s not always apparent what data engineering is, what role Data Engineers play within the analytics team and what skills are required (and should be vetted) to do the job.
So we thought in this brief article we’d answer the question “what is data engineering?”. We also explain why it is now widely recognized as being extremely important and what the role and skillsets of a Data Engineer are. It’s important to note that the definition of what data engineering is and what a Data Engineer does continues to evolve, so consider this summary a “snapshot”.
How did Data Engineering Come About?
Many would say that data engineering as a profession has been around for well over a decade, maybe a couple, ever since databases, Microsoft SQL Servers and ETL came to be. Some would say ever since IBM popularized database management systems in the 1970s. With that, here’s a very brief history recap.
In the 1980s the term “information engineering” was coined to largely describe database design and to include software engineering in data analysis. Somewhere after the rise of the internet in the 1990s and 2000s, ‘big data” came to be. Yet DBAs, SQL Developers and IT professionals working in the field were not labeled “Data Engineers” at that time.
So why the new job title?
Let’s summarize by saying that a lot of huge technological changes happened which escalated big data volumes, variety, and velocity. Around 2011 the term “Data Engineer” started to crop up in the circles of new data-driven companies such as Facebook and AirBnB. Sitting on mountains of potentially valuable real-time data, software engineers at these companies needed to develop tools to handle all the data quickly and correctly.
The term “data engineering” evolved to describe a role that moved away from using traditional ETL tools and developed its own tools to handle the increasing volumes of data. As big data grew, “data engineering” came to describe a kind of software engineering that focused deeply on data – data infrastructure, data warehousing, data mining, data modeling, data crunching, and metadata management.
Why the Critical Need for Data Engineering Now?
By now you’ve heard/read about Gartner’s determination back in 2017 that 85% of big data projects fail. This was largely due to a lack of reliable data infrastructures. Data could not be trusted enough to base key business decisions on it. Fast forward to 2019 and things had not improved. The CTO of IBM said that 87% of data science projects never make it into production. Gartner reiterated its prediction that now just 80% of projects would fail. A New Vantage Report produced similar stats.
So why is this?
Over the last decade, most companies have completed a digital transformation. This has produced unimaginable volumes of new types of data and much more complicated data at a higher frequency. While it was previously apparent that Data Scientists were needed to make sense of it all, it was less apparent that someone needs to organize and ensure this data’s quality, security, and availability for the Data Scientists to do their jobs.
So in the early days of big data analytics, Data Scientists were very often expected to build the necessary infrastructure and data pipelines to do their work. This was not necessarily in their skill sets or expectations for the job. The result was that data modeling would not be done correctly. There would be redundant work and inconsistency in the use of data among Data Scientists. These kinds of issues prevented companies from being able to extract optimal value from their data projects, so they failed. It also led to a high rate of Data Scientist turnover that still exists today.
Today with the onslaught of completed corporate digital transformations, the Internet of Things and the race to become AI-driven, it is crystal clear that companies need Data Engineers in abundance to provide the foundation for successful data science initiatives.
This is why will we continue to see the role of Data Engineers grow in importance and breadth. Companies need teams of people whose sole focus is to process data in a way that allows them to extract value from it.
What is the Relationship and Difference between Data Scientists and Data Engineers?
Much has been written about the relationships between these two roles, so we’ll be brief. In the past, companies thought that they could get away with having Data Scientists do the role of Data Engineers. This is what has caused much of the “unicorn effect” and shortage in Data Scientist recruitment.
Some Data Scientists also sold themselves as being able to do a Data Engineer’s job. Many fell short – see the image to the right courtesy of O’Reilly.com.
Today, the volume and speed of data have driven Data Scientist and Data Engineer to become two separate and distinct roles albeit but with some overlap.
It’s now widely recognized that companies need both Data Scientists and Data Engineers in an advanced analytics team. It’s pretty difficult to do any meaningful data science without Data Engineers to support this function. There’s frequent collaboration between Data Engineers and Data Scientists however the priority skills and knowledge of tools are different.
Data Scientists are focused on advanced analytics of data that is generated and stored in a company’s databases. Data Engineers design, manage and optimize the flow of data with those databases throughout the organization. So Data Scientists will be highly skilled in math and statistics, R, algorithms and machine learning techniques. Data Engineers will be more versed in SQL, MySQL, and NoSQL, architecture and cloud technologies and frameworks such as agile and scrum.
Both will likely know Python, visualization techniques and have other coding languages in common.
What Skills do Data Engineers Need?
Data engineers must have specialized skills in creating software solutions around data. At the same time, it’s perhaps unrealistically expected that Data Engineers will be familiar with a breadth of tool and technologies – anywhere from 10 to 30 of them. And these tools are constantly changing. Furthermore, it varies by industry.
Some, such as SQL, have been around forever. Others such as Scala are falling out of favor over time. Still others such as AWS are in rapid ascent in terms of demand.
Jeff Hale, a published export author and instructor on data science and data engineering topics recently did an analysis of the most in-demand skills asked of Data Engineers on three job platforms. Below is his summary of the top 10 technology skills required.
This variety of skills needed and the complexity of some of them makes determining the right person for the job very very difficult.
The requirements to do the job of a data engineer have been accelerating over the last several years. That’s why we suggest, as, with data science, it’s best to think of a “Data Engineer” as a team of people with a portfolio of data engineering skills. Which ones you prioritize will depend on a lot of things.
With that said, important skill areas would be:
- Foundation software engineering – Agile, devOps, architecture design, service oriented architecture..
- Distributed systems – This would include software engineer skills and software architect skills.
- Open Frameworks – Apache Spark, Hadoop, perhaps Hive, MapReduce, Kafka and others…
- SQL – This is a database staple and remains that way.
- Programming – Python has become the favored language for working with data. Java on the other hand, while still widely sought has fallen out of favor with most data scientists and engineers. Scala is another language that Apache Spark and Kafka are based on.
- Pandas – a Python library for cleaning and manipulating data.
- Cloud platforms – AWS is probably the most prevalent cloud skill set for Data Engineers to know. Google Cloud Data Engineering and Microsoft Azure are right behind.
- Analytics – While mainly the realm of data scientists, statistical analysis skills or understanding of some of the different mathematical principles or probabilistic principles are necessary for being able to properly manipulate the data so that it is in a shape that is accessible for the people who are doing the end analysis on it.
- Data modeling – Data modeling knowledge is quite important now in the sense that a Data Engineer needs to know how they are going to structure tables, partitions, where to normalize and denormalize data in the warehouse, etc. and how to think about retrieving certain attributes.
Yeah, that’s a lot. We know. But just to prove the point here’s Jeff Hale’s Top 30 technologies required of Data Engineers.
With this variety, it’s no wonder some companies are still struggling to figure out what exactly data engineering is and how to vet and hire Data Engineers.
If you need any assistance figuring out what a data engineer does or how to test data engineering skills, we’re happy to help! QuantHub now has data engineering skills tests including Spark. To learn more about how to test for data engineering skills contact our Chief Data Scientist Nathan Black at firstname.lastname@example.org!