The key to understanding what data engineering lies in the “engineering” part. Engineers design and build things. “Data” engineers design and build pipelines that transform and transport data into a format wherein, by the time it reaches the Data Scientists or other end users, it is in a highly usable state. These pipelines must take data from many disparate sources and collect them into a single warehouse that represents the data uniformly as a single source of truth.
Sounds simple enough but a lot of data literacy skills goes into this role. This is why Data Engineers are in such short supply and why there is confusion around the role. The figure below is one example of the activities involved in data engineering.
Looking To Hire Data Engineers?
Start a free trial of our Data Engineer Skills Assessment.
Salary Ranges: Data & Big Data Engineers
Data engineers and big data engineers can make significantly different amounts of money based on criteria including industry, location, years of experience, and duties. However, here are some ballpark salary figures for typical compensation for these positions across industries:
Technology Sector Salary
In the IT sector, data engineers and big data engineers can make between $90,000 and $140,000 annually, with the highest-paying positions bringing in over $180,000.
Finance Sector Salary
The average salary for a data engineer or big data engineer in the banking industry is between $100,000 and $150,000 per year, with salaries as high as $180,000 for more senior positions.
Healthcare Sector Salary
Data engineers and big data engineers in the healthcare industry can make between $80,000 and $120,000 annually, with some more senior roles potentially earning over $165,000 per year.
Retail Sector Salary
Data engineers and big data engineers in the retail industry can earn between $70,000 and $110,000 per year. This salary range can go as high as $135,000 for those in more senior positions.
Please remember that these are merely ballpark figures, and actual compensation will vary greatly depending on experience and skillset.
Monica Rogati, an equity partner at Data Collective, created a now-famous data science hierarchy of needs. It depicts where data engineering falls in the roadmap to becoming a data science/AI-driven organization.
A Data Engineer’s role is at level 2 and 3. It’s worth noting that the bottom level “collect” is growing larger and larger, thereby driving the need for more Data Engineers.
(Big) Data Engineers are in Demand
Data Scientists as a professional group get a lot of attention and hype. Over the last several months, however, we’ve seen a growing interest in using our technical skills testing platform for data engineering roles.
Data engineering falls into levels 2 and 3 primarily
We understand intuitively the surge in demand for Data Engineer skills testing. LinkedIn’s Emerging Jobs Report and Hired’s 2019 State of Software Engineers Report ranked Data Engineer jobs right up there with Data Scientist and Machine Learning Engineer.
Data engineering jobs grew 38% in 2019
However, for some companies, especially those still finding their legs in data science or AI, it’s not always apparent what data engineering is, what role Data Engineers play within the analytics team and what skills are required (and should be vetted) to do the job.
So we thought in this brief article we’d answer the question “what is data engineering?”. We also explain why it is now widely recognized as being extremely important and what the role and skillsets of a Data Engineer are. It’s important to note that the definition of what data engineering is and what a Data Engineer does continues to evolve, so consider this summary a “snapshot”.
How did Data Engineering Come About?
Many would say that data engineering as a profession has been around for well over a decade, maybe a couple, ever since databases, Microsoft SQL Servers and ETL came to be. Some would say ever since IBM popularized database management systems in the 1970s. With that, here’s a very brief history recap.
In the 1980s the term “information engineering” was coined to largely describe database design and to include software engineering in data analysis. Somewhere after the rise of the internet in the 1990s and 2000s, ‘big data” came to be. Yet DBAs, SQL Developers and IT professionals working in the field were not labeled “Data Engineers” at that time.
So why the new job title?
Let’s summarize by saying that a lot of huge technological changes happened which escalated big data volumes, variety, and velocity. Around 2011 the term “Data Engineer” started to crop up in the circles of new data-driven companies such as Facebook and AirBnB. Sitting on mountains of potentially valuable real-time data, software engineers at these companies needed to develop tools to handle all the data quickly and correctly.
The term “data engineering” evolved to describe a role that moved away from using traditional ETL tools and developed its own tools to handle the increasing volumes of data. As big data grew, “data engineering” came to describe a kind of software engineering that focused deeply on data – data infrastructure, data warehousing, data mining, data modeling, data crunching, and metadata management.
Why the Critical Need for Data Engineering Now?
By now you’ve heard/read about Gartner’s determination back in 2017 that 85% of big data projects fail. This was largely due to a lack of reliable data infrastructures. Fast forward to 2019 and things had not improved. The CTO of IBM said that 87% of data science projects never make it into production. Gartner reiterated its prediction that now just 80% of projects would fail. A New Vantage Report produced similar stats.
So why is this?
Over the last decade, most companies have completed a digital transformation. This has produced unimaginable volumes of new types of data and much more complicated data at a higher frequency. While it was previously apparent that Data Scientists were needed to make sense of it all, it was less apparent that someone needs to organize and ensure this data’s quality, security, and availability for the Data Scientists to do their jobs.
The Early Days of Big Data Analytics
In the early days of big data analytics, Data Scientists were expected to build the necessary infrastructure and data pipelines to do their work. This was not necessarily in their skill sets or expectations for the job. The result was that data modeling would not be done correctly. There would be redundant work and inconsistency in the use of data among Data Scientists. These kinds of issues prevented companies from being able to extract optimal value from their data projects, so they failed. It also led to a high rate of Data Scientist turnover that still exists today.
Today with the onslaught of completed corporate digital transformations, the Internet of Things and the race to become AI-driven, it is crystal clear that companies need Data Engineers in abundance to provide the foundation for successful data science initiatives.
This is why will we continue to see the role of Data Engineers grow in importance and breadth. Companies need teams of people whose sole focus is to process data in a way that allows them to extract value from it.
What is the Relationship and Difference between Data Scientists and Data Engineers?
In the past, companies thought that they could get away with having Data Scientists do the role of Data Engineers. This is what has caused much of the “unicorn effect” and shortage in Data Scientist recruitment.
Some Data Scientists also sold themselves as being able to do a Data Engineer’s job. Many fell short – see the image to the right courtesy of O’Reilly.com.
Today, the volume and speed of data have driven Data Scientist and Data Engineer to become two separate and distinct roles albeit but with some overlap.
Companies need Data Scientists and Data Engineers in an advanced analytics team. It’s pretty difficult to do any meaningful in data science without Data Engineers to support this function. There’s frequent collaboration between Data Engineers and Data Scientists however the priority skills and knowledge of tools are different.
Data Scientists focus on advanced analytics of data generated and stored in a company’s databases. They are highly skilled in math, statistics, R, algorithms, and machine learning techniques. Data Engineers design, manage, and optimize data flow with those databases throughout the organization. They will be more versed in SQL, MySQL, and NoSQL, architecture, and cloud technologies and frameworks such as agile and scrum.
Both will likely know Python, and visualization techniques and have other coding languages in common.
What Skills do Data Engineers Need?
Data engineers must have specialized skills in creating software solutions around data. At the same time, it’s perhaps unrealistically expected that Data Engineers will be familiar with a breadth of tool and technologies – anywhere from 10 to 30 of them. And these tools are constantly changing. Furthermore, it varies by industry.
Some, such as SQL, have been around forever. Others such as Scala are falling out of favor over time. Still others such as AWS are in rapid ascent in terms of demand.
Jeff Hale, a published export author and instructor on data science and data engineering topics recently did an analysis of the most in-demand skills asked of Data Engineers on three job platforms. Below is his summary of the top 10 technology skills required.
This variety of skills needed and the complexity of some of them makes determining the right person for the job very very difficult.
What Does a Data Engineer Do & Job Requirements
The requirements to do the job of a data engineer have been accelerating over the last several years. It’s best to think of a “Data Engineer” as a team of people with a portfolio of data engineering skills. Which ones you prioritize will depend on a lot of things.
With that said, important skill areas would be:
- Foundation software engineering – Agile, devOps, architecture design, service oriented architecture.
- Distributed systems – This would include software engineer skills and software architect skills.
- Open Frameworks – Apache Spark, Hadoop, perhaps Hive, MapReduce, Kafka and others…
- SQL – This is a database staple and remains that way.
- Programming – Python has become the favored language for working with data. Java on the other hand, while still widely sought has fallen out of favor with most data scientists and engineers. Scala is another language that Apache Spark and Kafka are based on.
- Pandas – a Python library for cleaning and manipulating data.
- Visualization/dashboards
- Cloud platforms – AWS is the most prevalent cloud skill set for Data Engineers to know. Google Cloud Data Engineering and Microsoft Azure are right behind.
- Analytics – While mainly the realm of data scientists, statistical analysis skills or understanding of some of the different mathematical principles or probabilistic principles are necessary for being able to properly manipulate the data so that it is in a shape that is accessible for the people who are doing the end analysis on it.
- Data modeling – Data modeling knowledge is quite important now in the sense that a Data Engineer needs to know how they are going to structure tables, partitions, where to normalize and denormalize data in the warehouse, etc. and how to think about retrieving certain attributes.
Lots of Skills Means Lots of Rewards
Yeah, that’s a lot. We know. But just to prove the point, Jeff Hale’s Top 30 technologies required these skills in Data Engineers.
With this variety, it’s no wonder some companies are still struggling to figure out what data engineering is. And moreover, how to vet and hire Data Engineers.
If you need any assistance figuring out what a data engineer does or how to test data engineering skills, we’re happy to help!
QuantHub now has data engineering skills tests including Spark. To learn more about how to test for data engineering skills contact our Chief Data Scientist Nathan Black at sales@quanthub.com!