Building highly reliable data pipelines @ Datadog


by DevOpsSchool.com

Rajesh Kumar

(Senior DevOps Manager & Principal Architect)


Rajesh Kumar — an award-winning academician and consultant trainer, with 15+ years’ experience in diverse skill management, who has more than a decade of experience in training large and diverse groups across multiple industry sectors.

Reliability is the probability that a system will produce correct outputs up to some given time t.





Source: E.J. McClusky & S. Mitra (2004). "Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press.

Highly reliable data pipelines



  1. Architecture

Highly reliable data pipelines



  1. Architecture
  2. Monitoring

Highly reliable data pipelines



  1. Architecture
  2. Monitoring
  3. Failures handling

Historical metric queries


Historical metric queries


Historical metric queries


Historical metric queries


Highly reliable data pipelines



  1. Architecture
  2. Monitoring
  3. Failures handling

Our big data platform architecture


Many ephemeral clusters


Total isolation


Pick the best hardware for each job


Scale up/down clusters


  • If we are behind.
  • Scale as we grow.
  • No more waiting on loaded clusters.

Safer upgrades of EMR/Hadoop/Spark


Spot-instance clusters


Spot-instance clusters



How can we build highly reliable data pipelines with instances killed randomly all the time?

No long running jobs




  • The longer the job, the more work you lose on average.
  • The longer the job, the longer it takes to recover.

No long running jobs


No long running jobs


Break down jobs into smaller pieces




Example


Rollups pipeline

Example


Rollups pipeline

Break down jobs into smaller pieces


Lessons



  • Many clusters for better isolation.
  • Break down jobs into pieces (no longer than ~3 hours).
  • Trade-off between performance and fault tolerance.

Highly reliable data pipelines



  1. Architecture
  2. Monitoring
  3. Failures handling

Reliability is the probability that a system will produce correct outputs up to some given time t.


Reliabilityis the probability that a system will produce correct outputs up to some given time t.


Monitoring data pipelines


1. Is the data pipeline going to finish before the deadline?


We monitor actively 3 types of metrics:

  • Data lags metrics.
  • Cluster health metrics.
  • Job health metrics.

Monitoring data pipelines


1. Is the data pipeline going to finish before the deadline?


We monitor actively 3 types of metrics:

  • Data lags metrics.
  • Cluster health metrics.
  • Job health metrics.

Monitoring data pipelines


1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines


1. Is the data pipeline going to finish before the deadline?


We monitor actively 3 types of metrics:

  • Data lags metrics.
  • Cluster health metrics.
  • Job health metrics.

Monitoring data pipelines


1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines


1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines


1. Is the data pipeline going to finish before the deadline?


We monitor actively 3 types of metrics:

  • Data lags metrics.
  • Cluster health metrics.
  • Job health metrics.

Monitoring data pipelines


1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines


1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines


1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines


2. Is the data produced correct?


  • Add custom counters throughout the pipelines.
    • Count records.
    • Count duplicates.
    • Count records that can’t join.
  • Ad hoc checks on the output data.

Lessons


  • Monitoring = will we finish before t? + is the data correct?
  • Measure, measure and measure!
  • Alert on meaningful and actionable metrics.

Highly reliable data pipelines



  1. Architecture
  2. Monitoring
  3. Failures handling

Data pipelines will break


Data pipelines will break



1. Recover fast

    We want to fix the issues ASAP.

2. Degrade gracefully

    We want to limit the customer-facing impact.

Recover fast




  • No long running job.
  • Switch from spot to on-demand clusters.
  • Increase cluster size.
  • Easy ways to rerun jobs (not always trivial!).

Example: rerun the rollups pipeline


Example: rerun the rollups pipeline


Example: rerun the rollups pipeline


Example: rerun the rollups pipeline


Example: rerun the rollups pipeline


Example: rerun the rollups pipeline


Example: rerun the rollups pipeline


Lessons




  • Think about potential issues ahead of time.
  • Have knobs ready to recover fast.
  • Have knobs ready to limit the customer facing impact.

Conclusion


Building highly reliable data pipelines

Conclusion


Building highly reliable data pipelines


  • Know your time constraints

Conclusion


Building highly reliable data pipelines


  • Know your time constraints
  • Break down jobs into small survivable pieces.

Conclusion


Building highly reliable data pipelines


  • Know your time constraints
  • Break down jobs into small survivable pieces.
  • Monitor cluster metrics, job metrics and data lags.

Conclusion


Building highly reliable data pipelines


  • Know your time constraints
  • Break down jobs into small survivable pieces.
  • Monitor cluster metrics, job metrics and data lags.
  • Think about failures ahead of time and get prepared.

Thanks!


We’re hiring!


qf@datadoghq.com

https://jobs.datadoghq.com

DevOpsSchool Community Networks


These platforms provide you the opportunity to connect with peers and industry DevOps leaders, where you can share, discuss or get information on latest topics or happenings in DevOps culture and grow your DevOps professionals network.

DevOps
Build & Release
DevOps
Build & Release
DevOpsSchool
DevOps Group
BestDevOps.com
      

Any Questions?


Thank You!


DevOpsSchool — Lets Learn, Share & Practice DevOps

www.devopsschool.com

Connect with us on
contact@devopsschool.com | +91 700 483 5930