What is DataHub and why should I use it?

DataHub is an open-source metadata platform for discovery, lineage, and governance across your data stack. It helps teams find trusted datasets, understand impact, and standardize documentation. Learn the fundamentals in the official introduction.

How do I install DataHub quickly?

Use the quickstart: install Docker, install the CLI, then start with a single command. You can access the UI locally and log in with defaults to validate setup fast.

Should I use UI ingestion or CLI ingestion in DataHub?

Use UI-based ingestion to get started quickly or involve non-engineers; it’s great for first-time connectivity and demos. Switch to CLI ingestion for versioned recipes, automation, and CI/CD integration.

How do I get lineage to show up in DataHub?

Ingest from multiple sources: your warehouse (e.g., Snowflake), your transformation layer (e.g., dbt), and orchestration (e.g., Airflow). Lineage emerges as DataHub connects these pieces.

What governance features should I enable first in DataHub?

Start with ownership, concise descriptions, a small glossary, and consistent tags like gold, pii, and deprecated. Then add policies to control who can edit critical assets and schedule regular ingestion.

วิธีใช้งาน DataHub: คำแนะนำที่ใช้งานได้จริงตั้งแต่ต้นจนจบสำหรับ Data Catalog ของคุณ

พร้อมที่จะเปลี่ยนข้อมูลที่กระจัดกระจายให้เป็นความชัดเจนแล้วหรือยัง DataHub ซึ่งเป็นแพลตฟอร์ม metadata แบบโอเพนซอร์สที่สร้างขึ้นครั้งแรกที่ LinkedIn ช่วยให้ทีมค้นหา เชื่อถือ และกำกับดูแลข้อมูลในคลังข้อมูล เครื่องมือ BI ระบบ orchestration และอื่นๆ อีกมากมาย ในคู่มือฉบับลงมือปฏิบัติจริงนี้ คุณจะได้เริ่มต้นจากศูนย์ไปจนถึง instance ของ DataHub ที่ใช้งานได้จริง นำเข้า metadata สำรวจ lineage และตั้งค่า governance โดยไม่หลงทางในศัพท์เฉพาะ

สิ่งที่คุณจะได้เรียนรู้โดยสังเขป:

Spin up DataHub ในเครื่องของคุณได้ภายในไม่กี่นาที

นำเข้า metadata จากแหล่งที่มาทั่วไป (เช่น Snowflake, BigQuery, dbt)

สำรวจการค้นหา, lineage, ownership และ documentation ใน UI

กำหนดนโยบาย, tags และ terms สำหรับ governance

เปิดตัวกระบวนการของทีมที่ใช้งานได้จริง

หมายเหตุ: นี่คือ walkthrough ที่เน้นการปฏิบัติจริงและมุ่งเน้นการแก้ปัญหา ซึ่งออกแบบมาเพื่อจับคู่กับ workflows จริง เราจะอ้างอิงเอกสารอย่างเป็นทางการสำหรับรายละเอียดและข้อมูลเชิงลึกเพิ่มเติมเมื่อจำเป็น

Quick Start: ทำให้ DataHub ทำงานในเครื่องของคุณ หากคุณกำลังทดลองหรือนำร่อง DataHub วิธีที่เร็วที่สุดคือ quickstart ตรวจสอบให้แน่ใจว่าคุณได้ติดตั้ง Docker แล้ว จากนั้น:

ติดตั้ง DataHub CLI

เปิดใช้งานด้วยคำสั่งเดียว

เปิด UI และล็อกอินด้วยค่าเริ่มต้น

รายละเอียด quickstart อย่างเป็นทางการ, คำสั่ง และค่าเริ่มต้นอยู่ที่นี่ บทนำจะอธิบายสถาปัตยกรรมและเหตุผลที่ DataHub ใช้โมเดล metadata แบบเรียลไทม์ (entities, aspects และ streaming updates) ที่เหมาะสำหรับ stacks สมัยใหม่

เคล็ดลับการตั้งค่าที่ชาญฉลาด:

เริ่มต้นในเครื่องของคุณ แม้ว่าคุณวางแผนที่จะไปที่ Kubernetes ในภายหลังก็ตาม มันเร็วกว่าสำหรับการซื้อใจและการสาธิต

หากคุณมี Docker Desktop อยู่แล้ว โดยทั่วไปคุณจะสามารถใช้งานได้ภายในไม่กี่นาที

รักษา credentials ให้ปลอดภัย แม้ใน sandbox ก็ตาม นิสัยที่สร้างขึ้นในตอนนี้จะให้ผลตอบแทนในภายหลัง

ทำความเข้าใจแนวคิดหลักใน 5 นาที ก่อนที่คุณจะนำเข้าอะไรก็ตาม ทำความคุ้นเคยกับ mental model ของ DataHub:

Entities: สิ่งต่างๆ เช่น datasets, tables, charts, dashboards, pipelines, users

Aspects: “facets” ของ metadata เกี่ยวกับ entities ที่มีการควบคุมเวอร์ชัน (schema, ownership, tags, glossary terms, lineage)

Graph: ความสัมพันธ์ (lineage, ownership, dependencies) ขับเคลื่อนประสบการณ์การค้นหาและการค้นพบ

วิธีการที่อิงตามกราฟนี้ช่วยให้สามารถใช้งานคุณสมบัติต่างๆ เช่น impact analysis (อะไรจะเสียหายหากเราเปลี่ยน column นี้), downstream lineage mapping และ trust signals (owners, tags, documentation) ภาพรวมแนวคิดที่กระชับอยู่ในคู่มือแนะนำ

นำเข้า Metadata: UI vs. CLI (เลือกเส้นทางของคุณ) DataHub รองรับทั้งการนำเข้าผ่าน UI ที่ใช้งานง่ายและ pipelines CLI ที่สามารถเขียนสคริปต์ได้ เลือกสิ่งที่เหมาะกับ workflow ของคุณในวันนี้ หลายทีมใช้ทั้งสองอย่าง

ตัวเลือก A: การนำเข้าผ่าน UI (รวดเร็วสำหรับการรันครั้งแรก)

ใน UI ไปที่ Ingestion → New Source

เลือก source (เช่น Snowflake, BigQuery, dbt, Kafka, Looker, Tableau)

ป้อนรายละเอียดการเชื่อมต่อ

ทดสอบการเชื่อมต่อ

กำหนดเวลาหรือรัน ingestion ตามต้องการ

UI flow และขั้นตอนต่างๆ ครอบคลุมอยู่ที่นี่ เหมาะสำหรับผู้ที่ไม่ใช่วิศวกรหรือทีมที่ต้องการตรวจสอบความถูกต้องของการเชื่อมต่ออย่างรวดเร็ว

ตัวเลือก B: การนำเข้าผ่าน CLI (ทำซ้ำได้และเป็นมิตรกับ CI)

สร้าง YAML recipe ที่กำหนด source, filters และ mapping ของคุณ

รัน: datahub ingest -c recipe.yml

Commit recipe ไปยัง version control เพื่อความสามารถในการทำซ้ำ

การนำเข้าผ่าน CLI และ recipes มีเอกสารรายละเอียดอยู่ที่นี่ วิธีนี้ดีกว่าสำหรับ dev/prod pipelines, automation และ consistency

เคล็ดลับสำหรับ ingestion:

เริ่มต้นด้วยแหล่งที่มาหนึ่งหรือสองแหล่งที่สำคัญที่สุด (เช่น Snowflake + dbt) Quick wins สร้างแรงผลักดัน

กรองอย่างเข้มข้น อย่า ingest ทุก sandbox dataset ในวันแรก มันสร้างสัญญาณรบกวน

เพิ่ม platform instance names (เช่น snowflake:prod vs snowflake:dev) เพื่อหลีกเลี่ยงความสับสน

สำรวจ UI: การค้นหา, Lineage และ Ownership เมื่อการนำเข้าครั้งแรกของคุณเสร็จสมบูรณ์ ให้กระโดดเข้าไปใน UI เพื่อตรวจสอบความถูกต้องของ value อย่างรวดเร็ว:

Universal Search: ค้นหา datasets, dashboards และ pipelines ตามชื่อ, schema, tags หรือ glossary terms

Lineage Graph: คลิกเข้าไปใน dataset เพื่อดู upstream และ downstream connections นี่คือ gold สำหรับ impact analysis

Ownership & Documentation: เพิ่ม owners (teams หรือ users) และเขียนคำอธิบายที่ชัดเจน เหล่านี้คือ trust signals แรกที่องค์กรของคุณจะรู้สึกได้

Schema & Profiling: ตรวจสอบ column names, types และ sample stats ตรวจสอบ anomalies ตั้งแต่เนิ่นๆ

เพิ่มความหมาย: Glossary, Tags และ Domains Raw metadata เป็นเพียงจุดเริ่มต้น คุณจะปลดล็อกการนำไปใช้จริงโดยการ layering semantics:

Glossary Terms: กำหนด business-friendly concepts (Customer, ARR, Active User) แนบไปกับ datasets/columns เพื่อ standardize language

Tags: Lightweight labels (PII, Critical, Deprecated, Gold) Quick visual cues สำหรับ risk และ importance

Domains: จัดกลุ่ม related assets ตาม business function (Finance, Marketing) หรือ platform

Recommended first taxonomy:

Glossary terms สามคำที่ทุกคนเข้าใจ (Customer, Order, Revenue)

ชุด tag เล็กๆ: pii, gold, deprecated, experimental

5–7 domains ที่ map ไปยัง org chart หรือ data platforms ของคุณ

Governance ที่ปรับขนาดได้: นโยบายและการเข้าถึง DataHub รองรับ role- และ asset-based policies เพื่อให้คุณสามารถควบคุมผู้ที่สามารถทำอะไรได้บ้าง (แก้ไข documentation, เพิ่ม tags, จัดการ lineage ฯลฯ) เริ่มต้นง่ายๆ:

สร้างกลุ่ม “Stewards” ที่มีสิทธิ์แก้ไขใน docs, ownership และ tags

ให้ analysts เข้าถึง assets ส่วนใหญ่ได้ แต่จำกัด domains ที่ละเอียดอ่อน

กำหนดให้มี owners สำหรับ datasets “gold” ก่อนที่จะปรากฏใน “Top Picks”

นโยบายและ governance อยู่ภายใน platform ดังนั้นประสบการณ์จึงสอดคล้องกันสำหรับ editors และ viewers เมื่อองค์กรของคุณเติบโตขึ้น ให้ขยายด้วย permissions และ approval flows ที่ละเอียดมากขึ้น

แนวทางปฏิบัติที่ดีที่สุดในการดำเนินงาน: ทำให้มันคงอยู่ Metadata programs ล้มเหลวเมื่อรู้สึกเหมือนเป็นงานพิเศษ ทำให้ DataHub เป็นส่วนหนึ่งของ normal flow:

Embed ใน PRs/CI: เมื่อ data pipelines เปลี่ยนแปลง ให้รัน metadata ingest และเปรียบเทียบ schema diffs Flag breaking changes โดยอัตโนมัติ

Align กับ dbt: ใช้ dbt docs, tests และ exposures; surface them ใน DataHub เพื่อเชื่อมต่อ code กับ business context

สร้าง “Adoption Playbook”: Owners เพิ่ม docs, tags และ glossary terms ในระหว่าง onboarding Reward quality ผ่าน scorecards

Publish a Data Contract: สำหรับ key tables กำหนด SLA, freshness, nullability และ stability rules Surface it ใน DataHub

จาก Pilot สู่ Production: มีอะไรเปลี่ยนแปลง

Infrastructure: ย้ายจาก local Docker ไปยัง managed environment (Kubernetes, cloud services) พิจารณา hosted option หากมีอยู่ในองค์กรของคุณ

Auth/SSO: ผสานรวมกับ identity provider ของคุณ (Okta, Azure AD ฯลฯ)

Observability: Monitor ingestion jobs, graph size และ UI performance

Change Management: สร้าง metadata review cadence (เช่น weekly stewardship syncs)

Troubleshooting: Common Pitfalls and Fixes

“ฉันมองไม่เห็น tables ของฉัน” ตรวจสอบ network rules, credentials และ source filters รัน minimal ingestion recipe เพื่อแยกปัญหา

“Lineage ไม่สมบูรณ์” ตรวจสอบให้แน่ใจว่าคุณได้ ingested จาก orchestration (Airflow), transformation (dbt) และ warehouse sources Lineage มักต้องการ connectors หลายตัว

“Search รู้สึกรก” Tighten filters เพิ่ม tags/glossary และซ่อน deprecated assets

“Docs ล้าสมัย” กำหนดเวลา regular ingestion; encourage owners ให้ update descriptions พร้อมกับการเปลี่ยนแปลง code

ตัวอย่าง: เส้นทางที่รวดเร็วสู่ Value ใน 48 ชั่วโมง วันที่ 1

Spin up DataHub ในเครื่องของคุณผ่าน quickstart

Ingest จาก warehouse ของคุณ (Snowflake/BigQuery) โดยใช้ UI ingestion

เพิ่ม owners และ descriptions ไปยัง five critical datasets

สร้าง glossary terms สำหรับ Customer และ Revenue; tag datasets เหล่านั้นเป็น gold

วันที่ 2

Ingest dbt metadata เพื่อเชื่อมต่อ models กับ tables

ตรวจสอบ lineage ข้าม ingestion → transformation → BI

สร้าง policy ที่ stewards เท่านั้นที่สามารถเปลี่ยน gold dataset docs ได้

Demo lineage view และ search experience ให้กับ stakeholders; รวบรวม feedback

Key References

Quickstart: local setup, credentials, ports, commands

Concepts และ architecture overview

UI-based ingestion steps

CLI ingestion และ YAML recipes

Sider.AI สามารถช่วยได้อย่างไร หากทีมของคุณค้นคว้าแนวทางปฏิบัติที่ดีที่สุด เขียน dataset docs หรือต้องการสรุป lineage และ schema changes ที่เข้าใจง่ายเป็นประจำ ควรทราบว่า Sider.AI สามารถเร่ง documentation และ knowledge sharing ได้ ตัวอย่างเช่น คุณสามารถเปลี่ยน dense schema diffs ให้เป็น change logs ที่มนุษย์อ่านได้ หรือสร้าง first-draft dataset descriptions ที่ stewards ปรับแต่ง ซึ่งช่วยลดเวลาจาก raw metadata เป็น usable context

Cheat Sheet: 10 Actions แรกของคุณ

Launch DataHub ในเครื่องของคุณผ่าน quickstart

เพิ่ม one warehouse source ผ่าน UI ingestion

Ingest dbt หรือ orchestration metadata สำหรับ lineage

เพิ่ม owners ไปยัง 5–10 key datasets

เขียน concise descriptions (2–3 sentences each)

สร้าง 3 glossary terms และ 4–6 tags

Tag 5 datasets เป็น gold และซ่อน deprecated ones

ตั้ง one editor policy สำหรับ stewards

กำหนดเวลา daily ingestion

Demo UI ให้กับ 2 stakeholder teams และรวบรวม feedback

What’s Next?

Scale ไปยัง Kubernetes หรือ managed environment

Roll out SSO และ groups สำหรับ governance

ขยาย ingestion ไปยัง BI และ event streams

สร้าง scorecards สำหรับ data quality และ documentation completeness

ผสานรวมกับ CI/CD เพื่อให้ schema changes สะท้อนให้เห็นใน catalog เสมอ

Final Takeaways

เริ่มต้นเล็กๆ ส่ง value อย่างรวดเร็ว และทำซ้ำ

ใช้ UI ingestion เพื่อความเร็ว; CLI สำหรับ repeatability

Layer ใน glossary, tags และ policies ตั้งแต่เนิ่นๆ เพื่อ boost trust

เชื่อมต่อ warehouse + dbt + BI สำหรับ complete lineage

Treat documentation เป็นส่วนหนึ่งของการพัฒนา ไม่ใช่ an afterthought

FAQ

Q1: DataHub คืออะไรและทำไมฉันถึงควรใช้มัน DataHub เป็น open-source metadata platform สำหรับ discovery, lineage และ governance ใน data stack ของคุณ ช่วยให้ทีมค้นหา trusted datasets ทำความเข้าใจ impact และ standardize documentation เรียนรู้ fundamentals ใน official introduction

Q2: ฉันจะ install DataHub ได้อย่างรวดเร็วได้อย่างไร ใช้ quickstart: install Docker, install the CLI จากนั้นเริ่มต้นด้วย single command คุณสามารถเข้าถึง UI ในเครื่องของคุณและล็อกอินด้วย defaults เพื่อ validate setup อย่างรวดเร็ว

Q3: ฉันควรใช้ UI ingestion หรือ CLI ingestion ใน DataHub ใช้ UI-based ingestion เพื่อเริ่มต้นอย่างรวดเร็วหรือ involve non-engineers เหมาะสำหรับ first-time connectivity และ demos สลับไปใช้ CLI ingestion สำหรับ versioned recipes, automation และ CI/CD integration

Q4: ฉันจะทำให้ lineage ปรากฏใน DataHub ได้อย่างไร Ingest จาก multiple sources: warehouse ของคุณ (เช่น Snowflake), transformation layer ของคุณ (เช่น dbt) และ orchestration (เช่น Airflow) Lineage เกิดขึ้นเมื่อ DataHub เชื่อมต่อ pieces เหล่านี้

Q5: governance features อะไรที่ฉันควร enable เป็นอันดับแรกใน DataHub เริ่มต้นด้วย ownership, concise descriptions, small glossary และ consistent tags เช่น gold, pii และ deprecated จากนั้นเพิ่ม policies เพื่อควบคุมผู้ที่สามารถแก้ไข critical assets และ schedule regular ingestion