šŸ•·ļø Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 76 (from laksa118)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ā„¹ļø Skipped - page is already crawled

šŸ“„
INDEXABLE
āœ…
CRAWLED
19 days ago
šŸ¤–
ROBOTS SERVER UNREACHABLE
Failed to connect to robots server: Operation timed out after 2002 milliseconds with 0 bytes received

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.7 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://datascienceweekly.substack.com/p/data-science-weekly-issue-616
Last Crawled2026-03-23 08:59:50 (19 days ago)
First Indexed2025-09-11 11:38:03 (7 months ago)
HTTP Status Code200
Meta TitleData Science Weekly - Issue 616
Meta DescriptionCurated news, articles and jobs related to Data Science, AI, & Machine Learning
Meta Canonicalnull
Boilerpipe Text
Hello! Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds. And now…let's dive into some interesting links from this week. Minesweeper thermodynamics You know how sometimes you start a game of Minesweeper and immediately get stuck? Like maybe there are some cells that you know are mines, but there aren’t any places that are safe to click…In statistical mechanics, the Boltzmann distribution is a law that tells you how likely a physical system is to be in a particular state. It works in the context that your system is in equilibrium with a larger environment that acts as a ā€˜heat bath’, holding it at a particular temperature š‘‡ā€¦I want to apply it to Minesweeper. The idea is that our little corner of the Minesweeper grid is like a physical system within a larger environment; a ā€˜mine bath’… 30 minutes with a stranger In this story, we’ll go through 30 minutes of conversation between the people you see here…They are a subset of nearly 1,700 conversations between about 1,500 people as part of a research project called the CANDOR corpus . The goal was to gather a huge amount of data to spur research on how we converse… Building Vector Tiles from scratch As I add more data to the NYC Chaos Dashboard , a website that maps live urban activity, I have been looking for a more efficient way to render the map. Since I collect all of the data in one process and return the Dashboard as one HTML file, I kept wondering how I could optimize the map’s loading time by pre-processing the data as much as possible in the backend. This is where vector tiles come in… POLL Want me to mail you a paper newsletter once a month? Yes 38% No 54% Maybe 8% POLL CLOSED I recently signed up for one and it's awesome and way less stressful than opening my inbox and seeing 72 million newsletters all vying for my attention. Right now I get 2 newsletters in paper via mail per month and it’s great! . Now we’re super curious about what you all would do! . Challenges in Statistics: A Dozen Challenges in Causality and Causal Inference Our goal in this discussion is to outline research directions and open problems we view as particularly promising for future work. Throughout we emphasize that advancing causal research requires a wide range of contributions, from novel theory and methodological innovations to improved software tools and closer engagement with domain scientists and practitioners… Professionals who perform at the top 10% of their respective organizations [Reddit] What’s something you do that most people around you don’t?… The two versions of Parquet A few days ago, the creators of DuckDB wrote the article: Query Engines: Gatekeepers of the Parquet File Format, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt it… Transparent, Robust and Ultra-Sparse Trees (TRUST) It achieves comparable accuracy to state-of-the-art machine learning algorithms - including black box models like Random Forest - while remaining fully interpretable. Scroll down for a short demo of TRUST. Current version solves regression problems (variants like time series only experimentally). Extensions to multiclass classification and beta regression are already under development and I will soon make them available as well… Welcome to the synthetic data tutorial This self-paced tutorial will introduce you to the generation and evaluation of synthetic data. Synthetic data is generated data that can be used as an alternative to privacy-sensitive data, for example to enhance open science practices. Advantages of open (synthetic) data are numerous: other researchers can re-run analyses with data that is close to the actual data, which allows them to verify the main results. Additionally, open (synthetic) data allows researchers to perform exploratory analyses that may lead to novel hypotheses, and in quite some instances performing such analyses with synthetic data yields rather accurate results… Dataframes in Haskell The goal of this document is to detail the design of a dataframe library for exploratory data analysis (EDA) in Haskell. In addition to fulfilling the usual functional requirements of a dataframe library, the library must also have many modern features learned from years of development in the space… Bayesian Inference is Just Counting Conceptual introduction to Bayesian data analysis, focusing on foundations and causal inference. Nothing really about computational details… From Frequencies to Coverage: Rethinking What ā€œRepresentativeā€ Means Whether you build an image classifier or want to estimate the average rent in Bologna, you need data. But not just any data, the data should be ā€œrepresentativeā€: A dog image classifier shouldn’t only be trained on images of dogs in spooky costumes, and the Bologna dataset shouldn’t only contain apartments above restaurants. But what exactly does ā€œrepresentativeā€ mean? Let’s start with a very general definition… Simulating and Visualising the Central Limit Theorem In this post I want to interrogate and explore the CLT using simulation and visualisation in an attempt to understand how it works in practice, not in theory… How to Spot (and Fix) 5 Common Performance Bottlenecks in pandas Workflows Slow data loads, memory-intensive joins, and long-running operations—these are problems every Python practitioner has faced. They waste valuable time and make iterating on your ideas harder than it should be. This post walks through five common pandas bottlenecks, how to recognize them, and some workarounds you can try on CPU with a few tweaks to your code… Why Netflix Struggles To Make Good Movies: A Data Explainer Why do Netflix films keep falling flat?…What genuinely interests me is finding a plausible explanation for why a $530 billion company consistently falls short in its attempts to make great movies. So today, we'll unpack what drives Netflix's underwhelming film output—and explore what purpose these streaming movies are supposed to serve… Will Amazon S3 Vectors Kill Vector Databases—or Save Them? Not too long ago, AWS dropped something new: S3 Vectors. It’s their first attempt at a vector storage solution, letting you store and query vector embeddings for semantic search right inside Amazon S3…instead of ā€œkillingā€ vector databases, I see it fitting into the ecosystem as a complementary piece. In fact, its real future probably lies in working with professional vector databases, not replacing them. In this post, I’ll walk you through why I think that—looking at it from three angles: the tech itself, what it can and can’t do, and what it means for the market… Data Modeling Guide for Real-Time Analytics with ClickHouse: From S3 Ingestion to Sub-Second Dashboards This article is for data engineers and practitioners who want to build analytics that deliver sub-second query responses, and who want to unlock ClickHouse’s full potential for real-time analytics demands. By the end, you’ll have a playbook for ClickHouse data modeling plus a working example that ingests NOAA weather data from S3 and visualizes it with a single configuration file… . What over-engineered tool did you finally replace with something simple? A One-Page Primer on: Statistical Power DSPy 0‑to‑1 Guide: Building Self‑Improving LLM Applications from Scratch . * Based on unique clicks. ** Find last week's issue #615 here . Against the Uncritical Adoption of 'AI' Technologies in Academia Things that screw up your causal inference The Leverage of LLMs for Individuals CompressGPT: Decrease Token Usage by ~70% Patterns, Predictions, and Actions: A story about machine learning [Book] Generalized Additive Models, A Review Untangling Sample and Population Level Estimands in Bayesian Causal Inference . Looking to get a job? Check out our ā€œGet A Data Science Jobā€ Course It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume. Promote yourself/organization to ~68,500 subscribers ​ by sponsoring this newsletter. 30-40% weekly open rate. Thank you for joining us this week! :) Stay Data Science-y! All our best, Hannah & Sebastian
Markdown
[![Data Science Weekly Newsletter](https://substackcdn.com/image/fetch/$s_!I8ji!,w_40,h_40,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3294ef46-cb03-42ea-b7d6-9f8e8b0f41f6_253x253.png)](https://datascienceweekly.substack.com/) # [Data Science Weekly Newsletter](https://datascienceweekly.substack.com/) Subscribe Sign in # Data Science Weekly - Issue 616 ### Curated news, articles and jobs related to Data Science, AI, & Machine Learning [Data Science Weekly](https://substack.com/@datascienceweekly) Sep 11, 2025 6 2 1 Share [![Data Science Weekly](https://substackcdn.com/image/fetch/$s_!byfl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17becea5-db12-4465-be92-858de78b9137_319x253.png)](https://substackcdn.com/image/fetch/$s_!byfl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17becea5-db12-4465-be92-858de78b9137_319x253.png) ## **Issue \#616 September 11, 2025** *** Hello\! **Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.** *** Data Science Weekly Newsletter is a reader-supported publication. To receive new posts and support our work, consider becoming a free or paid subscriber. *** ***And now…let's dive into some interesting links from this week.*** *** ## **Editor's Picks** - **[Minesweeper thermodynamics](https://oscarcunningham.com/792/minesweeper-thermodynamics/)** You know how sometimes you start a game of Minesweeper and immediately get stuck? Like maybe there are some cells that you know are mines, but there aren’t any places that are safe to click…In statistical mechanics, the Boltzmann distribution is a law that tells you how likely a physical system is to be in a particular state. It works in the context that your system is in equilibrium with a larger environment that acts as a ā€˜heat bath’, holding it at a particular temperature š‘‡ā€¦I want to apply it to Minesweeper. The idea is that our little corner of the Minesweeper grid is like a physical system within a larger environment; a ā€˜mine bath’… - **[30 minutes with a stranger](https://pudding.cool/2025/06/hello-stranger/)** In this story, we’ll go through 30 minutes of conversation between the people you see here…They are a subset of nearly 1,700 conversations between about 1,500 people as part of a research project called the *CANDOR corpus*. The goal was to gather a huge amount of data to spur research on how we converse… - **[Building Vector Tiles from scratch](https://www.debuisne.com/writing/geo-tiles/)** As I add more data to the *NYC Chaos Dashboard*, a website that maps live urban activity, I have been looking for a more efficient way to render the map. Since I collect all of the data in one process and return the Dashboard as one HTML file, I kept wondering how I could optimize the map’s loading time by pre-processing the data as much as possible in the backend. This is where vector tiles come in… *** # **What’s on your mind** ## This Week’s Poll: POLL ### Want me to mail you a paper newsletter once a month? Yes 38% No 54% Maybe 8% POLL CLOSED I recently signed up for one and it's awesome and way less stressful than opening my inbox and seeing 72 million newsletters all vying for my attention. Right now I get 2 newsletters in paper via mail per month and it’s great\! . ## Last Week’s Poll: [![](https://substackcdn.com/image/fetch/$s_!zPrN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416a8b9e-af00-4b9a-86c8-5cdc76162e6f_578x351.png)](https://substackcdn.com/image/fetch/$s_!zPrN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416a8b9e-af00-4b9a-86c8-5cdc76162e6f_578x351.png) Now we’re super curious about what you all would do\! . *** ## Data Science Articles & Videos - **[Challenges in Statistics: A Dozen Challenges in Causality and Causal Inference](https://arxiv.org/abs/2508.17099)** Our goal in this discussion is to outline research directions and open problems we view as particularly promising for future work. Throughout we emphasize that advancing causal research requires a wide range of contributions, from novel theory and methodological innovations to improved software tools and closer engagement with domain scientists and practitioners… - **[Professionals who perform at the top 10% of their respective organizations \[Reddit\]](https://www.reddit.com/r/productivity/comments/1n5sca8/professionals_who_perform_at_the_top_10_of_their/)** What’s something you do that most people around you don’t?… - **[The two versions of Parquet](https://www.jeronimo.dev/the-two-versions-of-parquet/)** A few days ago, the creators of DuckDB wrote the article: Query Engines: Gatekeepers of the Parquet File Format, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt it… - **[Transparent, Robust and Ultra-Sparse Trees (TRUST)](https://adc-trust-ai.github.io/trust/)** It achieves comparable accuracy to state-of-the-art machine learning algorithms - including black box models like Random Forest - while remaining fully interpretable. Scroll down for a short demo of TRUST. Current version solves regression problems (variants like time series only experimentally). Extensions to multiclass classification and beta regression are already under development and I will soon make them available as well… - **[Welcome to the synthetic data tutorial](https://lmu-osc.github.io/synthetic-data-tutorial/)** This self-paced tutorial will introduce you to the generation and evaluation of synthetic data. Synthetic data is generated data that can be used as an alternative to privacy-sensitive data, for example to enhance open science practices. Advantages of open (synthetic) data are numerous: other researchers can re-run analyses with data that is close to the actual data, which allows them to verify the main results. Additionally, open (synthetic) data allows researchers to perform exploratory analyses that may lead to novel hypotheses, and in quite some instances performing such analyses with synthetic data yields rather accurate results… - **[Dataframes in Haskell](https://docs.google.com/document/d/1oIX_OWzoTXFeN9q7ZRuDuP1mQaRSvu4RhT2Tnj8uV2c/edit?pli=1&tab=t.0#heading=h.5k7zdcnx6q5e)** The goal of this document is to detail the design of a dataframe library for exploratory data analysis (EDA) in Haskell. In addition to fulfilling the usual functional requirements of a dataframe library, the library must also have many modern features learned from years of development in the space…`` - **[Bayesian Inference is Just Counting](https://www.youtube.com/watch?v=_NEMHM1wDfI)** Conceptual introduction to Bayesian data analysis, focusing on foundations and causal inference. Nothing really about computational details… - **[From Frequencies to Coverage: Rethinking What ā€œRepresentativeā€ Means](https://mindfulmodeler.substack.com/p/the-two-cultures-of-representativeness)** Whether you build an image classifier or want to estimate the average rent in Bologna, you need data. But not just any data, the data should be ā€œrepresentativeā€: A dog image classifier shouldn’t only be trained on images of dogs in spooky costumes, and the Bologna dataset shouldn’t only contain apartments above restaurants. But what exactly does ā€œrepresentativeā€ mean? Let’s start with a very general definition… - **[Simulating and Visualising the Central Limit Theorem](https://blog.foletta.net/post/2025-07-14-clt/)** In this post I want to interrogate and explore the CLT using simulation and visualisation in an attempt to understand how it works in practice, not in theory… - **[How to Spot (and Fix) 5 Common Performance Bottlenecks in pandas Workflows](https://developer.nvidia.com/blog/how-to-spot-and-fix-5-common-performance-bottlenecks-in-pandas-workflows/)** Slow data loads, memory-intensive joins, and long-running operations—these are problems every Python practitioner has faced. They waste valuable time and make iterating on your ideas harder than it should be. This post walks through five common pandas bottlenecks, how to recognize them, and some workarounds you can try on CPU with a few tweaks to your code… - **[Why Netflix Struggles To Make Good Movies: A Data Explainer](https://www.statsignificant.com/p/why-netflix-struggles-to-make-good)** Why do Netflix films keep falling flat?…What genuinely interests me is finding a plausible explanation for why a \$530 billion company consistently falls short in its attempts to make great movies. So today, we'll unpack what drives Netflix's underwhelming film output—and explore what purpose these streaming movies are supposed to serve… - **[Will Amazon S3 Vectors Kill Vector Databases—or Save Them?](https://zilliz.com/blog/will-amazon-s3-vectors-kill-vector-databases-or-save-them)** Not too long ago, AWS dropped something new: S3 Vectors. It’s their first attempt at a vector storage solution, letting you store and query vector embeddings for semantic search right inside Amazon S3…instead of ā€œkillingā€ vector databases, I see it fitting into the ecosystem as a complementary piece. In fact, its real future probably lies in working with professional vector databases, not replacing them. In this post, I’ll walk you through why I think that—looking at it from three angles: the tech itself, what it can and can’t do, and what it means for the market… - **[Data Modeling Guide for Real-Time Analytics with ClickHouse: From S3 Ingestion to Sub-Second Dashboards](https://www.ssp.sh/blog/practical-data-modeling-clickhouse/)** This article is for data engineers and practitioners who want to build analytics that deliver sub-second query responses, and who want to unlock ClickHouse’s full potential for real-time analytics demands. By the end, you’ll have a playbook for ClickHouse data modeling plus a working example that ingests NOAA weather data from S3 and visualizes it with a single configuration file… . *** ## Last Week's Newsletter's 3 Most Clicked Links - **[What over-engineered tool did you finally replace with something simple?](https://www.reddit.com/r/dataengineering/comments/1n2u1ta/what_overengineered_tool_did_you_finally_replace/)** - **[A One-Page Primer on: Statistical Power](https://www.carlislerainey.com/blog/2025-08-30-1p-statistical-power/)** - **[DSPy 0‑to‑1 Guide: Building Self‑Improving LLM Applications from Scratch](https://github.com/haasonsaas/dspy-0to1-guide)** . \* Based on unique clicks. \*\* Find last week's issue \#615 [here](https://datascienceweekly.substack.com/p/data-science-weekly-issue-615). *** ## Cutting Room Floor - **[Against the Uncritical Adoption of 'AI' Technologies in Academia](https://zenodo.org/records/17065099)** - **[Things that screw up your causal inference](https://www.benkuhn.net/causreg/)** - **[The Leverage of LLMs for Individuals](https://mazzzystar.com/2023/05/10/LLM-for-individual/)** - **[CompressGPT: Decrease Token Usage by ~70%](https://musings.yasyf.com/compressgpt-decrease-token-usage-by-70/)** - **[Patterns, Predictions, and Actions: A story about machine learning \[Book\]](https://mlstory.org/)** - **[Generalized Additive Models, A Review](https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-112723-034249)** - **[Untangling Sample and Population Level Estimands in Bayesian Causal Inference](https://arxiv.org/abs/2508.15016)** . *** ## **Whenever you're ready, 2 ways we can help:** 1. **[Looking to get a job? Check out our](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)** ***[ā€œGet A Data Science Jobā€](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)*** **[Course](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)** It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume. 2. **[Promote yourself/organization to ~68,500 subscribers](https://www.datascienceweekly.org/advertising)**​ by sponsoring this newsletter. 30-40% weekly open rate. *** Thank you for joining us this week! :) Stay Data Science-y\! All our best, Hannah & Sebastian *** Data Science Weekly Newsletter is a reader-supported publication. To receive new posts and support our work, consider becoming a free or paid subscriber. [![Tina Montemor's avatar](https://substackcdn.com/image/fetch/$s_!Tfxb!,w_32,h_32,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack.com%2Fimg%2Favatars%2Forange.png)](https://substack.com/profile/84999492-tina-montemor) [![SM McCarthy's avatar](https://substackcdn.com/image/fetch/$s_!1QGJ!,w_32,h_32,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3f6aa-c872-4b21-bd56-d355c41e81b6_144x144.png)](https://substack.com/profile/63599937-sm-mccarthy) [6 Likes]()āˆ™ [1 Restack](https://substack.com/note/p-173341750/restacks?utm_source=substack&utm_content=facepile-restacks) 6 2 1 Share #### Discussion about this post Comments Restacks [![Michael Chavinda's avatar](https://substackcdn.com/image/fetch/$s_!Qcq8!,w_32,h_32,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0800b3b-fd46-442c-8106-042741bbd7a0_96x96.jpeg)](https://substack.com/profile/156626454-michael-chavinda?utm_source=comment) [Michael Chavinda](https://substack.com/profile/156626454-michael-chavinda?utm_source=substack-feed-item) [Sep 17](https://datascienceweekly.substack.com/p/data-science-weekly-issue-616/comment/156810401 "Sep 17, 2025, 3:14 AM") Hi. I'm the author of "Dataframes in Haskell". I'm curious how the article was discovered. [Like]() [Reply]() [Share]() [![SM McCarthy's avatar](https://substackcdn.com/image/fetch/$s_!1QGJ!,w_32,h_32,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3f6aa-c872-4b21-bd56-d355c41e81b6_144x144.png)](https://substack.com/profile/63599937-sm-mccarthy?utm_source=comment) [SM McCarthy](https://substack.com/profile/63599937-sm-mccarthy?utm_source=substack-feed-item) [Sep 11](https://datascienceweekly.substack.com/p/data-science-weekly-issue-616/comment/154941623 "Sep 11, 2025, 8:22 PM") Thank you so much! This week's newsletter is great...especially the book "Patterns, Predictions and Actions". I just began digging in and I am so excited. I spend time, almost every week, in Duda and Hart's "Pattern Classification and Scene Analysis". I know this book is going to be GREAT\! [Like]() [Reply]() [Share]() Top Latest Discussions [Data Science Weekly - Issue 530](https://datascienceweekly.substack.com/p/data-science-weekly-issue-530) [Curated news, articles and jobs related to Data Science, AI, & Machine Learning](https://datascienceweekly.substack.com/p/data-science-weekly-issue-530) Jan 19, 2024 • [Data Science Weekly](https://substack.com/@datascienceweekly) 71 ![](https://substackcdn.com/image/fetch/$s_!sImX!,w_320,h_213,c_fill,f_auto,q_auto:good,fl_progressive:steep,g_center/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f18bcd-5378-46d2-9d3e-f2f1d244626b_1080x1080.png) [Data Science Weekly - Issue 516](https://datascienceweekly.substack.com/p/data-science-weekly-issue-516) [Curated news, articles and jobs related to Data Science, AI, & Machine Learning](https://datascienceweekly.substack.com/p/data-science-weekly-issue-516) Oct 13, 2023 15 ![](https://substackcdn.com/image/fetch/$s_!37Mc!,w_320,h_213,c_fill,f_auto,q_auto:good,fl_progressive:steep,g_center/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c43b0f6-7a24-4c8d-84be-39ea160b3f58_600x600.png) [Data Science Weekly - Issue 529](https://datascienceweekly.substack.com/p/data-science-weekly-issue-529) [Curated news, articles and jobs related to Data Science, AI, & Machine Learning](https://datascienceweekly.substack.com/p/data-science-weekly-issue-529) Jan 12, 2024 • [Data Science Weekly](https://substack.com/@datascienceweekly) 50 ![](https://substackcdn.com/image/fetch/$s_!37Mc!,w_320,h_213,c_fill,f_auto,q_auto:good,fl_progressive:steep,g_center/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c43b0f6-7a24-4c8d-84be-39ea160b3f58_600x600.png) See all ### Ready for more? Ā© 2026 datascienceweekly.org, a service of DATAYOU, LLC Ā· [Privacy](https://substack.com/privacy) āˆ™ [Terms](https://substack.com/tos) āˆ™ [Collection notice](https://substack.com/ccpa#personal-data-collected) [Start your Substack](https://substack.com/signup?utm_source=substack&utm_medium=web&utm_content=footer) [Get the app](https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&utm_content=web-footer-button) [Substack](https://substack.com/) is the home for great culture This site requires JavaScript to run correctly. Please [turn on JavaScript](https://enable-javascript.com/) or unblock scripts
Readable Markdown
[![Data Science Weekly](https://substackcdn.com/image/fetch/$s_!byfl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17becea5-db12-4465-be92-858de78b9137_319x253.png)](https://substackcdn.com/image/fetch/$s_!byfl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17becea5-db12-4465-be92-858de78b9137_319x253.png) Hello\! **Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.** ***And now…let's dive into some interesting links from this week.*** - **[Minesweeper thermodynamics](https://oscarcunningham.com/792/minesweeper-thermodynamics/)** You know how sometimes you start a game of Minesweeper and immediately get stuck? Like maybe there are some cells that you know are mines, but there aren’t any places that are safe to click…In statistical mechanics, the Boltzmann distribution is a law that tells you how likely a physical system is to be in a particular state. It works in the context that your system is in equilibrium with a larger environment that acts as a ā€˜heat bath’, holding it at a particular temperature š‘‡ā€¦I want to apply it to Minesweeper. The idea is that our little corner of the Minesweeper grid is like a physical system within a larger environment; a ā€˜mine bath’… - **[30 minutes with a stranger](https://pudding.cool/2025/06/hello-stranger/)** In this story, we’ll go through 30 minutes of conversation between the people you see here…They are a subset of nearly 1,700 conversations between about 1,500 people as part of a research project called the *CANDOR corpus*. The goal was to gather a huge amount of data to spur research on how we converse… - **[Building Vector Tiles from scratch](https://www.debuisne.com/writing/geo-tiles/)** As I add more data to the *NYC Chaos Dashboard*, a website that maps live urban activity, I have been looking for a more efficient way to render the map. Since I collect all of the data in one process and return the Dashboard as one HTML file, I kept wondering how I could optimize the map’s loading time by pre-processing the data as much as possible in the backend. This is where vector tiles come in… POLL ### Want me to mail you a paper newsletter once a month? Yes 38% No 54% Maybe 8% POLL CLOSED I recently signed up for one and it's awesome and way less stressful than opening my inbox and seeing 72 million newsletters all vying for my attention. Right now I get 2 newsletters in paper via mail per month and it’s great\! . [![](https://substackcdn.com/image/fetch/$s_!zPrN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416a8b9e-af00-4b9a-86c8-5cdc76162e6f_578x351.png)](https://substackcdn.com/image/fetch/$s_!zPrN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416a8b9e-af00-4b9a-86c8-5cdc76162e6f_578x351.png) Now we’re super curious about what you all would do\! . - **[Challenges in Statistics: A Dozen Challenges in Causality and Causal Inference](https://arxiv.org/abs/2508.17099)** Our goal in this discussion is to outline research directions and open problems we view as particularly promising for future work. Throughout we emphasize that advancing causal research requires a wide range of contributions, from novel theory and methodological innovations to improved software tools and closer engagement with domain scientists and practitioners… - **[Professionals who perform at the top 10% of their respective organizations \[Reddit\]](https://www.reddit.com/r/productivity/comments/1n5sca8/professionals_who_perform_at_the_top_10_of_their/)** What’s something you do that most people around you don’t?… - **[The two versions of Parquet](https://www.jeronimo.dev/the-two-versions-of-parquet/)** A few days ago, the creators of DuckDB wrote the article: Query Engines: Gatekeepers of the Parquet File Format, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt it… - **[Transparent, Robust and Ultra-Sparse Trees (TRUST)](https://adc-trust-ai.github.io/trust/)** It achieves comparable accuracy to state-of-the-art machine learning algorithms - including black box models like Random Forest - while remaining fully interpretable. Scroll down for a short demo of TRUST. Current version solves regression problems (variants like time series only experimentally). Extensions to multiclass classification and beta regression are already under development and I will soon make them available as well… - **[Welcome to the synthetic data tutorial](https://lmu-osc.github.io/synthetic-data-tutorial/)** This self-paced tutorial will introduce you to the generation and evaluation of synthetic data. Synthetic data is generated data that can be used as an alternative to privacy-sensitive data, for example to enhance open science practices. Advantages of open (synthetic) data are numerous: other researchers can re-run analyses with data that is close to the actual data, which allows them to verify the main results. Additionally, open (synthetic) data allows researchers to perform exploratory analyses that may lead to novel hypotheses, and in quite some instances performing such analyses with synthetic data yields rather accurate results… - **[Dataframes in Haskell](https://docs.google.com/document/d/1oIX_OWzoTXFeN9q7ZRuDuP1mQaRSvu4RhT2Tnj8uV2c/edit?pli=1&tab=t.0#heading=h.5k7zdcnx6q5e)** The goal of this document is to detail the design of a dataframe library for exploratory data analysis (EDA) in Haskell. In addition to fulfilling the usual functional requirements of a dataframe library, the library must also have many modern features learned from years of development in the space…`` - **[Bayesian Inference is Just Counting](https://www.youtube.com/watch?v=_NEMHM1wDfI)** Conceptual introduction to Bayesian data analysis, focusing on foundations and causal inference. Nothing really about computational details… - **[From Frequencies to Coverage: Rethinking What ā€œRepresentativeā€ Means](https://mindfulmodeler.substack.com/p/the-two-cultures-of-representativeness)** Whether you build an image classifier or want to estimate the average rent in Bologna, you need data. But not just any data, the data should be ā€œrepresentativeā€: A dog image classifier shouldn’t only be trained on images of dogs in spooky costumes, and the Bologna dataset shouldn’t only contain apartments above restaurants. But what exactly does ā€œrepresentativeā€ mean? Let’s start with a very general definition… - **[Simulating and Visualising the Central Limit Theorem](https://blog.foletta.net/post/2025-07-14-clt/)** In this post I want to interrogate and explore the CLT using simulation and visualisation in an attempt to understand how it works in practice, not in theory… - **[How to Spot (and Fix) 5 Common Performance Bottlenecks in pandas Workflows](https://developer.nvidia.com/blog/how-to-spot-and-fix-5-common-performance-bottlenecks-in-pandas-workflows/)** Slow data loads, memory-intensive joins, and long-running operations—these are problems every Python practitioner has faced. They waste valuable time and make iterating on your ideas harder than it should be. This post walks through five common pandas bottlenecks, how to recognize them, and some workarounds you can try on CPU with a few tweaks to your code… - **[Why Netflix Struggles To Make Good Movies: A Data Explainer](https://www.statsignificant.com/p/why-netflix-struggles-to-make-good)** Why do Netflix films keep falling flat?…What genuinely interests me is finding a plausible explanation for why a \$530 billion company consistently falls short in its attempts to make great movies. So today, we'll unpack what drives Netflix's underwhelming film output—and explore what purpose these streaming movies are supposed to serve… - **[Will Amazon S3 Vectors Kill Vector Databases—or Save Them?](https://zilliz.com/blog/will-amazon-s3-vectors-kill-vector-databases-or-save-them)** Not too long ago, AWS dropped something new: S3 Vectors. It’s their first attempt at a vector storage solution, letting you store and query vector embeddings for semantic search right inside Amazon S3…instead of ā€œkillingā€ vector databases, I see it fitting into the ecosystem as a complementary piece. In fact, its real future probably lies in working with professional vector databases, not replacing them. In this post, I’ll walk you through why I think that—looking at it from three angles: the tech itself, what it can and can’t do, and what it means for the market… - **[Data Modeling Guide for Real-Time Analytics with ClickHouse: From S3 Ingestion to Sub-Second Dashboards](https://www.ssp.sh/blog/practical-data-modeling-clickhouse/)** This article is for data engineers and practitioners who want to build analytics that deliver sub-second query responses, and who want to unlock ClickHouse’s full potential for real-time analytics demands. By the end, you’ll have a playbook for ClickHouse data modeling plus a working example that ingests NOAA weather data from S3 and visualizes it with a single configuration file… . - **[What over-engineered tool did you finally replace with something simple?](https://www.reddit.com/r/dataengineering/comments/1n2u1ta/what_overengineered_tool_did_you_finally_replace/)** - **[A One-Page Primer on: Statistical Power](https://www.carlislerainey.com/blog/2025-08-30-1p-statistical-power/)** - **[DSPy 0‑to‑1 Guide: Building Self‑Improving LLM Applications from Scratch](https://github.com/haasonsaas/dspy-0to1-guide)** . \* Based on unique clicks. \*\* Find last week's issue \#615 [here](https://datascienceweekly.substack.com/p/data-science-weekly-issue-615). - **[Against the Uncritical Adoption of 'AI' Technologies in Academia](https://zenodo.org/records/17065099)** - **[Things that screw up your causal inference](https://www.benkuhn.net/causreg/)** - **[The Leverage of LLMs for Individuals](https://mazzzystar.com/2023/05/10/LLM-for-individual/)** - **[CompressGPT: Decrease Token Usage by ~70%](https://musings.yasyf.com/compressgpt-decrease-token-usage-by-70/)** - **[Patterns, Predictions, and Actions: A story about machine learning \[Book\]](https://mlstory.org/)** - **[Generalized Additive Models, A Review](https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-112723-034249)** - **[Untangling Sample and Population Level Estimands in Bayesian Causal Inference](https://arxiv.org/abs/2508.15016)** . 1. **[Looking to get a job? Check out our](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)** ***[ā€œGet A Data Science Jobā€](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)*** **[Course](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)** It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume. 2. **[Promote yourself/organization to ~68,500 subscribers](https://www.datascienceweekly.org/advertising)**​ by sponsoring this newsletter. 30-40% weekly open rate. Thank you for joining us this week! :) Stay Data Science-y\! All our best, Hannah & Sebastian
Shard76 (laksa)
Root Hash14862242593741677076
Unparsed URLcom,substack!datascienceweekly,/p/data-science-weekly-issue-616 s443