ā¹ļø Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.7 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://datascienceweekly.substack.com/p/data-science-weekly-issue-616 |
| Last Crawled | 2026-03-23 08:59:50 (19 days ago) |
| First Indexed | 2025-09-11 11:38:03 (7 months ago) |
| HTTP Status Code | 200 |
| Meta Title | Data Science Weekly - Issue 616 |
| Meta Description | Curated news, articles and jobs related to Data Science, AI, & Machine Learning |
| Meta Canonical | null |
| Boilerpipe Text | Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And nowā¦let's dive into some interesting links from this week.
Minesweeper thermodynamics
You know how sometimes you start a game of Minesweeper and immediately get stuck? Like maybe there are some cells that you know are mines, but there arenāt any places that are safe to clickā¦In statistical mechanics, the Boltzmann distribution is a law that tells you how likely a physical system is to be in a particular state. It works in the context that your system is in equilibrium with a larger environment that acts as a āheat bathā, holding it at a particular temperature šā¦I want to apply it to Minesweeper. The idea is that our little corner of the Minesweeper grid is like a physical system within a larger environment; a āmine bathāā¦
30 minutes with a stranger
In this story, weāll go through 30 minutes of conversation between the people you see hereā¦They are a subset of nearly 1,700 conversations between about 1,500 people as part of a research project called the
CANDOR corpus
. The goal was to gather a huge amount of data to spur research on how we converseā¦
Building Vector Tiles from scratch
As I add more data to the
NYC Chaos Dashboard
, a website that maps live urban activity, I have been looking for a more efficient way to render the map. Since I collect all of the data in one process and return the Dashboard as one HTML file, I kept wondering how I could optimize the mapās loading time by pre-processing the data as much as possible in the backend. This is where vector tiles come inā¦
POLL
Want me to mail you a paper newsletter once a month?
Yes
38%
No
54%
Maybe
8%
POLL CLOSED
I recently signed up for one and it's awesome and way less stressful than opening my inbox and seeing 72 million newsletters all vying for my attention. Right now I get 2 newsletters in paper via mail per month and itās great!
.
Now weāre super curious about what you all would do!
.
Challenges in Statistics: A Dozen Challenges in Causality and Causal Inference
Our goal in this discussion is to outline research directions and open problems we view as particularly promising for future work. Throughout we emphasize that advancing causal research requires a wide range of contributions, from novel theory and methodological innovations to improved software tools and closer engagement with domain scientists and practitionersā¦
Professionals who perform at the top 10% of their respective organizations [Reddit]
Whatās something you do that most people around you donāt?ā¦
The two versions of Parquet
A few days ago, the creators of DuckDB wrote the article: Query Engines: Gatekeepers of the Parquet File Format, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt itā¦
Transparent, Robust and Ultra-Sparse Trees (TRUST)
It achieves comparable accuracy to state-of-the-art machine learning algorithms - including black box models like Random Forest - while remaining fully interpretable. Scroll down for a short demo of TRUST. Current version solves regression problems (variants like time series only experimentally). Extensions to multiclass classification and beta regression are already under development and I will soon make them available as wellā¦
Welcome to the synthetic data tutorial
This self-paced tutorial will introduce you to the generation and evaluation of synthetic data. Synthetic data is generated data that can be used as an alternative to privacy-sensitive data, for example to enhance open science practices. Advantages of open (synthetic) data are numerous: other researchers can re-run analyses with data that is close to the actual data, which allows them to verify the main results. Additionally, open (synthetic) data allows researchers to perform exploratory analyses that may lead to novel hypotheses, and in quite some instances performing such analyses with synthetic data yields rather accurate resultsā¦
Dataframes in Haskell
The goal of this document is to detail the design of a dataframe library for exploratory data analysis (EDA) in Haskell. In addition to fulfilling the usual functional requirements of a dataframe library, the library must also have many modern features learned from years of development in the spaceā¦
Bayesian Inference is Just Counting
Conceptual introduction to Bayesian data analysis, focusing on foundations and causal inference. Nothing really about computational detailsā¦
From Frequencies to Coverage: Rethinking What āRepresentativeā Means
Whether you build an image classifier or want to estimate the average rent in Bologna, you need data. But not just any data, the data should be ārepresentativeā: A dog image classifier shouldnāt only be trained on images of dogs in spooky costumes, and the Bologna dataset shouldnāt only contain apartments above restaurants. But what exactly does ārepresentativeā mean? Letās start with a very general definitionā¦
Simulating and Visualising the Central Limit Theorem
In this post I want to interrogate and explore the CLT using simulation and visualisation in an attempt to understand how it works in practice, not in theoryā¦
How to Spot (and Fix) 5 Common Performance Bottlenecks in pandas Workflows
Slow data loads, memory-intensive joins, and long-running operationsāthese are problems every Python practitioner has faced. They waste valuable time and make iterating on your ideas harder than it should be. This post walks through five common pandas bottlenecks, how to recognize them, and some workarounds you can try on CPU with a few tweaks to your codeā¦
Why Netflix Struggles To Make Good Movies: A Data Explainer
Why do Netflix films keep falling flat?ā¦What genuinely interests me is finding a plausible explanation for why a $530 billion company consistently falls short in its attempts to make great movies. So today, we'll unpack what drives Netflix's underwhelming film outputāand explore what purpose these streaming movies are supposed to serveā¦
Will Amazon S3 Vectors Kill Vector Databasesāor Save Them?
Not too long ago, AWS dropped something new: S3 Vectors. Itās their first attempt at a vector storage solution, letting you store and query vector embeddings for semantic search right inside Amazon S3ā¦instead of ākillingā vector databases, I see it fitting into the ecosystem as a complementary piece. In fact, its real future probably lies in working with professional vector databases, not replacing them. In this post, Iāll walk you through why I think thatālooking at it from three angles: the tech itself, what it can and canāt do, and what it means for the marketā¦
Data Modeling Guide for Real-Time Analytics with ClickHouse:
From S3 Ingestion to Sub-Second Dashboards
This article is for data engineers and practitioners who want to build analytics that deliver sub-second query responses, and who want to unlock ClickHouseās full potential for real-time analytics demands. By the end, youāll have a playbook for ClickHouse data modeling plus a working example that ingests NOAA weather data from S3 and visualizes it with a single configuration fileā¦
.
What over-engineered tool did you finally replace with something simple?
A One-Page Primer on: Statistical Power
DSPy 0ātoā1 Guide: Building SelfāImproving LLM Applications from Scratch
.
* Based on unique clicks.
** Find last week's issue #615
here
.
Against the Uncritical Adoption of 'AI' Technologies in Academia
Things that screw up your causal inference
The Leverage of LLMs for Individuals
CompressGPT: Decrease Token Usage by ~70%
Patterns, Predictions, and Actions: A story about machine learning [Book]
Generalized Additive Models, A Review
Untangling Sample and Population Level Estimands in Bayesian Causal Inference
.
Looking to get a job? Check out our
āGet A Data Science Jobā
Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you donāt have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~68,500 subscribers
ā by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian |
| Markdown | [](https://datascienceweekly.substack.com/)
# [Data Science Weekly Newsletter](https://datascienceweekly.substack.com/)
Subscribe
Sign in
# Data Science Weekly - Issue 616
### Curated news, articles and jobs related to Data Science, AI, & Machine Learning
[Data Science Weekly](https://substack.com/@datascienceweekly)
Sep 11, 2025
6
2
1
Share
[](https://substackcdn.com/image/fetch/$s_!byfl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17becea5-db12-4465-be92-858de78b9137_319x253.png)
## **Issue \#616 September 11, 2025**
***
Hello\!
**Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.**
***
Data Science Weekly Newsletter is a reader-supported publication. To receive new posts and support our work, consider becoming a free or paid subscriber.
***
***And nowā¦let's dive into some interesting links from this week.***
***
## **Editor's Picks**
- **[Minesweeper thermodynamics](https://oscarcunningham.com/792/minesweeper-thermodynamics/)**
You know how sometimes you start a game of Minesweeper and immediately get stuck? Like maybe there are some cells that you know are mines, but there arenāt any places that are safe to clickā¦In statistical mechanics, the Boltzmann distribution is a law that tells you how likely a physical system is to be in a particular state. It works in the context that your system is in equilibrium with a larger environment that acts as a āheat bathā, holding it at a particular temperature šā¦I want to apply it to Minesweeper. The idea is that our little corner of the Minesweeper grid is like a physical system within a larger environment; a āmine bathāā¦
- **[30 minutes with a stranger](https://pudding.cool/2025/06/hello-stranger/)** In this story, weāll go through 30 minutes of conversation between the people you see hereā¦They are a subset of nearly 1,700 conversations between about 1,500 people as part of a research project called the *CANDOR corpus*. The goal was to gather a huge amount of data to spur research on how we converseā¦
- **[Building Vector Tiles from scratch](https://www.debuisne.com/writing/geo-tiles/)** As I add more data to the *NYC Chaos Dashboard*, a website that maps live urban activity, I have been looking for a more efficient way to render the map. Since I collect all of the data in one process and return the Dashboard as one HTML file, I kept wondering how I could optimize the mapās loading time by pre-processing the data as much as possible in the backend. This is where vector tiles come inā¦
***
# **Whatās on your mind**
## This Weekās Poll:
POLL
### Want me to mail you a paper newsletter once a month?
Yes
38%
No
54%
Maybe
8%
POLL CLOSED
I recently signed up for one and it's awesome and way less stressful than opening my inbox and seeing 72 million newsletters all vying for my attention. Right now I get 2 newsletters in paper via mail per month and itās great\!
.
## Last Weekās Poll:
[](https://substackcdn.com/image/fetch/$s_!zPrN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416a8b9e-af00-4b9a-86c8-5cdc76162e6f_578x351.png)
Now weāre super curious about what you all would do\!
.
***
## Data Science Articles & Videos
- **[Challenges in Statistics: A Dozen Challenges in Causality and Causal Inference](https://arxiv.org/abs/2508.17099)**
Our goal in this discussion is to outline research directions and open problems we view as particularly promising for future work. Throughout we emphasize that advancing causal research requires a wide range of contributions, from novel theory and methodological innovations to improved software tools and closer engagement with domain scientists and practitionersā¦
- **[Professionals who perform at the top 10% of their respective organizations \[Reddit\]](https://www.reddit.com/r/productivity/comments/1n5sca8/professionals_who_perform_at_the_top_10_of_their/)**
Whatās something you do that most people around you donāt?ā¦
- **[The two versions of Parquet](https://www.jeronimo.dev/the-two-versions-of-parquet/)**
A few days ago, the creators of DuckDB wrote the article: Query Engines: Gatekeepers of the Parquet File Format, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt itā¦
- **[Transparent, Robust and Ultra-Sparse Trees (TRUST)](https://adc-trust-ai.github.io/trust/)** It achieves comparable accuracy to state-of-the-art machine learning algorithms - including black box models like Random Forest - while remaining fully interpretable. Scroll down for a short demo of TRUST. Current version solves regression problems (variants like time series only experimentally). Extensions to multiclass classification and beta regression are already under development and I will soon make them available as wellā¦
- **[Welcome to the synthetic data tutorial](https://lmu-osc.github.io/synthetic-data-tutorial/)** This self-paced tutorial will introduce you to the generation and evaluation of synthetic data. Synthetic data is generated data that can be used as an alternative to privacy-sensitive data, for example to enhance open science practices. Advantages of open (synthetic) data are numerous: other researchers can re-run analyses with data that is close to the actual data, which allows them to verify the main results. Additionally, open (synthetic) data allows researchers to perform exploratory analyses that may lead to novel hypotheses, and in quite some instances performing such analyses with synthetic data yields rather accurate resultsā¦
- **[Dataframes in Haskell](https://docs.google.com/document/d/1oIX_OWzoTXFeN9q7ZRuDuP1mQaRSvu4RhT2Tnj8uV2c/edit?pli=1&tab=t.0#heading=h.5k7zdcnx6q5e)**
The goal of this document is to detail the design of a dataframe library for exploratory data analysis (EDA) in Haskell. In addition to fulfilling the usual functional requirements of a dataframe library, the library must also have many modern features learned from years of development in the spaceā¦``
- **[Bayesian Inference is Just Counting](https://www.youtube.com/watch?v=_NEMHM1wDfI)** Conceptual introduction to Bayesian data analysis, focusing on foundations and causal inference. Nothing really about computational detailsā¦
- **[From Frequencies to Coverage: Rethinking What āRepresentativeā Means](https://mindfulmodeler.substack.com/p/the-two-cultures-of-representativeness)** Whether you build an image classifier or want to estimate the average rent in Bologna, you need data. But not just any data, the data should be ārepresentativeā: A dog image classifier shouldnāt only be trained on images of dogs in spooky costumes, and the Bologna dataset shouldnāt only contain apartments above restaurants. But what exactly does ārepresentativeā mean? Letās start with a very general definitionā¦
- **[Simulating and Visualising the Central Limit Theorem](https://blog.foletta.net/post/2025-07-14-clt/)**
In this post I want to interrogate and explore the CLT using simulation and visualisation in an attempt to understand how it works in practice, not in theoryā¦
- **[How to Spot (and Fix) 5 Common Performance Bottlenecks in pandas Workflows](https://developer.nvidia.com/blog/how-to-spot-and-fix-5-common-performance-bottlenecks-in-pandas-workflows/)** Slow data loads, memory-intensive joins, and long-running operationsāthese are problems every Python practitioner has faced. They waste valuable time and make iterating on your ideas harder than it should be. This post walks through five common pandas bottlenecks, how to recognize them, and some workarounds you can try on CPU with a few tweaks to your codeā¦
- **[Why Netflix Struggles To Make Good Movies: A Data Explainer](https://www.statsignificant.com/p/why-netflix-struggles-to-make-good)**
Why do Netflix films keep falling flat?ā¦What genuinely interests me is finding a plausible explanation for why a \$530 billion company consistently falls short in its attempts to make great movies. So today, we'll unpack what drives Netflix's underwhelming film outputāand explore what purpose these streaming movies are supposed to serveā¦
- **[Will Amazon S3 Vectors Kill Vector Databasesāor Save Them?](https://zilliz.com/blog/will-amazon-s3-vectors-kill-vector-databases-or-save-them)** Not too long ago, AWS dropped something new: S3 Vectors. Itās their first attempt at a vector storage solution, letting you store and query vector embeddings for semantic search right inside Amazon S3ā¦instead of ākillingā vector databases, I see it fitting into the ecosystem as a complementary piece. In fact, its real future probably lies in working with professional vector databases, not replacing them. In this post, Iāll walk you through why I think thatālooking at it from three angles: the tech itself, what it can and canāt do, and what it means for the marketā¦
- **[Data Modeling Guide for Real-Time Analytics with ClickHouse: From S3 Ingestion to Sub-Second Dashboards](https://www.ssp.sh/blog/practical-data-modeling-clickhouse/)** This article is for data engineers and practitioners who want to build analytics that deliver sub-second query responses, and who want to unlock ClickHouseās full potential for real-time analytics demands. By the end, youāll have a playbook for ClickHouse data modeling plus a working example that ingests NOAA weather data from S3 and visualizes it with a single configuration fileā¦
.
***
## Last Week's Newsletter's 3 Most Clicked Links
- **[What over-engineered tool did you finally replace with something simple?](https://www.reddit.com/r/dataengineering/comments/1n2u1ta/what_overengineered_tool_did_you_finally_replace/)**
- **[A One-Page Primer on: Statistical Power](https://www.carlislerainey.com/blog/2025-08-30-1p-statistical-power/)**
- **[DSPy 0ātoā1 Guide: Building SelfāImproving LLM Applications from Scratch](https://github.com/haasonsaas/dspy-0to1-guide)**
.
\* Based on unique clicks.
\*\* Find last week's issue \#615 [here](https://datascienceweekly.substack.com/p/data-science-weekly-issue-615).
***
## Cutting Room Floor
- **[Against the Uncritical Adoption of 'AI' Technologies in Academia](https://zenodo.org/records/17065099)**
- **[Things that screw up your causal inference](https://www.benkuhn.net/causreg/)**
- **[The Leverage of LLMs for Individuals](https://mazzzystar.com/2023/05/10/LLM-for-individual/)**
- **[CompressGPT: Decrease Token Usage by ~70%](https://musings.yasyf.com/compressgpt-decrease-token-usage-by-70/)**
- **[Patterns, Predictions, and Actions: A story about machine learning \[Book\]](https://mlstory.org/)**
- **[Generalized Additive Models, A Review](https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-112723-034249)**
- **[Untangling Sample and Population Level Estimands in Bayesian Causal Inference](https://arxiv.org/abs/2508.15016)**
.
***
## **Whenever you're ready, 2 ways we can help:**
1. **[Looking to get a job? Check out our](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)** ***[āGet A Data Science Jobā](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)*** **[Course](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)**
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you donāt have any), and Section 3 covers how to write your resume.
2. **[Promote yourself/organization to ~68,500 subscribers](https://www.datascienceweekly.org/advertising)**ā by sponsoring this newsletter. 30-40% weekly open rate.
***
Thank you for joining us this week! :)
Stay Data Science-y\!
All our best,
Hannah & Sebastian
***
Data Science Weekly Newsletter is a reader-supported publication. To receive new posts and support our work, consider becoming a free or paid subscriber.
[](https://substack.com/profile/84999492-tina-montemor)
[](https://substack.com/profile/63599937-sm-mccarthy)
[6 Likes]()ā
[1 Restack](https://substack.com/note/p-173341750/restacks?utm_source=substack&utm_content=facepile-restacks)
6
2
1
Share
#### Discussion about this post
Comments
Restacks
[](https://substack.com/profile/156626454-michael-chavinda?utm_source=comment)
[Michael Chavinda](https://substack.com/profile/156626454-michael-chavinda?utm_source=substack-feed-item)
[Sep 17](https://datascienceweekly.substack.com/p/data-science-weekly-issue-616/comment/156810401 "Sep 17, 2025, 3:14 AM")
Hi. I'm the author of "Dataframes in Haskell". I'm curious how the article was discovered.
[Like]()
[Reply]()
[Share]()
[](https://substack.com/profile/63599937-sm-mccarthy?utm_source=comment)
[SM McCarthy](https://substack.com/profile/63599937-sm-mccarthy?utm_source=substack-feed-item)
[Sep 11](https://datascienceweekly.substack.com/p/data-science-weekly-issue-616/comment/154941623 "Sep 11, 2025, 8:22 PM")
Thank you so much! This week's newsletter is great...especially the book "Patterns, Predictions and Actions". I just began digging in and I am so excited. I spend time, almost every week, in Duda and Hart's "Pattern Classification and Scene Analysis". I know this book is going to be GREAT\!
[Like]()
[Reply]()
[Share]()
Top
Latest
Discussions
[Data Science Weekly - Issue 530](https://datascienceweekly.substack.com/p/data-science-weekly-issue-530)
[Curated news, articles and jobs related to Data Science, AI, & Machine Learning](https://datascienceweekly.substack.com/p/data-science-weekly-issue-530)
Jan 19, 2024
⢠[Data Science Weekly](https://substack.com/@datascienceweekly)
71

[Data Science Weekly - Issue 516](https://datascienceweekly.substack.com/p/data-science-weekly-issue-516)
[Curated news, articles and jobs related to Data Science, AI, & Machine Learning](https://datascienceweekly.substack.com/p/data-science-weekly-issue-516)
Oct 13, 2023
15

[Data Science Weekly - Issue 529](https://datascienceweekly.substack.com/p/data-science-weekly-issue-529)
[Curated news, articles and jobs related to Data Science, AI, & Machine Learning](https://datascienceweekly.substack.com/p/data-science-weekly-issue-529)
Jan 12, 2024
⢠[Data Science Weekly](https://substack.com/@datascienceweekly)
50

See all
### Ready for more?
Ā© 2026 datascienceweekly.org, a service of DATAYOU, LLC Ā· [Privacy](https://substack.com/privacy) ā [Terms](https://substack.com/tos) ā [Collection notice](https://substack.com/ccpa#personal-data-collected)
[Start your Substack](https://substack.com/signup?utm_source=substack&utm_medium=web&utm_content=footer)
[Get the app](https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&utm_content=web-footer-button)
[Substack](https://substack.com/) is the home for great culture
This site requires JavaScript to run correctly. Please [turn on JavaScript](https://enable-javascript.com/) or unblock scripts |
| Readable Markdown | [](https://substackcdn.com/image/fetch/$s_!byfl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17becea5-db12-4465-be92-858de78b9137_319x253.png)
Hello\!
**Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.**
***And nowā¦let's dive into some interesting links from this week.***
- **[Minesweeper thermodynamics](https://oscarcunningham.com/792/minesweeper-thermodynamics/)**
You know how sometimes you start a game of Minesweeper and immediately get stuck? Like maybe there are some cells that you know are mines, but there arenāt any places that are safe to clickā¦In statistical mechanics, the Boltzmann distribution is a law that tells you how likely a physical system is to be in a particular state. It works in the context that your system is in equilibrium with a larger environment that acts as a āheat bathā, holding it at a particular temperature šā¦I want to apply it to Minesweeper. The idea is that our little corner of the Minesweeper grid is like a physical system within a larger environment; a āmine bathāā¦
- **[30 minutes with a stranger](https://pudding.cool/2025/06/hello-stranger/)** In this story, weāll go through 30 minutes of conversation between the people you see hereā¦They are a subset of nearly 1,700 conversations between about 1,500 people as part of a research project called the *CANDOR corpus*. The goal was to gather a huge amount of data to spur research on how we converseā¦
- **[Building Vector Tiles from scratch](https://www.debuisne.com/writing/geo-tiles/)** As I add more data to the *NYC Chaos Dashboard*, a website that maps live urban activity, I have been looking for a more efficient way to render the map. Since I collect all of the data in one process and return the Dashboard as one HTML file, I kept wondering how I could optimize the mapās loading time by pre-processing the data as much as possible in the backend. This is where vector tiles come inā¦
POLL
### Want me to mail you a paper newsletter once a month?
Yes
38%
No
54%
Maybe
8%
POLL CLOSED
I recently signed up for one and it's awesome and way less stressful than opening my inbox and seeing 72 million newsletters all vying for my attention. Right now I get 2 newsletters in paper via mail per month and itās great\!
.
[](https://substackcdn.com/image/fetch/$s_!zPrN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416a8b9e-af00-4b9a-86c8-5cdc76162e6f_578x351.png)
Now weāre super curious about what you all would do\!
.
- **[Challenges in Statistics: A Dozen Challenges in Causality and Causal Inference](https://arxiv.org/abs/2508.17099)**
Our goal in this discussion is to outline research directions and open problems we view as particularly promising for future work. Throughout we emphasize that advancing causal research requires a wide range of contributions, from novel theory and methodological innovations to improved software tools and closer engagement with domain scientists and practitionersā¦
- **[Professionals who perform at the top 10% of their respective organizations \[Reddit\]](https://www.reddit.com/r/productivity/comments/1n5sca8/professionals_who_perform_at_the_top_10_of_their/)**
Whatās something you do that most people around you donāt?ā¦
- **[The two versions of Parquet](https://www.jeronimo.dev/the-two-versions-of-parquet/)**
A few days ago, the creators of DuckDB wrote the article: Query Engines: Gatekeepers of the Parquet File Format, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt itā¦
- **[Transparent, Robust and Ultra-Sparse Trees (TRUST)](https://adc-trust-ai.github.io/trust/)** It achieves comparable accuracy to state-of-the-art machine learning algorithms - including black box models like Random Forest - while remaining fully interpretable. Scroll down for a short demo of TRUST. Current version solves regression problems (variants like time series only experimentally). Extensions to multiclass classification and beta regression are already under development and I will soon make them available as wellā¦
- **[Welcome to the synthetic data tutorial](https://lmu-osc.github.io/synthetic-data-tutorial/)** This self-paced tutorial will introduce you to the generation and evaluation of synthetic data. Synthetic data is generated data that can be used as an alternative to privacy-sensitive data, for example to enhance open science practices. Advantages of open (synthetic) data are numerous: other researchers can re-run analyses with data that is close to the actual data, which allows them to verify the main results. Additionally, open (synthetic) data allows researchers to perform exploratory analyses that may lead to novel hypotheses, and in quite some instances performing such analyses with synthetic data yields rather accurate resultsā¦
- **[Dataframes in Haskell](https://docs.google.com/document/d/1oIX_OWzoTXFeN9q7ZRuDuP1mQaRSvu4RhT2Tnj8uV2c/edit?pli=1&tab=t.0#heading=h.5k7zdcnx6q5e)**
The goal of this document is to detail the design of a dataframe library for exploratory data analysis (EDA) in Haskell. In addition to fulfilling the usual functional requirements of a dataframe library, the library must also have many modern features learned from years of development in the spaceā¦``
- **[Bayesian Inference is Just Counting](https://www.youtube.com/watch?v=_NEMHM1wDfI)** Conceptual introduction to Bayesian data analysis, focusing on foundations and causal inference. Nothing really about computational detailsā¦
- **[From Frequencies to Coverage: Rethinking What āRepresentativeā Means](https://mindfulmodeler.substack.com/p/the-two-cultures-of-representativeness)** Whether you build an image classifier or want to estimate the average rent in Bologna, you need data. But not just any data, the data should be ārepresentativeā: A dog image classifier shouldnāt only be trained on images of dogs in spooky costumes, and the Bologna dataset shouldnāt only contain apartments above restaurants. But what exactly does ārepresentativeā mean? Letās start with a very general definitionā¦
- **[Simulating and Visualising the Central Limit Theorem](https://blog.foletta.net/post/2025-07-14-clt/)**
In this post I want to interrogate and explore the CLT using simulation and visualisation in an attempt to understand how it works in practice, not in theoryā¦
- **[How to Spot (and Fix) 5 Common Performance Bottlenecks in pandas Workflows](https://developer.nvidia.com/blog/how-to-spot-and-fix-5-common-performance-bottlenecks-in-pandas-workflows/)** Slow data loads, memory-intensive joins, and long-running operationsāthese are problems every Python practitioner has faced. They waste valuable time and make iterating on your ideas harder than it should be. This post walks through five common pandas bottlenecks, how to recognize them, and some workarounds you can try on CPU with a few tweaks to your codeā¦
- **[Why Netflix Struggles To Make Good Movies: A Data Explainer](https://www.statsignificant.com/p/why-netflix-struggles-to-make-good)**
Why do Netflix films keep falling flat?ā¦What genuinely interests me is finding a plausible explanation for why a \$530 billion company consistently falls short in its attempts to make great movies. So today, we'll unpack what drives Netflix's underwhelming film outputāand explore what purpose these streaming movies are supposed to serveā¦
- **[Will Amazon S3 Vectors Kill Vector Databasesāor Save Them?](https://zilliz.com/blog/will-amazon-s3-vectors-kill-vector-databases-or-save-them)** Not too long ago, AWS dropped something new: S3 Vectors. Itās their first attempt at a vector storage solution, letting you store and query vector embeddings for semantic search right inside Amazon S3ā¦instead of ākillingā vector databases, I see it fitting into the ecosystem as a complementary piece. In fact, its real future probably lies in working with professional vector databases, not replacing them. In this post, Iāll walk you through why I think thatālooking at it from three angles: the tech itself, what it can and canāt do, and what it means for the marketā¦
- **[Data Modeling Guide for Real-Time Analytics with ClickHouse: From S3 Ingestion to Sub-Second Dashboards](https://www.ssp.sh/blog/practical-data-modeling-clickhouse/)** This article is for data engineers and practitioners who want to build analytics that deliver sub-second query responses, and who want to unlock ClickHouseās full potential for real-time analytics demands. By the end, youāll have a playbook for ClickHouse data modeling plus a working example that ingests NOAA weather data from S3 and visualizes it with a single configuration fileā¦
.
- **[What over-engineered tool did you finally replace with something simple?](https://www.reddit.com/r/dataengineering/comments/1n2u1ta/what_overengineered_tool_did_you_finally_replace/)**
- **[A One-Page Primer on: Statistical Power](https://www.carlislerainey.com/blog/2025-08-30-1p-statistical-power/)**
- **[DSPy 0ātoā1 Guide: Building SelfāImproving LLM Applications from Scratch](https://github.com/haasonsaas/dspy-0to1-guide)**
.
\* Based on unique clicks.
\*\* Find last week's issue \#615 [here](https://datascienceweekly.substack.com/p/data-science-weekly-issue-615).
- **[Against the Uncritical Adoption of 'AI' Technologies in Academia](https://zenodo.org/records/17065099)**
- **[Things that screw up your causal inference](https://www.benkuhn.net/causreg/)**
- **[The Leverage of LLMs for Individuals](https://mazzzystar.com/2023/05/10/LLM-for-individual/)**
- **[CompressGPT: Decrease Token Usage by ~70%](https://musings.yasyf.com/compressgpt-decrease-token-usage-by-70/)**
- **[Patterns, Predictions, and Actions: A story about machine learning \[Book\]](https://mlstory.org/)**
- **[Generalized Additive Models, A Review](https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-112723-034249)**
- **[Untangling Sample and Population Level Estimands in Bayesian Causal Inference](https://arxiv.org/abs/2508.15016)**
.
1. **[Looking to get a job? Check out our](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)** ***[āGet A Data Science Jobā](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)*** **[Course](https://www.datascienceweekly.org/data-science-guides/data-science-getting-started-guide)**
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you donāt have any), and Section 3 covers how to write your resume.
2. **[Promote yourself/organization to ~68,500 subscribers](https://www.datascienceweekly.org/advertising)**ā by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y\!
All our best,
Hannah & Sebastian |
| Shard | 76 (laksa) |
| Root Hash | 14862242593741677076 |
| Unparsed URL | com,substack!datascienceweekly,/p/data-science-weekly-issue-616 s443 |