Programming Best Practices for Data Scientist

Data Scientist usually write code for either prototyping or production . You must be thinking whats the difference between them ? Your question is obvious . The clear cut answer is prototype is very specific use case but production is generalize . In this article you will get numerous technics for transforming your prototyping  to production one.This article – “Programming Best Practices for Data Scientist”  is design to give a mind set behind the data science production script .

Programming Best Practices for Data Scientist

1. Create a data pipeline –

Rather putting all logics in single place . Create configurable small functional modules.  By saying configurable I mean you have 10 steps in cleaning data . Create a property file and a data pipeline where you can control in each step by providing true and false value in property file rather than changing code base . Lets understand this with some example . Suppose You delete some column which you think garbage in the  data model . While putting that logic directly .You may create a function which needs a configurable parameter  as boolean . this Boolean variable gets value from property file . If you set true the specific function is perform on the data . Other wise the if the value in the property file is set false , It will bypass the function.

2. Proper Logging –

This is the very important when you are writing code for production. Because in prototype code base , we check the code output on the data file which we are assuming . In production user may vary data file which some time is not not expected . .Most of the time it is really easy to point the issue when code and data both are available for testing . The scenario is completely different with production . There you will not get the data just because client data is confidential .In that case you are completely relying on the logs file generated Right !

3. Unit Testing –

It is not only Programming Best Practices for Data Scientist but this is recommend for every body who code . Write the unit test cases for your module . In python I will recommend you to use unittest .

4. Code Optimization and performance –

Some time we write the code and test it on QA or staging environment . It is usually seen the data we use for testing may be smaller in size . Hence we do not see any performance issue . While in production it is in completely user’s hand  to provide the data . Some time their large data slow down the procedures too much and create performance related issues . So we should write the code in such a way , That the time complexity and space complexity should not increase exponentially with data .

5. Readability –

We all know about it very well . You should always write clean code . Again I will say it is not very specific to data science .Trust me it is not difficult . All you need to pay some attention . Its just a matter of habit . All you need to care about variable names ,write doc string and proper comments where it is require . All these help while inspecting bugs which are not caught while staging  . Because if you face any issue in production you have to resolve it in very less time . These small tricks plays very important role while inspection . It is really harder to understand other’s code if it is not written in standard format . I will suggest to follow a standard convention of naming variable and formatter across the organization.

There are few more Programming Best Practices for Data Scientist .But all above are enough in start and intermediate level.


I realize to share the topic when I face this issue in my real life . I prepared a predictive model for some demo prototype . After the demo the stakeholder decided to launch that model very quickly into production environment .  From their prospective every thing was ready .But there was tremendous amount of   remaining work to finish .From that date I decided to find those tricks which I documented in this article .

I hope you find this article – Programming Best Practices for Data Scientist  interesting and useful . If you need to add some other Best Practices on the top of this . Feel free to write us . You may contact us via social media channels or you may comment below .

Data Science Learner Team